Partly. The Developer plan is $0 for one seat and includes 5,000 base traces per month. Plus is $39 per seat per month with 10,000 base traces included. Beyond the included volume, traces cost $2.50 per 1,000 at 14-day retention or $5.00 per 1,000 at 400-day retention. A team doing active agent development can exhaust the free tier within days, so budget for the per-trace line, not just the seats.

Does LangSmith work without LangChain?

Yes. LangSmith supports OpenTelemetry and traces applications built with the OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, or custom code. That said, its strongest differentiator (graph-native tracing of agent state machines) only pays off if you are on LangGraph. If you are not in that ecosystem, compare it against Braintrust on features and price rather than defaulting to it.

What is Braintrust used for?

The full LLM quality loop in one platform: tracing production traffic, scoring it online with LLM, code, or human scorers, clustering failures into Topics, converting real traces into eval datasets, and gating releases on experiment results. It also ships Loop, an agent that proposes better prompts, scorers, and datasets from your own data. Braintrust lists Notion, Vercel, Replit, Dropbox, and Coursera among its customers.

How much does Braintrust cost?

Three tiers, all flat per team with unlimited users. Starter is free: 1 GB of processed data, 10,000 scores per month, 14-day retention. Pro is $249/month: 5 GB, 50,000 scores, 30-day retention. Enterprise is custom, and adds RBAC, custom retention and export, and hybrid or self-hosted deployment. Overages on Pro run $3 per GB and $1.50 per 1,000 scores. Qualifying startups can get 6 to 12 months of Pro free.

Which tool is best for LLM red teaming?

Promptfoo, and in this trio it is not close. It generates prompt-injection, jailbreak, and data-exfiltration attacks against your actual application, scans models for vulnerabilities, and proxies MCP traffic. Braintrust and LangSmith measure quality; they do not attack your system. OpenAI cited exactly this capability (agentic security testing) when it announced the acquisition.

Can I use Promptfoo and Braintrust or LangSmith together?

Yes, and most serious teams do. Promptfoo runs in CI for pre-ship regression and security testing; Braintrust or LangSmith watches production. They overlap far less than the shared 'eval tool' label suggests: one tests before you ship, the others score and trace what is live. Consolidating to a single tool usually means giving up either CI red teaming or production observability.

What is the difference between LangSmith and Braintrust pricing?

The models differ more than the sticker prices. LangSmith charges per seat ($39/month each on Plus) plus per trace ($2.50 per 1,000 base). Braintrust charges flat per team ($249/month Pro) plus usage ($3 per GB, $1.50 per 1,000 scores) with unlimited seats. Small teams with modest traffic pay less on LangSmith; from about seven seats up, or whenever you want PMs and reviewers in the tool, Braintrust's flat rate wins.

Can you self-host these tools?

Promptfoo: fully. It is open source and runs wherever your CI runs, with no data leaving your infrastructure. LangSmith: managed cloud, bring-your-own-cloud, or self-hosted, with self-hosting tied to Enterprise plans. Braintrust: SaaS by default, with hybrid deployment available and self-hosted options on Enterprise. If data residency is a hard requirement, price the enterprise tiers early: that is where those controls live on both paid platforms.

How does BearPlex set up eval infrastructure for clients?

Eval infrastructure is standard scope on every production AI engagement, not an add-on. The default shape: Promptfoo gates in CI (regression plus red teaming for anything customer-facing), Braintrust or LangSmith for production observability depending on the client's stack and team mix, golden datasets seeded from real traffic, and the review rituals that keep evals honest as the product evolves.

Start a conversation

Decision framework

Promptfoo vs Braintrust vs LangSmith: Which LLM Eval Tool in 2026

TL;DR

The right answer changed in 2026. Promptfoo (MIT open source, acquisition by OpenAI announced March 9, 2026 with a public commitment to stay open source and model-agnostic) is the pick for pre-ship evals, CI regression gates, and automated red teaming; it is free and it still ships multiple releases a month. Braintrust (flat $249/month Pro, unlimited seats on every tier) is the pick when production tracing, online scoring, and dataset curation by mixed engineering and product teams is the job. LangSmith ($39 per seat/month Plus) is the pick when you are on LangChain or LangGraph, both of which hit stable 1.0 in October 2025: nothing traces those graphs as natively. Most production teams we work with run two of the three: Promptfoo in CI plus one observability platform. Running zero of them is the only wrong answer.

Side-by-side comparison

Dimension	Promptfoo	Braintrust
Core job	Test before ship: evals, CI gates, red teaming	Watch after ship: tracing, online scoring, datasets, release gates
Pricing model	Free, MIT open source; enterprise cloud custom-quoted	Flat per-team: free Starter, Pro $249/month, Enterprise custom; unlimited seats on every tier
Free tier limits	None: unlimited local runs, no cloud dependency	1 GB processed data + 10,000 scores/month, 14-day retention, unlimited users
Overage pricing	None (you run it yourself)	$4/GB + $2.50 per 1,000 scores on Starter; $3/GB + $1.50 per 1,000 on Pro
Ownership (2026)	OpenAI acquisition announced March 9, 2026; MIT license and model-agnostic support publicly committed	Independent, venture-backed
Production trace analysis	Not the focus: no live-traffic scoring, drift detection, or alerting	Core product: online scoring, Topics failure clustering, Brainstore trace database
Security / red teaming	Best in class: generated prompt-injection, jailbreak, and data-leak attacks; guardrails; model scanning; MCP proxy	None: measures quality, does not attack your system
CI/CD integration	CLI-first with a GitHub Action; fails builds on regression	SDK-driven experiments and quality gates that block bad releases
Dataset workflows	YAML/CSV test cases; engineer-managed	Trace-to-dataset conversion, snapshots, human review via Custom Views
AI-assisted improvement	LLM-as-judge scoring and agent-rubric graders	Loop agent proposes better prompts, scorers, and datasets from your traces
Non-engineer usability	Engineering tool: config files and CLI	Built for mixed teams: PMs and domain experts review and annotate
Self-hosting	Fully self-hostable (open source, runs wherever CI runs)	SaaS default; hybrid and self-hosted deployment on paid/Enterprise tiers
Compliance posture	Your infra, your data path; nothing leaves by default	Advertises SOC 2 Type II, HIPAA, GDPR, SSO/SAML, RBAC
LangChain / LangGraph fit	Generic provider calls; no graph-aware tracing	Works fine via SDKs, but LangSmith's graph-native LangGraph tracing is deeper than either
Release cadence	Multiple releases per month (0.121.17 on June 16, 2026)	Actively shipping: Loop, Topics, and Brainstore are recent additions

Promptfoo

Open-source pre-ship evals, CI gates, and automated red teaming. Free, local, now backed by OpenAI.

Promptfoo is the open-source standard for testing LLM applications before they ship. You define test cases in YAML or CSV, run them against any provider (OpenAI, Anthropic, Google, Mistral, local models), and score outputs with deterministic assertions, LLM-as-judge rubrics, RAG metrics, or custom code. It runs locally with no cloud dependency, which makes it the easiest tool in this trio to drop into CI: the GitHub Action fails a build when a prompt change regresses quality. It has also grown into a full AI security suite: automated red teaming that generates prompt injection, jailbreak, and data-exfiltration attacks against your actual application, plus guardrails, model security scanning, code scanning, and an MCP proxy. Promptfoo claims 300,000+ open-source users and adoption at 156 of the Fortune 500. In March 2026 OpenAI announced it is acquiring Promptfoo, with a stated commitment to keep the project open source and model-agnostic; releases have continued at pace since (0.121.17 shipped June 16, 2026). What it does not do is production observability: it tests before you ship. It does not score live traffic, detect drift, or alert on quality drops. Pair it with a tracing platform for that.

Pros

MIT-licensed and free; runs entirely locally with no cloud dependency
Best-in-class CI integration: CLI-first, GitHub Action, fails builds on prompt regressions
Automated red teaming generates prompt injection, jailbreak, and data-leak attacks against your real app, not a generic benchmark
Provider-agnostic: OpenAI, Anthropic, Google, Mistral, and local models in one test matrix
Flexible scoring: deterministic assertions, LLM-as-judge rubrics, RAG metrics, custom code
Very active development: multiple releases per month through June 2026
Large install base (Promptfoo claims 300,000+ open-source users and 156 of the Fortune 500)

Cons

No production observability: does not score live traffic, detect drift, or alert on quality drops
Dataset management is files and CLI, not a curation workflow for non-engineers
Web UI is functional but thin next to Braintrust or LangSmith
OpenAI acquisition (announced March 2026) adds roadmap uncertainty despite the public open-source and model-agnostic commitments
Enterprise cloud features are custom-quoted, not self-serve

Best for

→ CI regression gates on prompts, models, and RAG pipelines before anything ships
→ Security red teaming of customer-facing agents (prompt injection, jailbreaks, data exfiltration)
→ Teams that want eval rigor at zero cost with no data leaving their infrastructure

Worst for

→ Production trace analysis, drift detection, or alerting on live traffic
→ Non-engineering teammates curating datasets or reviewing outputs
→ Teams that want one vendor to cover both pre-ship and post-ship quality

Cost model

Free (MIT open source, self-run). Enterprise cloud and support are custom-quoted.

Time to value

Hours: install the CLI, write a YAML config, first eval runs the same day.

Braintrust

Production tracing, online evals, and dataset curation in one platform. Flat per-team pricing, unlimited seats.

Braintrust is a commercial platform that covers the full quality loop: trace production traffic, score it online with LLM, code, or human scorers, cluster failures automatically (Topics), convert real traces into eval datasets, and gate releases on experiment results. Its differentiators are workflow and pricing. Trace-to-dataset conversion plus human review interfaces (Custom Views) make it practical for PMs and domain experts, not just engineers, to curate datasets and annotate outputs. Pricing is flat per team with unlimited seats on every tier: the free Starter includes 1 GB of processed data and 10,000 scores per month at 14-day retention; Pro is $249/month with 5 GB, 50,000 scores, and 30-day retention; Enterprise adds hybrid or self-hosted deployment. Recent additions include Loop, an agent that proposes better prompts, scorers, and datasets from your own data, and Brainstore, a purpose-built trace database. SDKs cover Python, TypeScript, Go, Ruby, and C#, and the platform is framework-agnostic. Braintrust advertises SOC 2 Type II, HIPAA, and GDPR compliance with SSO and RBAC, and lists Notion, Vercel, Replit, Dropbox, and Coursera among its customers. It is not a red-teaming tool, and your eval data lives in their platform unless you negotiate Enterprise deployment.

Pros

One platform for tracing, online scoring, experiments, datasets, and prompt management
Flat per-team pricing with unlimited seats on every tier: no per-seat math as headcount grows
Trace-to-dataset conversion turns production failures into regression tests in a few clicks
Custom Views and human review workflows make dataset curation practical for PMs and domain experts
Loop agent proposes improved prompts, scorers, and datasets from your own traces
Framework-agnostic SDKs: Python, TypeScript, Go, Ruby, C#
Advertises SOC 2 Type II, HIPAA, GDPR, SSO/SAML, RBAC; hybrid deployment available
Startup program: 6 to 12 months of Pro free for qualifying companies

Cons

$249/month Pro is a real line item for small teams, and overages ($3/GB, $1.50 per 1,000 scores) add up on chatty agent workloads
14-day retention on the free Starter tier is tight for debugging slow-burn regressions
No red teaming or security scanning: you still need Promptfoo or similar for adversarial testing
Eval data and workflows live in their platform; migrating off is real work
No graph-native view of LangGraph agents the way LangSmith has

Best for

→ Production LLM products where tracing, online evals, and dataset curation need to live in one system
→ Mixed engineering and product teams: unlimited seats means PMs, analysts, and reviewers join for free
→ Companies that want CI/CD quality gates wired to the same platform that watches production

Worst for

→ Adversarial security testing (use Promptfoo for red teaming)
→ Deep LangGraph agent debugging (LangSmith's graph-native tracing is stronger there)
→ Pre-revenue projects whose traffic would blow past the free tier's 1 GB and 14-day retention

Cost model

Flat per-team: free Starter (1 GB processed data, 10,000 scores/month, 14-day retention), Pro $249/month (5 GB, 50,000 scores, 30-day retention), Enterprise custom. Overages: $3 to $4 per GB, $1.50 to $2.50 per 1,000 scores depending on tier. Unlimited users on all tiers.

Time to value

Days: SDK instrumentation to first production traces and experiments within a week.

Decision scenarios

Shipping an AI feature next sprint and need regression gates in CI before launch

→ Promptfoo

Promptfoo. The CLI plus GitHub Action fails the build when a prompt or model change regresses quality. Running the same day, zero spend, no procurement conversation.

Customer-facing agent that touches sensitive data and needs a security review

→ Promptfoo

Promptfoo. Its red teaming generates prompt-injection, jailbreak, and data-exfiltration attacks against your actual application, plus model scanning and an MCP proxy. Neither Braintrust nor LangSmith does adversarial testing at all.

Production LLM product where PMs and domain experts curate datasets from real traffic

→ Braintrust

Braintrust. Trace-to-dataset conversion plus Custom Views make non-engineer review practical, and unlimited seats mean adding reviewers costs nothing extra.

Several AI features in production, growing team, want predictable tooling spend

→ Braintrust

Braintrust. Flat $249/month Pro with unlimited seats beats per-seat pricing as headcount grows: at LangSmith's $39 per seat, seven seats already exceeds it before any trace overages.

Stack is LangChain and LangGraph end to end

→ Both

Use LangSmith for observability: it is graph-native, and with LangChain and LangGraph both stable at 1.0 since October 2025 the ecosystem risk is lower than it used to be. Keep Promptfoo in CI for regression and security testing. This is the most common three-tool split we deploy.

Pre-seed startup with zero tooling budget

→ Promptfoo

Promptfoo: free and unlimited. If you also need tracing, LangSmith's Developer plan (1 seat, 5,000 base traces/month) or Braintrust's free Starter covers early volume. Upgrade when production traffic forces the question.

Healthcare or fintech product where compliance posture and human review matter

→ Braintrust

Braintrust advertises SOC 2 Type II, HIPAA, GDPR, RBAC, and hybrid deployment on Enterprise, and its human review workflows produce the audit trail regulated buyers ask about. Pair it with Promptfoo red teaming before each release: adversarial test evidence increasingly shows up in security questionnaires.

Team is nervous about the OpenAI acquisition of Promptfoo

→ Promptfoo

Still Promptfoo, with eyes open. The MIT license and model-agnostic support are publicly committed, releases have continued at pace since the March 2026 announcement, and the tool runs locally so there is no hosted service to lose. Revisit only if non-OpenAI provider support starts visibly lagging.

FAQ

Common questions

Yes. OpenAI announced the acquisition on March 9, 2026, and both companies publicly committed to keeping the project open source and model-agnostic, with continued support for existing users and customers. Releases have continued at the usual pace since: version 0.121.17 shipped June 16, 2026, and recent releases still added non-OpenAI providers. Because Promptfoo runs locally, your setup keeps working regardless of what happens to any cloud offering.

Related comparisons

Related services

Featured case studies

Get a recommendation tailored to your situation

BearPlex builds production AI systems using both approaches. We'll tell you which fits your case in a 30-minute scoping call.

Talk to BearPlex See case studies

Promptfoo vs Braintrust vs LangSmith: Which LLM Eval Tool in 2026

Side-by-side comparison

Promptfoo

Pros

Cons

Best for

Worst for

Braintrust

Pros

Cons

Best for

Worst for

Decision scenarios

Common questions

Related comparisons

Related services

Featured case studies

Related reading

Get a recommendation tailored to your situation