Voiceflow migrated between two versions of the same model. Not GPT-3.5 to GPT-4. GPT-3.5 to GPT-3.5, one snapshot to the next. Intent classification dropped 10 percent.
They caught it fast, and that is the remarkable part. The Applied LLMs practitioners who documented the incident point out it was caught only because the evals existed. Same prompt, same pipeline, same model family, and a tenth of the system's accuracy quietly gone. Without an evaluation suite, that regression ships. It sits in production for weeks. Users feel it before anyone inside the building does, and when someone finally notices, the team blames the prompt.
I think about that story every time a founder tells me their LLM feature is almost there. Almost there usually means someone is on prompt version forty and the demo behaves when the CEO is watching. The bottleneck in these products is rarely the prompt and rarely the model. The bottleneck is that nobody can answer the only question that matters: did the last change make this better or worse?
The Treadmill
Prompt tweaking feels like progress because the feedback is instant. You adjust a sentence, rerun your three favorite test inputs, and the output reads better. Small dopamine hit. Commit. Ship.
What you cannot see is everything else that moved. A language model is a distribution over behaviors, and a prompt edit shifts the entire distribution. You fixed the tone on your three test cases and broke date extraction on cases you never look at. Nobody checked, because nothing forces the check. Without a fixed yardstick, every edit is a coin flip where you only ever see one side of the coin.
Hamel Husain, whose field work on this problem is the sharpest I have read, finds that unsuccessful AI products almost always share a single root cause: the failure to create robust evaluation systems. Not weak models. Not clumsy prompts. A missing measurement layer. The stalled teams all look the same. They polish outputs by eye, they argue about whether responses feel better, and they accumulate a folder of prompts with names like final-v9-ACTUAL-this-one. Motion, mistaken for progress.
The stall has a body count. S&P Global found that 42 percent of companies abandoned most of their AI initiatives before production in 2025, up from 17 percent a year earlier. Plenty of those deaths were deserved. Some were products that worked, attached to teams who could not prove it. When you cannot demonstrate improvement, you cannot defend the next sprint, and the pilot dies in a budget review.
The Anatomy
A working evaluation system has three levels. Hamel Husain's framework lays them out, and the order matters because each level earns the next.
Assertions. Deterministic unit tests that run on every change: the output parses as JSON, the required fields exist, the date is a real date, the response never mentions a competitor, the refund amount never exceeds the invoice. They are cheap to run and brutal in what they catch. GoDaddy's engineering team measured malformed output on roughly 1 percent of its structured GPT-3.5 calls. One percent sounds tolerable until you multiply by production volume and realize it is a steady stream of broken responses every single day. Assertions catch those in code and trigger retries before a user ever sees them.
Trace review. Humans reading real production transcripts, on a calendar cadence, with tooling so frictionless there is no excuse to skip it. This is where you learn what is going wrong instead of what you guessed might go wrong, and it is where your failure taxonomy comes from. Once volume outgrows human reading, you add LLM judges, with one hard rule attached: Husain's guidance is to track the correlation between judge scores and human labels before trusting the judge. An unvalidated judge does not measure your failures. It launders them into a green dashboard while users quietly churn.
A/B tests. The final word. When the assertions pass and the traces look clean, real users tell you whether the improvement is real. Everything before this level exists to make sure you only spend live traffic on candidates that deserve it.
There is a fourth piece that sits beside the levels rather than inside them: deterministic gates on anything irreversible. GoDaddy gates human handoff with code-identified stop phrases instead of the model's own judgment, because a model that is right most of the time still fires the wrong action constantly at scale. Refunds, deletions, escalations, anything with consequences gets decided by code. We gave this whole discipline its own pillar in the AI Readiness Audit we published, because the public failure data points at evaluation more than at any other layer of the stack.
The Afternoon Upgrade
Models now churn faster than budget cycles. Every few months something better and cheaper lands, and every team faces the same question: migrate, or stay put and watch the cost gap widen?
Without evals, migration is a rewrite. Someone re-tests the system by hand, the team argues about whether the new outputs seem fine, somebody senior gets nervous, and the migration slides a quarter. Teams stay on aging models for a year, paying more for worse output, purely because nobody can say what would break.
With an eval suite, migration is an afternoon. Pin the current model version. Point the suite at the candidate. Read the diff: where it wins, where it regresses, which assertions moved. Decide by evening, with a rollback path and a number attached to the decision. Voiceflow's 10 percent drop was never a crisis, because the suite turned a silent regression into a line item.
The same suite covers you when a provider goes down. GoDaddy lost its chatbots for hours to a provider outage and came away treating outages, malformed outputs, and version drift as routine engineering problems: retries, fallbacks, multi-provider switching, all handled in code. But a fallback provider is only safe to fail over to if you know its quality holds, and the only way to know in advance is a suite you can run against it before the bad day arrives.
This is the asymmetry most teams miss. The prompt is disposable, tuned to one model's quirks. The model is rented, and the terms change without notice. The eval suite is the only layer you own, and it compounds: every production failure becomes a new test case, and every test case makes the next migration cheaper. It outlives every model it ever judged.
The Spec You Never Wrote
Here is the reframe I want to leave you with. Most LLM products have no written spec, because nobody can fully specify in advance what "answer customer emails well" means. But the eval suite specifies it, executably. Each assertion is a requirement someone decided to enforce: the assistant never invents a policy, never quotes a price outside the table, always cites a source. The trace-review taxonomy, ranked by frequency, is your roadmap, ordered by what real users hit rather than by what the loudest voice in planning guessed. The judge rubric is your quality bar, finally written down where it can be argued with.
That is why I say the evals are the product. The model generates. The evals define what good means, and the definition is the part your customers experience.
The same artifacts feed the heavier machinery later. Graded outputs and human preference labels are the raw material of RLHF and alignment tuning, so when the day comes to tune behavior into a model rather than prompt around it, a team with mature evals already owns the dataset that work requires. A team without one starts from zero, twice.
So start, and start ugly. Twenty assertions and one hour a week reading traces beats the elegant evaluation platform you are planning to build next quarter. Wire the suite into CI so no prompt change lands without a score. Promote every production failure into a test case the same day it happens. In six months you will hold the most valuable artifact in your stack, the one thing that gets stronger every time a model gets replaced.
The teams that win the next model generation will not be the ones with the cleverest prompts. They will be the ones who can tell, by the end of an afternoon, whether the new thing is better. Build the yardstick first. Then argue about prompts.
