The Production LLM Evaluation Playbook: How We Catch Regressions Before Customers Do
Most LLM projects ship without evals. The team eyeballs ten outputs, decides the demo looks good, and pushes. Then a model upgrade lands, output quality silently degrades, and the first signal is a customer complaint two weeks later. Evals are the cheapest insurance against this story — and most teams skip them because they don't know where to start.
Three layers of evals you actually need
- Unit-level prompt evals. For each prompt template, a small set of golden inputs with expected output properties. Run on every PR.
- End-to-end task evals. Real user-style queries scored by a frontier-model judge (LLM-as-judge) plus rule-based checks. Run nightly and on every model upgrade.
- Production observability. Sample real traffic, log to Langfuse or LangSmith, score asynchronously, surface drifts on a dashboard.
The metrics that survive contact with reality
RAGAS gives you faithfulness, answer-relevance, and context-precision out of the box. Useful baseline. But for production, three metrics matter more than the rest:
- Task-specific exact match or schema validity. If you are extracting JSON, did the JSON parse? Did every required field show up?
- Refusal correctness. When the model should say "I don't know," does it? Hallucination is the single biggest production failure mode and it is testable.
- Latency budget. Track p50 and p95. A model that's right but slow is worse than one that's slightly less right and fast.
Golden datasets: small, hand-curated, ruthlessly maintained
Your golden set should have 50–200 examples, hand-picked to cover the failure modes that matter. Update it whenever a new failure mode shows up in production — that is the whole loop. Auto-generated synthetic evals are a starting point but never a substitute.
The cheapest way to wire this up
Braintrust, Promptfoo, OpenAI Evals, or a custom JSON-Lines runner — they all work. We default to Braintrust for client work because the dashboard reads well in client meetings, and Promptfoo for OSS-ish builds. Whatever you use, run it in CI as a gate before merge to main.
How we apply this at Velura Labs
Every Custom LLM Application we ship comes with an eval harness wired into CI. For agent-heavy systems, the harness extends into Agentic Systems — golden trajectories, replay-on-PR, regression budgets. If your team has shipped LLMs without evals and the dashboards are starting to look noisy, multilingual RAG isn't your problem yet — eval discipline is. Get in touch and we'll audit your eval coverage in a week.