All posts
LLM evaluationRAGASBraintrust

The Production LLM Evaluation Playbook: How We Catch Regressions Before Customers Do

Dr Ishit Karoli
August 28, 2025
2 min read· 5 sections

The Production LLM Evaluation Playbook: How We Catch Regressions Before Customers Do

Most LLM projects ship without evals. The team eyeballs ten outputs, decides the demo looks good, and pushes. Then a model upgrade lands, output quality silently degrades, and the first signal is a customer complaint two weeks later. Evals are the cheapest insurance against this story — and most teams skip them because they don't know where to start.

Three layers of evals you actually need

  • Unit-level prompt evals. For each prompt template, a small set of golden inputs with expected output properties. Run on every PR.
  • End-to-end task evals. Real user-style queries scored by a frontier-model judge (LLM-as-judge) plus rule-based checks. Run nightly and on every model upgrade.
  • Production observability. Sample real traffic, log to Langfuse or LangSmith, score asynchronously, surface drifts on a dashboard.

The metrics that survive contact with reality

RAGAS gives you faithfulness, answer-relevance, and context-precision out of the box. Useful baseline. But for production, three metrics matter more than the rest:

  • Task-specific exact match or schema validity. If you are extracting JSON, did the JSON parse? Did every required field show up?
  • Refusal correctness. When the model should say "I don't know," does it? Hallucination is the single biggest production failure mode and it is testable.
  • Latency budget. Track p50 and p95. A model that's right but slow is worse than one that's slightly less right and fast.

Golden datasets: small, hand-curated, ruthlessly maintained

Your golden set should have 50–200 examples, hand-picked to cover the failure modes that matter. Update it whenever a new failure mode shows up in production — that is the whole loop. Auto-generated synthetic evals are a starting point but never a substitute.

The cheapest way to wire this up

Braintrust, Promptfoo, OpenAI Evals, or a custom JSON-Lines runner — they all work. We default to Braintrust for client work because the dashboard reads well in client meetings, and Promptfoo for OSS-ish builds. Whatever you use, run it in CI as a gate before merge to main.

How we apply this at Velura Labs

Every Custom LLM Application we ship comes with an eval harness wired into CI. For agent-heavy systems, the harness extends into Agentic Systems — golden trajectories, replay-on-PR, regression budgets. If your team has shipped LLMs without evals and the dashboards are starting to look noisy, multilingual RAG isn't your problem yet — eval discipline is. Get in touch and we'll audit your eval coverage in a week.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT