All posts
RAGevaluationretrieval

RAG Evaluation in 2026: Beyond "It Seems to Work in Demos"

Dr Ishit Karoli
May 5, 2026
2 min read· 7 sections

RAG Evaluation in 2026: Beyond "It Seems to Work in Demos"

The hardest part of running RAG in production is figuring out what’s broken when quality dips. Was the retrieval bad? Was the answer hallucinated despite good context? Was the question malformed? You can’t debug what you can’t separate. Here’s the evaluation layout we put in place before a RAG system goes live.

Separate retrieval and generation metrics

Retrieval gets its own eval set: query → expected document IDs. Generation gets a different eval set: (query + correct context) → expected answer. Mixing them is the most common evaluation mistake we audit. When the blended metric drops, you have no idea where to look. Split the metrics and your debugging time drops 5×.

Retrieval metrics that actually matter

  • Recall@k — does the right document appear in the top k? Pick k based on your generation context window, not aesthetics.
  • MRR (mean reciprocal rank) — how high does the right document rank? Penalises borderline-relevant top results.
  • Coverage — across all queries, how often does the system retrieve any relevant document? Lower than expected here usually means embedding model mismatch, not chunking strategy.

Generation metrics that actually matter

  • Faithfulness — did the answer come from the provided context? LLM-as-judge works here if calibrated to your domain.
  • Answer relevance — did the answer address the question? Different from faithfulness; you can be faithful and irrelevant.
  • Citation accuracy — do the cited sources actually support the claims? Critical in regulated domains.

The eval set you build, not the one you download

Public RAG benchmarks tell you whether your stack is plausible. Your own domain eval set tells you whether it works. Build 100–300 hand-curated queries with expected answers and expected sources before shipping. Re-curate quarterly as your domain shifts. Every team we see in trouble skipped this and tried to retrofit it after a quality incident.

Online metrics complete the picture

Offline evals validate against frozen expectations. Online metrics — citation click-through, thumbs-up rate, follow-up question rate, escalation-to-human rate — tell you how it behaves in the wild. Wire both. Offline regressions catch most issues before deploy; online metrics catch the rest.

Continuous regression suite

Treat the eval set like a test suite. Run it on every change to the prompt, the retriever, or the model. Block deploys on regressions above a threshold. Track per-category metrics — the average can hide a 20% drop in one user segment.

How we approach this at Velura Labs

Our RAG & Knowledge Systems engagements ship with a retrieval and generation eval set as deliverables, not afterthoughts. Pair with Custom LLM Applications for the broader system and Agentic Systems if your retrieval is part of an agent workflow. Read our production evaluation playbook for the broader framing and vector database decision for the infra side. Talk to us when your RAG quality is "fine on most queries" and you can’t tell which ones.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT