Building Multilingual RAG for Bharat: Why English-Only Systems Fall Short
The default RAG pipeline most teams ship still assumes English documents and English queries. In India, that assumption locks out roughly 800 million people who think, type and speak in something other than English. If you are building for Bharat, the pipeline needs to be designed multilingual from day one — not retrofitted after launch.
Where naive pipelines break
Most off-the-shelf embeddings are trained predominantly on English. When you embed a Hindi query alongside a Tamil document, semantic similarity collapses to noise. You get retrievals that share keywords but not meaning, or no retrievals at all because the query and corpus live in different vector neighbourhoods.
The first symptom is a quiet one: answer-faithfulness scores look fine on your English test set, then collapse when real users from Patna or Coimbatore start asking questions.
The four design choices that matter
- Embedding model. Use multilingual embeddings tuned for Indic scripts — Multilingual-E5, BGE-M3, or domain-specific Indian models. Stock OpenAI embeddings handle Devanagari but degrade for Tamil, Bengali and Telugu.
- Chunking. Sentence boundaries differ across scripts. Naive 500-token chunkers will split Devanagari mid-sentence. Use script-aware splitters or fine-tune chunk size per language.
- Reranking. A multilingual reranker (Cohere Rerank multilingual, BAAI/bge-reranker-v2-m3) closes most of the gap. This is the single highest-leverage retrieval upgrade for Indic corpora.
- Generation model. Sarvam, Bhashini, and Krutrim are catching up fast on Indic generation. Frontier models (Claude, GPT-5, Gemini 2.5) handle Hindi and English well but lose fluency in lower-resource languages.
Why hybrid retrieval matters more in Indic
BM25 plus dense retrieval consistently beats either alone in Indic — partly because transliteration is messy. Users type "namaste" in Latin script, the corpus has it in Devanagari, and only sparse retrieval bridges the gap. We rarely ship a Bharat-facing RAG without hybrid retrieval and a reranker on top.
Evaluation across languages
If you are running RAGAS or DeepEval, run them per-language. A 92% faithfulness average looks fine until you realise it is 96% in English and 68% in Bengali. We build per-language eval splits, score them separately, and ship only when the worst language meets the bar.
How we approach this at Velura Labs
Our RAG & Knowledge Systems service handles multilingual corpora natively, with hybrid retrieval, Indic-tuned reranking, and per-language eval baked into the build. For citizen-facing systems with chatbot surfaces, the same patterns extend into our AI & Data Solutions for government engagements. If you are stuck with an English-only pipeline that needs to serve Bharat, talk to us — we will tell you in 30 minutes whether to retrofit or rebuild.