All posts
fine-tuningLoRAopen weights

When Fine-Tuning Actually Pays Off — and the 90% of Cases Where It Doesn’t

Dr Ishit Karoli
September 24, 2025
2 min read· 5 sections

When Fine-Tuning Actually Pays Off — and the 90% of Cases Where It Doesn't

"We need to fine-tune our own model." We hear this in nine out of ten AI strategy calls, and in roughly nine out of ten of those, the right answer is: probably not. Frontier models plus a careful RAG pipeline plus a tight prompt cover most of the territory. But there is a real, defensible 10% where fine-tuning is the right tool. Here is how to tell.

The four scenarios where fine-tuning genuinely wins

  • Latency or cost compression at scale. If you are running tens of millions of inferences a month, fine-tuning a 7B model can be 10–30× cheaper than calling a frontier API. The math gets very compelling above a million calls a day.
  • Strict output format compliance. If your downstream system breaks on malformed JSON and prompting alone gives you 99.2% compliance, fine-tuning can push it past 99.9%. That last 0.7% is sometimes worth a quarter's budget.
  • Strong domain shift. Legal Marathi, medical Tamil, telecom-specific jargon — niche language models fine-tuned on the right data outperform general-purpose ones for narrow tasks. RAG plus a frontier model usually beats fine-tuning, but not always.
  • On-prem or air-gapped deployment. If the customer requires no API calls, fine-tuning a small open-weight model is the only path. This is increasingly common in BFSI and government.

The honest signs you should not fine-tune

If you have less than a thousand high-quality training examples, you don't have a fine-tuning project — you have a data project. If you cannot articulate the metric you would lift by ten points, the metric is not real. If your prompt is three lines long and you haven't tried a longer one, fine-tuning is a wildly inefficient way to get the same lift.

LoRA vs full fine-tuning

For 95% of useful fine-tunes, LoRA adapters are the right answer. They train fast, cost a fraction of a full fine-tune, and let you run multiple variants of the same base model from one set of GPU weights. Full fine-tunes are warranted when LoRA hits a quality ceiling — typically tasks that need the model to learn genuinely new representations, not just biases on existing ones.

Base model choice in 2026

Llama 3.3, Mistral, Qwen 3, and Phi-4 are the four most common bases we use. The decision usually comes down to license terms (Llama's commercial license has stricter usage clauses than Mistral or Qwen), serving cost, and which one performs best on your specific task in a quick eval. Don't pick on benchmarks alone.

How we run fine-tunes at Velura Labs

Our Model Fine-tuning engagements always start with a two-week audit where we honestly tell you whether fine-tuning will beat RAG on your specific task. Half the time, we recommend not fine-tuning. The other half, we ship a LoRA adapter on Llama 3.3 or Qwen 3 served via vLLM through our MLOps pipeline. Get in touch if you'd like a second opinion before committing to a training budget.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT