All posts
LLM costprompt cachingtoken optimisation

Cutting LLM Cost in Half Without Touching Quality: A Practical Checklist

Dr Ishit Karoli
May 4, 2026
2 min read· 8 sections

Cutting LLM Cost in Half Without Touching Quality: A Practical Checklist

When a customer’s LLM bill is climbing faster than usage, the first instinct is to swap to a cheaper model. That’s almost always the wrong first move — quality regressions are easy to introduce and hard to detect. Below is the order we run on every production system before considering a model swap.

1. Turn on prompt caching properly

Most teams enable prompt caching but cache the wrong segment. Cache the largest stable prefix — system prompt, tool definitions, retrieved chunks if they repeat across turns. We routinely see cache hit rates jump from 20% to 80% just by reordering the prompt so the volatile bits come last. On Claude and most providers, this alone cuts costs 40–60% for chatty workloads.

2. Stop sending dead context

Audit your prompts. Most production prompts carry 20–40% of tokens that no longer matter — debug examples left from development, redundant instructions, retrieved chunks that didn’t help. Add a "would we miss this if removed?" review to every prompt change. Token diets are unglamorous and effective.

3. Right-size the output

Output tokens are 3–5× more expensive than input on most APIs. Constrain output ruthlessly. If you need a yes/no, ask for "yes" or "no", not a paragraph that ends in yes. If you need JSON, specify the schema with minimal fields. Streaming summaries are nice for chat; they’re a cost smell for batch.

4. Route by complexity

Not every request needs your strongest model. A thin classifier — sometimes regex, sometimes a small LLM — routes simple cases to a cheaper model and reserves the strong model for the genuinely hard 20%. Done right, this drops blended cost by another 30–50% with no measurable quality loss on the routed-down traffic.

5. Batch where latency allows

For non-interactive workloads (overnight summarisation, document ingest, eval runs), batch APIs offer 50% off list price on most providers. The trade-off is overnight latency. For everything that isn’t user-facing real-time, batch.

6. Only now, consider the model swap

After the above, evaluate whether a cheaper or distilled model holds quality on your eval set. If yes, switch. If quality dips, you’ve at least cut the bill before swapping. Most teams swap first and discover they did 15% of the savings the optimisation order would have delivered.

The metric that matters

Track cost per successful task, not cost per token or cost per call. A request that costs $0.001 but fails and retries three times is more expensive than a $0.005 request that succeeds first time. Build the success metric into your telemetry before you start optimising.

How we approach this at Velura Labs

Our Custom LLM Applications engagements include a cost-optimisation pass after the first month of production. Pair with AI Strategy & Roadmap if you’re sizing the budget envelope upfront. Read our evaluation playbook for the quality-side of these trade-offs and our buy-vs-build framework for the bigger procurement question. Talk to us if your LLM bill is outpacing your usage growth — we’ll do a one-day audit and tell you where the leaks are.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT