The 2026 MLOps Stack for Indian AI Startups: Cheap, Reliable, Production-Ready
The MLOps blog posts coming out of Silicon Valley assume you have a platform team of twelve and a budget that doesn't blink at AWS. Most Indian AI startups have neither. Here is the stack we routinely deploy for early-stage teams that need to be production-ready without spending more on infra than on engineering.
Serving: vLLM is good enough
For self-hosted models, vLLM has won. It batches dynamically, exposes an OpenAI-compatible API, and handles 8B–70B models well on a single A10 or A100. SGLang and TGI are competitive but vLLM has the broadest ecosystem support. Run it behind a small Caddy or Cloudflare Tunnel — both give you HTTPS without faffing about.
Hosted vs self-hosted: pick by traffic shape
- Bursty, low total volume → hosted (OpenAI, Anthropic, Bedrock). Pay per token; your serving cost is zero when no one is using the system.
- Sustained, high volume → self-hosted on dedicated GPU. Above 5–10 million tokens a day, the math flips and self-hosting wins by 5–10×.
- Mixed → router. LiteLLM or a homegrown gateway routes cheap calls to your self-hosted cluster and frontier-required calls to a hosted API.
Observability: Langfuse for traces, Datadog or Grafana for everything else
Langfuse is the cheapest meaningful traces stack — open source, self-hostable, costs you a small EC2 instance. Wire it up early. For metrics and logs, Datadog if you can afford it, Grafana Cloud's free tier or a self-hosted Loki setup if you can't. Observability is non-negotiable in production; the only question is how cheap you can make it.
Compute: Modal for bursts, Hetzner for steady-state
Modal's serverless GPU is the right answer for irregular workloads — pay per second, no cluster to manage. For steady production load, Hetzner or OVH dedicated GPUs cost a fraction of AWS at the same spec. AWS makes sense for the parts that talk to other AWS services; not for raw GPU.
CI and deploy: GitHub Actions plus Docker
Boring is good. Build a Docker image in CI, push to a registry, deploy on tag. Skip Kubernetes until you have a real reason — for most pre-Series A companies, that day never comes.
How we deploy this at Velura Labs
Our AI Deployment & MLOps retainer ships exactly this stack — vLLM serving, Langfuse traces, Modal for bursts, GitHub Actions CI — with the operational handover and runbooks your team needs. For tactical infra advice without a retainer, our Backend & Infrastructure service covers it. Talk to us if you're staring down a 200K-a-month AWS bill that didn't have to happen.
Available to businesses across the United States (Washington, California, Texas, New York), Europe (France, Italy and the wider EU), the Middle East (Dubai and the Gulf) and India. Get in touch to scope your build.