The 2026 MLOps Stack for Indian AI Startups: Cheap, Reliable, Production-Ready
The MLOps blog posts coming out of Silicon Valley assume you have a platform team of twelve and a budget that doesn't blink at AWS. Most Indian AI startups have neither. Here is the stack we routinely deploy for early-stage teams that need to be production-ready without spending more on infra than on engineering.
Serving: vLLM is good enough
For self-hosted models, vLLM has won. It batches dynamically, exposes an OpenAI-compatible API, and handles 8B–70B models well on a single A10 or A100. SGLang and TGI are competitive but vLLM has the broadest ecosystem support. Run it behind a small Caddy or Cloudflare Tunnel — both give you HTTPS without faffing about.
Hosted vs self-hosted: pick by traffic shape
- Bursty, low total volume → hosted (OpenAI, Anthropic, Bedrock). Pay per token; your serving cost is zero when no one is using the system.
- Sustained, high volume → self-hosted on dedicated GPU. Above 5–10 million tokens a day, the math flips and self-hosting wins by 5–10×.
- Mixed → router. LiteLLM or a homegrown gateway routes cheap calls to your self-hosted cluster and frontier-required calls to a hosted API.
Observability: Langfuse for traces, Datadog or Grafana for everything else
Langfuse is the cheapest meaningful traces stack — open source, self-hostable, costs you a small EC2 instance. Wire it up early. For metrics and logs, Datadog if you can afford it, Grafana Cloud's free tier or a self-hosted Loki setup if you can't. Observability is non-negotiable in production; the only question is how cheap you can make it.
Compute: Modal for bursts, Hetzner for steady-state
Modal's serverless GPU is the right answer for irregular workloads — pay per second, no cluster to manage. For steady production load, Hetzner or OVH dedicated GPUs cost a fraction of AWS at the same spec. AWS makes sense for the parts that talk to other AWS services; not for raw GPU.
CI and deploy: GitHub Actions plus Docker
Boring is good. Build a Docker image in CI, push to a registry, deploy on tag. Skip Kubernetes until you have a real reason — for most pre-Series A companies, that day never comes.
How we deploy this at Velura Labs
Our AI Deployment & MLOps retainer ships exactly this stack — vLLM serving, Langfuse traces, Modal for bursts, GitHub Actions CI — with the operational handover and runbooks your team needs. For tactical infra advice without a retainer, our Backend & Infrastructure service covers it. Talk to us if you're staring down a 200K-a-month AWS bill that didn't have to happen.