All posts
computer visionretail AIIndian retail

Computer Vision in Indian Retail: What Works on a Crowded Shop Floor

Dr Ishit Karoli
May 8, 2026
2 min read· 6 sections

Computer Vision in Indian Retail: What Works on a Crowded Shop Floor

Shelf-monitoring CV models trained on European modern-trade footage fail spectacularly in Indian retail. SKUs are stacked three-deep, lighting is mixed fluorescent and daylight, and the same brand has six pack variants on the same shelf. The CV stack you ship for India looks meaningfully different from the global one. Here is what we’ve learned shipping it.

The conditions that break off-the-shelf models

  • Stacking and occlusion. SKUs partially hidden by other SKUs. Detection models trained on cleanly-faced shelves under-detect 30–50% of stock.
  • Pack variant proliferation. One brand, six sizes, three flavours, two packaging refreshes a year. Image classification needs constant refresh.
  • Mixed lighting. Tube light + door daylight + phone flash. White balance drifts across the same store visit.
  • Phone-captured input. Reps walk the aisle and snap photos. Angles, distances, and motion blur vary wildly.

The stack that performs

  1. SKU-level detection with India-specific fine-tuning. Start with a strong general detector (YOLO11, RT-DETR), fine-tune on locally captured shop-floor data — at least 5,000 frames across store formats. Generic models are 60–70% on Indian shelves; tuned models hit 85–92%.
  2. Embedding-based recognition for SKU identity. Detection finds the box; embedding retrieval matches it to the SKU catalogue. Embedding gallery is updated whenever a pack refreshes — far cheaper than retraining the classifier.
  3. Pack-variant matrix. Don’t collapse "Lays 50g classic" and "Lays 50g cream & onion" into "Lays 50g." Maintain pack-level granularity in the catalogue and the eval.
  4. Field-app feedback loop. Reps confirm or correct detections on-device. Corrections go back to the training pipeline.

Planogram compliance vs share-of-shelf

These are different problems and need different evals. Planogram compliance asks "is this SKU in its designated facing?" — needs spatial accuracy. Share-of-shelf asks "what percentage of the visible shelf is brand X?" — tolerates spatial error but needs accurate SKU classification. Build the right metrics for the question the brand is actually paying for.

On-device vs server inference

Field connectivity is variable. We default to on-device detection for immediate rep feedback (Tensorflow Lite or ONNX runtime) and server-side reprocessing for analytics-grade accuracy. The two paths use the same training data; the on-device model is quantised and pruned. Don’t make reps wait for the cloud.

The label problem nobody warns you about

Shop-floor data is expensive to label well. SKUs are tiny in frame, partial, sometimes upside-down. Plan for 8–12 seconds per box, not 2. Set up a Tier-2 city annotation team — costs are 3–5× lower than metros and the quality is comparable when supervised properly.

How we approach this at Velura Labs

Our AI & Data Solutions service handles end-to-end retail CV — data collection, annotation, model training, and on-device deployment. Pair with Mobile App Development for the rep-facing field app. Read our annotation pipelines piece for the labelling layer and our Bharat design patterns for the rep UX. Talk to us if your shelf-monitoring accuracy plateaued after the initial rollout — that’s usually the embedding gallery drifting.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT