All posts
observabilityAI opsLLM monitoring

The AI Observability Stack You Actually Need on Day One

Dr Ishit Karoli
May 6, 2026
2 min read· 8 sections

The AI Observability Stack You Actually Need on Day One

LLM applications fail in ways traditional APM doesn’t catch. Latency is fine but the answer is wrong. Cost is fine but tokens are 3× yesterday on the same load. The model returned valid JSON but the wrong tool got called. Standard observability tells you the request returned 200. Useful observability tells you the request returned a bad answer. Here’s the minimum stack we ship on day one.

1. Structured traces around every model call

Every LLM call gets a span with: model, input tokens, output tokens, latency, cost, cache hit/miss, prompt template version. Use OpenTelemetry conventions where they exist (Anthropic and OpenAI both support OTEL traces now). Without these, you can’t answer "is my prompt change saving money or burning it?"

2. Full prompt and response capture (with PII filters)

Store the actual prompt and response, not just metadata. You’ll need them to debug specific failures, run regression evals, and produce audit logs. Add PII redaction at the capture layer. For regulated industries, encrypt at rest and gate access through your audit trail.

3. Tool-call telemetry

For agentic systems, every tool call needs its own span with arguments, return value, latency, and success/failure. The most common production failure mode for agents is "model called the wrong tool with the wrong arguments." You won’t catch it without tool-level traces.

4. Online quality signals

Wire user feedback (thumbs, citations clicked, follow-up question rate) into the same observability backend as your latency metrics. Quality regressions are usually visible in user behaviour before they show up in evals. Mark each request with its prompt template version so you can correlate.

5. Cost telemetry with budgets

Per-user, per-feature, per-tenant cost rollups. Alerts when daily spend exceeds threshold. Most LLM cost blowouts are a single buggy code path looping. Without per-feature attribution, you find it three days later in the bill.

6. Eval runs on every change

CI runs your offline eval set on every prompt or model change. Block deploys on regressions. We see teams discover quality regressions weeks later in user complaints because their eval suite was never wired to CI. Day-one investment, lifetime payoff.

The tools we currently like

Langfuse, Helicone, and Arize Phoenix are the open-source / self-hostable options we deploy most often. Datadog and Honeycomb have credible LLM extensions if your org is already on them. Avoid hand-rolling — log lines in Postgres look cheap until you need to slice by prompt version × tool call success.

How we approach this at Velura Labs

Every Custom LLM Applications and Agentic Systems engagement ships with this stack pre-wired. Pair with Backend & Infrastructure for the platform side. Read our evaluation playbook for the offline side and multi-agent patterns for why tool-call telemetry matters most in agent systems. Talk to us if your first production incident is teaching you what you should have wired three months ago.

Now booking Q3 2026

Let's build the
next chapter of your business.

Quick chat on WhatsApp. We'll map your highest-leverage AI bet, show you a reference architecture, and price the first slice.

80+
shipped projects
12
industries
ISO 9001:2015
certified
98.4%
CSAT