r/devops 12d ago

Observability Agent Observability and what I think

Hey all, I wanted to share a perspective on something I've been thinking about a lot lately.

Traditional APM was built for request-response and AI Agents break that model entirely. Because, most of what's on the market right now is just legacy APM with agent added, and that leaves a gap you really only feel when things go wrong. You can see the agent's intent (what it decided to do) OR the system-level impact (latency, errors, resource usage), but not both in the same trace. Unfortunately, you're flying blind through the exact moments when cost spikes.

I think observability at the agent layer is one of the real problems here. It's not solved yet. But it's defined well enough that you can instrument properly if you start now.

UC Santa Cruz published research on this last year (arxiv:2508.02736). They used eBPF to intercept TLS traffic and correlate what the agent intended to do with what actually happened at the kernel level. Less than 3% overhead. Point being that this is architecturally possible.

About 5% of AI model requests fail in production today (Datadog, April 2026 survey). Sixty percent of those failures are capacity-related, not model errors. So, it's an operational gap. And teams that built agent-layer observability into their setup caught those failures before they cascaded into outages. Teams that didn't had incidents.

If you're building agents, start with OpenTelemetry. If you're buying a platform, ask the hard questions: Does this handle reasoning loops as a first-class thing? Can you see the decision tree as a continuous trace? Does it know the difference between a tool failing and the agent misunderstanding the tool? Can you alert on semantic drift?

Those are the questions that separate something actually built for agents from something that's just adding agent features to traditional APM. Honeycomb published their approach. Langfuse and LangSmith are solid for multi-step debugging. There are about 15 tools competing on this now, most built on OpenTelemetry standards.

My candid assessment is that you're going to be in supervised mode for a while. Your agent still needs human approval, there is no way around it right now. That's not going away in the next two years. If a vendor tells you otherwise, that's a red flag.

Curious if people can share a) what does good agent observability actually look like at your scale? And b) what are you currently missing on the observability side if anything?

0 Upvotes

7 comments sorted by

7

u/seweso 12d ago

There is NO place for unattended generative AI in devops. So you do not need observability. You should stop the AI bullshit asap imho.

Why would you ever add random entropy to processes which should be rock solid? Its highly inappropriate and unprofessional imho.

I will downvote every post which suggest this bs.

1

u/Imaginary_Gate_698 12d ago

I think the biggest gap right now is correlating agent reasoning with infra-level side effects in one timeline. Most tooling will happily show you token counts, traces, or latency separately, but when an agent starts looping, retrying tools, or generating expensive downstream calls, the causal chain gets fuzzy fast. Feels very similar to early distributed tracing days where everyone had logs, but nobody had context stitching.

1

u/ExternalComment1738 12d ago

the “traditional APM with agent sprinkled on top” criticism feels extremely accurate 😭 most observability stacks still fundamentally assume deterministic request → response flows, while agents behave more like probabilistic execution graphs with shifting internal objectivesthe hardest failures i’ve seen aren’t even infrastructure failures themselves, they’re interpretation failures 💀 the tool technically worked, the retrieval technically succeeded, latency looked normal… but the agent formed the wrong world-model halfway through the reasoning chain and everything downstream stayed “healthy” while becoming semantically brokenalso agree completely that reasoning loops/memory propagation need to become first-class trace objects instead of invisible prompt state hidden inside black boxes. otherwise debugging agents becomes pure archaeology

1

u/Jony_Dony 12d ago

The interpretation failure point is the real one. You can have perfect latency and zero infra errors while the agent is technically "doing its job" but making authorization-adjacent decisions nobody anticipated. Standard traces won't flag that because the tool call went through clean. What's missing is a way to evaluate whether the agent's actual behavior stayed within the scope it was supposed to operate in, not just whether the calls succeeded.

1

u/Any-Grass53 12d ago

The biggest gap right now is still connecting agent reasoning to infra level failurs in one clean trace. Most tools can show logs or decisions separately but very few actually help debug why an agent made a bad decision under real production conditions.

1

u/Relevant-Worry-3920 1d ago

Are agents going to need human supervision for a while? yes. But can we make the job more efficient and less time consuming for humans? That too is a big yes. That's where observability tools like Langfuse and Netra come into play.