Traditional observability relies on sampling telemetry to stay within budget. AI workloads make sampling a fatal blind spot.
The Guest: Shahar Azulay, CEO and Co-founder at groundcover
The Bottom Line
- AI workloads generate non-deterministic failures that traditional sampling cannot capture. The three pillars of observability—logs, metrics, and APM—are insufficient for monitoring AI agents making 500 LLM calls per minute across 50,000 spans. Full-fidelity telemetry and AI-as-a-judge evaluation are now requirements, not luxuries.
***
Speaking with TFiR, Shahar Azulay of groundcover defined the current challenge of AI observability as a fundamental mismatch between traditional telemetry sampling models and the non-deterministic, high-cardinality nature of AI workloads.
What Makes AI Observability Different?
Azulay explained that monitoring AI workloads requires fundamentally different telemetry strategies compared to traditional microservices observability. The classical three pillars—logs, metrics, and application performance monitoring (APM)—were designed for deterministic workflows where sampling a fraction of telemetry could reliably surface issues. AI workloads break that model.
Shahar Azulay: “What’s happening to telemetry is really interesting, because on one hand, there’s more telemetry. Monitoring AI workloads is very different. The classical three pillars of logs, metrics, APM—they don’t fit anymore. It’s not the same as monitoring a distributed trace in the legacy microservices world to monitoring 50,000 spans in an AI workflow that uses 500 LLM calls a minute. It’s different. It requires a different product. You can’t sample the data in the same way. It’s not deterministic. There’s a lot of different changes in how we monitor AI, and groundcover is spearheading what is called AI observability—how we monitor agents, how we monitor AI workflows.”
He emphasized that AI telemetry is not only higher in volume and cardinality but also more sensitive. Production traces now include customer prompts and interactions with AI features, raising data sovereignty concerns that traditional observability platforms were not designed to address.
Shahar Azulay: “The data that my observability system now stores about monitoring these agents is much more sensitive than before—it might be even customer prompts. So it’s more sensitive, more high cardinality, plus I need it for more use cases. I want to use the data from my AI agents to operate and build these agents, and I want to write evals that are based on AI to assess how fast they are, how well they operate.”
Broader Context: Why Sampling Fails for AI Workloads
During the full TFiR interview at KubeCon EU, Azulay elaborated on why traditional sampling strategies—designed for deterministic API workflows—cannot reliably detect failures in AI systems. Non-deterministic execution paths mean that even rare errors may never appear in sampled telemetry, and “successful” responses may still be incorrect without additional evaluation layers.
Shahar Azulay: “Before we used to think about an API that has multiple use cases, variables, different paths—but eventually it’s pretty deterministic. Given a state, and it’s been called 1,000 times a second, maybe I want to sample half a percent of it, which is what most of the industry does to maintain a reasonable budget, and I will catch that error. If it’s coming once in 1,000 times, I will catch it if I sample correctly. Sampling errors and figuring out what is an error in AI is complicated, both on how I store data and how I figure out what the data means. Even if there’s a clear error, it might be so non-deterministic and rare that sampling will never find it. I might miss on a very prominent use case just because the cardinality now spanned into a trillion different options that the AI model can go in. I will not be able to fix that very important customer journey.”
He further explained that AI workflows introduce a new category of failure: responses that appear successful but are functionally incorrect. This requires using AI-as-a-judge mechanisms to evaluate whether outputs are valid, adding another layer of complexity to observability.
Shahar Azulay: “In some cases, the answer looks perfectly valid. Our simple world of API status codes—200 OK means okay—now it’s not. I have to run AI as a judge on top of it to figure out what this entire flow means. The flows are wider, so it can be great here and fail two minutes later. We’re talking about two-hour sessions sometimes. What does it mean to validate that the entire session is healthy? It’s both not fit to how we used to think about data and sampling it and collecting it, and even if I collect it, I have to use AI as well to make sure it makes sense.”
groundcover’s approach to AI observability combines full-fidelity telemetry capture with bring-your-own-cloud architecture to address both the technical and data sovereignty challenges of monitoring AI workloads at scale.





