Traditional Observability Fails for AI Workloads | groundcover

AI systems fail differently than traditional applications. No error codes. No deterministic paths. Just subtle drift, hallucinations, and non-reproducible edge cases that legacy sampling strategies will never catch. Enterprises deploying AI agents into production are discovering that their observability stack—built for microservices and HTTP status codes—is fundamentally unfit for the non-deterministic chaos of LLM workflows.

The promise of agentic AI is colliding with a monitoring reality: when your AI workflow processes 50,000 spans per minute across 500 LLM calls, traditional observability becomes a liability, not an asset.

The Guest: Shahar Azulay, CEO and Co-Founder at groundcover

Key Takeaways

AI failures are non-deterministic and invisible to sampling-based observability strategies designed for microservices
eBPF provides the foundation for autonomous agent monitoring without instrumentation dependencies
groundcover’s Agent Mode enables agent-to-agent communication and dynamic telemetry collection control
Bring-your-own-cloud architecture addresses data sovereignty concerns for enterprises adopting AI in production
Observability is shifting from post-production troubleshooting to real-time operating system for agentic workflows

***

[expander_maker]

In a recent TFiR interview, Swapnil Bhartiya spoke with Shahar Azulay, CEO and Co-Founder at groundcover, about the fundamental shift happening in observability as enterprises deploy AI agents into production and why traditional monitoring approaches fail for non-deterministic AI workloads.

Why AI Workloads Break Traditional Observability

Azulay identified the core problem: AI systems fail in ways that legacy observability tools were never designed to detect. Traditional monitoring relies on deterministic patterns and sampling strategies that assume predictable code paths—assumptions that collapse under AI workloads.

Q: How are AI failures different from traditional application failures?

Shahar Azulay: “The interesting thing about it is that there are a lot of things happening in non-deterministic paths that cause problems, including how you treat errors. Before, we used to think about having an API. There are multiple use cases for how to operate the API. It has variables. It can go into different paths. But eventually, it’s pretty deterministic. Given a state, and if it is called 1000 times a second, maybe I want to sample half a percent of it, which is what most of the industry does. That’s what we do to maintain a reasonable budget, and I will catch that error. If it’s occurring once in 1000 times, I will catch it if I sample correctly. Sampling errors and figuring out what is an error in AI are complicated, both in terms of how I store data and how I figure out what the data means. Even if there’s a clear error, it might be so non-deterministic and rare that sampling will never find it. I might miss a very prominent use case just because the cardinality now spans into a trillion different options that the AI model can go through. I will not be able to fix that very important customer journey.”

He emphasized that the problem extends beyond error detection into the very definition of what constitutes a failure in AI systems.

Q: How do you even define what an error is in an AI system?

Shahar Azulay: “In some cases, the answer looks perfectly valid. In our simpler world of API status codes, where 200 OK meant everything was fine, things were very deterministic — now they’re not. I have to run AI as a judge on top of it to figure out what the entire flow means. The flows are wider, so something can work well at one point and fail two minutes later. We’re sometimes talking about two-hour sessions. What does it mean to validate that the entire session is healthy? So it’s not well-suited to how we used to think about data — sampling and collecting it — and even if I collect it, I still have to use AI to make sure it makes sense.”

Observability as the Operating System for Agentic AI

Azulay argued that observability is undergoing a fundamental transformation—from a post-production troubleshooting tool to the foundational operating system for agentic workflows. This shift has profound implications for how enterprises build, deploy, and maintain AI systems.

Q: How is observability changing in the context of AI and agentic systems?

Shahar Azulay: “Observability is changing from something that was more post-production-based. I used to investigate downtime and troubleshoot with observability, and I needed a lot of telemetry already to do it, to something that is now pivotal — the source of truth for any AI operating app or agents running on top of it. Whether I’m investigating root cause analysis or building code with my agents, I need context from production that observability can provide to build better and troubleshoot better. Observability is becoming more critical and more a part of the operating system for the agentic SDLC. People are building code with it, troubleshooting code with it, and testing code with it. So it’s just becoming more important.”

This transformation extends to the data itself—both its sensitivity and its volume are increasing simultaneously.

Q: What’s happening to telemetry data in AI systems?

Shahar Azulay: “What’s happening to telemetry is really interesting. On one hand, there’s more telemetry. Monitoring AI workloads is very different. The classical three pillars of logs, metrics, and APM are now changing because they don’t fit anymore. It’s not the same as monitoring a distributed trace in the legacy microservices world versus monitoring 50,000 spans in an AI workflow that uses 500 LLM calls a minute. It’s different. It requires a different product. You can’t sample the data in the same way. It’s not deterministic. There are a lot of changes in how we monitor AI. My customers are now operating in front of my AI features as end users of my company’s services. The data that my observability system now stores while monitoring these agents is much more sensitive than before. It might even include customer prompts. So it’s more sensitive, has higher cardinality, and I need it for more use cases. I want to use the monitors from my AI agents to operate something else that helps me build and fix these agents. I want to write evals that are based on AI to assess how fast and how well they operate. So it’s two different pulls in the same direction.”

eBPF as the Foundation for Autonomous Agent Monitoring

Azulay positioned eBPF—extended Berkeley Packet Filter—as the critical enabler for autonomous monitoring in AI-driven environments. Unlike instrumentation-based approaches, eBPF operates at the kernel level, providing complete visibility without code dependencies.

Q: Why is eBPF critical for AI observability?

Shahar Azulay: “There’s already an advantage to eBPF. We support both, of course, but there’s an advantage because your team might not be proficient enough, and you may miss some blind spots. There are things you can’t instrument that eBPF can see, because it observes everything. But when it comes to AI, it’s basically the foundation for autonomy. The code I’m writing isn’t being written by me anymore; it’s being written by my agents, who don’t think like me when it comes to instrumentation. Someone has to create the loop of data flowing from code being shipped to production back to the coding agents so they can make better decisions about how to ship next. eBPF is that foundation of autonomy. The agents can monitor themselves without being dependent on anything. Everything is covered — whether it’s AI-written or human-written — it’s covered in the same way. So I can feed that high-fidelity data back to agents so they can make better decisions about how to build next. We see it as the bedrock for autonomy. If I want to make sure that my SDLC pipeline becomes more and more autonomous with AI, I have to have something completely out-of-band, completely agnostic, that monitors my production, and eBPF is an important part of it.”

Agent Mode and Agent-to-Agent Communication

groundcover announced Agent Mode at KubeCon EU—a capability that transforms the observability platform itself into an agentic system capable of communicating with other AI agents across the development and operations stack.

Q: What is Agent Mode and why did groundcover build it?

Shahar Azulay: “We’ve announced at KubeCon our Agent Mode, which is basically our agentic experience inside groundcover. You can either operate with our agent throughout the platform — navigating, building dashboards, and troubleshooting — or you can delegate tasks using the agent to other agents used today. If you’re using Cursor to build code, you can delegate from Agent Mode in groundcover and get all the context from production into your other AI-native stack. groundcover is basically becoming agent-to-agent aware, enabling communication with both human users and agent users. You can also augment your operations with our agent through a wide range of capabilities that the experience provides. So it’s taking what we’ve built so far — how you store more telemetry privately, securely, and cost-effectively — and allowing agents now to operate on top of it for different tasks.”

Azulay described how Agent Mode emerged directly from customer demand—particularly from enterprises already using groundcover’s MCP (Model Context Protocol) server to feed telemetry into their coding agents.

Q: What drove the development of Agent Mode?

Shahar Azulay: “Agent Mode is a pull that we’ve been feeling significantly from our customers over the past six to nine months. Since we released the MCP server in groundcover, basically all of our customers have been consuming our telemetry through MCP, which is great on one hand, but on the other, they’re essentially building what Agent Mode should have provided them. They’re in their coding agents trying to fix problems, and they’re consuming telemetry to make decisions. Agent Mode is basically the ability to both stay inside the platform and troubleshoot with an AI assistant that can now sift through huge volumes of telemetry and perform complex analyses that you as a human just can’t. On the other hand, it can also delegate information to your agents in a very structured way that is better than MCP through the platform itself. If our customers used Claude before to query data using MCP, now they can use Agent Mode and delegate to Claude in a very structured way, and see the results back in the observability system. So we’re evolving from just being a system of record to also having the ability to control my AI fleet, making sure that I’m coding correctly, solving problems, and understanding what’s going on.”

Q: How does Agent Mode fit into the broader groundcover platform?

Shahar Azulay: “It’s basically part of everything that we do. The platform is becoming agentic in every sense. The agent is aware of every part of the platform itself. Everything is context that you can pass to the agent. It can use this context from within the platform. It’s running close to the data. It queries the data privately. So it’s basically embedded in everything that you do. You can communicate with groundcover differently now through agent-to-agent communication. The agent is more than a chatbot bolted into the platform. It’s part of how we think groundcover should coexist with your other agents. So it’s everywhere and in everything that you can do, whether you’re using the UI or communicating with it somewhere else.”

Data Sovereignty and Bring-Your-Own-Cloud Architecture

Azulay positioned groundcover’s bring-your-own-cloud architecture as critical for enterprises concerned about data sovereignty, privacy, and security when deploying AI in production. This approach stores telemetry in the customer’s own cloud infrastructure rather than in a vendor-controlled data plane.

Q: How does groundcover address data sovereignty concerns?

Shahar Azulay: “groundcover has always been different in how we operate. We operate on top of a bring-your-own-cloud architecture, which basically means we don’t store our customers’ telemetry. We host it in their cloud premises, privately secured, and we’re fit for data abundance in a very cost-effective way. Customers can have more telemetry, more high fidelity telemetry to troubleshoot. This is exactly what AI needs as well. The major concern about AI is who’s going to use it, and what data is going to move through there. Who’s processing the data, who’s training on this data. Enterprises are very concerned about that. groundcover’s entire approach since the beginning was a bring-your-own-cloud data plane. Basically, we don’t own the data. We store it privately. And even AI is operated on top of Bedrock, Vertex—basically the hyperscale AI connection of the customer. So it’s their tokens, it’s their AI models running in their VPC. So even though we provide all these rich experiences, they’re not going out to anywhere else. If you want to adopt AI, if you want to troubleshoot with AI, you don’t have to be concerned about data privacy, about data sovereignty, about data security—even shipping maybe PII data to a vendor that is not necessarily aware of what it’s getting. Maybe you didn’t intend to monitor the customer prompts. They can even punch in information there that you might not have wanted to store as a vendor. So this entire motion of bring-your-own-cloud just gets more emphasized with AI. The data is very private, very confident, basically to the customers themselves.”

Dynamic Telemetry Collection Control

Azulay revealed that groundcover’s Agent Mode will soon enable dynamic control over telemetry collection—a capability that third-party AI platforms bolted onto static observability data cannot replicate.

Q: How is Agent Mode different from third-party AI tools that query observability data?

Shahar Azulay: “It’s a fun transition to say, ‘I have data,’ even if I’m an external vendor querying groundcover, Datadog, or whatever. I can build AI workflows that query the data and make decisions. It’s a very interesting transition. It just shows how much value there is in it. But it’s not going to last. From the perspective of who owns the data, those owners will eventually be able to create better and more cost-effective value. For example, groundcover’s Agent Mode will soon dynamically control what the eBPF sensor is collecting. So if you have an incident, for example, and I want to collect higher-cardinality data for the next 15 minutes to allow the agent to troubleshoot, then the agent will be able to dynamically collect the data, control the data collection, and transform it to fit what it needs to troubleshoot at that moment. This entire connection between agent control and data collection is exactly what you can’t do with a bolt-on AI platform that assumes the data is already there. Most customers don’t have quality data. Most customers don’t have quality data that fits the current incident. And this data is statically built, not dynamic. It’s built on dashboards and monitors. They’re predefined for specific use cases. These systems have learned over time how to support investigation, but they’re not dynamic or broad enough for what AI needs. So a vendor that cannot control both the collection and the root cause analysis through the same AI platform is going to provide less value and, to be honest, less cost-effective value. It will have to sift through huge volumes of telemetry instead of making complex, intelligent decisions. And that’s where we’re going. Agent Mode will be part of the collection itself as well.”

Observability ROI in the AI Era

Azulay argued that the competitive dynamic around observability is shifting from reliability concerns to velocity and ROI—enterprises with richer telemetry infrastructures can ship AI features faster and iterate more effectively.

Q: What capabilities are required to build trust and accountability into production AI systems?

Shahar Azulay: “Without a very trusted telemetry system, you cannot ship these things with confidence. It’s hard to say if they’re working on one end, and it’s hard to make sure that you can move fast enough. The interesting thing is that, a year ago, you would mostly be worried about downtime. I’m shipping an agent out there, and both my competitors and I are shipping AI features, but theirs is more reliable. Then my engineers wake up at night and open an observability platform — it should be reliable. That’s what you were most focused on. Right now, the focus is shifting to ROI for your customers. A competitor with a more robust telemetry foundation can build agents faster and release more AI-powered features to their customers. They have more production telemetry to make better decisions about what to ship and how to build it. So it’s becoming very clear that if you don’t have a very secure and data-abundant way to store telemetry, you will not be able to ship fast enough. So we see customers becoming very concerned about that. And observability is an important part of all of it.”

[/expander_maker]

Traditional Observability Wasn’t Built for AI Failures | Shahar Azulay, groundcover | TFiR

Key Takeaways

Why AI Workloads Break Traditional Observability

Q: How are AI failures different from traditional application failures?

Q: How do you even define what an error is in an AI system?

Observability as the Operating System for Agentic AI

Q: How is observability changing in the context of AI and agentic systems?

Q: What’s happening to telemetry data in AI systems?

eBPF as the Foundation for Autonomous Agent Monitoring

Q: Why is eBPF critical for AI observability?

Agent Mode and Agent-to-Agent Communication

Q: What is Agent Mode and why did groundcover build it?

Q: What drove the development of Agent Mode?

Q: How does Agent Mode fit into the broader groundcover platform?

Data Sovereignty and Bring-Your-Own-Cloud Architecture

Q: How does groundcover address data sovereignty concerns?

Dynamic Telemetry Collection Control

Q: How is Agent Mode different from third-party AI tools that query observability data?

Observability ROI in the AI Era

Q: What capabilities are required to build trust and accountability into production AI systems?

Akamai’s 2026 AI Platform Strategy: LKE, GPUs, and Distributed Edge Unified | Danielle Cook | TFiR

Why AI Infrastructure Needs OpenSearch to Prevent Hallucinations at Scale | Bianca Lewis, OpenSearch | TFiR

Key Takeaways

Why AI Workloads Break Traditional Observability

Q: How are AI failures different from traditional application failures?

Q: How do you even define what an error is in an AI system?

Observability as the Operating System for Agentic AI

Q: How is observability changing in the context of AI and agentic systems?

Q: What’s happening to telemetry data in AI systems?

eBPF as the Foundation for Autonomous Agent Monitoring

Q: Why is eBPF critical for AI observability?

Agent Mode and Agent-to-Agent Communication

Q: What is Agent Mode and why did groundcover build it?

Q: What drove the development of Agent Mode?

Q: How does Agent Mode fit into the broader groundcover platform?

Data Sovereignty and Bring-Your-Own-Cloud Architecture

Q: How does groundcover address data sovereignty concerns?

Dynamic Telemetry Collection Control

Q: How is Agent Mode different from third-party AI tools that query observability data?

Observability ROI in the AI Era

Q: What capabilities are required to build trust and accountability into production AI systems?

Akamai’s 2026 AI Platform Strategy: LKE, GPUs, and Distributed Edge Unified | Danielle Cook | TFiR

Why AI Infrastructure Needs OpenSearch to Prevent Hallucinations at Scale | Bianca Lewis, OpenSearch | TFiR

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR