AI Infrastructure

Why AI Observability Fails Without Dynamic Data Collection Control | Shahar Azulay, groundcover | TFiR

0

Most AI observability integrations sit on top of static data pipelines built around dashboards and monitors that were defined long before the current incident. The AI has no ability to widen collection, raise cardinality, or shift what it is looking at in response to what it finds. It queries whatever data exists and attempts root cause analysis on telemetry that was never designed for dynamic investigation.

In this interview on TFiR, Shahar Azulay, CEO and Co-Founder at groundcover, breaks down why owning both the collection layer and the AI reasoning layer is what separates genuinely useful agent-mode observability from bolted-on AI that processes high volumes of low-quality data.

Guest: Shahar Azulay, CEO and Co-Founder at groundcover
Show: TFiR

Here is what every SRE, platform engineer, and AI infrastructure team needs to know.

Technical Deep Dive

Q: What is groundcover Agent Mode and how does it differ from a chatbot added to an observability platform?

Shahar Azulay, CEO and Co-Founder at groundcover, describes Agent Mode as a capability embedded across the entire platform rather than a chatbot bolted onto the UI. The agent runs close to the data, queries it privately, and can both initiate analysis within the platform and delegate structured information to external agents. It is designed to replace the pattern of engineers using coding agents to manually consume telemetry through MCP while trying to fix production problems.

“The agent is more than a chatbot kind of bolted into the platform. It is part of how we think that groundcover should coexist with your other agents.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: Why did groundcover customers start consuming telemetry through MCP, and what problem does that create?

Since groundcover released its MCP server, Azulay says significantly all customers moved to consuming telemetry through MCP within six to nine months. The pattern emerged because customers wanted to use telemetry to make decisions inside coding agents. The problem is that customers are effectively hand-building what agent mode should have provided natively, introducing complexity and losing the structured, platform-native integration that agent mode delivers.

“Basically all of our customers are consuming our telemetry through MCP, which is great on one hand, but on the other they’re basically building what agent mode should have given them.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: How does groundcover agent mode fit into the broader observability stack?

Azulay describes the platform as becoming agentic across every dimension. The agent is context-aware of every part of the platform, runs close to the data, queries it privately, and supports agent-to-agent communication. Engineers can interact with groundcover through the UI or through external agent calls, and the agent carries context from across the platform into every interaction.

“You can communicate with groundcover differently now through agent to agent communication. It is part of everything that you do, whether you are using the UI or whether you communicate with it somewhere else.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: What is wrong with bolting AI onto an existing observability backend that does not control data collection?

Azulay identifies the core failure as data quality and data dynamism. Most customers do not have quality data that fits the current incident. Existing data is built around dashboards and monitors predefined for known failure patterns. AI needs wide, dynamic data, and a bolted-on AI layer that assumes the data is already there will sift through large volumes of low-relevance telemetry rather than making targeted, intelligent decisions. This makes it both less effective and less cost efficient.

“A vendor that won’t be able to control both the collection and the root cause analysis through the same AI platform is just going to provide less value and less cost effective value.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: How will groundcover agent mode dynamically control eBPF sensor data collection during an incident?

Azulay explains that agent mode will soon control what the eBPF sensor collects in real time. When an incident occurs, the agent will be able to instruct the sensor to collect higher cardinality data for a defined window, such as 15 minutes, to support deeper troubleshooting. This closes the loop between AI reasoning and data collection, allowing the agent to shape its own inputs rather than working from whatever static telemetry was already being gathered.

“If you have an incident and I want to collect higher cardinality data for the next 15 minutes to allow the agent to troubleshoot, the agent will be able to dynamically collect the data, control the data collection and transform it to fit what it wants to troubleshoot right now.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: Why does data ownership determine long-term AI observability value and cost effectiveness?

Azulay frames data ownership as the structural advantage that determines which observability platforms will provide durable value. A vendor that owns both the collection layer and the AI reasoning layer can make the data fit the problem dynamically. A vendor that does not own collection is forced to process whatever volume of data exists, which is computationally expensive and analytically limited. Azulay describes the MCP-based external query pattern as a valuable but transitional phase that will give way to tighter integration.

“Who owns the data will eventually be able to create better value and more cost effective value.” — Shahar Azulay, CEO and Co-Founder, groundcover

Resources & Documentation

  • groundcover, cloud-native observability platform built on eBPF with agent mode and MCP server support

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: I want to go back to agent mode. Can you talk about what kind of use cases you have seen where either the beta setters, all the production user testers, they were actually impressed that, hey, we can see what we thought see earlier.

Shahar Azulay: Interestingly, right, the agent mode is a pool that we feel from our customers, like significantly from the past six to nine months since we released the MCP server in ground cover, basically all of our customers are consuming our telemetry through mcp, which is great on one hand, but on the other they’re basically building what agent mode should have given them, right? They’re in their coding agent trying to fix problems and they’re consuming telemetry to make decisions. Agent mode is basically the ability to both stay inside the platform, right, and troubleshoot with, you know, the AI assistant that can now sift through, you know, huge volumes of telemetry and make very complicated analysis that you as a human just can’t. And on the other hand, it can also kind of delegate information to your agents in a very constructed way that is better than MCP through the platform itself, right? So if our customers, you know, use cloud before to query data using MCP now, they can use the agent mode and delegate to cloud in a very structured way and see the results back in the observability system. So we’re becoming like from just the system of record also to, to the ability to control my AI fleet and make sure that I’m coding, right, solving problems and I understand what’s going on. So it’s a very interesting kind of pull that we felt from the market. Agent mode has been tested with, you know, enterprises for a few months now in our customer base. And we see these use cases tremendously valuable. The adoption is phenomenal, which we love to see.

Swapnil Bhartiya: And how or where does agent mode fit into the rest of the ground covers of Civility Stack?

Shahar Azulay: It’s basically part of everything that we do. The platform is becoming agentic in any sense. The agent is aware to every part of the platform itself. Everything is a context that you can pass, you know, kind of unliterally to the agent. It can use its context from within the platform. It’s running close to the data. It queries the data privately. So it’s basically kind of embedded in everything that you do. You can communicate with groundcover differently now through agent to agent communication. So the agent is more than a chatbot kind of bolted into the platform. Right. It’s part of how we think that groundcover should kind of coexist with your other agents. So it’s kind of everywhere in everything that you can do, whether you’re using the UI or whether you communicate it with it somewhere else.

Swapnil Bhartiya: And as organizations like all the brown field deployment, they are just bolting AI into the back end. What kind of nightmare is creating for their teams? Platform team, SR teams or builder teams? And do you feel that approach is right or what would if they come and ask you should be the right approach.

Shahar Azulay: I mean it’s a fun transient right to say I have data even if I’m an external vendor querying ground cover or datadog or whatever. Right. I can build AI workflows that query the data and make decisions. It’s a very interesting transient. It just shows how much value there is in it. But it’s not going to last from a perspective of who owns the data will eventually be able to create better value and more cost effective value. For example, ground cover’s agent mode will soon control dynamically what the EBPF sensor is collecting. So if you have an incident, for example, and I want to, you know, just collect higher cardinality data for the next 15 minutes to allow the agent to troubleshoot, then the agent will be able to dynamically collect the data, control the data collection and transform it basically to fit what it wants to troubleshoot right now. Right. This entire connection between agent control the collection is exactly what you can do with a bolted on AI platform. That assumes the data is there. Right. Most of the customers don’t have quality data most of the customers don’t have quality data that fits the current incident. And this data is statically built and not that non dynamic. Right. It’s built on dashboards and monitors. They’re predefined specific use cases they learned over time to investigate. It’s not dynamic and wide enough like AI needs. So a vendor that won’t be able to control both the collection and the root cause analysis through the same AI platform is just going to provide less value and to be honest, more, less cost effective value. It will have to sift through huge volumes of telemetry instead of making very complicated, very kind of, you know, intelligent decisions. And that’s where we’re going. Right. The agent mode will be part of the collection itself as well.

How to Use AI Agents in Open Source Without Losing Architectural Control | Madelyn Olson, AWS | TFiR

Previous article