Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Trusted telemetry is non-negotiable for AI in production. Shahar Azulay of groundcover explains why velocity—not just reliability—is the new differentiator | TFiR

By Monika Chauhan 23 hours ago

0

Most AI pilots never reach production with confidence intact. Teams discover too late that they cannot verify whether their agents are actually working, and the gap between what competitors ship and what they ship widens with every release cycle. The organizations closing that gap are doing it with better production telemetry, not better models.

In this interview on TFiR, Shahar Azulay, CEO and Co-Founder at groundcover, breaks down why trusted, data-abundant telemetry infrastructure is now a hard prerequisite for shipping AI workloads to production and staying competitive.

Guest: Shahar Azulay, CEO and Co-Founder at groundcover

Show: TFiR

Here is what every platform engineer and AI infrastructure team needs to know.

Technical Deep Dive

Q: What capabilities are required to move AI from pilots into production with trust and accountability?

Shahar Azulay, CEO and Co-Founder at groundcover, argues that a trusted telemetry system is the foundational requirement before any AI workload can ship to production with confidence. Without it, teams have no reliable way to verify whether their agents are functioning correctly. The absence of this foundation makes it impossible to move fast or make informed decisions about what to build next.

“Without a very trusted telemetry system, you cannot ship these things with confidence.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: How has the competitive pressure around AI deployments shifted over the past year?

Azulay notes that a year ago, the primary concern was reliability and uptime. Engineers were focused on whether their AI features stayed online and whether their system was more dependable than a competitor’s. That calculus has now shifted from reliability to velocity, where the team with richer production telemetry can build faster, release more AI features, and make better-informed product decisions.

“Right now it’s moving to your customers. The other competitor has more rich telemetry to build the agent faster, to release more AI workload features.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: Why does production telemetry directly affect how fast a team can ship AI features?

Azulay explains that production telemetry is the primary input for deciding what to ship and how to build it. Teams with abundant, high-quality data from live AI workloads can make faster and better decisions at every stage of development. Teams without it are guessing, which slows every decision downstream.

“They have more production telemetry to make better decisions on what to ship and how to build it.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: What happens to teams that lack a secure, data-abundant telemetry storage system for AI workloads?

Azulay is direct on this point: teams without a secure and data-abundant telemetry storage layer will not be able to ship fast enough to remain competitive. Customers are already aware of this risk and raising it explicitly. Observability is no longer a post-launch concern but a core part of the architecture required before AI goes to production.

“If you don’t have a very secured and very data abundant way to store telemetry, you will not be able to ship fast enough.” — Shahar Azulay, CEO and Co-Founder, groundcover

Q: What role does observability play specifically in AI production readiness?

Azulay frames observability as a critical and load-bearing component of the entire AI production stack. It is not supplementary tooling added after deployment. It functions as the mechanism by which teams build confidence in their agents, validate workload behavior, and generate the production signal needed to accelerate development cycles.

“Observability is an important part of all of that.” — Shahar Azulay, CEO and Co-Founder, groundcover

Resources & Documentation

groundcover, AI-native observability platform designed for cloud-native environments and AI workload monitoring

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Now, as of course, organizations are moving AI from pilots, most actually pilots fail. But as they move from pilots into production, from your perspective, what capabilities are like absolutely required to build trust and accountability into those systems?

Shahar Azulay: So again, I think without a very trusted telemetry system, you cannot ship these things, like with confidence, right? Because it’s hard to say if they’re working on one end. And again, it’s hard to make sure that you run fast enough. I think the interesting thing is that if, if like a year ago, you would mostly be worried about downtime, right? I’m shipping the agent out there and you know, me and my competitors are both shipping AI features, but theirs is more reliable, right? Then my engineers wake up at night, open an observability platform, it should be reliable. You were most focused about that. Right now it’s moving to our eye, to your customers. The other competitor has a more rich telemetry betting to build the agent faster, to release more AI workload features into their customers. They have more production telemetry to make better decisions on what to ship and how to build it. So it’s becoming very kind of clear that if you don’t have a very secured and very data abundant way to store telemetry, you will not be able to ship fast enough. So we see customers kind of being very concerned about that. And observability, again, is an important part of all of that.

You may also like

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

By Monika Chauhan23 hours ago

Cloud Native

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

By Monika Chauhan2 days ago

Cloud Native

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR

By Monika Chauhan3 days ago

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR

By Monika Chauhan3 days ago

AI Infrastructure