AI Infrastructure

Why AI Agents Fail Silently in Production | Andre Elizondo, Mezmo | TFiR

Andre Elizondo of Mezmo explains how AURA—an open source agent harness built in Rust—solves production reliability for AI SRE agents through self-correcting reasoning loops and transparent orchestration.

By Monika Chauhan April 14, 2026

0

Most AI agents deployed in production today fail quietly. They hallucinate, loop endlessly, and make confident but incorrect decisions—all because they lack proper context. The problem isn’t the power of large language models. The problem is the system of context around them. When agents don’t have the right context at the right time, reliability breaks down, and traditional observability stacks have no visibility into what’s actually happening inside these decision loops.

This challenge is especially acute for Site Reliability Engineering (SRE) teams, where agents operate at the tail end of the complexity curve. SRE workloads demand certainty, transparency, and trust—qualities that proprietary agent frameworks struggle to deliver at scale.

The Guest: Andre Elizondo, Director of Innovation at Mezmo

Key Takeaways

AURA is an open source, Apache 2-licensed agent harness built in Rust specifically for production SRE workloads, designed to short-circuit the learning curve from agent frameworks to reliable production deployment
Self-correcting reasoning loops are baked into AURA’s architecture: agents plan, execute, synthesize, and self-evaluate—automatically replanning when confidence is low, just like a human SRE would
Open source transparency is essential for SRE trust: teams need to own, audit, and understand how agents reach conclusions before deploying them in mission-critical environments
Context engineering happens on both sides: Mezmo optimizes telemetry data behind MCP tool calls, while AURA manages context orchestration after tool execution
Production deployment is Kubernetes-native with horizontal scaling, circuit breakers, multi-agent collaboration, and human-in-the-loop approval gates for gradual trust-building

***

[expander_maker]

In this exclusive interview with Swapnil Bhartiya at TFiR, Andre Elizondo, Director of Innovation at Mezmo, discusses AURA—an open source agent harness designed to solve the reliability, transparency, and context management challenges that plague AI agents in production SRE environments.

The Production Reliability Problem: Agents Fail Quietly

Most organizations racing to deploy AI agents into production face a common problem: agents fail silently. They hallucinate, loop, and make confident but incorrect decisions based on incomplete context. Traditional observability stacks offer no visibility into these failures.

Q: What is the core problem AURA is trying to solve?

Andre Elizondo: “Agents fail quietly, right? That’s a problem that everybody has when building agents and pushing them out into production. When you look at SRE use cases, those are on the tail end of the complexity curve. You have to make sure that they’re very reliable. You have to make sure that they’re going to do the right thing every single time. What we saw in the industry and our own experiences is that there’s a big lift between taking an open source agent framework—something that’s just like a bunch of Lego blocks—and saying, hey, I want to build an SRE agent on this.”

AURA emerged from Mezmo’s own internal AI Root Cause Analysis (RCA) system, which runs in production for all Mezmo customers. The team recognized that the gap between open source agent frameworks and production-ready SRE agents was too wide, forcing teams to navigate three years of agent development learnings just to get started.

Andre Elizondo: “We wanted to really short circuit people’s experience to get to the end of the curve, and do that in a way that was easy and really familiar for the folks that we built AURA for. It’s all config file driven. It’s very similar to deploying a Kubernetes manifest. AURA really is focused on giving those agents the best possible context needed, where AURA can sit as a standalone project that has its own ecosystem that really enables you to build those AI SRE agents without having to learn the last three years of agent development.”

Why Open Source: Transparency, Trust, and Disruption

The agent orchestration space is crowded with proprietary solutions—many baked into existing observability platforms or offered by venture-backed startups behind API paywalls. Mezmo took a different approach by releasing AURA as Apache 2-licensed open source software.

Q: What was your driver behind making AURA open source?

Andre Elizondo: “We went down this journey of thinking about the best possible way to advance the community. There are a lot of different products and solutions in the space right now, either baked into existing observability platforms or offered by standalone startups that focus on taking large amounts of data, making sense of it, and then executing workflows on it. What we believe is that this piece really needs to be open. We wanted to disrupt the idea of it always sitting behind a paywall or locked behind a vendor’s API.”

The philosophy mirrors what Kubernetes did for container orchestration: create an open standard that teams can trust, own, and extend. For SRE teams responsible for agents that investigate issues and execute workflows on their behalf, transparency is non-negotiable.

Andre Elizondo: “As an SRE who is responsible for an SRE agent that is taking some work off your plate and investigating issues for you, accomplishing tasks for you, you need to really be able to trust that agent. You need to be able to really have the right level of transparency, where you understand what that agent is doing, so that you can feel confident utilizing it and trusting its output and understanding how it came to a conclusion. In order to accomplish all these different tasks, we really do believe that open source is the best possible way to do this—to share our practices, make them apparent, and share our learnings with the rest of the industry.”

AURA Is Platform-Agnostic: You Don’t Need Mezmo

Unlike vendor-locked agent platforms, AURA was designed to stand on its own. While Mezmo provides optimized telemetry context through its MCP (Model Context Protocol) server, AURA integrates with any observability solution, incident management platform, or custom tooling via standard MCP interfaces.

Q: Can people get benefit from AURA if they’re not Mezmo customers?

Andre Elizondo: “We considered this from the very beginning. AURA should stand on its own. Whether or not you’re plugging in Mezmo—for us, we plug in through the MCP interface—anything with an MCP server, we can tie into AURA, and those tools are automatically discovered and instrumented to the agents. That could be our MCP server served through Mezmo, but you definitely don’t need to use Mezmo in order to utilize AURA. You can tie it into your favorite observability solution, the solution you’re using today, your incident management solution. AURA is still just as effective. You just may burn a little bit more tokens because you’re not engineering that context behind the scenes where Mezmo is really focused, but you can still get to 100% of your workflows utilizing Mezmo or not.”

The Agent Harness Architecture: Self-Correcting Reasoning Loops

AURA is described as an “agent harness” rather than an “agent framework”—a deliberate distinction emphasizing that it ships with batteries included. The core differentiator is AURA’s built-in planning, execution, synthesis, and evaluation loops designed to prevent cascading failures, context corruption, and drift.

Q: How does AURA’s architecture prevent cascading failures and context corruption?

Andre Elizondo: “This really comes back to the concept of an agent harness and not just calling it an agent framework, because we believe that with AURA, you can start with a simple 50-line config file. That config file gives the agent everything it possibly needs in order to accomplish its task. Everything behind the scenes—how you orchestrate its memory, how you tune its reasoning cycles—we ship with sane defaults. You can definitely utilize the defaults that are out of the box, but you have full control through that simple config where I can customize how the agent is going to plan a little bit longer, reason a little bit longer, utilize its tools to get to an end state.”

The planning loop is the secret sauce. AURA’s self-correction mechanism ensures agents don’t blindly execute workflows without verification.

Andre Elizondo: “There’s things like self-correction built in, where as the agent is first planning and understanding, ‘Hey, what’s the task that’s in front of me?’—whether that’s an ad hoc question from your team, a reactive task like a monitor that just fired, or a proactive task like continually looking at incidents unfolding to make sure Runbooks are updated—any one of these different tasks benefit from this reasoning loop. The agent will first plan, understand what it’s doing, execute any tools, synthesize the outputs so it can collect its thoughts, and then it’ll self-evaluate. There’s actually checks built into the system where if the agent isn’t fully confident in the reasoning that it got to, or if tools didn’t give it the full output of what it needed to see in order to feel comfortable about what the next task should be, it’ll continue to basically prompt a reasoning loop. It’ll go back to a replanning phase and say, ‘Hey, maybe I don’t have the full picture. Maybe I don’t feel super comfortable about this outcome. Let me go and recheck my thinking, see what else is available to me, see what other tools I can call.'”

This mirrors how a human SRE would approach problem-solving: evaluate, verify, and correct course if something doesn’t feel right.

Andre Elizondo: “What we really wanted to avoid was the scenario where you’re using Claude and you get the situation where it’s like, ‘Hey, you did something wrong.’ And it’s like, ‘You’re absolutely right.’ With AURA, the agent understands, the agent is evaluating itself, it’s correcting itself. It’s leaving hints for itself in the future. So if it sees that it did something and that was very well received or gave the outcome that it needed, it’ll leave memories behind for itself so it can quickly get to that solution the next time this happens or something similar happens.”

Built in Rust, Supports Five LLMs, Kubernetes-Native Deployment

AURA is built in Rust for performance and reliability. It supports five LLM providers and integrates with any MCP-compatible tooling. Deployment is designed to be as familiar as deploying any other cloud-native service.

Q: How does an organization actually deploy AURA?

Andre Elizondo: “You can run AURA pretty much anywhere. We also have a recommended path for how you deploy AURA within a container, within your Kubernetes cluster, standalone on an EC2 instance. There’s a lot of different options, and we really wanted to make sure those were easy, approachable options. With MCP, we support basically anything that has an MCP server, whether that’s returning data that is optimized for the agent or not. The system of AURA will take care of ensuring that won’t bloat the context window, won’t cause hallucinations based on the tool returning too much or not returning the right formatted output. There’s a lot of fail-safes built into the system.”

AURA’s plug-and-play architecture allows teams to integrate custom MCP servers, in-house tools, or third-party observability platforms without vendor lock-in.

Andre Elizondo: “The plugability of AURA was really one thing that we heavily focused on. We didn’t want to be a closed ecosystem where you couldn’t integrate with AURA, couldn’t build on top of AURA, couldn’t test out AURA with the tools that you’re using today. You can plug in AURA into any one of these different systems, and AURA will automatically understand what to do with those tools, how to map them to the appropriate agents. If you have a multi-agent system, you can easily define that within AURA. If you have a security expert, DBA expert, networking expert defined as agents, AURA will take care of ensuring those tools are available to the system while giving you the plugability of just utilizing that standard ecosystem.”

Distributed Systems Design for SRE Workloads

AURA is not just AI tooling—it’s built as a distributed system with horizontal scaling, circuit breakers, and multi-agent collaboration. This SRE-first mindset sets it apart from typical agent frameworks.

Q: Why is the distributed systems approach critical for production AI?

Andre Elizondo: “It’s twofold. One, because we wanted to build for our audience, we wanted to build a system that was scalable—horizontally scalable. I can have multiple different instances that can have those instances collaborate with each other fairly easily. But also it’s a great way for us to share context to just the agents that need it at that moment in time. It’s a great way to preserve token usage by default. It’s partially us really focusing on our audience as SREs and platform engineers and folks that are running services in production. It should just look like another service with the same level of reliability guarantee, the same level of configurability, and the same level of scalability that you expect from literally any one of your other tools.”

The design philosophy prioritizes familiarity and self-service for SRE teams.

Andre Elizondo: “We wanted to make a project, a harness, that was familiar to our audience—something that was self-explanatory. If you’re looking at a config file, you should be able to see, ‘This is what I can tune about the agent. This is why it’s important.’ I don’t have to go and manage an entire SDLC for every single one of my agents. I don’t have 10 different teams all building on 10 different frameworks, all building on 10 different methods of a loop or whatever. AURA really is focused on standardizing that 80% and standardizing that 80% in a way that you can just focus on the 20% that’s relevant to that specific workflow or use case for you, and be able to spread that across your entire team and really be able to build more agents that are more predictably running in production in a way that you would expect a production service to operate.”

The Mezmo Connection: Engineering Context Behind the Tools

Mezmo specializes in telemetry pipelines and active observability. AURA fits into that ecosystem by consuming highly optimized context through Mezmo’s MCP servers—but the relationship is intentionally decoupled.

Q: How does AURA connect to Mezmo’s telemetry and observability platform?

Andre Elizondo: “Going back to the genesis of AURA, we saw a need in the market where I can get to the point where I’m processing a lot of telemetry data, generating insights, looking for the needles in the haystack. We got to a point initially where we were just publishing those through our MCP server. The opportunity we saw, the pain we saw in the ecosystem, is that getting from that point of an MCP server having the right data for you, and then you being able to utilize that data effectively in a workflow that’s meaningful to you, was still at a point where it’s a little uncertain for people.”

Mezmo’s MCP tools are purpose-built for agent consumption, not just API wrappers.

Andre Elizondo: “To integrate Mezmo with AURA is really just through our standard MCP interface. Those MCP tools are really focused for agent consumption. We build AURA and all the tools that we expose through AURA with the intent that somebody might use this in Claude Code, somebody might use this in AURA, somebody might use this in any agent system, but at the end of the day, the downstream customer of that data is an agent. When you look at the philosophy of how we’ve built our MCP server and the tools that we expose, it’s purpose-built. It’s not just us slapping an MCP server on top of an API, which is what a lot of people have done just to say that they have an MCP server.”

The two-sided equation: context engineering before and after tool execution.

Andre Elizondo: “We really focus on two halves of the equation. There’s what we can do to engineer things after the tool call, which is really where AURA comes into play. That’s an open control plane—you can plug that into anything. But then what do we do to engineer the data or engineer the context behind the tool? That’s really where Mezmo as a company, Mezmo as a platform, is really focused. At the end of the day, if you have better tools, you have better data, you have better agents as your outcome.”

Early Reception: A Breath of Fresh Air

AURA was presented at Nvidia GTC last month and has been in beta with design partners. The reception highlights the frustration teams feel with proprietary, API-locked agent platforms.

Q: What kind of response has AURA received from the community?

Andre Elizondo: “The initial reception of the project has been a breath of fresh air, because a lot of these teams are used to seeing an API, seeing a service where in order to test it out, in order to feel it, I have to go sign up for it, get security approval to run this in my environment—all these different steps in order to just see if the thing works for you. That’s led to a lot of frustration in the market. With AURA, what’s great is because it’s Apache 2 license, because it’s open source, you can start with it right away. You can test it out in your laptop, your cloud environment, your dev environment. You can put it into production as soon as you feel like you trust it.”

The project is being developed in public with rapid iteration.

Andre Elizondo: “We’re working with a few design partners today to get to the point where we feel like, ‘Hey, we’ve incorporated the feedback.’ What we want to focus on is the innovators of SRE, the innovators of platform engineering, the folks that are living on the frontier of how they’re building these agents but really want to focus on standardization, not just reinventing the wheel for every single agent. AURA has been a breath of fresh air for the ecosystem so that we can focus on giving them the best possible tools. Going back to the Kubernetes analogy, we’re giving you a platform, we’re giving you a runtime, we’re giving you all the batteries included in order to make this effective. Now you can just focus on defining the piece that’s relevant to that team, that’s relevant to your workflow, and do that predictably, where you can build five agents, 10 agents, 100 agents, and know that will scale at the end of the day in production.”

The Future: From Elevators to Dark Factories

The long-term vision for AURA—and for AI-driven SRE more broadly—follows a trust-building trajectory similar to early elevator adoption. Human-in-the-loop workflows will gradually give way to fully autonomous operations.

Q: Where do you see AURA and systems of context evolving as organizations move AI agents from demos to operational workloads?

Andre Elizondo: “When we look at where we are in the cycle, we’re in the early phases of seeing AI SREs really get into production, have meaningful value, prove their ROI. Where the natural curve goes, and where we’re focused on with developing the ecosystem around AURA and Mezmo, is—you’re going to start kind of like how we did with the early elevators. If you look at the history of elevators, you have a human in the loop every single time. You don’t really trust the agent. And that’s okay. We have native human-in-the-loop workflows built into the system. We expect that to be how people get started: you want the agent to get to a certain point, and then you want to hit approve, or you want to interrogate its output, and then you want it to go to the next point.”

The long-term goal is the “dark factory” model—full autonomy without human oversight.

Andre Elizondo: “Similar to the elevator analogy, you always had somebody in the elevator pressing the button for you, and then you got comfortable enough to where—there’s certain folks calling it the dark factory analogy, where you feel so comfortable about the factory operating effectively and not needing humans there, that you just turn off the lights. Where this goes in the longer span is enabling those components, enabling the ecosystem where we feel like we have the right pieces in place to enable that dark factory mentality. I think we have a little bit to get there, but where we’re focused on with AURA is we want to bring SREs into the fold.”

Bringing SREs Into the AI Revolution

Throughout his career, Elizondo has focused on teams left behind during major technology shifts. With AURA, the goal is to ensure SRE teams aren’t sidelined during the AI revolution.

Andre Elizondo: “Throughout my career, I’ve always been focused on folks that have been left behind—whether that was DevOps, whether that was containers. There’s always been this group of people in any adoption cycle that is kind of left behind. There’s a great opportunity here for SREs to really jump on all the great innovation, get a lot of the great outcomes of AI and agents specifically in their workflows. What we believe is that the tooling has just been the missing piece, or maybe the piece that isn’t as approachable as it should be, to enable those teams, to enable those SREs, to enable those platform engineering teams to really feel confident about moving along with the rest of the shift and feeling like they can integrate it from day one and trust it from day one.”

The comparison to AI-assisted coding is deliberate. Just as developers moved from tab-complete to autonomous feature generation, SRE workflows will follow a similar arc.

Andre Elizondo: “It’s a very similar trend that we’ve seen with vibe coding or agentic engineering going through in the last couple of years, where everybody was used to hitting tab complete from every single line of code that was generated from GitHub Copilot in the early days. Now we’re at a point where there’s entire agent loops that are building features, building functions, building entire applications. Then the human can be really the last-mile quality gate, or the thing that is guiding its workflow, defining its workflow along the way. That’s where we get to a point in the next probably one to two years of where we have the SRE teams of today enabled, we have the SRE teams of tomorrow enabled. The tooling is there, the ecosystem is there, and they’re able to really come along for the ride and gain the benefits that other areas of the industry are gaining today. That’s why AURA is a very key piece of our strategy, because we really believe in open source being the mechanism for getting folks to trust it, adopt it, and build on it.”

Watch the full TFiR interview with Andre Elizondo here.

[/expander_maker]

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

By Monika Chauhan23 hours ago

Observability

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

By Monika Chauhan24 hours ago

Cloud Native

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

By Monika Chauhan2 days ago

Cloud Native

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR

By Monika Chauhan3 days ago