AI Agent Production Failures Solved with Dapr Workflow Engine | Mark Fussell

AI prototypes are deceptively easy to build. But when enterprises try moving those agents into production, they hit a wall: failure recovery, state management, and reliability become deal-breakers. When your network drops mid-transaction or a machine fails during a critical workflow, what happens to your business logic? Most AI agent frameworks simply restart from scratch—potentially charging your customer twice, losing critical context, or corrupting state entirely.

This is the Day 2 operational challenge killing production AI adoption. And it’s exactly what Dapr Agents 1.0 was designed to solve.

The Guest: Mark Fussell, Co-creator and Core Maintainer at Dapr

Key Takeaways

Dapr Agents 1.0 is built on a durable workflow engine that provides automatic crash recovery and state checkpointing for production AI agents running on Kubernetes
The framework uses continuous log-based checkpointing to ensure workflows recover exactly where they left off—preventing duplicate payments, lost context, or corrupted business logic
Dapr is CNCF graduated, vendor-neutral, and runs on any Kubernetes cluster with flexible backing store options
Real-world adoption: Zeiss Vision Care uses Dapr Agents to orchestrate personalized prescription glass manufacturing workflows at scale
The shift from microservices to agentic applications represents the next 10x wave in enterprise software—and workflow reliability is the new competitive advantage

***

In this exclusive interview with Swapnil Bhartiya at TFiR, Mark Fussell, Co-creator and Core Maintainer at Dapr, discusses how Dapr Agents 1.0 enables production-ready AI workflows, the role of durable execution in preventing state loss, and why the agentic era will be tenfold bigger than the microservices revolution.

What Is Dapr and Why Workflow Matters for Production AI

Dapr is a CNCF graduated project that provides a set of APIs for building distributed applications, with a focus on microservices architectures. While it offers communication and pub/sub messaging capabilities, its most critical feature for modern AI systems is the workflow engine—a durable execution layer that provides recoverable business state across network failures and machine crashes.

Q: For those who may not know, what is Dapr?

Mark Fussell: “Dapr is a graduated project inside the CNCF. It’s been part of the CNCF for nearly seven years now. The idea is that it provides a set of APIs for building distributed applications, typically focused on microservices architectures. It has APIs for communication and for pub/sub messaging. And I think the most important one that’s emerged as the most popular over the last couple of years is workflow as an engine.

If you think about workflow, you have business state that runs through—you do this, do this, do this—and a workflow engine provides you with a recoverable business state. This is sometimes referred to as durable execution, where you may execute a lot of business logic steps. If things fail—the network goes down, a machine fails, or a blade fails—you want exact recovery from where it left off. For example, if you’ve done a Stripe transaction and the payment fails, you don’t want to restart your workflow and process the payment again—it would be a disaster. So a workflow engine is a perfect fit for this.”

This durable execution model is what makes Dapr uniquely suited for production AI agents, where long-running workflows must survive infrastructure failures without losing context or creating duplicate operations.

How Dapr Agents 1.0 Solves Day 2 Operational Challenges

Throughout 2025, enterprises experimented heavily with building agent applications. But as teams moved toward 2026, the question shifted: how do you actually run these agents in production with the same reliability guarantees you’d expect from mission-critical systems? Dapr Agents 1.0 was announced at KubeCon 2025 to answer that question by building an agent framework directly on top of Dapr’s battle-tested workflow engine.

Q: How does Dapr Agents 1.0 solve Day 2 operational challenges around failure, recovery, and state management?

Mark Fussell: “If I look back over 2025, there was a lot of experimentation with building agent applications. As we look towards 2026, people are saying, how do we actually take these into production and make them serious? What we did last year at KubeCon, we announced the first version of Dapr Agents 1.0. We really wanted to focus on how you build an agent framework that allows you to run these agents in a production environment. Going back to workflow, we built Dapr Agents on top of our durable workflow engine to provide this reliability. That’s the key point. Dapr Agents now, after one year, has been tried and tested through a lot of different production-like scenarios.”

Fussell highlighted a real-world example from the KubeCon keynote: Zeiss Vision Care built a production system where customers upload prescription information, and Dapr Agents orchestrate the entire personalized manufacturing and scheduling workflow—with full guarantees that if any part of the system fails, the workflow recovers seamlessly.

Mark Fussell: “There was a keynote yesterday where Zeiss talked about Zeiss Vision Care, where prescription glasses were being made. They built agents that allow you to upload your prescription information. This agent looks at that prescription information and kicks off a personalized way of creating your glasses for you and scheduling with you. It’s a classic workflow. Dapr Agents is built on workflow to provide this durability, this recoverability, such that when you put this into production environments, you can feel confident that if machines fail, it recovers.”

How the Durable Workflow Engine Prevents Data Loss

The technical foundation of Dapr’s reliability guarantees comes from its code-first workflow engine, which uses continuous log-based checkpointing to record every activity in a durable backing store. This architecture ensures that long-running AI tasks can recover from crashes or timeouts without losing data or context—a critical requirement for enterprise AI agents.

Q: How does the durable workflow engine ensure that long-running AI tasks recover from crashes or timeouts without losing data or context?

Mark Fussell: “Dapr’s workflow engine is built as a code-first workflow engine, so a developer writes a set of code with activities. Each one of those activities is basically like a log file—it’s written as a continuous log file appended to a backing data store of your choice. Another thing you get with Dapr’s flexibility is you can choose any one of those backing stores to write this durable log file to. Think of it as your workflow executes—you may do a notification activity or an email activity or a business process check—this continuous stream of log events is being written like fast, light checkpoints. It provides full recoverability. The agent framework, built on this, means that if the agent is calling a tool or calling onto some other system, underneath the covers this log file is being written such that the agent can recover. It loads up all its previous context. Think of it as loading up all the variable state, and it perfectly carries on where it executed last.”

This flexibility extends to state store selection—teams can choose from multiple backing stores depending on their infrastructure requirements, avoiding vendor lock-in while maintaining production-grade reliability.

Integrating AI Agents into Existing Cloud Native Infrastructure

One of the biggest concerns for platform engineering teams is whether adopting AI agents will create new operational burdens. Dapr Agents 1.0 addresses this by running natively on Kubernetes and integrating seamlessly with existing business processes—allowing teams to augment their current workflows with AI rather than replacing entire systems.

Q: How does Dapr, built on Kubernetes, help platform engineers integrate AI agents into existing cloud native infrastructure without adding operational burden?

Mark Fussell: “Dapr Agents is built first-class to run on a Kubernetes platform. What we see today is that a larger number of people have business processes already, and what they’re actually doing is looking at their business processes and augmenting them with language models. There are two directions here. We see people take a business process that they have, and they may just call onto a language model. In Dapr, you can just call this language conversation API that talks to a language model. The next step is you can take your existing application and just launch one of these Dapr Agents on the side. It will run on your behalf and take advantage of using language models. We see today a lot of existing business processes augmenting with language models first, and then consolidating those into an agent that takes responsibility. The integration for existing systems and agents is very straightforward.”

Vendor Neutrality and Open Source Commitment

Unlike many AI agent frameworks tied to specific hyperscalers or proprietary platforms, Dapr Agents 1.0 is fully open source and vendor-neutral—a critical distinction for enterprises concerned about long-term flexibility and avoiding lock-in.

Q: Can you talk about vendor lock-in? Open source focuses on vendor neutrality, but sometimes you can still get caught in lock-in.

Mark Fussell: “Dapr Agents is open source. It’s all part of the Dapr open source project. It’s fully maintained by the maintainers—Diagrid is one of them, but other companies also contribute to it. So you don’t have any vendor lock-in—it’s completely vendor-neutral. I think that’s one of the benefits. If you go to a lot of the other language frameworks, they’re very much either tied to one of the hyperscalers or vendor-locked. That’s a key point. But I would also say you can run this in any environment you have. You can run it on any Kubernetes cluster, in any environment of your choice. There’s no vendor lock-in in any way.”

From Microservices to Agentic Applications: The Next 10x Wave

Fussell framed the rise of AI agents as a fundamental shift in application development—comparable in scale to the move from client-server architectures to microservices, but potentially much larger. The key enabler: combining microservices patterns with large language models to create intelligent, autonomous workflows that augment human operators.

Q: Talk about the evolution of Dapr—from distributed application runtime to now powering production AI agents. What’s next?

Mark Fussell: “Language models are able to amazingly simulate pieces of work. Let me take a good example. We’re working with a logistics company, and they wanted to build an agent that took advantage of a warehouse manager. The warehouse manager had to sit there and look at lots of incoming emails, look at other data coming in, update his database, notify all his clients. Companies now are able to build these agents that have language models that can look at this incoming data and work side by side with these operators. A lot of people are realizing they can augment their business processes by building agentic capabilities. As I mentioned at the beginning, agentic applications are the new microservices plus LLMs. Dapr is ideally suited for all of this.”

Looking forward, Fussell emphasized the trend toward localized models and the need for frameworks that provide true resiliency and reliability—capabilities Dapr has been building for eight years.

Mark Fussell: “People want to be able to swap out the underlying models because they’ll have choices. We’re definitely seeing a trend towards having localized models where people want to run models in their local environment, where they don’t have to depend on cloud models. They’re also looking for an agent framework that does provide this resiliency and reliability, which is key. We look at Dapr as a framework that, although we started eight years ago, is incredibly suited for agentic applications. It’s just amazing how good it is with the conversation API, with the workflow API—that combination going into Dapr Agents 1.0, we’re very proud of the way the community has adopted it and come up with some amazing scenarios.”

Mark Fussell: “If you think that the client-server era was one, and then we went through the microservices era in cloud native, I think the agentic era is going to be tenfold bigger by far. Literally every business process in the world can be augmented with language models. You really want to wrap those in an intelligent agent. Dapr Agents with the 1.0 release, we believe, is the best open source framework there is.”

Read Full Transcript & Technical Deep Dive

AI Agents Fail in Production Without Workflow State Recovery | Mark Fussell, Dapr | TFiR

Key Takeaways

What Is Dapr and Why Workflow Matters for Production AI

Q: For those who may not know, what is Dapr?

How Dapr Agents 1.0 Solves Day 2 Operational Challenges

Q: How does Dapr Agents 1.0 solve Day 2 operational challenges around failure, recovery, and state management?

How the Durable Workflow Engine Prevents Data Loss

Q: How does the durable workflow engine ensure that long-running AI tasks recover from crashes or timeouts without losing data or context?

Integrating AI Agents into Existing Cloud Native Infrastructure

Q: How does Dapr, built on Kubernetes, help platform engineers integrate AI agents into existing cloud native infrastructure without adding operational burden?

Vendor Neutrality and Open Source Commitment

Q: Can you talk about vendor lock-in? Open source focuses on vendor neutrality, but sometimes you can still get caught in lock-in.

From Microservices to Agentic Applications: The Next 10x Wave

Q: Talk about the evolution of Dapr—from distributed application runtime to now powering production AI agents. What’s next?

MITRE ATLAS and ATT&CK Navigator: How CISOs Are Securing AI Systems Against Real Threat Groups | Steve Winterfeld, Akamai | TFiR

AI Token Costs Are Spiraling — Rob Hirschfeld of RackN on Hybrid Infrastructure | TFiR

Key Takeaways

What Is Dapr and Why Workflow Matters for Production AI

Q: For those who may not know, what is Dapr?

How Dapr Agents 1.0 Solves Day 2 Operational Challenges

Q: How does Dapr Agents 1.0 solve Day 2 operational challenges around failure, recovery, and state management?

How the Durable Workflow Engine Prevents Data Loss

Q: How does the durable workflow engine ensure that long-running AI tasks recover from crashes or timeouts without losing data or context?

Integrating AI Agents into Existing Cloud Native Infrastructure

Q: How does Dapr, built on Kubernetes, help platform engineers integrate AI agents into existing cloud native infrastructure without adding operational burden?

Vendor Neutrality and Open Source Commitment

Q: Can you talk about vendor lock-in? Open source focuses on vendor neutrality, but sometimes you can still get caught in lock-in.

From Microservices to Agentic Applications: The Next 10x Wave

Q: Talk about the evolution of Dapr—from distributed application runtime to now powering production AI agents. What’s next?

MITRE ATLAS and ATT&CK Navigator: How CISOs Are Securing AI Systems Against Real Threat Groups | Steve Winterfeld, Akamai | TFiR

AI Token Costs Are Spiraling — Rob Hirschfeld of RackN on Hybrid Infrastructure | TFiR

You may also like

Why Cloud Native HA Isn’t Enough: The Case for Application Awareness | Philip Merry, SIOS Technology | TFiR

AI Infrastructure Lock-In: Why PyTorch is the Only Abstraction Layer That Matters | Mark Collier | TFiR

Your AI Agents Are Accessing Data They Shouldn’t. RecordPoint’s Joe Pearce Explains How to Fix It | TFiR

AI Code Is Leaking 29M Secrets: What Developers Must Know Now | Dwayne McDaniel, GitGuardian | TFiR

AI Token Costs Are Spiraling — Rob Hirschfeld of RackN on Hybrid Infrastructure | TFiR

MITRE ATLAS and ATT&CK Navigator: How CISOs Are Securing AI Systems Against Real Threat Groups | Steve Winterfeld, Akamai | TFiR