Cloud Native

Breaking Systems to Build Better Ones: How AI is Reshaping Chaos Engineering

0

The concept seems counterintuitive: intentionally break your systems to make them more reliable. Yet chaos engineering, which emerged from Netflix’s famous Chaos Monkey tool, has evolved from a novel experiment into a systematic discipline that’s now being enhanced by artificial intelligence (AI).

Kolton Andrus, CEO and Founder of Gremlin, the company that pioneered chaos engineering at scale, recently sat down to discuss the current state of the practice and how AI is transforming reliability engineering. His insights reveal both the maturation of chaos engineering and the practical limitations of AI in managing complex distributed systems.


📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot

The Systematic Evolution of Chaos Engineering

“I think we’ve seen chaos engineering evolve quite a bit over the last five years. And some of that is good news, and some of that was bad news,” Andrus explains. The challenge isn’t with the concept itself but with implementation. Many organizations treat chaos engineering as a one-off project rather than a systematic practice.

The companies seeing real success approach chaos engineering with the same rigor they apply to security. “It’s very rare that you go to the security team and they say, well, security is nice to have, you know, if you get to it, that would be great,” Andrus notes. Organizations that treat reliability with proper organizational buy-in, measurement, and accountability see significant improvements.

The key differentiator is moving beyond best-effort implementations to systematic approaches that track progress across teams, measure impact, and hold people accountable for results. This shift from ad-hoc experimentation to disciplined practice marks chaos engineering’s maturation.

AI’s Role and Limitations in Reliability

While AI dominates technical discussions across industries, Andrus maintains a pragmatic perspective on its role in system reliability. “If Skynet comes about tomorrow, it’s going to fail in three days. So I’m not worried about the AI apocalypse, because AI isn’t going to be able to build and maintain and run reliable systems.”

The fundamental challenge lies in the nature of distributed systems versus AI capabilities. “A lot of the LLMs and a lot of what we talk about in the AI world is really non deterministic, and when we’re talking about distributed systems, we care about it working correctly every time, not just most of the time.”

However, Andrus sees valuable applications for AI in specific areas. AI excels at providing suggestions and guidance rather than making deterministic decisions. This insight shapes Gremlin’s approach to integrating AI into their platform.

Reliability Intelligence: AI-Powered Recommendations

Gremlin’s new Reliability Intelligence platform represents a thoughtful integration of AI into chaos engineering. Rather than replacing human judgment, the platform augments engineer capabilities by providing actionable recommendations based on test results.

“We want to tell you, you know, you ran the test, you know whether you passed or failed the test. Now we want to tell you how to go fix it,” Andrus explains. The system combines traditional machine learning with large language models, but with important constraints.

The approach prioritizes trustworthiness over automation. “We would rather give no advice than bad advice,” Andrus emphasizes. Instead of simply feeding data into an LLM, Gremlin built custom machine learning models based on a decade of customer data, with human engineers validating every recommendation.

Beyond Individual Tools: Organizational Change

Perhaps the most significant insight from Andrus concerns organizational culture around reliability. The technical tools exist, but success requires addressing accountability and measurement challenges.

“If it’s owned by everybody, it’s owned by nobody,” he observes. Organizations often struggle with fragmented ownership between SRE teams, developers, and platform teams. The solution involves creating clear accountability structures and metrics.

Andrus poses a provocative question: “We know how someone could cause a bad enough outage and get fired, but if somebody prevents enough outages, how do they get promoted?” Organizations excel at measuring failures but often lack mechanisms to recognize and reward proactive reliability work.

Gremlin addresses this through reliability scores that provide leadership visibility into system risk. This enables informed conversations about resource allocation and priorities rather than reactive firefighting.

Practical Integration and Future Directions

The platform integrates deeply with observability tools, analyzing traffic patterns, error rates, and response times to make intelligent decisions about test execution and recommendations. This data-driven approach enables nuanced analysis—distinguishing between different failure patterns and providing appropriate guidance.

Looking ahead, Andrus envisions chaos engineering becoming as routine as other testing practices. “Just like the unit test, the integration test, we want to run the failure test,” he explains. The goal is embedding reliability testing throughout the development lifecycle rather than treating it as a separate activity.

The company has already expanded support for AI infrastructure with GPU-specific testing, recognizing the unique failure characteristics of machine learning workloads. Batch processing failures in AI systems create different challenges than always-on services, requiring specialized approaches.

Engineering Out the Chaos

Despite its name, chaos engineering represents the opposite of chaotic approaches to system reliability. “Chaos engineering is a bit of a misnomer. You know, a lot of people think, Oh, we’re going to go cause chaos and see what happens, and it’s the opposite. We want to engineer the chaos out of our systems.”

This systematic approach to understanding system behavior under stress provides the foundation for building more resilient infrastructure. As AI-generated code increases system complexity, the need for comprehensive reliability testing becomes even more critical.

The evolution of chaos engineering from a Netflix experiment to an AI-enhanced discipline reflects broader changes in how organizations approach system reliability. Success requires not just better tools but organizational commitment to treating reliability as a measurable, accountable practice rather than hoping systems remain stable.

DNS Posture Management: Fixing the Security Holes You Can’t See

Previous article

How Azul’s MSP Program Turns Java Complexity Into Opportunity

Next article