Cloud Native

How Gremlin’s Kolton Andrus Sees the Future of Chaos Engineering in the Age of AI

0

Guest: Kolton Andrus (LinkedIn)
Company: Gremlin
Show: KubeStruck
Topic: Cloud Computing

AI is accelerating the speed at which software gets built, deployed, and changed. But as more engineers rely on AI-generated code, the surface area for failure is expanding faster than most teams can track. At KubeCon, Kolton Andrus, CEO of Gremlin, sat down with us to talk about why reliability work is more important than ever and how chaos engineering is evolving to keep up with the new era of distributed, AI-driven systems.

For almost a decade, Andrus has been one of the most recognizable voices in the reliability and chaos engineering space. From his early days working on large-scale outage response at Amazon to building Gremlin into a platform dedicated to helping teams uncover failure before it reaches customers, Andrus has seen the evolution of reliability practices firsthand. At KubeCon, he shared a mix of realism, skepticism, and optimism about how the discipline is adapting to AI and the rapid growth of cloud-native environments.

One of the biggest updates from Gremlin this year is its new partnership and integration with Dynatrace. According to Andrus, the goal was simple: reduce the friction teams experience before they can even begin meaningful chaos and reliability testing. Traditionally, engineers had to model their services manually inside Gremlin, configure monitoring, ensure the right alerts were in place, and verify that performance metrics existed at both staging and production levels. Many teams did not have these basics set up, especially in non-production environments.

The integration with Dynatrace changes that. Gremlin can now pull applications directly from Dynatrace with one click, automatically wire up monitoring, and surface the alerts needed to evaluate a test. Teams can skip the long preamble and get straight to validating how their systems behave under stress. It’s a quality-of-life improvement that addresses one of the industry’s biggest hurdles: limited time and unclear operational baselines.

Stepping away from the announcement, Andrus reflected on the dramatic rise of AI in software delivery. Everywhere across KubeCon, conversations revolved around AI for SRE, AI for operations, or “AI-first” infrastructure. But Andrus is not convinced AI will replace SRE expertise anytime soon. He believes the hype cycle is outpacing what real-world reliability problems demand.

He points to two challenges. First, great SREs have spent years internalizing how complex systems behave. That intuition — built from debugging outages, tracing anomalies, and understanding dependencies — won’t be replicated quickly by AI. Second, high-stakes decisions made during a production outage require judgment and responsibility. A system outage doesn’t give an AI model minutes to think. Customers are already impacted, internal teams are on alert, and any wrong action can worsen the situation. Trusting AI with decisions of that magnitude remains a long way off.

That doesn’t mean AI has no place in reliability. In fact, Gremlin has been integrating machine learning into what it calls Reliability Intelligence — a layer that helps customers interpret test results and understand whether their systems behaved as expected. If something fails, the platform guides teams with actionable advice rather than just surfacing raw data. The goal, as Andrus puts it, is to make it as easy as possible for engineers to do the right thing: run tests, understand system behavior, and improve reliability quickly.

The increasing use of AI-generated code makes this even more important. As Andrus notes, teams are producing more code, faster, with less manual validation. That velocity comes with risk. If developers never explicitly tell an AI model to account for reliability concerns — redundancy, timeouts, dependency failure, scalability — the model will not magically solve them. Teams must test these assumptions deliberately, and chaos engineering remains the most direct way to expose unknowns.

Another major theme Andrus touched on is the organizational evolution of chaos engineering itself. In the early days, many teams treated chaos engineering as a side project — something interesting to try, often championed by a single engineer or innovation-minded team. But as Gremlin has worked with more mature organizations, the pattern has changed. The most successful companies treat reliability like security: systematic, incentivized, and enforced. Leadership defines reliability standards that every team must meet, and chaos experiments become part of the development lifecycle rather than an optional exercise.

It’s a shift that mirrors the broader “production readiness” culture emerging across cloud-native organizations. With distributed systems becoming more fragile and AI adding unknown variables, reliability can no longer live in the margins. Testing must move upstream, and chaos engineering must mature into a structured practice.

Looking ahead, Andrus sees chaos engineering and reliability engineering becoming even more essential — not because AI will take over operations, but because AI is rapidly transforming how software is built. Whether code is written by humans, machines, or both, the fundamental risks of distributed systems haven’t changed. Dependencies will fail. Latency will spike. Services will misbehave under load. The only question is whether teams will find these weaknesses in advance — or during an outage.

How anynines Is Redefining Data Services, Klutch, and Cloud Foundry for the Kubernetes Era: A Conversation with CEO Julian Fischer

Previous article

SIOS LifeKeeper v10 Unifies HA and DR Management Across Windows and Linux

Next article