Cloud Native

No-Code Failure Testing for Production: Gremlin’s Kolton Andrus on Failure Flags | TFiR

0

Software teams are deploying faster than ever. AI-assisted development, automated CI/CD pipelines, and microservices architectures have compressed release cycles from weeks to hours. But the same acceleration that increases feature velocity also multiplies the surface area for failure. Every new function, every third-party dependency, every serverless invocation is a potential failure point — and most of them have never been tested under real production conditions.

Traditional reliability testing focuses almost exclusively on infrastructure: what happens when a VM goes down, when a Kubernetes node fails, when a network partition occurs. But modern applications don’t fail that way anymore. They fail at the application layer — when a cloud dependency becomes unreachable, when a downstream API responds with unexpected error codes, when latency spikes cascade through distributed systems. These are the failures that cause catastrophic outages, and they are largely invisible to conventional chaos engineering tools.

The gap is made worse by organizational structure. In most enterprise environments, application developers are entirely separate from the platform and infrastructure teams that own chaos testing tooling. When a developer needs to test how their code handles an Amazon S3 outage or a slow identity service, they face months-long security review processes just to pull in a new library — let alone run a live failure experiment. The result is that application-layer reliability testing almost never happens, even at sophisticated engineering organizations.

Gremlin, the chaos engineering and reliability management platform, has built a direct answer to this problem: Failure Flags. Launched in 2026, Failure Flags gives application developers a no-code mechanism to inject failure conditions directly into their services — at the ingress or egress layer — without modifying source code, without waiting on platform teams, and without introducing new security dependencies. It is chaos engineering brought to the application layer, designed for the speed and organizational realities of modern software development.

For enterprise engineering leaders managing hundreds of services, dozens of teams, and the compounding risk of AI-accelerated deployment velocity, Failure Flags represents a meaningful shift in how reliability testing can be operationalized at scale.

The Guest: Kolton Andrus, CEO and Founder at Gremlin

Key Takeaways

  • Failure Flags enables no-code fault injection at the application layer via a drop-in proxy, eliminating the need to modify source code or pull in new libraries — critical for enterprise security compliance.
  • Two deployment models — SDK-based and proxy-based — give teams flexibility to instrument specific function calls or intercept all ingress/egress traffic depending on their architecture.
  • Circuit breaker tuning, fallback validation, and graceful degradation testing are primary use cases, allowing teams to verify resilience patterns before features go live to the full user population.
  • Built-in health checks, reliability scores, and automated test suites surface service-level resilience metrics to engineering leadership — identifying which services carry the most risk and which teams are maintaining reliability hygiene.
  • Gremlin’s fail-safe architecture ensures that any control plane failure or user error results in a fail-open state — traffic continues normally and no failure is injected — so Gremlin never becomes the source of an outage.

***

👇 Click to Read Full Transcript & Technical Deep Dive

In this exclusive interview with Swapnil Bhartiya at TFiR, Kolton Andrus, CEO and Founder at Gremlin, discusses the application-layer reliability gap in modern software engineering, the architecture and use cases behind Gremlin’s newly launched Failure Flags product, and the broader challenge of maintaining reliable systems as AI-driven development velocity accelerates deployment frequency and compounds production risk.

The Application-Layer Reliability Gap

Most chaos engineering tooling targets infrastructure: virtual machines, Kubernetes clusters, network layers. But as development teams move into serverless environments and managed hosting, and as application developers become organizationally separated from platform teams, the failures that cause the largest outages are increasingly happening at the application layer — a layer that traditional chaos engineering has largely ignored.

Q: What gaps in reliability testing led Gremlin to look at this problem and build Failure Flags?

Kolton Andrus: “Traditional failure testing — everyone thinks about the infrastructure level. What happens when my VM or my host goes down, or there’s some problem in Kubernetes, or some problem at the network layer. But we see more and more people are building their applications in serverless environments, in managed hosted environments, or the application developers are really separate from the team that owns the platform and the infrastructure. But it’s those application failures that can be major issues in bringing down systems — major bugs or failures that can end up causing catastrophic outages. And so how do we enable those application developers to go do this testing on their own, without having to wait on the infrastructure or the platform team, and to be able to quickly — while they’re building code at an increased speed — answer these questions and mitigate these problems while they’re pushing their code live.”

How Failure Flags Work: SDK and Proxy Models

Gremlin built Failure Flags in two distinct deployment models to accommodate different team structures and security requirements. The SDK model provides granular, in-code instrumentation. The proxy model delivers a no-code, drop-in solution that operates entirely at the network layer — enabling failure injection without any changes to application source code.

Q: What exactly are Failure Flags and how do they work?

Kolton Andrus: “We have two variations of Failure Flags. We have an SDK version. If you want to just include our library, it gives you a nice place where you can wrap a function or a call or a dependency in a failure flag, similar to a feature flag. It lets you enable who sees it, what users or what requests get those failures, and allows you to inject those failures. So what happens if that dependency fails or slows down? What happens if it throws a certain error code or a certain failure mode? You can replicate those. And then the other one — this no-code solution that we built — is a proxy that you can drop in in front of your application regardless of where your application lives, and inject that failure into the request on its way in or on its way out. That can happen both on the ingress side where you’re receiving it — what happens if there’s a failure while we’re getting to the serverless application — or on the dependency side when you’re making an outbound call out to a dependency. I’m calling a cloud service, and that cloud service — what happens if it goes down? Well, let’s test that by injecting a failure, failing those requests, injecting an error code, and really understanding what happens to our system, our application, when that dependency that we rely upon fails.”

Failure Flags vs. Feature Flags: Complementary, Not Competing

The naming of Failure Flags is intentional — it draws on the familiarity engineers already have with feature flags while signaling a fundamentally different purpose. Where feature flags gate code paths and manage rollouts along the happy path, Failure Flags are designed specifically to explore and validate failure paths. The prior Gremlin product in this space was called Application Layer Fault Injection (ALFI), a technically accurate but unwieldy name that limited adoption.

Q: How are Failure Flags different from Feature Flags, and how do the two complement each other?

Kolton Andrus: “When we built a feature similar to this quite a few years back, part of the problem was the name — we called it Application Layer Fault Injection, and that was a mouthful. And the acronym wasn’t that useful. So part of what we did this time around is let’s make it easier for people to understand what we mean by this. Failure Flag is meant to imply feature flag because people are familiar with feature flags. The feature flag is a little different — feature flags are meant to help you deploy or roll out code. They might live in your code base for a longer period of time, but their purpose is really gating what people see and enabling different code paths. Well, often those are still happy paths and they’re not really thinking about the failure path. Similar concept — we’re going to wrap a bit of code in a flag — but in this case we’re going to have the control to say, hey, turn this on and have this fail for everyone, for a subset of people, for a certain time period. It keeps those same concepts that Chaos Engineering brought to the table: let’s scope it down, let’s have control over it, let’s make sure we’re injecting the failure in a safe and thoughtful way. But it also allows us to be very precise about where we’re testing those failures. It need not be at the host level or the container level. We can get down to an individual function call, an individual network call to a dependency, so we understand how it fails. Circuit breakers, I think, are the best example of something you want to tune here. We know circuit breakers are a strong pattern when we’re calling a dependency — we want to open the release valve if that dependency is failing, we don’t want to pound the dependency, we don’t want to wait for that dependency if we think it’s going to fail. And we want to build a fallback or some sort of graceful degradation if things are going wrong. So how do we test those? Well, the best way to test it is to have something almost inline right there that we can control — that failure, we can open it up, we can say this failure is occurring, we can tune that circuit breaker, and we can make sure that that fallback works correctly for the end user. That we get the right user experience further up the stack when that failure propagates back to the user experience level.”

The No-Code Proxy: Network as the Solution

A core design principle behind Failure Flags is removing the friction that prevents application teams from running failure tests in the first place. In large enterprise environments, pulling in a new library triggers security reviews that can stretch for months. Gremlin’s proxy approach sidesteps this entirely — the proxy is inspectable, self-contained, and operates at the network layer without touching application code.

Q: How do you actually manage to run these experiments without touching a single line of code?

Kolton Andrus: “While the network is the source of all our problems, the network can be the solution to a lot of our problems as well. By introducing this proxy layer, we can pass through all normal traffic as if nothing has occurred, but we can inject failure into certain requests, or request types, or patterns that we’re matching on the front side or the back side. This is — I don’t have to go pull a library into my application. Some teams, especially at big enterprise companies, there are a lot of security concerns about adding another dependency or pulling in additional libraries — that just increases friction. A team that wants to do this testing might have to now go through a months-long security review process to be able to consume this library, where a proxy that they can inspect, see all the code, and is very simple at its root — they can put it in place in the network path to be able to inject that failure. And this is where, if we need the infrastructure team or the cloud team or the Kubernetes team to be involved in order to go instrument and run those failure modes at the host or container level, in this place, once we have that proxy in place — which is a drop-in solution — we can go manage it through Gremlin’s UI. We can scope it, turn it on, turn it off, and we get all the usual fail-safe and security features that Gremlin provides.”

Failure Scenarios: Latency, Unreachability, Error Codes, and Pattern Matching

Failure Flags supports a broad spectrum of failure simulation scenarios — from blunt dependency unavailability to highly targeted, pattern-matched injection against specific users, URL paths, cloud regions, or service parameters. This precision was a defining capability in Gremlin founder Kolton Andrus’s prior work at Netflix, where targeted fault injection enabled engineers to reproduce and fix production incidents against isolated accounts without impacting the broader user population.

Q: What kinds of failure scenarios can teams actually simulate through Failure Flags?

Kolton Andrus: “There’s a wide variety. When we’re talking about dependencies, what happens if that dependency is just unreachable? A cloud provider’s having an outage — it’s not that they respond cleanly, they often just do not respond. So we can introduce a situation where the connection doesn’t connect on the other end. We can slow it down. What happens if that service is under load? If a service dependency you’re talking to is under load, the way it’s going to manifest back to you as the consumer is really just a delayed response, delayed processing time. By injecting latency into that connection, we’re really simulating what happens if that service is underwater, failing to respond or responding very slowly. At the SDK level we can be in the code, or at the proxy level we can simulate what happens if certain error conditions hit — what if we want to return a certain error code, a certain type of response body, a certain type of explicit failure message. We can override whatever response is being returned, mock it in essence, and return the exact failure mode that we’re looking to replicate or reproduce. And then lastly, it’s about that pattern matching — I want to match requests of a certain URL path, requests that have certain attributes to them, certain users, certain cloud regions, certain types of service parameters that we have with our environment. That’s what allows us to be very precise about taking a specific example and reproducing it. When I worked at Netflix, we built something similar, and this ended up being a very powerful tool for us because it allowed us to inject failure specifically to our user accounts so that we were the only ones that saw that failure mode, and none of the rest of the population got impacted by it. There are a few examples where we saw an error in production, we had a theory about why this error was occurring, we were able to go in and inject this failure in a very precise way against our accounts, reproduce the failure, find and fix the bit of code that was wrong, and then hotfix it. In one instance, we took an incident that had occurred 20 minutes ago, pinpointed the cause, had a hotfix, pushed it out, and then were able to test it on our own accounts before we pushed it live to everybody. That kind of precision testing — that ability to get in and be very precise about reproducing a failure in a production environment but for a limited scope of users — is a very powerful tool for engineers to have.”

Built-In Health Checks and Automated Pass/Fail Detection

Knowing whether a system survived a failure test is as important as running the test itself. Gremlin’s proxy architecture provides access to real-time telemetry — response rates, error rates, latency, throughput — for the duration of every experiment, enabling automated health evaluation and instant experiment termination if system health degrades unexpectedly.

Q: There is also a built-in health check component. How does that work and what role does it play?

Kolton Andrus: “One of the most important parts of running failure tests is knowing if it succeeded — if your system was resilient to that type of failure. One of the ways that we’re able to accomplish that is by looking at the overall health of the service or the function that’s under test. Often that looks in a very similar way: do responses flow, are we seeing errors coming from that function, is latency and throughput remaining stable? By processing this code through a proxy, we actually have access to all of that telemetry while we’re running the experiment. We can see that the application or the function remains stable and healthy throughout. Or if it’s not, we can call the experiment, clean it up, and let the engineer know — hey, you didn’t pass this test, it didn’t behave in the way you expected, or overall your system wasn’t as healthy as you wanted it to be. That’s just one of the ways that we can automate understanding how the system behaves so engineers can focus on running the test and understanding whether or not it passed, not needing to manually monitor whether it passed or failed — and then, if it didn’t pass, focusing on how to fix it as quickly as possible.”

Reliability Scores, Test Suites, and Leadership Visibility

Failure Flags is integrated into Gremlin’s broader reliability management platform, which includes reliability scores, automated test suites, and scheduling capabilities. These features shift reliability from a reactive discipline into a continuously measured, proactively managed program — giving engineering leadership a multi-dimensional view of service health across the entire portfolio.

Q: How do you measure reliability improvement? What do reliability scores look like within the platform?

Kolton Andrus: “We built Failure Flags into the broader Gremlin platform, which has a reliability management component. We have within that reliability scores. We have test suites that are common failure modes that work out of the box to help people understand. We have ways to schedule those so they run on a regular basis. That really is what drives good hygiene around reliability. You can run these types of tests once in a while, but really your system is changing daily, hourly, by the minute in the most advanced systems. So you need to be running these tests on a regular basis to make sure you’re not drifting back into failure. The test suites give us a great mechanism for scheduling and running those tests. They give new engineers — or even senior engineers that are less familiar with the code — a clear place to begin. And we roll that all up into a set of reliability scores that measure the overall reliability of that service on multiple dimensions. That’s really important when it comes to the leadership aspect of running a reliability program. There are many teams, there are many pieces of software. How do you know which pieces of software need more attention? How do you know which teams are doing a great job? How do you recognize and reward people when they’re not having outages, as you want to incentivize the right behavior? That set of reliability scores really gives leadership an opportunity to see an overall view of the system and where risks lie within the system, but also to see where teams are doing a great job and where they’re keeping on top of their failures to produce what we want — which is really boring systems. We really want systems that behave as expected, that don’t fail and don’t surprise people on a regular basis. And counterintuitively, it’s a lot of work to have a boring system. You have to do all of the prep work and automation in advance to make sure things run smoothly whenever there’s a traffic spike or a cloud failure or a new bug that gets introduced.”

Real-World Workflow: Failure Flags Alongside Feature Flags

While Failure Flags require no dependency on feature flag tooling, the two capabilities are naturally complementary in a progressive delivery workflow. Teams can use Failure Flags to validate failure handling before enabling a feature flag for a broader user segment — testing graceful degradation, fallback logic, and latency resilience against internal or test users before expanding traffic.

Q: How would a team use Failure Flags alongside their existing feature flag workflow?

Kolton Andrus: “You don’t need to have feature flags to run Failure Flags. They’re similar in concept but there’s no dependency or requirement between them. But if you were launching a new feature and you wanted to failure-test it before it went live for the broader population base, you might turn on that feature flag for a percentage of your users. Well, before you do that, you’re probably going to want to go in and test that failure mode to understand — do you gracefully degrade? An example might be: I want to go test what happens when Amazon S3 has a failure. I have a portion of my code that goes out and calls and fetches assets from that. What I want to do before I turn that feature flag on is I want to go run the Failure Flag to understand — if Amazon isn’t available at this time, do I have a fallback? Can I return something from cache? Can I gracefully degrade? Can I not show a component? Can I calculate or do the piece of work I’m trying to do another way? So that when you go and turn on the feature flag and start putting users on that new functionality, it behaves the way you want. Similarly, if you want to make sure that you’re handling load properly, you might go test the latency of that service first. I’ve got a new internal service — maybe it’s an identity service, maybe it’s a recommendation service — and I’m going to go see what happens if it becomes twice as slow or four times as slow. I’m going to go run that experiment on some of my internal test users and understand: what does that behave like? Can we gracefully degrade? Can we alert? Do we know when that occurs? That way, when we’re turning on that new feature and dialing it up and putting traffic on it, we’ve already tested to see what happens if things start to degrade, so that we ensure that the customer experience remains consistent and of high quality throughout.”

Production Readiness, Performance Engineering, and Fail-Safe Architecture

A common concern with in-path proxy tooling is performance overhead. Gremlin invested heavily in performance testing and profiling Failure Flags before launch, resolving an edge-case performance degradation discovered in the weeks before release. The platform’s core design principle — fail-safe, fail-open behavior under any control plane failure — ensures that Gremlin itself never becomes the source of a production incident.

Q: How mature is this tool? Is this a pre-production tool, an operations tool, or somewhere in the middle?

Kolton Andrus: “We’ve worked hard to make it a production-worthy tool. If you’re only testing in your dev or your staging environment, you’re missing a whole class of failures that can occur only in production. Production has security groups, it has diversity of traffic, it has different load balancers, it has different traffic paths, and customers do different things than you expect. You can’t always anticipate everything a customer is going to do. We do an extreme amount of performance testing — of our own failure testing — on these types of products before we launch them. We recently found an issue a couple of weeks back before we launched where we saw a performance degradation in an edge case. It was all hands on deck for my team for the next week to go really understand and profile and fix every performance concern that we could find. The type of customers that we have are Fortune 100 Enterprise customers that are putting a large amount of production traffic through these Failure Flags. We need them to be performant when they’re not injecting a failure — they need to basically be non-existent when things are running through the system in a normal capacity. And when we are injecting the failure, we want to make sure we’re only injecting the overhead that relates to the failure. We don’t want to bring along any extra baggage. And of course, everything we build — one of our first principles is always about fail-safes. If there’s ever any failure in the control plane, if there’s any user error or other failure somewhere along the way, we want to fail open — fail in a way that allows the traffic to continue to work, where the failure doesn’t get injected, so that we’re not causing an outage. We never want to be the source of an outage. We want to be the source of very precise testing that teaches us about the system to prevent outages.”

Getting Started: Self-Serve Trial, Technical Sales, and Onboarding

Gremlin offers a self-serve trial that allows teams to explore Failure Flags without engaging a sales team. For enterprise teams that want guidance on experiment design, test construction, and result interpretation, Gremlin’s technical sales team — composed of engineers with on-call operations experience — provides hands-on advisory support.

Q: How does getting started look? Is there heavy setup involved, or is it self-serve?

Kolton Andrus: “We have a trial within Gremlin so people can go out and self-serve these capabilities on their own. Oftentimes they’ll consult with our team. Our technical sales team is quite adept — they’ve all been on call, they’ve all operated systems. They’re there to help advise people on how to construct the right experiments, how to build the right tests, and how to interpret the results, which ends up being a big value add for a lot of our customers. But the beauty of the no-code solution is at its heart, you drop it in, you go into our UI, you select the failure, and you run it. There isn’t a lot of setup, there isn’t a lot of overhead that needs to be done. What works by default out of the box is very lightweight and easy to set up. But then there’s a lot of depth — if you want to recreate something very specific or bespoke to your environment, you have that flexibility to go in and customize and configure it the way you want.”

The Road Ahead: Reliability in an AI-Accelerated Development World

Kolton Andrus views the reliability problem as a growing crisis, not a solved one. AI-assisted development is enabling 10x increases in code output and deployment frequency — but without corresponding investment in automated reliability testing, that velocity multiplies failure risk at the same rate. The future of Failure Flags lies in deeper CI/CD and AI pipeline integration, making safety nets a standard part of fast-moving development workflows rather than an afterthought.

Q: Where do Failure Flags, reliability, and SRE go from here?

Kolton Andrus: “As I’ve looked over the last six to nine months, the problem of reliability does not seem to be going away. We seem to be having more and more failures. My personal opinion is there’s a lot of great work going on that’s allowing us to increase velocity, 10x the number of code that’s being written, 10x the number of deployments and our velocity overall. But with that comes 10x the opportunity for bugs, risks, and failures to occur. In these environments where things are changing often, I think there’s more risk because of the unintended side effects and the knock-on effects that happen within distributed systems. So I think that engineers need more tooling to help them answer these questions, and more of this tooling needs to fit into their automated pipelines — whether that’s AI-based or traditional CI/CD. When you’re moving quickly, you need safety nets, you need guardrails in place that help you test these edge cases quickly and efficiently so that you can have that be a feedback loop into your development cycle — to make sure you’re accounting for and anticipating these failure modes. Short answer is I think there’s more opportunity for engineers to be able to do this. The key is there are many things happening and many things competing for attention. So as always, one of our other core design principles is: make it easy to do the right thing. How do we make it simple and straightforward for people to go answer these questions so that they’re able to move quickly in these modern environments, but do it in a way that mitigates risks and gives them a safety net?”

Multi-Cloud SQL Server HADR: Why Single-Cloud DR Fails and How DBAs Can Fix It | Dave Bermingham, SIOS Technology | TFiR

Previous article