Cloud Native

How to Test Application Failures Without Touching Source Code | Kolton Andrus, Gremlin | TFiR

0

Application teams are shipping code faster than ever, but most failure testing still happens at the infrastructure layer: VMs, Kubernetes nodes, and network paths. That leaves application-level failure modes, dependency outages, error code handling, and circuit breaker behavior, untested until they hit production. In serverless and managed environments where developers have no access to the underlying infrastructure, this gap is even wider and the blast radius of an untested failure is larger.

In this interview on TFiR, Kolton Andrus, CEO and Founder at Gremlin, breaks down how Failure Flags work, why the no-code proxy architecture removes the biggest adoption blockers at enterprise scale, and how teams can build automated reliability hygiene into their development cycles.

Guest: Kolton Andrus, CEO and Founder at Gremlin
Show: TFiR

Here is what every SRE, platform engineer, and application developer needs to know.

Technical Deep Dive

Q: Why does traditional chaos engineering miss application-layer failures?

Kolton Andrus, CEO and Founder at Gremlin, explains that traditional failure testing focuses on infrastructure: VMs going down, Kubernetes issues, or network-layer problems. As more teams build in serverless and managed environments, and as application developers become separated from the platform teams that own infrastructure, the application layer itself becomes the primary source of catastrophic failures. Testing at the host or container level does not reach those failure modes, and waiting on platform teams to instrument tests slows down developers who are already shipping at high velocity.

“It’s those application failures that can be major issues in bringing down systems or major bugs or failures that can end up causing catastrophic outages.” — Kolton Andrus, CEO and Founder, Gremlin

Q: What are Failure Flags and how do they work technically?

Gremlin offers two implementations of Failure Flags. The SDK version lets developers wrap a function call or dependency in a flag, similar in concept to a feature flag, and then control which users or requests receive injected failures and what type of failure is injected. The no-code version is a proxy that drops in front of an application regardless of where it runs and injects failures into requests on ingress or on outbound dependency calls. Both implementations allow teams to test what happens when a dependency is unreachable, when latency spikes, or when a specific error code is returned, without modifying application source code.

“By introducing this proxy layer, we can pass through all normal traffic as if nothing has occurred, but we can inject failure into certain requests or request types or patterns that we’re matching on the front side or the back side.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How are Failure Flags different from Feature Flags?

Feature flags control code rollout and gate which users see which functionality. They are designed for happy paths and do not test failure behavior. Failure Flags borrow the same mental model, wrapping a piece of code or a call in a controllable gate, but the purpose is to inject failure rather than route traffic. Andrus notes that Gremlin previously built a similar capability called Application Layer Fault Injection, but the name created friction. Failure Flags use the familiar feature flag framing to make the concept immediately graspable while serving a fundamentally different purpose: scoped, controlled, precise failure injection with the safety principles chaos engineering established.

“Feature flags are meant to help you deploy or roll out code. Often those are still happy paths and they’re not really thinking about the failure path.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How does the no-code proxy approach work and why does it matter for enterprise adoption?

The no-code proxy sits in the network path and passes through normal traffic without modification, injecting failures only into matched request patterns. For enterprise teams, adding a new library dependency typically requires a months-long security review. A proxy that teams can inspect directly and that is simple at its core removes that blocker entirely. Once the proxy is in place, engineers manage everything through the Gremlin UI: selecting failure types, scoping targets, and toggling experiments on and off, without any further involvement from infrastructure or platform teams.

“A proxy that they can inspect and see all the code and is very simple at its root, they can put in place in the network path to be able to inject that failure.” — Kolton Andrus, CEO and Founder, Gremlin

Q: What failure scenarios can teams simulate with Failure Flags?

Teams can simulate dependency unreachability, where a cloud provider does not respond at all rather than returning a clean error. They can inject latency to simulate a downstream service under load, since a slow dependency manifests as delayed responses to the consumer. They can override responses to return specific error codes, custom response bodies, or explicit failure messages, effectively mocking failure modes in production. Pattern matching lets teams scope injection to specific URL paths, user accounts, cloud regions, or service parameters, making it possible to reproduce a failure for a limited set of users while leaving the rest of the population unaffected.

“When I worked at Netflix, we built something similar, and this ended up being a very powerful tool for us because it allowed us to inject failure specifically to our user accounts so that we were the only ones that saw that failure mode.” — Kolton Andrus, CEO and Founder, Gremlin

Q: Why is circuit breaker tuning a strong use case for Failure Flags?

Circuit breakers are the right pattern for protecting a service from a failing dependency, but they are difficult to validate without actually failing the dependency. Andrus identifies circuit breaker testing as one of the best use cases for Failure Flags because the proxy can sit inline exactly where the dependency call occurs and open a controlled failure. Teams can tune the breaker threshold, verify that the fallback behavior triggers correctly, confirm that the end user sees the right degraded experience, and fix issues before that failure ever reaches production traffic at scale.

“The best way to test a circuit breaker is to have something almost inline right there that we can control that failure, open it up, tune that circuit breaker, and make sure that fallback works correctly for the end user.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How does the built-in health check work during a Failure Flags experiment?

Because all traffic passes through the proxy, Gremlin has access to telemetry for the function or service under test throughout the experiment: response flow, error rates, latency, and throughput. The platform uses this data to determine whether the system remained stable and healthy during the test. If health degrades beyond acceptable bounds, the platform can automatically halt the experiment, clean up the injected failure, and notify the engineer that the test did not pass. This automation means engineers can focus on interpreting results and remediating problems rather than manually monitoring system state while a test runs.

“If it didn’t pass, we want engineers focusing on how they go fix it as quickly as possible, not on whether it passed or failed.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How do reliability scores and test suites fit into the Failure Flags platform?

Failure Flags are built into Gremlin’s broader reliability management platform, which includes pre-built test suites covering common failure modes, scheduling capabilities for recurring runs, and multi-dimensional reliability scores per service. Because systems change continuously, one-time testing is insufficient. Scheduled test suites prevent reliability drift by confirming on a regular cadence that known failure modes are still handled correctly. Reliability scores give engineering leadership a consolidated view of where risks exist across many services and teams, and allow organizations to recognize teams that are maintaining consistently stable systems.

“It’s a lot of work to have a boring system. You have to do all the prep work and automation in advance to make sure things run smoothly whenever there’s a traffic spike or a cloud failure or a new bug that gets introduced.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How do Failure Flags integrate into an existing feature flag deployment workflow?

Failure Flags have no dependency on feature flags and work independently. However, Andrus describes a natural workflow where a team preparing to roll out a new feature via feature flag first runs a Failure Flag experiment against that code path to confirm graceful degradation before traffic scales up. For example, before enabling a feature that fetches assets from Amazon S3, a team would run a Failure Flag to verify that the application falls back to cache or degrades gracefully when S3 is unavailable. The same logic applies to latency testing on new internal services: validate degradation behavior on a small set of test users before dialing up the feature flag percentage.

“Before you turn that feature flag on, you want to go run the failure flag to understand: if Amazon isn’t available, do I have a fallback? Can I return something from cache? Can I gracefully degrade?” — Kolton Andrus, CEO and Founder, Gremlin

Q: Is Failure Flags production-ready or a pre-production testing tool?

Andrus is explicit that Failure Flags are built for production. Testing only in dev or staging environments misses an entire class of failures: production has different security groups, load balancer configurations, traffic diversity, and customer behavior that cannot be fully replicated in lower environments. Gremlin ran extensive performance testing and failure testing on Failure Flags before launch, including resolving a performance degradation edge case discovered weeks before release. The proxy is designed to be effectively invisible under normal traffic and to add overhead only when actively injecting a failure. Fail-safe design ensures that any control plane failure causes the proxy to fail open, allowing traffic to pass through normally without injecting unintended failures.

“If you’re only testing in your dev or staging environment, you’re missing a whole class of failures that can only occur in production.” — Kolton Andrus, CEO and Founder, Gremlin

Q: How do teams get started with Failure Flags and what does onboarding look like?

Gremlin offers a self-serve trial that lets teams access Failure Flags without a sales engagement. The no-code proxy is designed to be a drop-in: deploy it, open the Gremlin UI, select a failure type, and run the experiment. Gremlin’s technical sales team, who have on-call and production operations backgrounds, are available to help teams design experiments, construct the right test cases, and interpret results. For teams that need bespoke configuration to reproduce environment-specific failure modes, the platform supports deep customization beyond the out-of-the-box defaults.

“The beauty of the no-code solution is you drop it in, go into our UI, select the failure, and run it. There isn’t a lot of setup, there isn’t a lot of overhead.” — Kolton Andrus, CEO and Founder, Gremlin

Q: Where is Gremlin taking Failure Flags and reliability tooling from here?

Andrus sees the reliability problem accelerating rather than stabilizing. The same tooling and AI assistance that multiplies developer velocity also multiplies the number of deployments, dependencies, and potential failure scenarios in production. Distributed systems compound this through unintended side effects and knock-on failures. The direction for Failure Flags is deeper integration into automated pipelines, both AI-assisted and traditional CI/CD, so that failure testing becomes a continuous feedback loop in the development cycle rather than a periodic manual exercise. The core design constraint remains the same: make it easy to do the right thing so engineers can move fast without accumulating hidden reliability debt.

“When you’re moving quickly, you need safety nets, you need guardrails in place that help you test these edge cases in a quick and efficient manner so that you can have that be a feedback loop into your development cycle.” — Kolton Andrus, CEO and Founder, Gremlin

Resources & Documentation

  • Gremlin, chaos engineering and reliability management platform with Failure Flags, reliability scores, and test suites
  • Gremlin Failure Flags, no-code and SDK-based application-layer fault injection

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Like everyone else, you’re also shipping new features, but all of that new code that you’re releasing is also new risk. Every function, every dependency, every microservice that you spin up is a potential failure point you have never actually tested under pressure. And when it comes to traditional feature flags, they tell who sees a feature. They do not tell you if that feature survives when things go wrong. Gremlin is trying to change that. They just launched failure flags, a no code way to test how your applications actually behave under failure conditions without touching a single line of source code. And today we have with us once again, Kolton Andrus, CEO and Founder of Gremlin to walk us through it. Kolton, it’s great to have you back on the show. Let’s start with the problem. When it comes to, of course, code, what gaps in reliability testing led Gremlin to look at this problem and build failure flags?

Kolton Andrus: Yeah, so I think, you know, traditional failure testing, everyone thinks about the infrastructure level. What happens when my VM or my host goes down or there’s some problem in Kubernetes or some problem at the network layer. But we see more and more people are building their applications in, you know, serverless environments, in managed hosted environments, or the application developers are really separate from the team that owns the platform and the infrastructure. But it’s those application failures that can be major issues in bringing down systems or major bugs or failures that can end up causing, you know, catastrophic outages. And so how do we enable those application developers to go do this testing on their own without having to wait on the infrastructure, the platform team, and to be able to quickly, while they’re building code at an increased speed, answer these questions and mitigate these problems while they’re pushing their code live.

Swapnil Bhartiya: Can you talk a bit about, if we want to go into the weeds, talk a bit about technical. So what exactly are failure flags and how do they work?

Kolton Andrus: Yeah, so we have two variations of failure flags. We have an SDK version. So if you want to just include our library, it gives you a nice place that you can wrap a function or a call or a dependency in a failure flag, similar to a feature flag. It lets you enable who sees it, what users or what requests get those failures, and allows you to inject those failures. So what happens if that dependency fails or slows down? What happens if it throws a certain error code or a certain failure mode? You can replicate those. And then the other one, this no code solution that we built, is a proxy that you can drop in in front of your application regardless of where your application lives, and inject that failure into the request on its way in or on its way out. And so that can happen both on the ingress side where you’re receiving it. What happens if there’s a failure while we’re getting to the serverless application, or on the dependency side when you’re making an outbound call out to a dependency. So I’m calling a cloud service, and you know, that cloud service. What happens if it goes down? Well, let’s test that by injecting a failure, failing those requests, injecting an error code, and really understanding what happens to our system, our application, when that dependency that we rely upon fails.

Swapnil Bhartiya: Every time I was saying the word failure flag, I was, my tongue was slipping and I almost said feature flag because that is, you know, already well known, well adopted. How are failure flags different? And how do these two complement each other? Do they compete? They complement. How does that part work?

Kolton Andrus: Yeah, when we built a feature similar to this quite a few years back, and part of the problem was the name, we called it Application Layer Fault Injection, and that was a mouthful. And the acronym wasn’t that useful. And so part of what we did this time around is let’s make it easier for people to grok, to understand what we mean by this. And so failure flag is meant to imply feature flag because people are familiar with feature flags and the feature flag is a little different. Feature flags are meant to help you to deploy or roll out code. They might live in your code base for a longer period of time, but their purpose is really gating what people see and enabling different code paths. Well, often those are still happy paths and they’re not really thinking about the failure path. And so similar concept, you know, we’re going to wrap a bit of code in a flag, but in this case, we’re going to have the control to say, hey, turn this on and have this fail for everyone, for a subset of people for a certain time period. And so it keeps those same concepts that Chaos Engineering brought to the table. Hey, let’s scope it down, let’s have control over it. Let’s make sure we’re injecting the failure in a safe and thoughtful way. But it also allows us to be very precise about where we’re testing those failures. And so again, it need not be at the host level or the container level. We can get down to an individual function call, an individual network call to a dependency so we understand how it fails. Circuit breakers, I think, are the best example of something you want to tune here. We know circuit breakers are a strong pattern when we’re calling a dependency, we want to be able to, you know, open the release valve if that dependency is failing, we don’t want to pound the dependency. We don’t want to wait for that dependency if we think it’s going to fail. And we want to build a fallback or some sort of graceful degradation if things are going wrong. And so how do we test those? Well, the best way to test it is to have something almost in line right there that we can control that failure. We can open it up, we can say, hey, this failure is occurring. We can tune that circuit breaker and we can make sure that that fallback works correctly for the end user, that we get the right user experience further up the stack when that failure propagates back to the user, the user experience level.

Swapnil Bhartiya: Now I also want to talk about no code claim. How do you actually manage to run these experiments without touching a single line of code?

Kolton Andrus: Yeah, so I think that’s where. While the network is the source of all our problems, the network can be the solution to a lot of our problems as well. And so by introducing this proxy layer, we can pass through all normal traffic as if nothing has occurred, but we can inject failure into the certain requests or request types or patterns that we’re matching on the front side or the back side as I mentioned. And again, so this is, hey, I don’t have to go pull a library in to my application. Some teams, especially at big enterprise companies, there’s a lot of security concerns about adding another dependency or pulling in additional libraries that just increases friction. A team that wants to do this testing might have to now go through a months long security review process to be able to consume this library where a proxy that they can inspect and see all the code and is very simple at its root, they can put in place in the network path to be able to inject that failure. And again, this is where if we need the infrastructure team or the cloud team or the Kubernetes team to be involved in order to go instrument and run those failure modes at the host or container level, in this place, once we have that proxy in place, which is a drop in solution, we can go manage it through Gremlin’s UI and we can go scope it, turn it on, turn it off and we get all the usual fail safe and security features that Gremlin provides.

Swapnil Bhartiya: Can you talk about what kinds of failure scenarios can teams actually simulate through this?

Kolton Andrus: Yeah, so I think there’s a wide variety. So, you know, when we’re talking about dependencies, what happens if that dependency is just unreachable? You know, hey, a cloud provider’s having an outage, it’s not that they respond cleanly. They often just do not respond. So we can introduce a, you know, the connection doesn’t connect on the other end. We can slow it down. So what happens if that service is under load? And if a service dependency you’re talking to is under load, the way it’s going to manifest back to you as the consumer is really just a delayed response, delayed processing time. And so by injecting latency into that connection, we’re really simulating what happens if that service is underwater, failing to respond or responding very slowly. Because we can be, again, at the SDK level, we can be in the code, or at the proxy level, we can simulate what happens if certain error conditions hit. So what if we want to return a certain error code, a certain type of response body, a certain type of explicit failure message. So that’s where we can override whatever response is being returned and we can respond, you know, we can mock it, in essence, and return the exact failure mode that we’re looking to replicate or to reproduce. And then lastly, really, it’s about that pattern matching. Hey, I want to match requests of a certain URL path, requests that have certain attributes to them, certain users, certain cloud regions, certain types of service parameters that we have with our environment. And so that’s what allows us to be very precise about taking a specific example and reproducing it. When I worked at Netflix, we built something similar, and this ended up being a very powerful tool for us because it allowed us to inject failure specifically to our user accounts so that we were the only ones that saw that failure mode, and none of the rest of the population got impacted by it. And so there’s a few examples where we saw an error in production. We had a theory about why this error was occurring. We were able to go in and inject this failure in a very precise way against our accounts, reproduce the failure, find and fix the bit of code that was wrong, and then hotfix it. And so in one instance, we took an incident that had occurred 20 minutes ago, pinpointed the cause, had a hot fix, pushed it out, and then were able to test it on our own accounts before we pushed it live to everybody. So I think that kind of precision testing and that ability to get in and be very precise about reproducing a failure in a production environment, but for a limited scope of users, is a very powerful tool for engineers to have.

Swapnil Bhartiya: If I’m not wrong, there is also a built in health check component. How does that work and what role does that play here?

Kolton Andrus: Yeah, one of the most important parts of running failure tests is knowing if it succeeded, if your system was resilient to that type of failure. And so one of the ways that we’re able to accomplish that is by looking at the overall health of the service or the function that’s under test. And often that looks in a very similar way. Do responses flow? Are we seeing errors coming from that function? Is latency and throughput remaining stable. So by processing this code through a proxy, we actually have access to all of that telemetry while we’re running the experiment. And so while we’re doing that, we can see that the application or the function remains stable and healthy throughout. Or if it’s not, we can call the experiment, clean it up and let the engineer know, hey, you didn’t pass this test, it didn’t behave in the way you expected or overall your system wasn’t as healthy as you wanted it to be. And that’s just one of the ways that we can automate understanding how the system behaves so the engineers can focus on running the test and understanding whether or not it passed, not needing to manually monitor system state, and then focusing on if it didn’t pass, how do they go fix it as quickly as possible.

Swapnil Bhartiya: And can you also talk about how do you measure, like, do you have any reliability score that actually measures, when it comes to failure flags, I just want to understand what do you measure? How do you also ensure the failure rate? Do you understand what I’m talking about? That from A to B, what is the difference after using the failure flags, reliability of the code or the processes?

Kolton Andrus: Yeah, so we built failure flags into the broader Gremlin platform, which has a reliability management component. And so we have within that reliability scores. We have test suites that are common failure modes that work out of the box to help people understand. We have ways to schedule those so they run on a regular basis. And so that really is what drives good hygiene around reliability. You can run these types of tests once in a while, but really your system is changing daily, hourly, you know, by the minute in the most advanced systems. And so you need to be running these tests on a regular basis to make sure you’re not drifting back into failure. And so the test suites give us a great mechanism for scheduling and running those tests. It gives new engineers, or even senior engineers that are less familiar with the code a clear place to begin. And we roll that all up into a set of reliability scores that measure the overall reliability of that service on multiple dimensions. And that’s really important when it comes to the leadership aspect of running a reliability program. There’s many teams, there’s many pieces of software. How do you know which pieces of software need more attention? How do you know which teams are doing a great job? How do you recognize and reward people when they’re not having outages, as you want to incentivize the right behavior? And so that set of reliability scores really gives leadership an opportunity to see an overall view of the system and where risks lie within the system, but also to see where teams are doing a great job and where they’re keeping on top of their failures to produce what we want, which is really boring systems. We really want systems that behave as expected, that don’t fail and don’t surprise people on a regular basis. And counterintuitively, it’s a lot of work to have a boring system. You have to do all of the prep work and automation in advance to make sure things run smoothly whenever there’s a traffic spike or a cloud failure or a new bug that gets introduced.

Swapnil Bhartiya: Can you share, if possible, a real world example? How would a team use failure flags alongside their existing feature flag workflow? How seamlessly it integrates, and how does that workflow look like?

Kolton Andrus: Yeah, so you don’t need to have feature flags to run failure flags. So I would say they’re similar in concept, but they’re not. There’s no dependency or requirement between them. But if you were launching a new feature and you wanted to failure test it before it went live for the broader population base, you might turn on that feature flag for a percentage of your users. Well, before you do that, you’re probably going to want to go in and test that failure mode to understand, do you gracefully degrade? And so an example might be, hey, I want to go test what happens when Amazon S3 has a failure. Well, I have a portion of my code that goes out and calls and fetches assets from that. You know what I want to do before I turn that feature flag on is I want to go run the failure flag to understand, hey, if Amazon isn’t available at this time, do I have a fallback? Can I return something from cache? Can I gracefully degrade? Can I not show a component, can I calculate or do the piece of work I’m trying to do another way so that when you go and turn on the feature flag and start putting users on that new functionality, does it behave the way you want? Similarly, if you want to make sure that you are handling the load properly, you might go test the latency of that service first. Hey, I’ve got a new internal service, maybe it’s an identity service, maybe it’s a recommendation service and I’m going to go see what happens if it becomes twice as slow or four times as slow. So I’m going to go run that experiment on some of my internal test users and I’m going to understand what does that behave like? Can we gracefully degrade? Can we alert? Do we know when that occurs? And that way when we’re turning on that new feature and we’re dialing it up and putting traffic on it, we’ve already tested to see what happens if things start to degrade so that we ensure that the customer experience remains consistent and of high quality throughout.

Swapnil Bhartiya: How mature is this tool? Is this a pre-production tool, an operations tool or somewhere in the middle?

Kolton Andrus: Yeah, well we’ve worked hard to make it a production worthy tool. I think. You know, if you’ve probably heard my comments on this before, but if you’re only testing in your dev or your staging environment, you’re missing a whole class of failures that can occur only in production. Production has security groups, it has diversity of traffic, it has different load balancers, it has different traffic paths and customers do different things than you expect. And you can’t always anticipate everything a customer is going to do. And so we do an extreme amount of performance testing of our own failure testing on these types of products before we launch them. And we have for failure flags. You know, we recently found an issue a couple of weeks back before we launched where we saw a performance degradation in an edge case. And it was all hands on deck for my team for the next week to go really understand and profile and fix every performance concern that we could find. Because the type of customers that we have are, you know, Fortune 100 Enterprise customers that are putting a large amount of production traffic through these failure flags. And so we need them to be performant when they’re not injecting a failure. They need to basically be non existent when things are running through the system in a normal capacity. And then when we are injecting the failure, we want to make sure we’re only injecting that overhead that relates to the failure. We don’t want to bring along any extra baggage as part of that. And of course, everything we build, one of our first principles is always about fail safes. And so if there’s ever any failure in the control plane, if there’s any user error or other failure somewhere along the way, then we want to fail open, fail in a way that allows the traffic to continue to work where the failure doesn’t get injected so that we’re not causing an outage. Because we never want to be the source of an outage. We want to be the source of very precise testing that teaches us about the system to prevent outages.

Swapnil Bhartiya: And how does getting started look like? Is there heavy setup involved? Is it something they can sign up themselves, self serve, or your teams get involved with onboarding as well?

Kolton Andrus: Yeah, so we have a trial within Gremlin so people can go out and self serve these capabilities on their own. Oftentimes they’ll consult with our team. Our technical sales team is quite adept. They’ve all been on call, they’ve all operated systems. And so they’re there to help advise people on how to construct the right experiments, how to build the right tests, and how to interpret the results, which ends up being a big value add for a lot of our customers. But the beauty of the no code solution is at its heart, you drop it in, you go into our UI, you select the failure and you run it. And so there isn’t a lot of setup, there isn’t a lot of overhead that needs to be done. You know, what works by default out of the box is very lightweight and easy to set up. But then there’s a lot of depth that if you want to recreate something very specific or bespoke to your environment, you have that flexibility to go in and customize and configure it the way you want.

Swapnil Bhartiya: And if I may ask you, of course, you folks just launched it. Where do you think failure flags and reliability and SRE go from here?

Kolton Andrus: Well, I think, you know, as I’ve looked over the last six to nine months, the problem of reliability does not seem to be going away. We seem to be having more and more failures. And my personal opinion is there’s a lot of great work going on that’s allowing us to increase velocity, 10x the number of code that’s being written, 10x the number of deployments and our velocity overall. But with that comes 10x the opportunity for bugs, risks, failures to occur. And in these environments where things are changing often I think there’s more risk because of the unintended side effects and the knock on effects that happen within distributed systems. So I think that engineers need more tooling to help them to be able to go answer these questions and more of this tooling to fit into their automated pipelines. Whether that’s AI based or traditional CI/CD. When you’re moving quickly, you need safety nets, you need guardrails in place that help you to test these edge cases and test them in a quick and efficient manner so that you can have that be a feedback loop into your development cycle to make sure you’re accounting for and anticipating these failure modes. So short answer is I think there’s more opportunity for engineers to be able to do this. And the key is there’s many things happening and many things competing for attention. So as always, one of our other core design principles is make it easy to do the right thing. How do we make it simple and straightforward for people to go answer these questions so that they’re able to move quickly in these modern environments, but do it in a way that mitigates risks and gives them a safety net.

Swapnil Bhartiya: Kolton, this has been a fantastic conversation. This is a pain point most organizations face, so this is going to solve a lot of problems for them. And folks who are watching, please head over to gremlin.com to learn more about Failure Flags and I look forward to many more such conversations ahead. Kolton, thank you and I look forward to chatting with you again. Thank you.

Kolton Andrus: Always a pleasure. Thank you for your time.

Why AI Agents Fail Without Internal Data and How to Fix It | Michel Tricot, Airbyte | TFiR

Previous article

How to Manage AI Agents in Production Without Losing Control of Cost, Security, or Performance | Robert Brennan, OpenHands | TFiR

Next article