Guest: Kolton Andrus (LinkedIn)
Company: Gremlin (Twitter)
Show: Let’s Talk
Gremlin earlier this week announced the new ‘Detected Risks’ capability that empowers chaos engineering teams to automatically find and fix issues that could lead to infrastructure outages and incidents. We invited Gremlin Co-Founder and CTO Kolton Andrus to have a conversation on the evolution of Gremlin as a major player in chaos engineering and SRE space.
Andrus says that chaos engineering teams should not be seen as ‘firefighters’ who come in to save the day, instead there should be tools and practices that make everything work smoothly. The addition of ‘Detected Risks’ is a testament to that statement. In this discussion, Kolton talks about how Gremlin helps organizations to build the right practices and right mindset via right tools.
On reliability:
- For Andrus, reliability is all about having one’s platform/service/product work when the customer needs it to work. Everything else is done to support that goal.
- It is a CTO/CEO-level mandate because the entire organization, at least the entire engineering/technical group, needs to work together to achieve it. Someone has to set the standard and say, “Customer experience is sacred, and we have to invest in ensuring that we meet those expectations.”
SREs have a tough job:
- They have to go in and help the organization do the right thing.
- Sometimes, they need to be the experts in tools and help the organization understand and leverage those tools.
- Sometimes, they need to be the experts in the patterns, circuit breakers, back-offs, time-outs, thread pools, how to shed load, etc. They act as consultants to teach teams about these things and make sure they’re incorporating them into their designs and into their code.
- They’re often the first to get paged when things are broken, so they have to have real-time analytics, triage, and restoration skills.
- If they do their job well, it looks like they’ve done nothing at all. The software just runs smoothly and boringly. If they don’t do their job well, then they look like heroes when they come in as firefighters to fix it, but that’s not actually the behavior companies want to incentivize.
- Many organizations lack the concept of scores/risks, and a way to track reliability efforts: how do we help the organization understand the good work we’re doing? How do we help them understand the risks that exist, so they’ll prioritize the work to fix it? How do we help them understand the risks we’ve mitigated have made the system more reliable?
On chaos engineering:
- It is a learning tool to see how the system behaves when things go wrong. Often, it uncovers an incorrect assumption that needs to be revisited and fixed.
- Tracking what’s coming down the pipe, business continuity, disaster recovery, etc. are really related to understanding and mitigating risk in the future. Chaos engineering is about understanding the risks today.
- It is about knowing the cobwebs and the spiders under the bed, rather than being surprised by them. Instead of reacting and responding to fires, understand about the failures and then prioritize and mitigate in a thoughtful engineering approach.
- It’s about investing the right amount, not all, of your time in reliability efforts. You want to go in, be effective, mitigate the issues, and then go back to making customers happy and making products well.
- It’s also about sharing stories that the engineering teams do. Engineers are busy and skeptical. They need to be convinced that this is a good use of their time and that we’re going to learn something we didn’t know, and we’re going to be better on the other side.
On chaos engineering and generative AI:
- The original instantiation of chaos engineering was to provide the tool and then end-users run the tests, determine if the system behaved correctly, and then fix it.
- The next iteration was system health: the tool will tell you whether the test succeeded, and then end-users need to fix any issue that’s found.
- The natural next step is the tool will run the test, tell you what went wrong, and tell you how to fix it. There is already a very well-known class of problems (e.g., circuit breaker pattern, tight loop, etc.) that we can recommend either best practice patterns or specific code changes to mitigate.
- Everyone looks at the “happy” case. If we have AI run our systems, they’ll do great when everything is the happy case, and they will fail horribly when things go wrong.
- Andrus thinks there’s a lot of interesting work if we teach those models about those failure cases to help them to understand how to respond.
- How we interact with computers is another opportunity. If you could talk to the tool, negotiate with that tool, help it understand what you’re trying to accomplish, and let it decide upon the details, it would be interesting. (Andrus has a very high bar for what success looks like, having already written a developer tool that is incredibly precise.)
How Gremlin helps companies improve their reliability posture :
- The original Gremlin platform is for the SRE team. It’s for the experts. It has every knob and dial to do whatever you want, which most teams find intimidating.
- Their new Reliability Management product is meant to be the easy button. It is a robust but safe test suite out of the box. It includes 8 tests every team should run and get a score to know if it’s accomplished or not.
- It helps people to score tests, measure risk within their systems, and track the good work their teams are doing.
- It has monitoring so that if something goes wrong, you can stop it, clean it up, and make sure you know that your system withstood the test and that it responded correctly.
- It collects a good amount of data, especially about Kubernetes environments and cloud environments. It has agents installed so there are some risks that can be detected without ever running a test, for example, a service was deployed redundantly.
This summary was written by Camille Gregory.