Cloud Native ComputingDevelopersDevOpsFeaturedLet's TalkVideo

Introduction To Steadybit Resilience And Chaos Engineering Platform | Benjamin Wilms


In this episode of TFiR Let’s Talk, Swapnil Bhartiya sits down with Benjamin Wilms, CEO and Co-Founder of Steadybit, to discuss the Steadybit Resilience and Chaos Engineering Platform. He goes into details about the origins of chaos engineering and how it shifted towards being able to repeat specific situations under control. Wilms discusses the level of adoption with chaos engineering and the need for further education within organizations. Lastly, he takes us through the key features of the platform and how it is helping detect issues in systems proactively.

Key highlights from this video interview are:

  • Steadybit is a chaos engineering platform founded in 2018 but feedback from customers indicated that you needed to know how to start or where to start. This led to a shift where Steadybit now helps you start at the right point in the system to improve the system proactively. Wilms explains how this shift lets you convert incidents into request regression tests so that you can be better prepared.
  • The concept of chaos engineering regardless of its name is actually very structured and planned. Wilms discusses how a lot of the available information on it says you need to do it in production, yet he believes it needs to be done at an earlier stage so that SREs and DevOps can work collaboratively to optimize systems proactively. He explains how chaos engineering is finding the unknowns to test the system and takes us through this.
  • Chaos engineering started with Chaos Monkey for killing a specific machine to learn and see how the system was reacting. Wilms discusses how this concept shifted towards being able to repeat specific situations under control. He explains how Steadybit is helping close the gap between SREs and DevOps, and how many developers are measured on feature velocity not production uptime and how that motivation changes priorities.
  • Platform Engineering teams, SRE teams, DevOps teams are early adopters of chaos engineering since they would like to optimize productivity. Wilms feels that some developers do not prioritize reliability and resilience issues until there is an incident, and he believes they need to be more proactive in this respect.
  • Wilms feels that now the focus needs to be on how chaos engineering is brought into the organization and rolled out. Although the concept is already known to some people, he reiterates the need for educating people. He explains that one of their customers, ManoMano, has an e-commerce platform and they know the best solution is to do chaos engineering to search for problems.
  • Wilms discusses his motivations behind creating the Steadybit platform. He explains that you install agents into the system and the source agents are connected to one central platform, continuously checking and scanning the system. He takes us through how you can run experiments and can replay specific incidents, going through its main features.
  • Steadybit has a weak spot analysis feature, which acts as a checker to point out specific weak spots on the system without running an experiment. The next stage is to create an experiment to get further insight. Wilms tells us that they have a wide variety of customers from large SaaS to manufacturers and insurance companies who can extend the system with their open source extension kit to write discovery components.
  • Steadybit started as a closed source solution and their central platform, which still orchestrates everything, is still a closed source solution. However, there are parts that are open source like the extension kits, which you can create yourself or use from the open source community. The concept of policies and policies of weak spots are also open source.

Connect with Benjamin Wilms (LinkedIn, Twitter)
Learn more about Steadybit (Twitter)

The summary of the show is written by Emily Nicholls.