With increasingly complex environments, reliability is key for an organization’s success. Yet, how reliability is defined and measured within an organization is not so straightforward. Reliability is dependent on not just the tools developers have to hand but also the culture within the organization. Enterprise reliability platform, Gremlin, is focused on helping teams understand their reliability stance and giving them the tools to improve on it.
In this episode of TFiR: Let’s Talk, Gremlin Co-Founder and CTO Kolton Andrus talks about the challenges of reliability in today’s distributed world and the role of culture in helping developers produce more high-quality code. He goes on to discuss how Gremlin is enabling teams to define their own test suite and how this can help foster a more proactive approach to reliability throughout the organization.
Key highlights from this video interview:
- Andrus talks about the importance of reliability saying that if developers had infinite time they would dedicate their time to delivering only the highest of high-quality code. However, this world does not allow for this and as such they need to make trade offs. Andrus discusses how this translates into security and reliability.
- With the progression to microservice architectures, the environment is more complex to manage. Andrus tells us about the “joke” that in these distributed systems you do not know whose system will bring yours down. He highlights the importance of knowing all your dependencies so that they can be handled without impacting customers.
- Andrus discusses where the responsibility lies with security, saying from the perspective as an engineer, they need to be given the right tools, time, and resources to address critical problems. He talks about why reliability needs to be prioritized.
- Alert fatigue is a key problem and Andrus feels that they need to be tuned to only fire when things are going wrong. He believes the best way to do this is with controlled fault injection, which enables you to test specific scenarios. He discusses the importance of recognizing engineers for doing good work and how reliability needs to happen culturally.
- Rewarding teams that are getting it right is key, and Andrus shares his experiences working with customers who have commitments to reliability and business continuity which every team above a certain threshold needs to participate in. However, he believes software needs to work at all times and that it will become the standard.
- Andrus discusses the role Gremlin is playing in enabling teams to measure risks in the system and quality of the software with a reliability score. Gremlin has also enabled teams to build their own test suites to roll them out across teams and the organizations to help leaders assess the risk.
This summary was written by Emily Nicholls.