How SREs measure system reliability has changed dramatically over the years shifting from traditional metrics towards concepts like SLIs and SLOs. These concepts aim to help engineers gain a better, bird’s eye view of the system where they can see the impact the system has on the user’s experience more clearly.
In this episode of TFiR: Let’s Talk, Swapnil Bhartiya sits down with Stephan Lips, Staff Software Engineer at Procore, to discuss system reliability and how the user’s perspective needs to be considered by the engineer. He talks about having well-defined SLIs and how they can provide better coverage of the actual user experience. He also dives into the concept of black box SLIs and tells us how they can be used in a real-world example.
- As we have shifted from using traditional observability metrics such as response time to higher-level concepts like Service Level Indicators (SLIs) and Service Level Objectives (SLOs), this has changed how SREs measure system reliability. Lips believes we need to look at system reliability from the user’s perspective to develop meaningful SLIs.
- Engineers are so close to the systems that they build and support, they rarely experience user journeys like real-world users do. Lips tells us they tend to view the system and its performance and reliability in lower-level technical terms and details but these do not paint the whole picture and they rather need to take a bird’s eye view of the system.
- Even with well-defined SLIs, it is still possible for the system to fail, which Lips believes is dependent on what SLIs are chosen and how they are implemented. He talks through an example of a web service for real-time stock quotes and how the SLIs can be adjusted to provide better coverage of the actual high-level user experience.
- Lips discusses how the black box concept of a given input resulting in a particular output is embraced by quality engineering. He explains how the concept can be adopted by aggregating granular metrics into higher-level SLIs that focus on the user journey using it as an indicator of system reliability.
- Lips talks about an example of a black box SLI, a system that creates a user account by queueing requests. If the queue gets too long or gets stuck, it can impact the user’s experience. He discusses how a white box SLI and a black box SLI would differ and which would be more beneficial in this case.
This summary was written by Emily Nicholls.