Reliability Is A Team Sport | SLOconf 2023

Guests: Adriana Villela (LinkedIn) | Ana Margarita Medina (LinkedIn)
Company: Lightstep (Twitter)
Show: Let’s Talk

In this episode of TFiR: Let’s Talk, Swapnil Bhartiya sits down with two guests from Lightstep, Senior Technical Leader Adriana Villela and Staff Developer Advocate Ana Margarita Medina, to talk about translating failures into service level objectives (SLOs) and how these SLOs intersect with system reliability and observability.

Key highlights of this video interview:

A service level objective is a metric that an organization uses as the reliability goal. These are numbers that are tied to user impact and are key aspects of increasing observability.
It is very challenging for many organizations, especially the larger organizations, because there is a cultural aspect to it, where it’s almost taboo to embrace failure. It becomes extremely difficult to get into that mindset of “It’s okay to fail. It’s okay to admit that there is a problem and to iterate on that.”
Reliability is a team sport. Everyone in the organization needs to come together. Leadership needs to mandate, “Reliability is one of our objectives and key results (OKRs)” and ensure that the folks that are doing the heavy work are getting promoted, equally distributed, and the entire organization is carrying the load of this reliability goal together.
Cultural changes include conducting blameless post-mortems, doing tabletop chaos engineering exercises, inviting external advocacy to inspire the engineers into organic adoption.
Some engineering teams don’t really want to embrace the heavy lifting needed to actually adopt the new culture or there is pushback from leadership.
Villela believes we’re still in that position where things are being willfully misrepresented, e.g., rebranding operations teams as site reliability engineers (SREs) or inserting an entire team in the middle and call it DevOps transformation. The right conversations are going on, but there’s still a long way to go.
When deciding on what tool to adopt, organizations need to have a good understanding of the space, do due diligence (does the tool follow SRE fundamentals and is a good fit for your organization), do comparisons, talk to people who are using the tool, and also talk to people in the community.
Villela would like to see observability become part of the SRE conversation. She states that you cannot be successful at SRE these days without observability.
Medina would like to bring reliability and observability to the software development lifecycle, i.e., put quality gates before deploying to production in order to know the impact of every code to SLO.
A compelling use case for observability is having the information at your fingertips to be able to resolve incidents in the middle of the night in a relatively timely manner. It saves money because outages aren’t going to take as long.
Developers can leverage service level objectives as those metrics to guide them that this particular software or application is one that they need to keep reliable. They are tied to the OKRs that leadership is putting in and to those key performance indicators that show business value.
Why open source? 1) It allows the software that one organization has developed an opportunity to blossom further. Someone else might have other ideas on how to improve it and bring new ideas to the table. 2) It raises the profile of your organization, i.e., one that is open to collaboration and open to sharing. 3) It enhances your organization’s developer experience as most of them would love to be involved in open source. 4) It allows more contributors. You end up with technology that is built by worldwide users, not just one country or one city. 5) People may help you with coding, how to use Kubernetes, how to be a community member, write docs, etc. You get access to talent that you weren’t able to reach in the past.

This summary was written by Camille Gregory.

Reliability Is A Team Sport | SLOconf 2023

Companies Turn To SLOs To Increase Operational Efficiency: Survey

Tidelift Survey Reveals That 60% Of Maintainers Still Don’t Get Paid For Their Work

Companies Turn To SLOs To Increase Operational Efficiency: Survey

Tidelift Survey Reveals That 60% Of Maintainers Still Don’t Get Paid For Their Work

You may also like

How to Unify Database Provisioning Across Multi-Cloud Without Rebuilding Your Platform | Julian Fischer, anynines | TFiR

The HA Testing Gap Costing IT Teams Downtime | Matthew Pollard, SIOS Technology | TFiR

Does Your HA Setup Actually Work? Cassius Rhue, SIOS Technology | TFiR

AI Agents Now Build on Secure Base Images Automatically | John Morello, Minimus | TFiR

From Visibility to Action: The Two-Stage Cloud Cost Framework | Peter Maloney, Azul | TFiR

Platform Engineering Teams Need Better Communication, Not More Tools | Corey McGalliard, Akamai Cloud | TFiR