Transitioning From SRE/DevOps To Observability/SLO Leads To Reduction In Alert Fatigue

Guest: Asaf Yigal (LinkedIn)
Company: Logz.io (Twitter)
Show: TFiR: T3M

While many companies are looking to transition to observability, a name change or reading an article on Google’s definition of SRE is not necessarily going to get you all the way there. Not only does observability require changes within organizations in terms of roles, the tools that are used, and how things are done, but it also needs to be done in a way that is scalable and consistent throughout the organization.

In this episode of TFiR: T3M, Asaf Yigal, Co-Founder and CTO at Logz.io, talks about organizations’ transition to observability and what that entails for the company. He touches on some of the key challenges he sees with their customers, the cultural aspect of observability, and how Logz.io is helping organizations.

Introduction to Logz.io:

Logz.io is an observability solution that is cloud-based and Kubernetes-supported. The company is committed to offering observability at reasonable costs, ensuring that you store the right data that you need.

How do reliability and observability fit together?

There is some overlap between reliability and observability, however, much of the SRE work reliability engineers do, they do not see the impact of what is done on the applications. Observability is being looked at mainly from a development and business side to know if the application is serving the customers at the desired service level.

What makes more sense for businesses, SLAs, or SLOs?

Yigal feels that for their business, SLOs make more sense. For companies to make the transition to observability, they need to define themselves and build an observability system around SLA, ensuring they meet their service level objectives.
The transition from an SRE and DevOps organization to an observability and SLO organization results in a significant alert fatigue reduction.

What are the challenges teams face on their reliability journey?

Alert fatigue in the meantime to resolution (MTTR) to resolve an issue. MTTR has actually increased over the years even though there are more tools today to address this. Part of the problem is companies have defined so many alerts they should not have and they risk missing the alerts they should care about.
Tool sprawl is another problem but it is possible to use several tools for observability provided they have a unified data collection.

Where does security come into play from a Kubernetes perspective?

Yigal thinks there are two ways to look at security from a Kubernetes environment perspective, horizontally such as how you want to see the service, infrastructure, and how the application is laid out, and vertically, seeing the relationship between each of the different pods that are running on different clusters and the relationship. Security has to take both of these into consideration, from an infrastructure perspective and as an application owner.

Changing the name of a role to SRE does not translate to effectiveness

Many companies are having teething problems transitioning towards SRE as just changing a name to SRE by the development and monitoring team does not necessarily get your organization to its destination.
Yigal talks about the changes that need to happen in order to transition and he lays out the responsibilities of an SRE team to achieve a level of consistency through the organization. Although monitoring people may be monitoring all the metrics and have alerts for some of them, it is not a scalable format for hundreds and thousands of developers sharing the same environment.

How does Logz.io make it easier for organizations?

Logz.io Kubernetes 360 enables you to look at the clusters and see all the information on how the applications are being laid out. It also provides a service level overview and if you are meeting the SLO for that service.
The company also provides guidance for organizations to help them on their observability journey.

This summary was written by Emily Nicholls.

Transitioning From SRE/DevOps To Observability/SLO Leads To Reduction In Alert Fatigue

Network Is Fundamental To Your Business, Let’s Talk About Network Observability

BMC AMI zAdviser Enterprise Solution Helps Accelerate Mainframe DevOps Transformation

Network Is Fundamental To Your Business, Let’s Talk About Network Observability

BMC AMI zAdviser Enterprise Solution Helps Accelerate Mainframe DevOps Transformation

You may also like

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR

How Kubernetes 1.36 Handles GPU Scheduling, DRA, and Kubelet Security | Ryota Sawada, Kubernetes | TFiR

Your HA Backup System Has Hidden Gaps — SIOS Technology’s Trey Isaac Explains How to Find Them | TFiR

Escaping VMware After Broadcom: How Vates Is Winning the Open Source Virtualization Market | TFiR