Traditionally, SLOs or service-level objectives were focused on how end-users and customers were experiencing your website or product. However, there is a growing trend of SLOs now being applied more broadly to other areas and in more innovative ways.
It’s also becoming clear that traditional models of monitoring can fall short with typing in the business value to the backend. The dashboards which are created can be brittle, requiring a lot of maintenance and it can be difficult to bring in the key people.
SLOs try to align SLIs (service-level indicators) with tangible business metrics or values, helping organizations identify the SLIs that contribute to the SLO and giving you an error budget to make changes or improve your software.
In this episode of TFiR Let’s Talk, Swapnil Bhartiya sits down with Austin Parker, Head of Developer Relations at Lightstep, to discuss what exactly SLOs are and why they are a better approach for organizations compared to traditional monitoring practices. He shares his views on the challenges organizations face with SLOs as well as the business value they can bring.
Key highlights from this video interview are:
- One of the key trends at SLOconf was the emerging consensus of SLOs as a primary observability instrument. Although SLOs used to be primarily used by larger enterprises, Parker feels that that is changing. He goes into detail about the trends he is seeing with SLOs and their uses.
- SLO is a measurement of system performance that connects business value to underlying SLIs. Parker explains what exactly an SLO is and why they are preferable compared to traditional monitoring practices.
- Parker feels the three key challenges of the traditional model of monitoring: the dashboards created are brittle and require constant maintenance, distance between the person creating the dashboard and the people responsible for interpreting it, and alerts require a lot of cognitive load.
- Although the shift left movement has changed the development culture, the organizational culture of how money is allocated has not changed and silos do still remain. Parker discusses how SLOs can be used to break down barriers in organizations.
- Parker believes SLOs can help democratize data and tie together different data sources. He discusses the business value SLOs can bring to enterprise users and the challenges of tying together business value to the back end with SLOs.
- According to Parker, the best way to get the telemetry to make a good SLO is through distributed tracing. He shares his insights into this process and why he feels this is the best approach.
- Getting the right balance of SLOs is key, and sometimes enterprises try to switch everything over to SLOs while others go too small. He details some of the most common mistakes enterprises make and how to avoid them.
The summary of the show is written by Emily Nicholls.
Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.
Swapnil Bhartiya: Hi, this is your host, Swapnil Bhartiya. And, welcome to another episode of TFiR Let’s Talk and today, we have with us once again, Austin Parker, head of developer relations at Lightstep. Austin, it’s great to have you on the show.
Austin Parker: Hey, it’s great to be here.
Swapnil Bhartiya: Thanks for joining me. If I’m not wrong, SLOconf was there a few weeks or a month, time passes so fast that it’s hard to keep track on that. But I do want to talk a bit about during the conference, what kind of discussions you saw, what kind of insight it gave to what’s going on into the market? How much awareness is there about SLOs and how folks are looking at the whole tracing and monetary tracking? So, just talk about your observation from the conference.
Austin Parker: Yeah. So first, thanks for having me on again, it’s really great to be here. SLOconf, I believe this was the second year they’ve done this and it was really remarkable to me to see the growth and interest from one year to the next. I want to say there was about 3000 people that had registered and showed up for it. And it was in a really great format, where these little bite size, five, 10 minute talk, so you can just listen to them. So, it was really possible to hear a lot of things. And what we saw was not only this emerging consensus about SLOs as a primary observability instrument, but also how people are using them in really innovative ways. It is still kind of early days, I think for SLOs. Looking at the past couple years, they’ve gone from being very kind of niche, or really only used in kind of larger enterprises that have very mature observability practices.
But what we’re seeing now is more people bringing them down market a bit and making it easier to access and to implement those practices in kind of more average day-to-day teams. And I think that’s what you really saw with SLOconf this year is that heightened awareness, that shift down market, and also maybe a broadening of what you can use SLOs for. So I mean traditionally, your service-level objective is going to be really tightly focused around how our end users and customers experiencing your website, or product or whatever. But there were talks about SLOs for security, there were talks about SLOs for organizational metrics, and management and on call time and things like that. So, it’s interesting how broadly you can apply SLOs to things outside of just that narrow focus on application performance.
Swapnil Bhartiya: Since you talked about the span or scope as well, going all the way from there to security. I also want, I think it’s very important to just quickly remind our viewers also, that what SLO actually stands for, not only just in terms of service-level or agreement or objectives, whatever it is, but in terms of impact is what it should look like.
Austin Parker: Yeah. So the service-level objective or SLO is a measurement of system performance. And, it does this by connecting business value to sort of underlying service-level indicators. A service-level indicator being something that we’re all familiar with, like request latency, or CPU utilization, or concurrent user sessions, or any sort of metric that you can track. What’s important about SLOs and what differentiates them, I think from a traditional monitoring practice is that an SLO really tries to align those SLIs with some visible, tangible business metric or value. So it’s less saying, “Hey, the CPU utilization is really high, we want to have an alert set up if it goes over 90%.” Or “P99 or some tail latency of this particular API endpoint’s very high.” It says, “For 80% of customers that are going through this checkout flow, we want it to complete in under 500 milliseconds.”
And then, that statement gives you a couple things. One is it helps you identify, well, these are the SLIs that contribute to that SLO. Two and I think this is what’s really important, is it gives you kind of this budget. It gives you an error budget of, here’s room in this calculation that 20% or so of customers that were saying can have an experience worse than. That error budget gives you room, not only to kind of improve your software and make changes, and so on and so forth, but it also kind of recognizes something that I think a lot of people miss in when they think about software reliability is, systems are complex and they’re getting more and more complex. You can’t just say we’re going to have a hundred percent uptime, or 99.999% uptime.
You’re paying so much for those incremental improvements in reliability, and SLO gives you a lot of freedom to be able to say like, “Well, it’s actually okay if every now and then, someone maybe has a one second checkout.” We accept that’s going to happen, sometimes that happens for reasons that are within our control, sometimes it happens for reasons that are outside of our control. The SLO really codifies that, gives you room to experiment and gives you room to grow, both for intentional performance changes, but also recognizing the unintentional changes that can cause deleterious effects.
Swapnil Bhartiya: Excellent. Thanks for explaining that. Now, I want to go back to monitoring. What’s the problem with the traditional model of monitoring that you feel that SLOs are the right approach there?
Austin Parker: That’s a great question. So I think the primary problem with monitoring practice is that, if I tell you we’re going to go monitor an application, what does that mean? What do you need to look for? Most of the time in a traditional monitoring setup, that means you’re going to go and you’re going to create a dashboard. And those dashboards are going to have a lot of charts on them, they’re going to have line charts, they’re going to have top lists, they’re going to have big numbers, they’re going to have all these different little primitive ways to visualize data.
As a engineer, or as an SRE or as whoever you are, you have to look at that data, and you have to look at those charts and you have to say, “Okay, I have a bunch of raw data, I need to interpret it, I need to display it, so I need to figure out queries that can represent my system state.” I build a dashboard full of those charts, and then you’re done. You say, “Okay, I have my dashboard, this has all my key metrics on it and I’m going to hand this off to someone, and they can look at it and they can figure out if there’s a problem.” Here’s where that falls short. There’s kind of three big reasons. One is that, those dashboards themselves can become very brittle, they can rot. They require constant care, and feeding and maintenance.
We have to make sure that those underlying metrics didn’t change what they represent. We have to make sure that, as we grow, and scale and change our application, that we’re adding new metrics, we’re keeping them up-to-date, and then we have to refresh those dashboards. And that’s a lot of effort that quite often, to be honest, just isn’t put into this. You need specialized teams, you need people whose job it is to kind of keep on top of that. And then, those changes have to be communicated to the end users, to the engineers that are actually responsible for monitoring those dashboards, which adds even more friction. Another big reason this fails is more to do with how much distance there is between the people that are building the dashboards, and putting them together and the people that are responsible for using them to interpret what’s going on.
If you hand me a dashboard, I can go and I can… Like, let’s say I go into Grafana, I find some Kubernetes dashboard on the internet, I plug that in there, and then I have this big complex dashboard that shows me what’s going on in my Kubernetes cluster. If I’m not a Kubernetes expert, all of that is going to just be Greek to me. It’s a ton of information that I don’t really have the context for, I don’t know how these things relate to each other. I don’t know if this line goes up, is that bad? Is that good? It requires a lot of domain knowledge that not every engineer is going to have, and that adds in friction, but it also adds in frustration because you, as the engineer, you feel like you’re constantly having to like go and learn about all this new stuff.
And yes, that’s part of the job, but I feel like most engineers, they don’t necessarily want to become an expert on every single part of their stack. And as the stack gets more complex, we’re effectively mandating that they do become experts on this whole thing. The third reason that I think that traditional query and monitoring approach isn’t really valuable is that, it’s really about alerts. So you have your dashboard, you have your queries, that’s great. You have to make some sort of prediction about like, what is the state of the system… At what point is this so out of spec, or so out of whack that I want to tell a human being, “Hey, come in here and look at this?” And, that’s what alerting does for us. We put a horizontal line across our dashboard or across one of our visualizations and say, “This is the threshold.” “If the line goes above this or goes below this, then tell someone.”
But again, those alerts require care, and feeding and maintenance. Those alerts require a lot of cognitive load, in order to understand what does this actually mean, how does this relate to other parts of the system? Those three reasons are all due to kind of two main things. One, like I said, a lack of context about what’s the relationship between these metrics and what’s going on and two, sort of update, and care and feeding overload. So, SLOs fix the first part of those by tying business value back to a metric, back to a service-level indicator. So anyone that looks at it, especially, you might not know how Kubernetes works, but you do know like, okay, if people can’t check out or if this particular API route that I’m responsible for is being very slow, that’s a problem that’s causing pain for my users, I need to fix that.
On the care and feeding side, because SLOs are really aggregates of a lot of other pieces of data, they tend to be a little easier to actually take care of. And that sounds counterintuitive, because I’m saying here’s a thing that requires more stuff, but you have less SLOs. Because if you look at your system, you don’t have a million interaction points. I mean, maybe you do, but the actual end user journeys you care about through your application tend to be more tightly scoped, there tends to be accountable amount of them. So you can actually have fewer SLOs that are of higher quality, so the overall maintenance burden is reduced because you’re only having to care about 10 or 20% of the alerts, or monitors that you might have had to care about before.
Swapnil Bhartiya: Excellent. I love the three points that you mentioned there. I want to talk about, when you’re talking about the distance between those who are creating it and who are trying to get some… How does SLO… Also, I want to quickly talk about when we look at the whole DevOps movement, the whole idea was to break down old silos, bring folks together, but that problem is still there. So talk about, especially that how SLOs solve that problem. And also culturally, how you are seeing things are changing to further enable, so that organization can embrace this approach as well?
Austin Parker: I believe the real connection here between, kind of observability teams, SRE teams, the people that are usually tasked with this sort of monitoring workload, and the engineers that are building code and shipping it, is that it’s an organizational byproduct of just the way that big groups of people have to organize themselves. Like, I think DevOps and the idea of shifting left and the idea of making us all experts of building, running, shipping, da da, da, when you really get down into the weeds and you look at how, like especially in the enterprise, people are going about doing this, it’s not quite so simple. And the reason is that, we might have changed the development culture, but we didn’t really change the organizational culture, we didn’t change how is money allocated.
Like you still have to say, “Here’s a team, we want to hire X engineers, this is our kind of capital budget, this is our operating budget.” “This is how ladders and promotions and all that work.” And that hierarchical way of building an organization is, runs kind of counter to what you need, I think, to have a full DevOps transformation. So if we accept that we can’t, or are unlikely to change our organizations to make them flatter and less hierarchical and to fix those problems. And, not everyone considers that a problem. But let’s assume that it is, we can’t fix that, so what do we have to do?
We have to come up with these constructs that help us break down those barriers in ways that don’t require organizational chop and churn. And, I think that’s SLOs for you in a nutshell. Because, they’re not something that necessarily are opinionated about how you’re organizing your teams. I can have an SRE team and observability team, for example, that is responsible for creating metrics, creating SLOs, building monitors and alerts and all that sort of stuff, but they can work collaboratively with engineers, with product managers, with customer service and customer success, with account managers. Like, you can start to pull in people from all around the organization because you’re able to center the conversation around customer value. And, that’s something that someone in CS might not necessarily care about Kubernetes. They might not necessarily care about what API framework we were using.
The customer that you’re building your product for, certainly doesn’t care about any of that. They don’t care what library you’re using, they want it to work. So by using SLOs and saying like, “Okay, here’s the thing we actually care about, we care about customer experience,” then that’s something you can take to other parts of the organization and say, “Look, you might not know or care about all these details down here, these technical details, you don’t have to, we need to align on what is the customer experience that we care about monitoring.” And then, as an SRE team or an observability team, they can come in and help to pick the things that need to be monitored from a technical standpoint to align with that business value we’re trying to create.
Swapnil Bhartiya: First of all, thanks for explaining that. Second thing is that, I was about to ask that question also, that when we do talk about SLOs. And I will ask it because really important, how will you connect or associate SLO with business success, so that we are not… Once again, it’s not about technology for the sake of technology. Yes, you’re right. Customer success folks, they don’t care about the underneath technology. To be honest with you, between you and me as well, these technology, we can love to talk as much, but in the end, we are trying to solve a specific problem. These are means to an end, so we should not get too much… I mean, we can because we love technology, but they don’t. So, can you talk about the business value that SLOs can bring to enterprise users?
Austin Parker: Yeah. What I’ve seen, I’ve talked to a few customers at Lightstep that I’ve done, this is they have built SLOs in as part of an overall strategy of democratizing data. And that kind of goes both ways, so some of that is going to sort of frontline tier one, tier two support and saying, “Look, you can go and you can see really detailed data about customer experience, customer journeys, you can see these SLOs, you can see these dashboards of SLOs, so that, when someone, when a ticket comes through or someone is raising an issue, you can really easily go say and filter like, okay, well, let me see this customer.” “And then boom, I can kind of get a real time snapshot of what performance is like for them.”
The other thing that, beyond democratizing data is that SLOs let you bring in a lot of different data sources that are not just, I think when we tackle this from the technical perspective and we get very caught up in what is the exact way we’re getting stuff out of the code. Are we getting traces, or metrics, or logs or whatever, but within SLO, you can also start to tie in things like user analytics. And you can say, “Maybe our business, maybe our objective is to reduce the amount of clicks that takes to go from point A to point B.” There was a really interesting article I read about web performance on this. It was talking about, I believe it was from someone that worked at Kroger, the grocery store, and they had gone through, and they did sort of a shootout of here’s a single page app with this framework, that framework, here’s a native app, here’s all these things.
And what they did to demonstrate the performance issues that actual customers were seeing, was they bought a bunch of very inexpensive cell phones, just really cheap cell phones. They throttled the data on them to be slow, as slow as possible. Like, you would be if you were in a rural area, or you were not on a really high powered, fancy iPhone or whatever. And then, they go through and they hand these out to everyone on their team at this performance shootout to demonstrate, look how bad this user experience is. If you are someone that is kind of, you don’t have the fanciest newest technology, you don’t have super high speed internet, this is how long it takes.
And that analytic data is something that they can track and they can track that data, put it into an SLO and say, “Okay, instead of just looking at these synthetic metrics that don’t really tell us much, we can go and start to align ourselves around, what’s the time to First Contentful? What’s the time to painting all the elements, how long until the page is responsive?” Take those metrics and say, create a SLO of, I want it to take less than a minute or less than 30 seconds to be able to kind of complete these searches for these key items, eggs, milk, bread, whatever. That’s something you can then go back to, not just say yes, not just PMs, but that’s something you can go back to the actual business with, the logistics people, the executives. And you can say, “Look, we need to invest in web performance because we’re tracking like it takes us 30 seconds to do this, but if someone uses a different competitor’s app, it takes them 10 seconds.”
What are people going to actually do at the end of the day? We want to make sure we’re capturing, not only that we’re serving our customers, but that we’re able to kind of provide a better experience compared to our competitors. Here’s a way that you can see that, see whatever, oh and you can look at it and you can understand the actual dollar value of these performance… The work that we’re doing on performance. And that to me, that’s remarkable. Because so much, so many projects I’ve seen specifically around application performance and optimization tend to get derailed because it’s not visible work to business leaders. Maybe they understand it, or that they have some conception of like, yes, I need to do this.
But without that SLO there, as a way to really concretely tie the business value to what we’re doing in the back end, you’re going to be stuck in a situation where things are a little more loosey goosey. Maybe you can show your dashboard of just like, well, we took this number down or this number went up, but without that sort of synthetic, that aggregate SLO of like, okay, this is why we’re doing this, this is how it’s making the dollars, making us more dollars, then all it takes is some sort of shift or change, someone to come in that maybe doesn’t understand the technology as well. And all those performance programs just kind of get swept aside, because you can’t demonstrate the value of them.
Swapnil Bhartiya: Excellent. Great. Once again, thank you. Now, we talked about the business value, let’s talk about another [inaudible 00:22:27] friction could be that, first of all, knowing about the right tools for… Because that could also lead to frustrations. Talk about what are the best type of telemetry that you would need, how to get the data that you need, so that once again, teams are more efficient and business value is coming out quickly versus frustration.
Austin Parker: Absolutely. I think the best way to get the telemetry to make a good SLO is through distributed tracing. And I think the best way to get distributed tracing data is to use projects like OpenTelemetry to instrument your code, to instrument your distributed system, because OpenTelemetry and distributed traces will get you actual per user level telemetry signals about how are people using the application. When I click on the checkout button, what happens? And, I can see every step in that. You need to pair that underlying telemetry with observability tooling, tools like Lightstep that can help you kind of aggregate all that data and store it, tools like Nobl9 to help you build these SLOs. But you know that’s kind of a second, not a secondary one, those are very important, but you really have to start with that high quality telemetry data tracing first, you can bring in metrics to that, you can bring in logs to that, but really traces should be the bedrock of your SLO approach.
Just because that’s going to give you the data at the right resolution, at the right place in your stack, in order to build SLOs.
Swapnil Bhartiya: Now, we talked about picking the right tools, now let’s also talk about some of the missteps or mistakes folks tend to make when they do approach SLOs and what they should avoid there.
Austin Parker: That’s a really… That’s a good one, I got to think about this, one second. So I think the biggest mistake you can make on SLOs is trying to, there’s a sweet spot, is maybe the best way to think about them. I’ve seen people try to go and say, “Okay, we’re going to switch everything over to SLOs.” And if you have like a big enterprise system, you might have thousands, tens of thousands of alerts. You might have hundreds, and hundreds and hundreds of dashboards. Trying, sitting down and saying, “Okay, we’re going to boil the ocean, we’re going to switch all this stuff over to SLOs,” that’s a real… That’s just a painful amount of work and there’s no guarantee that people are going to like it. Like, I think the flip side of this is you can go too small. You can say, “Okay, we’re going to pick this one thing, kind of over in the corner that doesn’t get a lot of use and we’re going to implement SLOs there.”
And what you see in that case is, you didn’t really get enough data to make a good determination about how was this actually impacting my actual end user. Maybe, well, if my end user, my calling service, like I can see things there, but I think you need to kind of… It’s very much like Goldilocks and the Three Bears, you need to find the porridge that’s just right. So a good place to start is identify, maybe sit down rank a couple five, let’s say five or 10 really key interactions through multiple services in your system. They don’t necessarily have to all be… They don’t have to be like the most trafficked things, they don’t have to be like the absolute critical parts of it. They can be secondary subsystems, they can be things like account management.
They could be things like, if you want to focus internally, they can be around CICD. Let’s set SLOs on like, how long does it take when we push code to actually get deployed to production? But you need to pick sort of a single subsystem that has multiple interactions, multiple steps through it, that is also has a really define, something you can sit back and say, “Okay, here’s the dollar value or maybe not the dollar value, but here’s the business value of these interactions.” Pick those and start there. And again, start with one or two of those full journeys and take it kind of slow. You want to do things in parallel, so you don’t turn off all the alerts on day one, you have your SLOs and your traditional alerts. And then start to cut over, teach people about SLOs, teach them how it’s different, make sure you’re taking notes on all this, make sure you’re going and doing trainings and you’re gradually easing people in.
And what I think you’ll find is that, as people start to see this and as you turn down those old alerts and you’re in the SLO only world, you’re going to get people really interested. And they’re going to say, “Well, Hey, how can I do this for my service?” “How can I do that?” “Hey, I’m a front end engineer and I want to do this for the web app.” Or “I want to do this for these authentication routes,” or whatever. And then, you use that to kind of continue building out. So it’s about finding those key, but not critical parts of the app in those systems, starting there, building a few and expanding out slowly, and gradually and building trust in the SLOs too. Because I think that’s the other big difference is that, this is kind of radical in terms of how it can change, how you’re prioritizing work, and how you’re prioritizing performance work versus other kinds. So, it’s really important to build trust in the measurements as you go forward.
Swapnil Bhartiya: Austin, thank you so much for taking time out today and of course, talk about SLOs. Thanks for sharing those insights and I would love to have you back on the show. Thank you.
Austin Parker: Thanks. It’s been great to be here again. See y’all next time.