How Nobl9 Hydrogen Helps Engineers Manage Technical Debt And Reduce Burnout

Guest: Brian Singer (LinkedIn, Twitter)
Company: Nobl9 (LinkedIn, Twitter)
Show: Let’s Talk

The pandemic has accelerated digital transformation and cloud adoption. Shift left and DevOps movement are already moving a lot of responsibility onto a developer’s desk. With accelerated cloud adoption and complexity, the cloud-native space is moving even more responsibility into developer’s pipelines. When combined with other factors, it’s leading to burnouts which is impacting performance and organizations facing long-term challenges.

Nobl9 is a company that helps software developers, DevOps practitioners, and reliability engineers deliver reliable features faster through software-defined service level objectives (SLOs) that link monitoring and other logging and tracing data to user happiness and business KPIs. The company recently announced its Hydrogen platform that is designed for anyone to get started with implementing the SLOs.

“It will help organizations and engineers reduce the amount of burnout that they’re having from things like pages going off that aren’t actual issues and manage things like technical debt more effectively to focus on the things that are causing real issues for their teams in terms of operational load,” said Brian Singer, Co-Founder and Chief Product Officer, Nobl9.

In this show, we also talked about technical debt, burnout and how Hydrogen can help organizations manage these more efficiently. Here are some of the topics we covered in this show:

Focus:

What is Hydrogen and what problems are you trying to solve with it?
There’s mass attrition and the tech industry is certainly suffering from workplace burnout. What is causing this burnout?
What leads to the creation of technical debt and is it really that bad?
What are Service Level Indicators (SLIs) and how do they help companies in identifying the technical debt that they’re inquiring and how do they work with SLOs?
We understand the problem area. Now let’s talk about how Hydrogen, SLIs and SLOs solve some of these problems?
What are the core components of the Hydrogen SaaS platform? 
What has been the user feedback for Hydrogen?
What is the right approach for SLOs/SLIs and to use Hydrogen in the most efficient way?
What’s next in the pipeline? What does the roadmap for Hydrogen, or, your whole SLO strategy look like?

[expander_maker]

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and welcome to TFiR Let’s Talk, NOBL9 has recently launched Hydrogen, a platform designed to identify customers impacting technical debt in a way that engineers and business stakeholders can understand. To deep dive into this new platform and what problems companies face or inherit with technical debt. Today we have with us, once again, Brian Singer co-founder and chief product officer at NOBL9. Brian great to have you back on the show.

Brian Singer: It’s great to be back on swap really enjoyed it. Last time.

Swapnil Bhartiya: Let’s start with Hydrogen, what it has. And also, if you can tell me the name, what’s the story behind the name. How does it feel to this problem that you’re trying to solve?

Brian Singer: Hydrogen is our SLO platform that has been designed for anyone to get started with implementing service level objectives. And what we’ve done is we’ve taken what was formally an enterprise product took, took a bit of work to get it up and running. We have, we’re making it available as a free trial that can then be used at a much lower price point. Once the trial is, is expiring and for us, it’s, it’s very exciting because we’re putting these tools into pretty much every organization, every engineer’s hands that will really help them reduce the amount of burnout that they’re having from things like, pages going off that aren’t actual issues and manage things like technical debt, more effectively to focus on the things that are causing real issues for, for their teams in terms of operational load.

Swapnil Bhartiya: I want to quickly talk a bit about the cultural shift that is happening? We have been, watching all those new reports of, mass resignations that are going on. I think if you look at the whole revolution that happened was DevOps. Where, a lot of things are not falling into developer’s pipeline, earlier they write something and they’re done. Someone else. Now they have to worry about to how much role is that also playing, paying towards, either burnout. And then also, because of COVID, everything is moving to the cloud, everything is moving online. And so a lot of things that were not in their pipelines are moving to their pipeline. So can you just quickly talk about what is the cause of this burnout? And then we’ll talk about, of course technical debt and the other things.

Brian Singer: Yeah, so burnout is caused by situations where folks are having to respond to pages more frequently than they can really handle with. Google has done a lot of research into this actually, and they publish some findings, which says, the average engineer or SRE only has the capacity to respond to two pages a day or to two incidents a day because of how long it takes to root cause and, and work on each individual page or incident.

So even with an on-call rotation, if you have more incidents than that per day, you are burning out your engineers. And then on top of that, if engineers, every time they’re on-call, even if that on-call rotation is once every four weeks or once every six weeks, if they’re getting woken up once or twice a night, every six weeks, eventually they’re going to be looking for another job because that’s not a situation that anybody wants to find themselves in. I think everyone understands that, when you’re operating and always on service, there’s going to be issues. There’s going to be things that happen in the middle of the night that you have to deal with, but that shouldn’t be something that happens multiple times per week, every week. And that’s what we’re seeing happening in a lot of different organizations that are, making the transition to digital first and always on services.

Swapnil Bhartiya: Perfect. Now let’s talk about technical debt also.

Brian Singer: Sure.

Swapnil Bhartiya: What leads to creation of technical debt and why companies should kind of try to avoid that because in, in some cases you will say, “Hey, technical debt means my call” right? All the work that I have done, but the problem is that once you got those technical debt, you are stuck there, which also can lead to another big problem, which is tribal knowledge. Sometimes.

Brian Singer: I mean, technical debt is, I think unavoidable in modern engineering, everything from, changing the platforms that we’re operating on. I mean, I remember going through an angular two to angular three migration, and there’s very, there were very good reasons for doing that, but when you make the decision to move to a new, a new platform like that, right, you’ve now your entire old platform is now technical debt to things like the trade offs that we make. When we’re prototyping a new solution, we want to get something to market quickly. We don’t necessarily know what the usage pattern is going to look like. What areas of the system are going to be stressed when, when, when the volume of usage grows. So we make those trade offs and that incurs technical debt. And the question is not, how do we eliminate technical debt? Obviously we want to, we want to reduce it as much as possible. It’s where do we focus on, on it to get the best outcomes for our organizations, our employees who have to operate these services and for our customers who are using them.

Swapnil Bhartiya: Right. And also, I just want to quickly bring up because NOBL9 does do a lot of open sources, that with open source, if you have creating technical debt, you move so far away from the upstream code. Then when the next version comes out, you cannot easily implement it. Or the changes if you want to go in, the upstream will not be done. So

Brian Singer: Absolutely and something as simple as changing an API interface that creates technical debt, right? Because now you have to go update all the systems that depend on that API it’s unavoidable, right? A lot of the time, some times we have very good reasons for doing these things. And then it’s a question of, what is the migration path? How quickly do I need to move to this new platform or this new, or this new code. And we’ve found, obviously there’s, there’re different techniques, but, but service level objectives are a great way to prioritize that technical debt, because you’re actually identifying where the points are in your application that are, that are the closest to breaking and are under the most stress.

Swapnil Bhartiya: I want to talk about Hydrogen in detail, but I want to quickly talk about, of course, as lawyer brought, I also want to talk about, Service Level Indicators, SLIs, how does that also help companies in kind of identifying the technical debt that they’re inquiring and how does it work with SLOs?

Brian Singer: Sure. Well, I would pause it that if it’s not impacting customers or employees, then technical debt doesn’t necessarily matter. There was a great Twitter thread last week where somebody mentioned, Hey, I have these, my application is crashing about once a day in the Kubernetes pods, the pods are getting restarted and it’s not showing up in my SLOs. Should I do anything about that crashing, right? That’s obviously a form of technical debt. And the resounding response was, if it’s not impacting the SLOs, you don’t need to actually do anything with that. Like a Kubernetes pod crashing once a day is, is really an expected sort of event.

And so that is a great signal to be able to tell whether or not you need to go deal with that now, in terms of having the right SLI to tell you that you have to have some comfort that the indicators that you’re looking at for determining customer impact are the right indicators. And typically the way to start that is to look at the services and do a risk assessment. Who’s impacted if the service goes down? What is the impact to that person? How do I tell if the service is going down? And then obviously we can use those SLIs to start to define some parameters of, okay, how reliable do we need this indicator to be? What is the threshold for success for this particular metric?

Swapnil Bhartiya: And also as you also already mentioned, technical debt is not a word people should be worried about, no matter what you’re doing, you will be doing a lot of custom work internally. There will always be a soft fork internally versus, upstream. The only thing is it should not become too big. That is where the problem starts.

Brian Singer: One of the things I think every engineering organization grapples with is when, I have a code base that’s a little bit messy. When should I refactor that code? Right? Do I want to do it now? Or should I wait till I touch this component? And again, if it’s not impacting customers, then you can usually put that, that refactoring off till, till a time when you’re actually going to be touching that code base. Anyway,

Swapnil Bhartiya: Now let’s bring SLO, SLI and of course, Hydrogen, how is this new platform going to help teams with the problems that we just talked about?

Brian Singer: Yeah. Well, one of the biggest challenges we’ve found with defining SLOs is just getting started. And, and what we’ve done with Hydrogen is we’ve tried to build something that makes it very easy to go create your first SLO, bring those SLIs into the platform and then have a repeatable process that allows you to scale that out to the rest of the organization, through being able to define SLOs as code. So a couple things we focused on one is we have to be able to get data from the right sources.

So we support a pretty wide range of data sources, everything from elastic. You might have logs in an elastic cluster that tell you something about the healthier application to your more common sort of APM tools to something, as basic as Prometheus. So the key is getting something in place quickly that can connect to those data sources and then having it, really simple repeatable process for any that any engineer can understand to go define those SLOs and have it to be part of, when you codify services, when you codify user journeys, you’re actually building reliability into that process. So you have an understanding, this is, this is my goal from a reliability standpoint, with this service when I roll it out. And whether, I know I’ve incurred tech debt. Now I have a signal to tell me if I need to go address that.

Swapnil Bhartiya: Now, Hydrogen is a platform where do customers run it?

Brian Singer: So Hydrogen is a SaaS platform. If customers want to try it out, they can go to our website nobl9.com and sign up for the trial, it’s entirely browser based. And it has the ability to connect both to, data sources that live behind the firewall using agents, as well as directly to data sources that are running as SaaS.

Swapnil Bhartiya: What are some of the core components of Hydrogen?

Brian Singer: Sure. I think, when we look at it, it’s obviously one, the SLO platform itself. So the system goes and it ingests data. And then the SLOs as code piece, as well as sort of browser based wizards that allow you to define SLOs across those different data sources. And then the alerting infrastructure that we’ve built to use the SLOs that are coming into, again, alert if there’re actual issues. And that’s another piece that, we didn’t really touch on, but in terms of the pager fatigue, one of the challenges is alerting on incidents that are actually impacting customers.

So making sure that we only wake up engineers, if something is happening, that is actually customer would notice. And so having alerts based on service level objectives enables you to say, Hey, I’m not just going to wake somebody up, if an error rate is elevated but if it’s not impacting, an actual user journey that a customer is going through. So you have some different layers then to your alerting strategy where some things, Hey, somebody needs to look at this maybe in a week or so. It’s not something we are going to fix right now versus, oh, we actually need to get somebody out of bed because there’s something very, very wrong happening.

Swapnil Bhartiya: Have you released any bit, are there already users? And if they are users, what has been their feedback, which actually help you in further improving the product?

Brian Singer: Yeah, so we, we have a number of users on the platform. Now, the feedback has been, very universally positive in terms of how easy it is to get up and running connect to data sources, start defining SLOs, having a single language to talk about reliability, sort of a common frame of reference. We deal with a lot of organizations where different teams might be using different tooling as part of their observability strategy. Or there may be some legacy tooling for, maybe, very common pattern.

You have a Monolith, and you might be monitoring it with, with one set of tools. And then you have services that are dependent on that Monolith that you’re monitoring with another set of tools. And so being able to come into a situation like that and provide, a single view into sort of how things are operating end to end is one of the things that we get a lot of very positive feedback on, but really it’s just how quickly, and, and in, in terms of signing up for Hydrogen directly as a trial, how easy the sign up is, how quickly you can get up and running with SLOs really in, in any sort of production environment.

Swapnil Bhartiya: If I ask you, since you work with other companies, of course, and you are aware of how they use it. Is there of course, I am not asking you to share a playbook, but is there any right approaches that they should have so that they can better leverage this platform?

Brian Singer: Yeah, I think the most important thing is to not try to build the perfect SLO. So get started with the data that you have. Don’t try to go figure out what the latest observability tooling is. That is like the cool hot thing. And we’re going to go put that in place, and then we’re going to go build SLOs on top of it. Really the data that you have today is probably enough to get a lot of value, quite a bit of value from SLOs and start identifying which tech debt is problematic and, reducing the amount of page fatigue with what you have today.

And then obviously there’s great tooling that’s being built and is coming out and then having the ability to work that into your stack and work that in your SLO strategy. What you’ll find is when you start to define what the use cases you want to build SLOs around or you probably have some of the telemetry that you want already. There might be some other telemetry that you want to go add, but it’s really hard to, to figure out what to add until you’ve had those, those internal discussions, what are the risks to the service? What are customers actually trying to do? You know what, if I were, if I were a customer using this service, when would I call support, right? Those, those are the conversations you have up front, you build what you can, and then you just keep improving things from there.

Swapnil Bhartiya: What are the things in the pipeline course, you cannot share a lot at this point, but if I ask you, what does the rough roadmap for Hydrogen, or, your whole SLO strategy look like?

Brian Singer: I can share a couple things to get people excited. One is that we’re working on a really exciting feature to annotate the charts and the SLOs that we have. And that’s programmatic. So you can do things like when you have a release, new release, you can annotate that into a chart. And it’s really cool because oftentimes the things that are impacting error, budgets and burn rate are, are new releases. And so you can see, oh, this is when that release happened. And I can see the burn rate start starting to increase on this particular SLO. We’re also working on a new feature and I won’t, I won’t give too much about it, but basically it’s a way to analyze some of your historical data and start to understand how that’s, sort of how, how to best formulate an SLO based on some of the things that have, that have happened in the past. So that’s a new feature we’re working on and, sort of give a little teaser there. We’ll be back. I’m sure to talk more about that when it’s, when it’s ready to ship

Swapnil Bhartiya: Brian, thank you so much for taking time out today and talk about Hydrogen and also kind of go deeper into SLOs technical data, I think was really interesting. So thanks for those insights. And as usual, I would love to have you back on the show. Thank you.

Brian Singer: Thank you. So I would love to be back on enjoyed it.

[/expander_maker]

How Nobl9 Hydrogen Helps Engineers Manage Technical Debt And Reduce Burnout

Adrienne Cooley Talks About Split-Brain Scenarios In A Clustering Environment

A Tale Of Two Projects: From Cloud Foundry To Kubernetes | Wayne Seguin

Adrienne Cooley Talks About Split-Brain Scenarios In A Clustering Environment

A Tale Of Two Projects: From Cloud Foundry To Kubernetes | Wayne Seguin

You may also like

How to Unify Database Provisioning Across Multi-Cloud Without Rebuilding Your Platform | Julian Fischer, anynines | TFiR

The HA Testing Gap Costing IT Teams Downtime | Matthew Pollard, SIOS Technology | TFiR

Does Your HA Setup Actually Work? Cassius Rhue, SIOS Technology | TFiR

AI Agents Now Build on Secure Base Images Automatically | John Morello, Minimus | TFiR

From Visibility to Action: The Two-Stage Cloud Cost Framework | Peter Maloney, Azul | TFiR

Platform Engineering Teams Need Better Communication, Not More Tools | Corey McGalliard, Akamai Cloud | TFiR