What Inspired Komodor ‘Workflows’ and How it Can Simplify Kubernetes Troubleshooting

Guest: Ben Ofiri
Company: Komodor
Show: Let’s Talk

Komodor was created from the founder’s own personal pain. Ben Ofiri, CEO and co-founder of Komodor, and his partner were developers for many years. Ofiri worked for Google as a software developer and his partner, Itiel Shwartz, worked for big enterprises like eBay and other startups. The pain they experienced centered around being on-call engineers and getting alerts at 3 AM or 3 PM. They realized, even with all of the monitoring tools and log solutions in place, the first thing that came to mind, according to Ofiri, was “What the heck changed because five minutes ago, everything was great?” As an engineer, you’d need to get the immediate context of what changed, what happened in the system so the problem could be traced to the root cause.

To that, Ofiri says, “Unfortunately, the monitoring and logging tools are not meant for this use case.” Those tools do a great job of monitoring your system or providing log management, but with Kubernetes, which is a very “defragged, distributed, and complex system,” you need something in place that understands the different relationships between all of the different components that can go wrong.” Those tools do a great job in monitoring your system or providing you management for all of your logs. But when you have an alert specifically in Kubernetes, which is a very defragged and distributed and complex system, what you really want is something to help you understand the different relations between all of the different components that can go wrong and then something that can tell you which components that are faulty and what is the most important or most relevant to the specific alert you’re trying to troubleshoot.

To that end, Ofiri says, “Orthogonal developed internal tools for our companies to mitigate this pain. And once we talked about it in some random coffee shop, almost two years ago, we realized that this pain is shared across different teams, different companies, and it might make sense there would be a generic or cloud platform that can solve it instead of miserable developers like us trying to develop in-house tooling to solve this pain.”

The conversation then shifts to observability. Ofiri believes both observability and monitoring tools are mandatory. For that, there are some very powerful, open-source solutions, such as Grafana, Prometheus, and even managed solutions. However, what Komodor experienced first hand was that average organizations have hundreds or thousands of alerts and their IT staff need to drop everything else every time an alert is raised and do something in the role “To query different tools, to correlate information from different systems, to basically become an expert in Kubernetes or a different environment, just to understand what changed, what’s the root cause of each one of those alerts.” Those same staff members also need to take action to solve the issue (such as rollbacks, restarts, or increasing memory). To that end, Ofiri says, “What we see is that monitoring and logging are mandatory, and are a huge piece of troubleshooting, but they’re definitely not enough. And when we think about troubleshooting, we always think about the three pillars — understand, manage, and prevent.” So when you have an alert you need to understand well what’s going on, what happened recently in the system, the different contexts, the business logic, and the Kubernetes/infrastructure logic.”

Ofiri believes chaos engineering is a novel approach and is fascinated to see if it becomes the standard. He also believes that tests will ideally solve all of our issues. But even testing has limitations, so there needs to be end-client tests, integration tests, and tests for different layers. Ofiri sees chaos engineering as the next layer of tests. “So some things you can’t check in staging or Canary, so you must use production, right. But even though you add more layers, it’s lacking security. You will still have issues. Maybe the issues that we’ll have will be more severe. Maybe it will be a bit more rare.”

But even with testing, you can wind up with a never-ending loop. “If you feel more confident because you have less issues, you’re going to move faster. And once you move faster, you’re going to make more changes. Once you make more changes, you’re going to have more issues in your production.”

Komodor brings to KubeCon an important announcement, which is Workflows. “People tend to think about Kubernetes as maybe one thing, but in fact, it has thousands of different pieces, right? So it has pods and nodes and services and clusters and load balancers and jobs and Ingress and CRDs, et cetera, et cetera, et cetera. And all of those components relate to each other in one way or another. Right?” Ofiri adds, “Once we saw how much expertise and experience is lacking for R&D organizations, we figured out that a very good product or a very good solution will have to provide them not only a way to observe the status of the different Kubernetes resources but also to allow them a single place to conduct very comprehensive and complex queries and checks on top of Kubernetes that they lack the knowledge to do.”

For that, what Komodor wants to do is democratize the knowledge that is currently very sparse in every organization. This is accomplished by taking all of this information and expertise and baking it into Workflows, for Kubernetes troubleshooting, in an automated way. This means when an on-call developer (or a new developer that has no idea how to do these things) gets an alert, Komodor is already running 20 different checks on the alert to figure out what happened. With that in place, the developer or IT staff won’t have to explore different tabs or tools to track down where the problem started.

The summary of the show is written by Jack Wallen

Here is the rough, unedited transcript of the show…

Swapnil Bhartiya: Swapnil Bhartiya here and welcome to TFiR Newsroom. Today we have with us Ben Ofiri, CEO and co-founder of Komodor. We have been covering Komodor regularly here at TFiR so our audience, they do know about the company, but since we are hosting you for the first time, I want to hear from you, I want to get your perspective. What is Komodor all about, why are you created it?

Ben Ofiri: So we actually created it from our own personal pain. So both me and my partner and other co-founder Itiel were developers for many years. I actually worked for Google for many years as a software developer. And my partner had different experience working for big enterprises like eBay, and then smaller Israeli startups.

And actually the pain that both of us felt as being on-call developers, dev ops engineers, basically we’re the kind of guys where we used to get the alert at 3:00 AM or 3:00 PM. And what we realized is that even though we have all of those monitoring tools in place and log solutions, et cetera, given an alert, usually the first thing that come to your mind is wait, what the heck changed because five minutes ago, everything was great? But you need to get this immediate context of what changed, what happened in the system so you can trace the root cause and you can understand how to fix it.

And unfortunately, the monitoring and logging tools are not meant for this use case. So they are doing great job in monitoring your system or providing you a management for all of your logs. But when you have an alert specifically in Kubernetes, which is a very defragged and distributed and complex system, basically what you really want is some place to understand the different relations between all of the different components that can go wrong and then someone that can tell you what are the components that are calling to faulty and out of all of those noise, what is the most important or most relevant to the specific alert you’re trying to troubleshoot.

And this pain was so hard for us that both of us actually, orthogonal developed internal tools for our companies to mitigate this pain. And once we talked about it in some random coffee shop, a few, basically, almost two years ago, we realized that this pain is shared across different teams, different companies, and it might make sense there would be a generic or cloud platform that can solve it instead of miserable developers like us trying to develop in-house tooling to solve this pain.

Swapnil Bhartiya: If you look at the whole observability space, monitoring, logging, as you are saying, you know that there are good tools, existing one, they tell you there’s something there, but you need to take an action also right? That’s what you said, your team was there to fix it. So how, what kind of evolution you see in the whole metrics monitoring logging observability space so that we are also talking about understandability and actionability?

Ben Ofiri: Yeah, yeah. This is a great question. So one thing I can say about observability and monitoring tools is it’s mandatory, everyone understand their impact, and I can say that it’s almost a commodity these days.

So we have great solutions, either open source tools, right like Grafana, Prometheus, et cetera, or of course managed solution, you know, I don’t want to name drop, but you can all guess which tools are dominating the category, and those are providing great value for the users. In fact, before them, the users, the customers of those users was the one to complain when there is an issue. Now, instead of this, at least they’re getting an alert that there’s some degradation and they need to check it. So their value is tremendous.

But what we know, what we see, what we experienced firsthand is that in an average organization, they have hundreds or thousands of alerts right? Now they need to drop everything else every time there is an alert and to do something in the role, to query different tools, to correlate information from different systems, to basically become an expert in Kubernetes or different, different environment, just to understand what changed, what’s the root cause of each one of those alerts.

And then of course, they need to take actions. Sometimes those actions will solve the issue, right? Sometimes you do a rollback and it solves the issue. Sometimes the action is more soft. Maybe you will increase the memory and you hope it wants to reoccur again in the system, right? Maybe you do a restart for some machine and you hope that this will solve the problem. Maybe it was transient issue. So we also need to take a lot of actions, but this as well requires a lot of understanding, of expertise. You obviously don’t want to take an action that might degradate the symptoms and the issue instead of solving it.

So basically what we see is that monitoring and logging are mandatory, are a huge piece of troubleshooting, but they’re definitely not enough. And when we think about troubleshooting, we always think about the three pillars, understand, manage, prevent. Meaning when you have an issue, when you have an alert, you need to understand really well what’s going on, what happened recently in the system, the different contexts, the business logic and the Kubernetes or infrastructure logic combined, then you need to manage the issue, right? Like taking action, maybe communicate with your team members, maybe do a revert or rollback, et cetera. And then you need to make sure it won’t reoccur in the system, meaning you need to prevent similar issues to reoccur in your system again. And what we see is currently those three things are being handled by three, five, seven different tools and different team members. And it’s very, very inefficient and takes a lot of time and resources from the organizations.

Swapnil Bhartiya: If you look at a troubleshooting perspective, how much role does culture all in all, we talk about chaos engineering, you know, where you do bring the teams, you throw things at them. I mean, it’s not really that chaotic, it’s very planned, but does that also play any role there? When it comes to troubleshooting so that your teams are actually prepared, hey, this might go wrong and this is how you handle it.

Ben Ofiri: I think chaos troubleshooting, first of all, it’s a novel approach and it’s super interesting, and super interesting to see if it will become the standard or not.

I will even take maybe a simpler concept, right. Tests, right. Test ideally will solve all of our issues right? You should just test it. But we all know that there is limitations. So we need to have end client tests and integration tests and different layers of tests. The way I see chaos engineering is maybe like the next layer of tests, right? So some things you can’t check in staging or Canary, so you must use production, right. But even though you add more layers, it’s lacking security. You will still have issues. Maybe the issues that we’ll have will be more severe. Maybe it will be a bit more rare.

But the way I see it, if you feel more confident because you have less issues, you’re going to move faster. And once you’re going to move faster, you’re going to make more changes. Once you’re going to make more changes, you’re going to have more issues in your production. So it’s like a never-ending loop of moving faster, having more issues in production, fixing them faster. I don’t think that it will ever end by a system that don’t have any issues. Because if I’m the VP of R and D, and if I’m saying to my developers, look, it seems that our system is too stable, it probably means they are not moving that fast.

I just read the report of Google, you know, they’re released once a year, a dev ops report about the ecosystem, et cetera. And from their report, the most sophisticated, they call it I think platinum group of dev ops and the best developers, the best organizations, between five to 10% of all of their changes lead to issues. Meaning they push 1000 changes, they know that five of them will have some bug and they take it into consideration. It’s okay, it’s part of the drill, right? It’s part of the ecosystem we live in.

So if you ask me, issues will always be a problem. Alerts and incidents will always be something the developers and dev ops necessarily will need to handle. And we have to equip them with the right tools and expertise to do that.

Swapnil Bhartiya: Right and that brings me to my next question, which is also the theme of today’s discussion is that as you alluded to earlier that cloud native especially Kubernetes, things get complicated very quickly as it is a complicated in a word there, and automation plays a very big role there. So if we just look at troubleshooting, you folks are announcing Workflows to automate troubleshooting. You did talk about it a bit so I can see, but if I ask you, what was the driver behind this announcement? And then we’ll talk about what it is and how it works.

Ben Ofiri: Sure. So as we probably all know, Kubernetes is a very complex and distributed system, right? So people tend to think about Kubernetes is maybe like one thing, but in fact, it has thousands of different pieces, right? So it has Pods and Nodes and Services and Clusters and Load balancer and Jobs and Ingress and CRDs, et cetera, et cetera, et cetera. And all of those components relate to each other in one way or another. Right?

For example, between Pods and Nodes, there is many-to-many relation. So when you have an issue in the Pod and observability might not be enough because it will see that the Pod is having not enough replicas available, great, but why? Is it a Node issue? For example, are all of the Pods that are running on the same Node having the same issue? So it might be a Node issue, right. But how can you check it easily?

Or maybe, maybe it’s a Probe issue, right? Maybe someone just changed the configuration of the Probe of the Node, of the Pod, and this is why your Pod is having some issues. So when you have an issue, you need to examine so many different components and you need to have so many expertise and knowledge about how Kubernetes operate, that just giving a way to observe or a visibility upon all of those resources might not be enough for most organizations.

So once we saw how much of expertise and experience is lacking for R and D organizations, we figured out that a very good product or a very good solution will have to provide them not only a way to observe the the status of the different Kubernetes resources, but also to allow them basically a single place to conduct very comprehensive and complex queries and checks on top of Kubernetes that currently they are lacking the knowledge how to do that.

So what they’re doing usually is they’re forced to escalate to the head of dev ops or to the senior R and D team member who knows how to do those checks, who knows how to do those queries on the different components and how to correlate this information. But what we want to do is basically to democratize this knowledge that currently is very, very, very sparse in every organization. So we do that basically by taking all of this information we have and taking all of this knowledge and expertise that our customers are having and basically we’re offering Workflow that already has baked in all of the expertise and knowledge we have about Kubernetes troubleshooting in an automated way. So once an on-call developer, a very newbie developer that has no idea how to do those things, once he gets the alert, Komodor is already running 20 different checks on this specific alert and figuring out what happened so he won’t even need to explore different tabs or different tools.

Swapnil Bhartiya: Can you talk about what is change intelligence? Everybody defines it in their own way. How would you define it and how much role does it play at Komodor?

Ben Ofiri: Once we started a year ago, a year and a half ago, we saw the role that has metrics and has logs, but what was really missing is changes right? And I think it makes sense that it was missing because if we think about it like five or 10 years ago, a company used to do a release once a quarter. Okay so it’s not that hard to track quarterly releases, right? Like, you know, it’s February. So nothing happened lately, right.

But since company moved to a CI/CD model and started to really move fast with all of the microservices and Kubernetes, et cetera, they started to make much more changes. And when they say changes, it can be code changes. It can be configuration changes. It can be infrastructure changes, it can be DB changes. It can be feature flags that changed. And unfortunately it’s all of the above, right? So you have a mix of all of those changes. And the interesting thing about changes is that in 85% of the cases, they are responsible, or they are indeed the root cause for issues or for incidents in modern organizations.

So, on one hand you have tons of changes, you know that this is probably the root cause or this have a lot to do with the root cause of most of your issues. But on the other hand, keeping track and having visibility with all of those changes is almost mission impossible, just using, for example, a log solution, right? Like you open your LA case solution and you have three terabytes of logs, now good luck understanding what really changed, in which component, and what to do with sort of this information.

So in our opinion, change intelligence is not only keeping track of all of the changes, it’s how can you make something smart out of it, right? How can you take a change in AWS that affected the security group and trace it back to on which machines it affected, and then to understand what are the Kubernetes supports that use those machines and then track it back to some deployment that just happened and then to understand that this is why Jenkins Job failed, right?

So taking those changes from the different pieces of tools and components and correlating together to conduct a coherent story, this is what we call change intelligence, right? This capability, this notion, and the way we see it, without change intelligence, you’re going to lose the battle right? You’re going to chase 10 different tools every time. You’re going to have your op-call developer, dev ops engineer constantly firefighting, instead of basically innovating and developing when this is what you want them to do.

Swapnil Bhartiya: You were also initially talking about the complexity of Kubernetes and how tools are evolving. So if I ask you, what kind of trends are you seeing? I mean, Kubernetes is a huge space, so I’m not talking about trends in general, but especially in this troubleshooting monitoring space.

Ben Ofiri: Yeah. So I think two tools or two trends that we’re seeing is now one, vast adoption of Kubernetes in also like enterprises company and big, big companies that until now stood on the fence and try to see where it’s going. I think now it’s safe to say that Kubernetes is enterprise ready in terms of security, scale, et cetera, and adoption. So we see obviously vast majority of Kubernetes adoption.

With this, with the Kubernetes adoption, we see a very significant trend of adopting Kubernetes native tools. So you probably heard about ArgoCD where like two years ago, we explored ArgoCD, it was like a small project that only a few people knew about. Now we see many customers adopting ArgoCD in order to make Kubernetes native deployments. So Kubernetes is not only bringing itself, but also a new set of tools that are Kubernetes native.

So of course we also see a very high adoption of Prometheus, which is not only for Kubernetes, but has very good built in support for microservices and Kubernetes specifically.

So we definitely see, you know, now companies trying to find tools that are native to Kubernetes. I think this is where Komodor obviously fits in very nicely in this trend. I think our organization understand that Kubernetes is not only a Docker orchestration. It’s not a trend that is going to pass. It’s basically the operating system for most of the cloud operations. And you can see how the security guys, the dev ops, the developers, the SRE, the NOC, all of them are aligned that Kubernetes is going to be the heart and the different tools, the monitoring, the management, et cetera, needs to integrate with Kubernetes. And obviously to do that, it’s not an easy challenge. And this is where tools like Komodor or other tools can fit in and provide a lot of value for those organizations.

Swapnil Bhartiya: Ben, thank you so much for taking time out today and talk about not only these trends, the whole evolution of observability, but also Workflow. And I look forward to our next conversation. Thank you.

Ben Ofiri: Thank you so much for having me.

You may also like

How open source developers’ and maintainers’ mindset is changing towards security: Hilary Carter

Tackling organizational chaos requires focus on integration and standardization

You don’t have to change your incident management workflow for Transposit: Divanny Lamas

anynines Klutch simplifies data management for Kubernetes clusters

How mainframe systems are leveraging AI for businesses

StarTree Cloud adds new observability and anomaly detection capabilities