AI/MLCloud Native ComputingContributory Expert VoicesDevOps

Automating Root Cause Analysis in Kubernetes Deployed Applications

Kubernetes, Ship

Kubernetes has become a ubiquitous platform to create, deploy, manage and scale large distributed applications. But of course, things can go wrong, ranging from complete failure of the application to performance/responsiveness issues or isolated problems within certain components. Often it is readily apparent when an application breaks. However, sometimes you may not know something is broken or impaired until you hear complaints from a customer, partner, employee or an internal department or business unit. There are variations on this all the time: “It was only when customers started complaining that we realized certain items couldn’t be added to the shopping cart” or “A simple authentication problem stopped customers from being able to track their orders, and we didn’t even know for six hours” and so on.

The underlying challenge is complexity due to an almost infinite number of possible interactions that can take place in a distributed environment. These can be infrastructure-related, application-related or simply particular conditions or workflows that were never anticipated or tested. Most DevOps teams struggle with two questions:

Firstly, how do you know if something critical is broken, ideally before it causes a problem that affects the business? Fortunately, monitoring tools can be configured to generate alerts when critical conditions or thresholds are reached. These human-built alert rules can be an effective way of catching known problems such as increased latency or number of dropped sessions. However, thresholds are often reached well after the root cause of a problem has occurred. More importantly, detecting a symptom does not mean you know why the problem occurred or how to resolve it.

This leads to the second question: when something breaks, what was the root cause of the problem? Logs often contain the most complete record of what happened. But traditional techniques of “search and drill-down” are cumbersome and slow for two reasons: The sheer volume and “noisiness” of log streams from multiple containers, pods, nodes and clusters; and the innately challenging nature of log-based troubleshooting – typically you don’t know exactly what you’re looking for. The best troubleshooters have instincts that let them spot unusual events and subtle correlations with errors and warnings. But this is a time consuming and stressful process. When dealing with a critical problem, the clock starts ticking and urgency mounts. Understanding the problem often takes an “all-hands-on-deck” approach – where DevOps and engineering spend countless hours hunting through all the log data to piece together what happened. The pressure is on.

Fortunately, there is hope when it comes to determining the root cause of critical software failures in Kubernetes deployments! Machine learning (ML) can automatically detect correlated clusters of anomalies and errors within logs. Since ML is much faster and more scalable than humans working with traditional logging and monitoring tools, Mean-Time-To-Resolve can be reduced from hours to just minutes. Rather than disrupt already overloaded development and DevOps teams, machine learning can enable them to focus on other things and spare the disruptions caused by application incidents.

Machine learning for root cause determination works by monitoring logs in real time. It doesn’t require manual training and can achieve high accuracy in only a few hours. The ML first learns the log structures and normal patterns and correlations in the logs. It then looks for abnormal clusters of anomalous patterns across the logs. These patterns can be used to detect and find the root cause of software problems. In essence, the ML can distill millions of log lines down to just a few (typically 5-20) that best explain what happened.

Modern DevOps organizations drive excellence, efficiency and effectiveness by automating as many tasks as possible. Up until now, root cause identification remained one of only a few tasks that couldn’t be automated. But thanks to ML, when critical software incidents occur, DevOps can take up the prominent advice from Hitchhiker’s Guide to the Galaxy—don’t panic—and calmly resolve the problem in the shortest possible time.

Author: Ajay Singh, CEO, Zebrium
Bio: Ajay Singh is a strong advocate for creating products that “just work” to address real-life customer needs. As Zebrium CEO, he is passionate about building a world class team focused on using machine learning to find the root cause of software problems through automated log analysis. Prior to Zebrium, Ajay led Product at Nimble Storage from concept to annual revenue of $500M and over 10,000 enterprise customers. Ajay started his career as an engineer and has also held senior product management roles at NetApp and Logitech.

To hear more about cloud native topics, join the Cloud Native Computing Foundation and cloud native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021