Author: Yuval Lev, CTO, Senser
Bio: Yuval Lev, CTO and co-founder of the zero-instrumentation production intelligence pioneer Senser, honed his skills as a tech leader at DriveNets, a leader in cloud-native networking solutions. The company was selected for Intel Ignite’s startup accelerator in 2022.
The Kubernetes troubleshooting nightmare
Imagine the scenario: it’s 3am and you’re troubleshooting yet another Kubernetes headache. Your microservices-based web app is failing intermittent requests, leading to angry customers and plummeting revenue. Your team has been firefighting issue after issue like this over the past few months since migrating to Kubernetes. You wanted the benefits of automated container orchestration, but instead you just got endless complexity.
Sound familiar? This was the exact scenario facing a client of ours. The clock was ticking – and they needed to figure out what was going wrong. Fast.
The logs showed timeout errors calling their caching service from the web tier. But their monitoring dashboards reported no problems with the cache pods. After hours of digging, they finally discovered the root cause – the caching pods were competing for resources with co-located noisy neighbor containers, causing slowdowns only for certain requests. Without end-to-end visibility across the cluster, they wasted countless hours chasing decoy problems that magically disappeared as pods were rescheduled.
The client emerged from this crisis bruised but intact. Still, countless stories like this have convinced me that there has to be a better way than spending nights and weekends spelunking through metrics and logs, only to find solutions by accident.
We DevOps practitioners turned to Kubernetes to remove operational headaches, but instead it introduced a maze of complexity. Teams need fundamentally new approaches tailored to this environment.
The promise and perils of Kubernetes for production environments
Kubernetes has become the standard for deploying containerized applications, promising increased flexibility and resilience. It’s no wonder that 70% of organizations use Kubernetes, according to research from Red Hat. K8s abstracts infrastructure complexity so developers can easily deploy containers at scale – handling scheduling, scaling, failovers, secrets, and more.
This automated container orchestration seems like a panacea. But as with any new technology, what you gain in capabilities you lose in simplicity. The complexities of Kubernetes and distributed microservices can often outweigh the benefits when things go wrong.
New hurdles for troubleshooting
While Kubernetes removes some operational burdens, it introduces new challenges for debugging issues:
- Distributed systems spread across nodes inhibit end-to-end visibility. Traditional monitoring only provides visibility into individual containers or nodes, not the interactions among them. This makes it extremely difficult to trace the flow of a request across services.
- Ephemeral containers and pods disappear rapidly, making reproducing failures difficult. A pod running a misconfigured container may crash quickly, then Kubernetes will automatically restart it with a clean slate. Without capturing runtime data beforehand, the source of the failure vanishes.
- Abstracted networking fabric increases complexity for connectivity issues. Problems could originate from Kubernetes components like kube-proxy or CoreDNS, CNI plugins, firewall rules, and more. This opaque networking layer restricts insights.
- Resource sharing leads to contention and “noisy neighbor” problems that are hard to differentiate from things like memory leaks. A slow pod could be from a neighbor hogging CPU, not an app bug.
- Isolation restricts insights into containers themselves. Additionally, because the Kubernetes environment is distributed, it’s often infeasible to inspect each instance individually. Finally, the ephemeral nature of containers means that SSHing into a pod may not work as an approach to access key processes and logs.
Many organizations have adopted observability tools (either commercial platforms or open source solutions) to address these challenges. However, traditional observability approaches fall short when it comes to both data completeness and context and the enrichment of raw data with meaningful insights.
On the data side: Data remains siloed across metrics, logs, and traces, with no way to pivot from a symptom to a distributed root cause. Furthermore, agents lack full visibility since they run within containers, missing kernel, host, and network layers. Finally, manual instrumentation is incomplete and creates overhead, leading to key data being missed.
On the analytics side: traditional observability approaches focus on individual container metrics rather than system correlations. This limits insights into issues stemming from inter-service dependencies. Additionally, alerts typically point to symptoms not causes, with limited automated troubleshooting. Finally, data exploration is manual and reactive. Users must hunt for answers themselves.
The net result is that too many teams find themselves as our client did when an alert storm goes off at 3am – trying to manually debug Kubernetes failures in a scenario that can feel like whack-a-mole.
The dream of AIOps
To fully unlock the potential of Kubernetes, current observability capabilities are necessary but insufficient. Organizations need solutions that go beyond traditional metrics, logs, and tracing – infusing next-generation data capture and analytics.
The goal should be solutions that automatically provide systematic visibility and precise answers – not just more alerts. Three key technologies can help:
- Lightweight data capture like eBPF reduces overhead while exposing both applications and infrastructure.
- A service topology captures runtime dynamics, enriching raw data with a contextualized graph of key user and business flows.
- Machine learning extracts correlations and patterns from massive data, surfacing actionable insights.
Together, these techniques could provide continuous intelligence about distributed environments, detecting anomalies and immediately identifying root causes for any issue. The dream is automatically moving from chaos to clarity.
But genuinely achieving this goal requires purpose-built solutions vs tacked-on features. It demands holistic observability across all layers, not siloed data. It needs next-gen analytics, not just dashboards.
The reality of cloud-native development
Kubernetes delivers immense capabilities but also multiplies complexity. To enjoy the benefits without drowning in confusion, teams need solutions tailored for cloud-native development. The dream of “self-operating” systems has yet to fully materialize.
But with the right innovations, we can get closer to unlocking Kubernetes’ true potential in production environments. Developers want the advantages without the headaches – smarter observability and analytics solutions can bridge these worlds. We have yet to reach the promised land, but its contours are appearing on the horizon.