AI Infrastructure

AI Observability Noise: Why Kubernetes Plus AI Creates Chaos, Says Akamai’s Danielle Cook

0

Guest: Danielle Cook (LinkedIn)
Company: Akamai
Show Name: An Eye on AI
Topic: Observability

Observability was already a challenge in Kubernetes environments. Now AI workloads have entered the picture, and the noise levels are becoming unmanageable. Danielle Cook, Senior Product Marketing Manager at Akamai and CNCF Ambassador, addresses a growing pain point that platform engineering teams are grappling with at scale. When AI becomes a workload rather than just a tool, observability demands multiply across every layer of the infrastructure stack.

AI workloads generate exponentially more signals than traditional applications. Cook emphasizes that observability is necessary at every single part of the stack, from infrastructure to cloud to application layer. The problem is not a lack of data. The problem is too much data with no clear way to separate signal from noise. Teams are collecting metrics everywhere but struggling to identify what actually matters for business outcomes and system reliability.

The complexity compounds when AI runs on Kubernetes. Cook points out that teams using Kubernetes have already accepted a certain level of complexity. They are managing orchestration, networking, storage, security policies, and countless add-ons that must work together seamlessly. Introducing AI workloads on top of this foundation does not just add another layer. It fundamentally changes the observability equation.

Traditional observability tools were built for more predictable application behavior. AI workloads behave differently. Model inference patterns vary based on input. Training jobs consume resources in spikes. GPU utilization does not follow the same patterns as CPU metrics. Teams need to understand not just whether the AI workload is running, but whether it is running efficiently, whether it is delivering accurate results, and where bottlenecks are occurring in the inference pipeline.

Cook notes that the cloud native community is actively working to simplify this challenge. The reality is that no one has all the answers yet. Platform engineering teams are experimenting with different approaches to tame the observability chaos. Some are layering AI-driven analytics on top of existing observability platforms. Others are building custom dashboards focused specifically on AI workload performance. The common thread is that everyone is searching for ways to cut through the noise and focus on metrics that drive real decisions.

What makes this particularly urgent is that AI workloads are not optional experiments anymore. Companies are running inference at the edge, deploying models in production, and building business-critical applications on AI infrastructure. When observability fails in these environments, the impact is immediate and measurable. Slow inference times translate to poor user experiences. Undetected model drift leads to inaccurate predictions. Resource inefficiencies drive up cloud costs without anyone noticing until the bill arrives.

The shift from automation to intelligence in observability pipelines represents the next evolution. Automation can collect metrics and trigger alerts. Intelligence can interpret patterns, predict failures, and recommend optimizations. But Cook is candid about the fact that this transition is still underway. Teams are building the plane while flying it, trying to observe AI workloads with tools that were not designed for this purpose.

Why “Right Data” Determines AI Success: Airbyte’s Teo Gonzalez Explains the Missing Ingredient

Previous article

Enterprises Leave $24.8B on Table by Choosing Closed AI Models Over Open Alternatives

Next article