Applying Three Cloud-Native Lessons to Your AI Approach

0
Author: Rachel Revoy, Sr. Product Marketing Manager, SolarWinds
Bio: Rachel Revoy is a Sr. Product Marketing Manager at SolarWinds, a global software company that helps IT professionals monitor, manage, and secure their networks, systems, and applications across hybrid and multi-cloud environments. Rachel has 7 years of product marketing experience in the tech sector, translating complex enterprise solutions into clear customer value and driving successful go-to-market strategies.

The cloud-native community knows operational chaos intimately. Practitioners have earned their expertise through years of migration, from monolithic architectures to microservices, and from on-premises data centers to the cloud. Through these demanding transitions, three key learnings have emerged: the necessity of a unified view to manage the complexities of distributed systems, the prudence of avoiding a rush to containerize every application, and the knowledge that data quality is foundational to all efforts. No matter the platform or tooling, meaningful progress depends on reliable, contextual information.

Now, as organizations explore AI adoption, we shouldn’t set those hard-won lessons aside. Especially at a time when operational resilience is a growing priority, applying what we’ve already learned can help ensure AI systems are not just functional, but adaptable, dependable, and aligned with the way teams actually work. The principles that shaped cloud-native success, namely observability, data quality, and incremental change, can also guide how we design, deploy, and scale AI. Rather than treat AI as a separate or novel challenge, we have an opportunity to build on the operational excellence we’ve already developed to create systems that are resilient, sustainable, and ready for what’s next.


📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot

Lesson 1: Solving the Distributed System’s Observability Paradox

When microservice applications failed, the issue was often a cascade of events across dozens of services, which created the Observability Paradox: traditional monitoring tools designed for monolithic systems couldn’t provide a clear, unified view of what was happening. This fragmented visibility introduced delays, significantly increasing mean time to resolution (MTTR).

We face this paradox again in AI systems. Non-linear models and issues like data drift or silent degradation are merely a cascade of changes that resist simple measurement. The solution isn’t more data, but more context. We learned to look beyond individual components to the relationships between them, recognizing that the most crucial piece of observability is the context of change, or events. This context can range from system configuration changes to CI/CD deployments and feature flag toggles. Because these everyday changes are the most common causes of issues in production systems, capturing and correlating them with our existing metrics, logs, and traces is vital for moving from reactive firefighting to a proactive, hypothesis-driven troubleshooting approach.

Today, AI-powered systems can enhance that visibility by automatically correlating events across environments and surfacing patterns teams might miss. But the effectiveness of that insight still depends on designing systems around clear context

Lesson 2: Good Data is Still the Priority

The ability to use AI tools for focused problem detection and accurate outcomes is entirely dependent on the quality of the source data. There’s room for improvement, as evidenced by the SolarWinds IT Trends Report 2024, which found that 40% of IT professionals who reported negative experiences with AI attributed the issues to algorithmic errors, a problem often rooted in subpar data. There’s no denying that data quality has been paramount in other situations, like when migrating traditional on-premises data platforms to the cloud. Organizations sometimes focused on “lifting and shifting” the technology without addressing underlying data quality issues or security. The result was a new, costly platform that simply ran into the same problems it had before, proving that the source of the problem was the data. Observability frameworks like OpenTelemetry, establich a great resource to collect reliable data across the systems that can be fed into AI systems.

For AI workloads, data is intensive, demanding high-performing infrastructure, including optimized compute and high-speed storage solutions to be fed efficiently.  By using events to isolate the cause of a problem within the application stack, we ensure the underlying value by prioritizing data quality and resilient storage at the infrastructure layer. Good data remains the foundation. Without it, even the smartest AI can’t deliver trusted insights, only faster mistakes.

Lesson 3: Avoid the All-or-Nothing Trap

The impulse to rip out old systems and replace them with new AI frameworks is an “all-or-nothing” approach that often leads to costly, familiar problems. When companies first embraced containerization, many attempted to containerize every application at once, abandoning existing infrastructure and processes. This big-bang approach sometimes resulted in disconnected systems, team silos, that could lead to increased overhead costs and mean time to resolution (MTTR).

The strategic journey to AI-powered operational excellence isn’t about wholesale replacement. It’s about a disciplined, incremental approach that leverages our existing cloud-native platforms. These platforms, often anchored by container orchestration like Kubernetes, provide the unified, governed base for safely integrating new AI capabilities. The most resilient and effective transformations are built incrementally, integrating AI capabilities into existing, well-understood platforms rather than creating new silos.

By adopting this “replace a brick at a time” strategy, we can avoid the pitfalls of past transformations, ensuring our platforms remain unified, governed, and secure, and allowing teams to focus on solving problems, not just managing chaos.

Lessons Learned

The rapid adoption of AI is mirroring learnings from past tech transformations, leading to operational chaos due to a lack of intent and a focus on technology over problem-solving. To avoid these familiar pitfalls, let’s remember what we’ve already learned about operational excellence:

  • Solve the Observability Paradox. Instead of chasing a singular root cause, shift the focus to the context of change (Events) to understand how a cascade of changes affects system behavior, which drastically improves MTTR.
  • Prioritize good data. Technology is only as effective as the data it processes. Ensuring data quality is a design principle, backed by resilient, high-speed storage infrastructure for your AI workloads.
  • Avoid the all-or-nothing trap. Success is built incrementally. Integrate AI capabilities into your existing, well-understood cloud-native platforms to maintain a unified, governed, and secure environment.

By reapplying these hard-won lessons, we can craft a resilient and strategic AI journey to operational excellence, steering our AI evolution with the wisdom of experience.

KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.

How Akamai and FS-ISAC Are Tackling the New Wave of DDoS Attacks

Previous article

How JDK 25 Redefines Java for the AI and Cloud-Native Era — Simon Ritter, Azul

Next article