5 Levels of AI Agent Maturity in DevOps

Author: Nishant Modak, CEO of Last9

Bio: Nishant Modak is the CEO of Last9, the unified observability company for AI-native teams used at scale by companies like Replit, Brightcove, Clevertap, Pine Labs, Text Groove, and more.

I’ve been working in observability and DevOps tooling for years, and lately I’ve been diving deep into AI agents in production environments. The space is full of noise right now — startups claiming their chatbot is “fully agentic,” enterprise vendors slapping “AI-powered” on everything, conference demos that never quite work the same way in production.

But underneath all that noise, something real is happening. After working with teams implementing these systems and building observability tools that support them, I’ve started to see clear patterns in how AI agents actually evolve in DevOps environments.

There seem to be five distinct maturity levels, each with different capabilities, complexities, and practical considerations. This framework has helped me think more clearly about what’s possible today versus what’s still experimental, and where the real value lies for different types of organizations.

Level 1: AI-Enhanced Existing Tools

This is where we started, and honestly, it’s probably where most teams are today without even thinking about it that way. Level 1 is basically taking whatever automation you already have and replacing some of the hardcoded logic with AI decision-making.

The way this actually happened for us was pretty mundane. We were drowning in PagerDuty alerts – you know how it is, everything triggers an alert, half of them are false positives, and your on-call engineers are going crazy trying to figure out what’s actually urgent.

So we started experimenting with having an AI agent look at incoming alerts and try to guess which ones were worth immediately waking someone up versus which ones could wait until morning. It wasn’t some grand AI strategy — we just got tired of being paged for disk space alerts on servers that auto-scale anyway.

The agent plugs into our existing PagerDuty workflow and tries to route alerts to different escalation policies based on context like recent deployments, time of day, similar historical incidents. Sometimes it gets it wrong and we miss something urgent, sometimes it correctly filters out noise that would have ruined someone’s sleep for no reason.

What I like about this approach is that it feels pretty low-risk, if it fails, it fails safely within existing guardrails. All your existing safety mechanisms still work. If the AI routing goes completely haywire, the alerts still go somewhere, just maybe to the wrong person initially. You’re just making your current automation slightly less dumb, not throwing out everything you’ve built.

Level 2: Context-Aware Operations

This is where AI agents start becoming genuinely useful, and it’s where I see most mature teams finding real value and should be focusing for the near future.

Level 2 is about giving your AI agents memory about your specific infrastructure and operational patterns. Instead of making decisions based only on current data, they can search through historical context to make much better choices.

The technical foundation here is operational memory — vector databases that store incident histories, postmortems, deployment patterns, and runbooks in ways that AI agents can intelligently search. This isn’t just keyword matching; it’s semantic similarity that can connect related operational concepts across different incidents.

We’ve seen this work particularly well for incident response. When monitoring systems fire alerts, AI agents can search through historical incident databases and provide contextually relevant information to on-call engineers. Instead of just raw alert data, you get enriched notifications that might say “This pattern matches the database connection pool exhaustion from three weeks ago – here’s what resolved it quickly.”

The agent enhances existing workflows rather than replacing them. It’s not changing your PagerDuty setup or incident response procedures, but it’s providing the institutional memory that usually lives in the heads of senior engineers who aren’t always available at 3am.

This approach scales well because agents learn your operational fingerprint — your specific naming conventions, architectural patterns, and failure modes — rather than applying generic best practices that may not fit your environment.

Level 3: Autonomous Decision-Making

Level 3 is where I started feeling genuinely uncomfortable, and I’m still not sure if that’s a good or bad sign.

This is about AI agents that don’t just provide recommendations — they actually take action based on complex reasoning about your operational state. They remember not just what happened before, but why decisions were made, what worked, what didn’t, and theoretically how to improve over time.

We’ve been experimenting with an auto-scaling agent that goes way beyond simple CPU/memory thresholds. It tries to maintain this operational memory of past scaling decisions, analyze business context like deployment windows and marketing campaigns, and reason through decision trees before taking action. When it decides to scale something up, it logs exactly why it made that choice and what factors it considered.

The interesting part is the learning loop — when a scaling decision works out well or goes badly, the agent is supposed to store that outcome and use it to improve future choices. In theory, it’s building operational expertise about our specific infrastructure patterns.

I’ve been running this stuff for maybe six months now, and honestly, the results are all over the place. When it works well, it’s pretty impressive — our infrastructure costs seem to be down and performance feels better than our old rule-based system. But when it makes a bad decision, debugging becomes this whole weird exercise in trying to understand AI reasoning chains.

And here’s the thing nobody talks about: AI agents are terrible at knowing when they don’t know something. I’ve had to build all these confidence scoring systems and human escalation workflows, and I’m still not sure I trust them completely.

Maybe I’m just not operationally mature enough for this level yet. You definitely need really solid monitoring, incident response processes, and organizational comfort with autonomous systems making decisions that affect real users.

Level 4: Multi-Agent Teams

Level 4 is where I think I might be getting ahead of myself, but I’ve been experimenting with it anyway because, well, it seemed interesting.

This is about coordinating multiple specialized AI agents that work together on complex operational workflows. Instead of one agent trying to handle everything, you have specialists for security, performance, deployment, cost optimization, whatever, and they somehow need to collaborate.

I’ve been tinkering with this setup where we have specialized agents for different domains – security handles vulnerability scanning and compliance stuff, performance manages optimization and capacity planning, deployment handles CI/CD strategies. There’s supposed to be an orchestrator that coordinates their decisions using weighted voting when they disagree.

For complex deployments, the idea is that the security agent analyzes for vulnerabilities first, the performance agent tries to predict system impact, and the deployment agent figures out strategy based on both inputs. When agents disagree about rollout speed versus caution, the orchestrator is supposed to weigh security concerns more heavily than performance optimizations.

When this works — and I stress when — it can handle some genuinely complex scenarios. I’ve seen coordinated incident responses where agents identify security risks, assess performance impact, and implement rollbacks while keeping detailed logs of what they did and why.

But the coordination overhead is just enormous. Each agent needs to understand not just its own thing, but how its decisions affect other agents. And when something goes wrong with multiple AI systems interacting in weird ways… well, debugging becomes an adventure I’m not always prepared for.

I’m starting to think most teams probably don’t need this level of complexity, but maybe I’m wrong. Maybe I just haven’t figured out the right way to do it yet.

Level 5: Fully Agentic Systems

Level 5 is where I wave my hands and talk about the future, because honestly, I don’t think we’re there yet.

This would be where your entire DevOps operation becomes an agentic system. You submit business requirements like “handle 10x more traffic for next month’s product launch” and the system autonomously handles analysis, implementation, testing, deployment, monitoring, the whole thing.

I’ve been poking at some Level 5 concepts, but I’m going to be honest: it feels more like research than something I’d trust with production systems anytime soon. Current LLM limitations, the way errors propagate through complex systems, the lack of sophisticated technical reasoning — there are a lot of gaps.

Maybe some companies will figure this out in the next few years. The ones that do will probably have some serious competitive advantages. But rushing toward Level 5 without really understanding the earlier levels seems like a good way to have some spectacular failures.

What I’ve Learned About AI DevOps

Most of what gets marketed as “AI DevOps” today is essentially chatbot interfaces on existing tools. These systems can summarize logs or generate configuration files, but they lack the contextual reasoning needed for complex operational decisions.

The real value I’ve observed comes from systems that understand your specific operational context and can apply that knowledge to enhance existing workflows. This requires significant investment in data infrastructure and knowledge management, but the returns are substantial for teams ready to make that investment.

Practical Guidance Based on What I’ve Seen

Teams still building foundational DevOps automation should focus there first. AI agents add complexity that becomes valuable only when you have solid operational practices to build upon.

Organizations with mature automation often find Level 1 implementations provide immediate value. The key is choosing specific operational pain points — alert routing, test optimization, deployment decision-making — and enhancing those workflows rather than trying to revolutionize everything at once.

Level 2 with Context-Aware Operations is where AI agents start becoming genuinely useful, and it’s where I see most mature teams finding real value and should be focusing for the near future.

Level 2 represents the sweet spot for most mature DevOps organizations. The context-aware capabilities provide genuine operational leverage, and the technical infrastructure requirements align with investments most teams are already making in observability and data management.

Level 3 and beyond require careful consideration. The autonomous decision-making capabilities can be powerful, but they introduce new operational complexities around debugging, accountability, and system predictability. These levels work best for teams with deep AI/ML expertise and extremely mature operational practices.

The Real Opportunity

Based on what I’ve observed working with teams across different maturity levels, the most valuable AI agent implementations amplify human expertise rather than replacing it. The systems that work best make experienced engineers more effective by providing contextual information, automating routine decisions, and surfacing operational insights that would otherwise require manual analysis.

The key technical insight is that effective AI agents need a deep understanding of your specific operational context. Generic AI models trained on broad datasets provide limited value compared to systems that understand your infrastructure patterns, failure modes, and organizational practices.

This contextual understanding requires investment in data infrastructure and knowledge management systems, but it’s what separates genuinely useful AI agents from the chatbot interfaces that dominate the current market.

The evolution toward agentic operations is real, but it’s happening through incremental improvements to existing workflows rather than dramatic operational transformations. Teams that focus on practical implementations addressing specific operational challenges will see the most value from this technology.

KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.

The 5 Levels of AI Agents in DevOps

Level 1: AI-Enhanced Existing Tools

Level 2: Context-Aware Operations

Level 3: Autonomous Decision-Making

Level 4: Multi-Agent Teams

Level 5: Fully Agentic Systems

What I’ve Learned About AI DevOps

Practical Guidance Based on What I’ve Seen

The Real Opportunity

How Egen Helped a Financial Enterprise Scale Customer Outreach Using AI — Glenn Russell

Save 20% on Java Cloud Spend — Azul’s Optimizer Hub | John Ceccarelli

Level 1: AI-Enhanced Existing Tools

Level 2: Context-Aware Operations

Level 3: Autonomous Decision-Making

Level 4: Multi-Agent Teams

Level 5: Fully Agentic Systems

What I’ve Learned About AI DevOps

Practical Guidance Based on What I’ve Seen

The Real Opportunity

How Egen Helped a Financial Enterprise Scale Customer Outreach Using AI — Glenn Russell

Save 20% on Java Cloud Spend — Azul’s Optimizer Hub | John Ceccarelli

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR

AI Agents Are Breaking Observability — Snowflake’s Jeremy Burton on What Comes Next | TFiR

Why Multi-Cluster Kubernetes Is Now a Platform Engineering Crisis | Julian Fischer, anynines | TFiR

Sampling Telemetry Breaks AI Observability | Shahar Azulay, groundcover | TFiR