AI Autonomously Fixed 25 Production Incidents Overnight—Engineer Never Woke Up | Hong Wang, Akuity | TFiR

An AWS incident hit at 1 AM. An engineer wrote a 2-line runbook. AI autonomously fixed 25 incidents overnight. Hong Wang explains Akuity Intelligence's agentic SRE capabilities.

By Monika Chauhan May 15, 2026

0

An AWS DynamoDB/DNS incident hit production infrastructure at 1 AM. An Akuity SRE woke up to a PagerDuty alert, diagnosed an image pull backoff error caused by a broken Docker registry, switched to the backup registry, and went back to sleep. Five minutes later: another alert. Same symptom, different application. Instead of staying up all night fixing the same issue repeatedly, the engineer wrote a two-line human-readable runbook. AI handled the next 25 incidents autonomously. The engineer never woke up again.

The Guest: Hong Wang, Co-founder and CEO at Akuity

The Bottom Line

Akuity Intelligence‘s agentic loop—detect, diagnose, remediate, verify—autonomously fixes well-known production issues using human-defined runbooks, keeping SRE teams fresh and focused on innovation instead of repetitive troubleshooting

— Full Interview —

Speaking with TFiR, Hong Wang of Akuity defined the current state of AI-powered SRE automation and shared a real-world incident where human-defined runbooks enabled AI to autonomously resolve 25 production incidents overnight.

What Is Akuity Intelligence?

Akuity Intelligence is an agentic SRE capability embedded directly into ArgoCD and Kargo. It operates as a closed-loop system: detect when something happens, diagnose root cause, remediate the issue automatically, and verify the result. The system relies on human-defined runbooks—symptom-solution patterns written in natural language—to ensure deterministic outcomes rather than allowing AI to improvise fixes.

Hong Wang: “We build this agentic experience to help you do troubleshooting, remediation, and verification. It’s a whole loop. We know when something happened, we know what to look at to get to the bottom, we know what the right solution is to fix something, we can take action automatically, and we can verify the result. It’s a closed-loop thing for the agent experience.”

The human-defined runbook is the critical control mechanism. Instead of allowing AI to generate arbitrary solutions, platform teams define acceptable remediation patterns in advance. AI executes those patterns when symptoms match—ensuring consistency, safety, and predictability.

Broader Context: The AWS Incident Example

Wang shared a real-world example from an AWS infrastructure incident that affected Akuity’s production environment. The incident occurred late at night and triggered repeated alerts as the same issue cascaded across multiple applications.

Hong Wang: “There was an AWS incident last year caused by DynamoDB or DNS issues. We got affected because we run our infrastructure on AWS. Our Docker registry was broken, but we have a backup. Our engineer got a PagerDuty call at 1 AM. He started looking at the issue—image pull backoff. The registry cannot pull the image. The guy said, ‘We can just switch over to our backup.’ He fixed it and went back to sleep. Five minutes later, another PagerDuty call—a different application, exactly the same symptom.”

Rather than staying up all night manually fixing each application individually, the engineer wrote a simple two-line runbook that Akuity Intelligence could execute autonomously.

Hong Wang: “He said, ‘I don’t want to stay up the whole night fixing all the issues.’ What he did is he started to write this runbook—human-readable runbook, only two lines. Symptom: image pull backoff error from Docker registry A. Solution: override registry A to registry B. Apply to all applications in my infrastructure. What happened is there were another 20 more cases over the night for exactly the same symptom, same problem. My engineer didn’t wake up anymore because AI just took the job. AI said, ‘I look at what’s going on, I look at what the solution should be,’ and it overrode automatically. In the morning, we looked at it—AI actually took the action 25 times. We know it’s safe. We asked the AI to do exactly what we want.”

The key operational outcome: the engineer woke up fresh the next day instead of being exhausted from repetitive manual fixes. This shift—from reactive troubleshooting to proactive guardrail definition—is what Wang identifies as the future of SRE workflows.

Wang emphasized that the runbook approach ensures AI operates deterministically rather than creatively. The goal isn’t to replace human judgment; it’s to automate well-understood problems so engineers can focus on novel challenges and architectural improvements.

Hong Wang: “In this particular case, it’s a well-known issue, well-documented issue. You’re not learning something new from resolving that issue. If you well understand the problem, why not allow AI to do the autopilot and autonomously fix it? On the other side, we’re not seeing that AI will take over all the work. The action to take is not AI coming up with some random idea—’Here’s the way I will fix it.’ It’s coming from the runbook, which is written by the human or certified by the human. We want AI to give us more deterministic results. That’s why, when we design our AI SRE capability, we want that runbook. We want the human to get involved—getting their opinion, getting their preference, passing that as additional context to the AI to make things happen.”

The broader implication: SRE teams shift from being on-call firefighters to being architects of automation. Instead of manually diagnosing and fixing the same issues repeatedly, they document symptom-solution patterns once, and AI handles future occurrences at scale. This reduces burnout, improves sleep quality for on-call engineers, and allows teams to focus on higher-leverage work—reliability architecture, observability improvements, and proactive incident prevention.

Watch the full TFiR interview with Hong Wang here.

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

By Monika Chauhan23 hours ago

Observability

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

By Monika Chauhan24 hours ago

Cloud Native

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

By Monika Chauhan2 days ago

Cloud Native

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR

By Monika Chauhan3 days ago