Automating AWS Outage Recovery with Akuity Intelligence

Author: Hong Wang, Co-Founder and CEO of Akuity

Bio: Hong is the Co-Founder and CEO of Akuity, the leader in cloud-native GitOps, and a founding member of Argo CD, the third most-adopted open source project in the CNCF behind Kubernetes and OpenTelemetry.

In October 2025, AWS had a major outage that kept many teams awake fixing the same failures on repeat. We were affected too, but the incident showed how Akuity Intelligence can act as a reliable partner during real outages. It handled the repetitive work that normally wakes up engineers, providing practical, safe automation when we needed it most.

The AWS Outage Hits

Around 1:00 AM Pacific, AWS began having issues with DynamoDB. The impact spread quickly to the registry we use for pulling container images, and our workloads started failing with ImagePullBackOff.

📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot

The on-call engineer got paged, checked the logs, confirmed the primary registry was unavailable, and switched the workloads to the backup registry. The services recovered.

Five minutes later, another alert came in with the same failure. Then another. It became clear this wasn’t an isolated issue, and many workloads were about to hit the same problem.

Turning Knowledge Into a Runbook

Manually patching each workload wasn’t sustainable, so the team wrote a short, plain-language runbook for Akuity Intelligence:

“If a pod fails with ImagePullBackOff because the registry is unavailable, switch it to the backup registry. No approval needed.”

This turned a known fix into a repeatable action. Engineers kept full control of the rule, and Akuity Intelligence handled the repetitive work whenever the same failure appeared.

Akuity Takes Over the Night

Over the next few hours, the same failure occurred roughly 20–25 times across different services. Instead of waking the on-call engineer repeatedly, Akuity Intelligence:

Detected each failure
Confirmed the runbook applied
Patched the workload
Verified recovery

Each fix took about 30 seconds. By morning, the outage was still making news, but our alerts had stopped, and the system had stabilized.

Why This Matters for Engineers and SREs

This outage highlighted several practical benefits for teams running Argo CD, Kubernetes, and multi-cluster platforms:

1. Fewer Alerts, Less Fatigue

Without automation, the on-call engineer would have been paged more than 20 times for the same issue. With Akuity Intelligence:

alerts stopped
repetitive fixes were automated
engineers weren’t pulled into the same task again and again

For teams managing many clusters or customer workloads, this reduction in noise is a major operational improvement.

2. Faster, Consistent Remediation

Akuity Intelligence applied the same validated fix every time:

no skipping steps
no manual typos
no variation between engineers

This consistency is critical when supporting large GitOps environments.

3. Turn known fixes into automated actions in minutes

When an engineer identifies the cause of an issue, they can encode that knowledge in a simple runbook. Akuity Intelligence uses that runbook to:

watch for the specific failure
apply the exact fix automatically
validate recovery
repeat it across any affected workloads

This captures engineering knowledge and removes repetitive work so teams can focus on root causes and improvement.

4. Automation With Human Control

Engineers decide how Akuity Intelligence behaves:

which issues matter
what the correct fix is
when automation should run
when approval is needed

Because these rules come directly from the team, Akuity Agents operate in a predictable, controlled way. Even simple automation can make a meaningful difference during widespread outages.

5. Extends open source Argo CD

Open source Argo CD gives engineers the visibility they need: unhealthy apps, failed pods, events, and logs. But visibility doesn’t resolve issues during an outage.

The Akuity Platform adds the automation layer. Engineers deploy Akuity Agents instead of running Argo CD themselves. These agents connect the cluster to the Akuity control plane, allowing Akuity Intelligence to::

read the same event data engineers see
follow team-defined runbooks
apply safe, deterministic fixes
handle repeated failures automatically

This turns Argo CD’s visibility into real remediation and removes repetitive tasks from the on-call workflow.

What Engineering Teams Can Take Away

The outage highlighted a simple truth: the value isn’t in AI “solving everything,” but in using a tool that automates the fixes engineers already understand. When teams define clear, safe rules, Akuity Intelligence can take over the repetitive incident work that appears again and again during large outages.

Teams running Kubernetes at scale know these patterns well:

repeated ImagePullBackOff
stuck rollouts
incorrect image references
pods that need restarts
failures triggered by upstream services

Most of these issues have straightforward fixes once diagnosed. The AWS outage showed that Akuity Intelligence can apply those steps reliably and consistently, without pulling engineers out of sleep or deeper investigation.

Closing Thoughts

We couldn’t avoid the AWS outage, but we did avoid the repetitive work that normally turns an incident into an all-night effort. The incident showed how engineer-defined, deterministic automation can cut down alerts, protect on-call time, and keep systems stable during upstream failures.

The Night of the AWS Outage: How Akuity Intelligence Became Our Team’s On-Call Partner

The AWS Outage Hits

Why This Matters for Engineers and SREs

1. Fewer Alerts, Less Fatigue

2. Faster, Consistent Remediation

3. Turn known fixes into automated actions in minutes

4. Automation With Human Control

5. Extends open source Argo CD

What Engineering Teams Can Take Away

Closing Thoughts

How Open Source Keeps Europe Competitive in AI and Emerging Tech — Gabriele Columbro

Databahn Brings AI-Native Security Data Pipelines to AWS Marketplace

The AWS Outage Hits

Why This Matters for Engineers and SREs

1. Fewer Alerts, Less Fatigue

2. Faster, Consistent Remediation

3. Turn known fixes into automated actions in minutes

4. Automation With Human Control

5. Extends open source Argo CD

What Engineering Teams Can Take Away

Closing Thoughts

How Open Source Keeps Europe Competitive in AI and Emerging Tech — Gabriele Columbro

Databahn Brings AI-Native Security Data Pipelines to AWS Marketplace

You may also like

How vCluster Built Complete Multi-Tenancy Spectrum for AI Infrastructure | Saiyam Pathak

How Klutch Brings Self-Service Data to Kubernetes Without the Chaos | Julian Fischer, anynines

Why 66% of AI Workloads Now Run on Kubernetes: The Infrastructure Shift | Hilary Carter

How vCluster Solves GPU Waste and Isolation in Bare Metal AI Infrastructure | TFiR

vCluster Launches vind: Virtual Kubernetes in Docker for Modern Development | TFiR

Kubernetes in 2026: AI Factories, Inferencing at Scale & Multi-Tenancy | Saiyam Pathak, vCluster