In October 2025, AWS had a major outage that kept many teams awake fixing the same failures on repeat. We were affected too, but the incident showed how Akuity Intelligence can act as a reliable partner during real outages. It handled the repetitive work that normally wakes up engineers, providing practical, safe automation when we needed it most.
The AWS Outage Hits
Around 1:00 AM Pacific, AWS began having issues with DynamoDB. The impact spread quickly to the registry we use for pulling container images, and our workloads started failing with ImagePullBackOff.
📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot
The on-call engineer got paged, checked the logs, confirmed the primary registry was unavailable, and switched the workloads to the backup registry. The services recovered.
Five minutes later, another alert came in with the same failure. Then another. It became clear this wasn’t an isolated issue, and many workloads were about to hit the same problem.
Turning Knowledge Into a Runbook
Manually patching each workload wasn’t sustainable, so the team wrote a short, plain-language runbook for Akuity Intelligence:
“If a pod fails with ImagePullBackOff because the registry is unavailable, switch it to the backup registry. No approval needed.”
This turned a known fix into a repeatable action. Engineers kept full control of the rule, and Akuity Intelligence handled the repetitive work whenever the same failure appeared.
Akuity Takes Over the Night
Over the next few hours, the same failure occurred roughly 20–25 times across different services. Instead of waking the on-call engineer repeatedly, Akuity Intelligence:
-
Detected each failure
-
Confirmed the runbook applied
-
Patched the workload
-
Verified recovery
Each fix took about 30 seconds. By morning, the outage was still making news, but our alerts had stopped, and the system had stabilized.
Why This Matters for Engineers and SREs
This outage highlighted several practical benefits for teams running Argo CD, Kubernetes, and multi-cluster platforms:
1. Fewer Alerts, Less Fatigue
Without automation, the on-call engineer would have been paged more than 20 times for the same issue. With Akuity Intelligence:
-
alerts stopped
-
repetitive fixes were automated
-
engineers weren’t pulled into the same task again and again
For teams managing many clusters or customer workloads, this reduction in noise is a major operational improvement.
2. Faster, Consistent Remediation
Akuity Intelligence applied the same validated fix every time:
-
no skipping steps
-
no manual typos
-
no variation between engineers
This consistency is critical when supporting large GitOps environments.
3. Turn known fixes into automated actions in minutes
When an engineer identifies the cause of an issue, they can encode that knowledge in a simple runbook. Akuity Intelligence uses that runbook to:
-
watch for the specific failure
-
apply the exact fix automatically
-
validate recovery
-
repeat it across any affected workloads
This captures engineering knowledge and removes repetitive work so teams can focus on root causes and improvement.
4. Automation With Human Control
Engineers decide how Akuity Intelligence behaves:
-
which issues matter
-
what the correct fix is
-
when automation should run
-
when approval is needed
Because these rules come directly from the team, Akuity Agents operate in a predictable, controlled way. Even simple automation can make a meaningful difference during widespread outages.
5. Extends open source Argo CD
Open source Argo CD gives engineers the visibility they need: unhealthy apps, failed pods, events, and logs. But visibility doesn’t resolve issues during an outage.
The Akuity Platform adds the automation layer. Engineers deploy Akuity Agents instead of running Argo CD themselves. These agents connect the cluster to the Akuity control plane, allowing Akuity Intelligence to::
-
read the same event data engineers see
-
follow team-defined runbooks
-
apply safe, deterministic fixes
-
handle repeated failures automatically
This turns Argo CD’s visibility into real remediation and removes repetitive tasks from the on-call workflow.
What Engineering Teams Can Take Away
The outage highlighted a simple truth: the value isn’t in AI “solving everything,” but in using a tool that automates the fixes engineers already understand. When teams define clear, safe rules, Akuity Intelligence can take over the repetitive incident work that appears again and again during large outages.
Teams running Kubernetes at scale know these patterns well:
-
repeated ImagePullBackOff
-
stuck rollouts
-
incorrect image references
-
pods that need restarts
-
failures triggered by upstream services
Most of these issues have straightforward fixes once diagnosed. The AWS outage showed that Akuity Intelligence can apply those steps reliably and consistently, without pulling engineers out of sleep or deeper investigation.
Closing Thoughts
We couldn’t avoid the AWS outage, but we did avoid the repetitive work that normally turns an incident into an all-night effort. The incident showed how engineer-defined, deterministic automation can cut down alerts, protect on-call time, and keep systems stable during upstream failures.






