Cloud Native

Why Cloud Outages Expose Design Failures, Not Just Tech Failures | John Bradshaw, Akamai

0

Guest: John Bradshaw  (LinkedIn)
Company: Akamai
Show Name: An Eye on AI
Topics: Cloud Computing

When AWS experienced a major outage recently, the immediate impact was obvious: businesses couldn’t sell, services went dark, operations halted. But according to John Bradshaw, Field CTO Cloud, EMEA at Akamai, the real cost extends far beyond lost transactions. The reputational damage is harder to quantify, harder to fix, and lingers long after systems come back online. More importantly, these outages reveal a fundamental architectural problem that enterprises can actually solve—if they’re willing to invest in resilience before disaster strikes.

Everything Fails—Plan Accordingly

Bradshaw opens with a simple truth: everything fails. A pen runs out of ink, but you have another pen, or a pencil nearby. The same principle applies to cloud infrastructure, yet many enterprises operate as if their primary provider is infallible. “We’ve got to plan for these problems,” Bradshaw explains. That means designing workloads that can move and are naturally tolerant of failure, rather than hoping failure never comes.

The recent AWS outage didn’t just disrupt transactions—it rattled boardrooms. “There will be boards up and down the country, because it hit here in the UK, and no doubt the world, who are concerned about making sure that this doesn’t happen again,” Bradshaw notes. The timing could have been worse—imagine the impact during Black Friday—but the wake-up call was loud enough.

The Hidden Cost of Outages

While immediate revenue loss from an outage is calculable, the brand and reputational damage is exponentially harder to measure and repair. Customers lose trust. Stakeholders question reliability. Competitors gain ground. “That is much harder to quantify, and it is much harder to fix, and it goes on for an awfully long, long period of time,” Bradshaw emphasizes.

The paradox is that cloud technology has become so seamless—”almost magical”—that users expect it to just work. Fifteen years ago, joining a video call meant dialing into bridges and hoping you typed the code quickly enough. Today, it’s instant. That expectation of reliability is what makes outages feel so catastrophic when they occur.

Abstraction and Disintermediation: The Path to Resilience

Bradshaw’s solution centers on abstraction and disintermediation—breaking the monolithic dependencies where your entire stack lives in one place, with one provider, using one set of technologies. “We need to break the links where your stack is all in one place, all in one provider, all with one piece of technology,” he explains.

The goal is a heterogeneous infrastructure spanning multiple hyperscalers. You might have workloads running across half a dozen cloud providers, and ideally, you shouldn’t need to worry about which ones they are. The system should be designed to meet security requirements, satisfy regulatory demands, maintain performance, and automatically adapt when a database fails or a deep-sea fiber gets cut.

“If your users don’t ever notice the problem, then you never had a problem,” Bradshaw says. That’s the benchmark for true resilience.

The Investment Challenge

Here’s the uncomfortable truth: designing for resilience costs money upfront. Organizations must actively choose to spend on redundancy, abstraction layers, and multi-cloud orchestration. “You have to want to spend the money on making your solutions and services resilient,” Bradshaw cautions.

The challenge intensifies when systems have never failed before. Why invest in resilience when nothing’s broken? It’s the classic prevention paradox—the cost of preparation feels wasteful until catastrophe strikes, at which point it’s too late. “That’s always a challenge,” Bradshaw acknowledges. “We’ve got to keep front of mind as people start to look at how they protect themselves for the future.”

What This Means for Enterprise Leaders

For CTOs and infrastructure leaders, the message is clear: single-provider dependency is a strategic vulnerability, not a cost optimization. The upfront investment in multi-cloud architecture, abstraction frameworks, and resilience design pays for itself many times over when—not if—the next major outage occurs.

Bradshaw’s insights come at a critical moment when boards are demanding answers about business continuity. The enterprises that emerge strongest won’t be those with the best disaster recovery plans, but those who designed their infrastructure to never need them. As cloud adoption deepens across every industry, the question isn’t whether your organization can afford to build resilient systems—it’s whether you can afford not to.

Why Edge Inference Is Critical for Real-Time, Agentic AI | Ari Weil, Akamai

Previous article

Why Traditional Serverless Can’t Handle Edge AI—And What Replaces It | Matt Butcher, Fermyon

Next article