Resilience and High Availability in Uncertain Times

Organizations today are reliant on highly distributed, hybrid IT infrastructure to conduct operations. What is often overlooked and misunderstood about adopting a cloud-forward operational posture is the risk of experiencing an outage—and that you can take steps to build in resilience when you don’t own your own critical infrastructure.

The infrastructure services on which your operations depend can be (and often are) disrupted by any number of scenarios. Cyberattacks, natural disasters, and human error all conspire to throw sand in the gears of progress. And at a time when there is a great deal of uncertainty in the world, it is important to take the steps necessary to mitigate the risks of an outage and architect your systems for resilience and high availability.

That situation was illustrated on March 1 when, according to the Times of India, “objects struck” an Amazon Web Services data center in the United Arab Emirates starting a fire. Emergency crews responded and cut power to the facility, knocking out service for the mec1-az2 availability zone for several days. One of our customers was running a SQL server failover cluster in the AWS Middle East region and it was taken offline during the incident.

Fortunately, the customer was prepared. While AWS engineers worked to bring the data center back online, the customer’s cluster remained operational using the node in mec1-az1. This was possible because they had architected the system as a SANless cluster with replication, so other nodes in the cluster already had up-to-date copies of their operational data when the system failed over to its backup. As far as users of the customer’s applications were concerned, service remained available even as the crisis unfolded.

It was a high availability case study written in bold face. The takeaway? No matter how unexpected the circumstances, when your primary infrastructure goes offline, it doesn’t mean you have to be at the mercy of your service provider. The good news is, the lessons learned from this incident are easily applied to other organizations that recognize the value of building and maintaining resilience, high availability (HA), and disaster recovery (DR) into their critical infrastructure. Here’s how.

Changing the Game

Many organizations still associate high availability with specialized hardware such as SAN storage or proprietary replication appliances. That model is familiar because, for a long time, it was the only option available and so it dominated data center architecture for years. IT operations grew comfortable with the idea and, despite the limitations of that approach, it worked and, since it wasn’t broke… why fix it?

Eventually that model changed, and what is not widely understood today is that modern HA solutions can now be delivered entirely via software. Instead of hardware-based SAN clusters, it is now possible to deploy SANless clustering to replicate storage between nodes at the block level. This allows applications to fail over automatically and seamlessly without relying on shared storage.

However, the longstanding reliance on hardware seems to have resulted in a bias toward a traditional hardware model. Because of that, too many CIOs remain either unaware or unconvinced that they can trust their HA/DR contingencies to software; they want an expensive piece of physical equipment involved if they are to feel comfortable. But once they see that a software-based approach can deliver the same level of protection while also being more flexible and cost efficient in their cloud environment, the conversation changes quickly.

Engage in the Discussion

Whether you are in the planning stages of a new IT estate, or exploring ways to improve resilience for existing IT infrastructure, it’s never too early to engage in a discussion of how to adopt a software-based approach to HA/DR. Even if the conversation is dominated by issues like deployment speed, scalability, and cost, the wrong time to bring up HA/DR is after an outage exposes how dependent your organization is on a particular service or application. Unfortunately for most organizations, this is a reality that only comes up in an incident’s postmortem.

But with each new major service outage, awareness grows of the fact that, while your infrastructure provider may be spending billions of dollars each year to repair, maintain, and modernize its data centers, that is no guarantee of availability. For organizations whose operations depend on third party infrastructure and services, clustering software is what keeps the lights on when a failure occurs.

To guard against the inevitability of an outage using SANless clusters you can either run your applications on a primary node in the cloud with failover to a secondary cluster on a secondary node, or you can provision local storage with replication software on each cluster node for synchronization in the event of failover. In each case, clustering software monitors application health and automatically moves operation of affected applications to a healthy secondary node in the event of failure. For organizations taking a multi-cloud approach to IT operations, it is also possible to create node clusters that allow you to failover from one cloud to another when necessary. In each case, SANless clusters eliminate single points of failure to keep service running on secondary resources until primary service is restored.

More than Mere Uptime

It is worth pointing out here that, while maximizing uptime is the primary benefit of a SANless approach to HA/DR, it is far from the only advantage. If your system lacks automated failover, you risk long hours of manual effort spent during the recovery process, and a high likelihood that errors will be made by people under pressure to fix things fast.

It is also probable that valuable will be lost during the process due to a lack of backup until services are restored. That can mean many hours of lost transactional data. And, while some may not admit it, organizations lacking reliable HA often delay patching and maintenance because they cannot tolerate downtime and fear that an unplanned or unscheduled maintenance event (even if it is related to a known security risk) may be temporarily disruptive.

You Have Options

This raises the point that, while HA/DR systems are not security tools, they do play a crucial role in operating a secure and resilient IT estate. That’s because organizations that have replicated systems and standby environments in place have options when a security incident occurs. When vulnerabilities in your systems are discovered and patches are issued, you can respond quickly and confidently to execute those fixes as soon as they are available, closing the risk window before that vulnerability can be exploited.

If a cyberattack occurs and malware is detected, you can isolate compromised systems, failover to clean infrastructure, and restore services much more quickly than if you had to rebuild everything from scratch. And in the event of a ransomware attack, having multiple synchronized copies of data and defined recovery workflows can dramatically shorten the time required to get critical applications back online.

Our Times Require HA/DR

These days, service outages are not just caused by hardware failures. They can come from cloud disruptions, natural disasters, bad patches, cybersecurity incidents—or collateral damage from military conflict. Traditional clustering solutions designed primarily for local failover are no longer sufficient. IT infrastructure architected with solid, software-based high availability and disaster recovery can keep your operations running even when the unexpected occurs. IT teams need disaster recovery capabilities that can automatically verify replication integrity, orchestrate failover to secondary locations, and provide visibility across the full application stack. For those organizations, SANless clustering fits the bill.

Author Bio: Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high availability expert and has been honored by his peers by being elected to be a Microsoft MVP in Clustering six times and seven times as a Cloud and Datacenter MVP. Dave is a frequent speaker at technical conferences, including SQL Saturdays, Pass Summit, and MSSQL Tips, and is the author of Clustering for Mere Mortals blog. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.

Addressing Resilience and Availability at a Time of Global Uncertainty

How CISOs Use MITRE ATT&CK and ATLAS to Map and Defeat Modern Attackers | TFiR

AWS Testing Costs Kill Velocity: LocalStack Hits 400M Docker Pulls | TFiR

How CISOs Use MITRE ATT&CK and ATLAS to Map and Defeat Modern Attackers | TFiR

AWS Testing Costs Kill Velocity: LocalStack Hits 400M Docker Pulls | TFiR

You may also like

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR

Why Cloud Spend Now Drives Company Valuation | Peter Maloney, Azul | TFiR

How to Build Safe, Production-Ready Kubernetes Clusters at Scale | Corey McGalliard, Akamai Cloud | TFiR

Why AI Agent Logs Are Not Enough and How to Get Cryptographic Proof | Yaron Schneider, Dapr | TFiR

How to Stop Chasing CVEs and Ship Vulnerability-Free Containers | Eilon Elhadad, Echo | TFiR