Cloud Native

Multi-Cloud SQL Server HADR: Why Single-Cloud DR Fails and How DBAs Can Fix It | Dave Bermingham, SIOS Technology | TFiR

0

Database administrators have been sold a version of high availability that has a critical flaw baked in: when both the primary SQL Server workload and the disaster recovery site live inside the same cloud provider, a single regional outage eliminates both simultaneously. There is no failover. There is no recovery. There is only waiting for the cloud vendor to come back online.

This is not a theoretical risk. Azure, AWS, and Google Cloud have all experienced regional outages that lasted hours. Organizations that designed their DR strategy around multi-region deployments within a single cloud discovered that broad provider-level incidents can cascade across regions, invalidating assumptions that underpinned their entire business continuity plan.

The stakes are highest for SQL Server environments running transactional, mission-critical workloads — ERP systems, financial platforms, healthcare records, and manufacturing operations where downtime is measured not in inconvenience but in regulatory exposure, revenue loss, and customer trust. For these organizations, the question is not whether to invest in disaster recovery architecture, but whether the architecture they have actually works when the vendor they depend on is the source of the failure.

Multi-Cloud High Availability and Disaster Recovery, or Multi-Cloud HADR, is the architectural answer. By distributing SQL Server workloads and replicated data across fundamentally independent cloud providers — Azure and AWS being the most common pairing — organizations eliminate the single point of failure that resides at the vendor level. But implementation is not trivial. It requires rethinking storage replication, failover automation, network connectivity, identity and access management, cost modeling, and observability — all at once, across two different platform ecosystems.

The organizations that get this right are not the ones with the biggest budgets. They are the ones that have invested in the right tooling, documented and tested their failover procedures exhaustively, and aligned their recovery time and recovery point objectives with what the business actually requires — not just what is technically convenient to configure.

The Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology

Key Takeaways

  • Single-cloud DR strategies — even multi-region ones — leave organizations exposed to vendor-level outages that are completely outside DBA control; Multi-Cloud HADR is the only architecture that eliminates this dependency.
  • Failover Cluster Instances in a multi-cloud environment cannot use native shared storage options like FSx or Azure File Shares; block-level replication tools like SIOS DataKeeper are required to bridge cloud boundaries.
  • Synchronous replication is generally not viable across cloud providers due to latency; most organizations use synchronous replication within a region for HA and asynchronous replication across clouds for DR, accepting a measured RPO.
  • Untested disaster recovery plans are not disaster recovery plans — automation, documentation, and repeated testing are the difference between a predictable failover and a crisis response.
  • Hidden egress costs, inconsistent security policies across cloud IAM models, and lack of cross-environment visibility are the three most common and costly multi-cloud HADR implementation mistakes.

***

Read Full Transcript & Technical Deep Dive

In this exclusive interview with Swapnil Bhartiya at TFiR, Dave Bermingham, Senior Technical Evangelist at SIOS Technology, deep dives into the real-world architecture, failure scenarios, trade-offs, and implementation pitfalls of Multi-Cloud High Availability and Disaster Recovery for SQL Server — ahead of his two technical sessions at the Day of Data conference in Jacksonville.

The Real-World Pain Point: Vendor Dependency and Unplanned Downtime

The core problem Multi-Cloud HADR addresses is not technical complexity — it is control. When an organization’s SQL Server infrastructure, including its disaster recovery site, resides entirely within a single cloud provider, that organization has outsourced its uptime to that vendor. A provider-level incident does not just take down the primary; it eliminates the fallback simultaneously.

Q: For database administrators managing SQL Server today, what is the real-world pain point that Multi-Cloud HADR solves?

Dave Bermingham: “The biggest pain point is unplanned downtime that you can’t control. If you’re relying on a single cloud provider, you’re ultimately dependent on that cloud provider being up and available. Now you can design for multi-region within that cloud and that definitely can help. But there’s a broader issue that impacts availability. If the cloud provider is having a really bad day, in the worst-case scenario you could be sitting there waiting for the cloud provider to fix it if you don’t have another plan in place. Multi-cloud gives the DBA a way to take back some control. It allows you to recover from scenarios that would otherwise be completely out of your hands. And that’s a big deal when uptime really matters.”

Failure Scenario Walkthrough: Azure Goes Down, AWS Takes Over

Understanding Multi-Cloud HADR at the architecture level requires walking through a concrete failure scenario. The Azure-to-AWS pairing is the most common configuration SIOS Technology works with, and the recovery sequence illustrates both what automation enables and where human planning remains essential.

Q: Can you walk us through a specific failure scenario — say an Azure region goes down — and explain how Multi-Cloud HADR handles the recovery?

Dave Bermingham: “Let’s say your primary workload is running in Azure and you’ve got a DR site set up in AWS. Your data is sitting there in AWS and then you need to take some action. If Azure goes down, typically you’re going to have some monitoring software or some clustering software that’s going to detect the outage. At that point your DR plan takes effect — that is a written set of instructions that’s part of the larger business continuity plan. There may be some things happening across the business at the same time that aren’t related to your databases at all. But it’s important that everyone knows their role and that the entire plan has been tested — not just the database piece.

From a SQL Server standpoint, this is where execution matters. There’s usually a priority list of what needs to come online first, and then you walk through that list step by step. Whenever possible, you want those steps to be automated so that you’re not relying on someone trying to remember what to do — typically under a lot of pressure. The exact process is really going to depend on the solution that you’ve enacted. But generally it comes down to bringing the replicated storage online, starting SQL Server, and then updating the connectivity so that the applications and users are redirected to this new environment. If it’s designed properly, that process should be predictable and relatively fast. That comes down to automation, documentation, and testing. An untested DR plan really isn’t a DR plan at all. Test, test, and test again.

The key in that sense is Failover Cluster Instances. When we think about multi-site SQL Server, Always On Availability Groups often comes to mind — that’s a solution that can replicate data synchronously or asynchronously, and it’s a great solution within a single cloud, especially in the same region. Now, Failover Cluster Instances normally require some sort of shared storage device. When you’re talking about a single cloud provider in a single region, you have options — there are hosted solutions like FSx or Azure File Shares. But in a multi-cloud environment you can’t use those solutions. So instead you need to adopt a stateless approach where the data is replicated at the block level using something like SIOS DataKeeper.

Client connectivity also becomes a big factor. You need a solid plan for redirecting applications and users to the active node after a failover — that usually involves a DNS update, load balancers. Microsoft introduced Dynamic Network Names, which can be a significant benefit in a multi-cloud environment if you configure that properly. But on top of that you need to think about security and the routing between the cloud providers. It’s definitely doable, but it takes a bit of planning and some expertise to get it right.”

RTO and RPO Trade-Offs Across Cloud Providers

Recovery Time Objective and Recovery Point Objective are the two metrics that govern whether a DR strategy is fit for purpose. In a multi-cloud SQL Server deployment, the physics of inter-cloud latency force architectural trade-offs that organizations must consciously negotiate — and align with business stakeholders, not just engineering teams.

Q: How do you balance RTO and RPO when data has to replicate between different cloud providers?

Dave Bermingham: “It comes down to trade-offs. If you want zero data loss you need synchronous replication. The problem is that usually you can’t do synchronous replication across clouds because of the latency. So what most organizations end up doing is using synchronous replication locally for high availability — maybe between availability zones in the same region — and then asynchronous replication across clouds for disaster recovery. That means your recovery point objective in a multi-cloud scenario might be seconds or minutes depending on the workload. And your recovery time objective is going to depend heavily on how you automate your failover process.

The key is to align those trade-offs with the business, not just what’s technically possible. In some cases hourly or even daily snapshots are perfectly fine. In other scenarios where every second counts, you need to minimize potential data loss as much as possible. You’re not going to have a one-size-fits-all solution. Many organizations end up implementing different strategies based on the requirements of each application.”

Common Multi-Cloud HADR Implementation Mistakes

The gap between a multi-cloud DR strategy that looks sound on paper and one that actually functions under pressure is wide. SIOS Technology’s field experience across enterprise SQL Server deployments surfaces a consistent set of mistakes that organizations make when they first attempt multi-cloud HADR — mistakes that range from architectural misjudgments to operational blind spots.

Q: What are the most common mistakes organizations make when they first try to implement Multi-Cloud SQL Server HADR?

Dave Bermingham: “There are a few common mistakes I see when organizations start going down the multi-cloud path. The biggest one is underestimating the complexity. It sounds great on paper, but now you’re dealing with multiple cloud providers, multiple toolsets, a different way of doing things — and if your team isn’t comfortable across all those platforms, things can get messy pretty quickly.

Security is another big one. Each cloud has its own model for identity, access, and security controls. If you’re not careful you end up with inconsistent policies, and that’s where gaps start to show up.

Cost is something people often overlook, especially egress data cost. Moving data between cloud providers isn’t free, and those charges can add up fast if you’re not paying attention.

I also see people trying hard to avoid vendor lock-in by using only the most basic services across each cloud. The problem is you end up leaving a lot of value on the table by not taking advantage of what each platform does well.

And then there’s visibility. If you don’t have a good way to monitor and manage everything across environments, you’re basically flying blind — and that can lead to performance issues, security risk, and higher cost. Multi-cloud can absolutely work, but you need the right skills, the right tools, and a consistent approach to security and operations to make it successful.”

Day of Data Conference: Two Sessions in Jacksonville

Beyond the technical content, Dave Bermingham is bringing this expertise to the Day of Data conference in Jacksonville, with two sessions designed for different experience levels — one deep technical dive on Multi-Cloud HADR and one introductory track aimed at students and professionals making career transitions into database administration.

Q: You are speaking at Day of Data next month. What can attendees expect from your sessions?

Dave Bermingham: “I have two sessions in Jacksonville. One is like we mentioned — we’re going to be talking about Multi-Cloud high availability, with some really great information, some technical details, some how-to’s. But I’m also doing an intro track for those — maybe students that are graduating or people making career changes — on very high-level 101: what is high availability, what do you need to know? There’s going to be lots of great speakers, everything from the nuts and bolts of how to optimize queries to a whole track for students and people looking for career changes that’s going to be more high-level and give you some good information to investigate. I’m looking forward to it.”

AI Autonomously Fixed 25 Production Incidents Overnight—Engineer Never Woke Up | Hong Wang, Akuity | TFiR

Previous article