How to Architect Multi-Cloud SQL Server HADR When One Cloud Is Not Enough | Dave Bermingham, SIOS Technology | TFiR

Single-cloud SQL Server DR fails when the provider does. Dave Bermingham of SIOS Technology explains multi-cloud HADR architecture, RTO/RPO trade-offs, and top implementation mistakes.

By Monika Chauhan May 19, 2026

0

A SQL Server environment designed for high availability inside a single cloud provider is still a single point of failure at the vendor level. When a cloud provider experiences a regional or platform-wide outage, both the primary workload and the disaster recovery site go down together if they share the same infrastructure. No amount of availability zone redundancy within one cloud solves this problem.

In this interview on TFiR, Dave Bermingham, Senior Technical Evangelist at SIOS Technology, breaks down how to architect a genuine multi-cloud SQL Server high availability and disaster recovery solution, what tools and trade-offs are involved, and what DBAs consistently get wrong when they first attempt the implementation.

Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology
Show: TFiR

Here is what every DBA managing SQL Server in the cloud needs to know.

Technical Deep Dive

Q: What real-world pain point does multi-cloud HADR solve for SQL Server DBAs?

Dave Bermingham, Senior Technical Evangelist at SIOS Technology, explains that the biggest pain point is unplanned downtime that the DBA cannot control. When a team relies on a single cloud provider, the entire environment is dependent on that provider remaining available. Multi-region design within the same cloud helps, but a provider-wide incident leaves DBAs waiting for the vendor to fix the problem. Multi-cloud HADR gives DBAs a way to recover from scenarios that would otherwise be completely outside their control, which is critical when uptime genuinely matters to the business.

“Multi-cloud gives the DBA a way to take back some control. It allows you to recover from scenarios that would otherwise be completely out of your hands.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: What happens step by step when an Azure region goes down and the DR site is in AWS?

Bermingham walks through a concrete scenario: the primary workload runs in Azure, replicated data is sitting in AWS, and monitoring or clustering software detects the Azure outage. At that point the documented DR plan takes effect, which is part of the broader business continuity plan. The execution sequence for SQL Server specifically involves bringing replicated storage online, starting SQL Server, and updating connectivity so that applications and users are redirected to the new environment. Bermingham stresses that automation is essential at every step so that engineers are not trying to recall procedures under pressure. An untested DR plan, he says, is not a DR plan.

“The untested DR plan really is not a DR plan at all. Test, test, and test again.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: Why do failover cluster instances require a different storage strategy in a multi-cloud SQL Server deployment?

Failover cluster instances require shared storage, and within a single cloud or single region, cloud-native options like FSx or Azure File Shares satisfy that requirement. In a multi-cloud environment those hosted solutions cannot be used because they are scoped to a single provider. Bermingham explains that the correct approach is a storage-replication model where data is replicated at the block level, using a tool like SIOS DataKeeper, rather than relying on any shared file service. Always On Availability Groups are commonly considered for multi-site SQL Server, but their synchronous and asynchronous replication characteristics make them more practical within a single cloud, particularly within the same region.

“In a multi-cloud environment you need to adopt a stateless approach where the data is replicated at the block level using something like SIOS DataKeeper.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: How do you handle client connectivity and application redirection after a multi-cloud SQL Server failover?

Bermingham identifies client connectivity as a major factor that requires an explicit plan. After a failover, applications and users must be redirected to the active node, which typically involves a DNS update or a load balancer change. Microsoft introduced Dynamic Network Names, which Bermingham describes as a significant benefit in a multi-cloud environment when configured correctly. Security and routing between cloud providers must also be addressed as part of this connectivity design, since traffic crossing cloud boundaries requires deliberate configuration. Getting this right requires planning and expertise, but the problem is solvable.

“Client connectivity becomes a big factor. You need a solid plan for redirecting applications and users to the active node after a failover.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: How do you balance RTO and RPO when replicating SQL Server data across cloud providers?

Bermingham frames this as a trade-off problem driven by latency. Synchronous replication guarantees zero data loss but is generally not practical across cloud providers because the round-trip latency is too high. The common pattern is to use synchronous replication locally for high availability, such as between availability zones in the same region, and then asynchronous replication across clouds for disaster recovery. That means the RPO in a multi-cloud scenario will typically be seconds to minutes depending on workload. RTO depends heavily on how well the failover process is automated. Bermingham also notes that not every application needs the same strategy, and organizations should align their trade-off decisions with business requirements rather than defaulting to a single technical approach.

“The key is to align those trade-offs with the business, not just what is technically possible.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: What are the most common mistakes organizations make when implementing multi-cloud SQL Server HADR?

Bermingham identifies five recurring failure patterns. First, teams underestimate complexity: operating across multiple cloud providers means dealing with multiple toolsets and skillsets simultaneously, and teams without cross-platform comfort create operational risk fast. Second, security policies become inconsistent because each cloud has its own identity and access model, and gaps appear when those are not aligned deliberately. Third, egress data costs are routinely overlooked even though moving data between cloud providers generates significant charges that accumulate quickly. Fourth, some teams avoid vendor-specific services entirely in an attempt to prevent lock-in, but this means they leave platform capabilities unused without gaining meaningful portability. Fifth, monitoring gaps leave teams unable to see across environments, which leads to performance issues, security exposure, and cost overruns. Bermingham concludes that multi-cloud SQL Server HADR is achievable but requires the right skills, the right tools, and consistent operational discipline.

“If you don’t have a good way to monitor and manage everything across environments, you are basically flying blind.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Q: What sessions is Dave Bermingham presenting at Day of Data and who are they designed for?

Bermingham is presenting two sessions at Day of Data in Jacksonville. The first is a technical session on multi-cloud high availability for SQL Server, covering architectural details and practical how-to guidance. The second is an introductory track aimed at students graduating or professionals making career changes, providing a high-level overview of what high availability is and what practitioners need to know to start investigating it further. The event also includes sessions on query optimization and a dedicated track for people entering the field.

“I’m also doing a kind of intro track for those maybe students that are graduating or people making career changes on very high level 101 what is high availability.” — Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Resources & Documentation

SIOS Technology, vendor of SIOS DataKeeper, a block-level storage replication solution for SQL Server failover cluster instances in multi-cloud environments
SIOS DataKeeper, block-level replication software that enables shared-storage failover cluster instances across cloud providers
Microsoft Dynamic Network Names, a SQL Server feature that simplifies client connectivity redirection in multi-cloud failover scenarios
Microsoft Always On Availability Groups, SQL Server built-in replication supporting synchronous and asynchronous modes for high availability within and across regions

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Database administrators are tired of being told that their SQL Server is highly available when everything actually lives in one cloud. If there is an outage in one region of Azure, that means that your entire stack will go down because your primary and DR both rely and depend on the same vendor, same cloud. So how do we fix this? To talk about that, we have with us once again Dave Bermingham, Senior Technical Evangelist at SIOS Technology to explore what Multi Cloud HADR actually means. They will also be speaking at Day of Data next month, so we are going to cover our topics. Dave, it’s great to have you on the show. You are going to speak at Day of Data next month for database administrators who are managing SQL Server today. What’s the real world pain point that Multi Cloud HADR solves for them?

Dave Bermingham: Yeah, thanks for having me again, Swapnil. Yeah, we’re looking forward to going back to Jacksonville. I’ve been there a few times, always a great event and this time I have two technical sessions. One is like we mentioned, we’re going to be talking about Multi Cloud high availability, just some really great information, some technical details, some how to’s, but I’m also doing a kind of intro track for those maybe students that are graduating or looking or people making career changes on just very high level 101. What is high availability? What do you need to know? So I have two tracks in Jacksonville as far as the event. There’s going to be lots of great speakers. I’m sure there’s going to be everything from the nuts and bolts of how to optimize queries to just again there’s a whole track for students and people looking for career changes that are just going to be more high level and give you some good information to investigate. So I’m looking forward to it. My session on Multi Cloud and that you mentioned, what are some of the real world pain points that Multi Cloud HADR solves? Of course the biggest pain point is unplanned downtime that you can’t control, right? If you’re relying on a single cloud provider, you’re ultimately dependent on that cloud provider being up and available. Now you can design for multi region within that cloud and that definitely can help. But there’s a broader issue that impacts availability. If the cloud provider is having a really bad day and the worst case scenario you could be sitting there waiting for the cloud provider to fix it if you don’t have another plan in place. Well, Multi Cloud gives the DBA a way to take back some control. It allows you to recover from scenarios that would otherwise be completely out of your hands. And that’s a big deal when uptime really matters.

Swapnil Bhartiya: Can you walk us through a specific failure scenario? Say an Azure region goes down and explain how multi cloud HADR handles the recovery there?

Dave Bermingham: Yeah. So let’s say your primary workload is running in Azure and you’ve got a DR site set up in AWS. You have, like we’ve discussed, there’s ways of getting your data from Azure into DR, Azure into AWS, but essentially your data is sitting there in AWS and then you need to take some action. So if Azure goes down, typically you’re going to have some monitoring software or some clustering software that’s going to detect the outage. And at that point your DR plan takes effect and that is a written set of instructions that’s part of the larger business continuity plan. So there may be some things that are happening across the business at the same time that aren’t related to your databases at all. But it’s important that everyone knows their role and that the entire plan has been tested and not just the database piece. So from a SQL Server standpoint, this is where execution matters. There’s usually a priority list of what needs to come online first, and then you walk through that list step by step. Whenever possible you want those steps to be automated so that you’re not relying on someone trying to remember what to do. Typically under a lot of pressure. The exact process is really going to depend on the solution that you’ve enacted. But generally it comes down to bringing the replicated storage online, starting SQL Server and then updating the connectivity so that the applications and users are redirected to this new environment. If it’s designed properly, that process should be predictable and relatively fast. That comes down to automation, documentation and testing. The untested DR plan really isn’t a DR plan at all. So test, test and test again. Yeah, so the key in that sense is failover cluster instances. Often when we think about multi site SQL Server always on availability groups comes into mind. That’s a solution that can replicate data synchronously, asynchronously. Great solution within a single cloud, especially the same region, things are pretty straightforward. Now, failover cluster instances normally requires some sort of shared storage device. And when you’re talking a single cloud provider, single region, you have options. There’s hosted solutions like FSX or Azure File Shares, but in a multi cloud environment you can’t use those solutions. So instead you need to adopt a stateless approach where the data is replicated at the block level using something like SIOS DataKeeper. Client connectivity also becomes a big factor. You need a solid plan for redirecting applications and users to the active node after a failover that usually involves a DNS update. Load balancers Microsoft introduced this new thing called dynamic network names which can be a big benefit in a multi cloud environment if you configure that properly. But on top of that you need to think about security and the routing between the cloud providers. So it’s definitely doable, but it takes a bit of planning and some expertise to get it right.

Swapnil Bhartiya: And how do you balance recovery time objectives and recovery point objectives when data has to replicate between different cloud providers?

Dave Bermingham: It comes down to trade offs. So if you want zero data loss you need synchronous replication. The problem is that usually you can’t do synchronous replication across clouds because of the latency. So what most organizations end up doing is using synchronous replication locally for high availability, maybe between availability zones in the same region, and then asynchronous replication across clouds for disaster recovery. So that means your recovery point objective in a multi cloud scenario might be seconds or minutes depending on the workload. And your recovery time objective is going to depend heavily on how you automate your failover process. The key is to align those trade offs with the business, not just what’s technically possible. So in some cases hourly or even daily snapshots are perfectly fine. In other scenarios where every second counts, you need to minimize potential data loss as much as possible. You’re not going to have a one size fits all solution. Many organizations end up implementing different strategies based on the requirements of each application.

Swapnil Bhartiya: What are some of the common mistakes organizations make when they first try to implement multi Cloud SQL Server HADR?

Dave Bermingham: There are a few common mistakes I see when organizations start going down the multi cloud path. The biggest one is underestimating the complexity. It sounds great on paper, but now you’re dealing with multiple cloud providers, multiple tool sets, a different way of doing things, and if your team isn’t comfortable across all those platforms, things can get messy pretty quickly. Security is another big one. Each cloud has its own model for identity, access and security controls. So if you’re not careful you end up with inconsistent policies. And that’s where gaps start to show up. Cost is something people often overlook too, especially the egress data cost. So moving data between cloud providers isn’t free. And those charges can add up fast if you’re not paying attention. I also see people trying hard to avoid vendor lock in by using only the most basic services across each cloud. The problem is you end up leaving a lot of value on the table by not taking advantage of what each platform does well. And then there’s visibility. If you don’t have a good way to monitor and manage everything across environments, you’re basically flying blind. And that can lead to performance issues, security risk, and higher cost. So multi cloud can absolutely work, but you need the right skills, the right tools, and a consistent approach to security and operations to make it successful.

Swapnil Bhartiya: Thank you so much for joining me and walking us through these practical scenarios ahead of Day of Data. I hope you will have a lot of fun at the event and I look forward to chat with you again. Thank you.

Dave Bermingham: I look forward to it. Thanks.

You may also like

Why DDoS Attacks on Banks Last Longer and APIs Are the New Front Line | Steve Winterfeld, Akamai | TFiR

By Monika Chauhan4 hours ago

Why AI Coding Agents Fail in Jupyter Notebooks and How Jupyter AI Fixes It | Lahari Chowtorri, Amazon | TFiR

By Monika Chauhan1 day ago

AI Infrastructure

How to Route AI Inference Across Latency, Cost, and Model Fit Simultaneously | Ari Weil, Akamai | TFiR

By Monika Chauhan1 day ago

AI Infrastructure

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

By Monika Chauhan4 days ago

Cloud Native

Why AI Inference Costs and Vendor Lock-In Are Now Your Biggest Infrastructure Risk | Swapnil Bhartiya, TFiR

By Monika Chauhan4 days ago

AI Infrastructure

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR

By Monika Chauhan4 days ago

Cloud Native