Cloud Native

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

0

Application failover is only one layer of high availability. When storage replication, virtual IPs, and load balancers are not explicitly included in the HA dependency chain, a successful application move still results in downtime because clients cannot reach the service. Runbooks written months or years ago against a different environment state introduce silent gaps that only surface during an actual outage, precisely when there is no time to find them.

In this interview on TFiR, Matthew Pollard, Customer Experience Software Engineer at SIOS Technology, walks through the application-level dependencies most HA strategies miss, how untested failover procedures create compounding risk, and the specific validation steps IT admins need to run to confirm their environments actually protect what the business depends on.

Guest: Matthew Pollard, Customer Experience Software Engineer at SIOS Technology
Show: TFiR

Here is what every IT admin and platform engineer responsible for uptime needs to know.

Technical Deep Dive

Q: What application-level dependencies are most commonly overlooked in high availability strategies?

Matthew Pollard, Customer Experience Software Engineer at SIOS Technology, explains that teams focusing only on application and database failover often neglect the underlying dependencies those services require to function. Storage is a critical gap: if an application moves to a standby node but cannot access the storage it depends on, the service remains broken. Networking components including virtual IPs and load balancers are equally overlooked, and without them, clients have no path to reach the application even after a successful failover.

“Even if the application fails over, if it doesn’t have the storage that it depends on to function properly, if it doesn’t have the networking components that is required for clients to actually connect to the application, then you’ve still incurred a downtime, regardless of if the application moved or not.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: How does the “set it and forget it” mindset lead to unexpected outages in HA environments?

Pollard identifies the set it and forget it mindset as the root cause of most real-world failover failures. Organizations follow documentation and build internal runbooks, and on paper the configuration looks correct. The problem is that runbooks go stale as environments change, and gaps accumulate silently because no one has run a full failover simulation to surface them. The first time those gaps are discovered is during an actual outage, which is exactly when the environment needs to be functioning correctly.

“On paper it’s perfect. But because you never tested an actual robust failover procedure, there are holes under there that you were not aware of. And then once the actual failover happens, that’s when you’re finding out about it.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: What steps should IT admins take to keep failover strategies current as environments change?

Pollard outlines a continuous discipline rather than a one-time setup. Teams should stay in active contact with their HA solution providers to track new releases, known issues, available patches, and current best practices for both applications and databases. Applying those recommendations is necessary but not sufficient: the only way to confirm coverage is to test under realistic failure conditions. That means blocking networks, cutting power to systems, and verifying that standby nodes detect the failure, bring all services online, and allow external clients and dependent systems to connect successfully.

“You need to go in and block your networks. You need to cut power to your systems and make sure that the standbys can detect it, they can bring everything in service, and make sure that your clients can actually connect to it.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Resources & Documentation

  • SIOS Technology, provider of high availability and disaster recovery clustering software for critical applications

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Can you talk about what are some common application level dependencies that are often overlooked in HA strategies?

Matthew Pollard: Yeah. So once you get inside the systems, at the level of the applications, the databases, it’s easy to fall into the trap of just thinking that once those services can fail over, it can be moved between systems. You have high availability, but all of those depend on other things, on the systems, they depend on storage. If you’re focused on your applications and you neglect the external databases that they rely on, if you neglect the networking components like your virtual IPs, your load balancers, then even if the application fails over, if it doesn’t have the storage that it depends on to function properly, if it doesn’t have the networking components that is required for clients to actually connect to the application, then you’ve still incurred a downtime, regardless of if the application moved or not, because it’s not functioning properly.

Swapnil Bhartiya: Is it possible for you to share some examples where because of lack of real world failure testing, it led to kind of unexpected outages?

Matthew Pollard: Yeah, of course. And I see that as another symptom of the set it and forget it mindset we mentioned earlier. But I’ve definitely observed many instances where organizations have set up everything, follow the documentation, followed their internal runbooks. On paper it’s perfect. But for example, the runbook was written a while back or the documentation wasn’t what covered you, all of your specific needs. And because you never tested an actual robust failover procedure where you try to, for example, simulate an actual outage that would cause a failover, there are holes under there that you were not aware of. And then once the actual failover happens, that’s when you’re finding out about it, which is when again, you need it to be working the most. Real world does not often follow those ideal on paper scenarios.

Swapnil Bhartiya: What steps can IT admins take to ensure that failover strategies keep up with changing environments that we just discussed?

Matthew Pollard: So what I would say is the steps you take to keep up with it is just testing, updating. Stay in contact with your partners, your providers for your HA solutions, to make sure that you have the newest releases that you’re aware of, any issues that might affect you, and how to work around or remediate them. Any patches you need to apply, make sure that you’re applying best practices as recommended by the provider for your applications, for your HA solutions, for your databases, and again, just back to the tested, because once you’ve done all of that, you need to make sure that you’re actually covering your needs. You need to go in and block your networks. You need to cut power to your systems and make sure that the standbys can detect it, they can bring everything in service, and make sure that your clients can actually connect to it. Anything external that depends on it can still function in that scenario.

Why AI Inference Costs and Vendor Lock-In Are Now Your Biggest Infrastructure Risk | Swapnil Bhartiya, TFiR

Previous article