Cloud Native

Why High Availability Breaks Even in the Cloud and How to Fix It | Matthew Pollard, SIOS Technology | TFiR

0

Moving workloads to the cloud does not eliminate downtime risk. Application-level dependencies including storage, virtual IPs, and external databases sit entirely outside the redundancy that cloud providers manage, and most organizations only discover those gaps during an actual outage. Failover runbooks written months or years ago frequently do not reflect the current environment, and untested procedures fail in exactly the scenarios they were written to handle.

In this interview on TFiR, Matthew Pollard, Customer Experience Software Engineer at SIOS Technology, walks through the top five high availability mistakes that persist in 2026, the specific dependency gaps that cause failover to fail silently, and the proactive testing and communication practices that keep HA strategies aligned with changing environments.

Guest: Matthew Pollard, Customer Experience Software Engineer at SIOS Technology
Show: TFiR

Here is what every IT admin and infrastructure engineer needs to know.

Technical Deep Dive

Q: Why does migrating to the cloud not automatically guarantee high availability?

Matthew Pollard, Customer Experience Software Engineer at SIOS Technology, explains that cloud environments do provide redundancy and reliability around cloud-managed infrastructure components, but that protection does not extend inside the systems. Applications, databases, and other internal components require a dedicated high availability solution to provide redundancy and failover capabilities. Organizations that assume cloud migration transfers full HA responsibility to the provider are leaving their most critical workloads unprotected.

“You inherit some kind of high availability protection for the cloud managed infrastructure components, but it wholly neglects everything inside the systems, your applications and other components that do need a specific high availability solution set up to protect them.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: Why do teams still treat high availability as a one-time setup instead of an ongoing process?

Pollard observes that organizations frequently adopt a set-it-and-forget-it mindset toward HA, even as their environments evolve continuously. New components get added, others are removed or reorganized, and each of those changes can affect the HA solution’s ability to respond correctly. HA must be included in patching and maintenance procedures proactively, because discovering that the solution was not updated to account for an environment change during an actual incident is the worst possible time to find out.

“You might pull something out from under it, you might add something in that it is not accounting for, and then when you have some kind of issue, it is not set up to be prepared to handle that.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: What application-level dependencies are most commonly overlooked in HA strategies?

Pollard points out that teams often stop at ensuring the application service itself can fail over, without accounting for the dependencies that application relies on. Storage, external databases, virtual IPs, and load balancers are all required for the application to function correctly after a failover. If those components are not included in the HA strategy, the application may move to the standby node but still be unable to serve clients, resulting in a downtime event even though the failover technically completed.

“Even if the application fails over, if it does not have the storage that it depends on, if it does not have the networking components required for clients to actually connect, you have still incurred a downtime regardless of if the application moved or not.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: How does skipping real-world failover testing lead to unexpected outages?

Pollard describes a recurring pattern where organizations follow documentation and internal runbooks carefully, producing a setup that looks correct on paper, but never validate it with a realistic failover simulation. Runbooks written months or years earlier may not reflect the current environment, and gaps that were never discovered in testing surface only during an actual outage. Real-world failure scenarios rarely match the ideal conditions that documentation assumes.

“Because you never tested an actual robust failover procedure where you try to simulate an actual outage, there are holes under there that you were not aware of, and that is when you find out about it, which is when you need it to be working the most.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: How should organizations balance cost and performance in their high availability strategy?

Pollard frames this as a genuine balancing act with two failure modes. Over-investing produces an overly complex solution that is expensive to maintain and difficult to change. Under-investing saves money upfront but converts those savings into downtime losses when the solution fails to cover all business needs. The correct approach requires a thorough investigation of available solutions, providers, partners, strategies, and configurations to ensure the chosen solution covers all requirements without unnecessary cost or complexity.

“Any money that you save by not putting it into the high availability strategy, you are going to spend anyway in losses when you incur a downtime because it is not robust enough to cover all of your needs.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: What steps should IT admins take to keep failover strategies current as environments change?

Pollard recommends a combination of regular testing, active engagement with HA solution providers to stay current on releases and known issues, and consistent application of vendor-recommended best practices for applications, databases, and HA tooling. Testing should go beyond configuration checks and include realistic simulations such as cutting network access or powering down systems to confirm that standby nodes detect the failure, bring all services online, and allow external clients and dependent systems to connect successfully.

“Block your networks, cut power to your systems, and make sure that the standbys can detect it, bring everything into service, and make sure that your clients can actually connect to it.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: How does SIOS Technology help IT admins address common high availability challenges?

Pollard describes SIOS’s approach as proactive, open, and communication-driven. The team follows up with customers after issues are identified to confirm remediation steps were taken and to verify the solution is meeting their needs. SIOS offers services to help customers audit their configurations, identify problems before they cause incidents, and validate that their HA setup covers all required components. The priority is ensuring customers are not discovering gaps for the first time during a live outage.

“If I had to sum up our strategy, it is just being proactive, open, and communicative with our base.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: What emerging trends in high availability should organizations prepare for in 2026?

Pollard identifies the increasing compartmentalization of infrastructure teams as a significant emerging challenge. As environments grow more complex, organizations are forming dedicated teams for networking, databases, operating systems, applications, security, cloud, and infrastructure, and those teams do not always communicate the impact of their changes to each other or to the team responsible for HA. Since a high availability solution must monitor and manage all of those components, any gap in cross-team communication creates a risk that a change in one domain will break HA coverage in another. Pollard also notes that cybersecurity threats are an escalating concern that intersects directly with infrastructure stability.

“A trend I have been seeing emerge is the compartmentalization of teams and team responsibilities within an environment, and it creates the need for clear communication between those teams, especially when you have a high availability solution that is monitoring and managing everything.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Q: What is the most important advice for IT admins looking to strengthen their high availability setup?

Pollard emphasizes two non-negotiable practices: thoroughness and proactive planning. Every component that users and teams depend on must be explicitly protected within the HA solution, and the entire setup must be tested as realistically as possible, with QA and pre-production environments configured to match production as closely as achievable. Patching and maintenance windows must account for how the HA solution will respond, since an unplanned failover or alert triggered by a patch can compound the disruption. Verifying recommended steps with vendors before executing changes is essential.

“Be thorough and be proactive. That is probably the best advice that I can give.” — Matthew Pollard, Customer Experience Software Engineer, SIOS Technology

Resources and Documentation

  • SIOS Technology, high availability clustering and disaster recovery solutions for physical, virtual, and cloud environments

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: You might think that migrating to the cloud automatically solves all your problem. It gives you high availability, but in reality it introduces new points of failure. Why? Because teams often mistakenly treat high availability as a one time setup. But in reality it’s a balancing act that if neglected, can lead to costly downtime. It is recommended that failover strategy should evolve alongside changing environments. And today we have with us once again Matthew Pollard, Customer experience Software Engineer at SAISS Technology to uncover the top five high availability mistakes that persist even in 2026 and how to steer clear of them. Matthew, it’s great to have you back.

Matthew Pollard: Thank you for having me.

Swapnil Bhartiya: It’s my pleasure. Can you explain why simply moving to the cloud doesn’t solve or guarantee high availability?

Matthew Pollard: What a lot of organizations seem to think is once they get into the cloud, they inherit some kind of high availability protection for everything. And to a certain degree you do, because you get redundancy and reliability around a lot of the cloud managed infrastructure components. But it wholly neglects everything inside the systems, your applications and other components in the systems that do need a specific high availability solution set up to protect them, to provide redundancy and failover capabilities.

Swapnil Bhartiya: Can you also talk about why do teams still treat high availability as a one time setup instead of a continuous process? I mean, we have seen this in security space, right? It used to be hey security, you did it once. It’s someone else’s problem. But things are changing and I feel that just the way we have the whole DevSecOps movement with high availability also it cannot be treated as someone else’s problem or something. You said it opens and forget about it.

Matthew Pollard: Yeah, and we certainly do observe organizations treating high availability as this type of set it and forget it type of component in your environment. But environments are complicated in the modern day. They are evolving constantly. New things being added, things being taken away, reshuffled, reorganized, and every one of those, as much as it can impact your applications, your databases, your systems, your high availability solution is something in that stack that does need to be proactively maintained, planned around and included in, for example your patching and your maintenance procedures. Otherwise you might pull something out from under it, you might add something in that it’s not accounting for. And then when you have some kind of issue, it’s not set up to be prepared to handle that. And now you’re finding out about that problem when you need high availability solutions to be doing their job.

Swapnil Bhartiya: So very well said. Thank you. Can you talk about what are some common application level dependencies that are Often overlooked in in ha strategies.

Matthew Pollard: Yeah. So once you get inside the systems at the level of the applications, the databases, it’s easy to fall into the trap of just thinking that once those services can fail over, it can be moved between systems. You have high availability, but all of those depend on other things, on the systems, they depend on storage. If you’re focused on your applications and you neglect the external databases that they rely on, if you neglect the networking components like your virtual IPs, your load balancers, then even if the application fails over, if it doesn’t have the storage that it depends on to function properly, if it doesn’t have the networking components that is required for clients to actually connect to the application, then you’ve still incurred a downtime regardless of if the application moved or not, because it’s not functioning properly.

Swapnil Bhartiya: Is it possible for you to share some examples where because of lack of real world failure testing, it led to kind of unexpected outages?

Matthew Pollard: Yeah, of course. And I see that as another symptom of the set it and forget it mindset we mentioned earlier. But I’ve definitely observed many instances where organizations have set up everything, followed the documentation, followed their internal runbooks. On paper it’s perfect. But for example, the runbook was written a while back or the documentation wasn’t what covered you, all of your specific needs. And because you never tested an actual robust failover procedure where you try to, for example, simulate an actual outage that would cause a failover, there are holes under there that you were not aware of. And then once the actual failover happens, that’s when you’re finding out about it, which is when again you need it to be working the most. Real world does not often follow those ideal on paper scenarios.

Swapnil Bhartiya: These days, organizations are also becoming very, very cost, kind of constant cost conscious as well. But at the same time, performance and high availability is also becoming a critical talking point. How can organizations find the right balance between cost and performance in their strategies?

Matthew Pollard: Yeah, it can certainly be difficult. And it’s different for everyone because like you said, it’s this balancing act because you can throw everything you have, everything you possibly can afford to, at least your solution, your strategy. But what results from that is often over engineered. It’s overly complex, which makes it very hard to maintain or change or upkeep. And like you said, the cost is becoming a greater factor every day. So anything overly expensive becomes a problem. But if you overcorrect for that and you go too little, any money that you save by not putting it into the High availability strategy, the solution you’re going to spend anyway in losses when you incur a downtime because it’s not robust enough to cover all of your needs. So what you have to do is take a real thorough investigation into your options of solutions, providers, partners, strategies, configurations. Once you’ve chosen your HA solution or partner provider strategy to make sure that it is actually covering all of your needs and that you have to get configured to cover all of your needs but without breaking the bank.

Swapnil Bhartiya: What steps can IT admins take to ensure that failover strategies keep up with changing environments that we just discussed?

Matthew Pollard: So what I would say is the steps you take to keep up with it is just testing, updating, Stay in contact with your partners, your providers for your HA solutions to make sure that you have the newest releases, that you’re aware of, any issues that might affect you and how to work around or remediate them. Any patches you need to apply, make sure that you’re applying best practices as recommended by the provider for your applications, for your HA solutions, for your databases, and again, just back to the test it because once you’ve done all of that, you need to make sure that you’re actually covering your needs. You need to go in and block your networks, you need to cut power to your systems and make sure that the standbys can detect it, they can bring everything in service and make sure that your clients can actually connect to it. Anything external that depends on it can still function in that scenario.

Swapnil Bhartiya: Of course I’m talking to you. So I would also love to know, how does sios technology approach these common high ability challenges that we just talked about to make life easier for IT admins?

Matthew Pollard: Right? So the number one priority is always communication. It’s always getting in touch with our partners, with our users and making sure that what they’ve configured is actually meeting their needs. It’s following up after they’ve noticed some kind of issue and saying, have you remediated it? Have you taken these steps that we’ve identified to make sure won’t happen again? Is there any way we can help you make sure that your solution is meeting your needs? Any testing? Do you need us to come in and check your systems? We offer all kinds of services to help customers identify any problems and remediate them before they actually cause any issues. So I would say that if I had to sum up our strategy, it’s just being proactive, open and communicative with our base.

Swapnil Bhartiya: Are there any emerging trends in HA space that organizations should be aware of in 2026.

Matthew Pollard: I think that especially with some of the high profile issues that have happened recently, not just HA, but security, it’s always been a huge concern and it’s become an even bigger concern. Cybersecurity is commonly referred to as a type of arms race where it’s always escalating, always improving, both on the attacker and the defender side. And one thing that we’ve been noticing more and more is teams dealing with security, teams dealing with infrastructure, not always communicating enough to understand the impacts of the changes that they’re making. And I would even say that to take that to a more general view. As environments are getting more complex, it’s becoming more common for there to be dedicated teams for each different type of component. A networking team, a database team, an operating system team, an application team, a security team, a cloud team, an infrastructure team. And making sure that all of your teams are communicating with each other, even when that can be complex or hard to set up, is very important, or at the very least that your HA team has lines to communicate with all of those. Because we’ve talked about before, high availability is such a complicated type of environment when it has to cover all of these different components, the application dependencies, the application, the infrastructure. So a trend I would say I’ve been seeing emerge is the compartmentalization of teams and team responsibilities within an environment. And it arises the need for clear communication between those teams. Especially when you have a high availability solution that’s monitoring everything and managing everything.

Swapnil Bhartiya: And finally, what advice would you give to IT admins looking to strengthen their Azure setup?

Matthew Pollard: If you’re looking to strengthen your HA setup, I would say go through and really make sure that everything your users need, everything your teams need is being protected. I’m going to say it again because it’s just so important. Test, test, test, test, test it all. Test it as realistically as you can. If you’re setting up some kind of QA or pre production environment, then make sure that that is as similar as it can be to the production environment, because that’s how you get the most credibility out of your testing. And be proactive with your planning. Make sure that when you have to update, when you have to patch, that everything is considered as a nature of a HA solution, having to manage all these different components. If your patching causes the HA solution to trigger, you know, a failover or some kind of alert when it shouldn’t, that can affect the components that you’re patching as well. Check with your vendors to make sure that you’re following their appropriate and recommended steps. It’s just it requires you to be very thorough. So that’s probably the best advice that I can give. Be thorough and be proactive.

Why AI API Costs Force a Self-Hosted Model Strategy | Rob Hirschfeld, RackN | TFiR

Previous article