Most enterprise outages don’t come from hardware failures or cyberattacks. They come from the patch cycle. HA architectures built for disaster recovery leave a critical gap when it comes to planned maintenance — and that gap is where production goes dark.
The Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology
Key Takeaways:
- Most HA failures happen during maintenance windows, not random outages — HA architectures designed for disaster recovery don’t account for planned patching workflows
- Application-level clustering enables rolling, near-zero downtime updates; hypervisor-level solutions like VMware HA and Hyper-V clustering still require the workload inside the VM to go offline
- Configuration drift between nodes is a silent killer — servers diverge over time and failovers that worked in the lab behave unexpectedly in production
- The standby-node-first approach — patch the standby, fail over, patch the original — reduces risk and preserves a fast rollback path
- A documented, rehearsed patching playbook is the single highest-ROI improvement an IT team can make before the next maintenance window
In a recent TFiR interview, Swapnil Bhartiya spoke with Dave Bermingham, Senior Technical Evangelist at SIOS Technology, about why patching remains one of the most dangerous activities in high availability environments — and the practical architectural patterns that eliminate that risk.
WHY PATCHING BREAKS HIGH AVAILABILITY ENVIRONMENTS
The core problem with HA and maintenance windows
Most HA architectures are engineered around a single question: what happens when something fails unexpectedly? They are not engineered around a different, equally critical question: what happens when the team intentionally takes a node offline to patch it? That design gap is where the majority of real-world outages originate.
“Most outages actually occur during maintenance, not during random failures. When you apply a patch, you’re introducing new code, new drivers, and maybe a reboot. Sometimes these things interact with applications in ways that nobody expected.”
Bermingham explained that because HA architectures don’t include a clear maintenance workflow, IT teams are forced into one of two bad patterns: they delay patching, which accumulates security risk, or they rush through it and hope nothing breaks. Neither is acceptable at enterprise scale.
HYPERVISOR HA VS. APPLICATION-LEVEL CLUSTERING
Why not all HA solutions solve the patching problem
One of the most persistent misconceptions Bermingham encounters is that all HA solutions handle patching the same way. They do not. The distinction between hypervisor-level HA and application-level clustering is critical when evaluating a patching strategy.
Hypervisor-level solutions — VMware HA, Hyper-V clustering — protect against physical host failures and allow maintenance on the underlying hardware. But when the workload that needs patching is the operating system or application running inside the virtual machine, the VM still has to come offline. Scheduled downtime is unavoidable.
Application-level clustering solves this differently. Because the application is running across multiple nodes simultaneously, the team can patch one node at a time while the application continues serving users on the other. Rolling updates become operationally possible, not just theoretically possible.
“Application-level clustering becomes much more useful because the application is running on multiple nodes. You can patch one node at a time and perform rolling updates. The application keeps running while you update the environment, which can dramatically reduce the required downtime.”
THE STANDBY-NODE-FIRST PATCHING WORKFLOW
A controlled, step-by-step approach to near-zero downtime patching
Bermingham outlined a concrete patching sequence that applies directly to clustered environments:
Patch the standby node first. In a standard two-node cluster, one node is active and one is on standby. Patch the standby node first. Validate that everything looks correct before proceeding.
Fail the workload over. Once the standby is patched and confirmed stable, move the workload to the newly patched node. From the user’s perspective, the application stays available — interruption is typically only a few seconds during the switchover.
Patch the original node. The original active node is now the standby. Patch it in the same controlled manner.
Preserve the rollback path. If something goes wrong when patching the backup system, the team has an immediate rollback option: fail back to the unpatched node while diagnosing the issue.
“Instead of one risky maintenance event, you break the patch event into small, controlled steps. That approach takes a lot of the fear out of patching.”
CONFIGURATION DRIFT AND THE SILENT FAILOVER FAILURE
Why lab testing isn’t enough
Configuration drift is one of the most underestimated risks in HA environments. Over time, individual nodes accumulate small differences — a slightly different patch level, a different driver version, a changed configuration setting. Each difference is minor on its own. Together, they create a system where failover behaves differently in production than it did in testing.
“Over time, servers start to look a little different from each other. Maybe one has a slightly different patch level or driver version or configuration setting. Everything works fine until the moment you fail over. Then suddenly the application behaves differently on the other node.”
Bermingham also flagged a systemic testing gap: lab environments that don’t accurately reflect production. When the lab doesn’t match production, patches that pass lab validation can still cause surprises when they hit live systems. The mitigation is regular, realistic failover rehearsals — not just one-time tests — so the team has high confidence in the exact outcome before a maintenance window begins.
THE PLAYBOOK: THE HIGHEST-ROI FIX IN PATCH MANAGEMENT
Why documentation and rehearsal outperform tooling changes
When asked for the single most impactful step an IT team can take, Bermingham’s answer was operational, not architectural: document and rehearse the patching procedure.
“Many organizations handle patching differently every time. One person does it one way; another person does it differently—and that’s where mistakes can happen. If you define a repeatable process and practice it so it becomes easy during the maintenance window, the whole team becomes much more confident. If we’re all working off the same playbook, even something as simple as running a planned switchover test before patching can make a huge difference.”
The pre-patch switchover test is a particularly high-value practice. If the team confirms the application can move between nodes successfully before any patching begins, the entire maintenance window becomes significantly less stressful.
THE FUTURE OF PATCH MANAGEMENT
Automation, infrastructure-as-code, and invisible updates
Bermingham sees the long-term trajectory of patch management moving toward smaller, more frequent updates absorbed by systems designed to remain online throughout the process — rather than large, infrequent maintenance events that require after-hours coordination and carry high blast radius.
“Instead of big maintenance events where everyone stays up late hoping nothing breaks, updates will happen in smaller increments and more frequently. Systems will be designed to absorb those changes while staying online. The goal is to make patching almost invisible to end users — security updates will still happen, systems will stay stable, and organizations will not have to choose between protecting their systems and keeping them available.”
High availability infrastructure and automation will be central to achieving that state — not as a future aspiration, but as an architectural requirement for any team that wants to stop treating patch day as a production incident.





