Patching a live production system doesn’t have to mean scheduled downtime. With the right HA architecture and a deliberate sequencing strategy, IT teams can patch their entire cluster while keeping the application available — with a fast rollback path if anything goes wrong.
The Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology
The Bottom Line: • The standby-node-first patching workflow — patch the standby, validate, fail over, patch the original — keeps applications available throughout the maintenance window with only seconds of interruption during switchover, and preserves an immediate rollback path if the patched node doesn’t behave as expected
Speaking with TFiR, Dave Bermingham of SIOS Technology defined the current state of near-zero downtime patching in high availability environments — and walked through the exact operational sequence that makes it work in practice.
WHAT IS NEAR-ZERO DOWNTIME PATCHING IN AN HA CLUSTER?
Near-zero downtime patching is a maintenance approach that uses the existing active/standby architecture of a clustered environment to apply updates without taking the application offline. Rather than scheduling a maintenance window where the application is unavailable, the team patches one node at a time — moving the workload to keep it running throughout the process.
“The basic idea is pretty simple. Instead of shutting down the application to patch it, you move the workload somewhere else first. In a clustered environment, you usually have an active node and a standby node. The standby node is already running and ready to take over.”
The workflow has four steps. First, patch the standby node. Second, validate that the patched node looks correct. Third, fail the workload over to the newly patched server — it becomes the active system. Fourth, return to the original node, which is now the standby, and patch it using the same controlled process. From the user’s perspective, the application stays available the entire time. The only interruption is typically a few seconds during the switchover.
“Instead of one risky maintenance event, you break the patch event into small, controlled steps. That approach takes a lot of the fear out of patching.”
The Rollback Path Is Built In
One of the most operationally valuable aspects of this approach is what it does to rollback. In a traditional single-server patching scenario, rolling back a bad patch is complex and time-consuming. In the standby-node-first workflow, rollback is immediate — if the patched backup system behaves unexpectedly, the team simply fails back to the unpatched original node while they diagnose the issue.
“If things didn’t go as planned when you patch that backup system, you have a simple way to roll back. You can just fail back to the unpatched node until you figure out what went wrong.”
Broader Context: Why Architecture Makes This Possible
In the full TFiR interview, Bermingham explains why this workflow is only available to teams running application-level clustering — not hypervisor-level HA solutions like VMware HA or Hyper-V clustering. Hypervisor-level solutions protect the physical host, but the workload inside the VM still has to come offline for OS and application patching. Application-level clustering, by contrast, runs the application across multiple nodes simultaneously, which is precisely what makes the standby-node-first sequence operationally viable.
Bermingham also addresses configuration drift — the incremental divergence between nodes over time that can cause failovers to behave differently in production than they did in testing — and the importance of rehearsed patching playbooks. Knowing the application can move between nodes successfully before the maintenance window begins is what separates confident patch cycles from anxious ones.
Watch the full TFiR interview with Dave Bermingham here





