AI Infrastructure

The Two HA Patching Mistakes Causing Your Production Incidents: Dave Bermingham, SIOS Technology | TFiR

0

Most IT teams believe their high availability architecture has patching covered. Dave Bermingham of SIOS Technology explains why that assumption is the root cause of two of the most common — and most avoidable — production incidents in enterprise IT.

The Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology

The Bottom Line: • Hypervisor-level HA solutions protect the physical host but still require the workload inside the VM to go offline for OS and application patching — only application-level clustering enables true rolling updates; and configuration drift between nodes creates silent failover failures that lab testing consistently misses, making realistic rehearsal the single most important corrective practice an IT team can adopt

👇 Click to Read Technical Deep Dive

Speaking with TFiR, Dave Bermingham of SIOS Technology defined the current state of HA patching mistakes — identifying the two systemic errors that turn routine maintenance into production incidents, and the practices that eliminate them.

WHAT ARE THE MOST COMMON HA PATCHING MISTAKES?

Bermingham identifies two distinct categories of failure. The first is architectural: teams selecting or operating HA solutions that cannot actually support near-zero downtime patching at the OS and application layer. The second is operational: configuration drift between cluster nodes that goes undetected until a failover exposes it in production.

Mistake One: Assuming All HA Solutions Are Equal

The most widespread misunderstanding Bermingham encounters is the assumption that any HA architecture solves the patching problem. It doesn’t — and the distinction between hypervisor-level HA and application-level clustering is the critical variable.

“Hypervisor-level solutions like VMware HA or Hyper-V clustering are great for protecting against hardware failures or allowing maintenance on the physical host. But if you need to patch the operating system or the application running inside the VM, the workload still has to be taken offline. You have scheduled downtime for maintenance inside the virtual machine.”

Application-level clustering operates at a different layer entirely. Because the application runs simultaneously across multiple nodes, teams can patch one node at a time — rolling updates become operationally viable rather than theoretically possible.

“Because the application is running on multiple nodes, you can patch one node at a time and perform rolling updates. The application keeps running while you update the environment, which can dramatically reduce the downtime required.”

Mistake Two: Configuration Drift and the Lab Testing Gap

The second mistake is subtler and more dangerous because it is invisible until the worst possible moment. Over time, individual nodes in a cluster accumulate small differences — slightly different patch levels, different driver versions, different configuration settings. Each difference is minor in isolation. Together, they create a cluster where failover behavior in production no longer matches what the team tested.

“Over time, servers start to look a little different from each other. Maybe one has a slightly different patch level or driver version or configuration setting. Everything works fine until the moment you fail over. Then suddenly the application behaves differently on the other node.”

This problem is compounded by a systemic testing gap: lab environments that don’t accurately reflect production. Teams that validate patches in a non-representative lab gain false confidence — and get surprises when the patch hits live systems.

“Teams may test patches in a lab, but the lab environment does not look anything like production. So when the patch hits production, there are surprises. The best practice is to rehearse failovers and patching procedures regularly so the team knows exactly what will happen when the time comes to actually apply the patch.”

Broader Context: The Full Patching Strategy

In the full TFiR interview, Bermingham connects these two mistakes to a broader framework for safe, fast patching in HA environments. The standby-node-first patching workflow — patch the standby, validate, fail the workload over, then patch the original node — is the operational mechanism that makes rolling updates work in practice. A documented, rehearsed patching playbook closes the human coordination gap. And a pre-patch switchover test, run before any patching begins, is the single highest-confidence check a team can perform.

The through-line across all of Bermingham’s recommendations is the same: patch anxiety is an architectural and operational problem, not a skills problem. The teams that treat it as a design challenge — rather than something to manage through caution and luck — are the ones that make patch day routine.

Watch the full TFiR interview with Dave Bermingham here

What Is a High Availability Health Check? SIOS’s Trey Isaac Explains the Basics | TFiR

Previous article

AI Code Generation Is Outpacing QA. SmartBear’s BearQ Aims to Close the Gap | TFiR

Next article