Most HA architectures are engineered to survive unexpected failures. They are not engineered for the maintenance window your team schedules every month — and that gap is where production goes dark.

The Guest: Dave Bermingham, Senior Technical Evangelist at SIOS Technology

The Bottom Line: • HA architectures are designed around the question “what if the server crashes?” — not “what if we intentionally take a node offline to patch it?” That design gap is the leading cause of enterprise outages, and it creates a no-win choice: delay patching and accumulate security debt, or rush through it and hope nothing breaks

[expander_maker]

Speaking with TFiR, Dave Bermingham of SIOS Technology defined the current state of high availability patching — and why the anxiety most IT teams feel during maintenance windows is a direct symptom of an architectural problem that most HA solutions leave unaddressed.

WHAT IS THE HA PATCHING PROBLEM?

High availability architecture is built to answer one question: what happens when something fails unexpectedly? It is not built to answer a different but equally critical question: what happens when the team intentionally takes a node or application offline to apply a patch? Most enterprise HA deployments never close that gap — and every patch cycle exposes it.

“Patching is risky because you’re intentionally changing something in a system that people rely on to be stable. Most outages actually occur during maintenance, not during random failures. When you apply a patch, you’re introducing new code, new drivers, and possibly triggering a reboot. Sometimes these things interact with applications in ways that nobody expected.”

The result is a structural problem that looks like a people problem. IT teams approach patch day with anxiety, and that anxiety produces one of two outcomes: they delay patching — which compounds security risk — or they rush through it and accept the possibility that something will break. Neither outcome is acceptable for organizations running business-critical workloads.

“If the architecture does not include a clear maintenance workflow, you end up with a lot of anxiety around patching. People either delay it, which creates security risks, or they rush through it and hope nothing breaks. A good HA design should make patching feel routine rather than stressful.”

Broader Context: What SIOS Technology Does Differently

In the TFiR interview, Bermingham goes deeper on why not all HA solutions solve this problem equally. Hypervisor-level solutions — VMware HA, Hyper-V clustering — protect against physical host failures but still require the workload inside the VM to go offline for OS and application patching. Scheduled downtime remains unavoidable at that layer.

Application-level clustering, the approach SIOS Technology is built on, solves this at a different layer. Because the application runs across multiple nodes simultaneously, teams can patch one node at a time using a standby-node-first workflow: patch the standby, validate it, fail the workload over, then patch the original node. From the user’s perspective, the application stays available throughout. Interruption is typically only a few seconds during the switchover.

Bermingham also covers configuration drift — the incremental divergence between nodes that creates silent failover failures — and makes the case that a documented, rehearsed patching playbook is the single highest-ROI improvement most IT teams can make before their next maintenance window.

Watch the TFiR interview with Dave Bermingham here

[/expander_maker]

HA’s Patching Blind Spot Is Killing Uptime: Dave Bermingham, SIOS Technology | TFiR

Argo CD Hit 60% Adoption—But Single-Environment Deployment Wasn’t Enough | Hong Wang, Akuity

Vibe Hunting Is the Future of Threat Detection — Aqsa Taylor of Exaforce Explains Why

Argo CD Hit 60% Adoption—But Single-Environment Deployment Wasn’t Enough | Hong Wang, Akuity

Vibe Hunting Is the Future of Threat Detection — Aqsa Taylor of Exaforce Explains Why

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR