Does Windows Operations share blame with Microsoft and Crowdstrike?

The recent emergency patch rollout by cybersecurity firm CrowdStrike caused significant global disruption, underscoring the critical need for robust resilience planning, automation, and rigorous testing to mitigate IT system failures. In this episode of TFiR: Let’s Talk, Rob Hirschfeld, Co-Founder and CEO of RackN, talks about the necessity of having alternate control paths and backup systems in place to ensure quick recovery during such incidents.

Software patch that caused global disruption, with focus on CrowdStrike’s processes

Hirschfeld explains that a software patch caused Windows to crash, leading to inadvertently disrupting IT systems globally.
He attributes the issue to a combination of system errors, automation failures, and human error.
Hirschfeld adds that CrowdStrike quickly distributed patches to millions of devices, indicating a thorough QA process.
Hirschfeld highlights that a postmortem investigation will look into why the patch was released without proper vetting, emphasizing the importance of automation in the process.

IT operations challenges and resilience

Operations teams face challenges in dealing with infrastructure failures, including lack of visibility and resilience planning.
According to Hirschfeld, Microsoft and Windows are not to blame for the infrastructure failures associated with the recent disruptions.
Hirschfeld highlights the complexity of IT systems and the shared responsibility for maintaining their security.
He emphasizes the importance of regularly patching and updating systems to prevent security issues.

Recent patch failure highlights crucial need for IT resilience and backup systems

Hirschfeld emphasizes the importance of backup and resiliency systems to recover from vendor-related disruptions.
The discussion also focuses on the impact the recent patch failure had on customers and whether any measures were taken to mitigate the issue.
Hirschfeld emphasizes the importance of resilience in recovering from cyber attacks, particularly in the face of patch availability.
He further points out that operations teams lacking resilience may experience prolonged outages when issues arise. In contrast, teams with quick recovery processes can significantly minimize the impact of such disruptions.

Guest: Rob Hirschfeld (LinkedIn)
Company: RackN (Twitter)
Show: Newsroom

This summary was written by Monika Chauhan.

Does Windows Operations share blame with Microsoft and Crowdstrike?

Kentik helps address the limitations of OpenTelemetry in network observability

Transposit’s AI Teammate brings context to incident management solutions

Kentik helps address the limitations of OpenTelemetry in network observability

Transposit’s AI Teammate brings context to incident management solutions

You may also like

AI Process Controls: Stopping Bad Assumptions Before They Ship | Rob Hirschfeld, RackN | TFiR

How AI and Compliance Deadlines Are Reshaping Financial Services Security Strategy | Steve Winterfeld, Akamai | TFiR

Why Team Silos Break High Availability in Complex Environments | Matthew Pollard, SIOS Technology | TFiR

One Control Plane for All Data Services Across Kubernetes and Cloud | Julian Fischer, anynines | TFiR

The CFO’s Guide to Java Runtime Efficiency | Peter Maloney, Azul | TFiR

The Hidden Risks of Untested HA Environments | Cassius Rhue, SIOS Technology | TFiR