Americas

Asia

Oceania

Christopher Burgess
Contributing Writer

CrowdStrike debacle underscores importance of having a plan

Opinion
29 Jul 20246 mins
Incident ResponseTechnology Industry

With so many software agents in today’s IT ecosystems, epic fails like CrowdStrike’s are an inevitability. Make sure your teams are prepared by investing in preparation and rethinking dependencies.

Business Corporation Organization Teamwork Concept
Credit: Rawpixel.com / Shutterstock

The dust is largely settled from the global blue-screen-of-death (BSD) CrowdStrike inflicted on over 8.5 million Windows devices by its flawed delivery of a channel file in its Falcon Sensor update, crippling businesses worldwide. And now that nearly all those devices have been restored, it’s time for CISOs to pick up the pieces and contemplate lessons learned.

Much has been bantered about the magnitude of this event — and how it underscores how dependent enterprises have become on single points of failure.

According to Dave DeWalt, speaking to Dow Jones’ MarketWatch podcast “On Watch,” 30,000 CrowdStrike customers were impacted directly, with another 674,000 customers indirectly affected immediately when CrowdStrike hit the button to enable a global update.

DeWalt compared the event to the one he and CrowdStrike CEO George Kurtz experienced together in 2010 at McAfee when DeWalt was CEO and Kurtz was his CTO. Similarly, McAfee pushed out an update and within minutes realized there was an issue and stopped the rollout. In that instance, only 1,672 customers were affected, DeWalt shared.

Why the difference in magnitude? The deployment models between McAfee in 2010 and CrowdStrike in 2024 were different, and thus the scope and scale of the damage was much more significant in CrowdStrike’s case.

The salient point, however, shared but not emphasized enough, is that this is not the first rodeo where a company pushes out an update, patch, or other instructions to software sitting on customer’s devices and things go sideways immediately at scale.

Murphy’s Laws have never been repealed

What can go wrong, will. Expecting your vendor to not undercut your ability to conduct business is not unreasonable. Yet every CISO knows that errors happen, machines fail, and even billion-dollar companies can have a wonky line of code in their product.

“I can’t recall a better time to discuss the criticality of resilience and continuity of business operations — from threat actors to the solutions designed to defend against them,” Kyle Hanslovan, CEO of EDR maker Huntress, shared with me. “As our technical systems become more interconnected, it’s almost certain mass outage will increase in frequency. I highly encourage all industries to prepare and exercise plans to operate when key technology solutions become unavailable without notice.”

Planning for the unexpected is easier to say than to do — yet plan we must.

When I was in the first half of my career at the CIA, I handled telecommunications (CW, RTTY, Satellite — yes, I am that old) and I was often in locales where an adversary government was working hard to block our communications, or where sunspot activity would mess with the 2-30 MHz radio waves, or your transmitter would catch on fire, or your safe with the crypto fails. All of which would be showstoppers had it not been for the practice of having alternatives.

Most importantly, all these alternatives, from degrading to Morse code or rebuilding a transmitter, were practiced, regularly. There wasn’t a single point of failure. There may have been degradation but the communication channels were up and available.

CISOs need to take this to heart with their implementation strategies. Especially in the wake of a catastrophic example, like CrowdStrike.

Canaries are our friend

The canary was the sacrificial bird in the coal mines to signal to miners that oxygen levels had diminished. Today “canary deployment” can “offer an outlet for releasing new features staggered to mitigate risks such as breaks in services, outages and non-compliance. This approach also allows for a swift and safe rollback to previous working versions as potential issues are navigated and solved,” wrote Dinesh Chacko on June 24 in his article “Canary Deployment: What it is and Why it matters.” 

Chacko’s timely piece also notes that using the concept of canary deployment is complex and requires a healthy investment of resources.

CrowdStrike has indicated that it will be adjusting its deployment strategy — and enhancing its testing processes — to include providing “customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.”

These changes align nicely with those discussed in Chacko’s piece, and CISOs must weigh the trade-offs in making their selections going forward — a process that should involve business stakeholders in discussion of risk tolerances.

Plan for the future

To CrowdStrike’s credit, as well as its many partners and the CISO/InfoSec community at large, a lot of oil was burned in the initial days after the faulty update was transmitted as the community collectively jumped in and lent a hand to mitigate the situation.

Some entities found themselves mildly inconvenienced; others, like Delta Airlines, found themselves crippled for days and only returned to normal operations on July 25.

“Moving forward, this outage demonstrates that continuous preparation to fortify defenses is vital, especially before outages occur,” Christine Gadsby, CISO at Blackberry, opined. She continued, “Already understanding what areas are most vulnerable within a system prevents a panicked reaction when something looks amiss and makes it more difficult for hackers to wreak havoc. In a crisis, defense is the best offense; the value of confidence that comes with preparation cannot be underestimated.”

Let me close on a piece of positive news: As of 25 July, CrowdStrike tells us that 97% of the Windows Falcon Sensors are back online. Those directly affected, and now remediated, are reviewing the unexpected hit to the operating expenses, as well as the toll on workers, be they employees, contractors, or partners, who put in the long hours to fix machines BSOD’d by the rollout.

CISOs should also review what needs to be changed, included, or deleted from their emergency response and business continuity playbooks.

Administration is the weakest link in IT, Andy Ellis opined in his CSO op-ed. He highlighted how too much trust is assumed with respect to the CISO’s IT tools, and he is right.

I also agree with Huntress’ Hanslovan; this will happen again. Given the number of companies with software agents/widgets sitting on endpoint devices, and the nearly daily interaction that takes place between those providing the service and their customer’s devices, the odds are not in favor of an error-free future.

Now is the time for each CISO to do a bit of introspection on their team’s ability to address a similar scenario, and plan, exercise, and be prepared for the unexpected. Which could happen today, tomorrow, or hopefully never.

Christopher Burgess

Christopher Burgess is a writer, speaker and commentator on security issues. He is a former senior security advisor to Cisco, and has also been a CEO/COO with various startups in the data and security spaces. He served 30+ years within the CIA which awarded him the Distinguished Career Intelligence Medal upon his retirement. Cisco gave him a stetson and a bottle of single-barrel Jack upon his retirement. Christopher co-authored the book, “Secrets Stolen, Fortunes Lost, Preventing Intellectual Property Theft and Economic Espionage in the 21st Century”. He also founded the non-profit, Senior Online Safety.

More from this author