Americas

Asia

Oceania

John Leyden
Senior Writer

CrowdStrike blames testing shortcomings for Windows meltdown

News
24 Jul 20245 mins
Endpoint ProtectionIncident ResponseSecurity

Customers will be given more control over when and where content is downloaded to reduce the risk of similar incidents in future.

CrowdStrike has blamed a hole in its testing software for the release of a defective content update that hobbled millions of Windows computers worldwide on Friday, July 19.

The hole caused CrowdStrike’s Content Validator tool to miss a flaw in an update for the security vendor’s Falcon Sensor endpoint protection technology, causing Windows machines that received the update to crash with the infamous Blue Screen of Death (BSOD) before forcing them into a repetitive boot-loop that left them unusable.

In its preliminary post-incident review, CrowdStrike confirmed that the crashing of its customers’ computers was due to a flaw in Channel File 291, part of a sensor configuration update released to Windows systems at 04:09 UTC on July 19. In the review it provided an initial explanation for how that flaw came to be deployed, and outlined changes it is making to its processes to avoid a repeat.

CrowdStrike isn’t the only organization considering changes in the wake of the incident: Many CIOs are also rethinking their reliance on cloud software like CrowdStrike’s.

(Read what you need to know about the CrowdStrike failure here.)

Testing shortcomings exposed

CrowdStrike’s review described the rigorous testing process it applies to new versions of its software agent and the default data files that accompany them — what it calls Sensor Content — but said that the flaw was in a type of exploit signature update it calls Rapid Response Content, which goes through less-rigorous checks.

Customers have the option of operating with the latest version of Sensor Content, or with either of the two previous versions if they prefer to favor reliability over coverage of the most recent attacks. Rapid Response Content, however, is deployed automatically to compatible sensor versions.

Rapid Response Content is stored in a proprietary binary file that contains configuration data rather than code. The files are delivered as configuration updates to the Falcon sensor, making the platform better able to detect the hallmarks of malicious activity based on behaviour recognition.

CrowdStrike uses its Content Configuration System to create so-called Template Instances describing the behavior to be detected, storing them in Channel Files that it then tests with a tool called the Content Validator.

Countdown to disaster

Falcon Sensor 7.11 was made generally available to customers on February 28, introducing a new type of template to detect novel attack techniques on interprocess communications (IPC) that abuse so-called Named Pipes.

The first Channel File 291 was released to production on March 5 following a successful stress test. Template Instances that relied on Channel File 291 were released without problems on March 5, April 8 and April 24.

Disaster struck when two additional Template Instances were deployed on July 19. “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data,” CrowdStrike said in its review.

What seemed like a minor configuration update to a component that had been tested and was already in production triggered a wave of crashes. Nevertheless, CrowdStrike argued it acted responsibly in the run-up to what turned out to be disaster.

“Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production,” CrowdStrike explained in its review.

“When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD),” it added.

Testing improvements

From now on, CrowdStrike updates will be tested locally before being sent to clients. Content update and rollback testing will be carried out and there’ll be additional stability and content interface testing.

Existing error handling procedures in the Content Interpreter will be improved so that, for example, bugs only crash the program rather than triggering an operating system crash.

CrowdStrike will also introduce a staggered deployment strategy for the Rapid Response Content that caused the July 19 incident, it said. It will initially release new content as a “canary deployment” to detect critical issues, then release it to larger and larger portions of its customer base. It will also enable customers to refuse the very latest content releases, offering “granular selection of when and where these updates are deployed,” it said.

Early reaction to CrowdStrike’s analysis and remediation plan from security experts, such as Kevin Beaumont, has been positive.

“CrowdStrike’s response has been really good post error,” Beaumont said in a thread on Twitter/X. “They clearly realise they need to prioritise safety now.”