Unanswered questions regarding CrowdStrike’s processes that led to a global Windows outage strike at central issues of trust, transparency, validation, and interdependency for CISOs, which could result in a rethink given the stakes and ease of defection. Credit: Gorodenkoff / Shutterstock As enterprise CISOs and other executives are still calculating the impact of CrowdStrike’s disastrous July update glitch, some feel the need to assess alternatives. The big issue is transparency or, more precisely, the lack of meaningful transparency from CrowdStrike. CrowdStrike has absolutely been detailed about the technical description of the glitch. But although the vendor has said quite a bit about what happened, they have said virtually nothing publicly about the far important questions: How did this failure happen, and why? That’s what CISOs need to understand when evaluating next steps. What happened at CrowdStrike CrowdStrike’s technical explanation boiled down to a mismatch, with its Falcon software for the now infamous Channel File 291 requiring sensors to inspect 21 inputs, yet only 20 were delivered. In CrowdStrike’s phrasing: “The new IPC Template Type [deployed on July 19] defined 21 input parameter fields, but the integration code that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against. This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process.” And, according to CrowdStrike, attempts to “access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash,” thereby setting off a vicious cycle of BSODs and a global IT outage. Important questions remain Despite that technical explanation for the outage, several questions remain unaddressed by CrowdStrike, and thus vital for CISOs to receive answers for. For example: How did that mismatch happen? Who did what to cause it to occur? Was it a CrowdStrike employee or contractor? A third-party? Did Microsoft make some small operating system change that fueled the mismatch — which might explain why only Windows devices failed? There are two critical question categories: First, what led to the glitch happening? And second, how did CrowdStrike’s quality control efforts fail? The dry explanation given for why quality control missed what would become a catastrophic error — “This parameter count mismatch evaded multiple layers of build validation and testing” — doesn’t explain much. Impacted systems immediately crashed on receiving the patch, delivering a true BSOD. Had CrowdStrike tested even one impacted Windows machine, wouldn’t it have immediately witnessed a full crash, which would be very hard to miss? Did CrowdStrike somehow not test the patch against any of the relevant Windows devices? “This is a wakeup call that is not waking up anybody.”— Atti Riazi, CIO, Hearst Of course, ascertaining what constitutes a relevant Windows device in this instance is another issue. Back on July 20, Microsoft offered impact estimates that provided no specifics, having then said, “We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines.” That 1% number sounds minimal, but Microsoft’s analysis wasn’t restricted to Windows machines running CrowdStrike, so it provides little insight into the spread of impact on CrowdStrike-installed devices. What Hearst CIO Atti Riazi has told CSO Online is that her team met with CrowdStrike and was told that CrowdStrike halted the Windows patch “in less than an hour, once they saw” what was happening and “within that [less than an] hour, 8 million computers” crashed, she said. Given that CrowdStrike has not reported the numbers of Windows machines that received the patch during that fraction of an hour, the percentage impacted can’t be calculated. So, in an email exchange, CSO Online asked CrowdStrike officials what percentage of Windows machines that ran CrowdStrike were impacted. Although CrowdStrike had asked for questions, they chose not to answer that one. They also declined to address a follow-up question about any specific characteristics of the impacted Windows machines. If 8.5 million Windows machines crashed in less than an hour, it’s likely that the percentage of CrowdStrike-enabled Windows machines impacted was quite high. And if that’s the case, questions about how they tested the patch have more gravity, as it would seem any Windows machine would do. (CrowdStrike declined a request for an interview.) Remediation practices in the legal crosshairs Among the enterprises hardest hit by the outage was Delta Airlines. Attorneys for Delta sent CrowdStrike a letter arguing that CrowdStrike dropped by the ball in the hours and immediate days after the incident, at least in terms of helping enterprises recover. “Although you say ‘CrowdStrike took responsibility for its action,’ CrowdStrike’s current position as reflected in your letter seeks in every way to escape that responsibility,” Delta’s attorneys wrote. “Egregiously, there was no staged rollout to mitigate risk and CrowdStrike did not provide rollback capabilities.” But the airline hit the vendor hardest for what happened after the problem had been discovered. “CrowdStrike’s offers of assistance during the first 65 hours of the outage simply referred Delta to CrowdStrike’s publicly available remediation website, which instructed Delta to manually reboot every single affected machine. While CrowdStrike eventually offered a supposed automated solution on Sunday, July 21 at 5:27 pm ET, it introduced a second bug that prevented many machines from recovering without additional intervention,” the attorneys wrote. “CrowdStrike CEO George Kurtz’ single offer of support to [Delta CEO] Ed Bastian on the evening of Monday, July 22, was unhelpful and untimely. When made — almost four days after the CrowdStrike disaster began — Delta had already restored its critical systems and most other machines. Many of the remaining machines were located in secure airport areas requiring government-mandated access clearance.” Interdependence and a question of trust Riazi, who said Hearst didn’t suffer from the outage nearly as much as Delta and other enterprises did, was quite understanding and supportive of CrowdStrike. “These things happen. We did much better than a lot of our colleagues,” Riazzi said, adding that the incident was a stark reminder of the enterprise’s current level of interdependence. “Sometimes we forget how dependent we are. My kids, if they don’t have internet connections, it’s like they don’t have oxygen.” Riazzi, whose CISO at Hearst reports directly to her, was also understanding of CrowdStrike’s silence as to how and why the incident happened. “They have lawsuits coming their way, so they are not going to be that specific,” she said. The Hearst CIO also echoed Delta’s frustration with the lack of immediate help. “I think the biggest issue for us was that we didn’t know how to fix this. It is extremely complex. This is a wakeup call that is not waking up anybody.” Steve Zalewski, longtime CISO for Levi Strauss until 2021 when he became a cybersecurity consultant, said his greatest concern is that not only does CrowdStrike have direct access to the OS kernel, but that patches go directly from the vendor to OS. “Why are security tools not treated like business application tools? What changed? Why are we letting real-time updates fly right into our production environments?” asked Zalewski, who also served as a senior security manager for Kaiser Permanente. Ironically, the answer on updates is in large part because many enterprises have historically found CrowdStrike’s quality to be quite high. “We trusted them too far because they have been really good for too long,” Zalewski said, stressing that the decision was also made because enterprise IT was cutting back extensively. “We didn’t have the resources or the time so we had to trust the vendor,” he said. Many IT operations considered halting the patches and doing their own testing before allowing them to be deployed, but they concluded that “in our minds, the latency of delaying was great. It was higher risk for us to do the testing.” ‘Prove to me that you can test’ or risk defection Charles Blauner, former CISO for both JPMorgan Chase and Deutsche Bank, and former head of information security for Citi, disagreed with Zalewski regarding the ROI of testing patches before deploying them. “I can do a controlled deployment through policy and mitigate the risk of [security vendors] screwing up. We can change the deployment methodology to manage risk,” said Blauner, who is a security consultant running CyberAegis. According to Blauner, ROI calculations work in favor of a delayed deployment. “The time between the bad guys doing something and CrowdStrike doing something” is relatively small. “So we add an additional 48-hour drip to that. I don’t think it materially changes my risk profile,” Blauner said. Blauner’s message to security vendors like CrowdStrike is “prove to me that you can test and then give me control. If you can’t do that, I would look at alternatives.” He also wants labels on every update to indicate urgency: “Is it critical or just a daily update?” Blauner said that he had concerns about how CrowdStrike handled their quality control and checks of the update. “Their testing seems pretty damn negligent,” Blauner said. “Given that there’s clearly been a failure in their testing process, how strong are the rest of their processes?” If he had to consider replacement candidates, he said he would suggest CISOs look at either Deep Instinct or SentinelOne. Blauner did, however, point to one bright spot. Even though security vendors pushed the move to SaaS to get a more regular revenue flow, today SaaS has become an enterprise CISO benefit because it inadvertently reduces vendor lock-in. “SaaS lowers the biggest barriers to switching. While SaaS vendors love the annual subscriber model, it overcomes the barriers that CISOs used to have when doing technology switches,” Blauner said. “I used to have these five-year contracts” and an early exit meant that “I would have three years of cost that I would have to eat. Also, I haven’t built up infrastructure because it’s all in their cloud.” SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe