CrowdStrike Update Causes BSOD
The ‘most serious IT outage the world has ever seen’ sparked global chaos today - with planes and trains halted, the NHS disrupted, shops closed, football teams unable to sell tickets and banks and TV channels knocked offline.
See also: Reddit, Hacker News, and Slashdot.
Frontier Airlines briefly grounded all flights on Thursday amid a major outage in Microsoft networks, which also knocked out some computer systems at low-cost carriers Allegiant Air and Sun Country Airlines.
Microsoft said on the status page for Azure, its flagship cloud computing platform, that the problem began at 5:56 p.m. and affected multiple systems for customers in the central United States.
Andrew Cunningham (Hacker News):
Airlines, payment processors, 911 call centers, TV networks, and other businesses have been scrambling this morning after a buggy update to CrowdStrike's Falcon security software caused Windows-based systems to crash with a dreaded blue screen of death (BSOD) error message.
The list of services impacted by the outage includes Microsoft Defender, Intune, Teams, PowerBI, Fabric, OneNote, OneDrive for Business, SharePoint Online, Windows 365, Viva Engage, Microsoft Purview, and the Microsoft 365 admin center.
What’s happened today with Crowdstrike is completely unprecedented (and I’ll get to why shortly), and on the scale of the much-feared Y2K bug that threatened to ground the entirety of the world’s computer-based infrastructure once the Year 2000 began.
[…]
The problem here is systemic — that there is a company that the majority of people affected by this outage had no idea existed until today that Microsoft trusted to the extent that they were able to push an update that broke the back of a huge chunk of the world’s digital infrastructure.
Southwest Airlines, the fourth largest airline in the US, is seemingly unaffected by the problematic CrowdStrike update that caused millions of computers to BSoD (Blue Screen of Death) because it used Windows 3.1.
The cause of the failure has been identified as an update to Crowdstrike Falcon antivirus software installed on Windows 10 PCs, but Mac and Linux machines running the same cybersecurity software have been spared.
CrowdStrike’s now-infamous Falcon Sensor software, which last week led to widespread outages of Windows-powered computers, has also caused crashes of Linux machines.
CrowdStrike says the issue has been identified and a fix has been deployed, but fixing these machines won’t be simple for IT admins. The root cause appears to be an update to the kernel-level driver that CrowdStrike uses to secure Windows machines. While CrowdStrike identified the issue and reverted the faulty update after “widespread reports of BSODs on Windows hosts,” it doesn’t appear to help machines that have already been impacted.
This is why I keep telling people that third-party kernel extensions should be banned from production servers, period.
And shipping LIVE cloud updates direct to endpoints, unchecked, without any canaries?
[…]
But since most of the affected systems are in a boot loop that may well require physical (or IPMI) access to the machine.
The macOS version of the Falcon sensor uses a kernel extension (kext) on Intel Macs prior to Big Sur, but because of the limitations of kexts on Apple silicon, it now uses an endpoint security System Extension instead.
People pointing to EndpointSecurity framework in MacOS as the solution for the Crowdstrike problem are missing the point. ES is a typical Apple solution and basically means:anyone who can bypass it has to have exactly one exploit (chain) that will allow them to bypass ALL vendors
Sure yes running drivers in user land has less likelihood of taking down the whole system but it also means their functionality is severely limited by what API the vendor provided. Apple is simply gatekeeper in one more area of their devices.
It would be sufficient for OS protection to mark drivers that crash as dirty and if this happens repeatedly boot without the driver and/or optionally allow a rollback to a previously not crashing configuration
The EC obviously felt they were helping out third-parties by requiring Microsoft to continue to grant the same level of kernel access that they have. And perhaps this was even a good thing for end-users as these companies could cover security bases that Microsoft wouldn't, for whatever reason – security in general, of course, has not been a Microsoft strong suit, of late. But there are also often unintended consequences of such actions. In this case, a third-party service with a single code-push could take out millions of machines overnight and thus, cripple key infrastructure around the world.
Fast forward nearly two decades, and while Symantec and McAfee are still around, there is a new wave of cloud-based security companies that dominate the space, including CrowdStrike; Windows is much more secure than it used to be, but after the disastrous 2000s, a wave of regulations were imposed on companies requiring them to adhere to a host of requirements that are best met by subscribing to an all-in-one solution that checks all of the relevant boxes, and CrowdStrike fits the bill. What is the same is kernel-level access, and that brings us to last week’s disaster.
This strange tweet got >25k retweets. The author sounds confident, and he uses lots of hex and jargon. There are red flags though… like what’s up with the DEI stuff, and who says “stack trace dump”? Let’s take a closer look…
Patrick Wardle (tweet, Hacker News):
I don’t do Windows but here are some (initial) details about why the CrowdStrike’s CSAgent.sys crashed.
“Professional programmers” focusing on CrowdStrike disassembly/language is a coping mechanism that protects them from realizing that there is a remotely updated 3rd party kernel module that is deployed on significant part of the world. That is why real postmortems are important.
The CrowdStrike BSOD fiasco is extraordinary in its scale and scope; on Monday’s Oxide and Friends, @ahl and I will be joined by security researcher and @LutaSecurity CEO @k8em0 to help us sort through the many layers of this mess
See also: xkcd.
Previously:
- Southwest Airlines and Technical Debt
- Requesting Entitlements, Still Broken
- Little Snitch and the Deprecation of Kernel Extensions
Update (2024-07-23): Sebastiaan de With:
Has anyone checked on the App Store backend? Automated reports have been MIA since the Crowdstrike incident. 👀
Apple devices may not be as vulnerable to a bug in an update to third-party software like CrowdStrike, but that doesn’t mean we can be complacent. Apple itself regularly releases updates, and while it’s essential to install them to patch security vulnerabilities, Apple’s engineers could make a mistake that would cause problems for millions. Howard Oakley’s article reminded me of when an Apple update inadvertently disabled Ethernet (see “El Capitan System Integrity Protection Update Breaks Ethernet,” 29 February 2016). Apple quickly addressed the problem, but the lack of Ethernet prevented some Macs from getting the revised update, requiring manual intervention.
[…]
Even if we give CrowdStrike the benefit of the doubt and say that the bug was a subtle mistake that could have slipped by any developer, I can’t see any excuse for why it wasn’t caught in testing. Either CrowdStrike wasn’t doing real-world testing—the company constantly releases patches like this—or someone messed up big time.
In a statement to The Wall Street Journal, Microsoft blamed the European Commission for an inability to offer the same protections that Macs have. Microsoft said that it is unable to wall off its operating system because of an “understanding” with the European Commission. Back in 2009, Microsoft agreed to interoperability rules that provide third-party security apps with the same level of access to Windows that Microsoft gets. Microsoft agreed to provide kernel access in order to resolve multiple longstanding competition law issues in Europe.
Nothing prevents Microsoft and Crowdstrike from developing and adopting a user space solution if they so wish. But they didn't.
Also I'd like to point out that it is totally possible to completely deadlock macOS with user space endpoint security.
If one has a general worldview for technology today, they can find it in some analysis of this CrowdStrike failure. This saga has everything.
Update (2024-07-24): Oxide Computer Company:
Bryan and Adam were joined by security expert, Katie Moussouris, to discuss the largest global IT outage in history. It was an event as broadly impactful as it will be instructive; as Bryan noted, you can see all of computing from here, from crash dumps to antitrust.
Update (2024-07-26): Bruce Schneier and Barath Raghavan:
The catastrophe is yet another reminder of how brittle global internet infrastructure is. It’s complex, deeply interconnected, and filled with single points of failure. As we experienced last week, a single problem in a small piece of software can take large swaths of the internet and global economy offline.
The brittleness of modern society isn’t confined to tech. We can see it in many parts of our infrastructure, from food to electricity, from finance to transportation. This is often a result of globalization and consolidation, but not always. In information technology, brittleness also results from the fact that hundreds of companies, none of which you;ve heard of, each perform a small but essential role in keeping the internet running. CrowdStrike is one of those companies.
This brittleness is a result of market incentives. In enterprise computing—as opposed to personal computing—a company that provides computing infrastructure to enterprise networks is incentivized to be as integral as possible, to have as deep access into their customers’ networks as possible, and to run as leanly as possible.
Update (2024-07-29): Katie Moussouris:
The cause of the most significant internet outage event in history was a cascade of failures in both testing and deployment capability. The technical bugs in the testing and the client-side interpreter code are one area for improvement, and the process failures that propagated this so widely and quickly are another. Both functional areas need to be addressed to ensure we don’t have to endure an outage of this magnitude again.
I was rather skeptical that this wasn’t an elaborate joke, but yes, @CrowdStrike has apparently emailed its customers & offered a ~$10 UberEats gift card/coupon for any “inconvenience”
…and yes, it errors out when one goes to redeem it, saying it has been cancelled 🫠
ANOTHER opinion piece repeating Microsoft’s claim the EU is responsible for the #CrowdStrike debacle. You can read the “interoperability undertaking” Microsoft made in 2009 yourself… no, it does NOT require kernel access for Windows competitors.
In this blog post, we examine the recent CrowdStrike outage and provide a technical overview of the root cause. We also explain why security products use kernel-mode drivers today and the safety measures Windows provides for third-party solutions. In addition, we share how customers and security vendors can better leverage the integrated security capabilities of Windows for increased security and reliability. Lastly, we provide a look into how Windows will enhance extensibility for future security products.
Update (2024-07-30): Thom Holwerda (via Nick Heer):
It turned out be a troll tweet. A reply to the tweet by Russakovskii a day later made that very lear: “To be clear, I was trolling last night, but it turned out to be true. Some Southwest systems apparently do run Windows 3.1. lol.”
[…]
These few paragraphs do not say that Southwest is still using ancient Windows versions; it just states that the systems they developed internally, SkySolver and Crew Web Access, look “historic like they were designed on Windows 95”. The fact that they are also available as mobile applications should further make it clear that no, these applications are not running on Windows 3.1 or Windows 95. Southwest pilots and cabin crews are definitely not carrying around pocket laptops from the ’90s.
These paragraphs were then misread, misunderstood, and mangled in a game of social media and bad reporting telephone, and here we are.
Delta has hired prominent attorney David Boies to pursue potential damages from CrowdStrike and Microsoft after a mass outage earlier this month, CNBC’s Phil Lebeau reported on Monday.
Airline cancellations is a good metric, but I want to look directly at air traffic: How many planes were in the air? How many planes should have been in the air?
Update (2024-07-31): Patrick McKenzie (Hacker News):
It would be an overstatement to say that the United States federal government commanded U.S. financial institutions to install CrowdStrike Falcon and thereby embed a landmine into the kernels of all their employees’ computers. Anyone saying that has no idea how banking regulation works.
[…]
Does the FFEITC have a hugely prescriptive view of what you should be doing for malware monitoring? Well, no […]But your consultants will tell you that you want a very responsive answer to II.C.12 in this report and that, since you probably do not have Google’s ability to fill floors of people doing industry-leading security research, you should just buy something which says Yeah We Do That.
CrowdStrike’s sales reps will happily tell you Yeah We Do That.
Update (2024-08-14): See also: Accidental Tech Podcast.
Update (2024-09-17): Rachyl Jones (via Hacker News):
Software engineers at the cybersecurity firm CrowdStrike complained about rushed deadlines, excessive workloads, and increasing technical problems to higher-ups for more than a year before a catastrophic failure of its software paralyzed airlines and knocked banking and other services offline for hours.
“Speed was the most important thing,” said Jeff Gardner, a senior user experience designer at CrowdStrike who said he was laid off in January 2023 after two years at the company. “Quality control was not really part of our process or our conversation.”
6 Comments RSS · Twitter · Mastodon
Are we sure that macOS hasn't been impacted? A couple of days ago one of my computers reset a couple of preferences like the database of BookPedia and the font size of Messages. A couple of macOS apps told me about "new features". I did not have a kernel panic or a restart. The iPhone of my sister this morning had similar oddities like a alarm chime she hadn't set. Or it's all in my head.
@beatrix While there is a macOS version of the CrowdStrike software, the issue was only with the Windows flavor. So unless you are using a Windows PC where this software is installed, you are not directly impacted. Then indirectly, some remote services that may depend on servers running on Windows may have been down and rolled back. I don’t expect Apple iCloud infrastructure to be running on Azure. But who knows with Apple these days…
I'd this still an issue? I've not heard much since the first day, and I'm kind of surprised by how quick things seem to have gone back to normal.
@someone My understanding is that iCloud uses both Azure and AWS, but I haven’t heard anything about an outage.
@Kristoffer
The fix involves getting someone to have a physical access to the computer, boot it in safe boot mode and go delete a file to allow Windows to boot again.
The place where I work has been working 24/7 to get all computers back online and we're still not there yet.
And even outside of that, the email providers for one of my (less used) email accounts still has webmail and IMAP offline…
It's going to take time!
@Beatrix
As you can see in details if you read the post from Howard Oakley posted above, Mac are not susceptible to this issue by design. In addition, I have Falcon running on my work Mac and I haven't seen anything out of the ordinary (I'm not a fan of the app which hogs CPU cycles for no good reason, hanging the entire Mac at time, but it can't touch the system or crash the kernel the same way it did on Windows).
@Corentin Cras-Méneur Thanks for the insight. Sounds like a lot of tedious grunt work.
What I meant was more that to me as a person there were a few hours of "Is this 'Leave everything behind'? Will my wife be able to make it back from China?" that quickly turned into me bitching about "Nobody ever got fired for choosing IBM-ism" and then forgetting about the whole thing, because I wasn't affected in any way at all.
But it's good to be reminded that that was because a lot of people worked their asses off doing tedious gruntwork they really shouldn't have to.