Chaos Lever Podcast

Ned and Chris explore the chaotic fallout from a CrowdStrike Falcon sensor update that crashed Windows systems across various sectors.


Where Were You the Day the Screens Turned Blue?
The tech industry is a house of cards propped up by a mishmash of redundant systems and safety nets. In this episode, Ned and Chris dive into CrowdStrike’s Falcon sensor update on July 19, 2024. This blunder sent Windows systems crashing, causing chaos across airlines, retail stores, and hospitals. They dissect how the update triggered the dreaded Blue Screen of Death and the nightmarish recovery process, especially for BitLocker-encrypted systems. Solutions like macOS’s System Extensions and Linux’s eBPF are tossed around, with a side of skepticism about the balance between speed and security and the inevitable trainwreck of regulatory responses.

What is Chaos Lever Podcast?

Chaos Lever examines emerging trends and new technology for the enterprise and beyond. Hosts Ned Bellavance and Chris Hayner examine the tech landscape through a skeptical lens based on over 40 combined years in the industry. Are we all doomed? Yes. Will the apocalypse be streamed on TikTok? Probably. Does Joni still love Chachi? Decidedly not.

Ned: Ready to party.

Chris: By party, do you mean nap?

Ned: Uh, yeah, we could have a nap party. I’m not even mad about that. That sounds delightful.

Chris: [snoring].

Ned: Hello, alleged human, and welcome to the Chaos Lever podcast. My name is Ned, and I’m definitely not a robot. I am a real human person who does not make random updates to your computer and then crash entire airports. That would be wild, and I am definitely not behind that update through my Skynet uplink. With me is Chris, who was also here. Let’s talk about not Skylink. [crosstalk 00:00:46]. Whatever [laugh].

Chris: Wow. Yeah, I don’t know if you go into this or not, but you know, they are already conspiracy theories out there that this was an actual attack of some kind. Or a test for an actual attack.

Ned: Yeah, I have seen that on the Reddit threads, [background noise] and other places. Ohh, that was loud.

Chris: That was way louder than I thought it was going to be. Sorry [laugh].

Ned: Who likes fizzy water [laugh]? Yeah, on the Reddit forums, and in other places, I’ve seen the conspiracy theories starting to propagate. I feel like those start out as jokes a lot of the time. Like, “Wouldn’t it be funny if this was like a state actor,” or something, but like, when you actually drill down into it at all, you realize that this is just incompetence. Don’t attribute to malfeasance what is more likely just gross incompetence. There’s a pithier way of saying that.

Chris: Well, maybe now we should just talk about what we’re actually talking about because we’re already talking about it.

Ned: Oh, CrowdStruck. UnderStrike?

Chris: Dunn.

Ned: [laugh]. Those of us who have been around the tech industry for a while, and have peeked behind the mysterious curtain to see what actually supports this endeavor that we call modern information technology—

Chris: It’s terrifying. It’s three monkeys in a trench coat.

Ned: Barely. I feel like you quickly become aware of how fragile this entire construction is, and just how many redundancies and safeguards have to be put in place to prevent the entire edifice from crumbling into the proverbial sea.

Chris: Yeah, and just to put a pin on that, in terms of, not only is the technology fragile, so are the people. I saw a joke on LinkedIn today about power-washing the back of your servers to let the packets go faster, and I guarantee there’s somebody out there going, “I haven’t done that. I should do that.”

Ned: [laugh]. If nothing else, it’ll clean the air filters, so that’s probably good.

Chris: It’ll make everything a lot quieter.

Ned: [laugh]. I suppose it will. Oh, silence is golden. The packets go faster in silence. To quote the second-greatest sci-fi movie of all time, Men in Black, “There’s always an Arquillian Battle Cruiser, or a Korilian Death Ray, or an intergalactic plague that is about to wipe out all life on this miserable planet. The only way these people can get on with their happy lives is that they do not know about it!”

Chris: I love that quote.

Ned: Yeah, so just kind of apply that to technology instead of aliens, and it’s pretty much the same thing. The CrowdStrike debacle may not have been a Korilian death ray, but for 8.5 million Windows devices, it basically was. Everything, everywhere, is breaking, all at once, and it is only through the heroic efforts of thousands of ops people diligently doing their jobs that the public is unaware. Of course, the public does occasionally become very aware, and then senators have to hold hearings to grandstand about things they do not even slightly understand.

They’ll hold some CEO’s feet to the fire for an hour, make self-serving proclamations and possibly even attempt to levy a fine or two. Good luck with that now that Chevron Deference is dead. But hey, we’re not a Supreme Court podcast. Go listen to 5-4 for that. Solid plug for 5-4. Definitely [crosstalk 00:04:21] time.

After all, the hubbub dies down, honestly, one or two C-level executives will probably fall on their swords to appease the investor public. I wouldn’t feel too sorry for them. It is a metaphorical sword after all, and it comes with a guaranteed payout of several millions of dollars, and a cushy job as a lobbyist or CEO of some other poor unsuspecting private equity firm-acquired disaster where they can oversee another unavoidable catastrophe. It’s the circle of life, Chris.

Chris: I’m not going to sing it. I don’t want to get sued again.

Ned: I—no, you’re seeing it in your head though, and I can see it [laugh].

Chris: [laugh].

Ned: Oh, so rather than talking about CrowdStrike for the next 30 minutes, I think we should all just go watch The Lion King—

Chris: The original.

Ned: Which is the best sci-fi movie of all time [laugh].

Chris: I’m not really sure where to go with that [laugh].

Ned: [laugh]. I don’t either. I’m curious to hear the comments that we get in. I did recently watch Dune: Part Two, which was excellent.

Chris: Took you long enough.

Ned: Listen.

Chris: Some of us responsible citizens saw in the theater.

Ned: I have one word for you. That word is children. Anyway.

Chris: You don’t think they would like it?

Ned: I think nightmare fuel would probably be the closest, yeah, Feyd-Rautha—Routha—however the hell you say his name—yeah, those scenes in particular, God, that dude is creepy.

Chris: Yeah, he really inhabited the creepy level of the character.

Ned: Like, Jared Leto levels of creepy.

Chris: No, but except good.

Ned: Yes [laugh]. Yeah, because he’s playing a character, not himself. Oh. Anyway. CrowdStrike. What the hell happened? On Friday, July 19th, 2024, at 5:24 UTC—that’s 1 a.m. for our East Coast peeps, and the day before for California because you’re a bunch of weirdos—security vendor CrowdStrike released an update for their Falcon sensor platform.

Falcon is an endpoint detection and response solution meant to protect systems against viruses, malware, and advanced persistent threats. The update type was a content update, or what CrowdStrike calls a channel file, which you can think of is, like, the virus definition, except as a modern EDR, it’s a bit more complicated than that, and we’ll get to why that’s important. When we get to the root-cause analysis, or what we know so far. Once the channel file was loaded by the Falcon sensor platform, it caused a memory access fault at the kernel level that forced a system crash on all Windows clients. The old Blue Screen of Death popped up, and then the system either rebooted or sat at that screen for a while. Possibly forever.

Chris: Yeah, until somebody touched it.

Ned: Pretty much. So, if you happened to walk into a major airport around that time, you might have been greeted by giant display signs that just had the sad frowny face on it, because now the blue screen has an emoji. And it was kind of funny, actually. I mean, funny for the people, you know, seeing the screens; not funny for everybody who had to deal with the disaster, [unintelligible 00:07:45]

Chris: Right. And were sitting in airports for three days while waiting to, you know, go home.

Ned: Yeah. Depending on which airline you were working with, you may have been not impacted at all, impacted slightly, or still sitting in the airport listening to this right now. I’m so sorry. Maybe don’t fly Delta [laugh] next time. Actually, I don’t know if it was Delta. It might have been United. They’re all terrible. It doesn’t matter.

Chris: But one of the few that wasn’t affected was Southwest.

Ned: Is that because they’re running Linux?

Chris: Allegedly. Again, this is unproven internet theory, but allegedly it’s because their systems were so old that CrowdStrike wouldn’t run on them.

Ned: [laugh]. I feel like we did cover Southwest in a Chaos Lever, or possibly its precursor, when we talked about old, out-of-date systems that are super fragile. Am I remembering correctly?

Chris: I mean, I had that theory, or that thought as well, but I’m also now like, did they just post that, and it became a memory, or is it a real memory?

Ned: [laugh]. It’s hard to say. I will say that it was in fact Delta—and is Delta—that’s having the biggest struggle because they use BitLocker extensively.

Chris: Right. I assume you’re going to get into that.

Ned: Oh, yes.

Chris: Okay. I don’t want to interrupt. Carry on.

Ned: So, we had all these crashes—

Chris: Whenever you’re ready.

Ned: And you know, when your system—

Chris: Just go with [crosstalk 00:09:10]—

Ned: [unintelligible 00:09:11]—

Chris: —whenever [unintelligible 00:09:11]—

Ned: [unintelligible 00:09:13]—

Chris: At anytime—

Ned: [unintelligible 00:09:14]—

Chris: When you could—why would—who—

Ned: [laugh]. We have all these crashed systems, and what do you do with the crash system? You restart it. But unfortunately, attempts to restart the afflicted systems just resulted in another blue screen of death. Because Falcon sensor is loaded as a driver during system boot, and it has been marked as boot required, meaning it must be loaded for the system to boot properly.

As soon as Falcon started, it would load all of its channel files and, predictably, the system would crash again. This rendered all effective systems completely unusable and inaccessible through in-band management. So, you can RDP into this thing and fix it.

Chris: So, this makes sense from an EDR perspective, right?

Ned: Yes.

Chris: You want to protect your computer. No matter what tool you have, it’s going to have this boot requirement because you don’t want your system booting without endpoint protection.

Ned: Right.

Chris: Because endpoint protection, ostensibly, is good.

Ned: Ostensibly. The problem, obviously, comes in where your endpoint management is now, effectively, malware that’s crashing your system.

Chris: Right. That would be what we call ‘the downside.’

Ned: [laugh]. Yes. And we will definitely get into that as well. Microsoft has published a blog post where they claim, according to their telemetry, about 8.5 million Windows devices were impacted by this. Now, that’s only about one or 2% of all Windows devices out there, so this is not, as a percentage, a ton of devices.

However… it’s still a lot of devices, [laugh] and the impact was pretty severe. As we discussed, airlines had to suspend or cancel flights, retail stores suddenly couldn’t accept payment. Medical devices and hospitals crashed in the middle of surgeries, bowling alleys had to hand out paper and pencils to individuals, who just looked at them like, what the hell is this? How do I track ten frames by hand? How does a turkey even work? [sigh]. Dark times for all of us, Chris.

Chris: That’s the kind of math podcast that needs to come out because I guarantee there’s no one left on earth who knows how to score bowling by hand.

Ned: [laugh]. True story. I was up in Cape Cod, and we went duckpin bowling—which is a real thing. Look it up—

Chris: Oh, it’s so fun. It’s super fun. Definitely look it up.

Ned: Super fun, but the bowling alley was so old that they did not have a computerized scoring system.

Chris: Wow.

Ned: Yeah. They gave me a piece of paper and pencil, and I was like, “Uh, score is not important, right? We’re just here to have fun.” Oh… now to get these systems back to a working state, the offending channel files had to be removed before Falcon was loaded. There’s a few options to do this, and none of them are great or easy.

You can boot the system into Windows safe mode, which only loads the absolute bare minimum of Windows drivers, and then remove the files. For virtual systems, you could mount the system disk on another system and remove the files, and then reattach the drive to the original system, or if you had snapshots or a backup, you could roll back to a prior snapshot. Fortunately, CrowdStrike did pull the offending file from the update servers, so you wouldn’t then immediately redownload it and be back where you were. While it is a huge pain to fix all of these virtual systems, the real pain is those physical systems that don’t have an out-of-band management option. Someone will need to physically sit at the terminal, invoke safe mode, and perform the remediation steps, or use a separate boot device like a thumb drive to perform the maintenance. This is very bad.

Chris: You forgot about the other way to fix the system, which apparently did work on some—at least a number of people’s, which is just keep rebooting it until CrowdStrike Falcon updated, and deleted the file on its own before it crashed because of the file.

Ned: [laugh]. I guess if it does load and the network stack loads in time for it to pull the update and replace it, maybe? Maybe.

Chris: Between 15 and 20 reboots. Sometimes people were getting it to work.

Ned: Wow. That’s awful. But okay. So, another option. Microsoft has published a USB tool to assist with the removal of this file, so you have that option as well. As I mentioned, the BitLocker thing does throw a bit of a wrench in the whole plan because in order to access a BitLocker-protected system drive out-of-band, you have to supply a BitLocker unlock key—

Chris: Yeah.

Ned: And that can be hard to get.

Chris: Well, it’s not like people want their end-users to have that. Again—

Ned: Yes.

Chris: —this is a security concern. Also, the BitLocker key is 48 characters long, so not only finding it but typing it in before BitLocker times out… which it does, apparently.

Ned: So, a bit of a nightmare.

Chris: Not a great situation.

Ned: No. And so, that’s part of the reason Delta is still struggling. I would love to say that, as of right now, we know exactly what caused the error, but honestly portions of the supply chain are still pretty murky. Instead, I will try to explain how a simple update for an EDR caused millions of Windows machines to blue screen, and we can also have fun pointing all the fingers that we have at all the other parties because we’ve got jazz hands. [whispering] It’s all your fault. That works better with a—

Chris: Visual medium?

Ned: Yeah [laugh]. So, to start with, we have to consider what the Falcon [Center’s 00:15:09] actually trying to do. Falcon Center, as I mentioned, is an EDR product, and it’s meant to scan all activity on the host operating system looking for threats. Most applications aren’t granted that level of access to other applications or to the system as a whole. As you mentioned, Chris, it needs to be in a privileged position.

But that’s the point: you’re trying to prevent other pieces of software from getting themselves into privileged positions to compromise your computer. To understand what it means to be in that privileged position, I’m going to briefly talk about user space and kernel space. Please feel free to interrupt me when I get something wrong, which I will. [whispering] Yes, thank you. Your operating system, whether it’s Windows, Linux, macOS, I don’t know, Solaris—

Chris: AIX?

Ned: —sure—it is responsible for managing the hardware on your system. That includes stuff like memory management, writing data to disk, sensing input from peripherals, and scheduling threads on the CPU. This all happens in what is called kernel space, and it’s considered highly privileged. If something goes wrong in kernel space, the system may have to halt or crash to prevent damage to the hardware, or corruption of data. Ideally, as little as possible should be running in kernel space.

Instead, most applications run in user space which does not have direct access to the hardware. Applications running in user space interact with the operating system, and make requests based on that operating system’s published APIs. Do you want to write a file to disk? You make an API call and pass the correct information. Need to access memory? Make an API call and specify the address and range.

The operating system will evaluate that request, make sure it’s valid and allowed before executing it. This means when an application runs into issues or it crashes, the operating system is able to handle that crash gracefully—most of the time—and keep other processes and the system as a whole running.

Chris: Right. And there’s one point to note here. So, first of all, some of the terminology, they call it the kernel; also, they call it Ring 0, meaning it is the lowest possible level of the system, and it has access to everything else that is going on in the system without restriction. Necessary to make sure, for things like EDR tools, that it can scan not only all of the files, but all of the activity, all of the network, all of the I/O, the disk, et cetera, et cetera, et cetera.

Ned: Right.

Chris: One thing people always get upset about is, why does Windows crash so easily? And—

Ned: [laugh].

Chris: While there is an argument to be made that it is fragile and poorly designed and should have a better way of handling things like EDR that needs this access—which is true, and I assume you’ll get to that—

Ned: Yes.

Chris: The other thing is, again, remember, completely unfettered access. If something goes wrong at the kernel level, we get our old friend, unanticipated consequences. And this is extremely bad. So, for example, let’s say you have a system that is running a database. Databases, as you know, are kind of important.

A kernel-level job is trying to write a new file, or a new table, or a new row, or record, or whatever, but it runs into an error with, say, memory misallocation. What is it going to write to the database? It could be writing absolute nonsense. It could completely corrupt the database. Therefore, the kernel crashes preemptively whenever it detects a failure because the consequences of trying to soldier on might be worse.

Ned: Right. It’s that, “Out of an abundance of caution, I’m going to fail.”

Chris: Right. Which is the same thing I did in high school.

Ned: [laugh]. Yes… it was better if you didn’t succeed, Chris.

Chris: [laugh].

Ned: So, Windows applications have been able to request access to run in kernel mode for a long time. Generally, that’s a bad idea, for the reasons you just articulated. But Microsoft wasn’t super strict about it. Microsoft is nothing if not accommodating to developers and their terrible ideas. Some applications actually do need to run in kernel mode, in particular, antivirus software.

Applications running in user mode are not generally allowed to access the memory and monitor the behavior of other applications. Microsoft Teams can’t just decide to read the memory space of Slack or kill the Zoom processes, as much as it might want to.

Chris: I was going to say, it totally would.

Ned: [laugh]. The operating system just doesn’t allow that type of nonsense. But, you know, an antivirus application needs a privileged level of access and monitoring to defeat the bad guys. So, antivirus companies like Symantec wrote their application to run in kernel space. Now, Microsoft actually tried to push back on the rampant abuse of kernel mode by antivirus outfits—and others—when Windows Vista was being rolled out. Yeah.

Chris: What, what, what?

Ned: Vista, for you youngsters in the crowd, Vista was the Windows 8 of the early aughts. Hopefully that puts some perspective on things. While Vista was a disaster as an operating system release, they did add a whole bunch of additional functionality and features that brought the client OSes more in line with what the server OSes were doing, and added a bunch of security. And one of the things they really tried to do was lock down kernel mode access. Unfortunately, antivirus companies didn’t like that, and they threw a hissy fit, claiming that since Windows Defender could run in kernel mode, and their stuff couldn’t, Microsoft was abusing their influence, a la Internet Explorer. And Microsoft, still reeling from their decade-long battle with the FTC over antitrust, kowtowed to the AV club, and allowed them to keep their precious kernel mode access.

Chris: It’s not an unreasonable request because all the other players wanted was an even playing field.

Ned: Right.

Chris: The fact that even playing field was a wide-opened security nightmare is still a Microsoft problem.

Ned: [laugh]. Right. Microsoft did add an interesting requirement, though, if you wanted to play in kernel space, and that was driver signing. Antivirus applications would present themselves as device drivers to get to run in kernel mode. A device driver to nothing, but a device driver nonetheless.

Microsoft created the Windows Hardware Quality Labs Testing Certification—aka WHQL—and once a driver had gone through that lab and gotten its certification, Microsoft would digitally sign the driver and give them the Certified for Windows logo, so they could proudly display ‘Certified for Windows Vista’—or Windows 8 or whatever—on the box when you buy the software, or on their website. Now, vendors could still choose to sign their drivers internally, but the antivirus folks wanted to get that WHQL certification and all the cachet that went with it. As long as your driver code didn’t change, the digital signature would remain valid. So, that means all these antivirus companies—like CrowdStrike—would get that certification, which meant that it had gone through some level of rigorous testing when it came to the way the driver was written and the way it interacted with the kernel. Seems like a good idea.

Chris: I’m for it.

Ned: Unfortunately, external data could be loaded by the driver, like—

Chris: No.

Ned: —virus definitions. But in theory, the actual running code should all live in that signed device driver. So, read in some config, but all the logic in the actual code should live in that device driver. That’s all well and good for loading virus signatures and looking for matches in memory and CPU threads, but Falcon sensor is a modern EDR, and it doesn’t just use signatures. Instead, Falcon uses machine learning to develop behavior patterns, and then it needs to detect and respond to emerging threats that match those behavior patterns.

The channel updates Falcon sensor receives to model that behavior, those updates appear to include some amount of pseudocode that is executed by the driver. And it is that injected code from the channel—or lack thereof, actually—that seems to have caused the issue. According to people who have looked at the channel file in question, it is entirely filled with zeros [laugh]. Now, you would hope that the driver would look at a file full of zeros and just ignore it, like, “Nope. That’s invalid.” Falcon sensor chose a slightly different route and crashed.

Chris: Right. So, what we have here is a driver that is legitimate and was tested and proven resilient, which is good.

Ned: Yeah.

Chris: And we have updates that come down the wire multiple times a day and interact directly with that driver, that were not.

Ned: Precisely. And it would appear that of the many tests that were run against that driver, none of the tests were, “Here’s a file full of zeros. What do you do?” Because no one thought that was a thing that would ever occur. But it did. There is a popular breakdown of the Falcon sensor crash dump by Twitter person Perpetualmaniac, which I won’t be linking because after assessing that it was a lack of null pointer checking in the dump, he then went on to make weird disparaging comments about the Rust community and blamed the whole thing on DEI. It got strange and kind of fasci, so fuck that guy.

Chris: Fair.

Ned: Instead, I’ll include a link to a different Twitter thread, by someone who actually debugs stuff like this for a living, and he basically said that Perpetualmaniac was wrong and thinks that it is uninitialized data being read from a table that caused the crash. Now, considering that the input file was entirely filled with nothing, uninitialized sounds like an understatement. Unfortunately, we won’t know for sure unless CrowdStrike shares their source code for their driver, which seems unlikely. Maybe they should, but I don’t think they will.

The point is that the channel update caused Falcon sensor to attempt to access a memory location that didn’t exist or wasn’t initialized, and the driver crashed, forcing the system to halt in order to prevent possible data corruption. So, that’s where we’re at. Now, it’s time to point fingers.

Chris: Cool. [It’s 00:26:25] Everybody’s favorite part.

Ned: Predictably, in a fuckup of this magnitude, the blame game and armchair quarterbacking is in full effect. Thought leaders are tripping over themselves on LinkedIn to have an opinion about the whole mess. And I’ve seen posts ranging from ‘this is all CrowdStrike fault. How did this update ever get out the door?’ ‘This is all Microsoft’s fault. How could they let third parties run in kernel mode?’ ‘This is the customers’ fault for not having phased rollouts.’ Et cetera, et cetera. And then there’s all the conspiracy theories about how this was a state actor, or a planned thing, or I don’t know CrowdStrike did it on purpose, for reasons? Anyway.

Chris: Solar flares?

Ned: Oh, I like that one. That’s what made it all zeros. There’s plenty of blame to go around, and none of it is actually helpful while the fire is burning, but now that we’re over a week out, maybe we can take a more nuanced look. Or not. So, how did this update actually leave CrowdStrike’s front door? That’s a great question.

The truth is, we will not know until CrowdStrike tells us or a lawsuit forces legal discovery, and we find out that way. The former could come any day. I’ve checked their [unintelligible 00:27:40] blog posts several times as I was writing this piece, and so far, they haven’t said, but maybe they will.

Chris: Uh, actually, so they did—

Ned: Ooh.

Chris: —at about three o’clock this morning.

Ned: [laugh]. Of course they did.

Chris: They released an official—well, an official unofficial preliminary post-incident review.

Ned: Okay.

Chris: It’s a good name. And basically what they’re saying is, it went through automated testing, but the automated content validator had a bug in it. So, they passed it—quote-unquote, “Passed, but it was an invalid file.

Ned: Ah.

Chris: “Once the file went out, it was immediately picked up, read by Falcon sensor, and it caused an out-of-bounds memory read, triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash BSOD.” Unquote.

Ned: So, it seems like their testing harness or whatever they’re using also doesn’t know what to do with the file that’s all zeros.

Chris: Well, yeah. There’s a lot of problems here. First of all, clearly they did not test the tester enough.

Ned: Yeah.

Chris: Because if you have a bug in a testing system in an automated deployment, that is a problem. That is a huge problem.

Ned: And the fact that simply loading the file caused the blue screen pretty quickly makes it sound like they don’t actually push these updates to test machines that then run the update to see if the system crashes. They’re using some other testing process.

Chris: Right, which they do not go into any detail about, unsurprisingly.

Ned: So, I am sure that the lawsuits are forthcoming, and maybe we’ll find out more when legal discovery happens, if it gets that far, but the truth is, CrowdStrike pushes these channel updates frequently. Like you said, Chris, they push these more than once a day. And they have automated testing in place, but they’re trying to stay one step ahead of the bad guys, which means time is of the essence. This specific update was meant to address something, a new vulnerability found in named pipes. They wanted to get that update out before any attacker figured out how to abuse this vulnerability.

So, maybe what they’re doing is sacrificing quality or testing in favor of speed. This is a systematic failure, and it’s not the fault of one person. Yes, maybe someone screwed up and accidentally saved the file empty, but something else in the chain should have caught that.

Chris: Right.

Ned: If a single person can unwittingly push an update that takes down eight-and-a-half million Windows clients, that’s an organizational and systematic problem. There’s also some indications that this isn’t the first time such a transgression has occurred. It appears that Red Hat Enterprise Linux, Debian, and Rocky Linux have all encountered similar crashing problems earlier this year after a channel update was pushed. I think it was April and May were the two months where the issues were found. The issue with Debian in particular was traced to a specific version of the kernel that wasn’t included in CrowdStrike’s testing matrix, but was on their list of supported kernel versions. macOS seems to have weathered the storm, for reasons that we will get to.

Chris: Yeah, and I mean, that’s an important point. And, you know, a lot of times people will say, this only happens to Windows, and that’s absolutely not the case. Anytime something runs unfettered in Ring 0 of any operating system of any kind, you run the risk of causing an immediate crash.

Ned: It’s just not usually so public-facing because you don’t tend to have Linux running your displays that’s also running CrowdStrike.

Chris: Right.

Ned: For whatever reason, we like Windows for that. I don’t know. We’ll get to that, too [laugh]. So, what about Microsoft? Shouldn’t they prevent this kind of thing from happening? In an ideal world, they could. And we’ll get into the technical solutions in a moment, but this is largely not Microsoft’s fault.

Yes, Windows has its flaws—many, many, many flaws—and Microsoft hasn’t always produced the stablest or most secure software. No one could call them blameless with a straight face. But in this specific instance, the system is working as designed, even if the design kind of sucks. Should we be shaming all these organizations who let the update barrel through their environment like salmonella on a cruise ship? That’s an image for you.

Think about the counterexample for a second, let’s say that a zero-day attack was discovered using this named pipes thing, and it was leveraged by a hacking group to infect a major airline with ransomware, and later it came out that CrowdStrike would have protected them if they had been running the newest version of the channel updates. Stupid CISO decided to stay at n minus one for updates. Do you think the defense of not running the latest channel updates as a resiliency strategy would appease litigators and the public at large? I’m going to go with unlikely.

Chris: [laugh]. So, I mean, another point that’s important to note here is that the kind of patch that came out—or the channel update—would not have been stopped by an n minus one effect. N minus one would stop the driver update. Remember, that’s the part that was signed by Microsoft, and is noted as good. The actual kernel—or the actual channel update itself happens automatically, and you can’t do anything about it.

Ned: There is… I was reading through some Reddit posts, and some people did say that there is a way to run a little behind the channel updates, so to postpone them by certain periods. There is a way to run, kind of like, n minus one for the channel updates, but there’s an inherent risk in doing that.

Chris: Yeah, like you said, it’s certainly not the sort of thing that a CISO is going to encourage.

Ned: Right. And there’s also a regulatory hurdle with that, too, because there may be compliance and regulations that say you have to be running the latest version. So really, it’s just a rational decision based on balancing priorities and political realities, and trying to protect your customers as best you can. So, the blame ultimately should reside on CrowdStrike for putting out a floud update—floud? Flawed.

Chris: Floud.

Ned: Words. I love them. I like floud, actually. It’s like loud, but with an F. It’s floud.

Chris: It’s like a flan that has opinions.

Ned: An opinionated flan. I like it. Let’s talk about solutions [laugh]. The reason macOS hasn’t encountered a similar fate as the Linux and Apple installations is that Apple doesn’t let CrowdStrike—or really anything else—running kernel mode. Starting in macOS 10.15—I didn’t look at the codename, so please forgive me—Apple offered System Extensions. These allow an application to stay in user mode while requesting special access to hardware managed by the kernel. At the same time, Apple phased out Kernel Extensions—often shortened to kext, or [pronounced] kext, I guess—

Chris: Yeah, it’s pronounced, unfortunately.

Ned: [sigh]. They phased those out starting in macOS 11. So basically, CrowdStrike doesn’t run in kernel mode on macOS, and thusly, it cannot crash macOS the same way. I don’t know no Mac, so I don’t know about any of that [laugh].

Chris: No, it’s true. And for a while, it was extremely annoying because a lot of programs relied on kexts for similar reasons: to have instant access. Like, a good example is if you have an external audio device and you want that to work as fast—as efficiently as possible, you would want it to work and run in kernel mode.

Ned: Right.

Chris: So, there are actually ways to get around the security that you just talked about in macOS. I don’t recommend it, but it is doable. And the whole point here is that you have this little secret enclave, effectively, where things run in this sort of in-between mode—sandbox, if you will—which we’re going to go into in a second. But if it crashes there, it doesn’t take down the operating system.

Ned: Right. And Linux actually has a similar option with eBPF, which I struggle to say because it’s awkward. And apparently, it’s no longer an acronym. It’s just its own thing. So… that’s weird. eBPF lets applications load into a sandboxed secure kernel execution environment. So, once again, gives them kernel-level access to resources, while applying stringent safety checks to make sure the application doesn’t crash the system.

CrowdStrike now offers running Falcon in user mode on Linux—what they call user mode—which actually uses eBPF under the covers. If you were running in that mode, those previous crashes that happened with Red Hat, and Debian, and—what was it?—Rocky Linux, you would not have been affected by those. I mean, CrowdStrike would—Falcon still would have crashed, but it wouldn’t have crashed your system.

Chris: Which is better.

Ned: I think so. Windows has some similar functionality available. There’s the Windows Filtering Platform, Windows Defender Application Control, and Windows Defender Device Guard, all of which have APIs, but none of them have the same mechanisms present that, like, System Extensions for macOS or eBPF for Linux have. So, they provide an API that applications could be rewritten to take advantage of and get, you know, almost kernel levels of access and speed, but they’re not the same as this, sort of, sandbox, secured enclave. There is a project to port eBPF over to Windows, for what it’s worth.

I don’t know if that will be the ultimate solution, but this catastrophic calamity should at least prompt Microsoft to try something similar. I have heard some folks—and we could call this a technical solution—I’ve heard some folks say that you just shouldn’t be running Windows in most of these environments. Like… you’re not wrong. If I could wave a magic wand and turn back time, if I could find a way, Chris—

Chris: Stop it.

Ned: —I would take back all the Windows that hurt you and replace them with Linux variants. Okay, it doesn’t rhyme. [laugh]. It’s the best I could do. If you’re out there, and you’re building a net-new system, that’s, like, an end-user terminal, an IoT device, or even a server running in the cloud, I think anything but Windows is your best bet, and it would probably be malpractice to do otherwise. But like it or not, Windows remains the most popular desktop operating system, and that doesn’t appear to be changing anytime soon. We need a short-term plan to make things better—through some sort of update—and a long-term plan to ditch Windows for most use cases.

Chris: Thoughts? So, a lot of in—I’ll put, ‘in my opinion’ around all of this—

Ned: Right.

Chris: A lot of this comes down to the never-ending battle between speed and security, and making assumptions that things are just going to work. After all, like we said, they’ve done multiple channel updates a day for years and years and years and years and years, and while they’ve had a few issues in the past, it’s not very many. This is the sort of thing that leads developers—and, you know, engineering teams—to have a false sense of security, and a false sense that everything they do is golden, and they will never have a problem. Therefore, checks get skipped, checks get removed from the process—because after all, they’re just slowing us down—and that’s a huge issue. The other issue is, when you push everything out all at once, the problem can occur—like it did this time—that everything will crash all at once.

There needs to be some type of a fuzzed deployment. So, let’s just say these things get released on a schedule, I don’t know, every four hours. You get a customer that’s got a hundred servers. Those servers should get that update five minutes apart in, like, groups of 30. That way, if there is a catastrophic failure, it only takes down a percentage of your platform. Now, that’ll happen for every single customer on Earth, and that’s not great, but the assumption is, and should be, that there is high availability built into this, so if half your systems go down, theoretically, the other half can carry the load.

Ned: Mmm. Yeah, and that’s something that CrowdStrike could change today. That’s within their realm of control.

Chris: Yeah, and I suspect that they will [laugh]. Because the only other option—if this is the situation—is people are going to end up with half of their environment running one antivirus solution, and the other half of their environment running another one.

Ned: That seems worse.

Chris: It’s just as insane as running—insane and difficult to manage as running in a multi-cloud environment. Or a super cloud, as some might say.

Ned: Ugh. I hate you. Yeah, these are all technical solutions. I don’t know if there’s any policy solutions, but my biggest concern coming out of all of this is that regulators and litigators are going to get into hubbub and pass some poorly thought-out legislation that makes things effectively worse.

Chris: I can’t quite figure out how they would make things worse, but I am excited to see them try.

Ned: [laugh]. They’re nothing if not creative. Well, hey, thanks for listening or something. I guess you found it worthwhile enough if you made it all the way to the end, so congratulations to you, friend. You accomplished something today. Now, you can go sit on the couch, update your CrowdStrike channel file, and watch everything crash in beautiful synchronicity. You’ve earned it.

You can find more about this show by visiting our LinkedIn page, just search ‘Chaos Lever,’ or go to our website, chaoslever.com. You’ll find show notes, blog posts, and general tomfoolery. And if we got something wrong, or you have strong opinions about what CrowdStrike should have done, leave us a comment. Leave us a voicemail. We might even listen to it. We’ll be back next week to see what fresh hell is upon us. Ta-ta for now.

Chris: What a mess.

Ned: Mmm. A glorious mess.