1
00:00:00,430 --> 00:00:01,390
Ready to party.

2
00:00:02,090 --> 00:00:03,360
By party, do you mean nap?

3
00:00:04,019 --> 00:00:05,610
Uh, yeah, we could have a nap party.

4
00:00:05,720 --> 00:00:07,450
I’m not even mad about that.

5
00:00:07,480 --> 00:00:08,680
That sounds delightful.

6
00:00:10,650 --> 00:00:10,800
[snoring]

7
00:00:21,779 --> 00:00:25,040
.
 Hello, alleged human, and welcome to the Chaos Lever podcast.

8
00:00:25,040 --> 00:00:27,680
My name is Ned, and I’m definitely not a robot.

9
00:00:27,710 --> 00:00:31,430
I am a real human person who does not make random

10
00:00:31,469 --> 00:00:34,929
updates to your computer and then crash entire airports.

11
00:00:35,619 --> 00:00:37,780
That would be wild, and I am definitely not

12
00:00:37,780 --> 00:00:40,400
behind that update through my Skynet uplink.

13
00:00:40,570 --> 00:00:43,379
With me is Chris, who was also here.

14
00:00:44,250 --> 00:00:45,510
Let’s talk about not Skylink.

15
00:00:46,230 --> 00:00:47,629
[crosstalk] . Whatever [laugh]

16
00:00:48,099 --> 00:00:48,659
.
 Wow.

17
00:00:49,389 --> 00:00:52,520
Yeah, I don’t know if you go into this or not, but you know, they are already

18
00:00:53,469 --> 00:00:57,730
conspiracy theories out there that this was an actual attack of some kind.

19
00:00:58,190 --> 00:00:59,790
Or a test for an actual attack.

20
00:01:00,310 --> 00:01:02,729
Yeah, I have seen that on the Reddit

21
00:01:02,769 --> 00:01:04,989
threads, [background noise] and other places.

22
00:01:05,069 --> 00:01:05,920
Ohh, that was loud.

23
00:01:06,340 --> 00:01:07,940
That was way louder than I thought it was going to be.

24
00:01:08,039 --> 00:01:08,430
Sorry [laugh]

25
00:01:09,160 --> 00:01:14,420
.
 Who likes fizzy water [laugh] ? Yeah, on the Reddit forums, and in

26
00:01:14,420 --> 00:01:19,060
other places, I’ve seen the conspiracy theories starting to propagate.

27
00:01:19,800 --> 00:01:22,870
I feel like those start out as jokes a lot of the time.

28
00:01:22,870 --> 00:01:25,300
Like, “Wouldn’t it be funny if this was like a state actor,”

29
00:01:25,330 --> 00:01:28,409
or something, but like, when you actually drill down into

30
00:01:28,410 --> 00:01:31,940
it at all, you realize that this is just incompetence.

31
00:01:33,930 --> 00:01:39,660
Don’t attribute to malfeasance what is more likely just gross incompetence.

32
00:01:40,460 --> 00:01:41,939
There’s a pithier way of saying that.

33
00:01:42,469 --> 00:01:44,289
Well, maybe now we should just talk about what we’re actually

34
00:01:44,289 --> 00:01:46,140
talking about because we’re already talking about it.

35
00:01:46,260 --> 00:01:47,240
Oh, CrowdStruck.

36
00:01:48,270 --> 00:01:49,000
UnderStrike?

37
00:01:49,040 --> 00:01:49,210
Dunn.

38
00:01:49,210 --> 00:01:49,240
[laugh]

39
00:01:51,690 --> 00:01:55,759
. Those of us who have been around the tech industry for a while,

40
00:01:55,760 --> 00:01:59,559
and have peeked behind the mysterious curtain to see what actually

41
00:01:59,559 --> 00:02:04,010
supports this endeavor that we call modern information technology—

42
00:02:04,279 --> 00:02:04,899
It’s terrifying.

43
00:02:05,500 --> 00:02:07,040
It’s three monkeys in a trench coat.

44
00:02:07,040 --> 00:02:07,089
Barely.

45
00:02:10,440 --> 00:02:15,209
I feel like you quickly become aware of how fragile this entire construction

46
00:02:15,210 --> 00:02:20,969
is, and just how many redundancies and safeguards have to be put in place

47
00:02:21,309 --> 00:02:25,990
to prevent the entire edifice from crumbling into the proverbial sea.

48
00:02:26,430 --> 00:02:28,489
Yeah, and just to put a pin on that, in terms of, not

49
00:02:28,490 --> 00:02:30,540
only is the technology fragile, so are the people.

50
00:02:31,030 --> 00:02:35,230
I saw a joke on LinkedIn today about power-washing the back

51
00:02:35,230 --> 00:02:39,389
of your servers to let the packets go faster, and I guarantee

52
00:02:39,389 --> 00:02:41,649
there’s somebody out there going, “I haven’t done that.

53
00:02:42,260 --> 00:02:44,140
I should do that.”

54
00:02:44,140 --> 00:02:48,250
[laugh] . If nothing else, it’ll clean the air filters, so that’s probably good.

55
00:02:49,050 --> 00:02:50,709
It’ll make everything a lot quieter.

56
00:02:52,610 --> 00:02:53,430
[laugh] . I suppose it will.

57
00:02:53,630 --> 00:02:55,410
Oh, silence is golden.

58
00:02:55,960 --> 00:02:57,609
The packets go faster in silence.

59
00:02:58,270 --> 00:03:01,480
To quote the second-greatest sci-fi movie of all time, Men

60
00:03:01,480 --> 00:03:05,620
in Black, “There’s always an Arquillian Battle Cruiser,

61
00:03:05,620 --> 00:03:09,190
or a Korilian Death Ray, or an intergalactic plague that

62
00:03:09,190 --> 00:03:12,119
is about to wipe out all life on this miserable planet.

63
00:03:12,410 --> 00:03:14,430
The only way these people can get on with their

64
00:03:14,430 --> 00:03:17,769
happy lives is that they do not know about it!”

65
00:03:18,320 --> 00:03:19,110
I love that quote.

66
00:03:19,400 --> 00:03:22,170
Yeah, so just kind of apply that to technology instead

67
00:03:22,170 --> 00:03:25,720
of aliens, and it’s pretty much the same thing.

68
00:03:26,370 --> 00:03:30,099
The CrowdStrike debacle may not have been a Korilian death

69
00:03:30,099 --> 00:03:35,829
ray, but for 8.5 million Windows devices, it basically was.

70
00:03:36,820 --> 00:03:40,900
Everything, everywhere, is breaking, all at once, and it is

71
00:03:40,930 --> 00:03:45,030
only through the heroic efforts of thousands of ops people

72
00:03:45,040 --> 00:03:49,310
diligently doing their jobs that the public is unaware.

73
00:03:50,130 --> 00:03:54,359
Of course, the public does occasionally become very aware,

74
00:03:54,850 --> 00:03:58,110
and then senators have to hold hearings to grandstand

75
00:03:58,120 --> 00:04:01,070
about things they do not even slightly understand.

76
00:04:01,770 --> 00:04:06,960
They’ll hold some CEO’s feet to the fire for an hour, make self-serving

77
00:04:06,960 --> 00:04:11,169
proclamations and possibly even attempt to levy a fine or two.

78
00:04:11,690 --> 00:04:14,670
Good luck with that now that Chevron Deference is dead.

79
00:04:15,010 --> 00:04:16,670
But hey, we’re not a Supreme Court podcast.

80
00:04:17,430 --> 00:04:18,880
Go listen to 5-4 for that.

81
00:04:20,029 --> 00:04:21,320
Solid plug for 5-4.

82
00:04:21,940 --> 00:04:22,799
Definitely [crosstalk] time.

83
00:04:23,630 --> 00:04:28,380
After all, the hubbub dies down, honestly, one or two C-level executives

84
00:04:28,380 --> 00:04:32,169
will probably fall on their swords to appease the investor public.

85
00:04:33,080 --> 00:04:34,409
I wouldn’t feel too sorry for them.

86
00:04:35,290 --> 00:04:39,219
It is a metaphorical sword after all, and it comes with a guaranteed

87
00:04:39,270 --> 00:04:43,970
payout of several millions of dollars, and a cushy job as a lobbyist

88
00:04:43,970 --> 00:04:49,059
or CEO of some other poor unsuspecting private equity firm-acquired

89
00:04:49,080 --> 00:04:54,570
disaster where they can oversee another unavoidable catastrophe.

90
00:04:55,290 --> 00:04:56,659
It’s the circle of life, Chris.

91
00:04:57,349 --> 00:04:58,150
I’m not going to sing it.

92
00:04:58,150 --> 00:04:59,230
I don’t want to get sued again.

93
00:04:59,360 --> 00:05:02,984
I—no, you’re seeing it in your head though, and I can see it [laugh]

94
00:05:03,889 --> 00:05:03,919
.
 [laugh]

95
00:05:05,139 --> 00:05:08,830
.
 Oh, so rather than talking about CrowdStrike for the next 30

96
00:05:08,830 --> 00:05:11,690
minutes, I think we should all just go watch The Lion King—

97
00:05:11,700 --> 00:05:12,300
The original.

98
00:05:12,320 --> 00:05:15,539
Which is the best sci-fi movie of all time [laugh]

99
00:05:15,649 --> 00:05:17,840
.
 I’m not really sure where to go with that [laugh]

100
00:05:18,539 --> 00:05:18,969
.
 [laugh] . I don’t either.

101
00:05:20,049 --> 00:05:22,309
I’m curious to hear the comments that we get in.

102
00:05:22,320 --> 00:05:28,190
I did recently watch Dune: Part Two, which was excellent.

103
00:05:28,560 --> 00:05:29,560
Took you long enough.

104
00:05:30,090 --> 00:05:30,590
Listen.

105
00:05:30,910 --> 00:05:33,550
Some of us responsible citizens saw in the theater.

106
00:05:34,380 --> 00:05:35,539
I have one word for you.

107
00:05:36,040 --> 00:05:36,999
That word is children.

108
00:05:37,469 --> 00:05:37,949
Anyway.

109
00:05:38,139 --> 00:05:39,260
You don’t think they would like it?

110
00:05:39,700 --> 00:05:44,830
I think nightmare fuel would probably be the closest, yeah,

111
00:05:44,840 --> 00:05:49,000
Feyd-Rautha—Routha—however the hell you say his name—yeah,

112
00:05:49,000 --> 00:05:51,500
those scenes in particular, God, that dude is creepy.

113
00:05:51,940 --> 00:05:55,510
Yeah, he really inhabited the creepy level of the character.

114
00:05:56,150 --> 00:05:59,719
Like, Jared Leto levels of creepy.

115
00:06:00,020 --> 00:06:01,320
No, but except good.

116
00:06:01,500 --> 00:06:06,100
Yes [laugh] . Yeah, because he’s playing a character, not himself.

117
00:06:06,910 --> 00:06:07,140
Oh.

118
00:06:07,750 --> 00:06:08,410
Anyway.

119
00:06:09,179 --> 00:06:09,269
CrowdStrike.

120
00:06:10,230 --> 00:06:11,239
What the hell happened?

121
00:06:12,040 --> 00:06:19,000
On Friday, July 19th, 2024, at 5:24 UTC—that’s 1 a.m.

122
00:06:19,010 --> 00:06:23,110
for our East Coast peeps, and the day before for California

123
00:06:23,330 --> 00:06:26,480
because you’re a bunch of weirdos—security vendor CrowdStrike

124
00:06:26,520 --> 00:06:29,479
released an update for their Falcon sensor platform.

125
00:06:30,120 --> 00:06:34,219
Falcon is an endpoint detection and response solution meant to protect

126
00:06:34,219 --> 00:06:38,210
systems against viruses, malware, and advanced persistent threats.

127
00:06:38,900 --> 00:06:44,230
The update type was a content update, or what CrowdStrike calls a channel file,

128
00:06:44,559 --> 00:06:50,399
which you can think of is, like, the virus definition, except as a modern EDR,

129
00:06:51,049 --> 00:06:54,560
it’s a bit more complicated than that, and we’ll get to why that’s important.

130
00:06:54,740 --> 00:06:59,040
When we get to the root-cause analysis, or what we know so far.

131
00:06:59,860 --> 00:07:03,039
Once the channel file was loaded by the Falcon sensor

132
00:07:03,040 --> 00:07:07,080
platform, it caused a memory access fault at the kernel

133
00:07:07,080 --> 00:07:11,420
level that forced a system crash on all Windows clients.

134
00:07:11,959 --> 00:07:14,750
The old Blue Screen of Death popped up, and then the

135
00:07:14,750 --> 00:07:18,760
system either rebooted or sat at that screen for a while.

136
00:07:19,310 --> 00:07:20,440
Possibly forever.

137
00:07:21,070 --> 00:07:22,600
Yeah, until somebody touched it.

138
00:07:22,990 --> 00:07:23,729
Pretty much.

139
00:07:24,740 --> 00:07:27,919
So, if you happened to walk into a major airport around that time,

140
00:07:28,270 --> 00:07:31,640
you might have been greeted by giant display signs that just had

141
00:07:31,640 --> 00:07:37,220
the sad frowny face on it, because now the blue screen has an emoji.

142
00:07:38,180 --> 00:07:40,289
And it was kind of funny, actually.

143
00:07:40,639 --> 00:07:43,830
I mean, funny for the people, you know, seeing the screens; not funny

144
00:07:43,830 --> 00:07:46,090
for everybody who had to deal with the disaster, [unintelligible]

145
00:07:46,330 --> 00:07:46,650
Right.

146
00:07:46,690 --> 00:07:52,520
And were sitting in airports for three days while waiting to, you know, go home.

147
00:07:52,960 --> 00:07:53,590
Yeah.

148
00:07:53,970 --> 00:07:58,520
Depending on which airline you were working with, you

149
00:07:58,520 --> 00:08:02,050
may have been not impacted at all, impacted slightly, or

150
00:08:02,050 --> 00:08:04,280
still sitting in the airport listening to this right now.

151
00:08:04,949 --> 00:08:05,779
I’m so sorry.

152
00:08:06,240 --> 00:08:08,410
Maybe don’t fly Delta [laugh] next time.

153
00:08:08,990 --> 00:08:10,000
Actually, I don’t know if it was Delta.

154
00:08:10,010 --> 00:08:10,880
It might have been United.

155
00:08:10,960 --> 00:08:11,760
They’re all terrible.

156
00:08:11,950 --> 00:08:12,700
It doesn’t matter.

157
00:08:13,530 --> 00:08:16,620
But one of the few that wasn’t affected was Southwest.

158
00:08:17,010 --> 00:08:18,389
Is that because they’re running Linux?

159
00:08:18,679 --> 00:08:19,429
Allegedly.

160
00:08:19,870 --> 00:08:23,610
Again, this is unproven internet theory, but allegedly it’s because

161
00:08:23,629 --> 00:08:26,410
their systems were so old that CrowdStrike wouldn’t run on them.

162
00:08:27,670 --> 00:08:33,710
[laugh] . I feel like we did cover Southwest in a Chaos Lever, or possibly its

163
00:08:33,719 --> 00:08:38,689
precursor, when we talked about old, out-of-date systems that are super fragile.

164
00:08:38,889 --> 00:08:39,990
Am I remembering correctly?

165
00:08:40,340 --> 00:08:44,730
I mean, I had that theory, or that thought as well, but I’m also now like,

166
00:08:45,180 --> 00:08:49,199
did they just post that, and it became a memory, or is it a real memory?

167
00:08:49,530 --> 00:08:51,450
[laugh] . It’s hard to say.

168
00:08:52,080 --> 00:08:57,359
I will say that it was in fact Delta—and is Delta—that’s having

169
00:08:57,370 --> 00:09:01,720
the biggest struggle because they use BitLocker extensively.

170
00:09:02,150 --> 00:09:02,340
Right.

171
00:09:02,380 --> 00:09:03,550
I assume you’re going to get into that.

172
00:09:03,810 --> 00:09:04,410
Oh, yes.

173
00:09:04,560 --> 00:09:04,890
Okay.

174
00:09:05,200 --> 00:09:06,010
I don’t want to interrupt.

175
00:09:06,310 --> 00:09:06,710
Carry on.

176
00:09:06,940 --> 00:09:07,780
So, we had all these crashes—

177
00:09:07,780 --> 00:09:08,479
Whenever you’re ready.

178
00:09:08,960 --> 00:09:10,330
And you know, when your system—

179
00:09:10,330 --> 00:09:12,000
Just go with [crosstalk]

180
00:09:12,000 --> 00:09:12,130
—
 [unintelligible]

181
00:09:12,130 --> 00:09:12,363
—
 —whenever [unintelligible]

182
00:09:12,363 --> 00:09:12,536
—
 [unintelligible]

183
00:09:12,710 --> 00:09:13,170
—
 At anytime—

184
00:09:13,170 --> 00:09:13,464
[unintelligible]

185
00:09:13,759 --> 00:09:15,729
—
 When you could—why would—who—

186
00:09:15,740 --> 00:09:18,099
[laugh] . We have all these crashed systems,

187
00:09:18,230 --> 00:09:19,459
and what do you do with the crash system?

188
00:09:19,460 --> 00:09:20,150
You restart it.

189
00:09:20,840 --> 00:09:23,240
But unfortunately, attempts to restart the afflicted

190
00:09:23,240 --> 00:09:26,160
systems just resulted in another blue screen of death.

191
00:09:26,599 --> 00:09:30,959
Because Falcon sensor is loaded as a driver during system

192
00:09:30,969 --> 00:09:35,790
boot, and it has been marked as boot required, meaning

193
00:09:35,790 --> 00:09:38,780
it must be loaded for the system to boot properly.

194
00:09:39,580 --> 00:09:42,880
As soon as Falcon started, it would load all of its channel

195
00:09:42,880 --> 00:09:45,630
files and, predictably, the system would crash again.

196
00:09:46,309 --> 00:09:49,630
This rendered all effective systems completely

197
00:09:49,670 --> 00:09:53,429
unusable and inaccessible through in-band management.

198
00:09:53,440 --> 00:09:56,069
So, you can RDP into this thing and fix it.

199
00:09:56,830 --> 00:10:01,240
So, this makes sense from an EDR perspective, right?

200
00:10:01,240 --> 00:10:01,355
Yes.

201
00:10:01,355 --> 00:10:02,699
You want to protect your computer.

202
00:10:03,580 --> 00:10:07,469
No matter what tool you have, it’s going to have this boot requirement

203
00:10:08,290 --> 00:10:11,460
because you don’t want your system booting without endpoint protection.

204
00:10:11,639 --> 00:10:11,939
Right.

205
00:10:11,939 --> 00:10:14,210
Because endpoint protection, ostensibly, is good.

206
00:10:15,090 --> 00:10:15,750
Ostensibly.

207
00:10:16,700 --> 00:10:20,450
The problem, obviously, comes in where your endpoint management

208
00:10:20,450 --> 00:10:23,510
is now, effectively, malware that’s crashing your system.

209
00:10:23,880 --> 00:10:24,230
Right.

210
00:10:24,550 --> 00:10:26,190
That would be what we call ‘the downside.’

211
00:10:26,190 --> 00:10:27,210
[laugh] . Yes.

212
00:10:27,639 --> 00:10:29,769
And we will definitely get into that as well.

213
00:10:30,290 --> 00:10:34,700
Microsoft has published a blog post where they claim, according to their

214
00:10:34,700 --> 00:10:39,109
telemetry, about 8.5 million Windows devices were impacted by this.

215
00:10:39,450 --> 00:10:43,210
Now, that’s only about one or 2% of all Windows devices out

216
00:10:43,210 --> 00:10:47,530
there, so this is not, as a percentage, a ton of devices.

217
00:10:47,539 --> 00:10:53,010
However… it’s still a lot of devices, [laugh] and the impact was pretty severe.

218
00:10:53,020 --> 00:10:56,489
As we discussed, airlines had to suspend or cancel

219
00:10:56,500 --> 00:11:00,120
flights, retail stores suddenly couldn’t accept payment.

220
00:11:00,690 --> 00:11:03,630
Medical devices and hospitals crashed in the middle of

221
00:11:03,630 --> 00:11:08,180
surgeries, bowling alleys had to hand out paper and pencils to

222
00:11:08,180 --> 00:11:12,210
individuals, who just looked at them like, what the hell is this?

223
00:11:12,400 --> 00:11:14,599
How do I track ten frames by hand?

224
00:11:14,900 --> 00:11:16,520
How does a turkey even work?

225
00:11:17,150 --> 00:11:19,539
[sigh] . Dark times for all of us, Chris.

226
00:11:19,730 --> 00:11:23,569
That’s the kind of math podcast that needs to come out because I guarantee

227
00:11:23,570 --> 00:11:26,079
there’s no one left on earth who knows how to score bowling by hand.

228
00:11:28,140 --> 00:11:28,829
[laugh] . True story.

229
00:11:29,390 --> 00:11:33,240
I was up in Cape Cod, and we went duckpin bowling—which is a real thing.

230
00:11:33,309 --> 00:11:33,840
Look it up—

231
00:11:33,929 --> 00:11:34,630
Oh, it’s so fun.

232
00:11:34,730 --> 00:11:35,450
It’s super fun.

233
00:11:35,469 --> 00:11:36,280
Definitely look it up.

234
00:11:36,639 --> 00:11:40,860
Super fun, but the bowling alley was so old that

235
00:11:40,860 --> 00:11:43,279
they did not have a computerized scoring system.

236
00:11:43,810 --> 00:11:44,160
Wow.

237
00:11:44,630 --> 00:11:44,950
Yeah.

238
00:11:45,000 --> 00:11:46,630
They gave me a piece of paper and pencil, and

239
00:11:46,630 --> 00:11:49,980
I was like, “Uh, score is not important, right?

240
00:11:50,950 --> 00:11:58,140
We’re just here to have fun.” Oh… now to get these systems back to a working

241
00:11:58,140 --> 00:12:03,580
state, the offending channel files had to be removed before Falcon was loaded.

242
00:12:04,130 --> 00:12:07,550
There’s a few options to do this, and none of them are great or easy.

243
00:12:08,429 --> 00:12:13,110
You can boot the system into Windows safe mode, which only loads the

244
00:12:13,219 --> 00:12:17,580
absolute bare minimum of Windows drivers, and then remove the files.

245
00:12:18,440 --> 00:12:22,710
For virtual systems, you could mount the system disk on another system and

246
00:12:22,710 --> 00:12:27,390
remove the files, and then reattach the drive to the original system, or

247
00:12:27,420 --> 00:12:31,959
if you had snapshots or a backup, you could roll back to a prior snapshot.

248
00:12:32,770 --> 00:12:38,540
Fortunately, CrowdStrike did pull the offending file from the update servers,

249
00:12:38,970 --> 00:12:42,850
so you wouldn’t then immediately redownload it and be back where you were.

250
00:12:43,550 --> 00:12:48,339
While it is a huge pain to fix all of these virtual systems, the real pain

251
00:12:48,349 --> 00:12:52,840
is those physical systems that don’t have an out-of-band management option.

252
00:12:53,539 --> 00:12:56,790
Someone will need to physically sit at the terminal, invoke

253
00:12:56,790 --> 00:13:00,880
safe mode, and perform the remediation steps, or use a separate

254
00:13:00,890 --> 00:13:04,260
boot device like a thumb drive to perform the maintenance.

255
00:13:04,360 --> 00:13:06,519
This is very bad.

256
00:13:07,349 --> 00:13:10,389
You forgot about the other way to fix the system, which apparently

257
00:13:10,410 --> 00:13:14,010
did work on some—at least a number of people’s, which is just

258
00:13:14,010 --> 00:13:17,959
keep rebooting it until CrowdStrike Falcon updated, and deleted

259
00:13:17,960 --> 00:13:20,300
the file on its own before it crashed because of the file.

260
00:13:20,890 --> 00:13:25,360
[laugh] . I guess if it does load and the network stack loads

261
00:13:25,360 --> 00:13:30,210
in time for it to pull the update and replace it, maybe?

262
00:13:30,730 --> 00:13:31,200
Maybe.

263
00:13:31,650 --> 00:13:33,550
Between 15 and 20 reboots.

264
00:13:33,650 --> 00:13:35,410
Sometimes people were getting it to work.

265
00:13:35,680 --> 00:13:36,290
Wow.

266
00:13:37,110 --> 00:13:37,920
That’s awful.

267
00:13:38,030 --> 00:13:38,909
But okay.

268
00:13:38,970 --> 00:13:40,030
So, another option.

269
00:13:40,990 --> 00:13:45,970
Microsoft has published a USB tool to assist with the

270
00:13:45,970 --> 00:13:49,830
removal of this file, so you have that option as well.

271
00:13:51,330 --> 00:13:55,480
As I mentioned, the BitLocker thing does throw a bit of a wrench

272
00:13:55,650 --> 00:14:00,360
in the whole plan because in order to access a BitLocker-protected

273
00:14:00,360 --> 00:14:05,770
system drive out-of-band, you have to supply a BitLocker unlock key—

274
00:14:06,139 --> 00:14:06,349
Yeah.

275
00:14:07,080 --> 00:14:09,280
And that can be hard to get.

276
00:14:10,080 --> 00:14:13,240
Well, it’s not like people want their end-users to have that.

277
00:14:13,370 --> 00:14:13,720
Again—

278
00:14:13,730 --> 00:14:13,880
Yes.

279
00:14:13,880 --> 00:14:15,380
—this is a security concern.

280
00:14:15,929 --> 00:14:20,290
Also, the BitLocker key is 48 characters long, so not only finding it

281
00:14:20,290 --> 00:14:24,839
but typing it in before BitLocker times out… which it does, apparently.

282
00:14:26,230 --> 00:14:27,230
So, a bit of a nightmare.

283
00:14:27,790 --> 00:14:28,710
Not a great situation.

284
00:14:29,030 --> 00:14:29,329
No.

285
00:14:30,300 --> 00:14:33,360
And so, that’s part of the reason Delta is still struggling.

286
00:14:34,179 --> 00:14:39,099
I would love to say that, as of right now, we know exactly what caused the

287
00:14:39,099 --> 00:14:44,660
error, but honestly portions of the supply chain are still pretty murky.

288
00:14:45,440 --> 00:14:51,650
Instead, I will try to explain how a simple update for an EDR caused millions

289
00:14:51,650 --> 00:14:56,090
of Windows machines to blue screen, and we can also have fun pointing all the

290
00:14:56,090 --> 00:15:01,050
fingers that we have at all the other parties because we’ve got jazz hands.

291
00:15:01,050 --> 00:15:02,200
[whispering] It’s all your fault.

292
00:15:03,030 --> 00:15:04,340
That works better with a—

293
00:15:04,660 --> 00:15:05,430
Visual medium?

294
00:15:05,760 --> 00:15:09,490
Yeah [laugh] . So, to start with, we have to consider

295
00:15:09,490 --> 00:15:11,870
what the Falcon [Center’s] actually trying to do.

296
00:15:12,690 --> 00:15:16,930
Falcon Center, as I mentioned, is an EDR product, and it’s meant to

297
00:15:16,930 --> 00:15:21,209
scan all activity on the host operating system looking for threats.

298
00:15:21,820 --> 00:15:24,900
Most applications aren’t granted that level of access

299
00:15:24,900 --> 00:15:27,849
to other applications or to the system as a whole.

300
00:15:28,400 --> 00:15:32,099
As you mentioned, Chris, it needs to be in a privileged position.

301
00:15:32,679 --> 00:15:37,150
But that’s the point: you’re trying to prevent other pieces of software from

302
00:15:37,150 --> 00:15:41,320
getting themselves into privileged positions to compromise your computer.

303
00:15:42,430 --> 00:15:45,219
To understand what it means to be in that privileged position,

304
00:15:45,220 --> 00:15:48,350
I’m going to briefly talk about user space and kernel space.

305
00:15:48,620 --> 00:15:51,790
Please feel free to interrupt me when I get something wrong, which I will.

306
00:15:52,510 --> 00:15:53,386
[whispering] Yes, thank you.

307
00:15:53,609 --> 00:15:55,920
Your operating system, whether it’s Windows,

308
00:15:55,980 --> 00:15:59,910
Linux, macOS, I don’t know, Solaris—

309
00:16:00,680 --> 00:16:01,280
AIX?

310
00:16:01,760 --> 00:16:06,299
—sure—it is responsible for managing the hardware on your system.

311
00:16:06,410 --> 00:16:10,200
That includes stuff like memory management, writing data to disk,

312
00:16:10,510 --> 00:16:14,460
sensing input from peripherals, and scheduling threads on the CPU.

313
00:16:15,230 --> 00:16:17,510
This all happens in what is called kernel

314
00:16:17,520 --> 00:16:20,530
space, and it’s considered highly privileged.

315
00:16:20,860 --> 00:16:24,870
If something goes wrong in kernel space, the system may have to halt

316
00:16:25,230 --> 00:16:29,579
or crash to prevent damage to the hardware, or corruption of data.

317
00:16:30,490 --> 00:16:34,710
Ideally, as little as possible should be running in kernel space.

318
00:16:35,500 --> 00:16:38,530
Instead, most applications run in user space

319
00:16:38,639 --> 00:16:41,310
which does not have direct access to the hardware.

320
00:16:42,170 --> 00:16:45,940
Applications running in user space interact with the operating system,

321
00:16:45,990 --> 00:16:50,190
and make requests based on that operating system’s published APIs.

322
00:16:50,530 --> 00:16:52,650
Do you want to write a file to disk?

323
00:16:52,950 --> 00:16:55,620
You make an API call and pass the correct information.

324
00:16:56,130 --> 00:16:57,450
Need to access memory?

325
00:16:57,880 --> 00:17:01,220
Make an API call and specify the address and range.

326
00:17:01,960 --> 00:17:04,530
The operating system will evaluate that request,

327
00:17:05,010 --> 00:17:08,839
make sure it’s valid and allowed before executing it.

328
00:17:09,510 --> 00:17:12,810
This means when an application runs into issues or it crashes, the

329
00:17:12,849 --> 00:17:17,329
operating system is able to handle that crash gracefully—most of

330
00:17:17,329 --> 00:17:21,339
the time—and keep other processes and the system as a whole running.

331
00:17:22,440 --> 00:17:22,720
Right.

332
00:17:23,069 --> 00:17:25,129
And there’s one point to note here.

333
00:17:25,129 --> 00:17:27,843
So, first of all, some of the terminology, they call it the

334
00:17:27,920 --> 00:17:32,270
kernel; also, they call it Ring 0, meaning it is the lowest

335
00:17:32,280 --> 00:17:35,000
possible level of the system, and it has access to everything

336
00:17:35,000 --> 00:17:37,910
else that is going on in the system without restriction.

337
00:17:38,679 --> 00:17:42,690
Necessary to make sure, for things like EDR tools, that it can

338
00:17:42,790 --> 00:17:46,770
scan not only all of the files, but all of the activity, all of the

339
00:17:46,770 --> 00:17:49,669
network, all of the I/O, the disk, et cetera, et cetera, et cetera.

340
00:17:50,270 --> 00:17:50,490
Right.

341
00:17:50,870 --> 00:17:56,040
One thing people always get upset about is, why does Windows crash so easily?

342
00:17:56,420 --> 00:17:56,780
And—

343
00:17:58,030 --> 00:17:58,060
[laugh]

344
00:17:58,350 --> 00:18:01,250
.
 While there is an argument to be made that it is fragile and poorly

345
00:18:01,250 --> 00:18:04,100
designed and should have a better way of handling things like EDR

346
00:18:04,140 --> 00:18:07,730
that needs this access—which is true, and I assume you’ll get to that—

347
00:18:07,990 --> 00:18:08,310
Yes.

348
00:18:08,310 --> 00:18:12,320
The other thing is, again, remember, completely unfettered access.

349
00:18:12,349 --> 00:18:16,239
If something goes wrong at the kernel level, we

350
00:18:16,240 --> 00:18:20,700
get our old friend, unanticipated consequences.

351
00:18:21,600 --> 00:18:22,840
And this is extremely bad.

352
00:18:22,850 --> 00:18:27,709
So, for example, let’s say you have a system that is running a database.

353
00:18:28,200 --> 00:18:30,120
Databases, as you know, are kind of important.

354
00:18:31,389 --> 00:18:35,919
A kernel-level job is trying to write a new file, or

355
00:18:35,920 --> 00:18:38,760
a new table, or a new row, or record, or whatever, but

356
00:18:38,760 --> 00:18:42,179
it runs into an error with, say, memory misallocation.

357
00:18:43,290 --> 00:18:44,980
What is it going to write to the database?

358
00:18:45,610 --> 00:18:47,669
It could be writing absolute nonsense.

359
00:18:47,710 --> 00:18:49,949
It could completely corrupt the database.

360
00:18:49,969 --> 00:18:53,639
Therefore, the kernel crashes preemptively whenever it detects a

361
00:18:53,639 --> 00:18:59,280
failure because the consequences of trying to soldier on might be worse.

362
00:18:59,970 --> 00:19:00,420
Right.

363
00:19:00,940 --> 00:19:04,810
It’s that, “Out of an abundance of caution, I’m going to fail.”

364
00:19:05,179 --> 00:19:05,469
Right.

365
00:19:05,889 --> 00:19:08,170
Which is the same thing I did in high school.

366
00:19:09,280 --> 00:19:12,560
[laugh] . Yes… it was better if you didn’t succeed, Chris.

367
00:19:12,560 --> 00:19:12,590
[laugh]

368
00:19:13,890 --> 00:19:17,070
.
 So, Windows applications have been able to request

369
00:19:17,080 --> 00:19:19,760
access to run in kernel mode for a long time.

370
00:19:20,309 --> 00:19:24,420
Generally, that’s a bad idea, for the reasons you just articulated.

371
00:19:25,150 --> 00:19:27,389
But Microsoft wasn’t super strict about it.

372
00:19:27,960 --> 00:19:30,600
Microsoft is nothing if not accommodating

373
00:19:30,610 --> 00:19:32,959
to developers and their terrible ideas.

374
00:19:33,860 --> 00:19:36,030
Some applications actually do need to run in

375
00:19:36,030 --> 00:19:38,879
kernel mode, in particular, antivirus software.

376
00:19:39,450 --> 00:19:42,869
Applications running in user mode are not generally allowed to

377
00:19:42,870 --> 00:19:46,529
access the memory and monitor the behavior of other applications.

378
00:19:46,949 --> 00:19:50,220
Microsoft Teams can’t just decide to read the memory space of

379
00:19:50,220 --> 00:19:54,620
Slack or kill the Zoom processes, as much as it might want to.

380
00:19:54,620 --> 00:19:56,010
I was going to say, it totally would.

381
00:19:56,270 --> 00:19:59,939
[laugh] . The operating system just doesn’t allow that type of nonsense.

382
00:20:00,509 --> 00:20:03,739
But, you know, an antivirus application needs a privileged

383
00:20:03,740 --> 00:20:06,700
level of access and monitoring to defeat the bad guys.

384
00:20:07,100 --> 00:20:10,350
So, antivirus companies like Symantec wrote

385
00:20:10,380 --> 00:20:12,640
their application to run in kernel space.

386
00:20:13,500 --> 00:20:18,460
Now, Microsoft actually tried to push back on the rampant abuse of kernel

387
00:20:18,460 --> 00:20:23,949
mode by antivirus outfits—and others—when Windows Vista was being rolled out.

388
00:20:24,440 --> 00:20:24,800
Yeah.

389
00:20:24,920 --> 00:20:26,020
What, what, what?

390
00:20:26,230 --> 00:20:29,930
Vista, for you youngsters in the crowd,

391
00:20:30,310 --> 00:20:33,220
Vista was the Windows 8 of the early aughts.

392
00:20:34,230 --> 00:20:36,230
Hopefully that puts some perspective on things.

393
00:20:37,179 --> 00:20:42,590
While Vista was a disaster as an operating system release, they did add a whole

394
00:20:42,590 --> 00:20:48,410
bunch of additional functionality and features that brought the client OSes more

395
00:20:48,410 --> 00:20:52,810
in line with what the server OSes were doing, and added a bunch of security.

396
00:20:53,040 --> 00:20:57,280
And one of the things they really tried to do was lock down kernel mode access.

397
00:20:58,150 --> 00:21:02,159
Unfortunately, antivirus companies didn’t like that, and they threw a hissy

398
00:21:02,160 --> 00:21:06,550
fit, claiming that since Windows Defender could run in kernel mode, and their

399
00:21:06,550 --> 00:21:12,870
stuff couldn’t, Microsoft was abusing their influence, a la Internet Explorer.

400
00:21:13,530 --> 00:21:17,420
And Microsoft, still reeling from their decade-long battle

401
00:21:17,420 --> 00:21:22,040
with the FTC over antitrust, kowtowed to the AV club, and

402
00:21:22,040 --> 00:21:25,040
allowed them to keep their precious kernel mode access.

403
00:21:25,710 --> 00:21:29,170
It’s not an unreasonable request because all the

404
00:21:29,170 --> 00:21:31,929
other players wanted was an even playing field.

405
00:21:32,120 --> 00:21:32,419
Right.

406
00:21:32,460 --> 00:21:35,360
The fact that even playing field was a wide-opened

407
00:21:35,469 --> 00:21:38,660
security nightmare is still a Microsoft problem.

408
00:21:39,710 --> 00:21:40,190
[laugh] . Right.

409
00:21:40,750 --> 00:21:44,390
Microsoft did add an interesting requirement, though, if you

410
00:21:44,390 --> 00:21:47,550
wanted to play in kernel space, and that was driver signing.

411
00:21:48,889 --> 00:21:51,370
Antivirus applications would present themselves

412
00:21:51,410 --> 00:21:55,050
as device drivers to get to run in kernel mode.

413
00:21:55,480 --> 00:21:59,320
A device driver to nothing, but a device driver nonetheless.

414
00:21:59,960 --> 00:22:05,620
Microsoft created the Windows Hardware Quality Labs Testing Certification—aka

415
00:22:05,750 --> 00:22:14,240
WHQL—and once a driver had gone through that lab and gotten its certification,

416
00:22:14,470 --> 00:22:20,190
Microsoft would digitally sign the driver and give them the Certified for

417
00:22:20,190 --> 00:22:25,139
Windows logo, so they could proudly display ‘Certified for Windows Vista’—or

418
00:22:25,139 --> 00:22:30,540
Windows 8 or whatever—on the box when you buy the software, or on their website.

419
00:22:31,400 --> 00:22:34,430
Now, vendors could still choose to sign their drivers

420
00:22:34,450 --> 00:22:38,750
internally, but the antivirus folks wanted to get that

421
00:22:39,400 --> 00:22:42,710
WHQL certification and all the cachet that went with it.

422
00:22:43,599 --> 00:22:46,369
As long as your driver code didn’t change,

423
00:22:46,500 --> 00:22:49,179
the digital signature would remain valid.

424
00:22:49,630 --> 00:22:53,690
So, that means all these antivirus companies—like CrowdStrike—would

425
00:22:53,690 --> 00:22:56,710
get that certification, which meant that it had gone through

426
00:22:56,720 --> 00:23:00,079
some level of rigorous testing when it came to the way the

427
00:23:00,080 --> 00:23:03,000
driver was written and the way it interacted with the kernel.

428
00:23:03,880 --> 00:23:04,919
Seems like a good idea.

429
00:23:05,609 --> 00:23:06,109
I’m for it.

430
00:23:06,980 --> 00:23:10,649
Unfortunately, external data could be loaded by the driver, like—

431
00:23:10,660 --> 00:23:10,680
No.

432
00:23:11,290 --> 00:23:12,670
—virus definitions.

433
00:23:13,100 --> 00:23:14,909
But in theory, the actual running code

434
00:23:14,910 --> 00:23:18,270
should all live in that signed device driver.

435
00:23:18,429 --> 00:23:21,300
So, read in some config, but all the logic in the

436
00:23:21,300 --> 00:23:23,609
actual code should live in that device driver.

437
00:23:24,150 --> 00:23:27,490
That’s all well and good for loading virus signatures and

438
00:23:27,490 --> 00:23:32,100
looking for matches in memory and CPU threads, but Falcon

439
00:23:32,110 --> 00:23:36,600
sensor is a modern EDR, and it doesn’t just use signatures.

440
00:23:37,170 --> 00:23:40,800
Instead, Falcon uses machine learning to develop behavior

441
00:23:40,800 --> 00:23:44,150
patterns, and then it needs to detect and respond to

442
00:23:44,150 --> 00:23:47,120
emerging threats that match those behavior patterns.

443
00:23:47,740 --> 00:23:53,510
The channel updates Falcon sensor receives to model that behavior, those updates

444
00:23:53,550 --> 00:23:59,639
appear to include some amount of pseudocode that is executed by the driver.

445
00:24:00,309 --> 00:24:03,970
And it is that injected code from the channel—or lack

446
00:24:03,970 --> 00:24:07,010
thereof, actually—that seems to have caused the issue.

447
00:24:07,730 --> 00:24:10,330
According to people who have looked at the channel

448
00:24:10,330 --> 00:24:14,279
file in question, it is entirely filled with zeros

449
00:24:18,150 --> 00:24:21,670
[laugh] . Now, you would hope that the driver would look

450
00:24:21,670 --> 00:24:25,909
at a file full of zeros and just ignore it, like, “Nope.

451
00:24:26,510 --> 00:24:30,700
That’s invalid.” Falcon sensor chose a slightly different route and crashed.

452
00:24:31,500 --> 00:24:31,800
Right.

453
00:24:32,259 --> 00:24:35,180
So, what we have here is a driver that is legitimate

454
00:24:35,670 --> 00:24:39,169
and was tested and proven resilient, which is good.

455
00:24:39,870 --> 00:24:40,149
Yeah.

456
00:24:40,300 --> 00:24:43,300
And we have updates that come down the wire multiple times

457
00:24:43,300 --> 00:24:47,090
a day and interact directly with that driver, that were not.

458
00:24:47,660 --> 00:24:48,250
Precisely.

459
00:24:48,940 --> 00:24:53,010
And it would appear that of the many tests that were run against

460
00:24:53,010 --> 00:24:56,510
that driver, none of the tests were, “Here’s a file full of zeros.

461
00:24:56,630 --> 00:25:00,520
What do you do?” Because no one thought that was a thing that would ever occur.

462
00:25:01,070 --> 00:25:01,639
But it did.

463
00:25:02,460 --> 00:25:08,940
There is a popular breakdown of the Falcon sensor crash dump by Twitter person

464
00:25:09,240 --> 00:25:15,260
Perpetualmaniac, which I won’t be linking because after assessing that it was

465
00:25:15,280 --> 00:25:20,420
a lack of null pointer checking in the dump, he then went on to make weird

466
00:25:20,429 --> 00:25:24,080
disparaging comments about the Rust community and blamed the whole thing on DEI.

467
00:25:25,740 --> 00:25:29,760
It got strange and kind of fasci, so fuck that guy.

468
00:25:29,990 --> 00:25:30,250
Fair.

469
00:25:30,420 --> 00:25:33,900
Instead, I’ll include a link to a different Twitter thread, by

470
00:25:33,900 --> 00:25:37,989
someone who actually debugs stuff like this for a living, and he

471
00:25:37,990 --> 00:25:42,959
basically said that Perpetualmaniac was wrong and thinks that it is

472
00:25:43,570 --> 00:25:47,300
uninitialized data being read from a table that caused the crash.

473
00:25:48,059 --> 00:25:51,180
Now, considering that the input file was entirely filled

474
00:25:51,180 --> 00:25:54,950
with nothing, uninitialized sounds like an understatement.

475
00:25:55,970 --> 00:26:00,170
Unfortunately, we won’t know for sure unless CrowdStrike shares

476
00:26:00,170 --> 00:26:03,480
their source code for their driver, which seems unlikely.

477
00:26:03,990 --> 00:26:06,239
Maybe they should, but I don’t think they will.

478
00:26:07,080 --> 00:26:11,360
The point is that the channel update caused Falcon sensor to attempt to access a

479
00:26:11,360 --> 00:26:15,560
memory location that didn’t exist or wasn’t initialized, and the driver crashed,

480
00:26:15,719 --> 00:26:19,989
forcing the system to halt in order to prevent possible data corruption.

481
00:26:20,780 --> 00:26:21,800
So, that’s where we’re at.

482
00:26:22,770 --> 00:26:24,070
Now, it’s time to point fingers.

483
00:26:24,590 --> 00:26:25,010
Cool.

484
00:26:25,340 --> 00:26:26,540
[It’s] Everybody’s favorite part.

485
00:26:26,950 --> 00:26:30,780
Predictably, in a fuckup of this magnitude, the blame

486
00:26:30,780 --> 00:26:33,949
game and armchair quarterbacking is in full effect.

487
00:26:34,530 --> 00:26:37,090
Thought leaders are tripping over themselves on

488
00:26:37,349 --> 00:26:39,680
LinkedIn to have an opinion about the whole mess.

489
00:26:40,080 --> 00:26:43,760
And I’ve seen posts ranging from ‘this is all CrowdStrike fault.

490
00:26:43,930 --> 00:26:47,650
How did this update ever get out the door?’ ‘This is all Microsoft’s fault.

491
00:26:47,860 --> 00:26:50,750
How could they let third parties run in kernel mode?’ ‘This is the

492
00:26:50,750 --> 00:26:55,530
customers’ fault for not having phased rollouts.’ Et cetera, et cetera.

493
00:26:56,280 --> 00:26:59,799
And then there’s all the conspiracy theories about how this was a state actor,

494
00:26:59,799 --> 00:27:05,200
or a planned thing, or I don’t know CrowdStrike did it on purpose, for reasons?

495
00:27:05,910 --> 00:27:06,140
Anyway.

496
00:27:06,990 --> 00:27:08,040
Solar flares?

497
00:27:08,660 --> 00:27:09,460
Oh, I like that one.

498
00:27:09,840 --> 00:27:11,030
That’s what made it all zeros.

499
00:27:11,770 --> 00:27:14,439
There’s plenty of blame to go around, and none of it is

500
00:27:14,440 --> 00:27:17,850
actually helpful while the fire is burning, but now that

501
00:27:17,850 --> 00:27:21,640
we’re over a week out, maybe we can take a more nuanced look.

502
00:27:22,110 --> 00:27:22,480
Or not.

503
00:27:23,640 --> 00:27:27,020
So, how did this update actually leave CrowdStrike’s front door?

504
00:27:27,760 --> 00:27:28,630
That’s a great question.

505
00:27:29,349 --> 00:27:33,370
The truth is, we will not know until CrowdStrike tells us or

506
00:27:33,380 --> 00:27:37,330
a lawsuit forces legal discovery, and we find out that way.

507
00:27:38,140 --> 00:27:39,560
The former could come any day.

508
00:27:39,600 --> 00:27:43,160
I’ve checked their [unintelligible] blog posts several times as I was

509
00:27:43,160 --> 00:27:47,680
writing this piece, and so far, they haven’t said, but maybe they will.

510
00:27:48,230 --> 00:27:49,979
Uh, actually, so they did—

511
00:27:50,780 --> 00:27:51,010
Ooh.

512
00:27:51,130 --> 00:27:52,520
—at about three o’clock this morning.

513
00:27:54,230 --> 00:27:54,870
[laugh] . Of course they did.

514
00:27:54,870 --> 00:27:56,910
They released an official—well, an official

515
00:27:56,920 --> 00:27:59,740
unofficial preliminary post-incident review.

516
00:28:00,130 --> 00:28:00,540
Okay.

517
00:28:00,540 --> 00:28:01,370
It’s a good name.

518
00:28:01,840 --> 00:28:04,240
And basically what they’re saying is, it went through automated

519
00:28:04,240 --> 00:28:07,540
testing, but the automated content validator had a bug in it.

520
00:28:08,099 --> 00:28:12,729
So, they passed it—quote-unquote, “Passed, but it was an invalid file.

521
00:28:13,239 --> 00:28:13,249
Ah.

522
00:28:13,389 --> 00:28:17,540
“Once the file went out, it was immediately picked up, read by Falcon

523
00:28:17,540 --> 00:28:22,250
sensor, and it caused an out-of-bounds memory read, triggering an exception.

524
00:28:23,150 --> 00:28:25,520
This unexpected exception could not be gracefully handled,

525
00:28:25,520 --> 00:28:29,509
resulting in a Windows operating system crash BSOD.” Unquote.

526
00:28:30,930 --> 00:28:34,379
So, it seems like their testing harness or whatever they’re using

527
00:28:34,540 --> 00:28:37,450
also doesn’t know what to do with the file that’s all zeros.

528
00:28:37,870 --> 00:28:38,480
Well, yeah.

529
00:28:38,530 --> 00:28:39,909
There’s a lot of problems here.

530
00:28:40,030 --> 00:28:43,520
First of all, clearly they did not test the tester enough.

531
00:28:44,410 --> 00:28:44,720
Yeah.

532
00:28:44,770 --> 00:28:46,579
Because if you have a bug in a testing system

533
00:28:46,580 --> 00:28:48,770
in an automated deployment, that is a problem.

534
00:28:48,890 --> 00:28:50,159
That is a huge problem.

535
00:28:50,900 --> 00:28:55,720
And the fact that simply loading the file caused the blue screen pretty

536
00:28:55,720 --> 00:28:59,490
quickly makes it sound like they don’t actually push these updates to

537
00:28:59,490 --> 00:29:04,970
test machines that then run the update to see if the system crashes.

538
00:29:05,309 --> 00:29:08,060
They’re using some other testing process.

539
00:29:08,510 --> 00:29:11,770
Right, which they do not go into any detail about, unsurprisingly.

540
00:29:12,410 --> 00:29:16,600
So, I am sure that the lawsuits are forthcoming, and maybe we’ll

541
00:29:16,600 --> 00:29:20,549
find out more when legal discovery happens, if it gets that far, but

542
00:29:21,009 --> 00:29:25,199
the truth is, CrowdStrike pushes these channel updates frequently.

543
00:29:25,199 --> 00:29:28,460
Like you said, Chris, they push these more than once a day.

544
00:29:28,850 --> 00:29:33,029
And they have automated testing in place, but they’re trying to stay

545
00:29:33,030 --> 00:29:37,029
one step ahead of the bad guys, which means time is of the essence.

546
00:29:37,400 --> 00:29:39,700
This specific update was meant to address

547
00:29:39,700 --> 00:29:43,480
something, a new vulnerability found in named pipes.

548
00:29:44,070 --> 00:29:46,670
They wanted to get that update out before any

549
00:29:46,670 --> 00:29:50,260
attacker figured out how to abuse this vulnerability.

550
00:29:51,160 --> 00:29:53,350
So, maybe what they’re doing is sacrificing

551
00:29:53,350 --> 00:29:56,480
quality or testing in favor of speed.

552
00:29:57,200 --> 00:30:00,520
This is a systematic failure, and it’s not the fault of one person.

553
00:30:00,640 --> 00:30:03,969
Yes, maybe someone screwed up and accidentally saved the file

554
00:30:03,990 --> 00:30:06,600
empty, but something else in the chain should have caught that.

555
00:30:07,310 --> 00:30:07,700
Right.

556
00:30:08,210 --> 00:30:11,060
If a single person can unwittingly push an update that

557
00:30:11,060 --> 00:30:13,430
takes down eight-and-a-half million Windows clients,

558
00:30:14,099 --> 00:30:16,619
that’s an organizational and systematic problem.

559
00:30:17,449 --> 00:30:19,850
There’s also some indications that this isn’t

560
00:30:19,860 --> 00:30:22,870
the first time such a transgression has occurred.

561
00:30:23,390 --> 00:30:28,000
It appears that Red Hat Enterprise Linux, Debian, and

562
00:30:28,000 --> 00:30:31,489
Rocky Linux have all encountered similar crashing problems

563
00:30:31,559 --> 00:30:34,959
earlier this year after a channel update was pushed.

564
00:30:35,730 --> 00:30:40,600
I think it was April and May were the two months where the issues were found.

565
00:30:40,820 --> 00:30:43,980
The issue with Debian in particular was traced to a specific

566
00:30:43,990 --> 00:30:47,760
version of the kernel that wasn’t included in CrowdStrike’s

567
00:30:47,889 --> 00:30:51,990
testing matrix, but was on their list of supported kernel versions.

568
00:30:52,129 --> 00:30:56,680
macOS seems to have weathered the storm, for reasons that we will get to.

569
00:30:57,340 --> 00:30:58,689
Yeah, and I mean, that’s an important point.

570
00:30:58,690 --> 00:31:01,940
And, you know, a lot of times people will say, this only

571
00:31:01,940 --> 00:31:04,090
happens to Windows, and that’s absolutely not the case.

572
00:31:04,290 --> 00:31:07,669
Anytime something runs unfettered in Ring 0 of any operating

573
00:31:07,670 --> 00:31:11,250
system of any kind, you run the risk of causing an immediate crash.

574
00:31:12,109 --> 00:31:15,800
It’s just not usually so public-facing because you don’t tend to

575
00:31:15,800 --> 00:31:20,970
have Linux running your displays that’s also running CrowdStrike.

576
00:31:21,280 --> 00:31:21,540
Right.

577
00:31:21,550 --> 00:31:23,540
For whatever reason, we like Windows for that.

578
00:31:24,130 --> 00:31:24,479
I don’t know.

579
00:31:24,770 --> 00:31:27,409
We’ll get to that, too [laugh] . So, what about Microsoft?

580
00:31:27,830 --> 00:31:29,999
Shouldn’t they prevent this kind of thing from happening?

581
00:31:30,430 --> 00:31:32,690
In an ideal world, they could.

582
00:31:32,970 --> 00:31:35,139
And we’ll get into the technical solutions in a

583
00:31:35,140 --> 00:31:38,680
moment, but this is largely not Microsoft’s fault.

584
00:31:39,440 --> 00:31:46,440
Yes, Windows has its flaws—many, many, many flaws—and Microsoft

585
00:31:46,449 --> 00:31:49,750
hasn’t always produced the stablest or most secure software.

586
00:31:50,290 --> 00:31:52,980
No one could call them blameless with a straight face.

587
00:31:53,240 --> 00:31:56,760
But in this specific instance, the system is working

588
00:31:56,760 --> 00:31:59,730
as designed, even if the design kind of sucks.

589
00:32:00,490 --> 00:32:04,319
Should we be shaming all these organizations who let the update

590
00:32:04,340 --> 00:32:07,750
barrel through their environment like salmonella on a cruise ship?

591
00:32:08,330 --> 00:32:09,170
That’s an image for you.

592
00:32:10,000 --> 00:32:14,280
Think about the counterexample for a second, let’s say that a zero-day

593
00:32:14,300 --> 00:32:18,430
attack was discovered using this named pipes thing, and it was

594
00:32:18,430 --> 00:32:22,419
leveraged by a hacking group to infect a major airline with ransomware,

595
00:32:22,840 --> 00:32:27,010
and later it came out that CrowdStrike would have protected them

596
00:32:27,240 --> 00:32:30,180
if they had been running the newest version of the channel updates.

597
00:32:30,700 --> 00:32:34,590
Stupid CISO decided to stay at n minus one for updates.

598
00:32:35,580 --> 00:32:39,819
Do you think the defense of not running the latest channel updates as

599
00:32:39,820 --> 00:32:43,920
a resiliency strategy would appease litigators and the public at large?

600
00:32:44,790 --> 00:32:45,950
I’m going to go with unlikely.

601
00:32:46,590 --> 00:32:49,320
[laugh] . So, I mean, another point that’s important to note

602
00:32:49,320 --> 00:32:51,730
here is that the kind of patch that came out—or the channel

603
00:32:51,730 --> 00:32:55,620
update—would not have been stopped by an n minus one effect.

604
00:32:56,190 --> 00:32:59,360
N minus one would stop the driver update.

605
00:32:59,879 --> 00:33:03,639
Remember, that’s the part that was signed by Microsoft, and is noted as good.

606
00:33:03,880 --> 00:33:06,270
The actual kernel—or the actual channel update itself

607
00:33:06,490 --> 00:33:08,639
happens automatically, and you can’t do anything about it.

608
00:33:09,490 --> 00:33:12,670
There is… I was reading through some Reddit posts, and some

609
00:33:12,679 --> 00:33:16,620
people did say that there is a way to run a little behind

610
00:33:17,650 --> 00:33:21,550
the channel updates, so to postpone them by certain periods.

611
00:33:21,790 --> 00:33:24,389
There is a way to run, kind of like, n minus one for the

612
00:33:24,390 --> 00:33:28,149
channel updates, but there’s an inherent risk in doing that.

613
00:33:28,500 --> 00:33:29,760
Yeah, like you said, it’s certainly not the

614
00:33:29,770 --> 00:33:32,150
sort of thing that a CISO is going to encourage.

615
00:33:32,849 --> 00:33:33,209
Right.

616
00:33:33,259 --> 00:33:37,100
And there’s also a regulatory hurdle with that, too, because there may be

617
00:33:37,130 --> 00:33:41,370
compliance and regulations that say you have to be running the latest version.

618
00:33:41,890 --> 00:33:46,790
So really, it’s just a rational decision based on balancing priorities and

619
00:33:46,790 --> 00:33:51,390
political realities, and trying to protect your customers as best you can.

620
00:33:52,299 --> 00:33:54,540
So, the blame ultimately should reside on

621
00:33:54,540 --> 00:33:55,735
CrowdStrike for putting out a floud update—floud?

622
00:33:55,735 --> 00:33:56,040
Flawed.

623
00:33:58,150 --> 00:33:58,370
Floud.

624
00:33:58,450 --> 00:33:59,210
Words.

625
00:33:59,890 --> 00:34:00,400
I love them.

626
00:34:00,840 --> 00:34:02,030
I like floud, actually.

627
00:34:02,849 --> 00:34:04,580
It’s like loud, but with an F.

628
00:34:04,840 --> 00:34:05,470
It’s floud.

629
00:34:05,470 --> 00:34:07,250
It’s like a flan that has opinions.

630
00:34:07,710 --> 00:34:08,559
An opinionated flan.

631
00:34:09,540 --> 00:34:09,989
I like it.

632
00:34:10,779 --> 00:34:14,659
Let’s talk about solutions [laugh] . The reason macOS hasn’t encountered

633
00:34:14,659 --> 00:34:20,100
a similar fate as the Linux and Apple installations is that Apple

634
00:34:20,630 --> 00:34:25,160
doesn’t let CrowdStrike—or really anything else—running kernel mode.

635
00:34:25,599 --> 00:34:29,219
Starting in macOS 10.15—I didn’t look at the codename,

636
00:34:29,600 --> 00:34:34,639
so please forgive me—Apple offered System Extensions.

637
00:34:35,260 --> 00:34:38,190
These allow an application to stay in user mode while

638
00:34:38,190 --> 00:34:41,819
requesting special access to hardware managed by the kernel.

639
00:34:42,370 --> 00:34:47,379
At the same time, Apple phased out Kernel Extensions—often

640
00:34:47,529 --> 00:34:50,429
shortened to kext, or [pronounced] kext, I guess—

641
00:34:50,560 --> 00:34:52,020
Yeah, it’s pronounced, unfortunately.

642
00:34:52,840 --> 00:34:56,020
[sigh] . They phased those out starting in macOS 11.

643
00:34:56,599 --> 00:35:00,150
So basically, CrowdStrike doesn’t run in kernel mode on

644
00:35:00,160 --> 00:35:04,390
macOS, and thusly, it cannot crash macOS the same way.

645
00:35:05,480 --> 00:35:08,135
I don’t know no Mac, so I don’t know about any of that [laugh]

646
00:35:08,400 --> 00:35:09,100
.
 No, it’s true.

647
00:35:09,100 --> 00:35:11,950
And for a while, it was extremely annoying because a lot of programs

648
00:35:11,950 --> 00:35:15,319
relied on kexts for similar reasons: to have instant access.

649
00:35:15,320 --> 00:35:18,370
Like, a good example is if you have an external audio

650
00:35:18,370 --> 00:35:23,190
device and you want that to work as fast—as efficiently as

651
00:35:23,190 --> 00:35:25,569
possible, you would want it to work and run in kernel mode.

652
00:35:25,840 --> 00:35:26,150
Right.

653
00:35:26,590 --> 00:35:29,120
So, there are actually ways to get around the

654
00:35:29,120 --> 00:35:32,270
security that you just talked about in macOS.

655
00:35:32,760 --> 00:35:35,190
I don’t recommend it, but it is doable.

656
00:35:36,020 --> 00:35:38,489
And the whole point here is that you have this little secret

657
00:35:38,500 --> 00:35:41,400
enclave, effectively, where things run in this sort of in-between

658
00:35:41,400 --> 00:35:44,600
mode—sandbox, if you will—which we’re going to go into in a second.

659
00:35:45,160 --> 00:35:48,340
But if it crashes there, it doesn’t take down the operating system.

660
00:35:48,940 --> 00:35:49,240
Right.

661
00:35:49,820 --> 00:35:53,680
And Linux actually has a similar option with eBPF,

662
00:35:53,830 --> 00:35:57,340
which I struggle to say because it’s awkward.

663
00:35:57,639 --> 00:35:59,699
And apparently, it’s no longer an acronym.

664
00:36:00,469 --> 00:36:01,660
It’s just its own thing.

665
00:36:02,220 --> 00:36:03,950
So… that’s weird.

666
00:36:04,500 --> 00:36:08,370
eBPF lets applications load into a sandboxed

667
00:36:08,760 --> 00:36:10,930
secure kernel execution environment.

668
00:36:11,320 --> 00:36:15,490
So, once again, gives them kernel-level access to resources, while applying

669
00:36:15,580 --> 00:36:19,480
stringent safety checks to make sure the application doesn’t crash the system.

670
00:36:20,210 --> 00:36:25,200
CrowdStrike now offers running Falcon in user mode on Linux—what

671
00:36:25,200 --> 00:36:29,629
they call user mode—which actually uses eBPF under the covers.

672
00:36:30,440 --> 00:36:33,500
If you were running in that mode, those previous crashes

673
00:36:33,780 --> 00:36:37,470
that happened with Red Hat, and Debian, and—what was

674
00:36:37,490 --> 00:36:40,750
it?—Rocky Linux, you would not have been affected by those.

675
00:36:41,460 --> 00:36:43,420
I mean, CrowdStrike would—Falcon still would have

676
00:36:43,420 --> 00:36:44,950
crashed, but it wouldn’t have crashed your system.

677
00:36:45,880 --> 00:36:46,659
Which is better.

678
00:36:47,020 --> 00:36:47,630
I think so.

679
00:36:48,330 --> 00:36:52,850
Windows has some similar functionality available.

680
00:36:53,180 --> 00:36:56,600
There’s the Windows Filtering Platform, Windows Defender

681
00:36:56,620 --> 00:37:00,705
Application Control, and Windows Defender Device Guard, all of

682
00:37:00,720 --> 00:37:05,159
which have APIs, but none of them have the same mechanisms present

683
00:37:05,940 --> 00:37:10,680
that, like, System Extensions for macOS or eBPF for Linux have.

684
00:37:11,020 --> 00:37:15,220
So, they provide an API that applications could be rewritten to take

685
00:37:15,220 --> 00:37:20,630
advantage of and get, you know, almost kernel levels of access and speed,

686
00:37:21,370 --> 00:37:26,720
but they’re not the same as this, sort of, sandbox, secured enclave.

687
00:37:27,610 --> 00:37:31,330
There is a project to port eBPF over to Windows, for what it’s worth.

688
00:37:31,830 --> 00:37:35,940
I don’t know if that will be the ultimate solution, but this catastrophic

689
00:37:35,940 --> 00:37:39,970
calamity should at least prompt Microsoft to try something similar.

690
00:37:40,790 --> 00:37:43,350
I have heard some folks—and we could call this a technical

691
00:37:43,350 --> 00:37:45,220
solution—I’ve heard some folks say that you just shouldn’t

692
00:37:45,230 --> 00:37:47,270
be running Windows in most of these environments.

693
00:37:47,900 --> 00:37:49,920
Like… you’re not wrong.

694
00:37:50,700 --> 00:37:55,250
If I could wave a magic wand and turn back time, if I could find a way, Chris—

695
00:37:56,440 --> 00:37:56,799
Stop it.

696
00:37:57,099 --> 00:37:59,400
—I would take back all the Windows that hurt

697
00:37:59,400 --> 00:38:02,450
you and replace them with Linux variants.

698
00:38:03,280 --> 00:38:04,270
Okay, it doesn’t rhyme.

699
00:38:04,920 --> 00:38:06,019
[laugh] . It’s the best I could do.

700
00:38:06,940 --> 00:38:09,600
If you’re out there, and you’re building a net-new system,

701
00:38:10,040 --> 00:38:13,910
that’s, like, an end-user terminal, an IoT device, or even a

702
00:38:13,910 --> 00:38:18,270
server running in the cloud, I think anything but Windows is your

703
00:38:18,280 --> 00:38:22,020
best bet, and it would probably be malpractice to do otherwise.

704
00:38:22,830 --> 00:38:25,500
But like it or not, Windows remains the most popular desktop

705
00:38:25,520 --> 00:38:28,830
operating system, and that doesn’t appear to be changing anytime soon.

706
00:38:29,500 --> 00:38:33,499
We need a short-term plan to make things better—through some sort

707
00:38:33,500 --> 00:38:38,080
of update—and a long-term plan to ditch Windows for most use cases.

708
00:38:38,799 --> 00:38:39,279
Thoughts?

709
00:38:40,080 --> 00:38:45,230
So, a lot of in—I’ll put, ‘in my opinion’ around all of this—

710
00:38:45,650 --> 00:38:45,890
Right.

711
00:38:46,220 --> 00:38:51,730
A lot of this comes down to the never-ending battle between speed and

712
00:38:51,730 --> 00:38:58,020
security, and making assumptions that things are just going to work.

713
00:38:59,320 --> 00:39:03,620
After all, like we said, they’ve done multiple channel updates

714
00:39:03,620 --> 00:39:07,319
a day for years and years and years and years and years, and

715
00:39:07,320 --> 00:39:10,159
while they’ve had a few issues in the past, it’s not very many.

716
00:39:11,130 --> 00:39:14,010
This is the sort of thing that leads developers—and, you know,

717
00:39:14,020 --> 00:39:17,839
engineering teams—to have a false sense of security, and a false sense

718
00:39:17,840 --> 00:39:20,110
that everything they do is golden, and they will never have a problem.

719
00:39:20,980 --> 00:39:25,300
Therefore, checks get skipped, checks get removed from the process—because

720
00:39:25,300 --> 00:39:29,350
after all, they’re just slowing us down—and that’s a huge issue.

721
00:39:30,050 --> 00:39:34,630
The other issue is, when you push everything out all at once, the problem

722
00:39:34,779 --> 00:39:38,560
can occur—like it did this time—that everything will crash all at once.

723
00:39:40,060 --> 00:39:44,250
There needs to be some type of a fuzzed deployment.

724
00:39:44,599 --> 00:39:46,839
So, let’s just say these things get released

725
00:39:46,840 --> 00:39:48,779
on a schedule, I don’t know, every four hours.

726
00:39:49,759 --> 00:39:51,740
You get a customer that’s got a hundred servers.

727
00:39:52,410 --> 00:39:57,250
Those servers should get that update five minutes apart in, like, groups of 30.

728
00:39:58,140 --> 00:40:01,120
That way, if there is a catastrophic failure, it

729
00:40:01,120 --> 00:40:03,940
only takes down a percentage of your platform.

730
00:40:04,460 --> 00:40:06,580
Now, that’ll happen for every single customer on Earth, and

731
00:40:06,580 --> 00:40:10,369
that’s not great, but the assumption is, and should be, that

732
00:40:10,369 --> 00:40:13,560
there is high availability built into this, so if half your

733
00:40:13,560 --> 00:40:16,620
systems go down, theoretically, the other half can carry the load.

734
00:40:17,260 --> 00:40:17,570
Mmm.

735
00:40:18,470 --> 00:40:21,759
Yeah, and that’s something that CrowdStrike could change today.

736
00:40:22,010 --> 00:40:23,470
That’s within their realm of control.

737
00:40:23,470 --> 00:40:26,850
Yeah, and I suspect that they will [laugh] . Because the only

738
00:40:26,850 --> 00:40:29,540
other option—if this is the situation—is people are going to end

739
00:40:29,540 --> 00:40:32,470
up with half of their environment running one antivirus solution,

740
00:40:32,480 --> 00:40:34,370
and the other half of their environment running another one.

741
00:40:35,020 --> 00:40:35,850
That seems worse.

742
00:40:35,930 --> 00:40:39,070
It’s just as insane as running—insane and difficult

743
00:40:39,080 --> 00:40:40,960
to manage as running in a multi-cloud environment.

744
00:40:41,240 --> 00:40:42,669
Or a super cloud, as some might say.

745
00:40:43,449 --> 00:40:43,469
Ugh.

746
00:40:43,780 --> 00:40:44,280
I hate you.

747
00:40:44,900 --> 00:40:47,099
Yeah, these are all technical solutions.

748
00:40:47,590 --> 00:40:50,700
I don’t know if there’s any policy solutions, but my biggest

749
00:40:50,700 --> 00:40:54,750
concern coming out of all of this is that regulators and

750
00:40:54,750 --> 00:40:58,579
litigators are going to get into hubbub and pass some poorly

751
00:40:58,590 --> 00:41:02,970
thought-out legislation that makes things effectively worse.

752
00:41:03,660 --> 00:41:05,739
I can’t quite figure out how they would make

753
00:41:05,740 --> 00:41:08,170
things worse, but I am excited to see them try.

754
00:41:08,180 --> 00:41:11,990
[laugh] . They’re nothing if not creative.

755
00:41:12,780 --> 00:41:14,230
Well, hey, thanks for listening or something.

756
00:41:14,230 --> 00:41:16,130
I guess you found it worthwhile enough if you made it all

757
00:41:16,130 --> 00:41:18,280
the way to the end, so congratulations to you, friend.

758
00:41:18,480 --> 00:41:19,730
You accomplished something today.

759
00:41:19,740 --> 00:41:23,229
Now, you can go sit on the couch, update your CrowdStrike channel

760
00:41:23,230 --> 00:41:26,990
file, and watch everything crash in beautiful synchronicity.

761
00:41:27,140 --> 00:41:27,820
You’ve earned it.

762
00:41:28,500 --> 00:41:30,840
You can find more about this show by visiting our LinkedIn page,

763
00:41:30,840 --> 00:41:34,240
just search ‘Chaos Lever,’ or go to our website, chaoslever.com.

764
00:41:34,520 --> 00:41:36,460
You’ll find show notes, blog posts, and general tomfoolery.

765
00:41:37,280 --> 00:41:39,620
And if we got something wrong, or you have strong opinions

766
00:41:39,620 --> 00:41:42,200
about what CrowdStrike should have done, leave us a comment.

767
00:41:42,379 --> 00:41:43,290
Leave us a voicemail.

768
00:41:43,410 --> 00:41:45,400
We might even listen to it.

769
00:41:45,400 --> 00:41:47,920
We’ll be back next week to see what fresh hell is upon us.

770
00:41:48,240 --> 00:41:48,950
Ta-ta for now.

771
00:41:57,200 --> 00:41:57,790
What a mess.

772
00:41:58,440 --> 00:41:58,770
Mmm.

773
00:41:59,400 --> 00:42:00,359
A glorious mess.