1
00:00:00,000 --> 00:00:04,019
Ben Hartshorne: For all of these
dependencies, there are clearly

2
00:00:04,019 --> 00:00:10,020
several who have built their system
with this challenge in mind and have

3
00:00:10,080 --> 00:00:12,660
a series of different fallbacks.

4
00:00:13,200 --> 00:00:16,170
Uh, I'll, I'll give you the story
of, um, uh, we used LaunchDarkly

5
00:00:16,170 --> 00:00:17,100
for our feature flagging.

6
00:00:17,460 --> 00:00:19,050
Their service was also impacted yesterday.

7
00:00:19,050 --> 00:00:22,230
One would think, oh, we need our
feature flags in order to boot up.

8
00:00:23,070 --> 00:00:27,390
Well, their SDK is built with
the idea that you set your

9
00:00:27,390 --> 00:00:28,650
feature flag defaults in code.

10
00:00:28,995 --> 00:00:31,215
And if we can't reach our service,
we'll go ahead and use those.

11
00:00:32,145 --> 00:00:33,795
And if we can reach our service, great.

12
00:00:33,915 --> 00:00:34,635
We'll update them.

13
00:00:35,145 --> 00:00:37,035
And if we can update
them once, that's great.

14
00:00:37,065 --> 00:00:39,255
If we can connect to the
streaming service even better.

15
00:00:40,455 --> 00:00:44,085
And I, I think they also have, uh,
some, some more, uh, bridging in

16
00:00:44,085 --> 00:00:47,385
there, but we don't use, uh, the,
the more complicated infrastructure.

17
00:00:47,655 --> 00:00:53,535
But this idea that they design the
system with the expectation that in the

18
00:00:53,535 --> 00:00:57,465
event of a service on unavailability,
things will continue to work.

19
00:00:58,665 --> 00:01:01,155
Made the recovery process
all that much better.

20
00:01:01,665 --> 00:01:06,225
And, uh, even when, when, uh, their
service was unavailable and ours

21
00:01:06,225 --> 00:01:10,905
was still running, uh, the SDK
still answers questions in code for

22
00:01:10,905 --> 00:01:12,255
the status of all of these flags.

23
00:01:12,735 --> 00:01:14,925
It doesn't say, oh, I, I
can't reach my upstream.

24
00:01:14,925 --> 00:01:16,575
Suddenly, I can't give
you an answer anymore.

25
00:01:16,785 --> 00:01:20,955
No, the SDK is built with that idea
of local caching so that it can

26
00:01:21,015 --> 00:01:23,745
continue to serve the correct answer.

27
00:01:24,630 --> 00:01:27,300
So far as it new from whenever
it lost its connection.

28
00:01:32,160 --> 00:01:33,840
Corey: Welcome to Screaming in the Cloud.

29
00:01:34,080 --> 00:01:35,220
I'm Cory Quinn.

30
00:01:35,400 --> 00:01:39,240
My guest today is one of those folks
that I am disappointed I have not

31
00:01:39,240 --> 00:01:43,289
had on the show until now, just
because I assumed I already had.

32
00:01:43,590 --> 00:01:48,330
Ben Hartshornee is a principal engineer at
Honeycomb, but oh, so much more than that.

33
00:01:48,539 --> 00:01:50,520
Ben, thank you for dating to join us.

34
00:01:50,789 --> 00:01:52,050
Ben Hartshorne: It's lovely
to be here this morning.

35
00:01:52,440 --> 00:01:55,530
Corey: This episode is sponsored
in part by my day job Duck.

36
00:01:55,530 --> 00:01:58,384
Bill, do you have a horrifying AWS bill?

37
00:01:59,009 --> 00:02:00,869
That can mean a lot of things.

38
00:02:01,110 --> 00:02:06,179
Predicting what it's going to be,
determining what it should be, negotiating

39
00:02:06,179 --> 00:02:11,609
your next long-term contract with AWS,
or just figuring out why it increasingly

40
00:02:11,609 --> 00:02:16,140
resembles a phone number, but nobody
seems to quite know why that is.

41
00:02:16,440 --> 00:02:20,010
To learn more, visit duck bill hq.com.

42
00:02:20,310 --> 00:02:23,190
Remember, you can't duck the duck bill.

43
00:02:23,220 --> 00:02:28,620
Bill, which my CEO reliably informs
me is absolutely not our slogan.

44
00:02:29,025 --> 00:02:35,174
So you gave a talk, uh, about roughly
a month ago, uh, at the inaugural

45
00:02:35,265 --> 00:02:38,144
finops uh, meetup in San Francisco.

46
00:02:39,015 --> 00:02:40,215
Give us the high level.

47
00:02:40,215 --> 00:02:40,945
What did you talk about?

48
00:02:41,520 --> 00:02:43,710
Ben Hartshorne: Well, I got
to talk about two stories.

49
00:02:43,830 --> 00:02:44,970
Um, I love telling stories.

50
00:02:45,210 --> 00:02:49,680
I got to about, talk about two stories of
how we used honeycomb and instrumentation

51
00:02:49,950 --> 00:02:51,780
to help optimize our cloud.

52
00:02:51,780 --> 00:02:55,680
Spending a topic near and dear to your
heart, uh, is what brought me there.

53
00:02:56,070 --> 00:03:01,410
We gotta look at the overall bill
and say, Hey, what, where are some

54
00:03:01,410 --> 00:03:02,610
of the big things coming from?

55
00:03:02,910 --> 00:03:05,940
Obviously it's people sending
us data and people asking us

56
00:03:05,940 --> 00:03:07,079
questions about those data.

57
00:03:07,920 --> 00:03:09,930
Corey: And if they would just
stop both of those things, your

58
00:03:09,930 --> 00:03:11,250
bill would be so much better.

59
00:03:11,640 --> 00:03:12,990
Ben Hartshorne: It would
be so much smaller.

60
00:03:13,380 --> 00:03:15,390
Um, so would my salary, unfortunately.

61
00:03:15,900 --> 00:03:22,240
Um, so we wanted to reduce some of
those costs, but, uh, it, it's a, it's

62
00:03:22,240 --> 00:03:26,430
a problem that that's hard to get into
just from like a, a general perspective.

63
00:03:26,430 --> 00:03:28,680
You need to really get in and
look at all the details to find

64
00:03:28,680 --> 00:03:29,970
out what you're gonna change.

65
00:03:30,480 --> 00:03:32,550
So, uh, I gotta tell two stories.

66
00:03:33,255 --> 00:03:35,025
Uh, reducing costs.

67
00:03:35,355 --> 00:03:40,665
One by switching from a MD to,
uh, arm architecture for Amazon.

68
00:03:40,665 --> 00:03:42,765
That's the Graviton chip
set, which is fantastic.

69
00:03:43,155 --> 00:03:47,235
Uh, and the other was about the
amazing power of spreadsheets.

70
00:03:49,455 --> 00:03:52,395
As much as I love graphs,
I also love spreadsheets.

71
00:03:53,040 --> 00:03:54,420
I, I'm sorry.

72
00:03:54,630 --> 00:03:55,980
It's a personal failing.

73
00:03:55,980 --> 00:03:56,430
Perhaps.

74
00:03:56,670 --> 00:04:00,510
Corey: It's wild to me how many tools
out there do all kinds of business

75
00:04:00,510 --> 00:04:04,530
adjacent things, but somehow never bother
to realize that if you can just export

76
00:04:04,530 --> 00:04:09,330
and CSV, suddenly you're speaking kind
of the language of your ultimate user.

77
00:04:09,540 --> 00:04:13,500
Play up with pandas a little bit more
and spit out an actual Excel file,

78
00:04:13,500 --> 00:04:14,640
and now you're cooking it with gas.

79
00:04:14,645 --> 00:04:14,965
Mm-hmm.

80
00:04:16,320 --> 00:04:19,950
Ben Hartshorne: So, uh, the, the second
story is about doing that with honeycomb.

81
00:04:20,310 --> 00:04:23,969
Taking, uh, a number of different
graphs and looking at, um, five

82
00:04:23,969 --> 00:04:29,070
different attributes of our lambda,
uh, costs and what was going into

83
00:04:29,070 --> 00:04:33,719
them, and making changes across all
of them in order to, uh, accomplish

84
00:04:33,719 --> 00:04:36,060
an overall cost reduction about 50%.

85
00:04:36,599 --> 00:04:37,469
Uh, which is really great.

86
00:04:37,860 --> 00:04:42,599
So the, the story, uh, it does
combine my love of graph because we

87
00:04:42,599 --> 00:04:44,159
gotta see the three lines go down.

88
00:04:44,580 --> 00:04:48,990
Um, the power of spreadsheets
and also this idea that.

89
00:04:49,530 --> 00:04:55,560
You can't just look for one answer
to find the, uh, solution to your

90
00:04:55,620 --> 00:04:58,169
problems around, well, anything really.

91
00:04:58,500 --> 00:05:00,510
Uh, but especially around reducing costs.

92
00:05:00,990 --> 00:05:03,870
It's going to be a bunch of
small things that you can put

93
00:05:03,870 --> 00:05:05,400
together, uh, into one place.

94
00:05:06,000 --> 00:05:09,359
Corey: There's a, there's a lot that's
valuable when we start going down that

95
00:05:09,359 --> 00:05:11,969
particular path of starting to look at.

96
00:05:12,719 --> 00:05:15,810
Things through a, a lens of a
particular kind of data that

97
00:05:15,810 --> 00:05:17,280
you otherwise wouldn't think to?

98
00:05:17,429 --> 00:05:21,929
I, I remain, I maintain that you
remain the only customer we have

99
00:05:21,929 --> 00:05:28,679
found so far that uses Honeycomb to
completely instrument their AWS bill.

100
00:05:28,859 --> 00:05:31,919
Uh, we had not seen that before or since.

101
00:05:32,099 --> 00:05:34,650
It, it makes sense for
you to do it that way.

102
00:05:34,650 --> 00:05:35,460
Absolutely.

103
00:05:36,150 --> 00:05:38,130
It's a bit of a heavy lift for.

104
00:05:38,565 --> 00:05:40,125
Shall we say everyone else?

105
00:05:40,755 --> 00:05:44,655
Ben Hartshorne: Uh, and it, it actually
is a, a bit of a lift for, for us to, to

106
00:05:44,655 --> 00:05:49,604
say we've instrumented the entire bill,
uh, is a, a wonderful thing to, to assert.

107
00:05:49,695 --> 00:05:55,215
And, uh, as we've talked about, we,
we use the power of spreadsheets too.

108
00:05:55,784 --> 00:06:00,195
So there are some aspects there is
that, there's some aspects of our

109
00:06:00,195 --> 00:06:03,885
eight of us spending and actually
really dominant ones, uh, that

110
00:06:04,664 --> 00:06:06,765
lend themselves very easily to be.

111
00:06:08,310 --> 00:06:09,000
Using Honeycomb.

112
00:06:09,360 --> 00:06:14,400
Um, the best example is Lambda because
Lambda is, uh, charged on a per

113
00:06:14,400 --> 00:06:21,690
millisecond basis and our instrumentation
is collecting spans, traces about your

114
00:06:21,690 --> 00:06:23,610
compute on a per millisecond basis.

115
00:06:23,910 --> 00:06:27,270
There's a very easy translation
there, and so we can get really good

116
00:06:27,270 --> 00:06:30,990
insight into which customers are
spending, how much or rather, which

117
00:06:30,995 --> 00:06:32,825
customers are causing us to spend.

118
00:06:32,885 --> 00:06:33,225
How much.

119
00:06:34,815 --> 00:06:39,945
Provide our product to them and, uh,
understand how that, how we can balance

120
00:06:39,945 --> 00:06:45,495
our, uh, development resources to both
provide new features and also, uh,

121
00:06:45,495 --> 00:06:49,440
understand when we need to shift and, uh,
spend our attention managing costs that.

122
00:06:50,520 --> 00:06:54,450
Corey: There's a continuum here, and
I think that it tends to follow a

123
00:06:54,450 --> 00:06:59,160
lot around company ethos and company
culture here, where folks have

124
00:06:59,280 --> 00:07:04,110
varying degrees of insight into the
factors that drive their cloud spend.

125
00:07:04,560 --> 00:07:05,230
Uh, you are.

126
00:07:05,700 --> 00:07:10,560
Clearly an observability company you
have been observing your AWS bill

127
00:07:10,560 --> 00:07:14,700
for, I would argue longer than it
would've made sense to on some level.

128
00:07:14,700 --> 00:07:18,270
In the very early days you were
doing this and your AWS bill was

129
00:07:18,270 --> 00:07:23,370
not the, the, the limiting factor to
your company's success back in those

130
00:07:23,370 --> 00:07:24,870
days, but, but you did grow into it.

131
00:07:25,349 --> 00:07:28,560
Other folks, even at very large
enterprise scale, more or less,

132
00:07:28,560 --> 00:07:30,300
do this based on vibes and.

133
00:07:31,025 --> 00:07:34,805
Most folks, I think, tend to fall
somewhere in the middle of this,

134
00:07:34,805 --> 00:07:36,545
but, but it's not evenly distributed.

135
00:07:36,695 --> 00:07:39,185
Some teams tend to have a
very deep insight into what

136
00:07:39,185 --> 00:07:40,685
they're doing and others are.

137
00:07:40,685 --> 00:07:42,485
Amazon Bill, you mean the books?

138
00:07:42,724 --> 00:07:46,265
It, it, again, most tend to
fall somewhere center of that.

139
00:07:46,265 --> 00:07:47,914
It's, it's a law of large numbers.

140
00:07:47,914 --> 00:07:50,255
Everything starts to revert to
a mean, past a certain point.

141
00:07:51,455 --> 00:07:53,585
Ben Hartshorne: Well, I mean, you, you
wouldn't have a job if, if they didn't

142
00:07:53,585 --> 00:07:55,534
make it a bit of a challenge to do so.

143
00:07:56,190 --> 00:07:59,250
Corey: Or I might have a better
job depending, but we'll see.

144
00:07:59,489 --> 00:08:02,729
Uh, I I do wanna detour a little bit
here because as we record this, it is the

145
00:08:02,729 --> 00:08:07,140
day after AWS's big significant outage.

146
00:08:07,140 --> 00:08:11,280
I could really mess with the conspiracy
theorists and say it is their first

147
00:08:11,280 --> 00:08:15,150
major outage of, uh, October of 2025.

148
00:08:15,539 --> 00:08:17,159
Uh, and then people are
like, wait, what do you mean?

149
00:08:17,370 --> 00:08:18,359
What do you mean this is World War?

150
00:08:18,780 --> 00:08:22,049
I like same type of approach, like,
but these things do tend to cluster.

151
00:08:22,320 --> 00:08:24,299
Uh, how was your day yesterday?

152
00:08:25,305 --> 00:08:27,345
Ben Hartshorne: Uh, well,
it did start very early.

153
00:08:27,855 --> 00:08:33,855
Um, uh, our, our service, uh, has,
has presence in multiple regions.

154
00:08:34,275 --> 00:08:40,485
Uh, but we do have our, our main, uh,
US instance in, in Amazon's US East one.

155
00:08:40,965 --> 00:08:46,555
And so as, uh, things stop working, uh,
a lot of our service stopped working too.

156
00:08:48,104 --> 00:08:52,305
I mean, the, the outage was, was
significant, but wasn't, uh, pervasive.

157
00:08:52,545 --> 00:08:56,594
There were still some things that
kept functioning, and amazingly,

158
00:08:56,805 --> 00:09:01,844
we actually preserved all of the
customer telemetry that made it

159
00:09:01,844 --> 00:09:03,584
to our front door successfully.

160
00:09:04,185 --> 00:09:06,584
Uh, which is a big deal
because we hate dropping data.

161
00:09:06,944 --> 00:09:09,974
Corey: Yeah, it's, that took some
work in engineering and, and I have to

162
00:09:09,974 --> 00:09:11,265
imagine this was also not an accident.

163
00:09:11,775 --> 00:09:12,645
Ben Hartshorne: It was not an accident.

164
00:09:13,125 --> 00:09:16,185
Now their ability to query
that data during the outage.

165
00:09:17,355 --> 00:09:17,685
Suffered.

166
00:09:18,255 --> 00:09:20,415
Corey: I, I'm gonna push back on
you on that for a second there.

167
00:09:20,444 --> 00:09:24,525
When AWS's US East, one where
you have a significant workload

168
00:09:24,525 --> 00:09:30,285
is impacted to this degree, how
important is, uh, observability?

169
00:09:30,680 --> 00:09:32,564
I, I know that when I've
dealt with outages in the.

170
00:09:32,655 --> 00:09:33,375
Past.

171
00:09:33,405 --> 00:09:36,525
There's, uh, the first thing you try
and figure out of, is it my shitty,

172
00:09:36,525 --> 00:09:38,685
shitty code or is it a global issue?

173
00:09:38,714 --> 00:09:39,464
That's important.

174
00:09:39,584 --> 00:09:43,155
And once you establish it's a global
issue, then you can begin, uh, the

175
00:09:43,155 --> 00:09:45,135
mitigation part of that process.

176
00:09:45,314 --> 00:09:48,464
And yes, observability becomes
extraordinarily important there for

177
00:09:48,464 --> 00:09:50,115
some things, but for others it's.

178
00:09:50,865 --> 00:09:54,074
There, there's also, at least with
the cloud being as big as it is now,

179
00:09:54,255 --> 00:09:58,365
there's some reputational headline,
uh, risk protection here in that

180
00:09:58,545 --> 00:10:01,545
no one is talking about your site
going down in some weird ways.

181
00:10:01,545 --> 00:10:05,145
Yesterday everyone's talking
about AWS going down, like they

182
00:10:05,145 --> 00:10:06,555
own the reputation of this.

183
00:10:06,735 --> 00:10:07,005
Yeah,

184
00:10:08,324 --> 00:10:09,015
Ben Hartshorne: that's true.

185
00:10:09,464 --> 00:10:15,314
Um, and also when a business's
customers are asking them.

186
00:10:15,825 --> 00:10:17,355
Which parts of your service are working?

187
00:10:17,565 --> 00:10:21,495
I know AWS is having a thing,
uh, how bad is it affecting you?

188
00:10:21,675 --> 00:10:23,535
You wanna be able to
give them a solid answer.

189
00:10:24,285 --> 00:10:28,185
So our customers were asking us
yesterday, Hey, are you dropping our data?

190
00:10:28,995 --> 00:10:32,775
And we wanted to be able to give them a,
a reasonable answer even in the moment.

191
00:10:32,955 --> 00:10:37,095
So yes, the, we we're
able to deflect a certain.

192
00:10:39,020 --> 00:10:40,635
The, the reputational harm.

193
00:10:40,785 --> 00:10:44,324
But at the same time, there are people
that have come back and say, well, I

194
00:10:44,324 --> 00:10:45,615
mean, shouldn't you have done better?

195
00:10:45,915 --> 00:10:48,765
It's important for us to be able
to rebuild our business and to

196
00:10:48,824 --> 00:10:52,094
to move region to region, and we
need you to help us do that too.

197
00:10:52,425 --> 00:10:53,145
Corey: Oh, absolutely.

198
00:10:53,145 --> 00:10:56,235
And I, I actually encountered a lot of
this yesterday when I, uh, early in the

199
00:10:56,235 --> 00:11:01,365
morning tried to get a, uh, what was it, A
Halloween costume and Amazon site was not

200
00:11:01,365 --> 00:11:03,704
working properly for some strange reason.

201
00:11:03,885 --> 00:11:05,040
Now, if I read some of the.

202
00:11:05,895 --> 00:11:09,194
Relatively out of touch analyses
in the mainstream press.

203
00:11:09,435 --> 00:11:12,165
Uh, that's billions and
billions of dollars lost.

204
00:11:12,165 --> 00:11:16,035
Therefore, I either went to go get
a Halloween costume from another

205
00:11:16,035 --> 00:11:19,485
vendor, or I will never wear
a Halloween costume this year.

206
00:11:19,635 --> 00:11:21,225
Better luck in 2026.

207
00:11:21,765 --> 00:11:23,985
Neither of those is necessarily true,

208
00:11:24,255 --> 00:11:24,915
Ben Hartshorne: and that's really.

209
00:11:25,469 --> 00:11:29,819
Exactly why we're, we, were focused
on preserving, successfully storing

210
00:11:29,819 --> 00:11:31,199
our customer's data in the moment.

211
00:11:31,709 --> 00:11:34,949
Because then when the, uh, when the
time comes afterwards, they're like,

212
00:11:34,949 --> 00:11:37,920
okay, now we, we, we said what we
said in the time, in the moment.

213
00:11:38,189 --> 00:11:40,319
Now they're asking us,
okay, what really happened?

214
00:11:40,709 --> 00:11:44,729
Uh, that data is invaluable in
helping our customers piece together

215
00:11:44,969 --> 00:11:47,579
which parts of their services
were working and which weren't.

216
00:11:48,150 --> 00:11:48,959
At what times

217
00:11:49,380 --> 00:11:52,829
Corey: did you see a drop in,
uh, telemetry during the outage?

218
00:11:53,370 --> 00:11:54,270
Ben Hartshorne: Yep, for sure.

219
00:11:54,660 --> 00:11:56,910
Corey: Is that because people's systems
were down, or is that because their

220
00:11:56,910 --> 00:11:58,200
systems could not communicate out?

221
00:11:58,380 --> 00:11:58,890
Ben Hartshorne: Both.

222
00:11:59,760 --> 00:12:00,210
Corey: Excellent.

223
00:12:01,740 --> 00:12:05,460
Ben Hartshorne: Uh, we did get some
reports of, uh, from our customers that

224
00:12:05,730 --> 00:12:10,740
their, uh, specifically the open telemetry
collector that was, uh, gathering the

225
00:12:10,740 --> 00:12:15,180
data from their application was unable
to successfully send it to Honeycomb.

226
00:12:15,810 --> 00:12:18,990
Uh, at the same time we
were not rejecting it.

227
00:12:19,515 --> 00:12:24,344
So clearly there were challenges in
the, the path between those two things.

228
00:12:24,735 --> 00:12:28,875
Uh, whether that was an AWS's network in
some other network unable to get to aws.

229
00:12:28,875 --> 00:12:30,015
I, I dunno.

230
00:12:30,194 --> 00:12:35,145
So, uh, we definitely saw there
were issues of reachability.

231
00:12:35,714 --> 00:12:39,615
Uh, and so undoubtedly there was some
data drop there that's completely out

232
00:12:39,615 --> 00:12:43,844
of our, so the, the only part we could
say is once the data got to us, we

233
00:12:43,844 --> 00:12:45,135
were able to successfully store it.

234
00:12:45,525 --> 00:12:48,380
So, um, the question is, uh, was it.

235
00:12:49,110 --> 00:12:50,850
Customers apps going down.

236
00:12:51,704 --> 00:12:57,285
Uh, absolutely many of our customers were
down and they were unable to send us on e

237
00:12:57,285 --> 00:12:59,055
telemetry because their app was offline.

238
00:12:59,564 --> 00:13:03,824
Uh, but the the other side is also
true that the ones that were up

239
00:13:04,094 --> 00:13:07,854
were having trouble getting to us
because of our location in US East.

240
00:13:08,645 --> 00:13:11,615
Corey: Now to continue reading what
the mainstream press had to say about

241
00:13:11,645 --> 00:13:17,225
this, does that mean that you are now
actively considering evacuating AWS

242
00:13:17,225 --> 00:13:21,875
entirely to go to a different provider
that can be more reliable, probably

243
00:13:21,875 --> 00:13:23,135
building your own data centers.

244
00:13:23,465 --> 00:13:25,055
Ben Hartshorne: Yeah, you know,
I've, I've heard people say

245
00:13:25,055 --> 00:13:26,405
that's the thing to do these days.

246
00:13:26,944 --> 00:13:29,319
Now, I, I have helped build
data centers in the past.

247
00:13:30,375 --> 00:13:32,594
Corey: As have I, there's a
reason that both of us have a

248
00:13:32,594 --> 00:13:33,974
job that does not involve that

249
00:13:34,454 --> 00:13:37,964
Ben Hartshorne: there is, uh, the data
centers I built, were not as reliable as

250
00:13:37,964 --> 00:13:42,645
any of the data centers that are available
from our, our big public cloud providers.

251
00:13:42,750 --> 00:13:45,194
Corey: I, I would've said, unless you
worked at one of those companies building

252
00:13:45,194 --> 00:13:48,104
the data centers, and even back then,
given the time you've been at Honeycomb,

253
00:13:48,104 --> 00:13:51,584
I can say with a certainty, you are
not as good at running data centers as

254
00:13:51,584 --> 00:13:53,925
they are because effectively no one is.

255
00:13:53,925 --> 00:13:56,685
This is something that you get to
learn about at significant scale.

256
00:13:56,835 --> 00:13:59,685
The concern is, I see it as one of
consolidation, but I've seen too

257
00:13:59,685 --> 00:14:04,095
many folks try and go multi-cloud
for resilience reasons, and all

258
00:14:04,095 --> 00:14:06,345
they've done is they've added a
second single point of failure.

259
00:14:06,345 --> 00:14:09,045
So now they're exposed to everyone's
outage, and when that happens, their

260
00:14:09,045 --> 00:14:13,185
site continues to fall down in different
ways as opposed to being more resilient,

261
00:14:13,215 --> 00:14:16,665
which is a hell of a lot more than
just picking multiple providers.

262
00:14:17,205 --> 00:14:19,365
Ben Hartshorne: But there is
something to say though of looking

263
00:14:19,365 --> 00:14:21,405
at a business and saying, okay, what.

264
00:14:22,845 --> 00:14:26,865
What is the cost for us to be, you
know, single region versus what is

265
00:14:26,865 --> 00:14:31,455
the cost to be fully, uh, you know,
multi-region where we can fail over

266
00:14:31,455 --> 00:14:33,435
in an instant and nobody notices?

267
00:14:34,005 --> 00:14:35,955
Uh, those cost differences are huge.

268
00:14:36,630 --> 00:14:38,130
And for most businesses

269
00:14:38,490 --> 00:14:40,200
Corey: of course ma, it's
a massive investment.

270
00:14:40,200 --> 00:14:40,830
At least 10 x.

271
00:14:41,040 --> 00:14:41,220
Ben Hartshorne: Yeah.

272
00:14:41,640 --> 00:14:44,040
So for most businesses
you're not gonna go that far.

273
00:14:44,400 --> 00:14:47,820
Corey: My, my newsletter, publication
is entirely bound within US West

274
00:14:47,820 --> 00:14:51,300
two, because if that goes down, that,
that just happened to be for latency

275
00:14:51,300 --> 00:14:52,830
purposes, not reliability reasons.

276
00:14:52,980 --> 00:14:55,530
But if the region is hard down and
I need to send an email newsletter

277
00:14:55,530 --> 00:14:58,470
and it's down for several days,
I'm writing that one by hand.

278
00:14:58,500 --> 00:15:00,420
'cause I've got a different
story to tell that week.

279
00:15:00,420 --> 00:15:02,550
I don't need it to do
the business as usual.

280
00:15:03,005 --> 00:15:03,125
Thing.

281
00:15:03,305 --> 00:15:06,875
And that that's a reflection of
architecture and investment decisions

282
00:15:07,055 --> 00:15:08,675
reflecting the reality of my business.

283
00:15:08,855 --> 00:15:09,035
Ben Hartshorne: Yes.

284
00:15:09,335 --> 00:15:10,955
And that's, that's exactly where to start.

285
00:15:11,585 --> 00:15:15,395
And there are things you can do within
a region to increase a little bit

286
00:15:15,395 --> 00:15:19,205
of resilience to certain services
within that region suffering.

287
00:15:19,865 --> 00:15:25,085
So, um, as an example, uh, uh, I don't
remember how many years ago it was, uh,

288
00:15:25,085 --> 00:15:29,135
but uh, Amazon had an outage in kms,
the, uh, the key management service.

289
00:15:29,675 --> 00:15:32,465
And that basically made everything stop.

290
00:15:33,150 --> 00:15:35,400
Uh, you can probably find
out exactly when it happened.

291
00:15:35,640 --> 00:15:36,810
Corey: Yes, I'm pulling that up now.

292
00:15:36,810 --> 00:15:37,560
Please continue.

293
00:15:37,560 --> 00:15:38,130
I'm curious.

294
00:15:38,130 --> 00:15:38,460
Now

295
00:15:38,640 --> 00:15:42,329
Ben Hartshorne: they provide a really
easy way to replicate all of your

296
00:15:42,329 --> 00:15:47,850
keys to another region and a pretty
easy way to fail over accessing those

297
00:15:47,850 --> 00:15:49,110
keys from one region to another.

298
00:15:49,680 --> 00:15:52,920
So even if you're not gonna be
fully multi-region, you can insulate

299
00:15:52,920 --> 00:15:56,490
against individual services that
might have an incident and prevent

300
00:15:56,550 --> 00:15:59,910
those one services from having an
outsized impact on your application.

301
00:16:00,690 --> 00:16:04,110
You know, we don't need their keys
most of the time, but when you

302
00:16:04,110 --> 00:16:07,170
do need them, you kind of need
them to start your application.

303
00:16:07,170 --> 00:16:10,080
So if you need to scale up or do
something like that and it's not

304
00:16:10,080 --> 00:16:12,600
available, you're really out of luck.

305
00:16:13,290 --> 00:16:18,210
So I, the, the thing is, I, I don't
wanna advocate that people try and go

306
00:16:18,210 --> 00:16:21,990
fully multi-region, but that's not to
say that we advocate all responsibility

307
00:16:21,990 --> 00:16:24,065
for insulating our application from.

308
00:16:24,795 --> 00:16:27,525
Having transient outages
in our dependencies.

309
00:16:27,945 --> 00:16:28,095
Corey: Yeah.

310
00:16:28,150 --> 00:16:31,005
To, to be clear, they did not do
a formal writeup on the KMS issue

311
00:16:31,005 --> 00:16:38,055
on their basically kind of not
terrific, uh, list of, uh, uh, list

312
00:16:38,085 --> 00:16:40,635
of, um, out post event summaries.

313
00:16:40,815 --> 00:16:42,855
It's, things have to be sort
of noisy for that to hit.

314
00:16:43,755 --> 00:16:46,965
I'm sure yesterday's will wind up on
that list once they have, uh, the, the

315
00:16:47,205 --> 00:16:48,795
had that up before this thing publishes.

316
00:16:49,065 --> 00:16:51,285
But yeah, they did not
put the KMS issue there.

317
00:16:51,465 --> 00:16:52,335
You're completely correct.

318
00:16:53,145 --> 00:16:56,985
It's a, this is the sort of thing
of what is, what is the ba, what is

319
00:16:56,985 --> 00:16:58,365
the blast radius of these issues?

320
00:16:58,725 --> 00:17:04,305
And I, I think that there's this
sense that before we went in the

321
00:17:04,305 --> 00:17:07,305
cloud, everything was more reliable,
but just the opposite is true.

322
00:17:07,665 --> 00:17:10,665
Uh, the difference was, is that if
we were all building our data centers

323
00:17:10,665 --> 00:17:14,025
today, my shitty stuff at Duck Bill
is down as it is every, you know,

324
00:17:14,085 --> 00:17:16,905
every random Tuesday and tomorrow.

325
00:17:16,970 --> 00:17:20,120
Honeycomb is down because, oops,
it turns out you once again are

326
00:17:20,120 --> 00:17:22,010
forgotten to replace a bad, hard drive.

327
00:17:22,339 --> 00:17:22,790
Cool.

328
00:17:23,300 --> 00:17:24,710
But those are not the happening.

329
00:17:24,710 --> 00:17:27,680
At the same time, when you start
with the centralization story,

330
00:17:27,740 --> 00:17:31,970
suddenly a disproportionate swath
of the world is down simultaneously,

331
00:17:31,970 --> 00:17:33,440
and that's where things get weird.

332
00:17:33,830 --> 00:17:37,010
It gets even harder though because
you can test your durability and

333
00:17:37,010 --> 00:17:38,870
your resilience as much as you want.

334
00:17:39,639 --> 00:17:42,459
But it doesn't impact, it doesn't
account for the, the challenge of third

335
00:17:42,459 --> 00:17:44,500
party providers on your critical path.

336
00:17:45,730 --> 00:17:49,060
You're, you obviously need to make
sure that if, in order for honeycomb

337
00:17:49,060 --> 00:17:51,730
to work, honeycomb itself has to be up.

338
00:17:51,850 --> 00:17:52,929
That's sort of step one.

339
00:17:53,139 --> 00:17:57,490
But to do that, AWS itself has
to be up in certain places.

340
00:17:57,790 --> 00:17:59,320
What other vendors factor into this?

341
00:17:59,409 --> 00:18:00,790
Ben Hartshorne: You know, that
was, I think, the most interesting

342
00:18:00,790 --> 00:18:04,300
part of yesterday's challenge,
bringing the service back up.

343
00:18:04,990 --> 00:18:05,435
Uh, is that.

344
00:18:06,195 --> 00:18:09,885
We do rely on an incredible
number of other services.

345
00:18:10,215 --> 00:18:14,085
Uh, there's some list of all of
our vendors that is hundreds long.

346
00:18:14,445 --> 00:18:16,335
Now those are obviously very
different parts of the business.

347
00:18:16,335 --> 00:18:20,385
They involve, uh, you know, companies we
contract with for marketing outreach and

348
00:18:20,385 --> 00:18:22,350
for, uh, business and for all of that.

349
00:18:22,940 --> 00:18:23,210
Corey: Right.

350
00:18:23,210 --> 00:18:26,930
We use Dropbox here, and if Dropbox is
down, uh, it, that doesn't necessarily

351
00:18:26,930 --> 00:18:30,770
impact our ability to wind up serving
our customers, but it does mean I to

352
00:18:30,770 --> 00:18:34,700
find a need, to find a different way, for
example, to get the recorded file from

353
00:18:34,700 --> 00:18:36,980
this podcast over to my editing team.

354
00:18:37,070 --> 00:18:39,290
Ben Hartshorne: Yeah, so there's,
there's the very long list.

355
00:18:39,850 --> 00:18:42,820
And then there's the much, much
shorter list of vendors that are

356
00:18:42,820 --> 00:18:46,330
really in the critical path, and
we have a bunch of those too.

357
00:18:46,720 --> 00:18:51,189
Um, we use, uh, uh, vendors for
feature flagging and for sending

358
00:18:51,189 --> 00:18:57,250
email and uh, for, um, uh, some, some
other, uh, forms of telemetry that,

359
00:18:57,340 --> 00:18:58,750
that are destined for other spots.

360
00:19:00,550 --> 00:19:05,110
For the most part, when we get that
many vendors all relying on each other.

361
00:19:06,090 --> 00:19:07,380
They're all down at once.

362
00:19:07,650 --> 00:19:10,410
There's this bootstrapping problem
where they're all trying to come back,

363
00:19:10,410 --> 00:19:13,260
but they all sort of rely on each other
in order to come back successfully.

364
00:19:13,830 --> 00:19:17,190
And I think that's part of what
made yesterday morning's, uh, out.

365
00:19:18,240 --> 00:19:24,270
Move from, uh, roughly what, like
midnight to 3:00 AM Pacific all the way

366
00:19:24,270 --> 00:19:29,160
through the rest of the day and, and
still have issues, uh, with, with some

367
00:19:29,160 --> 00:19:32,040
companies up until, uh, five six, 7:00 PM

368
00:19:32,400 --> 00:19:35,970
Corey: This episode is sponsored
by my own company, duck Bill,

369
00:19:36,210 --> 00:19:38,125
having trouble with your AWS bill.

370
00:19:38,415 --> 00:19:41,865
Perhaps it's time to renegotiate
a contract with them.

371
00:19:42,195 --> 00:19:47,595
Maybe you're just wondering how to predict
what's going on in the wide world of AWS.

372
00:19:47,655 --> 00:19:50,295
Well, that's where Duck
Bill comes in to help.

373
00:19:50,504 --> 00:19:53,235
Remember, you can't duck the duck bill.

374
00:19:53,235 --> 00:19:56,835
Bill, which I am reliably
informed by my business partner

375
00:19:56,955 --> 00:19:59,445
is absolutely not our motto.

376
00:19:59,685 --> 00:20:02,770
To learn more, visit duck bill hq.com.

377
00:20:03,585 --> 00:20:07,425
The, the Google SRE book talked
about this, oh geez, when was it?

378
00:20:07,455 --> 00:20:08,504
15 years ago now.

379
00:20:08,534 --> 00:20:09,794
Damn near that.

380
00:20:09,825 --> 00:20:13,814
Uh, that at some point when a service
goes down and then it starts to recover,

381
00:20:13,995 --> 00:20:17,715
everything that depends on it will
often basically pummel it back into

382
00:20:17,715 --> 00:20:19,905
submission, trying to talk to the thing.

383
00:20:20,235 --> 00:20:24,945
It's a, like I remember back when I worked
at, uh, as a senior systems engineer at

384
00:20:24,945 --> 00:20:28,185
Media Temple in the days before GoDaddy
bought and then ultimately killed them.

385
00:20:28,514 --> 00:20:28,845
Uh.

386
00:20:29,030 --> 00:20:31,940
They, they, I was touring the
data center my first week.

387
00:20:31,940 --> 00:20:34,040
We had, uh, we had three
different facilities.

388
00:20:34,040 --> 00:20:36,200
I was in one of them and
I asked, okay, great.

389
00:20:36,200 --> 00:20:39,350
I just trip over things and hit
the emergency power off switch.

390
00:20:39,350 --> 00:20:39,770
Great.

391
00:20:39,890 --> 00:20:41,330
And kill the entire data center.

392
00:20:41,630 --> 00:20:43,760
There's an order that you
have to bring things back up.

393
00:20:43,760 --> 00:20:47,000
In the event of those catastrophic
outages, is there a runbook?

394
00:20:47,000 --> 00:20:48,380
And of course there was great.

395
00:20:48,380 --> 00:20:48,740
Where is it?

396
00:20:48,770 --> 00:20:49,790
Oh, it's not Confluence.

397
00:20:49,880 --> 00:20:50,300
Terrific.

398
00:20:50,300 --> 00:20:50,810
Where's that?

399
00:20:50,840 --> 00:20:52,340
Oh, in the rack over there.

400
00:20:52,939 --> 00:20:55,610
And I looked the data center manager,
and she, she was delightful and

401
00:20:55,610 --> 00:20:57,979
incredibly on her point, and she
knew exactly where I was going.

402
00:20:58,310 --> 00:20:59,719
We're gonna print that out right now.

403
00:21:00,050 --> 00:21:01,399
Excellent, excellent.

404
00:21:01,399 --> 00:21:01,820
Like that.

405
00:21:01,850 --> 00:21:02,840
That's why you ask.

406
00:21:02,840 --> 00:21:05,899
It's, it's someone who has never seen it
before, but knows how these things were

407
00:21:05,899 --> 00:21:09,439
going through that because you build
dependency on top of dependency and you

408
00:21:09,439 --> 00:21:12,830
never get the luxury of taking a step
back and looking at it with fresh eyes.

409
00:21:13,010 --> 00:21:14,300
But that's what our industry has done.

410
00:21:14,340 --> 00:21:17,790
But you have, you have your vendors that
have their own critical dependencies

411
00:21:18,030 --> 00:21:21,419
that they may or may not have done as
good a job as you have of identifying

412
00:21:21,419 --> 00:21:23,399
those and so on and so forth.

413
00:21:23,399 --> 00:21:26,429
It's the end of a very long chain that
does kind of eat itself at some point.

414
00:21:26,850 --> 00:21:27,030
Ben Hartshorne: Yeah.

415
00:21:27,030 --> 00:21:28,500
There are two things
that that brings to mind.

416
00:21:28,649 --> 00:21:31,860
First, we absolutely saw exactly what
you're describing yesterday in our

417
00:21:31,860 --> 00:21:35,399
track patterns, where the, the volume
of incoming traffic would sort of

418
00:21:35,399 --> 00:21:36,629
come along and then it would drop.

419
00:21:36,885 --> 00:21:40,125
As their services went off, and
then it's quiet for a little while,

420
00:21:40,125 --> 00:21:43,275
and then we get this huge spike as
they're trying to like, you know,

421
00:21:43,305 --> 00:21:44,745
bring everything back on all at once.

422
00:21:45,135 --> 00:21:48,254
Uh, thankfully those were sort of
spread out across our customers, so

423
00:21:48,254 --> 00:21:52,815
we didn't have like, just one enormous
spike hit all of our, our servers.

424
00:21:53,205 --> 00:21:56,055
Um, but we did see them on
a, on a per customer basis.

425
00:21:56,060 --> 00:21:57,975
It's, it's a real, very real pattern.

426
00:21:58,485 --> 00:22:03,945
Um, but the second one, for all of
these dependencies, there are clearly

427
00:22:03,945 --> 00:22:05,985
several who have built their system.

428
00:22:06,659 --> 00:22:12,750
With this challenge in mind and have
a series of different fallbacks.

429
00:22:13,110 --> 00:22:18,060
Uh, and, and, uh, I'll, I'll give
you the story of, um, uh, we used

430
00:22:18,060 --> 00:22:19,530
LaunchDarkly for our feature flagging.

431
00:22:20,550 --> 00:22:22,740
Um, their service was
also impacted yesterday.

432
00:22:24,600 --> 00:22:27,270
One would think, oh, we need our
feature flags in order to boot up.

433
00:22:28,110 --> 00:22:33,270
Well, their SDK is built with the idea
that you set your feature flag default

434
00:22:33,270 --> 00:22:35,915
in code, and if we can't reach our
service, we'll go ahead and use those.

435
00:22:37,169 --> 00:22:38,820
And if we can reach our service, great.

436
00:22:38,940 --> 00:22:39,690
We'll update them.

437
00:22:40,169 --> 00:22:42,060
And if we can update
them once, that's great.

438
00:22:42,090 --> 00:22:44,280
If we can connect to the
streaming service even better.

439
00:22:45,480 --> 00:22:49,320
And I, I think they also have, uh,
some, some more, uh, bridging in there.

440
00:22:49,320 --> 00:22:52,980
But, um, we don't use, uh, the, the
more complicated infrastructure.

441
00:22:53,250 --> 00:22:59,129
But this idea that they design the
system with the expectation that in the

442
00:22:59,129 --> 00:23:03,030
event of a service on unavailability,
things will continue to work.

443
00:23:04,260 --> 00:23:06,750
Made the recovery process
all that much better.

444
00:23:07,260 --> 00:23:11,850
And, uh, even when, when, uh, their
service was unavailable and ours

445
00:23:11,850 --> 00:23:16,470
was still running, uh, the SDK
still answers questions in code for

446
00:23:16,470 --> 00:23:17,850
the status of all of these flags.

447
00:23:18,300 --> 00:23:20,520
It doesn't say, oh, I, I
can't reach my upstream.

448
00:23:20,520 --> 00:23:22,170
Suddenly, I can't give
you an answer anymore.

449
00:23:22,410 --> 00:23:26,550
No, the SDK is built with that idea
of local caching so that it can

450
00:23:26,610 --> 00:23:29,340
continue to serve the correct answer.

451
00:23:30,149 --> 00:23:32,909
So far as it new from whenever
it lost its connection.

452
00:23:33,030 --> 00:23:37,169
But it means that if, if they have a
transient outage, our stuff doesn't break.

453
00:23:37,830 --> 00:23:42,810
And that kind of design, uh, really,
uh, makes recovering from these like

454
00:23:42,840 --> 00:23:47,520
interdependent outages, uh, feasible
in a way that the, the, uh, the

455
00:23:47,520 --> 00:23:50,185
strict ordering you were describing
just is, is really difficult.

456
00:23:51,075 --> 00:23:54,405
Corey: At least in my case, I, I have
the luxury of knowing these things just

457
00:23:54,405 --> 00:23:58,995
because I'm old and I, I figured this
out Before it was SRE Common Knowledge,

458
00:23:58,995 --> 00:24:02,774
or SRE, was a widely acknowledged thing
where, okay, you have a job server

459
00:24:02,774 --> 00:24:04,845
that runs CR jobs, uh, every day.

460
00:24:05,115 --> 00:24:08,445
And when it, it turns out that, oh,
when you founded missed a CR job.

461
00:24:08,520 --> 00:24:09,270
Oopsy doozy.

462
00:24:09,270 --> 00:24:11,070
That's a problem for some of those things.

463
00:24:11,070 --> 00:24:14,010
So now you start building in error
checking and the rest, and then you

464
00:24:14,010 --> 00:24:17,280
do a restore for three days ago from
backup for that thing, and it suddenly

465
00:24:17,280 --> 00:24:21,150
dinks it, missed all theron jobs and
runs them all, and then hammers some

466
00:24:21,150 --> 00:24:22,980
other system to death when it shouldn't.

467
00:24:22,980 --> 00:24:27,270
And you, you learn iteratively of,
oh, that's kind of a failure mode.

468
00:24:27,450 --> 00:24:29,850
Like when you start externalizing
and hardening APIs, you

469
00:24:29,850 --> 00:24:31,650
build, you learn very quickly.

470
00:24:31,650 --> 00:24:36,210
Everything needs a rate limit, and
you need a way to make bad actors

471
00:24:36,240 --> 00:24:37,830
stop hammering your endpoints.

472
00:24:38,865 --> 00:24:40,575
Not just bad actors, naive ones.

473
00:24:40,755 --> 00:24:43,755
Ben Hartshorne: And, uh, rate limits
are a good, a good example because,

474
00:24:43,875 --> 00:24:48,345
um, uh, that is one of the things that
that did happen uh, yesterday as people

475
00:24:48,345 --> 00:24:52,000
were coming back, we actually wound
up needing to rate limit ourselves.

476
00:24:53,445 --> 00:24:56,594
We didn't have to rate, limit
our customers, but the, because,

477
00:24:56,625 --> 00:24:58,485
uh, so brief digression here.

478
00:24:58,814 --> 00:25:02,024
Um, honeycomb uses honeycomb
in order to build honeycomb.

479
00:25:02,115 --> 00:25:04,784
Uh, we, we are our own
observability vendor.

480
00:25:05,145 --> 00:25:10,544
Uh, now this, this leads to some obvious,
um, uh, challenges in architecture.

481
00:25:11,145 --> 00:25:13,274
Uh, you know, how, how
do we know we're right?

482
00:25:13,665 --> 00:25:16,545
Well, in the beginning we did have
some other services that we'd use to

483
00:25:16,545 --> 00:25:19,305
checkpoint our, our numbers and make sure
that they, they were actually correct.

484
00:25:19,665 --> 00:25:22,845
Uh, but our production instance
sits here and serves our customers

485
00:25:23,145 --> 00:25:26,500
and all of its telemetry goes
into the next one down the chain.

486
00:25:28,140 --> 00:25:31,560
We call that dog food because we are,
uh, you know, the, the whole phrase of

487
00:25:31,560 --> 00:25:34,650
eating your own dog food, uh, drinking
your own champagne is the other,

488
00:25:34,650 --> 00:25:37,050
uh, um, more, more pleasing version.

489
00:25:37,350 --> 00:25:40,200
Um, so the, from our
production, it goes to dog food.

490
00:25:40,200 --> 00:25:40,890
From dog food.

491
00:25:40,890 --> 00:25:42,000
Well, what's dog food made of?

492
00:25:42,000 --> 00:25:43,050
It's made up of, of kibble.

493
00:25:43,140 --> 00:25:45,180
So our third environment is called kibble.

494
00:25:45,360 --> 00:25:49,710
Uh, so the, the dog food telemetry, it
goes into this third environment and

495
00:25:49,710 --> 00:25:52,500
that third environment, well, we need
to know if it's working too, so it

496
00:25:52,500 --> 00:25:54,060
feeds back into our production instance.

497
00:25:54,555 --> 00:25:57,315
Each of these instances,
uh, is emitting telemetry.

498
00:25:57,555 --> 00:26:01,515
Uh, and we have our, um, rate
limiting and our, I'm sorry, our

499
00:26:01,515 --> 00:26:03,225
tail sampling proxy called refinery.

500
00:26:03,555 --> 00:26:08,745
That, uh, helps us reduce volume so it's
not a, a positively amplifying cycle.

501
00:26:09,525 --> 00:26:15,705
Um, but in this, in this incident
yesterday, uh, we started emitting

502
00:26:15,705 --> 00:26:17,805
logs that we don't normally emit.

503
00:26:18,585 --> 00:26:20,295
These are coming from some of our SD.

504
00:26:23,460 --> 00:26:24,180
Their services.

505
00:26:24,899 --> 00:26:30,120
And so suddenly we started getting
two or three or four log entries

506
00:26:30,120 --> 00:26:31,680
for every event we were sending.

507
00:26:31,680 --> 00:26:35,909
And, uh, did get into this
kind of amplifying cycle.

508
00:26:36,629 --> 00:26:39,899
So we, we put, uh, a pretty
heavy rate limit on the kibble

509
00:26:39,899 --> 00:26:43,620
environment in order to squash
that traffic and disrupt the cycle.

510
00:26:43,980 --> 00:26:47,370
Uh, which, which made it
difficult to ensure that was

511
00:26:47,370 --> 00:26:49,020
working correctly, which, but.

512
00:26:49,365 --> 00:26:53,205
It was, and, and that led us make sure
that, make sure that the production

513
00:26:53,205 --> 00:26:54,225
instance was working alright.

514
00:26:54,645 --> 00:26:57,735
Um, but this idea of rate limits
being a, a critical part of

515
00:26:57,765 --> 00:27:02,415
maintaining an interconnected stack,
uh, in order to, to suppress these

516
00:27:02,415 --> 00:27:04,815
kind of, um, uh, like wavelike.

517
00:27:06,270 --> 00:27:10,230
Formations that oscillations, that start
growing on each other and amplifying

518
00:27:10,230 --> 00:27:13,890
themselves, uh, can, can take any
infrastructure down and being able

519
00:27:13,890 --> 00:27:17,100
to put in, uh, just the right point,
a little, a couple switches and say,

520
00:27:17,250 --> 00:27:21,240
Nope, suppress that signal, uh, really
made a big difference in our ability

521
00:27:21,240 --> 00:27:23,100
to, to bring back all of the services.

522
00:27:23,195 --> 00:27:23,315
Corey: I,

523
00:27:23,340 --> 00:27:26,610
I want to pivot to one last
topic, but I, we could talk about

524
00:27:26,610 --> 00:27:27,835
this outage for days and hours.

525
00:27:28,605 --> 00:27:32,385
I, but there's, uh, something that you
mentioned you wanted to go into that I

526
00:27:32,385 --> 00:27:37,185
wanted to pick a fight with you over, uh,
was how to get people to instrument their

527
00:27:37,185 --> 00:27:41,534
applications to, for observability so
they can understand their applications,

528
00:27:41,534 --> 00:27:43,245
their performance, and the rest.

529
00:27:43,425 --> 00:27:47,115
And I'm gonna go with the easy answer
because it's a pain in the ass.

530
00:27:47,115 --> 00:27:50,445
Ben, have you tried instrumenting
an application that already

531
00:27:50,445 --> 00:27:52,875
exists without having to spend a

532
00:27:52,875 --> 00:27:53,445
week on it?

533
00:27:53,804 --> 00:27:54,345
Ben Hartshorne: I.

534
00:27:58,064 --> 00:27:58,784
you're not wrong.

535
00:27:59,115 --> 00:28:02,175
It's a pain in the ass
and it's getting better.

536
00:28:02,625 --> 00:28:04,274
There's lots of ways to make it better.

537
00:28:04,695 --> 00:28:06,975
Uh, there are packages that
do auto instrumentation.

538
00:28:07,215 --> 00:28:07,784
Corey: Oh yeah, absolutely.

539
00:28:07,784 --> 00:28:08,235
From my case.

540
00:28:08,235 --> 00:28:08,354
Yeah.

541
00:28:08,354 --> 00:28:09,405
It's Claude Coat's problem.

542
00:28:09,405 --> 00:28:10,695
Now I'm getting another drink.

543
00:28:10,784 --> 00:28:14,804
Ben Hartshorne: You know, uh, you,
you say that in jest and yet, um,

544
00:28:14,834 --> 00:28:16,965
they are actually getting really good.

545
00:28:17,235 --> 00:28:17,445
Yeah.

546
00:28:17,804 --> 00:28:19,304
Corey: No, that's what I've been doing.

547
00:28:19,304 --> 00:28:20,264
It works super well.

548
00:28:20,264 --> 00:28:22,844
You test it first, obviously, but yeah.

549
00:28:23,565 --> 00:28:24,825
YOLO slammed that into production.

550
00:28:24,885 --> 00:28:25,275
But yeah,

551
00:28:25,575 --> 00:28:27,555
Ben Hartshorne: the, uh, the,
the LLMs are actually getting

552
00:28:27,555 --> 00:28:30,735
pretty good at understanding where
instrumentation can be useful.

553
00:28:30,735 --> 00:28:32,625
I say understanding, I
put that in their quotes.

554
00:28:32,895 --> 00:28:36,885
Uh, they're good at, uh, finding code
that represents a, a good place to,

555
00:28:36,915 --> 00:28:40,215
to put instrumentation and, and adding
it to your code in the right place.

556
00:28:40,555 --> 00:28:42,835
Corey: I need to take another
try one of these days.

557
00:28:42,865 --> 00:28:46,885
Uh, the last time I played with Honeycomb,
I instrumented my home Kubernetes

558
00:28:46,885 --> 00:28:51,085
cluster and I exceeded the limits of
the free tier based on ingest volume

559
00:28:51,145 --> 00:28:52,615
by the second day of every month.

560
00:28:53,215 --> 00:28:58,495
And that led to either you have really
unfair limits, which I don't believe to be

561
00:28:58,525 --> 00:29:03,595
true or the more insightful question, what
the hell is my Kubernetes cluster doing?

562
00:29:03,595 --> 00:29:04,735
That's that chatty.

563
00:29:05,770 --> 00:29:08,199
So I rebuilt the whole thing
from scratch, so it's time for me

564
00:29:08,199 --> 00:29:09,399
to go back and figure that out.

565
00:29:09,429 --> 00:29:09,699
Ben Hartshorne: Yeah.

566
00:29:09,699 --> 00:29:14,260
So, um, I will say a lot of, a lot
of instrumentation is terrible.

567
00:29:14,980 --> 00:29:21,010
A lot of instrumentation is based on
this idea that every single signal must

568
00:29:21,010 --> 00:29:28,510
be published all the time, and, um,
that that's not relevant to you as a

569
00:29:28,510 --> 00:29:30,070
person running the Kubernetes cluster.

570
00:29:30,689 --> 00:29:35,100
You know, do you need to know
every time, uh, the, the, um, a, a

571
00:29:35,100 --> 00:29:38,429
local pod checks in to see whether
it's, uh, needs to be evicted?

572
00:29:38,939 --> 00:29:39,750
No, you don't.

573
00:29:40,169 --> 00:29:44,250
What you're interested in are the,
the types of activities that are

574
00:29:44,250 --> 00:29:47,909
relevant to what you need to do
as an operator of that cluster.

575
00:29:48,179 --> 00:29:49,560
And the same is true of an application.

576
00:29:50,040 --> 00:29:56,250
If you just, you know, put, uh,
uh, in the tracing language, put a

577
00:29:56,250 --> 00:29:58,260
span on every single function call.

578
00:29:58,950 --> 00:30:03,784
You will not have useful traces
because it doesn't map to, uh, a,

579
00:30:04,020 --> 00:30:07,710
a useful way of representing your
user's journey through your product.

580
00:30:08,490 --> 00:30:12,120
So there's definitely some nuance
to getting the right level of

581
00:30:12,120 --> 00:30:17,159
instrumentation, and I think the right
level, it's not a single place, uh, it's

582
00:30:17,159 --> 00:30:20,639
a continuously moving spectrum based
on what you were trying to understand

583
00:30:20,850 --> 00:30:22,350
about what your application is doing.

584
00:30:22,980 --> 00:30:24,600
So, uh, at least at Honeycomb.

585
00:30:25,440 --> 00:30:30,030
We add instrumentation all the time, and
we remove instrumentation all the time

586
00:30:30,960 --> 00:30:35,550
because what's relevant to me now as I'm
building out this feature is different

587
00:30:35,940 --> 00:30:40,379
from what I need to know about that
feature once it is fully built and stable

588
00:30:40,560 --> 00:30:42,600
and running in, in a regular workload.

589
00:30:43,440 --> 00:30:48,720
Um, furthermore, a as I'm looking
at a. Specific problem or question?

590
00:30:48,720 --> 00:30:52,140
I, we talked about, uh, you know, pricing
for Lambdas at the beginning of this.

591
00:30:52,530 --> 00:30:57,300
Um, there was a time when we really
wanted to understand pricing for S3 and

592
00:30:57,390 --> 00:31:00,810
part of our model, it, it's a struggle.

593
00:31:01,020 --> 00:31:04,740
Um, part of our, part of our storage
model is that, uh, we store our customers

594
00:31:04,740 --> 00:31:07,080
telemetry in S3, in in many files.

595
00:31:07,295 --> 00:31:07,645
Files.

596
00:31:07,645 --> 00:31:11,520
And we put instrumentation around
every single F three access.

597
00:31:12,125 --> 00:31:16,294
In order to understand both the volume
and the latency of those to, to see

598
00:31:16,294 --> 00:31:19,294
like, okay, should we bundle them
up or resize it like this and how

599
00:31:19,294 --> 00:31:20,675
does that influence SOS and so on.

600
00:31:21,004 --> 00:31:24,185
And it's incredibly expensive to
do that kind of, uh, experiment.

601
00:31:24,425 --> 00:31:26,524
And it, it's not just
expensive in dollars.

602
00:31:26,885 --> 00:31:30,725
Adding that level of instrumentation
does have an impact on the overall

603
00:31:30,725 --> 00:31:32,345
performance of, of the system.

604
00:31:32,794 --> 00:31:36,845
When you're making 10,000 calls
to S3 and you add a span around

605
00:31:36,845 --> 00:31:39,305
everyone, it takes a bit more time.

606
00:31:39,875 --> 00:31:40,325
So.

607
00:31:40,710 --> 00:31:44,340
Once we understood the system well
enough to, to make the change, we wanted

608
00:31:44,340 --> 00:31:45,750
to make, we pulled all that back out.

609
00:31:46,890 --> 00:31:49,950
So, for your Kubernetes cluster,
uh, you know, maybe it's interesting

610
00:31:49,950 --> 00:31:53,760
at the very beginning to, to look
at every single, uh, connection

611
00:31:53,760 --> 00:31:55,350
that any, any process might make.

612
00:31:57,240 --> 00:32:00,210
But if it's your home cluster,
that's not really what you

613
00:32:00,210 --> 00:32:01,710
need to know as an operator.

614
00:32:02,385 --> 00:32:07,395
So finding the right balance there of
instrumentation that lets you fulfill

615
00:32:07,395 --> 00:32:11,055
the needs of the business, that lets
you understand the, the needs of the

616
00:32:11,055 --> 00:32:16,275
operator in order to, uh, best be
able to provide the service that this

617
00:32:16,275 --> 00:32:17,985
business is providing to its customers.

618
00:32:19,125 --> 00:32:22,185
It's a, it's a place somewhere
there in the middle, and you're

619
00:32:22,185 --> 00:32:23,385
gonna need some people to find it,

620
00:32:23,745 --> 00:32:24,465
Corey: and that's

621
00:32:25,215 --> 00:32:26,895
easier said than done for a lot of folks.

622
00:32:27,270 --> 00:32:29,280
But you're right, it is getting
easier to instrument these things.

623
00:32:29,280 --> 00:32:33,660
It is something that is iteratively
getting better all the time, uh, to the

624
00:32:33,660 --> 00:32:37,140
point where now, like this is an area
where AI is surprisingly effective.

625
00:32:37,740 --> 00:32:42,750
It doesn't take a lot to wrap a
function call with a decorator.

626
00:32:42,930 --> 00:32:43,140
Ben Hartshorne: Mm-hmm.

627
00:32:43,875 --> 00:32:46,754
It just takes a lot of doing that
over and over and over again.

628
00:32:47,175 --> 00:32:50,685
You, you do a lot of them and you see
what it looks like and then you see, okay,

629
00:32:50,685 --> 00:32:56,895
which ones of these are actually useful
to me Now that's gonna, and uh, we want.

630
00:32:58,245 --> 00:33:01,965
Open to that changing and
willing to understand that, uh,

631
00:33:02,115 --> 00:33:03,495
that this is an evolving thing.

632
00:33:03,824 --> 00:33:07,064
And this does actually tie back to
one of the core operating principles

633
00:33:07,155 --> 00:33:14,385
of modern sa uh, architectures, the
ability to deploy your code quickly.

634
00:33:15,405 --> 00:33:19,215
Because if you're in this cycle of
adding instrumentation, of removing

635
00:33:19,215 --> 00:33:20,445
instrumentation, you see a bug.

636
00:33:20,445 --> 00:33:25,185
It has to be easy enough to add a
little bit more data to get insight

637
00:33:25,185 --> 00:33:26,895
into that bug in order to resolve it.

638
00:33:27,555 --> 00:33:31,305
It's gonna do it and the
whole business suffer for

639
00:33:31,965 --> 00:33:33,075
what is quickly to you,

640
00:33:34,095 --> 00:33:36,405
uh, in.

641
00:33:38,685 --> 00:33:42,915
Uh, I need to make this change and, uh,
it's visible in my test environment.

642
00:33:43,185 --> 00:33:46,005
A couple of minutes I need to
make this change and have it

643
00:33:46,005 --> 00:33:47,415
visible running in production.

644
00:33:47,865 --> 00:33:52,995
Um, it depends on like how, how much
the, the, uh, how frequency, how frequent

645
00:33:52,995 --> 00:33:56,385
the bug comes, but I'm, I'm actually
okay with it being about, about an

646
00:33:56,385 --> 00:33:58,605
hour for that kind of, uh, turnaround.

647
00:33:58,935 --> 00:34:01,485
I know a lot of people say you should
have your code running in 15 minutes.

648
00:34:01,875 --> 00:34:02,445
That's great.

649
00:34:02,955 --> 00:34:06,765
Uh, I know that's outta reach for a lot
of people in a lot of industries, so, um.

650
00:34:07,889 --> 00:34:10,469
I'm, I'm not a hardliner
on, on how quickly it has to

651
00:34:10,469 --> 00:34:12,089
be, but it can't be a week.

652
00:34:12,659 --> 00:34:18,299
It can't, it, it can bear, it can't
be a day that just like you're gonna

653
00:34:18,359 --> 00:34:20,969
wanna do this two or three times
in the course of resolving a bug.

654
00:34:21,299 --> 00:34:26,219
And so if it's something too long,
you're just really pushing out any

655
00:34:26,219 --> 00:34:27,900
ability to respond quickly to a customer.

656
00:34:28,230 --> 00:34:30,929
Corey: I really wanna thank you for taking
the time to speak with me about all this.

657
00:34:31,049 --> 00:34:33,120
If people wanna learn more, where's
the best place for them to go?

658
00:34:33,989 --> 00:34:38,279
Ben Hartshorne: You know, I have,
uh, backed off of almost all of

659
00:34:38,279 --> 00:34:42,029
the platforms in which people carry
on conversations in the internet.

660
00:34:42,239 --> 00:34:42,870
Corey: Everyone

661
00:34:42,870 --> 00:34:43,830
seems to have done this.

662
00:34:44,669 --> 00:34:48,299
Ben Hartshorne: I, I, uh, I,
I did work for Facebook for

663
00:34:48,419 --> 00:34:50,759
two and a half years and, um,

664
00:34:50,940 --> 00:34:51,899
Corey: someday I might forgive you.

665
00:34:52,739 --> 00:34:53,879
Ben Hartshorne: Someday
I might forgive myself.

666
00:34:54,089 --> 00:34:54,600
Um.

667
00:34:58,050 --> 00:35:03,030
Really different environment and, uh, I
could see the allure of the world they're

668
00:35:03,030 --> 00:35:05,010
trying to create and it doesn't match.

669
00:35:05,220 --> 00:35:06,960
Oh, I interviewed there in 2009.

670
00:35:06,960 --> 00:35:08,430
It was, it was incredibly compelling.

671
00:35:08,970 --> 00:35:12,300
Um, it doesn't match the, the view
that I see of the world we're in.

672
00:35:12,750 --> 00:35:17,610
And so, um, uh, I have a, a
presence at, at Honeycomb.

673
00:35:17,700 --> 00:35:22,950
Um, I do have, uh, accounts on
all of the major, um, platforms,

674
00:35:23,250 --> 00:35:24,420
so you can find me there.

675
00:35:24,825 --> 00:35:30,134
Uh, there, there will be links afterwards
I'm sure, but, um, LinkedIn, blue Sky.

676
00:35:30,855 --> 00:35:31,245
I dunno.

677
00:35:31,755 --> 00:35:33,375
GitHub, is that a social
media platform now?

678
00:35:34,035 --> 00:35:34,665
Corey: They wish.

679
00:35:35,384 --> 00:35:36,075
We'll put all this in.

680
00:35:36,075 --> 00:35:38,025
The show notes Problem solve for us.

681
00:35:38,085 --> 00:35:40,095
Thank you so much for taking
the time to speak with me.

682
00:35:40,095 --> 00:35:40,785
I appreciate it.

683
00:35:41,055 --> 00:35:41,745
Ben Hartshorne: It's a real pleasure.

684
00:35:41,895 --> 00:35:42,285
Thank you.

685
00:35:42,585 --> 00:35:45,765
Corey: Ben Hartshorne is the
principle engineer at Honeycomb.

686
00:35:45,855 --> 00:35:48,495
One of the possibly
might have more than one.

687
00:35:48,495 --> 00:35:51,855
Seems to be something you can scale,
unlike my nonsense as Chief Cloud

688
00:35:51,855 --> 00:35:53,325
Economist at the Duck Bill Group.

689
00:35:53,805 --> 00:35:55,395
And this is screaming in the cloud.

690
00:35:55,740 --> 00:35:58,589
If you've enjoyed this podcast,
please leave a five star review on

691
00:35:58,589 --> 00:36:00,270
your podcast platform of choice.

692
00:36:00,359 --> 00:36:03,600
Whereas if you've hated this podcast,
please leave a five star review on

693
00:36:03,600 --> 00:36:07,500
your podcast platform of choice along
with an insulting comment that won't

694
00:36:07,500 --> 00:36:10,709
work because that platform is down and
not accepting comments at this moment.