1
00:00:00,000 --> 00:00:01,950
Omri Sass: I would say
it's always best effort.

2
00:00:02,009 --> 00:00:05,160
So we learn based on the knowledge
that we have and when we need

3
00:00:05,160 --> 00:00:08,730
to adapt our knowledge, we do
our best effort to adapt it.

4
00:00:08,910 --> 00:00:13,470
Our investment in this area is pretty
good, and like we have people who do

5
00:00:13,470 --> 00:00:16,800
ongoing maintenance and continuously
look at model improvements.

6
00:00:21,600 --> 00:00:23,045
Corey : Welcome to screaming in the Cloud.

7
00:00:23,620 --> 00:00:29,650
I'm Corey Quinn and I am at long last,
thrilled to notice that something in

8
00:00:29,650 --> 00:00:33,550
this world exists that I've wanted
to exist for a very long time.

9
00:00:33,550 --> 00:00:34,990
But we'll get into that.

10
00:00:35,260 --> 00:00:39,220
Omri SaaS has been a Datadog for
something like six years now.

11
00:00:39,280 --> 00:00:40,750
Omri, thank you for joining me.

12
00:00:41,019 --> 00:00:42,250
Omri Sass: Uh, thanks
for having me, Corey.

13
00:00:42,825 --> 00:00:45,945
Corey : This episode is sponsored
in part by my day job Duck.

14
00:00:45,945 --> 00:00:49,125
Bill, do you have a horrifying AWS bill?

15
00:00:49,395 --> 00:00:51,285
That can mean a lot of things.

16
00:00:51,495 --> 00:00:55,515
Predicting what it's going to be,
determining what it should be,

17
00:00:55,755 --> 00:01:00,735
negotiating your next long-term
contract with AWS, or just figuring

18
00:01:00,735 --> 00:01:02,685
out why it increasingly resembles.

19
00:01:02,905 --> 00:01:06,535
Phone number, but nobody seems
to quite know why that is.

20
00:01:06,835 --> 00:01:10,405
To learn more, visit duck bill hq.com.

21
00:01:10,705 --> 00:01:13,585
Remember, you can't duck the duck bill.

22
00:01:13,615 --> 00:01:18,985
Bill, which my CEO reliably informs
me is absolutely not our slogan.

23
00:01:19,664 --> 00:01:23,054
So you are apparently the
Mastermind, as it were behind

24
00:01:23,054 --> 00:01:25,774
the recently launched updog.ai.

25
00:01:26,024 --> 00:01:28,045
So forgive me, what's Updog?

26
00:01:28,215 --> 00:01:31,365
Omri Sass: I, you know, I was expecting
the conversation to start like that,

27
00:01:31,365 --> 00:01:35,325
and I have to say, uh, it's definitely
not me being the mastermind there.

28
00:01:35,354 --> 00:01:38,715
Uh, I joined, uh, Datadog, like
you said, about six years ago.

29
00:01:39,015 --> 00:01:40,725
This thing has been in the making.

30
00:01:40,755 --> 00:01:43,425
Uh, some folks would say even before that.

31
00:01:43,630 --> 00:01:46,780
There is, I'm happy to share a bit
of the history later, but I joined

32
00:01:46,780 --> 00:01:50,830
the Applied AI group, uh, here at
Datadog a couple of months ago while

33
00:01:50,830 --> 00:01:52,120
this project was already ongoing.

34
00:01:52,120 --> 00:01:54,700
So I do have to give
credit where credit is due.

35
00:01:54,940 --> 00:01:59,860
We have an amazing applied AI folks like
group of data scientists, engineers.

36
00:02:00,205 --> 00:02:03,505
Product manager who's, uh, this has
been his passion for quite a while.

37
00:02:03,595 --> 00:02:06,865
Uh, and so I'm not the mastermind,
I'm just the pretty face

38
00:02:07,165 --> 00:02:08,664
Corey : I like, and it's
an impressive beard.

39
00:02:08,664 --> 00:02:09,354
I will say.

40
00:02:09,354 --> 00:02:10,045
I, I'm envious.

41
00:02:10,045 --> 00:02:11,485
I can't grow one myself.

42
00:02:11,725 --> 00:02:16,645
Uh, for those who may not be aware
of the beautiful thing that is

43
00:02:16,645 --> 00:02:18,415
Updog, how do you describe it?

44
00:02:18,595 --> 00:02:21,715
Omri Sass: So, Updog is
effectively down detector.

45
00:02:21,775 --> 00:02:25,345
So if you're familiar with that, it's
a way of making sure that very common,

46
00:02:25,375 --> 00:02:27,655
uh, SaaS providers are actually up.

47
00:02:28,255 --> 00:02:34,615
But Updog is powered by telemetry of
people who actually use these providers.

48
00:02:34,615 --> 00:02:38,335
It's not a test against all
their APIs or anything like that.

49
00:02:38,665 --> 00:02:43,195
Corey : It, it's also not user reported,
like down detector tends to be.

50
00:02:43,435 --> 00:02:47,180
And I, I have to say it was awfully
considerate of you, uh, to the

51
00:02:47,180 --> 00:02:48,535
day that we're recording this.

52
00:02:48,715 --> 00:02:51,415
Uh, most of the morning
CloudFlare has taken significant.

53
00:02:51,445 --> 00:02:52,045
Outages.

54
00:02:52,045 --> 00:02:55,904
In fact, right now as we speak, there's
a banner at the top of Updog.ai.

55
00:02:56,185 --> 00:02:58,165
CloudFlare is reporting
an outage right now.

56
00:02:58,315 --> 00:03:02,424
Updog is not detecting it as it is on
API endpoints that are not watched today.

57
00:03:02,545 --> 00:03:06,355
We are working on adding those
API endpoints to our watch list.

58
00:03:06,595 --> 00:03:09,114
Now, this sounds like a no-brainer.

59
00:03:09,114 --> 00:03:13,075
I have been asking various monitoring and
observability companies for this since

60
00:03:13,075 --> 00:03:15,655
I was a young CIS admin because when.

61
00:03:16,130 --> 00:03:19,160
Suddenly your website
no longer serves web.

62
00:03:19,520 --> 00:03:21,470
Your big question is,
is it my crappy code?

63
00:03:22,320 --> 00:03:25,680
Or is it a global issue
that is affecting everyone?

64
00:03:25,770 --> 00:03:28,380
And it's not because whose
throat do I choke here?

65
00:03:28,380 --> 00:03:30,720
It's what do I do to
get this thing back up?

66
00:03:30,930 --> 00:03:34,590
Because if it's a major provider that
has just gone down, there are very few

67
00:03:34,710 --> 00:03:38,670
code changes you are going to make on
your side that'll bring the site back up.

68
00:03:38,820 --> 00:03:41,430
And in fact, you could conceivably
take it from a working state to a

69
00:03:41,430 --> 00:03:42,900
non-working state, and now you have.

70
00:03:42,930 --> 00:03:43,740
Two problems.

71
00:03:43,980 --> 00:03:47,610
Conversely, if everything else is
reporting fine, maybe look at what's

72
00:03:47,610 --> 00:03:49,860
going on in your specific environment.

73
00:03:50,010 --> 00:03:53,925
For folks who have not lived
in the, in the trenches of 3:00

74
00:03:53,925 --> 00:03:56,580
AM pages, that's a huge deal.

75
00:03:56,580 --> 00:03:58,519
It's effectively bifurcating
the world in two.

76
00:03:59,040 --> 00:04:03,150
Omri Sass: That is precisely right
and I'll say one of the biggest

77
00:04:03,150 --> 00:04:06,900
reasons, uh, that it took so long,
at least for us to get there, is

78
00:04:06,900 --> 00:04:10,380
the understanding that we can't just
test all of these endpoints, right?

79
00:04:10,380 --> 00:04:14,310
There's a reason, like you, uh, mentioned
down detector uses user reports.

80
00:04:15,120 --> 00:04:19,290
If we were to run, uh, synthetic tests,
for example, against all of these

81
00:04:19,290 --> 00:04:23,460
endpoints ourselves, if we, and we
report something as down, we now need

82
00:04:23,460 --> 00:04:27,870
to verify that the thing being down is
the actual website and not our testing.

83
00:04:28,385 --> 00:04:29,675
That is no longer correct

84
00:04:29,855 --> 00:04:33,545
Corey : and it's worse than that because
you take a look at any, like take AWS as

85
00:04:33,545 --> 00:04:39,665
I often am forced to do, and you Oh, okay,
well is EC2 working in US East one that's

86
00:04:39,665 --> 00:04:42,125
over a hundred data center facilities.

87
00:04:42,305 --> 00:04:43,985
Uh, it's not at at that scale.

88
00:04:43,985 --> 00:04:45,725
It's not a question of is it down?

89
00:04:45,725 --> 00:04:47,255
It's a question of how down is it?

90
00:04:47,525 --> 00:04:51,365
There are, you can have a hundred
customers there and five are saying

91
00:04:51,365 --> 00:04:54,905
things are terrific and five are
saying that they're terrible and.

92
00:04:55,085 --> 00:04:58,685
The rest are at varying degrees
between those two points just because

93
00:04:58,685 --> 00:05:01,924
it's, it's blind people trying
to describe an elephant by touch.

94
00:05:02,255 --> 00:05:03,695
Omri Sass: That is exactly right.

95
00:05:03,784 --> 00:05:07,265
And what you just described
is the realization that we had

96
00:05:07,265 --> 00:05:08,614
about the asymmetry of data.

97
00:05:08,614 --> 00:05:11,854
And I, I rest assured that's the,
probably the word with the most,

98
00:05:12,065 --> 00:05:13,684
uh, syllables I'm gonna use today.

99
00:05:13,745 --> 00:05:16,745
That's, uh, above my, uh, my IQ grade.

100
00:05:17,370 --> 00:05:21,180
But what you just described is
exactly the realization that we

101
00:05:21,180 --> 00:05:23,820
had about the asymmetry of data.

102
00:05:23,849 --> 00:05:26,640
We have more data than any individual.

103
00:05:26,640 --> 00:05:31,500
One of our customers, IE, we have
all of the blind people touching

104
00:05:31,500 --> 00:05:32,820
the elephant at the same time.

105
00:05:33,870 --> 00:05:36,390
And not needing to describe it, right?

106
00:05:36,390 --> 00:05:42,120
We, we have the sense of touch for all
of the, these folks, and what we do is

107
00:05:42,120 --> 00:05:47,370
actually looking at this data in aggregate
and using it to try and understand whether

108
00:05:47,370 --> 00:05:48,929
all of these endpoints are up or down.

109
00:05:49,260 --> 00:05:51,840
Now let me try to make
that slightly more real.

110
00:05:52,530 --> 00:05:57,630
When we started, uh, going along this
journey, our realization was that when

111
00:05:58,020 --> 00:06:03,599
EC2 is down, that's actually the, the
specific example, uh, when EC2 was down

112
00:06:03,630 --> 00:06:06,599
the load on our, uh, watchdog backend.

113
00:06:06,599 --> 00:06:12,300
Watchdog is our machine learning anomaly
detector increases significantly because

114
00:06:12,330 --> 00:06:16,860
everyone has a higher error rate and
higher latencies and a drop in throughput.

115
00:06:17,250 --> 00:06:21,599
And so our backend had to compensate
for that, and we saw a surge.

116
00:06:21,900 --> 00:06:23,130
In processing power.

117
00:06:23,490 --> 00:06:27,840
So we're like, Hey, we're not looking
at customer data here is purely

118
00:06:27,840 --> 00:06:33,000
within our systems, but something is
definitely going on in the real world.

119
00:06:33,000 --> 00:06:35,730
It's not a byproduct of
anything that we're doing.

120
00:06:35,850 --> 00:06:37,830
It's not tied to any
change that we've made.

121
00:06:37,830 --> 00:06:40,230
It's not anything our
systems are functioning.

122
00:06:40,710 --> 00:06:45,180
And through investigating that, we
had realized that it's actually tied.

123
00:06:45,615 --> 00:06:46,665
To EC2.

124
00:06:46,995 --> 00:06:49,725
And then we started figuring
out, wait, what are the most

125
00:06:49,725 --> 00:06:53,325
common things that people rely on
that are observed with Datadog?

126
00:06:53,475 --> 00:06:56,895
And if you look at the uh,
updog.ai website today.

127
00:06:57,240 --> 00:07:01,380
That's also a really easy way to see
what people use Datadog to monitor

128
00:07:01,440 --> 00:07:03,240
of their third party dependencies.

129
00:07:03,390 --> 00:07:06,090
'cause it's just the top API
endpoints that we observe.

130
00:07:06,270 --> 00:07:09,000
Corey : Well, I, I'm curious on
some level, like effectively what

131
00:07:09,000 --> 00:07:12,480
I care about on this today, for
example, there are a bunch of folks

132
00:07:12,480 --> 00:07:14,130
that wound up I looking at this now.

133
00:07:14,130 --> 00:07:14,670
You, you're right.

134
00:07:14,670 --> 00:07:15,630
You don't have.

135
00:07:15,710 --> 00:07:19,969
The CloudFlare endpoints themselves,
but Aiden is first alphabetically.

136
00:07:19,969 --> 00:07:21,440
They took a dip earlier.

137
00:07:21,710 --> 00:07:24,530
Uh, AWS took a little bit of
one that I'm sure we'll get

138
00:07:24,530 --> 00:07:25,849
back to that in a minute or two.

139
00:07:26,180 --> 00:07:28,280
PayPal was down the drain.

140
00:07:28,310 --> 00:07:30,440
OpenAI had a bunch of issues.

141
00:07:30,620 --> 00:07:33,560
X is probably the worst of all
of these graphs as I look at

142
00:07:33,560 --> 00:07:35,000
that formerly known as Twitter.

143
00:07:35,330 --> 00:07:36,500
And it great.

144
00:07:36,500 --> 00:07:39,740
This is a high level approach
to, this is what I care about.

145
00:07:39,830 --> 00:07:44,000
I, I almost want to take one level
beyond this in either step where

146
00:07:44,179 --> 00:07:45,349
just give me a traffic light.

147
00:07:45,680 --> 00:07:49,730
Is something globally being messed
up right now, like earlier today

148
00:07:49,730 --> 00:07:53,030
when multiple, multiple services
are all down in the dumps?

149
00:07:53,180 --> 00:07:56,030
That's what I wanna see
at the top, just the, yep.

150
00:07:56,030 --> 00:07:57,440
Things globally are broken.

151
00:07:57,440 --> 00:08:01,400
Maybe it's a routing convergence issue
where a lot of traffic between Oregon

152
00:08:01,400 --> 00:08:03,560
and Virginia no longer routes properly.

153
00:08:03,890 --> 00:08:08,300
Maybe it's that an off provider is
breaking and everything is break,

154
00:08:08,300 --> 00:08:11,720
is out to lunch at that point,
like you almost just wanna see at

155
00:08:11,720 --> 00:08:14,990
a high level with no scroll above
the fold, what's broken right now.

156
00:08:15,255 --> 00:08:17,325
Omri Sass: It's, uh, amazing
that you should say that.

157
00:08:17,415 --> 00:08:19,275
Uh, a I know some people who work on this.

158
00:08:19,275 --> 00:08:21,585
I can file the feature
request, uh, for you.

159
00:08:21,705 --> 00:08:22,425
Corey : Oh, honestly, yeah.

160
00:08:22,455 --> 00:08:22,815
Oh yeah.

161
00:08:22,815 --> 00:08:26,145
It was either ask you on social media
or ask you directly here in a scenario

162
00:08:26,145 --> 00:08:27,735
in which you cannot possibly say no.

163
00:08:28,125 --> 00:08:31,095
Omri Sass: Oh, you, I, I'm, I'm
a, I'm a mean bastard, but that,

164
00:08:31,095 --> 00:08:33,765
that is, that one's easy for me
'cause we're already working on it

165
00:08:33,914 --> 00:08:38,325
Corey : because then you go the other
direction where AWS right now, like each

166
00:08:38,325 --> 00:08:40,365
of these little cards are clickable.

167
00:08:40,605 --> 00:08:44,385
So I click on one and it talks
about different services, dynamo

168
00:08:44,385 --> 00:08:48,015
db, elastic load balancing,
elastic search, so on and so forth.

169
00:08:48,375 --> 00:08:51,165
But I'm not seeing, at
least at this level.

170
00:08:51,170 --> 00:08:52,819
Uh, here or anywhere else for that matter.

171
00:08:53,000 --> 00:08:54,560
Anything that breaks it down by region.

172
00:08:54,740 --> 00:08:58,490
And AWS is very good
at regional isolation.

173
00:08:58,790 --> 00:09:00,229
So I'm of two minds on this.

174
00:09:00,260 --> 00:09:03,020
On the one hand, it's,
well, Stockholm is broken.

175
00:09:03,020 --> 00:09:03,770
What's going on?

176
00:09:03,800 --> 00:09:05,420
Yeah, there are five customers using that.

177
00:09:05,420 --> 00:09:06,020
Give or take.

178
00:09:06,170 --> 00:09:06,920
I exaggerate.

179
00:09:07,280 --> 00:09:07,849
Fine.

180
00:09:07,880 --> 00:09:08,719
It'll be great.

181
00:09:08,900 --> 00:09:10,520
This is a big picture.

182
00:09:10,724 --> 00:09:11,870
What is going on?

183
00:09:11,870 --> 00:09:15,920
Regardless of the intricacies of
any given provider's endpoints.

184
00:09:16,375 --> 00:09:18,805
Then the other, it'd be nice
to have more granularity.

185
00:09:19,015 --> 00:09:20,305
So I can see both sides

186
00:09:20,455 --> 00:09:22,165
Omri Sass: we're working
towards that as well.

187
00:09:22,225 --> 00:09:25,675
Uh, the first release of Updog, which
I think was fortunate timing for

188
00:09:25,675 --> 00:09:29,695
us and for the, uh, for our users
is basically where we are today.

189
00:09:29,695 --> 00:09:32,935
We're investing heavily in improving,
uh, granularity and coverage.

190
00:09:33,115 --> 00:09:37,525
So to your point about, uh, CloudFlare,
uh, we went with the most common APIs.

191
00:09:38,380 --> 00:09:42,460
Uh, that people actually look at, if
you look at some of the error messages

192
00:09:42,460 --> 00:09:46,240
that people have been posting in, you
know, on the, uh, the airwaves, formerly

193
00:09:46,240 --> 00:09:49,330
known as Twitter, you'll see that
they mentioned a particular endpoint.

194
00:09:49,330 --> 00:09:53,110
I think it's challenge, uh, CloudFlare,
that one's, uh, as far as I know, not

195
00:09:53,110 --> 00:09:56,710
a documented API, and I'm, I'm sure
I'm gonna be corrected about that,

196
00:09:56,830 --> 00:10:00,819
uh, later, but it's not one that
is commonly observed by our users.

197
00:10:00,819 --> 00:10:02,949
And so when we began this journey.

198
00:10:03,220 --> 00:10:05,170
We had to go off of available telemetry.

199
00:10:05,590 --> 00:10:09,580
Now that it's public, we can take feature
requests, we can learn our lessons, we

200
00:10:09,580 --> 00:10:13,030
can improve everything that we're doing,
which is exactly what we're gonna do.

201
00:10:13,240 --> 00:10:16,720
CloudFlare, in particular, uh,
we're, we're working overtime to

202
00:10:16,720 --> 00:10:17,890
make sure that we account for that.

203
00:10:18,069 --> 00:10:19,810
Corey : Yes, people accuse me of being an.

204
00:10:20,095 --> 00:10:24,535
Asshole when I say this, but I'm
being completely sincere that I have

205
00:10:24,535 --> 00:10:28,555
trouble using a cloud provider where
I do not understand how it breaks.

206
00:10:28,765 --> 00:10:31,075
And people think, oh, you
think we're gonna go down?

207
00:10:31,075 --> 00:10:32,725
Everything's going to go down.

208
00:10:32,725 --> 00:10:35,185
What I wanna know is, what's
that going to look like?

209
00:10:35,305 --> 00:10:40,675
Until I saw a number of AWS outages, I
did not believe a lot of the assertions

210
00:10:40,675 --> 00:10:45,925
they made about cro, about hard
regions separation, because we did see

211
00:10:45,925 --> 00:10:47,815
cascading issues in the early days.

212
00:10:47,905 --> 00:10:49,435
It turns out what those were is.

213
00:10:49,585 --> 00:10:53,155
Oh, maybe if you're running everything
in Virginia and it goes down, you're

214
00:10:53,155 --> 00:10:57,595
not the only person trying to spin up
capacity in Oregon at the same time, and

215
00:10:57,595 --> 00:10:59,245
now you have a herd of elephants problem.

216
00:10:59,365 --> 00:11:01,795
Turns out your DR tests
don't account for that.

217
00:11:02,245 --> 00:11:02,725
Omri Sass: Oh, yeah.

218
00:11:02,725 --> 00:11:05,575
And I think that there's a couple
of, uh, famous stories about,

219
00:11:05,875 --> 00:11:09,325
uh, people who realize that they
should do Dr ahead of other people

220
00:11:09,325 --> 00:11:11,185
and basically beat the stampede.

221
00:11:11,565 --> 00:11:15,795
So the, their websites were up during
one of these moments, uh, and then

222
00:11:15,795 --> 00:11:18,855
everyone else came up and were like,
Hey, we're, we're out of capacity.

223
00:11:18,855 --> 00:11:19,785
Like, what's going on?

224
00:11:19,785 --> 00:11:22,515
So there's a couple of like famous
stories like that through, uh,

225
00:11:22,545 --> 00:11:26,325
through history and I'll, I'll just
say the example that you just gave,

226
00:11:26,325 --> 00:11:28,455
I think is one that, uh, AWS is.

227
00:11:29,160 --> 00:11:32,220
You know that they've told
this story, uh, quite a bit and

228
00:11:32,220 --> 00:11:34,590
regionally today I think makes sense.

229
00:11:34,590 --> 00:11:38,910
But even if you look at the, uh, USE
Swan outage from a couple of weeks ago,

230
00:11:39,090 --> 00:11:43,380
a lot of the managed services were down,
not necessarily because of the service

231
00:11:43,380 --> 00:11:47,670
itself was down, or, uh, you know,
I'm, I'm not involved, I wasn't there.

232
00:11:47,675 --> 00:11:48,835
But my understanding there was.

233
00:11:49,280 --> 00:11:53,120
Uh, dynamo DB was part of the
services that were directly impacted

234
00:11:53,329 --> 00:11:56,900
and a lot of their other managed
services use Dynamo under the hood.

235
00:11:57,110 --> 00:12:00,500
So there's the cascading
failure would happen even if

236
00:12:00,500 --> 00:12:01,819
it's not cascading regionally.

237
00:12:01,819 --> 00:12:02,900
It cascades logically.

238
00:12:03,255 --> 00:12:06,435
Between the different services
and they have so many of them.

239
00:12:06,525 --> 00:12:08,535
So like, who's to know now?

240
00:12:08,685 --> 00:12:11,595
But we now have this information,
we can use that and learn and

241
00:12:11,595 --> 00:12:13,215
adapt our models based on it.

242
00:12:13,574 --> 00:12:14,805
Corey : What happens?

243
00:12:14,805 --> 00:12:18,135
Uh, again, this is, I'm sure that you're
gonna be nibbled to death by ducks,

244
00:12:18,135 --> 00:12:20,295
and we will get there momentarily, but.

245
00:12:20,580 --> 00:12:24,030
What happens when they, when
that learning becomes invalid?

246
00:12:24,180 --> 00:12:28,770
By what I mean is that when we remember
the great S3 apocalypse of 2019, they

247
00:12:28,770 --> 00:12:32,130
completely rebuilt that service under
the hood, which to my understanding

248
00:12:32,130 --> 00:12:36,510
is the third time since it launched
in 20 2005, that they had done that.

249
00:12:36,900 --> 00:12:39,840
And everything that you've
learned about how it failed back

250
00:12:39,840 --> 00:12:42,360
in 20 20 19 is no longer valid.

251
00:12:43,035 --> 00:12:46,575
Or at least an argument could be made
in good faith that that is the case.

252
00:12:46,635 --> 00:12:49,395
Omri Sass: So for that, I would
say it's always best effort.

253
00:12:49,575 --> 00:12:52,755
So we learn based on the knowledge
that we have and when we need

254
00:12:52,755 --> 00:12:56,325
to adapt our knowledge, we do
our best effort to adapt it.

255
00:12:56,505 --> 00:13:01,035
Our investment in this area is pretty
good and like we have people who do

256
00:13:01,035 --> 00:13:04,395
ongoing maintenance and continuously
look at model improvements.

257
00:13:04,395 --> 00:13:07,365
And so if we do something like
that, hopefully we'll catch it.

258
00:13:07,395 --> 00:13:10,875
And I will say AWS in particular,
but a lot of the providers

259
00:13:10,875 --> 00:13:12,155
that you would see on Updog.

260
00:13:12,795 --> 00:13:14,295
Uh, we have good relationships with them.

261
00:13:14,475 --> 00:13:17,955
We would hope that they come to us and
tell us, Hey, this thing that you're

262
00:13:17,955 --> 00:13:20,685
saying is incorrect, or You need to
update this, let's work on it together.

263
00:13:21,420 --> 00:13:25,829
Corey : Yes, that is what I really want
to get into here because historically

264
00:13:25,890 --> 00:13:31,469
when there have been issues for a
while, I redirected gaslighting me to

265
00:13:31,500 --> 00:13:35,459
the a Ws official status page because
that's what it felt like back in the

266
00:13:35,459 --> 00:13:37,560
days of the green circles everywhere.

267
00:13:37,709 --> 00:13:42,270
I had a inline transformer that
would up that would downgrade things

268
00:13:42,270 --> 00:13:46,380
and be a lot more honest about
how these things worked because.

269
00:13:46,569 --> 00:13:50,500
Suddenly it would break in a region and it
would take a while for them to update it.

270
00:13:51,069 --> 00:13:51,970
I understand that.

271
00:13:52,000 --> 00:13:55,000
It's a question of how do you
determine the impact of an outage?

272
00:13:55,000 --> 00:13:55,780
How bad is it?

273
00:13:55,780 --> 00:13:58,555
When do you update the global
status page versus not?

274
00:13:59,699 --> 00:14:02,040
And there are political considerations
with this too, I'm not suggesting

275
00:14:02,040 --> 00:14:05,850
otherwise, but it gets back to that
question of, is something on your side

276
00:14:05,850 --> 00:14:08,220
broken or is it on my side that's broken?

277
00:14:08,550 --> 00:14:13,920
And every time this has happened,
they get so salty about down detector

278
00:14:13,920 --> 00:14:17,819
and anything like it to the point
where they've had VPs making.

279
00:14:18,085 --> 00:14:21,565
Public publicly disparaging
commentary about the approach.

280
00:14:21,565 --> 00:14:25,675
It's, I'm not looking for the
universal source of truth for things.

281
00:14:25,915 --> 00:14:28,015
That's not what I go to down detector for.

282
00:14:28,015 --> 00:14:32,365
I'm not pulling up down detector.com
to look for a root cause analysis

283
00:14:32,365 --> 00:14:33,745
to inform my learnings about.

284
00:14:34,095 --> 00:14:36,765
Things, especially with
that, it's just user reports.

285
00:14:36,795 --> 00:14:40,335
So I'm sure today they're trotting
that line out already because there

286
00:14:40,335 --> 00:14:45,345
was an AWS blip on down detector just
because that is what people equate to.

287
00:14:45,465 --> 00:14:49,575
The internet is broken with, which
sort of a symptom of their own success.

288
00:14:49,785 --> 00:14:54,885
They have been unreasonably annoyed
by the existence of these things.

289
00:14:55,604 --> 00:14:57,165
But it's the first thing we look for.

290
00:14:57,165 --> 00:15:01,515
It's effectively putting a bit of data
around anyone I know on Twitter or the

291
00:15:01,515 --> 00:15:05,805
Tire fire, formerly known as Twitter,
noticing an issue here because suddenly

292
00:15:05,805 --> 00:15:09,314
things started breaking and I need to
figure out again, is it me or is it you?

293
00:15:09,584 --> 00:15:11,745
Omri Sass: So, uh, there's
two things that you said here.

294
00:15:11,745 --> 00:15:15,074
Uh, first of all, uh, I, I,
my go-to is dumpster fire.

295
00:15:15,165 --> 00:15:16,425
Uh, but I'll take tire fire.

296
00:15:16,425 --> 00:15:18,944
You know, I, I, I learned
something new today.

297
00:15:19,310 --> 00:15:21,230
Corey : Just like the bike shed is
full of yaks in need of a shave.

298
00:15:22,280 --> 00:15:22,640
Omri Sass: Oh, wow.

299
00:15:22,640 --> 00:15:24,950
I'm, I'm thinking that as a
personal offense in my beard.

300
00:15:25,040 --> 00:15:29,540
The other thing that you said here,
I think is, uh, a realization that is

301
00:15:29,540 --> 00:15:31,280
starting to permeate across the industry.

302
00:15:31,640 --> 00:15:36,620
And that is that some people will use this
to measure their SLAs and they'll use this

303
00:15:36,620 --> 00:15:40,190
to go to their account teams and complain,
demand credits and things like that.

304
00:15:40,460 --> 00:15:44,840
And that's valid reason for folks
like AWS and other providers to maybe

305
00:15:44,840 --> 00:15:48,980
get a bit, you know, add a bit of
salt, uh, to their behavior maybe.

306
00:15:49,349 --> 00:15:50,099
Just a snitch.

307
00:15:50,219 --> 00:15:54,000
But the flip side is that more and more
people come to the same realization

308
00:15:54,000 --> 00:15:58,530
that you come to in the first moment of
responding to an incident when I'm still

309
00:15:58,530 --> 00:16:01,050
orienting myself in the the 3:00 AM case.

310
00:16:01,319 --> 00:16:03,901
And it's always the, the worst
one is always at 3:00 AM.

311
00:16:04,860 --> 00:16:05,640
I'm groggy.

312
00:16:05,640 --> 00:16:06,840
I'm still orienting myself.

313
00:16:06,840 --> 00:16:10,110
I'm trying to figure out what
action do I take right now?

314
00:16:10,470 --> 00:16:13,110
Hey, this thing is not even my problem.

315
00:16:13,110 --> 00:16:16,890
It comes from somewhere else, is one of
the most important learning learnings

316
00:16:16,890 --> 00:16:21,840
that I can grab in a moment and not
waste time on uh, on it, especially

317
00:16:21,840 --> 00:16:24,390
given that most people don't think
to ask that unless they're very

318
00:16:24,390 --> 00:16:28,860
experienced and have gone through
these types of issues because of that.

319
00:16:29,280 --> 00:16:33,630
We see more and more people who are
actually interested in joining this.

320
00:16:33,839 --> 00:16:38,520
And when we launch, like on the day
of launch rest, our, our legal team

321
00:16:38,790 --> 00:16:43,920
was basically like ready for all the
inbound, salty, like angry emails.

322
00:16:44,280 --> 00:16:49,079
We didn't get any, but on the same week
that we launched, we had a provider who's

323
00:16:49,079 --> 00:16:55,349
not represented on that page reach out to
us and say, Hey, we're Datadog customers.

324
00:16:55,349 --> 00:16:56,849
Like, why aren't we even up there?

325
00:16:56,849 --> 00:16:57,420
And we're like.

326
00:16:58,065 --> 00:16:58,875
Oh wow.

327
00:16:58,935 --> 00:16:59,865
Come, come talk to us.

328
00:16:59,865 --> 00:17:00,105
Like

329
00:17:00,555 --> 00:17:01,605
Corey : we didn't know you broke.

330
00:17:01,665 --> 00:17:02,445
Omri Sass: Yeah, exactly.

331
00:17:03,600 --> 00:17:07,109
Corey : This episode is sponsored
by my own company, duck Bill.

332
00:17:07,440 --> 00:17:10,950
Having trouble with your AWS
bill, perhaps it's time to

333
00:17:10,950 --> 00:17:13,050
renegotiate a contract with them.

334
00:17:13,379 --> 00:17:18,780
Maybe you're just wondering how to predict
what's going on in the wide world of AWS.

335
00:17:18,839 --> 00:17:21,480
Well, that's where Duck
Bill comes to help.

336
00:17:21,690 --> 00:17:24,420
Remember, you can't duck the duck bill.

337
00:17:24,420 --> 00:17:25,500
Bill, which I am

338
00:17:25,500 --> 00:17:30,629
reliably informed by my business
partner is absolutely not our motto.

339
00:17:30,875 --> 00:17:34,115
To learn more, visit doc bill hq.com.

340
00:17:34,955 --> 00:17:39,365
On some level, this is a little bit of,
you must be at least this big to wind up

341
00:17:39,365 --> 00:17:42,544
appearing here like this is, I, I know
people are listening to this and they're

342
00:17:42,544 --> 00:17:44,075
gonna take the wrong lesson away from it.

343
00:17:44,165 --> 00:17:46,085
This is not a marketing opportunity.

344
00:17:46,085 --> 00:17:47,764
I'm sorry, but it's not.

345
00:17:47,794 --> 00:17:51,095
This is, this is systemically
important providers.

346
00:17:51,095 --> 00:17:54,695
In fact, I could make an argument about
some of the folks that are included.

347
00:17:54,965 --> 00:17:57,754
Is this really something
that needs to be up here?

348
00:17:57,845 --> 00:18:00,064
Azure DevOps is an example.

349
00:18:01,020 --> 00:18:04,500
Yeah, if you're on Azure, you're
used to it breaking periodically.

350
00:18:04,500 --> 00:18:05,879
I'm sorry, but it's true.

351
00:18:06,090 --> 00:18:08,550
I, we talk about not knowing
if it's a global problem.

352
00:18:08,610 --> 00:18:10,770
There have been multiple occasions
where I'm trying to get GitHub

353
00:18:10,770 --> 00:18:13,200
actions to work properly, only to
find out that it's, it can have

354
00:18:13,200 --> 00:18:14,639
actions that's broken at the time.

355
00:18:14,879 --> 00:18:15,870
Omri Sass: Totally fair.

356
00:18:15,870 --> 00:18:18,090
But we still want to be
able to reflect that to you.

357
00:18:18,210 --> 00:18:20,820
Corey : But you also wanna turn
this into a scrolling zoom forever.

358
00:18:20,970 --> 00:18:24,600
It, it would be interesting almost
to have a frequency algorithm where

359
00:18:24,750 --> 00:18:27,330
when something is breaking right now.

360
00:18:27,765 --> 00:18:30,675
You sort of have to hope that it's
gonna be alphabetically, supreme,

361
00:18:30,975 --> 00:18:34,935
whereas it would be nice to surface
that and not have to scroll forever.

362
00:18:34,935 --> 00:18:36,225
Again, minor stuff.

363
00:18:36,285 --> 00:18:39,015
Part of the problem is you don't get
a lot of opportunities to test this

364
00:18:39,015 --> 00:18:42,255
with wide ranging global outages
that impact multiple providers.

365
00:18:42,255 --> 00:18:43,905
So make hay while the sun shines.

366
00:18:44,535 --> 00:18:45,165
Omri Sass: E uh, exactly.

367
00:18:45,165 --> 00:18:50,355
Like, uh, we, we got two in the, since we
released this in a very, uh, brief moment.

368
00:18:50,415 --> 00:18:53,385
And I, you know, I, I, I say
this, I might sound like I have a

369
00:18:53,385 --> 00:18:54,945
smile on my face, obviously, like.

370
00:18:55,695 --> 00:18:58,245
My heart goes out to all the people
who have to actually respond to

371
00:18:58,245 --> 00:19:01,605
those incidents and to every one of
their users who's having a rough day.

372
00:19:01,754 --> 00:19:05,595
But to your point, this is a
golden opportunity to learn and

373
00:19:05,595 --> 00:19:08,415
to make sure that that knowledge
is disseminated and available to

374
00:19:08,415 --> 00:19:10,814
everyone as equally as possible.

375
00:19:10,875 --> 00:19:12,345
Never let a good crisis go to waste.

376
00:19:12,524 --> 00:19:16,575
That is, uh, one of the first adages
that I heard our CTO speak, uh,

377
00:19:16,575 --> 00:19:17,925
when I joined the company, and it's.

378
00:19:18,270 --> 00:19:19,800
Etched into the back of my brain

379
00:19:20,010 --> 00:19:23,640
Corey : and it, it's an important thing,
like I, I al, you've overshot the thing

380
00:19:23,640 --> 00:19:28,200
that I was asking for because as soon as
you start getting too granular, you start

381
00:19:28,200 --> 00:19:30,270
to get into the works for me, not for you.

382
00:19:30,390 --> 00:19:32,340
And it descends into meaningless noise.

383
00:19:32,490 --> 00:19:36,930
All I really wanted was for a
traffic light webpage you folks

384
00:19:36,930 --> 00:19:38,225
put up there, or even a graph.

385
00:19:38,550 --> 00:19:40,020
Don't even put the numbers on it.

386
00:19:40,020 --> 00:19:44,070
Make it logarithmic con, just control
for individual big customers that

387
00:19:44,070 --> 00:19:47,790
you have, and just tell me what your
alerting volume is right now, where

388
00:19:47,820 --> 00:19:50,129
that's enough signal to answer.

389
00:19:50,129 --> 00:19:53,940
The thing that I have, this is
superior to that because, oh.

390
00:19:53,990 --> 00:19:54,620
Great.

391
00:19:54,680 --> 00:19:59,630
Now I know whether it's my wifi or whether
it's the site that isn't working, some

392
00:19:59,630 --> 00:20:02,570
of these services are reliable enough
that if they're not working for me, my

393
00:20:02,570 --> 00:20:05,390
first thought is that my local internet
isn't working as well as it used to.

394
00:20:05,540 --> 00:20:07,550
I mean, Google crossed
that boundary a while back.

395
00:20:07,550 --> 00:20:10,250
If google.com doesn't load,
it's probably your fault.

396
00:20:10,520 --> 00:20:11,330
Omri Sass: Completely agreed.

397
00:20:11,480 --> 00:20:14,390
Corey : My question without sounding
rude about this, is why did this

398
00:20:14,390 --> 00:20:17,750
take so long to exist as a product?

399
00:20:18,560 --> 00:20:21,500
Omri Sass: When I, I did kind
of the rounds with the team and,

400
00:20:21,530 --> 00:20:25,160
uh, the director of data science
who kind of, uh, runs the group.

401
00:20:25,160 --> 00:20:26,000
He's an old friend of mine.

402
00:20:26,000 --> 00:20:28,310
We've been working very
closely for a long time.

403
00:20:28,400 --> 00:20:33,470
I asked the same question and I heard
a really funny bit of, uh, history and

404
00:20:33,470 --> 00:20:35,940
you can, if you Google Datadog Pokemon.

405
00:20:36,160 --> 00:20:41,260
You, you may find something kind of
funny, uh, where in 20 16, 2 engineers,

406
00:20:41,290 --> 00:20:45,340
uh, here at Datadog realized that
po, I think they were using Pokemon.

407
00:20:45,340 --> 00:20:48,190
They were like playing all over and
they'd realized that there were a bunch

408
00:20:48,190 --> 00:20:51,075
of connectivity issues and everyone,
like, people were literally like.

409
00:20:51,435 --> 00:20:55,455
Trying to swap their phones and like
tap on it, like figure out if it was

410
00:20:55,485 --> 00:20:59,804
the, the phone that was wrong, the wifi
that was wrong, uh, or if Pokemon Go

411
00:20:59,804 --> 00:21:01,965
itself, like Niantic servers were down.

412
00:21:02,085 --> 00:21:06,764
And they basically built a public
Datadog dashboard that kept track

413
00:21:06,764 --> 00:21:11,720
of a whole bunch of, uh, health
measures for the Niantic APIs.

414
00:21:12,045 --> 00:21:16,130
Yeah, and they kind of published that
it made a splash on then Twitter, you,

415
00:21:16,135 --> 00:21:19,605
you tell me if it was before or after it
turned into a, a dumpster full of, uh,

416
00:21:19,605 --> 00:21:24,585
tires on fire and that kind of idea of
like, Hey, we can do this type of public

417
00:21:24,585 --> 00:21:26,745
service, stuck with the same group.

418
00:21:26,805 --> 00:21:30,105
And then a couple of years later we
released Watchdog the, uh, ML engine

419
00:21:30,105 --> 00:21:31,725
that does anomaly detection for us.

420
00:21:32,205 --> 00:21:35,715
Uh, and then came that realization that
I mentioned earlier where, uh, every

421
00:21:35,715 --> 00:21:38,985
time that there's a major outage with
one of the cloud providers, with one

422
00:21:38,985 --> 00:21:43,455
of the like main SaaS providers out
there, uh, we would see load increase.

423
00:21:43,725 --> 00:21:46,455
And ever since then we
worked on refining the model.

424
00:21:46,695 --> 00:21:51,885
Like it started with data that we had
and then it moved to, uh, what type

425
00:21:51,885 --> 00:21:53,504
of telemetry is the best predictor.

426
00:21:53,715 --> 00:21:57,165
We found that if we take
fairly naive approaches.

427
00:21:57,375 --> 00:22:02,625
It gets noisy, it gets actually
really noisy and we have so many cases

428
00:22:02,625 --> 00:22:04,575
where the service isn't in fact down.

429
00:22:04,845 --> 00:22:08,505
It's a one-off, or it's something
that changed either an hour or in

430
00:22:08,505 --> 00:22:12,165
a customer's environment that makes
it look like the service is down.

431
00:22:12,345 --> 00:22:15,075
So we had to do a lot of refinement
and we ended up building our,

432
00:22:15,075 --> 00:22:17,625
an proprietary model to do this.

433
00:22:17,625 --> 00:22:20,355
So it's an actual ML model
that we built, homegrown.

434
00:22:20,685 --> 00:22:24,524
Doesn't use any of the, uh, big
AI providers, anything like that.

435
00:22:24,735 --> 00:22:27,045
It processes a massive amount of data.

436
00:22:27,045 --> 00:22:30,615
I think he threw out the like amount
of, uh, you know, how, what petabytes

437
00:22:30,615 --> 00:22:32,355
of data it processes in a given time.

438
00:22:32,715 --> 00:22:33,585
I don't remember.

439
00:22:34,004 --> 00:22:39,105
And we had to build a low latency pipeline
for it because the last thing that we

440
00:22:39,105 --> 00:22:44,595
want is to say, oh, this thing is up
in five minutes from it being down.

441
00:22:44,595 --> 00:22:45,345
Only show it.

442
00:22:45,675 --> 00:22:47,925
So there's a bunch of these things where.

443
00:22:48,300 --> 00:22:51,300
We started building and we're
like, oh, this thing seems to work.

444
00:22:51,300 --> 00:22:55,410
And when it says something's down,
it's mostly down, but not always.

445
00:22:55,650 --> 00:22:56,760
And then it's late.

446
00:22:56,970 --> 00:22:58,740
So it has to be high reliability.

447
00:22:58,740 --> 00:23:01,200
It has to be decoupled from
the rest of Datadog or if we're

448
00:23:01,200 --> 00:23:03,300
down, not that we're ever down.

449
00:23:03,660 --> 00:23:05,970
It's, uh, my favorite joke
to tell in front of users.

450
00:23:05,970 --> 00:23:08,910
'cause you know, in a room full of SREs,
you tell someone, our code is perfect.

451
00:23:08,910 --> 00:23:09,690
We never have any issues.

452
00:23:09,960 --> 00:23:10,830
Everyone starts asking.

453
00:23:10,830 --> 00:23:12,510
Corey : That's why Datadog
needs to be a provider on this.

454
00:23:12,510 --> 00:23:12,930
And it just could.

455
00:23:12,990 --> 00:23:13,740
That doesn't need to be a graph.

456
00:23:13,740 --> 00:23:14,940
That could just be a static image

457
00:23:15,755 --> 00:23:16,105
Omri Sass: there.

458
00:23:16,260 --> 00:23:16,680
There we go.

459
00:23:17,190 --> 00:23:18,300
We're also working on that.

460
00:23:18,300 --> 00:23:21,930
You should know we're, we are, we
try to be fairly transparent, uh,

461
00:23:21,960 --> 00:23:23,550
not topic of today's conversation.

462
00:23:23,550 --> 00:23:27,150
We have a, a finops tool, uh,
Datadog Cloud, cloud Cost Management.

463
00:23:27,450 --> 00:23:31,320
Uh, we put Datadog costs on there
by default at no additional charge.

464
00:23:31,320 --> 00:23:33,180
So like we, we do try
to make sure that we.

465
00:23:33,795 --> 00:23:37,725
Accept our place in the ecosystem,
uh, in that way, or try to be humble

466
00:23:37,725 --> 00:23:39,045
about our place in the ecosystem.

467
00:23:39,375 --> 00:23:42,825
Corey : Oh, part of the challenge too is
that I would argue, and you can tell me if

468
00:23:42,825 --> 00:23:47,805
I'm wrong on this, I, I don't see Datadog
an outage as being a critical path issue.

469
00:23:47,805 --> 00:23:51,165
By which I mean if you
folks go down dark globally.

470
00:23:51,804 --> 00:23:56,514
No one's website should stop working
because, ooh, our telemetry provider is

471
00:23:56,514 --> 00:24:01,345
not working, therefore we're going to just
block on IO or it's going to take us down.

472
00:24:01,465 --> 00:24:03,445
Sure, they're gonna have no
idea what their site is doing

473
00:24:03,445 --> 00:24:05,095
or if their site is up at all.

474
00:24:05,335 --> 00:24:07,465
But it's, you're a second order effect.

475
00:24:07,465 --> 00:24:08,665
You're not critical path.

476
00:24:08,695 --> 00:24:11,965
Omri Sass: That is very correct, and
to be fair, it's something that allows

477
00:24:11,965 --> 00:24:13,345
me to sleep much better at night.

478
00:24:13,680 --> 00:24:19,080
But, um, I will say that there are a
couple of good examples of customers

479
00:24:19,080 --> 00:24:24,000
who use the observability data
either, uh, to gate, uh, deployments.

480
00:24:24,030 --> 00:24:27,480
So if you practice, uh, continuous
deployment or continuous integration

481
00:24:27,810 --> 00:24:30,600
and you don't have observability,
suddenly you need to shut down

482
00:24:30,600 --> 00:24:32,010
your ability to deploy code.

483
00:24:32,010 --> 00:24:36,900
And that may not be, Hey, the website is
down, but it is considered, uh, in our

484
00:24:36,900 --> 00:24:39,090
language, uh, sev one or sev two incident.

485
00:24:39,090 --> 00:24:39,660
So like the.

486
00:24:39,975 --> 00:24:42,405
The worst kinds of, uh, incidents.

487
00:24:42,495 --> 00:24:46,605
Uh, and there are also other
companies that are not ebusinesses.

488
00:24:46,665 --> 00:24:51,284
Uh, they have real world brick and mortar
or you know, some airlines or things like

489
00:24:51,284 --> 00:24:56,415
that where if they lose, uh, observability
into some parts of their system, there

490
00:24:56,415 --> 00:24:58,185
is a real, real world repercussion.

491
00:24:58,185 --> 00:25:01,725
And so we take the, we take our
own reliability very seriously.

492
00:25:01,725 --> 00:25:02,175
And again, kind of.

493
00:25:02,850 --> 00:25:08,430
Uh, a is that I've, um, kind of heard our
CTO say that have been etched in my brain.

494
00:25:08,730 --> 00:25:13,740
Uh, our reliability target, it
needs to be as, as high, so like

495
00:25:13,740 --> 00:25:17,430
as strict as our most strict
customer, and that's how we treat it.

496
00:25:17,520 --> 00:25:19,950
And our, I will say our
ops game is pretty good.

497
00:25:20,500 --> 00:25:21,854
Corey : It, it has to be at some point.

498
00:25:22,080 --> 00:25:26,445
I I, I do have, again, things that you
can bike shed to death if you need to.

499
00:25:26,774 --> 00:25:28,935
AWS you're monitoring 12 services.

500
00:25:29,175 --> 00:25:30,225
How did you pick those

501
00:25:30,254 --> 00:25:31,995
Omri Sass: highest
popularity among our users?

502
00:25:32,024 --> 00:25:35,774
Like these are, uh, by far the
most used, uh, AWS services

503
00:25:35,774 --> 00:25:37,844
among the Datadog customer base.

504
00:25:38,135 --> 00:25:38,855
Corey : Fascinating.

505
00:25:38,915 --> 00:25:41,405
The looking at the list, there
are things that I'm not, I'm not

506
00:25:41,405 --> 00:25:44,555
entirely surprised by any of these
things, with the exception of KMS.

507
00:25:44,795 --> 00:25:48,695
I don't love the fact that that's
as popular as it is, but there are

508
00:25:48,695 --> 00:25:52,145
ways to use it for free and yeah,
it's, it does feel like it's more

509
00:25:52,145 --> 00:25:55,085
critical path and you're gonna
see more operational log noise out

510
00:25:55,085 --> 00:25:56,585
of it, for lack of a better term.

511
00:25:56,705 --> 00:25:57,935
I'm sure that right now.

512
00:25:58,165 --> 00:26:01,315
Biggest thing that someone at
Amazon is upset about internally is

513
00:26:01,315 --> 00:26:02,665
that their service isn't included.

514
00:26:02,845 --> 00:26:04,405
I don't see bedrock on this.

515
00:26:04,405 --> 00:26:04,885
Don't worry.

516
00:26:04,885 --> 00:26:08,335
They already have Anthropic and open
a a p and open AI's APIs on here.

517
00:26:08,335 --> 00:26:09,295
So they're covered there.

518
00:26:09,355 --> 00:26:12,534
Omri Sass: And again, if anyone from
the AWS Bedrock team wants to come and

519
00:26:12,534 --> 00:26:13,885
talk to us, they know where to find us.

520
00:26:13,885 --> 00:26:15,409
We, we have good friends
on the Bedrock team.

521
00:26:16,245 --> 00:26:16,665
Corey : Oh yeah.

522
00:26:16,875 --> 00:26:19,545
I, I still find it fun that as well,
like there are a couple of these folks.

523
00:26:19,545 --> 00:26:20,955
I, I'm the first one list.

524
00:26:20,955 --> 00:26:24,675
Ayden, A-D-Y-E-N-I don't off
the top of my head, I don't know

525
00:26:24,675 --> 00:26:26,055
who that is or what they do.

526
00:26:26,235 --> 00:26:28,095
So this is a marketing story here.

527
00:26:28,215 --> 00:26:28,875
Omri Sass: Oh, interesting.

528
00:26:28,995 --> 00:26:30,405
Oh, well, good for the Ayden folks

529
00:26:30,405 --> 00:26:33,975
Corey : and, and payments, data and
financial management in one solution.

530
00:26:34,035 --> 00:26:34,575
Okay.

531
00:26:35,130 --> 00:26:38,610
Now we know it's basically
a, a pipeline for payments.

532
00:26:38,790 --> 00:26:39,270
Omri Sass: Yes.

533
00:26:39,270 --> 00:26:42,270
And I'm, I'm actually willing to
bet you that you have used them,

534
00:26:42,270 --> 00:26:46,260
you know, just like, uh, block, uh,
formerly square, they have devices

535
00:26:46,260 --> 00:26:47,970
where you tap your credit card to use.

536
00:26:47,970 --> 00:26:51,090
I think they're, if, if memory serves,
they're more popular in Europe than the

537
00:26:51,090 --> 00:26:52,620
us but I've seen their devices here too.

538
00:26:52,920 --> 00:26:54,215
Corey : They are a Dutch
company, which would explain it.

539
00:26:54,750 --> 00:26:55,830
It's, it's that useful stuff.

540
00:26:55,830 --> 00:26:57,390
The world is full of
things that we don't use.

541
00:26:57,390 --> 00:27:00,330
I, it's weird 'cause I'm thinking
of this in the context of

542
00:27:00,390 --> 00:27:03,840
infrastructure providers, like what
the hell kind of cloud provider

543
00:27:03,840 --> 00:27:05,490
is this a highly specific one?

544
00:27:05,490 --> 00:27:06,450
Thank you for asking.

545
00:27:06,780 --> 00:27:07,320
Omri Sass: Exactly.

546
00:27:07,470 --> 00:27:09,570
Corey : What do you see
is coming next for this?

547
00:27:09,570 --> 00:27:13,590
I mean, you mentioned that there's
the idea of the overall, here's

548
00:27:13,590 --> 00:27:16,980
what's broken and we, we learn as we
go on this, but if you had a magic

549
00:27:16,980 --> 00:27:17,754
wand, what would you make it do?

550
00:27:18,495 --> 00:27:23,235
Omri Sass: Well, the easy answer to that
is what Corey said, but jokes aside,

551
00:27:23,295 --> 00:27:25,965
regional visibility into all the services,

552
00:27:26,070 --> 00:27:28,125
Corey : the, the
counterpoint to that is that.

553
00:27:28,965 --> 00:27:31,335
Global rolling outages
are not a thing with AWS.

554
00:27:31,335 --> 00:27:37,005
They are frequently a thing with GCP
and with Azure it's, it's Tuesday,

555
00:27:37,005 --> 00:27:38,355
they're probably down somewhere already.

556
00:27:38,655 --> 00:27:41,265
Every provider implements these
things somewhat differently.

557
00:27:41,385 --> 00:27:43,815
And then you also, the cascade
effects, like as we saw this

558
00:27:43,815 --> 00:27:47,325
morning when CloudFlare goes down,
a lot of API endpoints behind

559
00:27:47,325 --> 00:27:48,915
CloudFlare will also go down.

560
00:27:49,095 --> 00:27:51,915
If AWS is a bad day, so
does half of the internet.

561
00:27:52,365 --> 00:27:58,125
There's a, there's a strong sense that
this is becoming sort of a symptom.

562
00:27:58,485 --> 00:28:03,615
Of centralization where it's not,
again, reliability is far higher

563
00:28:03,615 --> 00:28:05,175
today than it ever has been.

564
00:28:05,475 --> 00:28:08,655
The difference is, is that we've
centralized across a few providers that

565
00:28:08,655 --> 00:28:12,764
when they have a very bad day, it takes
everyone down simultaneously, as opposed

566
00:28:12,764 --> 00:28:16,485
to your crappy data centers is down on
Mondays and Tuesdays and mine, down on

567
00:28:16,485 --> 00:28:19,514
Wednesdays and Thursdays, and that's just
how the internet worked for a long time.

568
00:28:19,785 --> 00:28:24,285
Omri Sass: So I, I think that there's a,
say, a moment of reckoning here for a lot

569
00:28:24,285 --> 00:28:28,845
of companies and, uh, the cloud providers
included and many of their customers.

570
00:28:28,905 --> 00:28:32,775
You know, folks who maybe never
invested in reliability or resilience

571
00:28:32,775 --> 00:28:38,205
or disaster recovery, or any flavor
of being, uh, I think resilient

572
00:28:38,205 --> 00:28:39,495
is probably the best word here.

573
00:28:40,215 --> 00:28:45,375
To any particular outage because
to your point, while some things

574
00:28:45,375 --> 00:28:48,975
are more centralized, right, there
are the, the main hyperscalers are

575
00:28:48,975 --> 00:28:50,385
where we're mostly centralized.

576
00:28:50,385 --> 00:28:54,495
A lot of things are significantly more
distributed, and that distribution, on the

577
00:28:54,495 --> 00:28:59,504
one hand means we're more resilient in the
overall, in the aggregate, but it's harder

578
00:28:59,504 --> 00:29:01,304
to figure out what's actually broken.

579
00:29:01,784 --> 00:29:02,294
And so.

580
00:29:02,850 --> 00:29:06,930
On the one hand, I would hope that a
bunch of, uh, companies that are critical

581
00:29:06,930 --> 00:29:11,115
for their users that have critical,
uh, infrastructure up in the cloud.

582
00:29:11,850 --> 00:29:16,830
Would remember that the internet is
not actually just US East one as uh,

583
00:29:16,830 --> 00:29:21,810
a lot of folks who consider the clouds
to be, uh, didn't realize that the

584
00:29:21,810 --> 00:29:25,020
cloud was mostly just US East one and
we all learned that the hard way a

585
00:29:25,020 --> 00:29:29,490
couple of weeks ago and would start to
move things, uh, to other places or.

586
00:29:29,975 --> 00:29:31,055
Build redundancies.

587
00:29:31,055 --> 00:29:34,925
And then, you know, maybe the
negotiation here is I, uh, I'll

588
00:29:34,925 --> 00:29:38,255
be down, but I'll be down for not
as long as the cloud provider.

589
00:29:38,255 --> 00:29:42,665
I can, uh, fail over safely or degrade
gracefully or any of these things.

590
00:29:43,205 --> 00:29:46,685
A lot of nice sounding terms that we
can throw at it, but that a lot of

591
00:29:46,685 --> 00:29:48,635
folks haven't heard or decided to.

592
00:29:48,635 --> 00:29:49,475
Not priorit.

593
00:29:49,485 --> 00:29:51,405
Is now is a good, or they don't understand

594
00:29:51,405 --> 00:29:53,685
Corey : what the reality of
that looks like because you,

595
00:29:53,685 --> 00:29:56,324
you can't simulate an S3 outage.

596
00:29:56,324 --> 00:29:56,534
Yeah.

597
00:29:56,534 --> 00:29:58,125
You can block it from your app.

598
00:29:58,125 --> 00:29:58,604
Sure.

599
00:29:58,604 --> 00:29:59,264
Terrific.

600
00:29:59,445 --> 00:30:02,745
You can't figure out how your
third party critical dependencies

601
00:30:02,745 --> 00:30:03,554
are going to react then.

602
00:30:03,554 --> 00:30:07,665
And when they all rely on each other,
it becomes a very strange mess.

603
00:30:07,665 --> 00:30:09,764
And that's why outages
at this scale unique.

604
00:30:10,185 --> 00:30:10,425
Omri Sass: Yep.

605
00:30:10,485 --> 00:30:11,415
Completely agreed.

606
00:30:11,850 --> 00:30:14,774
Corey : I, I want to thank you for taking
the time to walk me through the thought

607
00:30:14,774 --> 00:30:16,245
processes behind this and how it works.

608
00:30:16,334 --> 00:30:18,885
If people wanna learn more, where's
the best place for them to find you?

609
00:30:19,290 --> 00:30:20,460
Omri Sass: updog.ai.

610
00:30:20,700 --> 00:30:23,820
Uh, and then after that
at uh, datadog hq.com.

611
00:30:23,970 --> 00:30:27,150
And if you're already a Datadog
customer, I'm sure your account

612
00:30:27,150 --> 00:30:28,680
team is, knows where to find me.

613
00:30:29,700 --> 00:30:32,490
Corey : You're going to regret
saying that because everyone duo,

614
00:30:32,490 --> 00:30:34,110
everyone is a Datadog customer.

615
00:30:34,500 --> 00:30:36,210
Omri, thank you so much for your time.

616
00:30:36,210 --> 00:30:36,960
I appreciate it.

617
00:30:37,260 --> 00:30:38,010
Omri Sass: Thanks for having me.

618
00:30:38,340 --> 00:30:41,760
Corey : Omri Sass, director of
Product Management at Datadog.

619
00:30:41,970 --> 00:30:45,360
I'm cloud economist Cory Quinn,
and this is screaming in the cloud.

620
00:30:45,660 --> 00:30:48,570
If you've enjoyed this podcast,
please leave a five star review on

621
00:30:48,570 --> 00:30:50,070
your podcast platform of choice.

622
00:30:50,250 --> 00:30:53,610
Whereas if you've hated this podcast,
please leave a five star review on your

623
00:30:53,610 --> 00:30:57,870
podcast platform of choice, along with
an angry, bewildered comment of along the

624
00:30:57,870 --> 00:31:00,840
lines of date, A dog like Tinder for pets.

625
00:31:00,900 --> 00:31:04,230
That's disgusting, showing
that you did not get the point.