1
00:00:00,270 --> 00:00:04,110
Tyler: So even the simple like S3 will
gladly compute checks, sum sums for you.

2
00:00:04,710 --> 00:00:07,680
But to do so, it's a batch
operation and a batch operation

3
00:00:07,680 --> 00:00:09,720
costs $1 per million objects.

4
00:00:10,020 --> 00:00:14,700
If you have a hundred billion objects,
all of a sudden you're faced with, if

5
00:00:14,700 --> 00:00:18,660
I need checksums for all of this data,
I've gotta go drop a hundred K just to

6
00:00:18,660 --> 00:00:23,340
compute checks sums, because a billion
is just so astronomically large that

7
00:00:23,340 --> 00:00:26,790
just the simple act of getting checksums
for the data you already have in S3.

8
00:00:27,210 --> 00:00:29,460
Becomes a very serious pricing discussion.

9
00:00:29,880 --> 00:00:31,770
If you aren't ready to
drop that kind of coin,

10
00:00:37,080 --> 00:00:38,700
Corey: welcome to Screaming in the Cloud.

11
00:00:38,910 --> 00:00:40,050
I'm Cory Quinn.

12
00:00:40,350 --> 00:00:44,550
I am joined today by an early
Duck Bill Group customer.

13
00:00:44,730 --> 00:00:49,860
A recent speaker at the inaugural
San Francisco finops, I suppose

14
00:00:49,860 --> 00:00:55,769
we'll call it R Tyler Croy is an
infrastructure architect over at Scribd.

15
00:00:56,459 --> 00:00:57,300
Tyler, how are you?

16
00:00:57,745 --> 00:00:59,220
How you I'm doing All right.

17
00:01:00,180 --> 00:01:03,269
This episode is sponsored
in part by my day job Duck.

18
00:01:03,269 --> 00:01:06,479
Bill, do you have a horrifying AWS bill?

19
00:01:06,750 --> 00:01:08,640
That can mean a lot of things.

20
00:01:08,850 --> 00:01:13,949
Predicting what it's going to be,
determining what it should be, negotiating

21
00:01:13,949 --> 00:01:19,380
your next long-term contract with AWS,
or just figuring out why it increasingly

22
00:01:19,380 --> 00:01:23,910
resembles a phone number, but nobody
seems to quite know why that is.

23
00:01:24,210 --> 00:01:24,900
To learn more.

24
00:01:24,900 --> 00:01:25,380
Visit dot.

25
00:01:25,495 --> 00:01:27,775
Bill hq.com.

26
00:01:28,075 --> 00:01:30,925
Remember, you can't duck the duck bill.

27
00:01:30,985 --> 00:01:35,665
Bill, which my CEO reliably
informs me is absolutely not.

28
00:01:35,665 --> 00:01:39,445
Our slogan been there for a while
over at Scribd, six years in change.

29
00:01:39,445 --> 00:01:42,415
Uh, the company boldly answering
the question, what if S3

30
00:01:42,415 --> 00:01:43,615
had a good user interface?

31
00:01:46,015 --> 00:01:47,365
Tyler: Uh, yeah.

32
00:01:47,455 --> 00:01:50,785
I mean, I've, I've joined the
most recent incantation of script.

33
00:01:50,785 --> 00:01:51,955
Scribd has been a lot of things.

34
00:01:51,955 --> 00:01:53,305
We've been around for 18 years.

35
00:01:54,119 --> 00:01:59,399
I think Scribd has existed longer
than most other tech companies.

36
00:01:59,399 --> 00:02:00,779
So I mean, we've got the, uh.

37
00:02:01,830 --> 00:02:06,039
The lack of vowel in our name
were from the flicker, the Scribd,

38
00:02:06,059 --> 00:02:08,399
the were before the dot lys.

39
00:02:08,759 --> 00:02:10,949
Corey: Uh, as we look,
vowels are expensive.

40
00:02:10,949 --> 00:02:14,970
Why buy them if you don't have, uh, so,
so your talk was fascinating 'cause it of

41
00:02:14,970 --> 00:02:19,350
course, focused heavily on economics and
it also focused on S3, uh, of the reason

42
00:02:19,350 --> 00:02:20,910
that many of us don't sleep anymore.

43
00:02:21,329 --> 00:02:23,430
And that's been.

44
00:02:23,605 --> 00:02:27,924
It was an interesting story just at the
level of scale that you're talking about.

45
00:02:28,255 --> 00:02:31,674
Things that people don't consider
to be expensive, got expensive.

46
00:02:31,704 --> 00:02:36,834
Uh, specifically request charges when
you're doing things in buckets that for

47
00:02:36,834 --> 00:02:41,935
those who are unfamiliar, effectively,
have a crap ton of uploaded text

48
00:02:41,995 --> 00:02:44,230
documents at head of height scale.

49
00:02:45,050 --> 00:02:47,445
Tyler: I think K Crapton is
the metric that storage lens

50
00:02:47,445 --> 00:02:48,945
shows you at at our scale.

51
00:02:49,185 --> 00:02:53,650
So for the, the uninitiated scribd has
user documented or user uploaded content.

52
00:02:54,570 --> 00:02:57,090
Documents, typically presentations
through our SlideShare

53
00:02:57,090 --> 00:02:59,490
product going back 18 years.

54
00:02:59,580 --> 00:03:03,900
And so every day, thousands and thousands
of new documents, legal documents,

55
00:03:03,900 --> 00:03:05,490
study guides, et cetera, get uploaded.

56
00:03:05,790 --> 00:03:09,990
And those have been quietly accumulating
in our, in our S3 storage layer

57
00:03:09,990 --> 00:03:12,660
for a long time without anybody
really paying attention to it.

58
00:03:13,470 --> 00:03:16,950
And so a year or so ago, I started
to really look at like, where's a big

59
00:03:16,950 --> 00:03:19,200
dent I can make in class Explorer?

60
00:03:19,620 --> 00:03:22,679
Like, if I'm going to take on something
big, what's the biggest thing?

61
00:03:23,100 --> 00:03:24,299
And I saw SD storage

62
00:03:24,299 --> 00:03:25,560
Corey: cost cloud economics.

63
00:03:25,560 --> 00:03:29,549
Instead of pulling an AWS billing org and
going alphabetically, if you start with

64
00:03:29,549 --> 00:03:32,489
a big first, that tends to have impact.

65
00:03:32,519 --> 00:03:36,120
For years, I was asked about people's
random, uh, Alexa for business

66
00:03:36,120 --> 00:03:38,609
spend, it's, it is $3 a month.

67
00:03:38,670 --> 00:03:39,420
What are you doing?

68
00:03:43,020 --> 00:03:47,760
Tyler: Yeah, I mean most store, most
companies I think have a, like EC2 S3,

69
00:03:47,820 --> 00:03:49,380
Aurora, like those are the big things.

70
00:03:49,380 --> 00:03:53,430
But once I started to look into our actual
S3 spend, I knew we had a lot of content.

71
00:03:53,430 --> 00:03:56,370
Like we, we talk about the hundreds of
millions of documents that have been

72
00:03:56,370 --> 00:04:00,360
uploaded over the years, but when I
actually looked into what was stored.

73
00:04:00,795 --> 00:04:06,165
You know, in S3, we're talking hundreds
of billions of objects because every

74
00:04:06,165 --> 00:04:09,705
single document that you upload, we
have like format conversion, we have

75
00:04:09,705 --> 00:04:11,295
accessibility changes that get made.

76
00:04:11,595 --> 00:04:17,475
And so every single document became
this diaspora of related objects in S3.

77
00:04:17,834 --> 00:04:21,584
And suddenly like the thing like batch
operations, intelligent tiering, anything

78
00:04:21,584 --> 00:04:26,295
that has a per object charge associated
with it becomes wildly expensive

79
00:04:26,295 --> 00:04:27,960
in a way that requires you to like.

80
00:04:28,980 --> 00:04:34,230
Step back and like think about how,
how, how should we be doing this?

81
00:04:34,230 --> 00:04:35,355
How should we be storing this data?

82
00:04:36,030 --> 00:04:39,840
'cause that shotgun of objects into
S3 only works for the first billion.

83
00:04:39,960 --> 00:04:42,600
And then after that you might
have to think what you're doing.

84
00:04:43,020 --> 00:04:45,600
Corey: The, the ergonomics of the
request charges are very different too.

85
00:04:45,600 --> 00:04:49,320
I think philosophically we tend to
see, you know, on some level, oh,

86
00:04:49,380 --> 00:04:53,250
if I stuff an exabyte of data into
S3, that's going to be expensive.

87
00:04:53,460 --> 00:04:57,750
But when we start, I think it's hard for
humans to wrap their head around the idea.

88
00:04:57,890 --> 00:05:02,090
Of hundreds of billions of objects just
because it's, the difference between a

89
00:05:02,090 --> 00:05:03,980
million and a billion is about a billion.

90
00:05:04,220 --> 00:05:08,870
If you pass a point of scale, you do
an S3 Ls to see what objects you have

91
00:05:08,870 --> 00:05:11,840
there, and it'll complete right around
the time the earth crashes into the sun.

92
00:05:12,200 --> 00:05:16,040
It's a, it's just not something
that makes sense, but.

93
00:05:16,115 --> 00:05:20,015
On the other side of it, I over-optimize
for a lot of this stuff because I

94
00:05:20,015 --> 00:05:23,195
think at the Duck bill group now, our
total S3 bill is something like 110

95
00:05:23,195 --> 00:05:28,145
bucks a month right now, and we can do
basically anything we want to S3, it

96
00:05:28,145 --> 00:05:31,895
doesn't materially move the needle on
our business because we are not screwed.

97
00:05:32,345 --> 00:05:37,835
Tyler: The S3, like you
can abuse S3 for terabytes.

98
00:05:37,895 --> 00:05:41,375
Petabytes even really, like
you can put so much into S3.

99
00:05:41,375 --> 00:05:42,545
It's so incredibly cheap.

100
00:05:42,545 --> 00:05:44,165
It's so incredibly reliable.

101
00:05:44,700 --> 00:05:47,190
And then there's this sne,
like something happens.

102
00:05:47,190 --> 00:05:50,070
And I don't know when it happened at
Scribd because I wasn't paying attention.

103
00:05:50,070 --> 00:05:51,510
I'm only looking back in history.

104
00:05:51,780 --> 00:05:54,450
Something happened where we went
from like the first billion to the

105
00:05:54,450 --> 00:05:55,680
next billion to the next billion.

106
00:05:55,800 --> 00:06:00,540
And once you're in the tens or hundreds
of billions of objects, like it's,

107
00:06:00,600 --> 00:06:02,340
it's, it's like quantum physics.

108
00:06:02,340 --> 00:06:06,120
Like all of a sudden all of the physics
that you've learned no longer applies.

109
00:06:06,120 --> 00:06:08,790
You're in a completely different
ballgame and you've gotta figure

110
00:06:08,790 --> 00:06:10,230
out how does this world work?

111
00:06:10,440 --> 00:06:12,930
Because the world I thought
I had doesn't exist anymore.

112
00:06:14,205 --> 00:06:15,015
Corey: Oh, very much so.

113
00:06:15,165 --> 00:06:15,735
It's.

114
00:06:16,620 --> 00:06:19,290
You also have the problem where
especially when we're talking

115
00:06:19,290 --> 00:06:22,320
about all things billing you,
it's a lot of hurry up and wait.

116
00:06:22,560 --> 00:06:23,820
Okay, we're gonna make some transitions.

117
00:06:23,820 --> 00:06:25,980
We're gonna try something
here and see how it goes.

118
00:06:26,250 --> 00:06:30,030
And then you have to, in some cases, wait
for objects to age out into the next tier.

119
00:06:30,210 --> 00:06:33,570
Or there's a bunch of request charges
that suddenly mean for this month your

120
00:06:33,570 --> 00:06:35,940
S3 bill is, is just in the stratosphere.

121
00:06:36,030 --> 00:06:39,300
And you get the angry client screaming
phone call like, what have you done?

122
00:06:39,720 --> 00:06:42,600
Yes, there is an amortization story here.

123
00:06:42,780 --> 00:06:44,370
Give it time, don't move it back.

124
00:06:44,370 --> 00:06:44,790
Patient.

125
00:06:45,150 --> 00:06:45,480
Yeah.

126
00:06:45,930 --> 00:06:49,530
Tyler: And, and I'm, I'm actively,
as we record this, I am waiting for

127
00:06:49,710 --> 00:06:53,640
a first 30 day on some reclassing
to occur for intelligent tiering.

128
00:06:54,060 --> 00:06:57,900
And I, I can't wait until next month
because I'm hoping for a big drop.

129
00:06:59,220 --> 00:06:59,580
Corey: Yeah.

130
00:06:59,640 --> 00:07:02,670
It's, this stuff has become
magic, but you have to speak the

131
00:07:02,670 --> 00:07:04,680
right incantations around it.

132
00:07:04,685 --> 00:07:05,045
Mm-hmm.

133
00:07:06,030 --> 00:07:08,190
Uh, past a certain point of scale.

134
00:07:08,190 --> 00:07:13,140
A lot of things just in the way that AWS
talks about this are no longer make sense.

135
00:07:13,140 --> 00:07:17,580
Like I, I asked you about using S3
metadata or S3 tables for a lot of this

136
00:07:17,580 --> 00:07:22,349
stuff, and your response was the polite
business person equivalent of kid.

137
00:07:22,349 --> 00:07:23,159
That's adorable.

138
00:07:23,159 --> 00:07:26,010
Do you have any idea what that
would cost just on the sheer number

139
00:07:26,010 --> 00:07:27,815
of objects, because that's not.

140
00:07:28,565 --> 00:07:32,195
That's not usually the first dimension
we tend to think about historically.

141
00:07:32,284 --> 00:07:36,395
Now metadata and tables are changing
that and vector buckets and directory

142
00:07:36,395 --> 00:07:37,775
buckets and Lord knows what else.

143
00:07:38,044 --> 00:07:41,375
Uh, but that's, that just changes
the way that we think about a

144
00:07:41,375 --> 00:07:44,054
service that honestly is old than
some people listening to this show.

145
00:07:44,700 --> 00:07:47,940
Tyler: I mean, yeah, a lot of these
things really break down in ways

146
00:07:47,940 --> 00:07:51,659
that are challenging the, at the
metadata and inventory and other

147
00:07:51,659 --> 00:07:54,780
like operation or things that have
happened in S3 over the last few

148
00:07:54,780 --> 00:07:56,099
years that are really interesting.

149
00:07:56,429 --> 00:08:00,150
They're really, really great when
you're sub billion, uh, sub billion

150
00:08:00,150 --> 00:08:01,859
objects, but like S3 metadata.

151
00:08:02,010 --> 00:08:05,940
The questions I don't have are
I want to ask of these buckets.

152
00:08:06,660 --> 00:08:10,890
Are not worth the amount of money it
would take to ingest into S3 metadata

153
00:08:10,890 --> 00:08:14,730
and then to to continue storing because
they're just so astronomically huge.

154
00:08:15,270 --> 00:08:19,260
I was looking at, um, I was looking
at a problem with some of these

155
00:08:19,260 --> 00:08:21,930
older objects sometime in 2024.

156
00:08:22,200 --> 00:08:23,490
You probably talked about this.

157
00:08:24,225 --> 00:08:27,555
Every upload to S3, gotta
check something automatically.

158
00:08:27,555 --> 00:08:30,135
You don't have to do anything before that.

159
00:08:30,255 --> 00:08:32,715
You may or may not have a
checksum if you're using in a

160
00:08:32,715 --> 00:08:34,455
proper, you know, A-W-S-S-D-K.

161
00:08:34,455 --> 00:08:36,735
You did, but if you weren't, who knows?

162
00:08:36,735 --> 00:08:38,534
And at some point who would
ever use anything other?

163
00:08:38,835 --> 00:08:38,955
The

164
00:08:38,955 --> 00:08:39,855
Corey: latest correct.

165
00:08:39,855 --> 00:08:40,455
SDK.

166
00:08:41,805 --> 00:08:43,544
Tyler: Why would you interact with S3?

167
00:08:43,544 --> 00:08:45,555
Except through the official SDK.

168
00:08:46,215 --> 00:08:50,355
Um, but when you go back
to like Sri S3 bucket.

169
00:08:51,150 --> 00:08:55,980
Came was created, the year S3
was created and and announced.

170
00:08:55,980 --> 00:08:59,730
And so when we're going back that
far in time, we have billions of

171
00:08:59,730 --> 00:09:01,230
objects that don't have check sums.

172
00:09:01,470 --> 00:09:05,430
And so even the simple like S3 will
gladly compute check sum sums for you.

173
00:09:06,030 --> 00:09:08,070
But to do so, it's a batch operation.

174
00:09:08,130 --> 00:09:11,040
And a batch operation costs
$1 per million objects.

175
00:09:11,340 --> 00:09:15,600
If you have a hundred billion objects,
all of a sudden you're faced with.

176
00:09:15,765 --> 00:09:19,965
If I need Checksums for all of this data,
I've gotta go drop a hundred K just to

177
00:09:19,965 --> 00:09:25,005
compute checksums because a billion is
just so astronomically large that just

178
00:09:25,005 --> 00:09:27,885
the simple act of getting checksums
for the data you already have in S3.

179
00:09:28,530 --> 00:09:30,780
Becomes a very serious pricing discussion.

180
00:09:31,230 --> 00:09:33,090
If you aren't ready to
drop that kind of coin.

181
00:09:33,840 --> 00:09:36,780
Corey: I have to think just again,
you have lived in this space.

182
00:09:36,780 --> 00:09:41,010
I only dabble from time to time, but my
default approach when I'm start thinking

183
00:09:41,010 --> 00:09:45,780
about this sort of problem is the idea
of lazy, checksumming, lazy conversions.

184
00:09:45,990 --> 00:09:48,960
But when I was at Expensify back
in 2012, something that we learned

185
00:09:49,170 --> 00:09:53,850
was that the typical lifecycle of a
receipt was it would be written once

186
00:09:54,120 --> 00:09:56,340
and it would be read either one time.

187
00:09:56,520 --> 00:10:01,439
Or never except for the very end of
the long tail where suddenly things

188
00:10:01,439 --> 00:10:05,880
get read a lot years later during an
audit of some sort or when there's

189
00:10:06,000 --> 00:10:07,260
a question of mouth, the essence.

190
00:10:07,380 --> 00:10:09,030
So you could never get rid of the data.

191
00:10:09,120 --> 00:10:12,030
You had to have it, but the
expectation would never get read.

192
00:10:12,270 --> 00:10:16,829
I have to imagine the majority of stuff
that gets uploaded to the Scribd, in many

193
00:10:16,829 --> 00:10:18,600
cases, it's there, but it's not accessed.

194
00:10:19,425 --> 00:10:21,585
Tyler: I would say that's
a pretty good assumption.

195
00:10:21,915 --> 00:10:25,425
The interesting thing about user
uploaded content and user uploaded

196
00:10:25,425 --> 00:10:30,405
documents in in particular, is the
long tail is years and years long.

197
00:10:30,795 --> 00:10:34,035
You know, a study guide that was
created for Catcher in the Rye.

198
00:10:34,290 --> 00:10:39,960
In 2010 is still probably just as
useful in 2025 as it was in 2010

199
00:10:39,960 --> 00:10:42,420
because the catcher in the rye is
a classic and people still wanna,

200
00:10:42,420 --> 00:10:45,510
Corey: yeah, it gets no access post
a link to it on Reddit one year

201
00:10:45,510 --> 00:10:49,020
when the entire, and then studying
that and yeah, it's impossible.

202
00:10:49,020 --> 00:10:50,010
Predict what's gonna hit.

203
00:10:50,670 --> 00:10:51,780
Tyler: It's impossible to predict.

204
00:10:51,780 --> 00:10:56,150
But one of the things that's been
really interesting about Scribd.

205
00:10:56,175 --> 00:11:00,824
Particular flavor of of content
is in the last couple of years,

206
00:11:00,824 --> 00:11:04,425
large language models have become
really useful for what Scribd does.

207
00:11:04,665 --> 00:11:07,755
I won't speak to the utility of large
language models in other domains,

208
00:11:08,055 --> 00:11:12,314
but the use utility in what Scribd
does in particular, has made old

209
00:11:12,314 --> 00:11:16,079
documents suddenly much more useful,
much more interesting, much more

210
00:11:16,079 --> 00:11:20,115
relevant to users today than they have
ever been before Because you didn't.

211
00:11:20,395 --> 00:11:23,275
Before this, you didn't have to
sort of rely on like a Reddit

212
00:11:23,275 --> 00:11:26,965
post or you know, something to
like re reinvigorate a document.

213
00:11:26,965 --> 00:11:27,235
Right?

214
00:11:27,655 --> 00:11:32,745
But now that if we look at all of this
long 20, almost 20 year history of Scribd,

215
00:11:32,965 --> 00:11:36,295
if you look at that like a knowledge
base, then all of a sudden we're looking

216
00:11:36,295 --> 00:11:40,105
at a very like very broad horizontal
access pattern that we might wanna be

217
00:11:40,105 --> 00:11:44,815
doing for data science use cases or large
language model based applications that.

218
00:11:45,055 --> 00:11:48,175
Again, flips the access patterns
that you might have in a traditional

219
00:11:48,175 --> 00:11:52,615
user-generated content site on its head
and makes the storage discussion so much

220
00:11:52,615 --> 00:11:54,805
more challenging, but like in a fun way.

221
00:11:56,425 --> 00:12:00,305
Corey: One of the more horrifying parts of
your talk was when you mentioned that you.

222
00:12:01,630 --> 00:12:04,870
Had a lot of digging into various
file formats you were talking about,

223
00:12:04,870 --> 00:12:08,260
even ISOs at one point, I'm like,
oh, hey, someone knows what Joliet

224
00:12:08,260 --> 00:12:09,550
standard is in this day and age.

225
00:12:09,550 --> 00:12:10,209
Imagine that.

226
00:12:10,689 --> 00:12:15,069
But you picked Parquet and then
started using S3 bite offsets on

227
00:12:15,069 --> 00:12:18,670
Reeds to be able to just grab the end
of an object and be, and then figure

228
00:12:18,670 --> 00:12:22,870
out where exactly you'd go and grab
things from exploded document views.

229
00:12:22,930 --> 00:12:26,680
It's, it was a very in-depth
approach it sounds like.

230
00:12:27,075 --> 00:12:31,215
Not nine S3 tables, rest of metadata
from first principles, because those

231
00:12:31,215 --> 00:12:34,725
things didn't exist back then, and now
that they do, they're no, they're not

232
00:12:34,725 --> 00:12:36,435
even close to being economically viable.

233
00:12:36,765 --> 00:12:37,125
Tyler: Yeah.

234
00:12:37,125 --> 00:12:40,215
I think if, if they were economically
viable, I mean, S3 tables would be

235
00:12:40,215 --> 00:12:41,955
really interesting for this use case.

236
00:12:42,225 --> 00:12:46,275
The really novel thing about Apache
Parquet files is a lot of what we

237
00:12:46,275 --> 00:12:49,605
are doing at Scribd with Apache
Parquet is not new territory.

238
00:12:49,605 --> 00:12:52,305
It's not necessarily novel, it's
how, you know the quote unquote

239
00:12:52,365 --> 00:12:53,775
lakehouse architectures of.

240
00:12:53,945 --> 00:12:58,445
You know, delta tables and iceberg tables
and things like that are doing really,

241
00:12:58,445 --> 00:13:02,675
really fast queries and things like
that on top of S3 object stores or, you

242
00:13:02,675 --> 00:13:03,755
know, Azure, blah, blah, blah, blah.

243
00:13:03,845 --> 00:13:07,895
So like the infrastructure for
picking needles out of these parquet

244
00:13:07,895 --> 00:13:10,115
file haystacks already exist.

245
00:13:10,625 --> 00:13:12,750
One of the, the, the, the
work that I've been doing is.

246
00:13:13,735 --> 00:13:17,485
Reusing some of the same principles, but
bringing it to a wildly different domain

247
00:13:17,545 --> 00:13:22,195
of this sort of very, very large content
library that Scribdipt has, and using

248
00:13:22,195 --> 00:13:24,235
that as a way to reduce object size.

249
00:13:24,325 --> 00:13:27,115
Like the whole, the whole thing
that I was trying to get across at

250
00:13:27,115 --> 00:13:30,565
the, the, the finops meetup that,
that y'all invited me down for was.

251
00:13:31,125 --> 00:13:36,165
The, the problem of X is really
expensive at a hundred billion objects.

252
00:13:36,465 --> 00:13:41,385
My solution has been okay, not like to
go negotiate with the team or like try to

253
00:13:41,385 --> 00:13:44,475
find a way to make that cheaper, but try
to get the object count actually lower.

254
00:13:44,625 --> 00:13:47,805
Because if you bring it from a hundred
billion to a hundred million, then

255
00:13:47,805 --> 00:13:50,655
we're in a ballpark to where you can
take advantage of intelligent tiering.

256
00:13:50,835 --> 00:13:53,235
Batch operations become
much more easy to do.

257
00:13:53,475 --> 00:13:54,885
All sorts of things become simpler.

258
00:13:55,260 --> 00:13:56,850
If you can reduce that object count.

259
00:13:57,240 --> 00:13:59,610
And when I was looking at other
things like ISOs, I mean the

260
00:13:59,610 --> 00:14:00,990
classics are a classic for a reason.

261
00:14:01,110 --> 00:14:02,069
You know, they never die.

262
00:14:02,310 --> 00:14:06,930
Um, when I was looking at that, like
zip, tar, et cetera, um, I wasn't

263
00:14:06,930 --> 00:14:08,939
able to find a way to get random.

264
00:14:09,329 --> 00:14:14,189
By selections within S3 objects
to work nearly as effectively

265
00:14:14,189 --> 00:14:15,510
as I can with Parquet.

266
00:14:15,989 --> 00:14:19,920
As with Parquet, if I know what
file I'm looking for, I can get it

267
00:14:19,920 --> 00:14:24,270
extremely quickly from within, let's
say a a a a hundred megabyte file.

268
00:14:24,300 --> 00:14:31,109
I can go grab you 25 kilobytes with the
same level of, I would say, performance

269
00:14:31,349 --> 00:14:36,750
as most other S3 object accesses
work because S3 supported this range.

270
00:14:36,940 --> 00:14:40,360
Request for a long time, and one of
the bits of trivia that I was very

271
00:14:40,360 --> 00:14:43,390
pleased to discover, which really,
really made this work well is you

272
00:14:43,390 --> 00:14:46,150
can do negative range reads on S3.

273
00:14:46,300 --> 00:14:48,550
So you can look at the tail of a
file, you can look at the middle of

274
00:14:48,550 --> 00:14:51,820
the file, you can grab any part of
a, of an object that exists in S3.

275
00:14:52,090 --> 00:14:54,850
If you just know where in
the file that it, it exists.

276
00:14:55,385 --> 00:14:58,385
Corey: Which is magic,
and there's a, it's magic.

277
00:14:58,385 --> 00:15:02,315
The, the downside, of course, is you
have to know it's there, you have

278
00:15:02,315 --> 00:15:05,615
to have an in-depth understanding of
your workload, which you folks do.

279
00:15:05,975 --> 00:15:07,955
This is also, I think, the
curse of retail pricing.

280
00:15:08,105 --> 00:15:13,700
Um, it's no secret at this point that at
scale, nobody's paying retail, but mm-hmm.

281
00:15:14,095 --> 00:15:16,765
And like, well, of course we're gonna
work and negotiate with you on this

282
00:15:16,765 --> 00:15:18,535
and return for long-term commitments.

283
00:15:18,535 --> 00:15:21,775
We'll wind up giving you various degrees
of discounts, but when you're sitting

284
00:15:21,775 --> 00:15:26,035
there just doing the napkin math to figure
out, okay, if I have a hundred billion

285
00:15:26,035 --> 00:15:28,735
objects and what are you gonna charge me?

286
00:15:28,765 --> 00:15:29,875
Okay, nevermind.

287
00:15:29,875 --> 00:15:33,385
We're going to move on to the next
part of the conversation, because it

288
00:15:33,385 --> 00:15:35,335
doesn't occur to you to go up there.

289
00:15:35,335 --> 00:15:40,285
It's like, so I need at minimum,
a 98% discount on this particular

290
00:15:40,285 --> 00:15:42,655
dimension, even if that's attainable.

291
00:15:42,885 --> 00:15:44,324
It sounds ludicrous.

292
00:15:44,324 --> 00:15:47,444
Like there, there's no way you would even
be able to say that with a straight face.

293
00:15:47,444 --> 00:15:52,095
Like, I'm not gonna go into a car
dealership and ask for a car for 20 bucks

294
00:15:52,095 --> 00:15:53,865
because it's just wasting everyone's time.

295
00:15:54,285 --> 00:15:55,665
Same principle applies here.

296
00:15:55,725 --> 00:15:58,574
The, they have priced themselves out
of some very interesting conversations.

297
00:15:59,265 --> 00:16:03,765
Tyler: The same principle applies,
I think the, you know, the, the

298
00:16:03,765 --> 00:16:05,805
problem domain that we are faced with.

299
00:16:06,180 --> 00:16:10,200
Uh, on top of S3, I think S3, as I
think you've claimed a number of times,

300
00:16:10,200 --> 00:16:11,730
is the eighth wonder of the world.

301
00:16:11,730 --> 00:16:14,130
It is a fantastic piece of infrastructure.

302
00:16:14,820 --> 00:16:20,520
Building on top of it enables so many
different use cases, but when you've

303
00:16:20,520 --> 00:16:23,640
got a large enough scale, you've
got really interesting problems.

304
00:16:23,910 --> 00:16:26,790
And being an engineering, this
is certainly a bias, right?

305
00:16:26,790 --> 00:16:28,770
Like, I don't wanna look
away from those problems.

306
00:16:29,190 --> 00:16:33,360
Getting, getting things cheaper is
sometimes easier to do just with paper.

307
00:16:33,480 --> 00:16:37,260
You know, just signing a contract and
other times stepping back far enough to

308
00:16:37,260 --> 00:16:38,670
look at what we're trying to accomplish.

309
00:16:38,670 --> 00:16:43,230
And coming up with an interesting
technology solution is also a perfectly

310
00:16:43,230 --> 00:16:44,580
reasonable way to solve the problem.

311
00:16:45,180 --> 00:16:49,350
And I think the way that, that I
really am trying to approach what

312
00:16:49,350 --> 00:16:53,550
we're doing with S3 at Scribd is it's
not just about getting the bill lower.

313
00:16:54,170 --> 00:16:57,469
Nobody is gonna give me the time,
the money to, to make the bill lower.

314
00:16:57,920 --> 00:17:00,319
But if I can give us new
capabilities by expanding.

315
00:17:01,620 --> 00:17:04,349
What we can do with this a hundred
billion objects within the organization,

316
00:17:04,650 --> 00:17:09,329
that is a capability change that you
get from a technology based solution as

317
00:17:09,329 --> 00:17:13,170
opposed to a policy or, you know, uh,
uh, you know, contract based solution.

318
00:17:13,589 --> 00:17:15,089
Both are equally valid, right?

319
00:17:15,450 --> 00:17:17,940
But I'm much better at
one than am the other.

320
00:17:18,210 --> 00:17:21,210
Um, that may be a different
story for you, but I'm better

321
00:17:21,210 --> 00:17:22,589
with the, uh, let's build some.

322
00:17:23,194 --> 00:17:26,795
Some code that's gonna solve some
big problems, and hopefully that'll

323
00:17:26,795 --> 00:17:30,065
make the, the chart go down in a
way that makes time, uh, in finance.

324
00:17:30,065 --> 00:17:30,365
Happy.

325
00:17:31,055 --> 00:17:35,525
Corey: This episode is sponsored by my
own company, duck Bill, having trouble

326
00:17:35,525 --> 00:17:40,505
with your AWS bill, perhaps it's time
to renegotiate a contract with them.

327
00:17:40,835 --> 00:17:45,710
Maybe you're just wondering how to predict
what's going on in the wide world of AWS.

328
00:17:46,285 --> 00:17:48,925
Well, that's where Duck
Bill comes in to help.

329
00:17:49,105 --> 00:17:51,865
Remember, you can't duck the duck bill.

330
00:17:51,865 --> 00:17:55,465
Bill, which I am reliably
informed by my business partner

331
00:17:55,585 --> 00:17:58,075
is absolutely not our motto.

332
00:17:58,315 --> 00:18:01,555
To learn more, visit duck bill hq.com.

333
00:18:02,385 --> 00:18:05,745
Feels like half the time you look at
deep in the bill, like every different

334
00:18:05,745 --> 00:18:07,545
usage item, there's a reason for it.

335
00:18:07,545 --> 00:18:09,105
There's ways to optimize around it.

336
00:18:09,675 --> 00:18:09,764
Mm-hmm.

337
00:18:09,764 --> 00:18:11,295
But, but it's small to midsize scale.

338
00:18:11,295 --> 00:18:14,535
It feels like it's just a tax on
not knowing those intricacies.

339
00:18:14,774 --> 00:18:18,585
It's also, frankly, why the bigger
bills get less interesting because

340
00:18:18,585 --> 00:18:21,675
you can have a weird misconfiguration
that's, you know, a significant portion

341
00:18:21,675 --> 00:18:24,915
of an $80,000 monthly bill, but by
the time you're like, you know, we

342
00:18:24,915 --> 00:18:26,295
spent a hundred million bucks a year.

343
00:18:27,060 --> 00:18:30,720
No one's gonna spend 40 million of
that on Nat Gateway data processing.

344
00:18:30,720 --> 00:18:32,700
'cause someone's gonna ask where
the hell of the money's going

345
00:18:32,700 --> 00:18:34,200
long before it gets to that point.

346
00:18:34,380 --> 00:18:35,490
So it starts to normalize.

347
00:18:35,490 --> 00:18:38,040
You see the, the usual
suspects and services.

348
00:18:38,100 --> 00:18:38,190
Mm-hmm.

349
00:18:38,430 --> 00:18:40,980
S3 of course, being one of
the top three every time.

350
00:18:41,340 --> 00:18:44,460
But in your case, it's, it's not
just about the fact that it's

351
00:18:44,460 --> 00:18:45,930
S3, it's what is the usage type?

352
00:18:45,930 --> 00:18:47,190
What is the dimension breakdown?

353
00:18:47,190 --> 00:18:51,570
What is the ratio of, uh,
of requests to bites stored?

354
00:18:51,750 --> 00:18:53,280
That's where it starts to
get really interesting.

355
00:18:53,550 --> 00:18:57,179
And there's still no really good
way of saying, oh, 99% of it's this

356
00:18:57,179 --> 00:19:01,560
one S3 bucket because you have to
go diving even to get that out.

357
00:19:01,920 --> 00:19:02,010
It's,

358
00:19:03,930 --> 00:19:06,570
Tyler: you have to go diving to,
to get the specifics on where data

359
00:19:06,570 --> 00:19:10,140
is being stored, especially as it
starts to get more and more costly.

360
00:19:10,530 --> 00:19:14,820
But the use cases that I see more and
more, and, and this is sort of because

361
00:19:14,820 --> 00:19:19,560
of the, the time that we're in right
now, is if you give a data scientist.

362
00:19:19,790 --> 00:19:20,270
A bucket.

363
00:19:20,300 --> 00:19:23,780
If you give a data scientist a data,
a table or an engineer a table,

364
00:19:24,050 --> 00:19:26,959
they're gonna start to put data in
it and it starts to explode over time

365
00:19:26,959 --> 00:19:30,020
to where we start to have data sizes
that get large enough to where you're

366
00:19:30,020 --> 00:19:32,030
like, okay, should this be an S3?

367
00:19:32,360 --> 00:19:33,530
We need it to be online.

368
00:19:33,530 --> 00:19:34,459
Should it be in Aurora?

369
00:19:34,459 --> 00:19:35,870
Should it be in elastic cache like.

370
00:19:36,435 --> 00:19:40,665
There's all of these very interesting data
scale problems that are starting to creep

371
00:19:40,665 --> 00:19:45,045
up because data has become so much more
intrinsic in, you know, the, the product

372
00:19:45,045 --> 00:19:48,705
value or what we can do that's really
interesting and everybody wants the data

373
00:19:48,764 --> 00:19:54,135
all the time in every surface possible
for as little latency as possible, and all

374
00:19:54,135 --> 00:19:55,935
of all normalizing for everything else.

375
00:19:56,205 --> 00:19:59,415
S3 is so like incredibly fast.

376
00:19:59,595 --> 00:20:00,764
It is incredibly fast.

377
00:20:00,764 --> 00:20:02,055
It is incredibly cheap.

378
00:20:02,415 --> 00:20:04,095
You just have to know to store data in it.

379
00:20:04,425 --> 00:20:06,105
To take advantage of those two properties.

380
00:20:06,555 --> 00:20:10,545
And that's, that's sort of the, the,
the, the thrust of the work that I've

381
00:20:10,545 --> 00:20:15,945
been doing over the last year is if you
know how to wield S3, it is probably

382
00:20:15,945 --> 00:20:19,435
the most powerful tool in the toolbox,
but you have to know how to wield it.

383
00:20:19,860 --> 00:20:20,100
Corey: Yeah.

384
00:20:20,159 --> 00:20:20,760
It's magic.

385
00:20:20,895 --> 00:20:23,040
It it is infinite storage definition.

386
00:20:23,040 --> 00:20:24,235
Yes, they can, they can.

387
00:20:24,334 --> 00:20:24,554
Yes.

388
00:20:24,554 --> 00:20:26,070
It's faster than you can fill it.

389
00:20:26,129 --> 00:20:30,240
I know that because I've
done some, uh, yeah.

390
00:20:30,300 --> 00:20:32,220
That's why my test environment is
other people's product accounts.

391
00:20:32,879 --> 00:20:36,120
Uh, it's, it also changes
the nature of behaviors.

392
00:20:36,300 --> 00:20:39,300
If I were to go into a data center
environment and say, great, I

393
00:20:39,300 --> 00:20:42,419
need to just store another 10
petabytes, uh, over in that rack

394
00:20:42,419 --> 00:20:44,040
next week, the response would be.

395
00:20:44,305 --> 00:20:44,395
Right.

396
00:20:45,805 --> 00:20:47,815
Uh, is there someone
smarter I can speak with?

397
00:20:47,815 --> 00:20:49,195
Do you have a goldfish perhaps?

398
00:20:49,525 --> 00:20:55,675
Uh, whereas work with S3 is just
effectively one gigabyte at a time, and

399
00:20:55,975 --> 00:20:59,335
there is no forcing function where, well,
now we need to get a whole bunch more

400
00:20:59,335 --> 00:21:01,225
power, cooling, hardware, et cetera.

401
00:21:01,465 --> 00:21:04,285
You can just keep incrementally
adding and growing forever.

402
00:21:04,285 --> 00:21:05,065
There is no more.

403
00:21:05,150 --> 00:21:08,210
Bound to speak of, at least
not at anything you or I

404
00:21:08,210 --> 00:21:08,990
are ever going to encounter.

405
00:21:08,990 --> 00:21:12,800
'cause the money is the bound there
and there's no forcing function that

406
00:21:12,800 --> 00:21:14,150
makes us clean up after ourselves.

407
00:21:14,180 --> 00:21:16,070
It is an unbounded growth problem.

408
00:21:17,030 --> 00:21:18,500
Tyler: It is an unbounded growth problem.

409
00:21:18,500 --> 00:21:21,860
I think there's a, there's a
industry change that has happened

410
00:21:21,860 --> 00:21:22,730
that has influenced this.

411
00:21:22,730 --> 00:21:26,030
I was having a chat with, uh, with
my friend Denny from Databricks about

412
00:21:26,030 --> 00:21:30,140
this in, when I first came up in
the industry, how you stored data.

413
00:21:30,524 --> 00:21:34,754
Whether it was online or offline
was in a relational data store of

414
00:21:34,754 --> 00:21:36,314
some form or the data warehouse.

415
00:21:36,524 --> 00:21:40,455
And the goal of all of these foreign key
constraints and the relations between them

416
00:21:40,695 --> 00:21:42,975
was to only store any one piece of data.

417
00:21:42,975 --> 00:21:46,814
Once in the last 10 years,
we've said to hell with that.

418
00:21:47,235 --> 00:21:48,525
De-normalized the data.

419
00:21:48,645 --> 00:21:53,085
It's faster to just de-normalized it
and to create a copy plus one column of

420
00:21:53,085 --> 00:21:57,165
this table rather than to try to manage
all of these relationships between data.

421
00:21:57,495 --> 00:22:01,395
And so we have like excessive de
normalization of data happening across

422
00:22:01,395 --> 00:22:03,495
the data layer for good reasons.

423
00:22:03,730 --> 00:22:05,230
In theory, but for good reasons.

424
00:22:05,379 --> 00:22:10,030
But what that means is this unbounded
growth has happened because we have

425
00:22:10,060 --> 00:22:11,800
infinitely cheap storage, you know?

426
00:22:11,860 --> 00:22:12,100
Right.

427
00:22:12,370 --> 00:22:16,360
And then we also have this push of
de normalizing data, which leads

428
00:22:16,360 --> 00:22:20,920
to crazy data growth as time goes
on, because most new data sets are

429
00:22:20,920 --> 00:22:22,870
not net new original data sets.

430
00:22:23,050 --> 00:22:26,830
They are that old table or
that old dataset, plus some new

431
00:22:26,830 --> 00:22:31,240
properties that I've added is
now an entirely new, you know.

432
00:22:31,649 --> 00:22:34,740
Prefix an entirely new bucket
of of, of stuff, right?

433
00:22:35,159 --> 00:22:37,949
And so rather than trying to find
a way to get the least amount of

434
00:22:37,949 --> 00:22:39,495
storage used, we said to hell with it.

435
00:22:40,305 --> 00:22:41,235
S3 is cheap enough.

436
00:22:41,565 --> 00:22:45,765
Just copy the data a bunch, and that
works great until it stops working.

437
00:22:46,370 --> 00:22:49,035
And at the billions is
where it stops working.

438
00:22:49,035 --> 00:22:51,525
Corey: And also until you realize,
okay, you had a data scientist,

439
00:22:51,555 --> 00:22:55,605
uh, copy away five petabytes to do
a quick experiment for two weeks.

440
00:22:55,665 --> 00:22:58,935
Uh, she left the company
in 2012 and, oops, a doozy.

441
00:22:58,935 --> 00:23:00,475
We probably should have
cleaned that up before.

442
00:23:00,745 --> 00:23:01,195
Tyler: Yeah.

443
00:23:01,375 --> 00:23:03,145
That's where intelligent
tiering could really help.

444
00:23:04,705 --> 00:23:08,515
Intelligent tiering and object access
logs have been used quite aggressively

445
00:23:08,605 --> 00:23:12,295
by, by myself and some other folks
that I, I work with to identify

446
00:23:12,295 --> 00:23:16,405
exactly those data sets that were
orphaned, that were suspiciously huge

447
00:23:16,825 --> 00:23:18,295
Corey: object access logs are great.

448
00:23:18,325 --> 00:23:20,095
Uh, CloudShare data events are terrible.

449
00:23:20,095 --> 00:23:21,775
I've, I've done the math on this.

450
00:23:21,805 --> 00:23:25,915
It's something like 25 times more
expensive than the S3 request

451
00:23:26,185 --> 00:23:30,025
to write the CloudTrail data
event recording that request.

452
00:23:30,680 --> 00:23:34,280
So professionally speaking, don't do that.

453
00:23:34,490 --> 00:23:37,310
The access are a lot more reasonable.

454
00:23:37,315 --> 00:23:39,830
It, it reminds us the old,
good old days of server web

455
00:23:39,835 --> 00:23:41,025
logs from Apache and whatnot.

456
00:23:41,025 --> 00:23:41,345
Mm-hmm.

457
00:23:42,225 --> 00:23:45,014
Tyler: I set up Webalizer, I know
exactly where things are going

458
00:23:46,004 --> 00:23:46,485
Corey: exactly.

459
00:23:46,844 --> 00:23:49,814
And uh, my log analytics
suite, uh, that was a very

460
00:23:49,814 --> 00:23:54,024
convoluted one Line ox Scribdipt.

461
00:23:54,044 --> 00:23:56,114
When you have underpowered systems,
what else are you gonna do?

462
00:23:56,354 --> 00:24:01,574
It's hitting my server and making it
fall over or, uh, tail dash f and try and

463
00:24:01,574 --> 00:24:02,669
strain the tea leaves with your teeth.

464
00:24:03,205 --> 00:24:04,764
Tyler: That's what, that's how
I've been doing it actually.

465
00:24:05,935 --> 00:24:05,995
Corey: Yeah.

466
00:24:05,995 --> 00:24:07,195
There are worse choices you could make.

467
00:24:07,735 --> 00:24:11,665
Uh, unfortunately, I, I really wish you
could give a version of your talk at

468
00:24:11,665 --> 00:24:15,504
Reinvent, but it doesn't involve ai,
so there's no way in the world that

469
00:24:15,504 --> 00:24:18,895
it would ever fit on the schedule in,
uh, someone just did the analytics.

470
00:24:18,895 --> 00:24:23,814
Something like 400 and Change or the
roughly 500, uh, talks in the, uh, in the

471
00:24:23,814 --> 00:24:28,854
catalog so far are about, I mentioned AI
at least once in the title or description.

472
00:24:29,125 --> 00:24:29,425
Tyler: Yeah.

473
00:24:29,514 --> 00:24:30,445
I mean, that's how you get.

474
00:24:30,895 --> 00:24:31,855
That's how you get on stage.

475
00:24:31,855 --> 00:24:32,905
That's how you get some funding.

476
00:24:32,965 --> 00:24:34,945
I mean, you gotta have some
little, some hashtag ai.

477
00:24:35,425 --> 00:24:40,945
I think the, the thing about the AI use
cases, there's a lot of really interesting

478
00:24:40,945 --> 00:24:45,835
things that people are doing that's,
you know, quote unquote AI on, on a ws.

479
00:24:46,554 --> 00:24:50,004
Some of those are with AWS's AI tools.

480
00:24:50,395 --> 00:24:52,165
A lot of them are kind of conventional.

481
00:24:53,070 --> 00:24:58,500
SageMaker conventional vector stores
conventional, like the tools of three

482
00:24:58,500 --> 00:25:01,710
or four years ago, the stuff that's
coming out now or that that has been

483
00:25:01,710 --> 00:25:03,179
announced in the last year or so.

484
00:25:03,389 --> 00:25:05,730
I think we're gonna see a couple
more years before that's really.

485
00:25:06,285 --> 00:25:09,705
Used in anger for production
products and things like that.

486
00:25:10,335 --> 00:25:12,555
Corey: Th this is the problem too,
is that you're building things out.

487
00:25:12,555 --> 00:25:15,135
Like I, it's easy as hell for me
to go to you and point at some of

488
00:25:15,135 --> 00:25:17,685
these new features, like, well,
why didn't you idiots just build

489
00:25:17,685 --> 00:25:21,285
this thing, build Scribds on top of
this, because this may shock you.

490
00:25:21,365 --> 00:25:23,475
Scribd is older than three months.

491
00:25:23,505 --> 00:25:23,775
Who knew?

492
00:25:26,865 --> 00:25:27,165
Tyler: Yeah.

493
00:25:27,180 --> 00:25:30,165
I mean, the thing that's fascinating
about Scribd in that regard is.

494
00:25:30,675 --> 00:25:32,834
There's s Scribd evolved, right?

495
00:25:32,834 --> 00:25:35,264
Like we've had very different
business models depending on which,

496
00:25:35,324 --> 00:25:37,094
you know, era of Scribd Yeah.

497
00:25:37,155 --> 00:25:38,324
That, that you look at.

498
00:25:38,685 --> 00:25:43,185
And the design constraints that
influence systems change over time.

499
00:25:43,185 --> 00:25:46,665
And so I think it's a great thing for
most engineers to work for a company

500
00:25:46,665 --> 00:25:50,415
that has had to like join a company.

501
00:25:51,020 --> 00:25:52,610
Doesn't have greenfield problems.

502
00:25:52,760 --> 00:25:54,920
Join a company that's been
around for five, 10 years.

503
00:25:55,190 --> 00:25:58,220
Because there's really interesting
engineering challenges.

504
00:25:58,430 --> 00:26:03,620
When you look at a system that was built
for an era that no longer exists and have

505
00:26:03,620 --> 00:26:05,300
to figure out, how do I convert this?

506
00:26:05,300 --> 00:26:08,150
How do I make this work
in where we are today?

507
00:26:08,930 --> 00:26:11,300
Because you've gotta, you've
got like, every problem has

508
00:26:11,300 --> 00:26:12,890
constraints, but the constraints of.

509
00:26:13,919 --> 00:26:15,780
Somebody's solution yesterday.

510
00:26:16,080 --> 00:26:20,580
Bringing that to today is a very
interesting mental challenge because

511
00:26:20,580 --> 00:26:21,899
you can't throw it out and rebuild it.

512
00:26:22,139 --> 00:26:25,199
You've gotta find a way to
evolve the system as opposed

513
00:26:25,199 --> 00:26:26,100
to burning it to the ground.

514
00:26:26,655 --> 00:26:29,189
Corey: I, I make this observation
from time to time in meetings that

515
00:26:29,189 --> 00:26:32,909
it doesn't take a particularly bright
solutions engineer to be able to

516
00:26:32,909 --> 00:26:36,000
fill an empty whiteboard with an
architecture that largely solves any

517
00:26:36,000 --> 00:26:37,350
given problem you want to throw at them.

518
00:26:37,620 --> 00:26:39,540
The thing is, is that we are not.

519
00:26:39,565 --> 00:26:43,345
Starting from square one with a greenfield
approach for almost anything these days.

520
00:26:43,375 --> 00:26:44,545
It's great.

521
00:26:44,575 --> 00:26:48,205
We already have an existing business
and existing systems that are doing

522
00:26:48,205 --> 00:26:51,745
this and know we don't get to just
turn it all off for 18 months while we

523
00:26:51,745 --> 00:26:54,775
decide to start over from scratch and
then do a slow leisurely migration.

524
00:26:54,775 --> 00:26:56,425
There has to be a path forward here.

525
00:26:58,064 --> 00:27:01,365
Tyler: Yes, there has to be a path
forward, and in Scribds case, what makes

526
00:27:01,365 --> 00:27:02,834
it interesting is we still get uploads.

527
00:27:02,834 --> 00:27:03,975
Like we're getting uploads.

528
00:27:04,004 --> 00:27:07,485
Thousands and thousands of uploads
happen every day, and so any storage

529
00:27:07,485 --> 00:27:14,205
solution that we come up with on top
of S3 has to slip in or follow behind

530
00:27:14,264 --> 00:27:17,804
something that's getting thousands and
thousands of uploads every single day

531
00:27:18,044 --> 00:27:19,875
and all of the objects that that created.

532
00:27:19,875 --> 00:27:21,195
I was looking at a late,

533
00:27:21,375 --> 00:27:22,365
Corey: I think it was a free database.

534
00:27:23,505 --> 00:27:23,895
Tyler: Thank you.

535
00:27:24,195 --> 00:27:27,495
Uh, thank you for the engagement,
I guess, uh, it's really helpful.

536
00:27:27,795 --> 00:27:28,635
Anything database,

537
00:27:28,635 --> 00:27:29,355
Corey: when you hold it wrong.

538
00:27:30,195 --> 00:27:32,865
Tyler: Um, please don't
use Scribd as a database.

539
00:27:33,915 --> 00:27:34,555
Corey: I don't mean Scribdipt.

540
00:27:34,575 --> 00:27:36,435
I mean you personally,
you have information.

541
00:27:36,555 --> 00:27:38,355
Oh, we can ask you about
it and get answers.

542
00:27:38,505 --> 00:27:39,150
You're just a slow

543
00:27:39,585 --> 00:27:40,245
Tyler: database done.

544
00:27:40,545 --> 00:27:43,335
I am very, I'm a very slow
database lossy as well.

545
00:27:43,575 --> 00:27:48,645
Um, I was looking at, uh, some,
some assets that I had put

546
00:27:48,645 --> 00:27:51,315
together for a discussion with AWS
at the beginning of the summer.

547
00:27:52,304 --> 00:27:55,695
And on the whiteboard, I
had put one number, you know

548
00:27:55,905 --> 00:27:57,314
this many billion objects.

549
00:27:57,405 --> 00:28:01,574
When I went and I looked at that in the
last two weeks, there are another couple

550
00:28:01,574 --> 00:28:05,235
billion objects in that bucket compared
to what I had put on the whiteboard.

551
00:28:05,985 --> 00:28:08,925
So when you're looking at a
system that is in display exists,

552
00:28:09,795 --> 00:28:14,055
Corey: additional use of the service and
the site, not the classic architectural

553
00:28:14,055 --> 00:28:16,035
pattern called Lambda invokes itself.

554
00:28:16,995 --> 00:28:19,815
Tyler: It was not Lambda invokes
itself, though I am a fan of that one.

555
00:28:20,235 --> 00:28:22,275
Corey: Also, don't put your
logs into the same bucket.

556
00:28:22,275 --> 00:28:23,955
You're recording the object access in.

557
00:28:24,899 --> 00:28:27,780
Tyler: That's a, yeah, I, I
don't think I've done that one.

558
00:28:27,899 --> 00:28:29,370
I have done the recursive landmark.

559
00:28:29,370 --> 00:28:29,940
Well, are we fed?

560
00:28:32,040 --> 00:28:37,710
Um, but like the, the challenge of
architecting or designing something

561
00:28:37,830 --> 00:28:42,930
that has to handle massive scale, but
also work through 18 years of massive

562
00:28:42,930 --> 00:28:44,940
scale, it's such a fascinating problem.

563
00:28:44,940 --> 00:28:49,620
Like there is no bigger problem at
Scribd that has me this excited than

564
00:28:49,620 --> 00:28:54,030
figuring out how we take 18 years
of the largest document library.

565
00:28:54,255 --> 00:28:55,635
That exists as far as I'm aware.

566
00:28:55,875 --> 00:28:55,995
Um.

567
00:28:57,120 --> 00:28:58,890
Figuring out how to make that useful.

568
00:28:59,190 --> 00:29:03,840
You know, high performance, easily
accessible and give new capabilities.

569
00:29:03,840 --> 00:29:05,760
Like S3 is, do S3.

570
00:29:05,760 --> 00:29:07,650
The service team is
doing this now as well.

571
00:29:07,650 --> 00:29:11,700
Like they're trying to find new ways
to get new capabilities out of S3,

572
00:29:11,880 --> 00:29:12,090
Corey: you know?

573
00:29:12,150 --> 00:29:14,550
Well, they're never breaking the
old capabilities except turning.

574
00:29:14,550 --> 00:29:15,120
Well never break.

575
00:29:15,540 --> 00:29:18,450
None of us want like soap
and points for their API.

576
00:29:18,450 --> 00:29:18,480
Oh.

577
00:29:19,230 --> 00:29:21,060
It's like you're one of them worse.

578
00:29:21,570 --> 00:29:26,070
And I recently discovered the
soap as the seeding as a bucket.

579
00:29:26,985 --> 00:29:28,245
That used to do that sounds like.

580
00:29:28,245 --> 00:29:28,725
I'm kidding.

581
00:29:29,775 --> 00:29:30,495
Tyler: I love that.

582
00:29:30,555 --> 00:29:34,515
I love that Soap endpoints are still
supported until October of 2025.

583
00:29:34,695 --> 00:29:36,825
It makes me laugh so much that I saw that,

584
00:29:37,035 --> 00:29:37,425
Corey: oh, next

585
00:29:37,430 --> 00:29:39,165
month I missed that.

586
00:29:39,285 --> 00:29:39,885
Okay.

587
00:29:40,635 --> 00:29:40,755
They,

588
00:29:41,115 --> 00:29:42,285
those deprecation warnings.

589
00:29:44,005 --> 00:29:47,665
Tyler: I discovered soap recently
for S3, and I discovered that it

590
00:29:47,665 --> 00:29:50,695
was going away recently, both within
about two days of each other, not

591
00:29:50,695 --> 00:29:51,745
that I was planning on using soap.

592
00:29:52,105 --> 00:29:57,835
Um, there's a lot of really cool tools
in the S3 toolbox that are underutilized.

593
00:29:57,865 --> 00:30:00,265
I mean, object Lambda, I've, I've
chatted with you about before.

594
00:30:00,265 --> 00:30:03,025
I think object Lambda is super
cool, but I don't think it's

595
00:30:03,025 --> 00:30:04,195
seeing a lot of attention.

596
00:30:04,555 --> 00:30:09,415
I think S3 select was a good idea
that maybe didn't materialize

597
00:30:09,445 --> 00:30:11,125
in, in any particular way, like.

598
00:30:11,330 --> 00:30:15,170
Over the years, the S3 service team
has been doing interesting things, and

599
00:30:15,170 --> 00:30:18,350
I think only now in the last two years
have they like found their footing with

600
00:30:18,350 --> 00:30:22,040
the metadata tables, different types
of buckets, and then vector, uh, vector

601
00:30:22,040 --> 00:30:26,330
buckets, and found a way to like move
up from object storage in a way that

602
00:30:26,360 --> 00:30:29,990
is really fascinating to see how the,
the industry around it is gonna change.

603
00:30:30,350 --> 00:30:34,370
'cause as you've pointed out, like S3
is backwards compatible with just the

604
00:30:34,370 --> 00:30:38,720
classic S3 API, but S3 means a lot of
different things now depend like, it's

605
00:30:38,720 --> 00:30:40,220
like SageMaker, like what do you mean by.

606
00:30:40,845 --> 00:30:41,115
Three.

607
00:30:41,115 --> 00:30:42,645
Like what part of S3
are you talking about?

608
00:30:43,095 --> 00:30:44,985
Um, in my case, I'm just
talking about objects.

609
00:30:45,075 --> 00:30:46,425
I'm not talking about
this other magic stuff.

610
00:30:47,535 --> 00:30:47,715
Corey: Yeah.

611
00:30:47,745 --> 00:30:48,255
Don't worry.

612
00:30:48,315 --> 00:30:50,415
Uh, those will continue to work.

613
00:30:50,415 --> 00:30:54,195
They kind of have to, but it's weird
to almost wonder if you go on a, on a

614
00:30:54,195 --> 00:30:58,155
cruise for the next five years and come
back, how little or how much of the

615
00:30:58,155 --> 00:31:00,495
then current stuff would you recognize?

616
00:31:00,645 --> 00:31:03,375
It's, they keep dripping out
feature enhancements, but they

617
00:31:03,375 --> 00:31:04,575
add up to something meaningful.

618
00:31:05,565 --> 00:31:06,735
Tyler: They do, they do.

619
00:31:06,825 --> 00:31:13,755
There's very, there's a very clear data
strategy from the S3 service team that's

620
00:31:13,755 --> 00:31:16,935
working in concert with, with some of
the other parts, uh, like the Bedrock

621
00:31:16,935 --> 00:31:19,515
team, um, and the SageMaker teams.

622
00:31:20,325 --> 00:31:21,585
To where it is to me.

623
00:31:21,885 --> 00:31:23,775
I mean, I am all about data.

624
00:31:23,775 --> 00:31:24,465
I love data.

625
00:31:24,465 --> 00:31:26,715
I work on data products and data tools.

626
00:31:26,715 --> 00:31:27,855
Like this is my jam.

627
00:31:28,095 --> 00:31:30,645
But it's very like, there's never
been a more exciting time, in my

628
00:31:30,645 --> 00:31:34,695
opinion, to be building on top of S3
as a platform because the platform

629
00:31:34,695 --> 00:31:39,645
itself rock solid, super fast, super
cheap, and getting more capabilities.

630
00:31:40,555 --> 00:31:42,295
Every reinvent, which is super cool.

631
00:31:42,745 --> 00:31:44,425
Corey: Yeah, it's, it's
really neat to see.

632
00:31:44,845 --> 00:31:48,085
Uh, I want to thank you for taking the
time to speak to me about all of this.

633
00:31:48,115 --> 00:31:51,295
If people wanna learn more, and I
strongly suggest they do, where's

634
00:31:51,295 --> 00:31:52,285
the best place them to find?

635
00:31:53,215 --> 00:31:55,495
Tyler: Uh, that is a great question.

636
00:31:56,605 --> 00:32:00,625
You can, uh, you can find
my shit posts on Mastodon.

637
00:32:01,080 --> 00:32:02,760
So I'm just hacky.

638
00:32:03,000 --> 00:32:03,930
My server got eaten when

639
00:32:03,930 --> 00:32:08,490
Corey: I had, uh, 10,000 got, I
dunno, the energy to do it again.

640
00:32:09,450 --> 00:32:10,050
Tyler: Yeah, I

641
00:32:10,050 --> 00:32:10,350
Corey: know.

642
00:32:11,400 --> 00:32:14,250
Tyler: I noticed that Quinnie Pig
was on Mastodon for a minute and

643
00:32:14,250 --> 00:32:16,575
then I didn't know what happened
and then you just, yeah, I

644
00:32:16,575 --> 00:32:18,450
Corey: basically didn't, I
missed the deprecation warning

645
00:32:18,450 --> 00:32:19,770
and suddenly, nope, no backups.

646
00:32:19,770 --> 00:32:19,890
You're.

647
00:32:20,800 --> 00:32:21,040
Tyler: Oh, no.

648
00:32:22,930 --> 00:32:25,750
Uh, it is definitely the do
it yourself social network.

649
00:32:26,920 --> 00:32:29,440
Corey: Yeah, which is great if I
wanna to talk to a very specific

650
00:32:29,440 --> 00:32:33,160
subset, uh, archetype of a person
and absolutely no one else.

651
00:32:34,450 --> 00:32:35,710
Tyler: I fair.

652
00:32:36,160 --> 00:32:36,760
Absolutely fair.

653
00:32:36,760 --> 00:32:40,750
I'd say, um, the Sri Tech blog, uh,
tech.scribd.com is a good place to find

654
00:32:40,750 --> 00:32:42,910
some stuff that we periodically share.

655
00:32:43,210 --> 00:32:45,700
Uh, Mastodon's probably the easiest
place to find me, or GitHub.

656
00:32:45,730 --> 00:32:46,840
I'm just R Tyler on GitHub.

657
00:32:47,235 --> 00:32:50,564
Like for the last 20 years that
GitHub's been around, GitHub's

658
00:32:50,564 --> 00:32:52,034
been my primary social network.

659
00:32:52,034 --> 00:32:53,865
So that's where people can find me.

660
00:32:54,165 --> 00:32:54,465
Corey: Awesome.

661
00:32:54,465 --> 00:32:58,274
Mine, uh, historically has been my primary
social network is basically notepad and

662
00:32:58,274 --> 00:33:00,584
text files and terrible data security.

663
00:33:00,584 --> 00:33:01,965
But that, you know,
that's beside the point.

664
00:33:02,354 --> 00:33:04,754
Thank you so much for taking
the time to speak with me.

665
00:33:04,754 --> 00:33:05,534
I appreciate it.

666
00:33:06,419 --> 00:33:06,840
Tyler: Thanks Corey.

667
00:33:07,860 --> 00:33:10,949
Corey: R Tyler Croy,
infrastructure architect at Scribd.

668
00:33:11,250 --> 00:33:14,520
I'm Cloud economist Cory Quinn,
and this is Screaming In the Cloud.

669
00:33:14,820 --> 00:33:17,850
If you've enjoyed this podcast,
please leave a five star review on

670
00:33:17,850 --> 00:33:19,350
your podcast platform of choice.

671
00:33:19,350 --> 00:33:23,040
Whereas if you hated this podcast,
please leave a five star review on your

672
00:33:23,040 --> 00:33:27,780
podcast platform of choice along with an
angry, insulting comment that completely

673
00:33:27,780 --> 00:33:31,889
transposes a few numbers and you'll have
no idea what the hell it's gonna cost to

674
00:33:31,889 --> 00:33:33,659
retrieve that from Glacier Deep Archive.