Screaming in the Cloud

R. Tyler Croy, a principal engineer at Scribd, joins Corey Quinn to explain what happens when simple tasks cost $100,000. Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive. Normal solutions don't work anymore. Tyler shares how with this much data, you can't just throw money at the problem, but rather you have to engineer your way out.

About R. Tyler:

R. Tyler Croy leads infrastructure architecture at Scribd and has been an open source developer for over 14 years. His work spans the FreeBSD, Python, Ruby, Puppet, Jenkins, and Delta Lake communities. Under his leadership, Scribd’s Infrastructure Engineering team built Delta Lake for Rust to support a wide variety of high performance data processing systems. That experience led to Tyler developing the next big iteration of storage architecture to power large-scale fulltext compute challenges facing the organization.

Show Highlights:
01:48 Scribd's 18-Year History

04:00 One Document Becomes Billions of Files

05:47 When Normal Physics Stop Working

08:02 Why S3 Metadata Costs Too Much

10:50 How AI Made Old Documents Valuable

13:30 From 100 Billion to 100 Million Objects

15:05 The Curse of Retail Pricing

19:17 How Data Scientists Create Growth

21:18 De-Normalizing Data Problems

25:29 Evolving Old Systems

27:45 Billions Added Since Summer

29:29 Underused S3 Features

31:48 Where to Find Tyler

Links:

Scribd: https://tech.scribd.com
Mastodon: https://hacky.town/@rtyler
GitHub: https://github.com/rtyler

Sponsored by:
duckbillhq.com

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

SITC-670-R Tyler Croy-1
===

Tyler: So even the simple like S3 will gladly compute checks, sum sums for you. But to do so, it's a batch operation and a batch operation costs $1 per million objects. If you have a hundred billion objects, all of a sudden you're faced with, if I need checksums for all of this data, I've gotta go drop a hundred K just to compute checks sums, because a billion is just so astronomically large that just the simple act of getting checksums for the data you already have in S3.

Becomes a very serious pricing discussion. If you aren't ready to drop that kind of coin,

Corey: welcome to Screaming in the Cloud. I'm Cory Quinn. I am joined today by an early Duck Bill Group customer. A recent speaker at the inaugural San Francisco finops, I suppose we'll call it R Tyler Croy is an infrastructure architect over at Scribd. Tyler, how are you? How you I'm doing All right. This episode is sponsored in part by my day job Duck.

Bill, do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more.

Visit dot. Bill hq.com. Remember, you can't duck the duck bill. Bill, which my CEO reliably informs me is absolutely not. Our slogan been there for a while over at Scribd, six years in change. Uh, the company boldly answering the question, what if S3 had a good user interface?

Tyler: Uh, yeah. I mean, I've, I've joined the most recent incantation of script.

Scribd has been a lot of things. We've been around for 18 years. I think Scribd has existed longer than most other tech companies. So I mean, we've got the, uh. The lack of vowel in our name were from the flicker, the Scribd, the were before the dot lys.

Corey: Uh, as we look, vowels are expensive. Why buy them if you don't have, uh, so, so your talk was fascinating 'cause it of course, focused heavily on economics and it also focused on S3, uh, of the reason that many of us don't sleep anymore.

And that's been. It was an interesting story just at the level of scale that you're talking about. Things that people don't consider to be expensive, got expensive. Uh, specifically request charges when you're doing things in buckets that for those who are unfamiliar, effectively, have a crap ton of uploaded text documents at head of height scale.

Tyler: I think K Crapton is the metric that storage lens shows you at at our scale. So for the, the uninitiated scribd has user documented or user uploaded content. Documents, typically presentations through our SlideShare product going back 18 years. And so every day, thousands and thousands of new documents, legal documents, study guides, et cetera, get uploaded.

And those have been quietly accumulating in our, in our S3 storage layer for a long time without anybody really paying attention to it. And so a year or so ago, I started to really look at like, where's a big dent I can make in class Explorer? Like, if I'm going to take on something big, what's the biggest thing?

And I saw SD storage

Corey: cost cloud economics. Instead of pulling an AWS billing org and going alphabetically, if you start with a big first, that tends to have impact. For years, I was asked about people's random, uh, Alexa for business spend, it's, it is $3 a month. What are you doing?

Tyler: Yeah, I mean most store, most companies I think have a, like EC2 S3, Aurora, like those are the big things. But once I started to look into our actual S3 spend, I knew we had a lot of content. Like we, we talk about the hundreds of millions of documents that have been uploaded over the years, but when I actually looked into what was stored.

You know, in S3, we're talking hundreds of billions of objects because every single document that you upload, we have like format conversion, we have accessibility changes that get made. And so every single document became this diaspora of related objects in S3. And suddenly like the thing like batch operations, intelligent tiering, anything that has a per object charge associated with it becomes wildly expensive in a way that requires you to like.

Step back and like think about how, how, how should we be doing this? How should we be storing this data? 'cause that shotgun of objects into S3 only works for the first billion. And then after that you might have to think what you're doing.

Corey: The, the ergonomics of the request charges are very different too.

I think philosophically we tend to see, you know, on some level, oh, if I stuff an exabyte of data into S3, that's going to be expensive. But when we start, I think it's hard for humans to wrap their head around the idea. Of hundreds of billions of objects just because it's, the difference between a million and a billion is about a billion.

If you pass a point of scale, you do an S3 Ls to see what objects you have there, and it'll complete right around the time the earth crashes into the sun. It's a, it's just not something that makes sense, but. On the other side of it, I over-optimize for a lot of this stuff because I think at the Duck bill group now, our total S3 bill is something like 110 bucks a month right now, and we can do basically anything we want to S3, it doesn't materially move the needle on our business because we are not screwed.

Tyler: The S3, like you can abuse S3 for terabytes. Petabytes even really, like you can put so much into S3. It's so incredibly cheap. It's so incredibly reliable. And then there's this sne, like something happens. And I don't know when it happened at Scribd because I wasn't paying attention. I'm only looking back in history.

Something happened where we went from like the first billion to the next billion to the next billion. And once you're in the tens or hundreds of billions of objects, like it's, it's, it's like quantum physics. Like all of a sudden all of the physics that you've learned no longer applies. You're in a completely different ballgame and you've gotta figure out how does this world work?

Because the world I thought I had doesn't exist anymore.

Corey: Oh, very much so. It's. You also have the problem where especially when we're talking about all things billing you, it's a lot of hurry up and wait. Okay, we're gonna make some transitions. We're gonna try something here and see how it goes. And then you have to, in some cases, wait for objects to age out into the next tier.

Or there's a bunch of request charges that suddenly mean for this month your S3 bill is, is just in the stratosphere. And you get the angry client screaming phone call like, what have you done? Yes, there is an amortization story here. Give it time, don't move it back. Patient. Yeah.

Tyler: And, and I'm, I'm actively, as we record this, I am waiting for a first 30 day on some reclassing to occur for intelligent tiering.

And I, I can't wait until next month because I'm hoping for a big drop.

Corey: Yeah. It's, this stuff has become magic, but you have to speak the right incantations around it. Mm-hmm. Uh, past a certain point of scale. A lot of things just in the way that AWS talks about this are no longer make sense. Like I, I asked you about using S3 metadata or S3 tables for a lot of this stuff, and your response was the polite business person equivalent of kid.

That's adorable. Do you have any idea what that would cost just on the sheer number of objects, because that's not. That's not usually the first dimension we tend to think about historically. Now metadata and tables are changing that and vector buckets and directory buckets and Lord knows what else. Uh, but that's, that just changes the way that we think about a service that honestly is old than some people listening to this show.

Tyler: I mean, yeah, a lot of these things really break down in ways that are challenging the, at the metadata and inventory and other like operation or things that have happened in S3 over the last few years that are really interesting. They're really, really great when you're sub billion, uh, sub billion objects, but like S3 metadata.

The questions I don't have are I want to ask of these buckets. Are not worth the amount of money it would take to ingest into S3 metadata and then to to continue storing because they're just so astronomically huge. I was looking at, um, I was looking at a problem with some of these older objects sometime in 2024.

You probably talked about this. Every upload to S3, gotta check something automatically. You don't have to do anything before that. You may or may not have a checksum if you're using in a proper, you know, A-W-S-S-D-K. You did, but if you weren't, who knows? And at some point who would ever use anything other?

The

Corey: latest correct. SDK.

Tyler: Why would you interact with S3? Except through the official SDK. Um, but when you go back to like Sri S3 bucket. Came was created, the year S3 was created and and announced. And so when we're going back that far in time, we have billions of objects that don't have check sums. And so even the simple like S3 will gladly compute check sum sums for you.

But to do so, it's a batch operation. And a batch operation costs $1 per million objects. If you have a hundred billion objects, all of a sudden you're faced with. If I need Checksums for all of this data, I've gotta go drop a hundred K just to compute checksums because a billion is just so astronomically large that just the simple act of getting checksums for the data you already have in S3.

Becomes a very serious pricing discussion. If you aren't ready to drop that kind of coin.

Corey: I have to think just again, you have lived in this space. I only dabble from time to time, but my default approach when I'm start thinking about this sort of problem is the idea of lazy, checksumming, lazy conversions.

But when I was at Expensify back in 2012, something that we learned was that the typical lifecycle of a receipt was it would be written once and it would be read either one time. Or never except for the very end of the long tail where suddenly things get read a lot years later during an audit of some sort or when there's a question of mouth, the essence.

So you could never get rid of the data. You had to have it, but the expectation would never get read. I have to imagine the majority of stuff that gets uploaded to the Scribd, in many cases, it's there, but it's not accessed.

Tyler: I would say that's a pretty good assumption. The interesting thing about user uploaded content and user uploaded documents in in particular, is the long tail is years and years long.

You know, a study guide that was created for Catcher in the Rye. In 2010 is still probably just as useful in 2025 as it was in 2010 because the catcher in the rye is a classic and people still wanna,

Corey: yeah, it gets no access post a link to it on Reddit one year when the entire, and then studying that and yeah, it's impossible.

Predict what's gonna hit.

Tyler: It's impossible to predict. But one of the things that's been really interesting about Scribd. Particular flavor of of content is in the last couple of years, large language models have become really useful for what Scribd does. I won't speak to the utility of large language models in other domains, but the use utility in what Scribd does in particular, has made old documents suddenly much more useful, much more interesting, much more relevant to users today than they have ever been before Because you didn't.

Before this, you didn't have to sort of rely on like a Reddit post or you know, something to like re reinvigorate a document. Right? But now that if we look at all of this long 20, almost 20 year history of Scribd, if you look at that like a knowledge base, then all of a sudden we're looking at a very like very broad horizontal access pattern that we might wanna be doing for data science use cases or large language model based applications that.

Again, flips the access patterns that you might have in a traditional user-generated content site on its head and makes the storage discussion so much more challenging, but like in a fun way.

Corey: One of the more horrifying parts of your talk was when you mentioned that you. Had a lot of digging into various file formats you were talking about, even ISOs at one point, I'm like, oh, hey, someone knows what Joliet standard is in this day and age.

Imagine that. But you picked Parquet and then started using S3 bite offsets on Reeds to be able to just grab the end of an object and be, and then figure out where exactly you'd go and grab things from exploded document views. It's, it was a very in-depth approach it sounds like. Not nine S3 tables, rest of metadata from first principles, because those things didn't exist back then, and now that they do, they're no, they're not even close to being economically viable.

Tyler: Yeah. I think if, if they were economically viable, I mean, S3 tables would be really interesting for this use case. The really novel thing about Apache Parquet files is a lot of what we are doing at Scribd with Apache Parquet is not new territory. It's not necessarily novel, it's how, you know the quote unquote lakehouse architectures of.

You know, delta tables and iceberg tables and things like that are doing really, really fast queries and things like that on top of S3 object stores or, you know, Azure, blah, blah, blah, blah. So like the infrastructure for picking needles out of these parquet file haystacks already exist. One of the, the, the, the work that I've been doing is.

Reusing some of the same principles, but bringing it to a wildly different domain of this sort of very, very large content library that Scribdipt has, and using that as a way to reduce object size. Like the whole, the whole thing that I was trying to get across at the, the, the finops meetup that, that y'all invited me down for was.

The, the problem of X is really expensive at a hundred billion objects. My solution has been okay, not like to go negotiate with the team or like try to find a way to make that cheaper, but try to get the object count actually lower. Because if you bring it from a hundred billion to a hundred million, then we're in a ballpark to where you can take advantage of intelligent tiering.

Batch operations become much more easy to do. All sorts of things become simpler. If you can reduce that object count. And when I was looking at other things like ISOs, I mean the classics are a classic for a reason. You know, they never die. Um, when I was looking at that, like zip, tar, et cetera, um, I wasn't able to find a way to get random.

By selections within S3 objects to work nearly as effectively as I can with Parquet. As with Parquet, if I know what file I'm looking for, I can get it extremely quickly from within, let's say a a a a hundred megabyte file. I can go grab you 25 kilobytes with the same level of, I would say, performance as most other S3 object accesses work because S3 supported this range.

Request for a long time, and one of the bits of trivia that I was very pleased to discover, which really, really made this work well is you can do negative range reads on S3. So you can look at the tail of a file, you can look at the middle of the file, you can grab any part of a, of an object that exists in S3.

If you just know where in the file that it, it exists.

Corey: Which is magic, and there's a, it's magic. The, the downside, of course, is you have to know it's there, you have to have an in-depth understanding of your workload, which you folks do. This is also, I think, the curse of retail pricing. Um, it's no secret at this point that at scale, nobody's paying retail, but mm-hmm.

And like, well, of course we're gonna work and negotiate with you on this and return for long-term commitments. We'll wind up giving you various degrees of discounts, but when you're sitting there just doing the napkin math to figure out, okay, if I have a hundred billion objects and what are you gonna charge me?

Okay, nevermind. We're going to move on to the next part of the conversation, because it doesn't occur to you to go up there. It's like, so I need at minimum, a 98% discount on this particular dimension, even if that's attainable. It sounds ludicrous. Like there, there's no way you would even be able to say that with a straight face.

Like, I'm not gonna go into a car dealership and ask for a car for 20 bucks because it's just wasting everyone's time. Same principle applies here. The, they have priced themselves out of some very interesting conversations.

Tyler: The same principle applies, I think the, you know, the, the problem domain that we are faced with.

Uh, on top of S3, I think S3, as I think you've claimed a number of times, is the eighth wonder of the world. It is a fantastic piece of infrastructure. Building on top of it enables so many different use cases, but when you've got a large enough scale, you've got really interesting problems. And being an engineering, this is certainly a bias, right?

Like, I don't wanna look away from those problems. Getting, getting things cheaper is sometimes easier to do just with paper. You know, just signing a contract and other times stepping back far enough to look at what we're trying to accomplish. And coming up with an interesting technology solution is also a perfectly reasonable way to solve the problem.

And I think the way that, that I really am trying to approach what we're doing with S3 at Scribd is it's not just about getting the bill lower. Nobody is gonna give me the time, the money to, to make the bill lower. But if I can give us new capabilities by expanding. What we can do with this a hundred billion objects within the organization, that is a capability change that you get from a technology based solution as opposed to a policy or, you know, uh, uh, you know, contract based solution.

Both are equally valid, right? But I'm much better at one than am the other. Um, that may be a different story for you, but I'm better with the, uh, let's build some. Some code that's gonna solve some big problems, and hopefully that'll make the, the chart go down in a way that makes time, uh, in finance. Happy.

Corey: This episode is sponsored by my own company, duck Bill, having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help. Remember, you can't duck the duck bill.

Bill, which I am reliably informed by my business partner is absolutely not our motto. To learn more, visit duck bill hq.com. Feels like half the time you look at deep in the bill, like every different usage item, there's a reason for it. There's ways to optimize around it. Mm-hmm. But, but it's small to midsize scale.

It feels like it's just a tax on not knowing those intricacies. It's also, frankly, why the bigger bills get less interesting because you can have a weird misconfiguration that's, you know, a significant portion of an $80,000 monthly bill, but by the time you're like, you know, we spent a hundred million bucks a year.

No one's gonna spend 40 million of that on Nat Gateway data processing. 'cause someone's gonna ask where the hell of the money's going long before it gets to that point. So it starts to normalize. You see the, the usual suspects and services. Mm-hmm. S3 of course, being one of the top three every time. But in your case, it's, it's not just about the fact that it's S3, it's what is the usage type?

What is the dimension breakdown? What is the ratio of, uh, of requests to bites stored? That's where it starts to get really interesting. And there's still no really good way of saying, oh, 99% of it's this one S3 bucket because you have to go diving even to get that out. It's,

Tyler: you have to go diving to, to get the specifics on where data is being stored, especially as it starts to get more and more costly.

But the use cases that I see more and more, and, and this is sort of because of the, the time that we're in right now, is if you give a data scientist. A bucket. If you give a data scientist a data, a table or an engineer a table, they're gonna start to put data in it and it starts to explode over time to where we start to have data sizes that get large enough to where you're like, okay, should this be an S3?

We need it to be online. Should it be in Aurora? Should it be in elastic cache like. There's all of these very interesting data scale problems that are starting to creep up because data has become so much more intrinsic in, you know, the, the product value or what we can do that's really interesting and everybody wants the data all the time in every surface possible for as little latency as possible, and all of all normalizing for everything else.

S3 is so like incredibly fast. It is incredibly fast. It is incredibly cheap. You just have to know to store data in it. To take advantage of those two properties. And that's, that's sort of the, the, the, the thrust of the work that I've been doing over the last year is if you know how to wield S3, it is probably the most powerful tool in the toolbox, but you have to know how to wield it.

Corey: Yeah. It's magic. It it is infinite storage definition. Yes, they can, they can. Yes. It's faster than you can fill it. I know that because I've done some, uh, yeah. That's why my test environment is other people's product accounts. Uh, it's, it also changes the nature of behaviors. If I were to go into a data center environment and say, great, I need to just store another 10 petabytes, uh, over in that rack next week, the response would be.

Right. Uh, is there someone smarter I can speak with? Do you have a goldfish perhaps? Uh, whereas work with S3 is just effectively one gigabyte at a time, and there is no forcing function where, well, now we need to get a whole bunch more power, cooling, hardware, et cetera. You can just keep incrementally adding and growing forever.

There is no more. Bound to speak of, at least not at anything you or I are ever going to encounter. 'cause the money is the bound there and there's no forcing function that makes us clean up after ourselves. It is an unbounded growth problem.

Tyler: It is an unbounded growth problem. I think there's a, there's a industry change that has happened that has influenced this.

I was having a chat with, uh, with my friend Denny from Databricks about this in, when I first came up in the industry, how you stored data. Whether it was online or offline was in a relational data store of some form or the data warehouse. And the goal of all of these foreign key constraints and the relations between them was to only store any one piece of data.

Once in the last 10 years, we've said to hell with that. De-normalized the data. It's faster to just de-normalized it and to create a copy plus one column of this table rather than to try to manage all of these relationships between data. And so we have like excessive de normalization of data happening across the data layer for good reasons.

In theory, but for good reasons. But what that means is this unbounded growth has happened because we have infinitely cheap storage, you know? Right. And then we also have this push of de normalizing data, which leads to crazy data growth as time goes on, because most new data sets are not net new original data sets.

They are that old table or that old dataset, plus some new properties that I've added is now an entirely new, you know. Prefix an entirely new bucket of of, of stuff, right? And so rather than trying to find a way to get the least amount of storage used, we said to hell with it. S3 is cheap enough. Just copy the data a bunch, and that works great until it stops working.

And at the billions is where it stops working.

Corey: And also until you realize, okay, you had a data scientist, uh, copy away five petabytes to do a quick experiment for two weeks. Uh, she left the company in 2012 and, oops, a doozy. We probably should have cleaned that up before.

Tyler: Yeah. That's where intelligent tiering could really help.

Intelligent tiering and object access logs have been used quite aggressively by, by myself and some other folks that I, I work with to identify exactly those data sets that were orphaned, that were suspiciously huge

Corey: object access logs are great. Uh, CloudShare data events are terrible. I've, I've done the math on this.

It's something like 25 times more expensive than the S3 request to write the CloudTrail data event recording that request. So professionally speaking, don't do that. The access are a lot more reasonable. It, it reminds us the old, good old days of server web logs from Apache and whatnot. Mm-hmm.

Tyler: I set up Webalizer, I know exactly where things are going

Corey: exactly.

And uh, my log analytics suite, uh, that was a very convoluted one Line ox Scribdipt. When you have underpowered systems, what else are you gonna do? It's hitting my server and making it fall over or, uh, tail dash f and try and strain the tea leaves with your teeth.

Tyler: That's what, that's how I've been doing it actually.

Corey: Yeah. There are worse choices you could make. Uh, unfortunately, I, I really wish you could give a version of your talk at Reinvent, but it doesn't involve ai, so there's no way in the world that it would ever fit on the schedule in, uh, someone just did the analytics. Something like 400 and Change or the roughly 500, uh, talks in the, uh, in the catalog so far are about, I mentioned AI at least once in the title or description.

Tyler: Yeah. I mean, that's how you get. That's how you get on stage. That's how you get some funding. I mean, you gotta have some little, some hashtag ai. I think the, the thing about the AI use cases, there's a lot of really interesting things that people are doing that's, you know, quote unquote AI on, on a ws. Some of those are with AWS's AI tools.

A lot of them are kind of conventional. SageMaker conventional vector stores conventional, like the tools of three or four years ago, the stuff that's coming out now or that that has been announced in the last year or so. I think we're gonna see a couple more years before that's really. Used in anger for production products and things like that.

Corey: Th this is the problem too, is that you're building things out. Like I, it's easy as hell for me to go to you and point at some of these new features, like, well, why didn't you idiots just build this thing, build Scribds on top of this, because this may shock you. Scribd is older than three months. Who knew?

Tyler: Yeah. I mean, the thing that's fascinating about Scribd in that regard is. There's s Scribd evolved, right? Like we've had very different business models depending on which, you know, era of Scribd Yeah. That, that you look at. And the design constraints that influence systems change over time. And so I think it's a great thing for most engineers to work for a company that has had to like join a company.

Doesn't have greenfield problems. Join a company that's been around for five, 10 years. Because there's really interesting engineering challenges. When you look at a system that was built for an era that no longer exists and have to figure out, how do I convert this? How do I make this work in where we are today?

Because you've gotta, you've got like, every problem has constraints, but the constraints of. Somebody's solution yesterday. Bringing that to today is a very interesting mental challenge because you can't throw it out and rebuild it. You've gotta find a way to evolve the system as opposed to burning it to the ground.

Corey: I, I make this observation from time to time in meetings that it doesn't take a particularly bright solutions engineer to be able to fill an empty whiteboard with an architecture that largely solves any given problem you want to throw at them. The thing is, is that we are not. Starting from square one with a greenfield approach for almost anything these days.

It's great. We already have an existing business and existing systems that are doing this and know we don't get to just turn it all off for 18 months while we decide to start over from scratch and then do a slow leisurely migration. There has to be a path forward here.

Tyler: Yes, there has to be a path forward, and in Scribds case, what makes it interesting is we still get uploads.

Like we're getting uploads. Thousands and thousands of uploads happen every day, and so any storage solution that we come up with on top of S3 has to slip in or follow behind something that's getting thousands and thousands of uploads every single day and all of the objects that that created. I was looking at a late,

Corey: I think it was a free database.

Tyler: Thank you. Uh, thank you for the engagement, I guess, uh, it's really helpful. Anything database,

Corey: when you hold it wrong.

Tyler: Um, please don't use Scribd as a database.

Corey: I don't mean Scribdipt. I mean you personally, you have information. Oh, we can ask you about it and get answers. You're just a slow

Tyler: database done. I am very, I'm a very slow database lossy as well.

Um, I was looking at, uh, some, some assets that I had put together for a discussion with AWS at the beginning of the summer. And on the whiteboard, I had put one number, you know this many billion objects. When I went and I looked at that in the last two weeks, there are another couple billion objects in that bucket compared to what I had put on the whiteboard.

So when you're looking at a system that is in display exists,

Corey: additional use of the service and the site, not the classic architectural pattern called Lambda invokes itself.

Tyler: It was not Lambda invokes itself, though I am a fan of that one.

Corey: Also, don't put your logs into the same bucket. You're recording the object access in.

Tyler: That's a, yeah, I, I don't think I've done that one. I have done the recursive landmark. Well, are we fed? Um, but like the, the challenge of architecting or designing something that has to handle massive scale, but also work through 18 years of massive scale, it's such a fascinating problem. Like there is no bigger problem at Scribd that has me this excited than figuring out how we take 18 years of the largest document library.

That exists as far as I'm aware. Um. Figuring out how to make that useful. You know, high performance, easily accessible and give new capabilities. Like S3 is, do S3. The service team is doing this now as well. Like they're trying to find new ways to get new capabilities out of S3,

Corey: you know? Well, they're never breaking the old capabilities except turning.

Well never break. None of us want like soap and points for their API. Oh. It's like you're one of them worse. And I recently discovered the soap as the seeding as a bucket. That used to do that sounds like. I'm kidding.

Tyler: I love that. I love that Soap endpoints are still supported until October of 2025. It makes me laugh so much that I saw that,

Corey: oh, next

month I missed that.

Okay. They,

those deprecation warnings.

Tyler: I discovered soap recently for S3, and I discovered that it was going away recently, both within about two days of each other, not that I was planning on using soap. Um, there's a lot of really cool tools in the S3 toolbox that are underutilized. I mean, object Lambda, I've, I've chatted with you about before.

I think object Lambda is super cool, but I don't think it's seeing a lot of attention. I think S3 select was a good idea that maybe didn't materialize in, in any particular way, like. Over the years, the S3 service team has been doing interesting things, and I think only now in the last two years have they like found their footing with the metadata tables, different types of buckets, and then vector, uh, vector buckets, and found a way to like move up from object storage in a way that is really fascinating to see how the, the industry around it is gonna change.

'cause as you've pointed out, like S3 is backwards compatible with just the classic S3 API, but S3 means a lot of different things now depend like, it's like SageMaker, like what do you mean by. Three. Like what part of S3 are you talking about? Um, in my case, I'm just talking about objects. I'm not talking about this other magic stuff.

Corey: Yeah. Don't worry. Uh, those will continue to work. They kind of have to, but it's weird to almost wonder if you go on a, on a cruise for the next five years and come back, how little or how much of the then current stuff would you recognize? It's, they keep dripping out feature enhancements, but they add up to something meaningful.

Tyler: They do, they do. There's very, there's a very clear data strategy from the S3 service team that's working in concert with, with some of the other parts, uh, like the Bedrock team, um, and the SageMaker teams. To where it is to me. I mean, I am all about data. I love data. I work on data products and data tools.

Like this is my jam. But it's very like, there's never been a more exciting time, in my opinion, to be building on top of S3 as a platform because the platform itself rock solid, super fast, super cheap, and getting more capabilities. Every reinvent, which is super cool.

Corey: Yeah, it's, it's really neat to see. Uh, I want to thank you for taking the time to speak to me about all of this.

If people wanna learn more, and I strongly suggest they do, where's the best place them to find?

Tyler: Uh, that is a great question. You can, uh, you can find my shit posts on Mastodon. So I'm just hacky. My server got eaten when

Corey: I had, uh, 10,000 got, I dunno, the energy to do it again.

Tyler: Yeah, I

Corey: know.

Tyler: I noticed that Quinnie Pig was on Mastodon for a minute and then I didn't know what happened and then you just, yeah, I

Corey: basically didn't, I missed the deprecation warning and suddenly, nope, no backups.

You're.

Tyler: Oh, no. Uh, it is definitely the do it yourself social network.

Corey: Yeah, which is great if I wanna to talk to a very specific subset, uh, archetype of a person and absolutely no one else.

Tyler: I fair. Absolutely fair. I'd say, um, the Sri Tech blog, uh, tech.scribd.com is a good place to find some stuff that we periodically share.

Uh, Mastodon's probably the easiest place to find me, or GitHub. I'm just R Tyler on GitHub. Like for the last 20 years that GitHub's been around, GitHub's been my primary social network. So that's where people can find me.

Corey: Awesome. Mine, uh, historically has been my primary social network is basically notepad and text files and terrible data security.

But that, you know, that's beside the point. Thank you so much for taking the time to speak with me. I appreciate it.

Tyler: Thanks Corey.

Corey: R Tyler Croy, infrastructure architect at Scribd. I'm Cloud economist Cory Quinn, and this is Screaming In the Cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice.

Whereas if you hated this podcast, please leave a five star review on your podcast platform of choice along with an angry, insulting comment that completely transposes a few numbers and you'll have no idea what the hell it's gonna cost to retrieve that from Glacier Deep Archive.

More episodes

Chapters

What is Screaming in the Cloud?