Jaz: There's a lot of convenience that comes with cloud, but you definitely pay for it. And you don't necessarily pay for it in the things that you expect to pay for it in. Like, you don't expect, ah, you're gonna charge a markup on this EC2 instance based off of how powerful it is. You end up paying most of it in, like, kind of hidden places.
Justin: Welcome to Fork Around and Find Out, the podcast about building, running, and maintaining software and systems.
Welcome to Fork Around and Find Out, the PLC DID of Is the Website Still Up? I am Justin Garrison, and with me is Autumn Nash, and today we have Jazz, a software engineer at BlueSky. Welcome to the show, Jazz.
Jaz: Hi, glad to be here.
Justin: So excited for you to be here. I have been looking forward to talk about the infrastructure around Bluesky and what you all been doing for a very long time.
Autumn: Jazz's radio voice just totally kicked Justin's podcast voice.
Justin: Voice is absolutely better than mine. Like,
Autumn: like, as your friend, I want to have your back. That's fire. I'm not sure if I'm going to
Jaz: be demonstrating that for the podcast yet. But, you know.
Autumn: Can we hire you in Skittles?
Jaz: We could, you could DM me, we could talk about it.
Autumn: We'll just, we'll send you Ikea plushies as payment.
Jaz: That's right. Pay me in gum.
Justin: All valid cryptocurrencies for 2025.
Autumn: I guarantee you, if we took Trump coins and Ikea plushies, one has a better resale value.
Justin: Oh, wow. We're, we are three minutes into this episode. Welcome to the show. Everyone. It's been a week. It's definitely been a week. Just for context, for anyone listening to this, we are recording this on January 23rd. It is Thursday, still in January.
This episode is coming out in February, second week of February. So I don't know what the future holds, but. Godspeed to you all. Y'all, uh, it is,
Autumn: we were like 2025 will get better. And then halfway through January, we're like, whoa, whoa. We want a refund. Like,
Jaz: I don't know if the store does that anymore. I think
Autumn: the bad place, what
we just survived tech like recession. And now we just, we don't even know. Okay. We like,
Jaz: we definitely served a lot of video this week. I can tell you that much. I mean.
Justin: Speaking of surviving Bluesky is like a male. It's just, it is, and that is
Autumn: you are the saving the world right now. Cause like, I don't even know where to go.
I've deleted my Instagram three times. The only reason why I have a Facebook is because it's so confusing. I can't get rid of it. Like, I swear Meadow was like, I'm going to make this horrible. So they can't delete it. And I'm like, I just won't post and I'll just delete it from my phone. Like, yeah, They had to ask a UI UX like designer, how to make it as insufferable as possible.
Like they, not to make it better, but how to make it worse.
Jaz: It's a lot of dark patterns out there. Yeah.
Autumn: And then people are like, Oh, tick tock is bad. Don't rock tick tock. You know, you shouldn't give your information to foreign like, okay, but this is like, okay. This is the Boston Tea Party of data. They were like, okay, you want to take my data?
And like, you want to take my TikTok and say it's like Chinese government ware? We will throw it into like Red Note. Like it is the Tea Party of data. They were like, F your data rules. And then they gave it to the, there was a video of this woman saying that they told her to verify her identity for fraud.
On red note, and she was like, I'm giving the Chinese government my I. D. What now, U. S. Government? And I was like, Oh, sweet Lord, what are we doing? Me and Jazz are going to be besties.
Jaz: I don't know. I hope more good places show up.
Autumn: Bluesky is all we got.
Jaz: I think there were four at Proto based TikTok clones that were like starting up in the past week or two.
Justin: So let's go back a little bit first and. How did you get into software infrastructure? What's kind of your background? Where did you go from doing something to like part of Bluesky at Proto?
Jaz: I started in hardware. I started as like a, as a repair tech, uh, when I was like 14 at a computer repair shop in my local town.
Justin: Support desk life, it is like you are help desk and yeah,
Jaz: yeah, I was very good at taking stuff apart. I wasn't very good at putting things back together. And then as I got older, I got better at putting things back together. Well, I don't know. There's, there's, I feel like there's, I feel like there's a disease you get where you just want to like take everything apart and figure out how it works.
And so I was that kid, you have
Justin: to see the parts. You have to know what's going on. Yeah. Yeah. I didn't, I didn't know how like solder joints
Jaz: worked. I learned that the hard way after like breaking a few too many solder joints and like. It's not going back together. What the hell does it work? A new
Justin: skill today.
The oven to try to solder it again.
Jaz: Yeah. Yeah. That's just, and then evolved from that to doing tech support at a local PC repair shop, and then that paid awfully and they build a lot for my time. So I was like, okay, cool. Let me do this independently. Um, so I went solo for a little bit. And then when I was in like high school, got in the hackathon scene in London, in the UK, right after I moved there in high school.
That was really cool. I was going to these hackathons. I was like, well, technically not old enough to go to some of the hackathons so I could win the prizes.
Autumn: Terrible in London. Or is it good?
Jaz: If you get the right food, it's good. British food is bad.
Autumn: Okay.
Jaz: Uh, I probably shouldn't have said that on the podcast, but British food is bad.
They know it. Um,
Autumn: they know it. Just like they know it. It's cool.
Jaz: They know, they know it. They know the good British food is like Nando's, but that's like. South African slash Portuguese slash British. And then obviously there's like really good Indian food and there's really good continental European foods.
If you want like good Italian food or good French food, those are some really good eats to get in London. Yeah. So it was in the hackathon scene, started doing software engineering as like a part time thing. I think my junior year of high school into my senior year of high school. And then. moved back to the U.
S. for college, uh, worked through college 39 and a half hours a week doing contracting, and then graduated, uh, early 2020, was thrust into the tech market in the middle of a pandemic. So it was interesting. So I spent some time working at a financial company, um, doing like infrastructure for their engineering teams, building a platform as a service on top of Kubernetes.
And then I spent some time working at a social media company doing infrastructure for their research teams. So turning research projects Was it like a really evil one or
Autumn: just like a kind of evil one?
Jaz: It was, yeah, it was at Facebook. I was at, I was at Facebook for briefly for about a year. Um, fourth as a production engineer
Justin: you got out.
Jaz: So I I got out, subscribe, got out. I'm just trying
Justin: to think how hard it was to get out of the company. Like this is . Yeah. Yeah.
Jaz: I was working, um, production engineering at Facebook Reality Labs, so I spent time working with a bunch of That sounds a cool job. Researchers. Yeah. They were building really cool stuff.
The problem is there were like a few thousand researchers and there were about 20 production engineers. So it was just like
Justin: MetaQuest, like VR This is like MetaQuest. This is like
Jaz: MetaHorizons. This is all sorts of this is all the hardware project. They're working on. Do you know the legs
Justin: on the models?
Autumn: I was, that's what I was going to ask.
I was going to be like, why don't they have hands and legs? Like, do you have the teeth? Like this man was really trying to tell us that we all need to be more masculine and all this stuff. And I was like, bro, you can't build hands. Sit down.
Jaz: I don't know why there are no legs. I do. Yeah, I do know. Like.
There were lots of really cool projects going on around the AR glasses that debuted at a recent meta event. So the, the really cool, the like time machine glasses, the real thick ones, those are awesome. The amounts of engineering that went into. Every single component in that pair of glasses is crazy.
Like everything in it is custom Silicon designing that custom Silicon and designing the optics, those Silicon carbide optics that are like actually just like a rock that was manufactured specifically to like do all these crazy wave guides and stuff that requires an insane amount of simulation and insane amount of like physics and engineering.
And I like. Sure. Helped the team with their simulation cluster or something. I have no idea how the math works, but that was cool stuff to work on.
Justin: And now that I think about it, jazz, like I can't see your legs now either. So I don't even know if you have legs. So you
Autumn: can tell us point twice.
Jaz: No, I, uh, yeah.
So that was. That was a fun chapter. After that, I kind of like, I went to, I went to a tiny, I went to a tiny six person startup that was, that was doing like solar cellular camera networks around cities for determining parking occupancy. It was very weird.
Autumn: What made you want to go from Meta to that?
Jaz: I wanted a small, like small team startup vibe thing.
And the CEO was a friend of mine from high school, but they didn't really have any engineers. And I kind of, I built a product stack and burned out pretty quickly and then went to work at Planet Labs where I was for about two years. That was kind of my ethical turning point where I was like, Hey, I want to go build technology that helps the world.
Uh, Planet builds tiny CubeSat constellation that images the world every day. They sell that imagery to farmers and agricultural industry and all sorts of like NGOs and other, other, uh, organizations so they can get like really fast real time imagery. My role there was like billing infrastructure when I got my foot in the door and then it turned into, uh, I wrote a charter and built out their internal developer experience team.
But come, you know, 18 plus months into my career at Planet, my friend invites me to Bluesky. Uh, it was like usually like 20,000 or something like that. I check out this cool protocol that they're working on. It's very, very interesting because they just have a public fire hose and I was like Holy crap, that's awesome.
I've never really seen a public fire hose for a social network. So I figure out how to consume the fire hose. I noticed like, Hey, there's this Paul guy. Who's like everywhere all the time, responding to everybody. Everyone mentions him. Always the
Autumn: first thing everybody notices.
Jaz: Yeah. So I was like, who's this Paul guy and how, how much has he mentioned?
So I wrote like,
Justin: he's the MySpace Tom of LooseGuy.
Jaz: Yeah. Yeah. I wrote some code and I was like, how often is Paul mentioned? And like, how many different people are talking to Paul? And so that was the initial idea was like tracking how popular Paul was on this platform. And then that evolved into my social graph visualization, which was, Hey, let's graph all of the interactions between users on Bluesky and try to find like clusters of, of new users popping up that have common features and stuff.
Very cool
Autumn: hobbies. Thank
Jaz: you. So that was, that was really fun. And I realized I was spending about 30 hours a week on miscellaneous at Proto stuff. And then 40 hours a week at work. And I was like, I definitely like one of these a lot more than the other. So I went to a, one of the Bluesky user meetups in the Bay area.
And I met, uh, some members of the team at the time. And they recognized me from the projects I was doing on the network of sharing all of this as I was building it open source, like, Hey, check out this cool graph. Oh, look, all these new users showed up and they're from this area and they speak this language or whatever it is.
We chatted for a couple hours and like, cool. Do you want to like come work here? And I was like, Oh, do I? I was like, yeah, I think I do.
Autumn: That's actually really helpful data for a startup though. Like you were doing meaningful work. Yeah, I mean,
Jaz: yeah, at the point I had more dashboards than the company did of like what was going on on the network.
I had like a better idea of who their users were than they did. Obviously it's evolved a whole lot since then, but I was, I was basically working at the company before I started working at the company just because everything was open source and everything was all the data was open. You could just do
Autumn: whatever you want.
I was gonna
Jaz: say you
Autumn: were, you just rolled in with insights.
Jaz: Yeah, it was, it was super cool. Like I've never, I've never had like a, an experience like that where you can watch the evolution of a social network, like from basically first principles, totally in the public and totally in the open and just build all sorts of stuff on top of it.
And the developer community around it got really psyched. Yeah. That, and that was back in, I joined the team back in July of 2023. So I've been, been around for about like 18 months now. And it's been an absolutely insane ride. It's felt like it's been a decade.
Justin: You primarily have focused on the infrastructure side of it, right?
Like as far as, yeah,
Jaz: my roles and responsibilities are mostly around infrastructure and scaling. So when I joined, we had around a hundred thousand users this weekend where we'll probably be pushing on 30 million users. So pretty significant increase in scale. It's a little bit 18 months.
Autumn: We need a HugOps meme for jazz.
You are the real MVP because the amount of people that are social media refugees right now.
Jaz: It's not just me on the, on the infrasight of things. We have a, we probably have. Five or six people who are like kind of core infrastructure, like on call rotation now. But back in the day, it was, it was not quite that big.
We were, we were really tiny team. Five is
Autumn: still a lot for that many users and then doing mostly on prem.
Jaz: Yeah, it's, it's a bit crazy. We, we built out our data center locations in like November of 2024. And so before that, we were all on cloud. Things were kind of falling over with a hundred thousand users.
Autumn: Oh, that's interesting. So you did start in the cloud. And then,
Justin: yeah, it all started in the cloud. Like general overview. What does the infrastructure look like today? Like where, as I know, like some pieces in the cloud, some on prem, some different areas. Like what is that? How does that break down?
Jaz: So we have three tiers of infrastructure, I guess.
You'd have like singleton one off services, which are kind of smaller, lower load services that we want a replicated Postgres database for or something. And those we stick in a cloud provider. We have our like core data services, which are really high compute scale, really high storage requirements. Uh, and those we run on prem.
And so we have two, two POPs, two physical POPs that we have our own hardware in. Uh, that we co locate. We, you know, get a cage in a data center somewhere and go throw your servers in it. And then we have, our third tier is kind of like bare metal providers, which is different providers, but they give us basically a full machine in their data center somewhere.
And then we run like the PDSs. So if you've, you're the personal data servers that have All of our users canonical data on it is stored on bare metal through bare metal providers. And that lets us kind of scale those a lot more easily than we can scale our own physical hardware and then smaller one off services or things that need to be in the cloud or in the cloud.
And then all of our kind of like really high compute intensive or network intensive or storage intensive stuff runs on our own hardware because. bandwidth in a data center is a lot cheaper than bandwidth in a cloud. Storage in a data center is a lot cheaper than storage in the cloud.
Autumn: Do you have any extra backup storage in the cloud just in case things get super crazy?
Jaz: Yeah, so there are all sorts of like different tiers of backups based on what kind of data it is and where it is. So like canonical data, like your PDS data is backed up a couple different ways in a couple different places. But like our global index of All of the data in the atmosphere, we run to fully independent copies, uh, indexing atmosphere.
So each data center, uh, fully indexes, um, the fire hose on its own. Um, so they both contain two independent sets of the same data. Um, so if there were some kind of. outage or anything like that. Uh, we have at least a copy of that somewhere. Uh, and we have the ability to shift all of our traffic to one of the data centers so that it can, it can handle the production load.
Autumn: It makes my heart so happy when people use cloud and on prem correctly and don't just think either of them are the end all be all. And when people are redundant properly, like it just makes me so happy.
Jaz: We, we really like commoditized cloud products. So something like block storage is like super commoditized.
It's super cheap. There's so many different people that provide it and the like SLAs on it are very industry standard at this point. So it's much easier to get cheap. Block storage. So we don't mind building a petabyte scale, uh, storage cluster on like metal is, is kind of challenging. It's, it's expensive.
It's error prone, depending on your latency requirements and stuff, you might mean to be running flash for that. In which case it's a lot more expensive. And if you're running hard drives, you have failure rates, which means you need somebody to like the bigger scale your cluster is, the more often you have to send somebody down to go swap hard drives, whereas block storage is like.
Honestly, really economical up to, I think it's somewhere in the, in the like four to five petabyte range, at which point it makes sense to start just running your own storage clusters.
Autumn: But I love that you guys actually did the numbers and you looked at each, you know, like the, all of your storage is very well placed.
Justin: Well, and the, the really funny thing is, I mean, you, you said you started these in 2023. Alright, like end of 2024, you were growing a million users a day. So whatever math you thought you had in 2023 was not the math you were doing in 2024.
Jaz: You would be surprised. We've got some spreadsheets that were written in 2023, very early 2024, and they were wildly early 2024.
early 2024. They were wildly ambitious when they were written that go month by month, like user numbers. We missed a ton of the marks on them. And then we caught up.
Autumn: Were y'all just predicting the future? Like, did, did someone know if Elon was breaking up or getting back with girlfriends?
Jaz: Our previous infrastructure lead, Jake, Justin, who I think you talked to briefly on the website at some point.
He wrote up this spreadsheet that like, I think he, I think he based it off of Instagram numbers or something. He got, he got a bunch of different numbers from different social medias that like had, they gave you a whole like six data points of their user numbers over the course of their like 10 year history and then extrapolated between them to try and find what successful growth look like.
And then we built that into the spreadsheet and then we said, Hey, let's plan for success because. If you plan for failure, you're not going to succeed.
Autumn: Which is amazing because just the, the environment in which Instagram and Facebook and most places and social media grew is not like. What's happening right now?
Like, this is a very crazy time in social media.
Jaz: When you, when you have so much market saturation and there's so many incumbents and everybody is already fully subscribed on the social medias that they want to be on, it is so hard to pull people away from the platform that they're on and bring them to something new and show them something new.
And we saw that for six months last year, we had like basically flat growth. We were like between three and four thousand new users a day for six months. And then in November of 2024, we were doing over a million users a day for three days in a row. We'll never
Autumn: know why in November specifically that happened.
Jaz: The whiplash is crazy. You go from like no growth at all. Oh, we, I can't believe we spent so much time focusing on scaling. Why did we waste all that time and money on filling out these data centers?
Autumn: Oh my gosh, like at, in the last six months, was there like, you don't have to give us specifics, obviously we don't need numbers, but like, was there ever a point where you were like, oh my goodness, like what's going on?
Like, or how are we going to sustain this?
Jaz: Brazil was insane, Brazil was, I was, I thought
Autumn: it was going to be another thing because I like watched the Brazil blip, but I guess in the U. S. it didn't make the same, it didn't seem as much.
Jaz: It didn't catch on as much in the U. S., but it was like one and a half million users in a weekend, right?
Which for us coming from having like no growth for six months to suddenly picking up a million and a half users in a weekend was, it was like 30 percent growth of our network in like a week. Which was nuts for us. I was like on a plane to London to go see a friend of mine for his birthday. And I like bought the in flight wifi and was like trying to get my VPN to work so that I could like connect to dashboards and everything.
And it was. I was so terrified. I ended up like that working the entire weekend. I was in London because I was just like, we've never seen any load like this. It was like five or six times higher, like firehose throughput and request throughput than we've ever seen before.
Justin: You've mentioned a couple of components and I don't want to go deep into the app protocol stuff, but like, could you just give a general overview?
Like the PDS, the firehose, the app view, the indexes. All of those have different constraints and how do they tie together or just a general overview of like what BlueSky is offering as a service is a bunch of things underneath it.
Jaz: I'll steal the Paulism, which is everybody's a website. So you as a user on BlueSky, every time you like something, every time you create a post, every time you follow somebody.
Uh, every time you block somebody, every time you repost something, you are writing a little document, a JSON document effectively to your website. You're putting a JSON document in your canonical data store that lives on your PDS, on your personal data server. For the vast majority of our users, that means they are writing it to a PDS that we operate, but there are also thousands of independently operated PDSs.
Autumn: I'm one
Justin: of them.
Jaz: Yeah, Justin's one of them. He
Autumn: broke himself for a while, and I couldn't reply to him. I was broken for a
Justin: little while, but I'm back. It's been stable for a couple weeks now. It
Autumn: was like he picked on me specifically, Jazz. Like, I couldn't even reply to him. And he kept tagging me in taco and license plate things.
Like, that's just mean. First of all, I'm hungry. And then I couldn't even reply.
Jaz: That's unfortunate. The nature of a distributed network is, is you've got all these documents that you write into your personal data store, whether it's hosted by us or somebody else. They get aggregated into one giant fire hose.
So your PDS emits an event stream for all of the repos hosted on it. So for our users, it's usually right now it's like 500,000 users per PDS. And so if you're on like Amanita, that's a, all of our PDSs are named after mushrooms. Um, so if you're on Amanita. You've got 499,999 of your closest friends on AMITA with you, and every time you post, you are writing to Amita to a SQL Light database that exists just for you on amita.
Each user gets one. SSL L. We
Autumn: like database Justin, or do you think we're the same? We were.
Justin: That was. Well, so that was the problem, actually, Autumn, because someone else pointed that out to me where you and I were on the same PDS originally before I migrated off.
Autumn: And then when I migrated off, they were the real MVP.
Justin: I don't remember which one it was, but when I migrated off to my own, my account deactivation didn't fully happen on the hosted PDS. So people on my PDS couldn't reply to me until I went through and did my full. So it just so happened we were neighbors and, uh, and then when I
Autumn: moved out of the neighborhood, why'd you do that?
Like just worst friend ever. Like
Jaz: you and your neighbors all chat, all you want. You've write all of these, these documents to your own little sequel lights living on your mushroom with you. And then the mushroom itself, it sequences all of the events for its users. So you and all the other people on there are writing.
Generally, we see somewhere between 5 and 20 events a second per PDS. So, all those writes get written into one sequencer database, which is a SQLite as well. Uh, and then once they get sequenced, they get given like a sequence number and they get emitted out of the firehose. So each, each mushroom has its own little firehose.
And then we have something called the relay, which is in the network that sucks from all of the mushrooms and turns into a gigantic firehose that does, you know, anywhere from 1,000 to 2,000 events per second these days. And so that giant firehose is running in our right now it runs in our on prem for us that merges all of the disparate event streams into one giant event stream, which makes consuming the network a lot easier and a lot less complex.
And so that one big event stream then gets crawled by Firehose consumers, a couple hundred of those Jetstream is connected that which is like a lightweight version of the Firehose that has a couple hundred consumers that connect to it as well. But everybody, everybody consumes this Firehose and then the Firehose has like, hey, this person created.
This record with this ID and here's the content of that record and then here's a proof of this operation so that you can check and make sure that they actually created this record, like it's signed with their private key.
Autumn: Can I just work at Bluesky for like a day and then just re architect all of your architecture with mushrooms, like just little mushroom databases and like just magical streams, you know, like it was just like the fire hose will be like magical and then like it'll be like each thing will be like a mushroom and it'll be adorable.
Jaz: There's a legendary drawing that I found in the developer discord, the third party dev discord. Somebody drew an architectural diagram of Bluesky, but where each node in the network is like a forest creature and gave them like Very interesting names. And so there, there is like kind of headcanon of, oh yeah, this, this is, this component is like an anteater and this component is like a hedgehog and this component, you know, whatever.
Autumn: Each database should be a mushroom or like
Jaz: each, each data, the microsphere, we call it, uh, all of the, the PDS is make up the microsphere. So that gets indexed internally and then we have a big. database in each of the POPs right now that runs Scilla, which is a kind of a C rewrite of Cassandra. So it's a big NoSQL key, key value store.
And that's where we actually persist the global index of data on the network. So your PDS only knows about what your users have, what its users have.
Autumn: I love watching Scylla and Cassandra fight and then Scylla's like, C is like faster because we don't like compile and then Cassandra's like, but we're faster and it just pretend like it doesn't suck to manage us and then they fight back and forth and it's the best nerd fight you've ever seen in your life.
Justin: Is that where the AppVue pulls from? It's not going directly from the.
Jaz: Yeah. So the AppVue pulls from its local, there's a data service that I wrote called Atlantis. which is our like data plane or whatever that talks to Scylla that that's what writes things into Scylla that's what reads things out of Scylla it also handles some like caching tiers it handles some request coalescing things like that and so that is where the global index of data is so when you load your timeline when you load a thread when you look at the number of likes on a post that's all coming out of Scylla that's coming from our big data store and then in terms of scale for that like the actual amount of data that's on the network is like It's like a couple of terabytes.
If you don't include images and you don't include video or anything like that, the actual like, record data, the JSON is like a couple terabytes. So it's not huge. The timelines are really big though. Timelines are a really weird workload, which is like every time you post, we send out your post to all the people that follow you.
So if you have 20,000 followers and you post something, we're going to go insert 20, 000 references to your post into the timelines of the people that follow you. And then we keep a That sounds very complex. It's It was a big architectural shift from what we did before, but the timelines themselves, like the timelines table is like over 100 billion rows.
We trim it so like there's a maximum length of your timeline, but when you have 30 million users and you want to keep like a few thousand timeline items in there that quickly balloons to like hundreds of billions of rows.
Justin: Wasn't the Bluesky account had to like you had to post tens of thousands of people wait five minutes right to let that propagate?
Jaz: There was a moment where we had, there was only one work queue or whatever for dealing with stuff. And the fan out job was also in that same work queue. And so it, like you get sharded into a work queue based off of your date and all that kind of stuff. But he's got an app account would. It would create a post, and then it would start fanning out the post, and the creation of the next post in the thread would get blocked, because it would be waiting for the fanout to finish before it would create the next post in the thread.
And now, those are two separate queues, so fanout jobs can happen in the background, and they don't block the, like, persisting of the actual thread post itself.
Autumn: Okay, do you have a different flow for users that have a bunch of followers versus users that don't have a bunch because there was a certain point in like Twitter where they had to re architect for like Justin Bieber versus a regular person and that's one of my favorite data stories because it just shows you how scale can just be completely ridiculous.
Like he would get so many followers a day and then when he would tweet it would like mess up everything and it's just so interesting.
Jaz: We haven't done that yet. But that is absolutely, like a hybrid timeline architecture is absolutely probably where we'll go as we get bigger and bigger. Because right now, every time bscott.
app posts a thread, it's getting fanned out to, I think, 22 million people's timelines. That's a lot of writes. And if they post a five post thread, that's like a hundred mil, over a hundred million writes.
Autumn: The, the guy who wrote Date
Justin: of Intensive Applications
Autumn: Is on Bluesky.
Jaz: Yes.
Autumn: And he's so rad and nice and that, dude, that's my favorite book.
It is.
Jaz: The boar book. But when I found
Autumn: him, I was like, Oh my God, you're real.
Jaz: Yeah. Martin is actually a technical advisor of Bluesky. To
Autumn: be like, so smart, you know, like you'd think that he would be like, Oh, I'm too smart and I won't talk to people. And he's so nice.
Jaz: He's a teacher. I feel like he gets a lot of human interaction.
He's not like locked in a cave doing like research. So I think he ends up interacting with humans a lot more than some, uh, some CS researchers do.
Autumn: Also, I think the way that that book is written, you can almost tell that he must have taught something because it's. Much more digestible than a lot of just dense, horrible data.
Jaz: Martin is great. We meet with him fairly regularly to just talk about like issues that we're having. I'm a
Autumn: fan. Okay. Tell him I'm a fan girl over all of his data books. And that's my favorite data book. And I talk about it way too much. And people are probably so tired of me bringing up that one book.
Jaz: One of my favorite.
We have, we do have a lot of like internal memes. I think we've shared a couple of them on the network. We do have memes
Autumn: and we need to like plushy picks jazz. I just blue skies. Yeah. You're right. She picks
Jaz: and, and designing data intensive applications, memes, Martin used to come to us with like, we'd, we'd go to him and we'd ask like all these questions and he'd give us like, Oh yeah, here's a great, here's a great way to solve that problem.
And nowadays we go to him and every time we like, we're like, Hey, we have this problem. And he'd be like, ah, that's a tough one.
And it's like,
Jaz: we're getting out of the realm of easy, of easy answers that are like well explored. And we're into the like, yep, that's, that's a challenge at scale.
Autumn: Can I come be a free technical consultant just so I can talk to Martin and get memes?
Like, can I be paid in memes?
Jaz: You can talk to Martin on the network, and I think Martin also goes to a good number of conferences as well. You could, you could end up at a I'm gonna not stalk him in
Autumn: a creepy way, but in a very nice, professional way.
Justin: Yeah.
Autumn: Get him to come to one
Justin: of the concerts we just did.
With that general overview, I remember back when you were scaling a million users a day, you had to go rack some servers, right? Like, there was a point where you were like, hey, we need to scale up, and, and it's not in the PDS. with, with millions of users coming, even though that's growing, you can still scale that the, the bare metal, because those are rentals.
That's a, it's a provider that say, give me another one. We'll provision it. It'll come into the network. And then on the other side of like some cloud services running, like those skip, but like somewhere in there you had to rack. And that's mostly, you said for the fire hose
Jaz: for that kind of global index, so that the data service that does all the querying to the database, the database cluster itself.
And a couple of other, like the discover feed and stuff that run on prem. So those all require machines to run on and we don't have a magic, uh, like I can't change a number in a Pulumi deploy and then magically have more hardware available in the data center. It's a whole process. You've got to go through the acquisition process.
You've got to find a vendor. You've got to talk to a vendor. You've got to, you know, spend some money on some new machines. They get shipped. You have to go to the data center. You have to receive them. You have to Unbox everything, rack it, hook it up, network it, burn it in, provision it, and then you can figure out, all right, how are we going to like migrate the workload to this, to this new hardware?
Autumn: That's why I think that like you have to have that happy medium between cloud. And like on prem like everybody acts like either is some magical solution and I'm just like we're going to pretend that we forgot the leeway and all the stuff you have to do to get something on prem like it is cheaper and it does need to be used a lot more because putting everything in the cloud is just not cost efficient but I think people forgot how long it takes to get stuff on prem and then the fact that you have to go fix that when it burns out.
Jaz: There's a lot of convenience that comes with cloud, but you definitely pay for it. And you don't necessarily pay for it in the things that you expect to pay for it in. Like, you don't expect, ah, you're gonna charge a markup on this EC2 instance based off of how powerful it is. You end up paying most of it in, like, kind of hidden places, like, you know, in egress fees or in, like, WAF requests or something like that.
You're also kind of beholden to
Autumn: them and their decision making, you know.
Jaz: Yeah, a lot of a lot of cloud providers haven't really passed down cost savings of like more efficient hardware to consumers. So like the cost of an EC2 instance per like vCore hasn't really, or vCPU hasn't really gone down much over time.
And the number of vCPUs you can pack into a single machine and that you can, the amount of compute you get per watt in a data center has had insane leaps in the past 10 years.
Autumn: I'm really interested to see where that goes, right? Like eventually they're going to have to figure out how to compete with on prem, you know what I mean?
And it's just interesting the way that they've made cuts in certain areas. And I'm like, bro, you're making cuts for the most expensive stuff that you run, but not the stuff that you get for the cheapest, which is very interesting.
Justin: I mean, almost all of them have doubled down on their investments in custom silicon.
And so they all say like, Oh, we're going to, the, the AWS play is Graviton is more efficient per Watts. And so you should go to Graviton. You should use our
Autumn: About like the bad place, but Graviton is kind of fire. Now, is that a good excuse to ha for what we're talking about? No, but I think that is going to be one of the best things that has come out of the bad place in a long time.
Justin: So for you looking back on. These separate places that pieces of infrastructure run and putting things on prem and having to go through that we have to scale this thing up was, do you think that was still a good decision?
Jaz: Absolutely. Yeah. I mean, the way that we approached it was, hey, let's build out, let's way overbuild our on prem solution and then we'll be.
Ready for, you know, insane overheads if something crazy happens. And then even now, we like, we just, we recently finished an expansion in our, in our on prem POPs. And even that was like a, it was a preemptive measure. It was a cool, we're not near the limits of the hardware we have right now, but if we want to keep really healthy overhead in our POPs, we should probably do some expansion.
And so a lot of this comes from like planning for a couple orders of magnitude, and then making sure that in the time it would take to grow by a couple orders of magnitude, you, you can. get hardware where it needs to be in time.
Autumn: Whatever you're doing, the planning is very well placed. You're doing a great job.
Jaz: A lot of it was kind of scarily instinct. Like, the most recent expansion that we did, I was like, this was after Brazil, you know, we saw
Autumn: Which is such a weird, like I almost wonder if y'all should just pay, like, Elon at this point, or like, send him a gift, because like, every time like, there for a while, every time he said something stupid or did something stupid, it would just be like, Spike?
Like, you could Tell what, like, you just be like, what did Elon do today? Cause there's so many new people, you can
Jaz: see them on graphs. They're pretty noticeable and pretty sharp on the graph, make
Autumn: a meme of the graph. Right. And then put his head on each.
Justin: You're like, we'd mark our, all of our graphs with our deploys.
And instead you have all these marks of like news articles, a
Autumn: little blip of like the dumb thing he did that day, you know, like talked crap to Brazil.
Jaz: So much of our planning and everything is like, we, we don't have control over how many people are going to decide to use our website today. There
Autumn: was like a rumor at Tesla that when his girlfriend changed the color of her hair, or that like if they had a fight, like if they saw them walk out, the handlers of the Elon would like panic and then figure out how to like make it the least.
Wild outcome of that. Like, can you imagine this dude is like CEO of a company, like, and they have handlers because they're worried about, like, what will result after this argument or hair color? Like, can you imagine that environment? And you know what I mean? And now it's affecting a whole nother company.
And now we're just like, let's try it with the country. It's going to be great.
Jaz: I mean, well, when, when all of Brazil loses access to Twitter, like overnight. That was an insane moment. That was like, we were, I think some numbers I can talk about, like, which are fun numbers is like total requests, throughput across the PDS is so like, that's kind of our, our big, how much load is going on right now number.
And before Brazil, we were doing like three and a half K 4,000 requests a second peak a day. And then Brazil happened and we shot to 25,000 requests a second across all of the PDS is. And then. In November, we hit our new kind of record, which was like 50,000 requests a second. So we're still like way above Brazil's peak on like a daily basis now.
But it is insane to me that that was, that was like a 10x event for us, which is crazy. And now that has become normal in like a few months. It's like, yeah, that's just what we deal with every day. Now we're running around with the chickens with our head cut off when Brazil happened. And then November came along and that was like.
An even worse version of it, it was like four Brazils or something crazy like that. After Brazil happened, we were like, alright, how the heck are we going to plan for a 10x of this? But we, we did everything we could to like, alright, can we prepare for a 10x of what that just was? And now November happened and we're like, alright, how do we prepare, how do we prepare for a 10x of what that was?
Autumn: Okay, like low key though, did you think like it was going to hit the fan in November? Or like, did you, like, did you anticipate it at all? Yeah, we were
Jaz: prepared. I don't think we, anybody expected that we were gonna like triple our user base in like three weeks. We had 10 million users leading up to the election roughly, right?
And we're, a few months after that, like, I think we hit 25 million users within like a month of the election.
Autumn: For the small issues you had, they were very well handled. Like y'all were
Justin: just, I would, that's actually, so I, we asked questions or I asked him questions on Bluesky, like, Hey, anyone have questions to ask?
And one of them was about incident management. How does that, how does that work? How do you learn from some of those incidents you've been having? Like there's, there's always something going on, um, between all the planning and, and the other, the normal things you have to do to like deploy and make software better.
How do you handle those incidents?
Jaz: A lot of metrics, a lot of dashboards. That's, that's kind of the most important thing is like, if you are not measuring something, it is very hard to improve it. And so we lean really heavily into Can you say that
Autumn: louder for the people in the back, Jazz? Observability and monitoring is important, engineers.
Jaz: If you, if you can't measure it, you, you can't meaningfully improve it. Or at least you can't prove that you improved it. So when things were going crazy in November, we had what I call like the 11 days from hell. Which was 11 days of 16 hours a day in a situation room from like the moment you wake up to the moment you go to bed and then like Wake up, check some graphs in bed, line is still going up, get ready as quickly as you can, get downstairs, log into the situation room, and figure out what's on fire this morning.
Autumn: Tell me there was coffee.
Jaz: I drink Monster, but yeah, there was, there's, there's a lot of, uh, a lot of Dang, see, that's why
Autumn: that, that's why their infrastructure never goes down.
Jaz: That was, there, there were so many, like, so many different components hit. I guess you would call them like early scaling limits, not that they were at the maximum of their design, but that they've never been pushed that hard before.
And so we were shaking out bugs all over the place, like scaling, scaling issues or like some concurrency bug or something like that that was falling out from so many different systems all at once. Because when you. You drive a truck over a bridge, and if the truck is really heavy, and it's like too heavy, and you have this really old, like, bolt in the bridge, the bolt could, like, get broken, or shear, or fall off, and that reduces some of the stress of the bridge.
Like, it starts swaying, and then a bolt fires off, and then it stops swaying as much. And that kind of, that's kind of how you, like, release tension in a bridge when it's under stress. But if you land, like, an AC 130 on the bridge, and you're, like, taking, like, a giant jumbo jet, or, like, some kind of massive 747 landed on the bridge all at once and a bunch of bolts pop loose all at once and you're like Oh, crap.
Which one do we go fix first? Which one is like structurally important to the success of the bridge? So when you scale insanely fast in a really short period of time, you have a lot of systems that hit these early limits or that, that shoot these bugs out like bolts off of a bridge. And you have to figure out through your metrics, figure out, okay, which services are okay, which services are not okay.
And then dig into the services that are not okay and figure out, all right, where are we running into problems? One of the craziest issues we had was like everybody's handles started suddenly started becoming invalid because we ran into the limits of public DNS resolvers. We were like hitting Google Public DNS Resolver and Cloudflare's Public DNS Resolver so heavily they started rate limiting us and we just couldn't do DNS queries anymore.
Autumn: Okay, can we just talk though? Like, why is it always DNS? DNS finds new ways to like, just ruin people's lives. Like, it wakes up in the morning and it's like, how can I be difficult in a way that they'll never expect? Like, it's never something that's easily figured out. You gotta go down the whole rabbit hole, figure out some way that you've never heard of before.
Justin's problem also somehow tied to DNS. Like, it's always every time and it's always, it's never like a normal error that like makes you think, okay, it's this, it's always something ridiculous. That's just this rabbit hole.
Justin: Every error message in every application for every log everywhere should probably just end with, it might be DNS.
Autumn: No, seriously. It should be like, go hit this line. This thing's mad at you. But also if this fails, is it DNS?
Justin: Segfault. Maybe it's DNS. I don't know.
Jaz: And then Kubernetes was like, hey, what if we put DNS everywhere? What if we wove DNS through the entire stack?
Justin: Actually, that's a good question because you said you were doing Kubernetes at previous startups.
You don't have any Kubernetes in the stack now, right? We have
Jaz: no Kubernetes it's all
Justin: VMs and it's still containerized.
Jaz: It's containerized. It is containerized, but it is not a lot of VMs, even honestly, it's just like SSH into the box. It's kind of running, you know, Linux right on top of the bare metal and then it's running Docker.
Justin: So no traditional orchestrator.
Jaz: No, no traditional orchestrator at the moment.
Justin: It's like Ansible jobs, Docker run.
Jaz: Yeah, Ansible jobs, Docker compose and a couple of tweaks to make things faster. We're not using like the Docker logging because Docker logging is not very good if you have really really high throughput logs.
So using like, we're using svlogd, which is in runit. And so svlogd lets you just log to a directory and it kind of cycles through files and then you can use like Promptail to. Scrape those directories. So every container gets its own logging directory and then it just pipes it to svlogd and svlogd is really lightweight and it handles all the log management without having to do like standard out piping or anything like that.
Justin: Every user is a website, a SQLite database, and a svlogd.
Jaz: Yeah, exactly. Exactly.
Justin: It's a whole stack right there.
Jaz: It works surprisingly well. Uh, you also want to make sure that you're not like doing user space docker NAT, because user space docker NAT is how you make your high throughput services be very low throughput.
Justin: Well, you're not running everything like network hosts though, right?
Jaz: Uh, no. I mean, you can, you can run kernel level NAT, which is, which is a lot less, uh, messy than user level NAT for docker. It's not CPU intensive, I guess I would say. Uh, there's less, less packet copying going on. But that's one of the reasons we don't, didn't want to run Kubernetes is we've got these really cool bare metal machines.
We don't want to add so many layers of virtualization on top of them that. We lose a lot of the, like, benefit of being close to the metal. You're gonna hide all that
Justin: performance under abstractions. Yeah,
Jaz: yeah, exactly, exactly. Say goodbye to your, your cache locality. Say goodbye to, I don't know, whatever it is you're, you're trying to do because your, your container is being preempted because the Kubernetes, the Kubelet needs to come in and do something or whatever it might be.
I mean, you can tune Kubernetes for performance and you can run it in a high performance way. We don't have the expertise to do that. But what we, we do know is, yeah, you can just And
Justin: a lot of this, I mean, a lot of the orchestrators are typically, you have a dynamic infrastructure, right? Like you have machines coming and going frequently.
You need to reshuffle things or reallocate things. And in a lot of your case, at least half of your infrastructure is fairly static. It's like we have a bunch of machines over here that are running PDSs, a bunch of machines over here running all the app view and database. Flows and everything and and you can define that that's a spreadsheet.
That's not an orchestrator
Jaz: It's all very static and and you buy the capacity when you buy the machines, right? You can use as much or as little of it as you want to you've already paid for it Basically,
Autumn: do you think Bluesky will somehow figure a way to incorporate video and images more? So that way we don't have to go to any of the bad places
Jaz: I think so.
Yeah. I mean, I think recently we launched video feeds. So feeds can describe themselves as like primarily a video feed and then they'll go into that kind of video vertical scrolling mode. That was like a six day project by the front end team that was actually like kind of insane turnaround on that. So we have a couple of things where we do very hackathon mindset and, and we're like, cool, how quickly can we get something that is like.
Of our quality standards shipped to production. When you're at a tiny company, you know, you've got like 20 something people and you're dealing with tens of millions of users, there's a lot of priority juggling. And so you've got like stuff that's easy to do and stuff that is important. There's stuff that's like fast and easy and stuff that's important.
And if it's in that quadrant, you've, you kind of just do it. immediately drop whatever you're doing, go do that thing. And then you have stuff that's like a little bit harder to do and it's important, and that's work that you try to schedule. And then you have work that is stuff that's like hard to do and on unimportant.
And that's stuff that falls to the, kind of the bottom of your priority list. And then there's stuff that is easy to do, but unimportant, and. If you need extra dopamine and there's nothing on the easy important list to do, you gotta do that stuff.
Justin: Speaking of, of possibly important, I'm going back to some of the questions here.
Someone's asking about like expansion outside the U. S. What does that look like in your network, which is mostly static? Are you going to, are you planning on doing some like, Oh, these users really care about data locality or this country does. So we have to put the PDSs or the whole stack in that environment in their country within the borders.
Jaz: I'm not up to date on the legal side of any of that or like the regulatory side of that from a just a purely architectural standpoint, it should be something doable is like run the PDS in another country and then your canonical data lives in that country. And then the other side, like if we wanted to run a pop in another country or something like that, we could we could go set it up and move our hardware there.
Some countries are easier to do that in than others. And then the connectivity of that country is also important. It's like, cool, can we get a lot of bandwidth cheap? Is it going to reach our customers? There are a couple of considerations that go into where we place infrastructure. Right now, it's mostly in the U.
S. just because that's the easiest place to put it. When it comes to delivering like images and video, we, we work with a CDN partner and the CDN, they've got, you know, a whole distributed network of their pops and their local caches and nodes and stuff.
Justin: Going back to the, the hardware, not going into super specific details, but as far as like, how did you decide what to pick for hardware?
Where were you looking at? What were the kind of the qualifications?
Jaz: I can talk about like the chips and stuff that we're running because we, we wanted to run AMD because current generation AMD in, in the data center is just at a scale that it is hard to push Intel to. It runs higher performance per watts and you just get better density out of them.
That was kind of our decision on AMD versus Intel for that. And also we were very interested in, uh, the X, the 3DV cache, uh, chips that AMD is coming out with. And so Genoa X CPUs, we've got, like, some of our machines are spec'd with two of the 96 core, 192 thread Genoa X series CPUs that each have 768 megs of L3 cache.
I
Justin: mean, you're over 300 cores. Holy crap.
Jaz: Yeah, so it's uh, a gig and a half of uh, L3 cache in a single box across two chips, which is absolutely absurd. Yeah.
Justin: That's more than my first computer. It's like all total RAM and that's cache.
Jaz: Yeah, so you can get insane amounts of cache. You can get these like really, really high core density machines.
You could, you could pack a ton of RAM into a box. Like if you're, if you're just buying. Your own box. You can stick a couple of terabytes of Ram into it. You can't get a couple of terabytes of Ram in a cloud VM.
Justin: You can, but you're going to pay for it. I mean, you probably have to like
Jaz: break like 16 different pieces of glass and like talk to like 30 different account reps before they'll let you get like a node with two terabytes of Ram in it.
Autumn: Which is where cloud is not fun when like, it's cool when you can get an instance in seconds. It's not when you have to break glass and ask permission.
Jaz: Yeah, we can buy hardware that is very kind of tailored to the workloads that we're doing. So ScyllaDB is a big distributed horizontally scalable database.
It's got a shard per core architecture, so you can throw a bunch more cores at it and it will just kind of scale horizontally. But what it does want is a lot of RAM and a lot of NVMe. And so. NVMe is cheap these days. You can get like a 15 terabyte enterprise NVMe drive for like two grand.
Autumn: Is it as hard to manage as Cassandra's?
Jaz: It's been, when we've been using it correctly, it's been totally quiet and we've had no issues with it. We do have the timelines workload that is doing those like. Many, many, many writes a second to timelines is not the best fit for like an LSM tree with, with size to your compaction. So we've running into performance issues there that were really annoying.
We've got past some of them by kind of segmenting that workload into its own cluster. And now it no longer has an impact on like P99 latencies for every other operation that goes on on the website. Uh, but it was all in one big cluster. I think
Autumn: that's kind of the secret of databases. Cause everyone thinks that no SQL or.
Using one or the other is going to be some sort of magical thing because they think it's not, doesn't have to be a structured or it's not, doesn't have to be like is relational, but they're all you have to write, use the right tool for the job and then the right access patterns and all kinds of stuff.
So, I
Justin: mean, I think the secret of databases, everyone has to use it wrong the first time, right? And then, and then you figure out, Oh, this one's different.
Jaz: There is, there is no database that will support wildly different workloads. on the same instance, on the same cluster, basically, is what we've learned.
You can design your database as, as heavily as you want to, but like, if you have a really noisy neighbor, it's gonna thrash your caches, and you're gonna have really bad performance, or it's gonna, like, cause a bunch of compactions to kick off, and you're gonna be wasting a bunch of CPU time in compactions that could have been serving requests, and your latencies are gonna be all over the place.
So, so when we bought hardware, we were like, okay, cool, let's buy hardware to run a Scylla cluster, and let's buy hardware to run A couple of really highly concurrent Go processes and then some more generic hardware to run more generic things like a bunch of TypeScript containers and stuff like that. So the, the core data service I was talking to you about in November was running on 16 containers across two physical machines in both of our data centers.
So two in each, in each DC, eight, eight containers, those machines had 384 logical cores. So with, with SMT 384 cores, and so each Go process was getting. A couple dozen cores and
Justin: still, when I think of that scale and you're literally talking about four physical servers, and I think if I wanted to replicate that in a cloud architecture, that is at least 30 VM somewhere with a couple of cues and something else and like
Jaz: that complexity for physical servers handling across all four of them in the neighborhood of 700,000 requests a second from the app view and querying a database around four and a half million times a second.
Autumn: Your experience being a hardware engineer and a software engineer really meshes well with you working in infrastructure because if you didn't know hardware as well you probably wouldn't be able to Go and pick the right, like everything is, seems like you have a really good knack for right sizing and picking the right things.
And I think people struggle with that so much. They're all tools, right? But how do you go and use that tool efficiently, right? And the fact that you worked with bare metal and you worked with hardware and let's be real, it's easier to figure out cloud because there's a lot more kind of tutorials and information out there to go figure that out, right?
You came with the hard stuff and then you get to meld that together.
Jaz: I feel like a lot of it is instinct at this point, or it's like, I feel like I'm guessing really often. When you are, like, right sizing for hardware, you're never gonna make a decision with as much data as you want. You'll never reach a point where every decision that you make is fully informed, and you're like, Ah, yes, this is clearly the obvious decision because I have all the information I need to make this decision.
So I will just make the correct decision. What you're left with is like What do you know? What do you have experience with? And then, what does your gut say?
Autumn: A lot of times that's almost more important. I've learned through working at different companies that sometimes it's more like what your engineers know and what they're good at and then finding the best tool that they have experience with.
Rather than just picking the best tool, like they all have to be counted in and like accounted for.
Jaz: Making the decision is like, and making the correct decision is hard. Choosing when to make a decision is another really important role that takes a lot of experience to get. I don't have a ton of that experience right now.
Jake, our previous, our previous Inferlead made a lot of these decisions that I was like. Are you sure? Like, I don't know, like, is this going to work? And that has a lot of those have like very clearly panned out. And I, I've bowed to his wisdom on a lot of that. And now I'm in the position where I'm like. I hope I know what I'm doing.
I like, I have no idea what I'm doing, but you know, we're still alive. So I must be doing something right. And choosing when to make a decision is also very important because delaying decisions until you have more information is, is good if you really don't have enough information to make a decision, but being indecisive can cause you to slow down or it can cause problems or it can make more work for you.
And so you have to like constantly be. doing this trade off between should I just make a decision and go with it and commit to it because we'll get more done that way If the decision isn't super high stakes or if it's a really high stakes decision How do I wait just the right amount of time so that we have enough information, but we're also not missing the boat
Justin: Looking back over the last 18 months Were there any decisions you regret that either you made at the wrong time or you, you just decided that I'm just trying, I'm asking, you know, there's a lot of learning experiences here,
Jaz: any decisions that I regret, I don't think I can fault any of our major decisions that we've made because
Autumn: we
Jaz: haven't, well, we, yeah, we haven't fallen over Nobody could have possibly predicted the ridiculous trajectory that we're on, like, except for Jake when he wrote that spreadsheet.
But like,
Autumn: if you could have predicted all of this, then we should pay you for like predicting the election and a bunch of like, some other really unstable world. These have all been very heavily outside influence.
Jaz: I do kind of firmly believe that, like, from an infrastructure standpoint, we have made. the best decision that we could with the information that we had pretty much across the board.
And having more information, we wouldn't have believed it if I, if I like sent myself back from the future and was like, Hey, you have to prepare for this scale. I would have been like, you're insane. Get out of here. I
Autumn: saw a post like that on Bluesky today. It was like, if someone had told me that it was something like random about like where we are now, verse 10 years ago, it was like, if I went back in 2004 and I got put in like a mental asylum for telling people what's going on in the future, the future's like, And I was like, they're not wrong.
Like, they're so not wrong.
Jaz: Back in November of 2023, we re architected the entire backend. So the entire backend was on one big Postgres instance, uh, or like a bunch of Postgres replicas, the PDS and the App Viewer merged into one big thing. It was all just one giant Postgres serving a hundred thousand users.
We broke those roles apart. And then we moved to the V2 architecture, which is, Hey, Scylla based. Rewrite the entire data schema, build it all from scratch, and design it to support up to 100 million users at the time was the goal. And we had 100,000 users, and we were like, cool, we're going to build for three orders of magnitude from only having information of, you know, operating at 100,000 users.
None of us had any idea what the hell we were doing. Like, this was all way pie in the sky architect engineering stuff. We got some idea of what it was going to look like and then I went head down for like six weeks from like Christmas to the end of January. And just wrote out our entire new data architecture and then implemented it and got it running and on our [00:54:00] hardware.
Autumn: I hope you guys are going to a beach in Mexico at some point because you'll be working some
Jaz: hours. Right before the public launch back in February of last year, five days before that, we silently shifted the entire backend from the in cloud. On top of a big Postgres to the running on our own hardwire and nobody noticed and so we had we'd like we backfilled all the data we had it all running for a while we for a couple days before everything switched over and then we just slowly moved one PDS at a time and pointed it out at the new architecture and so over the course of like an hour we shifted 100 percent of traffic onto the on prem loadout and that was like that was the moment where I was like I can't believe we just did that you I was like, we went to a cave and wrote this whole thing.
And then like, all right, I hope it works. We'll see what happens when it like actually gets users on it. And then it just frigging worked. And it was like, you're kidding me. Like we had like two bugs. And like, tiny, tiny, tiny percentage of people noticed it, and we fixed those within a day or two. And I was like, alright, what's next?
Autumn: I feel like someone tried to explain what an SRE was the other day on Bluesky to like, people that were not technical. And it's wild because like, nobody knows what you're doing until you mess it up. And then they know what you're doing, you know what I mean? So like, it's what, like, that's such a huge achievement for you to do that much of a data switch.
And like, to know you did it right is because nobody noticed, you know?
Jaz: Yeah, that was one of the very high stakes moments. We've had a couple of those since then, like turning on video. Was like, I have no idea. Video, the like backend for video is all custom. It's all like I w I wrote up our entire kind of video processing pipeline.
I architected it and, and set up the, it just runs on a bunch of machines that, that we don't operate. And I was like, I think this should be horizontally scalable. Like I've done. I've run it in Docker Compose on my like work machine and I've scaled it to like, however many, you know, hits a second and it worked fine.
It should probably be okay, but our only way of like figuring it out was like, all right, turn the dial and actually let users use it and see if it's going to happen. And this was right after Brazil. So Brazil happened. We had 10 X, the, the number of users we expected to have, I had been building video. For the previous number of users.
But I was like, I want it to be able to scale to a billion horizontally. And then Brazil came on and, and Paul was like, can we still do video? And I was like, give me a week. Like, yeah, give me, give me, give me a week. Let me, let me, let me update some spreadsheets to figure out what the costs are going to look like.
And then give me a week and then yeah, let's do video. We had a last minute architectural change with video as well. That was insane. We were, it was the morning of the video launch. Uh, we had, we had a transcoding partner that was going to do like half of our video encoding for us. And, and a big chunk of the, the workflow.
We submitted some jobs to their, their queues that morning. Like. Through their API and it took like an hour to process the video and I was like what this was like working just fine Like last night it was happening in seconds and they said oh, you know There's there's a really big backlog right now and I was like, I can't ship that to like millions of users That's not that's not accept it Like people can't upload videos if it's gonna take an hour to process a 60 second video that makes no sense So in about 14 hours of insanity, I like rewrote their entire part of that stack into the existing job system that I built.
And I was like, cool, I'm just going to replace your product and I'm just going to shove these into an S3 bucket.
Autumn: What kind of monster do you drink? Goodness, Paul
Justin: drinks Red Bull, doesn't he? It's like between Red Bull and Monster.
Autumn: Paul needs a fridge of Red Bull. I think I ate that
Jaz: night, briefly. Yeah. Me and, me and Divey.
Divey was like, I was like, Hey, I think this is how this can work. Can you figure out how to get the CDN to front this like S3 bucket or like S3 compatible bucket, this block store bucket? And then I will do everything I can to get us to encode these HLS streams and get them into that block storage bucket.
And then hopefully it should just work, maybe. Um, and we literally launched the next day.
Autumn: You're like, oh, Jake did this and like, oh, I didn't do anything big. And I'm like, are you listening to the words that are coming out of your own mouth?
Jaz: It was a lot. It was a
Autumn: bajillion times. And like, it was no big deal though.
I just did it with a monster.
Jaz: The secret to video encoding is everybody's just calling FFmpeg. It doesn't matter. It doesn't matter how big of a company. I mean, maybe if you're like Google scale or something, you're not doing it anymore at that point. But. It's so much just like, yeah, you're calling FFmpeg.
Justin: Disney FFmpeg. It's just legit. Like, yeah, there's some hardware that's specialized to It's
Autumn: so phenomenal that Disney didn't fall over in itself. Also, like, can we talk, like, with the amount of times that we saw the Twitter whale in early Twitter scale days? Y'all are killing it.
Jaz: The secret is we're a distributed system, so we're never fully down.
We only ever have partial outages. We only ever have service degradations. So occasionally the website goes into read only mode, and you can't like things or anything, and they all get backed up in a queue somewhere, but you can still scroll. You can still scroll. Scrolla nine, that nine CAS.
Autumn: If your
Jaz: system is distributed enough, you're never fully down.
Autumn: Your bugs will always be 10 times worse because you have to figure out where you went wrong, but it'll be up and it looks like it's great for customers.
Jaz: Exactly. All of your, all of your bugs
are Heisenbugs.
Jaz: What's next?
Justin: What's next for BlueSky for infrastructure? What are you, what are you looking at?
Jaz: We just did some hardware scaling, which was exciting.
Um, we're probably going to do some more of that in the future, depending on how growth goes this year, you know, like we were at 100,000 users 18 months ago, we're sitting at 30, just shy of 30 million users today, there's a lot of maturing our data architecture that we have to do, there's a lot of like low hanging fruit in, in like how to do caches better, how to coalesce requests better, how to, you know, hybrid timeline fan out stuff for, uh, for celebrities, there's so many different things that If we stretch this, you know, this past six month period over the course of two years It would have gone totally differently.
Everything would have been perfectly smooth. Like, we would have no tech debt. It would have been great because we would have scaled at a rate that like, you can see what's going to be a problem slightly ahead of time and you can anticipate it and go do something about it. But where we're at now is like, problems are either on fire or they're not high enough priority.
And so that, that was in November. And now, now we've got, we've bought ourselves some more breathing room. And so I'm starting to look at how do we do service discovery? We have a bunch of services that are like, Here's like a Here's a static list of instances to go try to talk to. And if one of those instances goes down and I can't bring it back up because it had some load bearing bloom filters or something like that and we're in peak traffic, everything gets mad.
I have to go redeploy all of the services that talk to it to tell it, hey, don't try to talk to this one. So there's some kind of like dynamic configuration and service discovery that we want to get rolling. Lots of caching infrastructure changes. Maybe writing a custom database for timelines. That's, that's one thing that's been on my mind is uh, LSM tree is not a great fit for this like, Circular buffer style timeline where like, you've got a fixed length of, of references you want to put in everybody's timelines.
Then you want to kind of overwrite the oldest one. When a new one comes in, it feels a lot like a circular buffer. And I'm like, okay, cool. Can we do something with that? Can I go write a database for timelines? That is just going to be a super, especially built for this workload and just really efficient and scale.
way farther than I needed to right now. So, yeah, writing some databases. I did that with a graph database last year.
Justin: Yeah, like that's totally no big deal.
Autumn: Because everybody does that. I'm just going to change the way that, like, you know, app protocol and social media does data.
Jaz: Hey, if you limit the scope of your problem, any problem is, any problem can be tackleable if you limit the scope hard enough.
Autumn: The next time you go for a job interview or write a bio, call us. This is your new resume. Yeah, you just, you're not doing, like, what you do justice, okay?
Jaz: It's, yeah, I don't know. There's so many, you wear so many hats at a, uh, like on a tiny team that like, I forget what I do a month afterwards because the, the, the past month is like, because you left, or
eighth of that month?
Jaz: The past month is like a whole, a whole, like every month is like we're in a whole new league. Oh crap. Now we're in a whole new league. Oh crap. Now we're in a whole new league. And it's like your poor
Autumn: brain hasn't had the time to turn off and like register the memory.
Jaz: I, I took some time off over, over the holiday, over the winter holidays.
I got, I got like a week or two off there, which was, uh,
Autumn: gave me some breathing room. I slept for eight hours. It was okay.
Justin: Jazz, thank you so much for coming on the podcast, explaining all of this. The rollercoaster of Bluesky over the last year and a half has been phenomenal. I've been enjoying it thoroughly.
I've been trying to Play with the new things you've been putting out with PDSs and whoever I want to, you know, poke at a fire hose and whatnot and see what's going on. We are sorry
Autumn: that Justin does hoodrat stuff with your infrastructure. We apologize. I
Justin: definitely am one of those abusers.
Autumn: Just like, look, just we're, we're going to send, just make a like little like page where we can send you coffee every time Justin gets a bright idea and then post about it to encourage other people to get said bright idea and do hoodrat stuff.
Jaz: If a well intended dev can cause issues, then we've, we've got work to do, right?
Autumn: Justin's your chaos engineering. He's your, like, chaos goblin.
Jaz: Retroid is definitely another one of our chaos engineers in the community. If you, if you follow Retroid, he's, since the early days, has been helping us find, uh, bugs in unlikely places.
That's a way to describe. that relationship.
Autumn: That was such a nice way of doing it.
Justin: So everyone, thank you for listening. If you're on Bluesky, go look up jazz. They're on the network, obviously very active posting and sharing your knowledge and everything. And so that's, that's been fantastic just to follow along and everyone that's listening.
Thank you so much. We will talk to you again next week.
Jaz: Thank you for having me.
Justin: Thank you for listening to this episode of Fork Around and find out. If you like this show, please consider sharing it with a friend, a coworker, a family member, or even an enemy. However we get the word out about this show helps it to become sustainable for the long term. If you wanna sponsor this show, please go to fa fo fm slash sponsor and reach out to us there about what you're interested in sponsoring and how we can help.
We hope your systems stay available and your pagers stay quiet. We'll see you again next time.