Screaming in the Cloud

Jake Gold, Infrastructure Engineer at Bluesky, joins Corey on Screaming in the Cloud to discuss his experience helping to build Bluesky and why he’s so excited about it. Jake and Corey discuss the major differences when building a truly open-source social media platform, and Jake highlights his focus on reliability. Jake explains why he feels downtime can actually be a huge benefit to reliability engineers, and why how he views abstractions based on the size of the team he’s working on. Corey and Jake also discuss whether cloud is truly living up to its original promise of lowered costs.

About Jake

Jake Gold leads infrastructure at Bluesky, where the team is developing and deploying the decentralized social media protocol, ATP. Jake has previously managed infrastructure at companies such as Docker and Flipboard, and most recently, he was the founding leader of the Robot Reliability Team at Nuro, an autonomous delivery vehicle company.

Links Referenced:

Bluesky: https://blueskyweb.xyz/
Bluesky waitlist signup: https://bsky.app

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. In case folks have missed this, I spent an inordinate amount of time on Twitter over the last decade or so, to the point where my wife, my business partner, and a couple of friends all went in over the holidays and got me a leather-bound set of books titled The Collected Works of Corey Quinn. It turns out that I have over a million words of shitpost on Twitter. If you’ve also been living in a cave for the last year, you’ll notice that Twitter has basically been bought and driven into the ground by the world’s saddest manchild, so there’s been a bit of a diaspora as far as people trying to figure out where community lives.

Jake Gold is an infrastructure engineer at Bluesky—which I will continue to be mispronouncing as Blue-ski because that’s the kind of person I am—which is, as best I can tell, one of the leading contenders, if not the leading contender to replace what Twitter was for me. Jake, welcome to the show.

Jake: Thanks a lot, Corey. Glad to be here.

Corey: So, there’s a lot of different angles we can take on this. We can talk about the policy side of it, we can talk about social networks and things we learn watching people in large groups with quasi-anonymity, we can talk about all kinds of different nonsense. But I don’t want to do that because I am an old-school Linux systems administrator. And I believe you came from the exact same path, given that as we were making sure that I had, you know, the right person on the show, you came into work at a company after I’d left previously. So, not only are you good at the whole Linux server thing; you also have seen exactly how good I am not at the Linux server thing.

Jake: Well, I don’t remember there being any problems at TrueCar, where you worked before me. But yeah, my background is doing Linux systems administration, which turned into, sort of, Linux programming. And these days, we call it, you know, site reliability engineering. But yeah, I discovered Linux in the late-90s, as a teenager and, you know, installing Slackware on 50 floppy disks and things like that. And I just fell in love with the magic of, like, being able to run a web server, you know? I got a hosting account at, you know, my local ISP, and I was like, how do they do that, right?

And then I figured out how to do it. I ran Apache, and it was like, still one of my core memories of getting, you know, httpd running and being able to access it over the internet and telling my friends on IRC. And so, I’ve done a whole bunch of things since then, but that’s still, like, the part that I love the most.

Corey: The thing that continually surprises me is just what I think I’m out and we’ve moved into a fully modern world where oh, all I do is I write code anymore, which I didn’t realize I was doing until I realized if you call YAML code, you can get away with anything. And I get dragged—myself getting dragged back in. It’s the falling back to fundamentals in these weird moments of yes, yes, immutable everything, Infrastructure is code, but when the server is misbehaving and you want to log in and get your hands dirty, the skill set rears its head yet again. At least that’s what I’ve been noticing, at least as far as I’ve gone down a number of interesting IoT-based projects lately. Is that something you experience or have you evolved fully and not looked back?

Jake: Yeah. No, what I try to do is on my personal projects, I’ll use all the latest cool, flashy things, any abstraction you want, I’ll try out everything, and then what I do it at work, I kind of have, like, a one or two year, sort of, lagging adoption of technologies, like, when I’ve actually shaken them out in my own stuff, then I use them at work. But yeah, I think one of my favorite quotes is, like, “Programmers first learn the power of abstraction, then they learn the cost of abstraction, and then they’re ready to program.” And that’s how I view infrastructure, very similar thing where, you know, certain abstractions like container orchestration, or you know, things like that can be super powerful if you need them, but like, you know, that’s generally very large companies with lots of teams and things like that. And if you’re not that, it pays dividends to not use overly complicated, overly abstracted things. And so, that tends to be [where 00:04:22] I follow up most of the time.

Corey: I’m sure someone’s going to consider this to be heresy, but if I’m tasked with getting a web application up and running in short order, I’m putting it on an old-school traditional three-tier architecture where you have a database server, a web server or two, and maybe a job server that lives between them. Because is it the hotness? No. Is it going to be resume bait? Not really.

But you know, it’s deterministic as far as where things live. When something breaks, I know where to find it. And you can miss me with the, “Well, that’s not webscale,” response because yeah, by the time I’m getting something up overnight, to this has to serve the entire internet, there’s probably a number of architectural iterations I’m going to be able to go through. The question is, what am I most comfortable with and what can I get things up and running with that’s tried and tested?

I’m also remarkably conservative on things like databases and file systems because mistakes at that level are absolutely going to show. Now, I don’t know how much you’re able to talk about the Blue-ski infrastructure without getting yelled at by various folks, but how modern versus… reliable—I guess that’s probably a fair axis to put it on: modernity versus reliability—where on that spectrum, does the official Blue-ski infrastructure land these days?

Jake: Yeah. So, I mean, we’re in a fortunate position of being an open-source company working on an open protocol, and so we feel very comfortable talking about basically everything. Yeah, and I’ve talked about this a bit on the app, but the basic idea we have right now is we’re using AWS, we have auto-scaling groups, and those auto-scaling groups are just EC2 instances running Docker CE—the Community Edition—for the runtime and for containers. And then we have a load balancer in front and a Postgres multi-AZ instance in the back on RDS, and it is really, really simple.

And, like, when I talk about the difference between, like, a reliability engineer and a normal software engineer is, software engineers tend to be very feature-focused, you know, they’re adding capabilities to a system. And the goal and the mission of a reliability team is to focus on reliability, right? Like, that’s the primary thing that we’re worried about. So, what I find to be the best resume builder is that I can say with a lot of certainty that if you talk to any teams that I’ve worked on, they will say that the infrastructure I ran was very reliable, it was very secure, and it ended up being very scalable because you know, the way we solve the, sort of, integration thing is you just version your infrastructure, right? And I think this works really well.

You just say, “Hey, this was the way we did it now and we’re going to call that V1. And now we’re going to work on V2. And what should V2 be?” And maybe that does need something more complicated. Maybe you need to bring in Kubernetes, you maybe need to bring in a super-cool reverse proxy that has all sorts of capabilities that your current one doesn’t.

Yeah, but by versioning it, you just—it takes away a lot of the, sort of, interpersonal issues that can happen where, like, “Hey, we’re replacing Jake’s infrastructure with Bob’s infrastructure or whatever.” I just say it’s V1, it’s V2, it’s V3, and then I find that solves a huge number of the problems with that sort of dynamic. But yeah, at Bluesky, like, you know, the big thing that we are focused on is federation is scaling for us because the idea is not for us to run the entire global infrastructure for AT Proto, which is the protocol that Bluesky is based on. The idea is that it’s this big open thing like the web, right? Like, you know, Netscape popularized the web, but they didn’t run every web server, they didn’t run every search engine, right, they didn’t run all the payment stuff. They just did all of the core stuff, you know, they created SSL, right, which became TLS, and they did all the things that were necessary to make the whole system large, federated, and scalable. But they didn’t run it all. And that’s exactly the same goal we have.

Corey: The obvious counterexample is, no, but then you take basically their spiritual successor, which is Google, and they build the security, they build—they run a lot of the servers, they have the search engine, they have the payments infrastructure, and then they turn a lot of it off for fun and… I would say profit, except it’s the exact opposite of that. But I digress. I do have a question for you that I love to throw at people whenever they start talking about how their infrastructure involves auto-scaling. And I found this during the pandemic in that a lot of people believed in their heart-of-hearts that they were auto-scaling, but people lie, mostly to themselves. And you would look at their daily or hourly spend of their infrastructure and their user traffic dropped off a cliff and their spend was so flat you could basically eat off of it and set a table on top of it. If you pull up Cost Explorer and look through your environment, how large are the peaks and valleys over the course of a given day or week cycle?

Jake: Yeah, no, that’s a really good point. I think my basic approach right now is that we’re so small, we don’t really need to optimize very much for cost, you know? We have this sort of base level of traffic and it’s not worth a huge amount of engineering time to do a lot of dynamic scaling and things like that. The main benefit we get from auto-scaling groups is really just doing the refresh to replace all of them, right? So, we’re also doing the immutable server concept, right, which was popularized by Netflix.

And so, that’s what we’re really getting from auto-scaling groups. We’re not even doing dynamic scaling, right? So, it’s not keyed to some metric, you know, the number of instances that we have at the app server layer. But the cool thing is, you can do that when you’re ready for it, right? The big issue is, you know, okay, you’re scaling up your app instances, but is your database scaling up, right, because there’s not a lot of use in having a whole bunch of app servers if the database is overloaded? And that tends to be the bottleneck for, kind of, any complicated kind of application like ours. So, right now, the bill is very flat; you could eat off, and—if it wasn’t for the CDN traffic and the load balancer traffic and things like that, which are relatively minor.

Corey: I just want to stop for a second and marvel at just how educated that answer was. It’s, I talk to a lot of folks who are early-stage who come and ask me about their AWS bills and what sort of things should they concern themselves with, and my answer tends to surprise them, which is, “You almost certainly should not unless things are bizarre and ridiculous. You are not going to build your way to your next milestone by cutting costs or optimizing your infrastructure.” The one thing that I would make sure to do is plan for a future of success, which means having account segregation where it makes sense, having tags in place so that when, “Huh, this thing’s gotten really expensive. What’s driving all of that?” Can be answered without a six-week research project attached to it.

But those are baseline AWS Hygiene 101. How do I optimize my bill further, usually the right answer is go build. Don’t worry about the small stuff. What’s always disturbing is people have that perspective and they’re spending $300 million a year. But it turns out that not caring about your AWS bill was, in fact, a zero interest rate phenomenon.

Jake: Yeah. So, we do all of those basic things. I think I went a little further than many people would where every single one of our—so we have different projects, right? So, we have the big graph server, which is sort of like the indexer for the whole network, and we have the PDS, which is the Personal Data Server, which is, kind of, where all of people’s actual social data goes, your likes and your posts and things like that. And then we have a dev staging, sandbox, prod environment for each one of those, right? And there’s more services besides. But the way we have it is those are all in completely separated VPCs with no peering whatsoever between them. They are all on distinct IP addresses, IP ranges, so that we could do VPC peering very easily across all of them.

Corey: Ah, that’s someone who’s done data center work before with overlapping IP address ranges and swore, never again.

Jake: Exactly. That is when I had been burned. I have cleaned up my mess and other people’s messes. And there’s nothing less fun than renumbering a large complicated network. But yeah, so once we have all these separate VPCs and so it’s very easy for us to say, hey, we’re going to take this whole stack from here and move it over to a different region, a different provider, you know?

And the other thing is that we’re doing is, we’re completely cloud agnostic, right? I really like AWS, I think they are the… the market leader for a reason: they’re very reliable. But we’re building this large federated network, so we’re going to need to place infrastructure in places where AWS doesn’t exist, for example, right? So, we need the ability to take an environment and replicate it in wherever. And of course, they have very good coverage, but there are places they don’t exist. And that’s all made much easier by the fact that we’ve had a very strong separation of concerns.

Corey: I always found it fun that when you had these decentralized projects that were invariably NFT or cryptocurrency-driven over the past, eh, five or six years or so, and then AWS would take a us-east-1 outage in a variety of different and exciting ways,j and all these projects would go down hard. It’s, okay, you talk a lot about decentralization for having hard dependencies on one company in one data center, effectively, doing something right. And it becomes a harder problem in the fullness of time. There is the counterargument, in that when us-east-1 is having problems, most of the internet isn’t working, so does your offering need to be up and running at all costs? There are some people for whom that answer is very much, yes. People will die if what we’re running is not up and running. Usually, a social network is not on that list.

Jake: Yeah. One of the things that is surprising, I think, often when I talk about this as a reliability engineer, is that I think people sometimes over-index on downtime, you know? They just, they think it’s much bigger deal than it is. You know, I’ve worked on systems where there was credit card processing where you’re losing a million dollars a minute or something. And like, in that case, okay, it matters a lot because you can put a real dollar figure on it, but it’s amazing how a few of the bumps in the road we’ve already had with Bluesky have turned into, sort of, fun events, right?

Like, we had a bug in our invite code system where people were getting too many invite codes and it was sort of caused a problem, but it was a super fun event. We all think back on it fondly, right? And so, outages are not fun, but they’re not life and death, generally. And if you look at the traffic, usually what happens is after an outage traffic tends to go up. And a lot of the people that joined, they’re just, they’re talking about the fun outage that they missed because they weren’t even on the network, right?

So, it’s like, I also like to remind people that eBay for many years used to have, like, an outage Wednesday, right? Whereas they could put a huge dollar figure on how much money they lost every Wednesday and yet eBay did quite well, right? Like, it’s amazing what you can do if you relax the constraints of downtime a little bit. You can do maintenance things that would be impossible otherwise, which makes the whole thing work better the rest of the time, for example.

Corey: I mean, it’s 2023 and the Social Security Administration’s website still has business hours. They take a nightly four to six-hour maintenance window. It’s like, the last person out of the office turns off the server or something. I imagine some horrifying mainframe job that needs to wind up sweeping after itself are running some compute jobs. But yeah, for a lot of these use cases, that downtime is absolutely acceptable.

I am curious as to… as you just said, you’re building this out with an idea that it runs everywhere. So, you’re on AWS right now because yeah, they are the market leader for a reason. If I’m building something from scratch, I’d be hard-pressed not to pick AWS for a variety of reasons. If I didn’t have cloud expertise, I think I’d be more strongly inclined toward Google, but that’s neither here nor there. But the problem is these large cloud providers have certain economic factors that they all treat similarly since they’re competing with each other, and that causes me to believe things that aren’t necessarily true.

One of those is that egress bandwidth to the internet is very expensive. I’ve worked in data centers. I know how 95th percentile commit bandwidth billing works. It is not overwhelmingly expensive, but you can be forgiven for believing that it is looking at cloud environments. Today, Blue-ski does not support animated GIFs—however you want to mispronounce that word—they don’t support embedded videos, and my immediate thought is, “Oh yeah, those things would be super expensive to wind up sharing.”

I don’t know that that’s true. I don’t get the sense that those are major cost drivers. I think it’s more a matter of complexity than the rest. But how are you making sure that the large cloud provider economic models don’t inherently shape your view of what to build versus what not to build?

Jake: Yeah, no, I kind of knew where you’re going as soon as you mentioned that because anyone who’s worked in data centers knows that the bandwidth pricing is out of control. And I think one of the cool things that Cloudflare did is they stopped charging for egress bandwidth in certain scenarios, which is kind of amazing. And I think it’s—the other thing that a lot of people don’t realize is that, you know, these network connections tend to be fully symmetric, right? So, if it’s a gigabit down, it’s also a gigabit up at the same time, right? There’s two gigabits that can be transferred per second.

And then the other thing that I find a little bit frustrating on the public cloud is that they don’t really pass on the compute performance improvements that have happened over the last few years, right? Like computers are really fast, right? So, if you look at a provider like Hetzner, they’re giving you these monster machines for $128 a month or something, right? And then you go and try to buy that same thing on the public, the big cloud providers, and the equivalent is ten times that, right? And then if you add in the bandwidth, it’s another multiple, depending on how much you’re transferring.

Corey: You can get Mac Minis on EC2 now, and you do the math out and the Mac Mini hardware is paid for in the first two or three months of spinning that thing up. And yes, there’s value in AWS’s engineering and being able to map IAM and EBS to it. In some use cases, yeah, it’s well worth having, but not in every case. And the economics get very hard to justify for an awful lot of work cases.

Jake: Yeah, I mean, to your point, though, about, like, limiting product features and things like that, like, one of the goals I have with doing infrastructure at Bluesky is to not let the infrastructure be a limiter on our product decisions. And a lot of that means that we’ll put servers on Hetzner, we’ll colo servers for things like that. I find that there’s a really good hybrid cloud thing where you use AWS or GCP or Azure, and you use them for your most critical things, you’re relatively low bandwidth things and the things that need to be the most flexible in terms of region and things like that—and security—and then for these, sort of, bulk services, pushing a lot of video content, right, or pushing a lot of images, those things, you put in a colo somewhere and you have these sort of CDN-like servers. And that kind of gives you the best of both worlds. And so, you know, that’s the approach that we’ll most likely take at Bluesky.

Corey: I want to emphasize something you said a minute ago about CloudFlare, where when they first announced R2, their object store alternative, when it first came out, I did an analysis on this to explain to people just why this was as big as it was. Let’s say you have a one-gigabyte file and it blows up and a million people download it over the course of a month. AWS will come to you with a completely straight face, give you a bill for $65,000 and expect you to pay it. The exact same pattern with R2 in front of it, at the end of the month, you will be faced with a bill for 13 cents rounded up, and you will be expected to pay it, and something like 9 to 12 cents of that initially would have just been the storage cost on S3 and the single egress fee for it. The rest is there is no egress cost tied to it.

Now, is Cloudflare going to let you send petabytes to the internet and not charge you on a bandwidth basis? Probably not. But they’re also going to reach out with an upsell and they’re going to have a conversation with you. “Would you like to transition to our enterprise plan?” Which is a hell of a lot better than, “I got Slashdotted”—or whatever the modern version of that is—“And here’s a surprise bill that’s going to cost as much as a Tesla.”

Jake: Yeah, I mean, I think one of the things that the cloud providers should hopefully eventually do—I hope Cloudflare pushes them in this direction—is to start—the original vision of AWS when I first started using it in 2006 or whenever launched, was—and they said this—they said they’re going to lower your bill every so often, you know, as Moore’s law makes their bill lower. And that kind of happened a little bit here and there, but it hasn’t happened to the same degree that you know, I think all of us hoped it would. And I would love to see a cloud provider—and you know, Hetzner does this to some degree, but I’d love to see these really big cloud providers that are so great in so many ways, just pass on the savings of technology to the customer so we’ll use more stuff there. I think it’s a very enlightened viewpoint is to just say, “Hey, we’re going to lower the costs, increase the efficiency, and then pass it on to customers, and then they will use more of our services as a result.” And I think Cloudflare is kind of leading the way in there, which I love.

Corey: I do need to add something there—because otherwise we’re going to get letters and I don’t think we want that—where AWS reps will, of course, reach out and say that they have cut prices over a hundred times. And they’re going to ignore the fact that a lot of these were a service you don’t use in a region you couldn’t find a map if your life depended on it now is going to be 10% less. Great. But let’s look at the general case, where from C3 to C4—if you get the same size instance—it cut the price by a lot. C4 to C5, somewhat. C5 to C6 effectively is no change. And now, from C6 to C7, it is 6% more expensive like for like.

And they’re making noises about price performance is still better, but there are an awful lot of us who say things like, “I need ten of these servers to live over there.” That workload gets more expensive when you start treating it that way. And maybe the price performance is there, maybe it’s not, but it is clear that the bill always goes down is not true.

Jake: Yeah, and I think for certain kinds of organizations, it’s totally fine the way that they do it. They do a pretty good job on price and performance. But for sort of more technical companies—especially—it’s just you can see the gaps there, where that Hetzner is filling and that colocation is still filling. And I personally, you know, if I didn’t need to do those things, I wouldn’t do them, right? But the fact that you need to do them, I think, says kind of everything.

Corey: Tired of wrestling with Apache Kafka's complexity and cost? Feel like you're stuck in a Kafka novel, but with more latency spikes and less existential dread by at least 10%? You're not alone.

What if there was a way to 10x your streaming data performance without having to rob a bank? Enter Redpanda. It's not just another Kafka wannabe. Redpanda powers mission-critical workloads without making your AWS bill look like a phone number.

And with full Kafka API compatibility, migration is smoother than a fresh jar of peanut butter. Imagine cutting as much as 50% off your AWS bills. With Redpanda, it's not a pipedream, it's reality.

Visit go.redpanda.com/duckbill today. Redpanda: Because your data infrastructure shouldn’t give you Kafkaesque nightmares.

Corey: There are so many weird AWS billing stories that all distill down to you not knowing this one piece of trivia about how AWS works, either as a system, as a billing construct, or as something else. And there’s a reason this has become my career of tracing these things down. And sometimes I’ll talk to prospective clients, and they’ll say, “Well, what if you don’t discover any misconfigurations like that in our account?” It’s, “Well, you would be the first company I’ve ever seen where that [laugh] was not true.” So honestly, I want to do a case study if we do.

And I’ve never had to write that case study, just because it’s the tax on not having the forcing function of building in data centers. There’s always this idea that in a data center, you’re going to run out of power, space, capacity, at some point and it’s going to force a reckoning. The cloud has what distills down to infinite capacity; they can add it faster than you can fill it. So, at some point it’s always just keep adding more things to it. There’s never a let’s clean out all of the cruft story. And it just accumulates and the bill continues to go up and to the right.

Jake: Yeah, I mean, one of the things that they’ve done so well is handle the provisioning part, right, which is kind of what you’re getting out there. One of the hardest things in the old days, before we all used AWS and GCP, is you’d have to sort of requisition hardware and there’d be this whole process with legal and financing and there’d be this big lag between the time you need a bunch more servers in your data center and when you actually have them, right, and that’s not even counting the time takes to rack them and get them, you know, on network. The fact that basically, every developer now just gets an unlimited credit card, they can just, you know, use that’s hugely empowering, and it’s for the benefit of the companies they work for almost all the time. But it is an uncapped credit card. I know, they actually support controls and things like that, but in general, the way we treated it—

Corey: Not as much as you would think, as it turns out. But yeah, it’s—yeah, and that’s a problem. Because again, if I want to spin up $65,000 an hour worth of compute right now, the fact that I can do that is massive. The fact that I could do that accidentally when I don’t intend to is also massive.

Jake: Yeah, it’s very easy to think you’re going to spend a certain amount and then oh, traffic’s a lot higher, or, oh, I didn’t realize when you enable that thing, it charges you an extra fee or something like that. So, it’s very opaque. It’s very complicated. All of these things are, you know, the result of just building more and more stuff on top of more and more stuff to support more and more use cases. Which is great, but then it does create this very sort of opaque billing problem, which I think, you know, you’re helping companies solve. And I totally get why they need your help.

Corey: What’s interesting to me about distributed social networks is that I’ve been using Mastodon for a little bit and I’ve started to see some of the challenges around a lot of these things, just from an infrastructure and architecture perspective. Tim Bray, former Distinguished Engineer at AWS posted a blog post yesterday, and okay, well, if Tim wants to put something up there that he thinks people should read, I advise people generally read it. I have yet to find him wasting my time. And I clicked it and got a, “Server over resource limits.” It’s like wow, you’re very popular. You wound up getting—got effectively Slashdotted.

And he said, “No, no. Whatever I post a link to Mastodon, two thousand instances all hidden at the same time.” And it’s, “Oh, yeah. The hug of death. That becomes a challenge.” Not to mention the fact that, depending upon architecture and preferences that you make, running a Mastodon instance can be extraordinarily expensive in terms of storage, just because it’ll, by default, attempt to cache everything that it encounters for a period of time. And that gets very heavy very quickly. Does the AT Protocol—AT Protocol? I don’t know how you pronounce it officially these days—take into account the challenges of running infrastructures designed for folks who have corporate budgets behind them? Or is that really a future problem for us to worry about when the time comes?

Jake: No, yeah, that’s a core thing that we talked about a lot in the recent, sort of, architecture discussions. I’m going to go back quite a ways, but there were some changes made about six months ago in our thinking, and one of the big things that we wanted to get right was the ability for people to host their own PDS, which is equivalent to, like, posting a WordPress or something. It’s where you post your content, it’s where you post your likes, and all that kind of thing. We call it your repository or your repo. But that we wanted to make it so that people could self-host that on a, you know, four or five $6-a-month droplet on DigitalOcean or wherever and that not be a problem, not go down when they got a lot of traffic.

And so, the architecture of AT Proto in general, but the Bluesky app on AT Proto is such that you really don’t need a lot of resources. The data is all signed with your cryptographic keys—like, not something you have to worry about as a non-technical user—but all the data is authenticated. That’s what—it’s Authenticated Transfer Protocol. And because of that, it doesn’t matter where you get the data, right? So, we have this idea of this big indexer that’s looking at the entire network called the BGS, the Big Graph Server and you can go to the BGS and get the data that came from somebody’s PDS and it’s just as good as if you got it directly from the PDS. And that makes it highly cacheable, highly conducive to CDNs and things like that. So no, we intend to solve that problem entirely.

Corey: I’m looking forward to seeing how that plays out because the idea of self-hosting always kind of appealed to me when I was younger, which is why when I met my wife, I had a two-bedroom apartment—because I lived in Los Angeles, not San Francisco, and could afford such a thing—and the guest bedroom was always, you know, 10 to 15 degrees warmer than the rest of the apartment because I had a bunch of quote-unquote, “Servers” there, meaning deprecated desktops that my employer had no use for and said, “It’s either going to e-waste or your place if you want some.” And, okay, why not? I’ll build my own cluster at home. And increasingly over time, I found that it got harder and harder to do things that I liked and that made sense. I used to have a partial rack in downtown LA where I ran my own mail server, among other things.

And when I switched to Google for email solutions, I suddenly found that I was spending five bucks a month at the time, instead of the rack rental, and I was spending two hours less a week just fighting spam in a variety of different ways because that is where my technical background lives. Being able to not have to think about problems like that, and just do the fun part was great. But I worry about the centralization that that implies. I was opposed to it at the idea because I didn’t want to give Google access to all of my mail. And then I checked and something like 43% of the people I was emailing were at Gmail-hosted addresses, so they already had my email anyway. What was I really doing by not engaging with them? I worry that self-hosting is going to become passe, so I love projects that do it in sane and simple ways that don’t require massive amounts of startup capital to get started with.

Jake: Yeah, the account portability feature of AT Proto is super, super core. You can backup all of your data to your phone—the [AT 00:28:36] doesn’t do this yet, but it most likely will in the future—you can backup all of your data to your phone and then you can synchronize it all to another server. So, if for whatever reason, you’re on a PDS instance and it disappears—which is a common problem in the Mastodon world—it’s not really a problem. You just sync all that data to a new PDS and you’re back where you were. You didn’t lose any followers, you didn’t lose any posts, you didn’t lose any likes.

And we’re also making sure that this works for non-technical people. So, you know, you don’t have to host your own PDS, right? That’s something that technical people can self-host if they want to, non-technical people can just get a host from anywhere and it doesn’t really matter where your host is. But we are absolutely trying to avoid the fate of SMTP and, you know, other protocols. The web itself, right, is sort of… it’s hard to launch a search engine because the—first of all, the bar is billions of dollars a year in investment, and a lot of websites will only let us crawl them at a higher rate if you’re actually coming from a Google IP, right? They’re doing reverse DNS lookups, and things like that to verify that you are Google.

And the problem with that is now there’s sort of this centralization with a search engine that can’t be fixed. With AT Proto, it’s much easier to scrape all of the PDSes, right? So, if you want to crawl all the PDSes out on the AT Proto network, they’re designed to be crawled from day one. It’s all structured data, we’re working on, sort of, how you handle rate limits and things like that still, but the idea is it’s very easy to create an index of the entire network, which makes it very easy to create feed generators, search engines, or any other kind of sort of big world networking thing out there. And then without making the PDSes have to be very high power, right? So, they can do low power and still scrapeable, still crawlable.

Corey: Yeah, the idea of having portability is super important. Question I’ve got—you know, while I’m talking to you, it’s, we’ll turn this into technical support hour as well because why not—I tend to always historically put my Twitter handle on conference slides. When I had the first template made, I used it as soon as it came in and there was an extra n in the @quinnypig username at the bottom. And of course, someone asked about that during Q&A.

So, the answer I gave was, of course, n+1 redundancy. But great. If I were to have one domain there today and change it tomorrow, is there a redirect option in place where someone could go and find that on Blue-ski, and oh, they’ll get redirected to where I am now. Or is it just one of those 404, sucks to be you moments? Because I can see validity to both.

Jake: Yeah, so the way we handle it right now is if you have a, something.bsky.social name and you switch it to your own domain or something like that, we don’t yet forward it from the old.bsky.social name. But that is totally feasible. It’s totally possible. Like, the way that those are stored in your what’s called your [DID record 00:31:16] or [DID document 00:31:17] is that there’s, like, a list that currently only has one item in general, but it’s a list of all of your different names, right? So, you could have different domain names, different subdomain names, and they would all point back to the same user. And so yeah, so basically, the idea is that you have these aliases and they will forward to the new one, whatever the current canonical one is.

Corey: Excellent. That is something that concerns me because it feels like it’s one of those one-way doors, in the same way that picking an email address was a one-way door. I know people who still pay money to their ancient crappy ISP because they have a few mails that come in once in a while that are super-important. I was fortunate enough to have jumped on the bandwagon early enough that my vanity domain is 22 years old this year. And my email address still works,which, great, every once in a while, I still get stuff to, like, variants of my name I no longer use anymore since 2005. And it’s usually spam, but every once in a blue moon, it’s something important, like, “Hey, I don’t know if you remember me. We went to college together many years ago.” It’s ho-ly crap, the world is smaller than we think.

Jake: Yeah.j I mean, I love that we’re using domains, I think that’s one of the greatest decisions we made is… is that you own your own domain. You’re not really stuck in our namespace, right? Like, one of the things with traditional social networks is you’re sort of, their domain.com/yourname, right?

And with the way AT Proto and Bluesky work is, you can go and get a domain name from any registrar, there’s hundreds of them—you know, we’d like Namecheap, you can go there and you can grab a domain and you can point it to your account. And if you ever don’t like anything, you can change your domain, you can change, you know which PDS you’re on, it’s all completely controlled by you. And there’s nearly no way we as a company can do anything to change that. Like, that’s all sort of locked into the way that the protocol works, which creates this really great incentive where, you know, if we want to provide you services or somebody else wants to provide you services, they just have to compete on doing a really good job; you’re not locked in. And that’s, like, one of my favorite features of the network.

Corey: I just want to point something out because you mentioned oh, we’re big fans of Namecheap. I am too, for weird half-drunk domain registrations on a lark. Like, “Why am I poor?” It’s like, $3,000 a month of my budget goes to domain purchases, great. But I did a quick whois on the official Bluesky domain and it’s hosted at Route 53, which is Amazon’s, of course, premier database offering.

But I’m a big fan of using a enterprise registrar for enterprise-y things. Wasabi, if I recall correctly, wound up having their primary domain registered through GoDaddy, and the public domain that their bucket equivalent would serve data out of got shut down for 12 hours because some bad actor put something there that shouldn’t have been. And GoDaddy is not an enterprise registrar, despite what they might think—for God’s sake, the word ‘daddy’ is in their name. Do you really think that’s enterprise? Good luck.

So, the fact that you have a responsible company handling these central singular points of failure speaks very well to just your own implementation of these things. Because that’s the sort of thing that everyone figures out the second time.

Jake: Yeah, yeah. I think there’s a big difference between corporate domain registration, and corporate DNS and, like, your personal handle on social networking. I think a lot of the consumer, sort of, domain registries are—registrars—are great for consumers. And I think if you—yeah, you’re running a big corporate domain, you want to make sure it’s, you know, it’s transfer locked and, you know, there’s two-factor authentication and doing all those kinds of things right because that is a single point of failure; you can lose a lot by having your domain taken. So, I completely agree with you on there.

Corey: Oh, absolutely. I am curious about this to see if it’s still the case or not because I haven’t checked this in over a year—and they did fix it. Okay. As of at least when we’re recording this, which is the end of May 2023, Amazon’s Authoritative Name Servers are no longer half at Oracle. Good for them. They now have a bunch of Amazon-specific name servers on them instead of, you know, their competitor that they clearly despise. Good work, good work.

I really want to thank you for taking the time to speak with me about how you’re viewing these things and honestly giving me a chance to go ambling down memory lane. If people want to learn more about what you’re up to, where’s the best place for them to find you?

Jake: Yeah, so I’m on Bluesky. It’s invite only. I apologize for that right now. But if you check out bsky.app, you can see how to sign up for the waitlist, and we are trying to get people on as quickly as possible.

Corey: And I will, of course, be talking to you there and will put links to that in the show notes. Thank you so much for taking the time to speak with me. I really appreciate it.

Jake: Thanks a lot, Corey. It was great.

Corey: Jake Gold, infrastructure engineer at Bluesky, slash Blue-ski. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that will no doubt result in a surprise $60,000 bill after you posted.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

More episodes

Chapters

What is Screaming in the Cloud?