Screaming in the Cloud

Just because the AWS Cloud hangs above our heads, doesn’t mean your bill needs to be just as sky-high. In this Screaming in the Cloud Summer Replay, Corey is joined by Airbnb Staff Software Engineer Melanie Cebula. Her job is to ensure they keep their monthly cloud bill low, and that the cost isn’t just there for a temporary stay. Hear Melanie and Corey chat about the vital role engineers play in helping balance the company books, tricks to optimizing your organization’s cloud spending, how inexperience can have a dangerous effect on cost-cutting, and the growing pains facing today’s world of data infrastructure. We hope you enjoy this trip down memory lane (just be sure you checkout on time to avoid any fees).


Show Highlights: 
  • (0:00) Intro to episode
  • (0:27) Backblaze sponsor read
  • (0:54) The role of a Staff Engineer
  • (2:09) Working for a large company reliant on the cloud
  • (3:59) Melanie’s Area of Expertise
  • (5:58) Efficiently Managing AWS Bills
  • (11:33) Optimizing cloud spend
  • (14:50) The harmful hesitancy to turn things off
  • (18:17) Inexperience and cost-saving measures
  • (21:17) Firefly sponsor read
  • (21:53) How to avoid snowballing cloud bills
  • (23:40) Kubernetes and cloud billing
  • (27:12) The perks of compounding microservices
  • (29:19) Misconceptions about Kubernetes
  • (31:10) Growing pains of data infrastructure
  • (34:44) Where you can find Melanie


About Melanie Cebula

Melanie Cebula is an expert in Cloud Infrastructure, where she is recognized worldwide for explaining radically new ways of thinking about cloud efficiency and usability. She is an international keynote speaker, presenting complex technical topics to a broad range of audiences, both international and domestic. Melanie is a staff engineer at Airbnb, where she has experience building a scalable modern architecture on top of cloud-native technologies.

Besides her expertise in the online world, Melanie spends her time offline on the “sharp end” of rock climbing. An adventure athlete setting new personal records in challenging conditions, she appreciates all aspects of the journey, including the triumph of reaching ever higher destinations.

On and off the wall, Melanie focuses on building reliability into critical systems, and making informed decisions in difficult situations. In her personal time, Melanie hand whisks matcha tea, enjoys costuming and dancing at EDM festivals, and she is a triplet.


Links Referenced:


Sponsors

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

Melanie Cebula: [00:00:00] I think what we are seeing is, is people trying out different sort of opinionated platforms and building tooling around it. And I think we are moving in the right direction, but we'll see these sort of waves of complexity until we get there.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Melanie Sabula, staff engineer at Airbnb. Melanie, welcome to the show.

Melanie Cebula: Thanks for having me, Corey.

Corey Quinn: Backblaze B2 Cloud Storage helps you scale applications and deliver services globally. Egress is free up to 3x of the amount of data you have stored, and completely free between S3 compatible Backblaze and leading CDN and compute providers like Fastly, Cloudflare, Vulture, and CoreWeave.

Visit backblaze. com to learn more. Backblaze, Cloud Storage, Built Better.

Let's start at the very beginning, I guess. What is a staff engineer, and how did you become such a thing? [00:01:00]

Melanie Cebula: To understand a staff engineer, you probably need to understand that there's like levels with engineering. So a lot of people, when they're new to engineering, they just know that they're software engineers or developers.

But when you join most large companies, they kind of need a way of sort of tiering people based on experience and the kind of work they do. So there's essentially junior engineers, software engineer, senior, staff, principal, and it kind of goes up from there. And it varies from company to company, but.

Essentially, the kind of work that you do at different levels does change a lot. And so it is helpful to frame them differently. So staff engineers usually are given less scoped problems. They're sort of given problem spaces and they kind of work within that space and provide direction for the company in that space.

Corey Quinn: Most companies that aren't suffering from egregious title inflation, which I can state Airbnb is not, it tends to mean an engineer of a sufficient level of experience and depth where. Folks are more or less, as you said, trusted to provide not just solutions, but insight and effectively begin to have [00:02:00] insight into strategic level concerns more so than just what's the most elegant way to write this particular code section.

Is that a somewhat fair assessment?

Melanie Cebula: I think that's fair.

Corey Quinn: Okay, so Airbnb, which we are not here to talk specifically about your environment, but rather in a general sense, is an interesting company because they tend to be, from the world's perspective, a giant company with a massive web presence. But unlike a lot of other folks that I get to talk to on this show, you're not yourselves a cloud provider.

You're not trying to sell any of the infrastructure services that you're using to run your environment. Instead, you're coming at this from a perspective of you're a customer, just like most of us tend to be customers, of One or more cloud providers with the exception being that you just tend to have bigger numbers to deal with in some senses than the rest of us do.

Not me necessarily. I mean, I tend to run Twitter for pets at absolute world spanning scale because I know it's going to take off any day now, but most sensible people don't operate at that level.

Melanie Cebula: [00:03:00] Yeah, so I think what's so empowering about Being a user of all these technologies is, you can be really pragmatic about how you use things.

You don't have to look at, well, this is what Google does. And you know, this is, this is what this other big company does, or this is what this vendor is pushing this month. This is AWS's newest, latest technology. You look at what you need and what the problems you have are and what are the solutions out there, and you can actually.

You know, try them out and find what's the best one, and then you can share that with everyone else. Hey, for our scale, for these kinds of problems we have, we found that this technology works for us. Or, as the case usually is, with a lot of work on our end, we've made this technology work well enough. And I think that's really refreshing.

I love talking to other users of technology and coming to, like, what is actually the best solution for your problem? Because you just don't get that from vendors. They're trying to sell you something.

Corey Quinn: Right, it comes down to questioning the motives of people who are having conversations around specific areas where they have things to offer.

[00:04:00] Apropos of absolutely nothing, what areas of problem do you tend to focus on these days?

Melanie Cebula: In the last few years, I've worked a lot on our infrastructure platform, and so, what makes our infrastructure more usable, more easy to operate? Make it more functional for developers. And in the past few months, instead, I've started working on cost efficiency and cloud savings.

Corey Quinn: A subject that is near and dear to my heart. When I started my consulting company a few years back, the big question I had was, Great, problem can I solve with a set of engineering skills in my background? But I want it to be an expensive business problem. Absolutely nobody wants to see me write code. And, oh yeah, because of some horrible environments I'd worked in previously, I refuse to work in anything that requires me to wake up in the middle of the night, so this has to be restricted to business problems.

The AWS bill was really where I landed on when I was putting all those things together, and for better or worse, it seems to have caught on to the [00:05:00] point where I fail to go out of business pretty consistently every single month.

Melanie Cebula: Yeah, it's a really hard problem, and I do agree, like, I've had on call before, and I've worked on different problems, and I think cost is actually a really great problem to work on, and when I work on cost too, I was wondering, hey, is this actually a problem worth working on?

Because when you think about some of these other problems, There's more sophistication around companies that have reliability orgs or developer tooling orgs. And most companies just don't have that sophistication when it comes to cloud savings. So it is this like new and exciting place. And so I'm happy to be working on it.

And the problems do not stop. So I think it's quite interesting.

Corey Quinn: I would absolutely agree with you. It was one of those things that when I first got into, which it turns out was surprisingly easy to get into, because when you call yourself an expert on something like the AWS bill, no one is going to challenge you on such a thing, because who in the world would ever claim such a thing if it weren't true, and I figured it would be a lot drier and less, you know, technically interesting than it turned out to be.

The more I do this, the more I'm [00:06:00] realizing that it is almost entirely an architecture story past a certain point. It's not about the basic arithmetic story of adding up all the bill items and make sure that the numbers agree. That's arithmetic that's not nearly as interesting or, frankly, as challenging of a problem.

The part of it that's neat to me is that past a certain point of scale, And that point is not generally in someone's personal test environment. Spending significant time and energy on not just reducing the bill, but understanding and allocating portions of the bill to different teams, environments, etc.

is something that companies begin to turn their attention to. At a certain point of scale, as it's clear that you folks are at, having an engineer or engineers focusing on that problem makes an awful lot of sense. It seems that some folks try to get there a bit too soon by hiring an engineer to do this, who costs more than their entire AWS bill.

That seems like it might be an early optimization, but I'm not one to judge. So my question for you that I want to start with is, What do you see as [00:07:00] being the most interesting thing that you've learned about AWS billing in the last 6 to 12 months?

Melanie Cebula: That is a big question. I'd have to say that the most interesting thing that I have learned has been around architecture and architecting for basically efficient compute and cost savings.

And so, things like, You know, the way that we send data between services and configuring retention for that isn't always straightforward. And another big one has been data transfer. I mean, that's been huge. Like when people think about availability and being available in multiple zones across multiple data centers around the world, I don't think cost goes into that equation and what I've found in my own experience so far is that it definitely should because To solve that problem, you're, you're looking at, you know, building enough provisioning into your compute layer.

So like having enough so that if one data center goes down, that the other ones can spin up fast enough and get that compute in time to handle an outage. You're looking [00:08:00] at changes in your service mesh to send traffic to the data center. Different sources within the same availability zone. I mean, the list goes on, like Kafka clusters needing to send traffic, like setting one up in every AZ.

And for me, that's just been one of the most fascinating things is that when I first started working on cost savings, I didn't think that that much architecture work would be involved. And certainly there's a mixture of things. I've had to build little tools, little scripts, some automation. I've had to.

Do some of the, like, oh, let's just get rid of that manually. But it goes from, like, these sort of basic easy wins to, like, we need to really rethink this entire piece of infrastructure. And so that's kind of exciting.

Corey Quinn: The hard part about data transfer pricing is that it's inscrutable from the outside and it's not at all intuitive as far as understanding what makes sense from a billing logical perspective.

It costs the same, for example, to move data from one availability zone to another as it does from one region to another. for most workloads. There are exceptions to virtually everything that we talk about in this, which is part of what makes this fun. And [00:09:00] general rule of thumb, this isn't quite right, but if you're looking at this from an as a gross estimation perspective, storing data in S3 for one month is roughly the same cost as moving it once between availability zones or between regions.

So if you're passing the same piece of data back and forth four or five times, Maybe just store it more than once and stop moving it around. Where the processing and reprocessing of data. You talked about Kafka. There's always the challenge of historically, compression wasn't as great as some of the newer versions with that have been, some of the newer pull requests have merged in new forms of compression that tend to offer a better ratio.

There's a pull request that was merged in somewhat recently where you can query the local follower. But you're right, you have to have things like your service mesh. Understand that you can now route those things differently, and what your replication factor looks like becomes a challenge, and a lot of it, at least in my experience, has always become a more strategic question, where it's a spectrum that you have to pick a point on that you're going to target between cost [00:10:00] efficiency and durability.

I mean, things are super cheap if you only ever run one of them compared to running three of them. But if you accidentally fat fingered the wrong S3 bucket, you don't have a company anymore in some cases. Aligning business risk and technical risk with something that is cost efficient is a balancing act.

And anyone who tells you that stuff is simple is selling something.

Melanie Cebula: Yes. For me, what I found is framing the problem sort of with It's almost along a matrix of, well, we could go all the way on making this as cheap as possible. That's not ever anyone's preference. People want some amount of durability and availability and redundancy.

And I think that's great. What I've seen in a lot of strong engineering organizations, this is normally a good thing. Engineers really want to do the best thing, at least in my experience. And so they're very optimistic on some of the engineering work they do and how available they want things to be. I You know, the pricing and some of these things just need to be considered and architected for.

So I don't think you necessarily have to make a choice on like, [00:11:00] being this available and this durable is prohibitively expensive, but the ways that people do it naively can be. And so having to think through like the ways to solve the problem, I think is really interesting. And another example of that is EBS versus EC2 costs.

You know, we recently discovered that If you're, um, trying to run, you know, a certain kind of job on, on these instances that need to have, sort of, storage on them, there are multiple ways to solve this problem. And so what I found is what we're really looking at is different ways to solve it with different AWS resources.

And, you know, the pricing can matter on those kinds of things.

Corey Quinn: Oh, absolutely. And there are edge cases that cut people to ribbons all the time. Almost every time you see IO1 EBS volumes, my default response is, That's probably not what you mean to be doing. You can get GP2, which is less expensive, to similar performance profiles up to a certain point, but before you hit that, an awful lot of the instances will wind up having instance throughput limits.

And that's the fun part, is that no matter what AWS service you look at, by and large, there are going to be [00:12:00] Interesting ways to optimize once you hit a certain point of scale. The hard part in some cases is finding an environment that's using a particular service in such a way where you get to spend time doing some of those deep dives.

For example, everyone loves to use EC2, or rather they use it. Whether they love it or not is the subject of some debate, but it turns out that Amazon Chime, maybe there's ways to optimize the bill. We wouldn't know. We've never seen an actual customer. So finding things that align with everyone and hitting the big numbers on the bill before working on the smaller ones is generally an approach that I think, for some reason, sails past people because the bill is organized alphabetically.

But we're also seeing that folks tend to wind up getting focused on things that are complicated and interesting to solve from an engineering perspective, rather than step one, turn things you're not using off. Because the cloud is not billing you based upon what you use so much as what you're forgetting to turn off.

But that's not fun or [00:13:00] interesting, so instead we're going to build this custom bot that powers down developer environments out of hours and that's great. But development in some cases is 3 percent of your spend and you haven't bought a reserved instance in two years, maybe fix that one first.

Melanie Cebula: It is so interesting that you say that because I do know an engineer who has built a bot to spin down development instances.

And actually I think it was really quite effective because the development instances were so expensive and a lot of developers were not using them. So in that case, it was a really great tool. What they didn't know is that a lot of the development instances were actually at one point spinned up as CI jobs.

So someone had an interesting idea where we could run integration tests on these development machines. So they kind of. Glued everything together, and it didn't work that well, and then they forgot to turn them off, and the bot didn't account for those, so over time, what we saw is that the bot was running, but machines weren't going down, and it was because there were so many of this other kind of machine up, and so it really came back down to is, What are you [00:14:00] actually running or like what EC2 instances do you have running that you're not using?

And even in this case, the idea that there's a lot more development instances than we think are being used, it still didn't get kind of root cause to the right level. But I definitely have found that a lot of the tooling I've built and a lot of the solutions that I worked on, at least initially, were just.

Low hanging fruit. There are a lot of things that I think companies don't realize they're not using. Like, if they knew they weren't using it, they wouldn't be paying for it. But they just don't know. And, um, I think some costs I've seen around that is S3 buckets, not setting life cycle policies, and like, you have a lot of data being stored there that is never, ever removed.

And like, you're just not accessing it. You're like, not using it. And another example I've seen is with EBS volumes becoming unattached and then never being cleaned up. And so that also can cost a bit over time.

Corey Quinn: Oh yeah. Part of the problem too is a lot of tooling in this space claims to solve these problems perfectly.

The challenge is, is all of them. Lack context. Hey, that data in that [00:15:00] S3 bucket has never been accessed, so we can get rid of it is probably accurate if you're referring to build logs from four years ago. Probably not if you're referring to the backups of the payment database. So there's always going to be a strange story.

Around what you can figure out programmatically versus what requires in depth investigation by someone who has the context to see what is happening inside that environment. I think a lot of the spinning things up and never turning them off, in some respects, is a culture problem. First, people are never as excited to clean up after themselves as they are to make a mess.

But, In some companies, this is worse, where back in the days of data centers, you wanted a new server, great. If you have an IT team that's really on the ball, you can get something racked and ready to go in only six short weeks. So, once you've run your experiment, would you ask them to turn it off?

Absolutely not. If you have to run it again, it'll be six weeks until they wind up getting you another one. So, you keep it around. I've seen some shops where they run Idle nonsense like folding at home on fleets just to [00:16:00] keep utilization up so accounting doesn't bother them. It's really a strange and perverse incentive.

This idea of needing to make sure that people aren't spinning things up unnecessarily can counterintuitively cause more waste than it solves for.

Melanie Cebula: The basic idea is that Engineers kind of want to hold on to the things they spend up to?

Corey Quinn: They want to hold on to things if it's painful to turn them off. Or rather, if it's painful to get it spun back up.

Where if it takes you three hours of work to get something up and running, once it's up and running, you're going to leave it there because you don't want to go through that process of spinning it up again. Whereas if it's push button and receive this thing that you were using almost with no visible latency, then people are way more willing to turn things off.

Melanie Cebula: Where I have seen hesitancy in turning things off, it generally comes from a state where they, they know that it's painful to spin up again, and they really don't want to ever do that again. And in cases where people just aren't sure if it's safe, they just don't know, and they're, they're afraid of there being consequences.

And so, like, I [00:17:00] think, Our sort of old school development environments, and that was another problem with that, is that people didn't want to spin them down, even if they hadn't used them in quite a while, because of the way those machines were configured. It's not using containers, there's a lot of stateful data.

It means that people want to hold on to it because spinning it up again is so painful. And what I found with us moving to a lot of containerized technology and sort of stateless, Things is that, at least in those cases, spinning things down to, like, sort of the right utilization has been a lot less controversial.

And another strategy has been sort of making it not the developer's problem. So, when you don't have autoscaling and you don't have sophisticated capacity management in place yet. A lot of developers tend to over provision things because that's how they handle traffic spikes. They just try to make sure they always have enough compute to handle the traffic spike, but that's not necessarily efficient.

So when you make it not their problem, and you sort of just have them say, okay, well this service, you know, it can target 50 percent CPU utilization, let's say. And then we can, you know, [00:18:00] behind The abstraction layer sort of spin things up and down, or in this case, Kubernetes is doing it. You can also use auto scaling groups with EC2 or other solutions.

I have found, in general, trying to get rid of that problem and sort of get rid of the attachment is one way to solve it, but you'll always find cases where you have to kind of deal with that sort of perverse incentive.

Corey Quinn: Right. And you also find that there's this idea as well that, oh, we're going to build tooling to solve all of this.

But it turns out that mistakes on turning things off can show. And the first time that in most shops that a cost savings initiative takes production down, you're often not allowed to try to save money anymore because, well, we tried that once and it ended badly. There often need to be better safeguards and people trying to dive into these things with the best of intentions, but not the real world experience that.

Or at least the scars that come from real world experience having tried such things in the past.

Melanie Cebula: Yeah, I think if you're an inexperienced shop trying to work in cost savings, you know, be really careful, I guess would be my advice. What you're doing [00:19:00] really with cost savings is you're trying to run, like with compute, you're trying to run things more efficiently.

I mean, there's services and applications that just have never run this hot before, and they might not perform well under those kinds of circumstances. I can imagine cases where you don't think anyone's using this S3 bucket, and you delete the bucket, and, uh, well, now you're in that situation. And so I think when you're looking at cost savings, you are, like, every operation with cost savings is a risky one.

And so for me Taking my reliability background and applying that to this problem has been really helpful. I mean, having runbooks, having operation plans, having The needed metrics and introspection to make these changes. So like, one of the biggest changes I've done here, at least at Airbnb, was there were places where we just didn't know what was being used.

There just was no observability. And Amazon offers products for this. So like, S3 object analytics and metrics on usage and things like that. Just enabling those for the buckets where it made sense has been really [00:20:00] helpful. I will say that, like you've said, there's an edge case for everything. So there are certain buckets that have such an extraordinary number of objects that enabling this kind of observability would be very expensive.

And that's the other category I've seen is People not understanding where the bill can become exponential. And so, like, S3Buckets with a lot of objects is one of them.

Corey Quinn: Oh, yes. I saw one once that was just shy of 300 billion objects in a single bucket. You try and iterate through those, it'll complete two weeks after the earth crashes into the sun.

And their response, when you asked them about that, was, what is this? And they had an answer to, they tried to build some custom database style thing. They said, this may not have been the best approach. I'm going to stop you there. It was not, but at some point things become so big you can't instrument them using traditional methods and have to start looking at new and creative ways.

Things that are super easy when you do a test case on a small handful of resources explodes in fire, ruin, and pain when you get to a point of scale.

Melanie Cebula: Yeah, and the other thing I [00:21:00] noticed is, so AWS's billing doesn't necessarily have these safeguards out of the bat. So like one of the first things I would do is implement some safeguards so that.

You don't kind of shoot yourself in the foot and make it worse. And the other one is to build an understanding of what observability you need and what you can enable. And so that's been helpful.

Corey Quinn: Are you running critical operations in the cloud? Of course you are. Do you have a disaster recovery strategy for your cloud configurations?

Probably not, though your binders will say otherwise. Your DevOps teams invested countless hours on those configs, so don't risk losing them. Firefly makes it easy. They continuously scan your cloud and then back it up using infrastructure as code and, most importantly, enable quick restoration. Because no one cares about backups, they care about restorations and recovery.

Be DR ready with Firefly at firefly. ai.

Something else that I think is not well understood by folks who are used to much smaller environments is if I were to check my [00:22:00] AWS credentials into GitHub or Jithub, depending upon pronunciation choice, then I would notice that I had done so pretty much immediately when my 200 a month bill is now 15, 000 past a certain point of scale.

Even incredibly hilarious spin ups of all kinds of instances that are being exploited or misconfigurations that are causing meteoric bill growth disappear into the low level background noise because it takes a lot to have a, even a 10 percent shift in big numbers versus, in my case, if I have a lambda function to get out of hand, I can have a 10 percent shift.

Melanie Cebula: Absolutely, and I think that is what makes costs such a snowballing problem. People don't understand why the bill is increasing at this rate, and it's because the more crazy the bill gets, the more things get hidden by the crazy bill. The harder it gets to sort of go after and fix all these things in a way that's systematic and prevents it from happening.

from happening again. And what I found is we had to build a lot of custom tooling. And so one of the most important ones is [00:23:00] not necessarily show alphabetically what's the most expensive or anything like that, but show the difference. Like these costs that we have tagged, you know, their Delta over the last day or the last three days is like this big.

And so that's going to be at the top of the list is actually that Delta and pricing change.

Corey Quinn: I would like to point out just for the record at the moment that In the event that someone else is listening to this and thinking, Oh, I'm going to go build some custom cost tooling myself. Don't do that. You don't want to do that.

You want to go ahead and see if there's something else out there first before you start building your own things. I promise. Having fallen down that trap myself, please learn from my mistake.

Something I want to talk to you about that is Well, how do I put this in the, whatever the opposite of least confrontational way possible is.

Okay. So at Airbnb, you've run an awful lot of really interesting, well built, very clearly defined, awesome technologies and [00:24:00] also Kubernetes. What have you found that makes Kubernetes interesting, if anything, from an AWS billing perspective?

Melanie Cebula: From an AWS billing perspective, what I would say is, So when you work with Amazon Web Services, a lot of it, you're working with the different services that they define, and so their billing can show you how you use their resources.

When you run your own infrastructure on top of EC2, in this case we run our own Kubernetes clusters on top of EC2, they don't get the same insight, and so You know, when you're looking at costs, what you get is the cost of your Kubernetes clusters. So, it's not that helpful to know that this Kubernetes cluster got more expensive from day one to the next day.

What's helpful is to know which namespaces or which services essentially got more expensive and why. And so, having to do that sort of second level attribution as, as I call it, is like necessary to understand your compute costs. And so, I do think there is a sort of lag from when you use some of the latest technologies and they're not [00:25:00] necessarily AWS services, then you take on a lot of the maintenance of owning that technology and running it.

And then also the cost savings for that technology.

Corey Quinn: Part of the challenge too, is that folks who are really. Invested heavily in Kubernetes are trying inherently to solve infrastructure problems or engineering problems, and that's great. No one is setting out to deploy Kubernetes. I could stop that sentence there and probably have a decent argument, but no one is setting out to deploy Kubernetes from a cost optimization or, more importantly, cost allocation perspective.

So, whenever you wind up with a weird billing story on top of Kubernetes, a lot of things weren't done early on, and now there's a bit of a mess, because from the cloud provider's perspective, you have one application that is running on top of a bunch of EC2 instances or otherwise, And that application is called Kubernetes, and it is super weird, because sometimes it does all kinds of weird data transfers, sometimes it beats the crap out of S3, sometimes it winds up having weird disk access [00:26:00] patterns, but figuring out which workload inside of Kubernetes is causing a particular behavior is super weird.

Almost impossible without an awful lot of custom work. Today, I'm not aware of anything generic that works across the board from that perspective. Are you?

Melanie Cebula: I am not aware.

Corey Quinn: I was hoping you'd have a different answer to that.

Melanie Cebula: Well, I can say that because of some standardization and implementation details of how We implemented Kubernetes and how we sort of hosted it on our, on AWS.

We were able to come up with a strategy for sort of tagging different namespaces and getting them attributed to the right services and that they're for the right service owners. You know, I think we did have some insight about standardization and very opinionated usage of Kubernetes. I think if you didn't have that, you would be in a much tougher position.

And I also think that's probably why it is tough to find a solution out there so far. And it, I think it's because. You can use these like super pluggable, like flexible [00:27:00] infrastructure. You can use them a lot of different ways. And so it's hard to build a tooling that kind of just works. I mean, like how you define namespaces, I think would be really huge for how, how, you know, What is increasing the Kubernetes data transfer costs or whatever it is?

Corey Quinn: A further problem that goes beyond that, too, is every time I've looked at workloads inside of Kubernetes, as we talked about earlier, there's not a lot of zone affinity that is built into this, where if it's going to ask a different microservice, because everything's a microservice, because why shouldn't every outage become a murder mystery instead?

It reaches out to the thing that's defined with no awareness of the fact that that very well might be someplace super expensive. versus free. And again, AWS helps with approximately none of this because data transfer with AWS is super expensive because bandwidth is a rare and precious thing unless it's bandwidth into AWS, in which case it's free.

Put all your data there, please. Have you found that there's any answer to that other than just building more intelligent service discovery on top of it and then having to shoehorn it into various apps?

Melanie Cebula: [00:28:00] Well, what I will say is that I think we were probably one of the first companies to truly try to run AZ aware workloads on Kubernetes.

on AWS. And so we did run into interesting problems and ways of solving this. So right away on the orchestration layer, we found bugs in the Kubernetes scheduler that sort of prevented AC balance from happening. An engineer on one of Airbnb's infra teams actually made some changes to the Kubernetes scheduler upstream.

Once we had that fixed, it was possible to have pods be balanced across AZ zones, so we could do AZ aware routing in a way that didn't just, like, blow up one zone because it was not balanced. And then, because we've been working on using Envoy, uh, we, Envoy has quite good AZ aware routing support, and so we've been able to sort of use that to route traffic.

But there's this other idea in Kubernetes about like, uh, scheduling preferences. So like scheduling pods in such a way [00:29:00] that the same AZ zone is preferred, but not required. So that you want to prefer this because it's much cheaper and it's a better solution. But if you choose the required option, what you'll get is actually an outage if you have problems with the AZ zone.

So that's a little bit in the weeds, but what I found is we had to solve it at multiple layers.

Corey Quinn: And that's sort of the problem, because it feels like Kubernetes was sold, perhaps incorrectly, as this idea of, you have silos between dev and ops, but that's okay, because you don't have to have any communication between those groups.

That was never true, but this is one of those areas where that historical separation seems like it's coming home to roost a bit. One of the whole arguments behind containerization early on was that, oh, now you can have developers build their application, they don't have to worry at all about the infrastructure piece.

And then they throw it over the wall, more or less, and let operations take it from there. When you have to build things into applications to be aware of zonal affinity, and the infrastructure absolutely has to be aware of that, it feels like by the time that becomes an [00:30:00] expensive enough problem to start Really addressing, there's already a large enough environment that was built without any close coupling between Dev and Ops to build that type of cost effective architecture in at the base level.

So it has to almost be patched in after the fact.

Melanie Cebula: Yeah, so I think Kubernetes actually does bill itself as like DevOps empowerment. Which is interesting, like the idea that you can sort of create your own configuration and apply it yourself. And in practice, I'd say Airbnb has had a really strong DevOps culture historically.

So a lot of our engineers are on call for their own services, their own outages, essentially. And so when services do have a configuration problem, like a Kubernetes problem, it is Generally, that team that is paged, there is a problem definitely with the platform that we've built, where if you have a general issue with supporting AZ, that's going to fall to the infrastructure team to solve really, because at that point, it's [00:31:00] just so in the weeds that like, I think a regular application developer would be probably horrified if you ask them to try to solve AZ aware routing and Kubernetes for their service.

Corey Quinn: I am the opposite of an application developer and I'm still horrified at the idea. It's one of those complicated problems with no right answer that also becomes a serious problem. When you're trying to have this built out for anything that is not just a toy problem on someone's laptop. Oh wow, because not only do you have to solve for this, but you also have to be able to roll this out to something at significant scale.

And scale adds its own series of problems that Tend to be a treasure and a delight for everyone experiencing them for the first time.

Melanie Cebula: Yeah, and I think it's interesting because I think infrastructure is kind of like in a renaissance period right now where people are really excited about all these technologies, but it's just the whole category is kind of immature.

And so there's these growing pains and like. I think when you're in my position, you see those growing pains. I think a lot of people are starting [00:32:00] to acknowledge that now. Um, and you can still be really excited about all these technologies and possibly be willing to run them in production. But when we use these kinds of technologies, there will be trade offs.

There will be growing pains. When there's these paradigm shifts in how infrastructure is used, they come in waves. And so, for example, service mesh, I think, is probably the biggest example. When, when you are dealing with all of these microservices, these other technologies become kind of crucial to running them at a certain scale.

Corey Quinn: Yeah, there's really not a great series of stories that apply universally yet. And I think you're right, things come in waves, where you wind up with things getting more and more layers of abstraction, the complexity increases to the point where something happens, and that all collapses down on itself into something a human being can understand again.

And then it continues to repeat. It's almost a sawtooth graph of complexity measured over decades. I think that Kubernetes is one of those areas now where it's starting to get more complex than it's worth when at some point you look at all the different projects that are associated with it under the CNCF and you look around for the [00:33:00] hidden camera because you're almost positive you're being punked.

Melanie Cebula: A lot of these technologies, I'm really excited by all the development of them, but I can also acknowledge that it's far too complex for Not just like the average use case, but any use case. No one wants to be running anything this complex. And it's also fair to say that, you know, it's, it would be hard to build something that supports all of these use cases and not end up as complex.

But I think every time we go through this development cycle and iteration, we learn things and we build it better the next time. And. So what I'm actually seeing is a proliferation of opinionated platforms being built. So Kubernetes is, you know, actually one of the older ones at this point, although it's surprising to say that, you know, with Borg being its predecessor and, and now there's these other opinionated platforms that are really quite new.

I mean, you look at the serverless movement, AWS Lambda, uh, Knative as like another iteration on Kubernetes. And so I think what we are seeing is, is people trying out different sort of opinionated platforms and building tooling around it. And I think we are moving in the right direction, [00:34:00] but we'll see these sort of waves of complexity until we get there.

Corey Quinn: I think that's probably a very fair assessment, and I wish I could argue with it, but everything old becomes new again, sooner or later.

Melanie Cebula: Yeah, and I think what we'll find is in some areas we're really insightful, in other areas we just went kind of too far, and we'll course correct over time. Yeah, I think that's what we're seeing right now, is I think people are agreeing that it's quite complex and it has all these implications, all the way down to cost.

And, yeah, I think there'll be, there already is a lot of development there. I think people are actually, hopeful that there will be like one platform that comes out and everyone just tells them to use the platform and it's great. I don't actually think that's going to happen. I think what we'll get is a few specialized platforms that are really good at what they do.

So I'm excited for that future.

Corey Quinn: I am too. I'm looking forward to seeing how it shakes out.

Melanie, thank you so much for taking the time to speak with me today. If people want to hear more about what you have to say, where can they find you?

Melanie Cebula: So you can reach out to me at Melanie Sabula on Twitter or on my website.

Corey Quinn: Excellent. We'll throw links to both of those in the show notes. Melanie Sabula, [00:35:00] Staff Engineer at Airbnb. I am Cloud Economist Corey Quinn. And this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a 5 star review on Apple Podcasts. If you've hated this podcast, please leave a 5 star review on Apple Podcasts, and then leave a comment incorrectly explaining AWS Data Transfer.