Eswar Bala, Director of Amazon EKS at AWS, joins Corey on Screaming in the Cloud to discuss how and why AWS built a Kubernetes solution, and what customers are looking for out of Amazon EKS. Eswar reveals the concerns he sees from customers about the cost of Kubernetes, as well as the reasons customers adopt EKS over ECS. Eswar gives his reasoning on why he feels Kubernetes is here to stay and not just hype, as well as how AWS is working to reduce the complexity of Kubernetes. Corey and Eswar also explore the competitive landscape of Amazon EKS, and the new product offering from Amazon called Karpenter.

About Eswar

Eswar Bala is a Director of Engineering at Amazon and is responsible for Engineering, Operations, and Product strategy for Amazon Elastic Kubernetes Service (EKS). Eswar leads the Amazon EKS and EKS Anywhere teams that build, operate, and contribute to the services customers and partners use to deploy and operate Kubernetes and Kubernetes applications securely and at scale. With a 20+ year career in software , spanning multimedia, networking and container domains, he has built greenfield teams and launched new products multiple times.

Links Referenced:

Amazon EKS: https://aws.amazon.com/eks/
kubernetesthemuchharderway.com: https://kubernetesthemuchharderway.com
kubernetestheeasyway.com: https://kubernetestheeasyway.com
EKS documentation: https://docs.aws.amazon.com/eks/
EKS newsletter: https://eks.news/
EKS GitHub: https://github.com/aws/eks-distro

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: It’s easy to **BEEP** up on AWS. Especially when you’re managing your cloud environment on your own!
Mission Cloud un **BEEP**s your apps and servers. Whatever you need in AWS, we can do it. Head to missioncloud.com for the AWS expertise you need.

Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. Today’s promoted guest episode is brought to us by our friends at Amazon. Now, Amazon is many things: they sell underpants, they sell books, they sell books about underpants, and underpants featuring pictures of books, but they also have a minor cloud computing problem. In fact, some people would call them a cloud computing company with a gift shop that’s attached. Now, the problem with wanting to work at a cloud company is that their interviews are super challenging to pass.

If you want to work there, but can’t pass the technical interview for a long time, the way to solve that has been, “Ah, we’re going to run Kubernetes so we get to LARP as if we worked at a cloud company but don’t.” Eswar Bala is the Director of Engineering for Amazon EKS and is going to basically suffer my slings and arrows about one of the most complicated, and I would say overwrought, best practices that we’re seeing industry-wide. Eswar, thank you for agreeing to subject yourself to this nonsense.

Eswar: Hey, Corey, thanks for having me here.

Corey: [laugh]. So, I’m a little bit unfair to Kubernetes because I wanted to make fun of it and ignore it. But then I started seeing it in every company that I deal with in one form or another. So yes, I can still sit here and shake my fist at the tide, but it’s turned into, “Old Man Yells at Cloud,” which I’m thrilled to embrace, but everyone’s using it. So, EKS is approaching the five-year mark since it was initially launched. What is EKS other than Amazon’s own flavor of Kubernetes?

Eswar: You know, the best way I can define EKS is, EKS is just Kubernetes. Not Amazon’s version of Kubernetes. It’s just Kubernetes that we get from the community and offer it to customers to make it easier for them to consume. So, EKS. I’ve been with EKS from the very beginning when we thought about offering a managed Kubernetes service in 2017.

And at that point, the goal was to bring Kubernetes to enterprise customers. So, we have many customers telling us that they want us to make their life easier by offering a managed version of Kubernetes that they’ve actually beginning to [erupt 00:02:42] at that time period, right? So, my goal was to figure it out, what does that service look like and which customer base should be targeting service towards.

Corey: Kelsey Hightower has a fantastic learning tool out there in a GitHub repo called, “Kubernetes the Hard Way,” where he talks you through building the entire thing, start to finish. I wound up forking it and doing that on top of AWS, and you can find that at kubernetesthemuchharderway.com. And that was fun.

And I went through the process and my response at the end was, “Why on earth would anyone ever do this more than once?” And we got that sorted out, but now it’s—customers aren’t really running these things from scratch. It’s like the Linux from Scratch project. Great learning tool; probably don’t run this in production in the same way that you might otherwise because there are better ways to solve for the problems that you will have to solve yourself when you’re building these things from scratch. So, as I look across the ecosystem, it feels like EKS stands in the place of the heavy, undifferentiated lifting of running the Kubernetes control plane so customers functionally don’t have to. Is that an effective summation of this?

Eswar: That is precisely right. And I’m glad you mentioned, “Kubernetes the Hard Way,” I’m a big fan of that when it came out. And if anyone who did that tutorial, and also your tutorial, “Kubernetes the Harder Way,” would walk away thinking, “Why would I pick this technology when it’s super complicated to setup?” But then you see that customers love Kubernetes and you see that reflected in the adoption, even in 2016, 2017 timeframes.

And the reason is, it made life easier for application developers in terms of offering web services that they wanted to offer to their customer base. And because of all the features that Kubernetes brought on, application lifecycle management, service discoveries, and then it evolved to support various application architectures, right, in terms of stateless services, stateful applications, and even daemon sets, right, like for running your logging and metrics agents. And these are powerful features, at the end of the day, and that’s what drove Kubernetes. And because it’s super hard to get going to begin with and then to operate, the day-two operator experience is super complicated.

Corey: And the day one experience is super hard and the day two experience of, “Okay, now I’m running it and something isn’t working the way it used to. Where do I start,” has been just tremendously overwrought. And frankly, more than a little intimidating.

Eswar: Exactly. Right? And that exactly was our opportunity when we started in 2017. And when we started, there was question on, okay, should we really build a service when you have an existing service like ECS in place? And by the way, like, I did work in ECS before I started working in EKS from the beginning.

So, the answer then was, it was about giving what customers want. And their space for many container orchestration systems, right, ECS was the AWS service at that point in time. And our thinking was, how do we give customers what they wanted? They wanted a Kubernetes solution. Let’s go build that. But we built it in a way that we remove the undifferentiated heavy lifting of managing Kubernetes.

Corey: One of the weird things that I find is that everyone’s using Kubernetes, but I don’t see it in the way that I contextualize the AWS universe, which of course, is on the bill. That’s right. If you don’t charge for something in AWS Lambda, and preferably a fair bit, I don’t tend to know it exists. Like, “What’s an IAM and what might that possibly do?” Always have reassuring thing to hear from someone who’s often called an expert in this space. But you know, if it doesn’t cost money, why do I pay attention to it?

The control plane is what EKS charges for, unless you’re running a bunch of Fargate-managed pods and containers to wind up handling those things. So, it mostly just shows up as an addenda to the actual big, meaty portions of the belt. It just looks like a bunch of EC2 instances with some really weird behavior patterns, particularly with regard to auto-scaling and crosstalk between all of those various nodes. So, it’s a little bit of a murder mystery, figuring out, “So, what’s going on in this environment? Do you folks use containers at all?” And the entire Kubernetes shop is looking at me like, “Are you simple?”

No, it’s just I tend to disregard the lies that customers say, mostly to themselves because everyone has this idea of what’s going on in their environment, but the bill speaks. It’s always been a little bit of an investigation to get to the bottom of anything that involves Kubernetes at significant points of scale.

Eswar: Yeah, you’re right. Like if you look at EKS, right, like, we started with managing the control plane to begin with. And managing the control plane is a drop in the bucket when you actually look at the costs in terms of operating a Kubernetes cluster or running a Kubernetes cluster. When you look at how our customers use and where they spend most of their cost, it’s about where their applications run; it’s actually the Kubernetes data plane and the amount of compute and memory that the applications end of using end up driving 90% of the cost. And beyond that is the storage, beyond that as a networking costs, right, and then after that is the actual control plane costs. So, the problem right now is figuring out, how do we optimize our costs for the application to run on?

Corey: On some level, it requires a little bit of understanding of what’s going on under the hood. There have been a number of cost optimization efforts that have been made in the Kubernetes space, but they tend to focus around stuff that I find relatively, well, I call it banal because it basically is. You’re looking at the idea of, okay, what size instances should you be running, and how well can you fill them and make sure that all the resources per node wind up being taken advantage of? But that’s also something that, I guess from my perspective, isn’t really the interesting architectural point of view. Whether or not you’re running a bunch of small instances or a few big ones or some combination of the two, that doesn’t really move the needle on any architectural shift, whereas ingesting a petabyte a month of data and passing 50 petabytes back and forth between availability zones, that’s where it starts to get really interesting as far as tracking that stuff down.

But what I don’t see is a whole lot of energy or effort being put into that. And I mean, industry-wide, to be clear. I’m not attempting to call out Amazon specifically on this. That’s [laugh] not the direction I’m taking this in. For once. I know, I’m still me. But it seems to be just an industry-wide issue, where zone affinity for Kubernetes has been a very low priority item, even on project roadmaps on the Kubernetes project.

Eswar: Yeah, the Kubernetes does provide ability for customers to restrict their workloads within as particular [unintelligible 00:09:20], right? Like, there is constraints that you can place on your pod specs that end up driving applications towards a particular AZ if they want, right? You’re right, it’s still left to the customers to configure. Just because there’s a configuration available doesn’t mean the customers use it. If it’s not defaulted, most of the time, it’s not picked up.

That’s where it’s important for service providers—like EKS—to offer ability to not only provide the visibility by means of reporting that it’s available using tools like [Cue Cards 00:09:50] and Amazon Billing Explorer but also provide insights and recommendations on what customers can do. I agree that there’s a gap today. For example in EKS, in terms of that. Like, we’re slowly closing that gap and it’s something that we’re actively exploring. How do we provide insights across all the resources customers end up using from within a cluster? That includes not just compute and memory, but also storage and networking, right? And that’s where we are actually moving towards at this point.

Corey: That’s part of the weird problem I’ve found is that, on some level, you get to play almost data center archaeologists when you start exploring what’s going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral tradition of, “Okay, this Kubernetes cluster just starts hurling massive data quantities at 3 a.m. every day. What’s causing that?” And it leads to, “Oh, no no, have you talked to the data science team,” like, “Oh, you have a data science team. A common AWS billing mistake.” And exploring down that particular path sometimes pays dividends. But there’s no holistic way to solve that globally. Today. I’m optimistic about tomorrow, though.

Eswar: Correct. And that’s where we are spending our efforts right now. For example, we recently launched our partnership with Cue Cards, and Cue Cards is now available as an add-on from the Marketplace that you can easily install and provision on Kubernetes EKS clusters, for example. And that is a start. And Cue Cards is amazing in terms of features, in terms of insight it offers, right, it looking into computer, the memory, and the optimizations and insights it provides you.

And we are also working with the AWS Cost and Usage Reporting team to provide a native AWS solution for the cost reporting and the insights aspect as well in EKS. And it’s something that we are going to be working really closely to solve the networking gaps in the near future.

Corey: What are you seeing as far as customer concerns go, with regard to cost and Kubernetes? I see some things, but let’s be very clear here, I have a certain subset of the market that I spend an inordinate amount of time speaking to and I always worry that what I’m seeing is not holistically what’s going on in the broader market. What are you seeing customers concerned about?

Eswar: Well, let’s start from the fundamentals here, right? Customers really want to get to market faster, whatever services and applications that they want to offer. And they want to have it cheaper to operate. And if they’re adopting EKS, they want it cheaper to operate in Kubernetes in the cloud. They also want a high performance, they also want scalability, and they want security and isolation.

There’s so many parameters that they have to deal with before they put their service on the market and continue to operate. And there’s a fundamental tension here, right? Like they want cost efficiency, but they also want to be available in the market quicker and they want performance and availability. Developers have uptime, SLOs, and SLAs is to consider and they want the maximum possible resources that they want. And on the other side, you’ve got financial leaders and the business leaders who want to look at the spending and worry about, like, okay, are we allocating our capital wisely? And are we allocating where it makes sense? And are we doing it in a manner that there’s very little wastage and aligned with our customer use, for example? And this is where the actual problems arise from [unintelligible 00:13:00].

Corey: I want to be very clear that for a long time, one of the most expensive parts about running Kubernetes has not been the infrastructure itself. It’s been the people to run this responsibly, where it’s the day two, day three experience where for an awful lot of companies like, oh, we’re moving to Kubernetes because I don’t know we read it in an in-flight magazine or something and all the cool kids are doing it, which honestly during the pandemic is why suddenly everyone started making better IT choices because they’re execs were not being exposed to airport ads. I digress. The point, though, is that as customers are figuring this stuff out and playing around with it, it’s not sustainable that every company that wants to run Kubernetes can afford a crack SRE team that is individually incredibly expensive and collectively staggeringly so. That it seems to be the real cost is the complexity tied to it.

And EKS has been great in that it abstracts an awful lot of the control plane complexity away. But I still can’t shake the feeling that running Kubernetes is mind-bogglingly complicated. Please argue with me and tell me I’m wrong.

Eswar: No, you’re right. It’s still complicated. And it’s a journey towards reducing the complexity. When we launched EKS, we launched only with managing the control plane to begin with. And that’s where we started, but customers had the complexity of managing the worker nodes.

And then we evolved to manage the Kubernetes worker nodes in terms two products: we’ve got Managed Node Groups and Fargate. And then customers moved on to installing more agents in their clusters before they actually installed their business applications, things like Cluster Autoscaler, things like Metric Server, critical components that they have come to rely on, but doesn’t drive their business logic directly. They are supporting aspects of driving core business logic.

And that’s how we evolved into managing the add-ons to make life easier for our customers. And it’s a journey where we continue to reduce the complexity of making it easier for customers to adopt Kubernetes. And once you cross that chasm—and we are still trying to cross it—once you cross it, you have the problem of, okay so, adopting Kubernetes is easy. Now, we have to operate it, right, which means that we need to provide better reporting tools, not just for costs, but also for operations. Like, how easy it is for customers to get to the application level metrics and how easy it is for customers to troubleshoot issues, how easy for customers to actually upgrade to newer versions of Kubernetes. All of these challenges come out beyond day one, right? And those are initiatives that we have in flight to make it easier for customers [unintelligible 00:15:39].

Corey: So, one of the things I see when I start going deep into the Kubernetes ecosystem is, well, Kubernetes will go ahead and run the containers for me, but now I need to know what’s going on in various areas around it. One of the big booms in the observability space, in many cases, has come from the fact that you now need to diagnose something in a container you can’t log into and incidentally stopped existing 20 minutes for you got the alert about the issue, so you’d better hope your telemetry is up to snuff. Now, yes, that does act as a bit of a complexity burden, but on the other side of it, we don’t have to worry about things like failed hard drives taking systems down anymore. That has successfully been abstracted away by Kubernetes, or you know, your cloud provider, but that’s neither here nor there these days. What are you seeing as far as, effectively, the sidecar pattern, for example of, “Oh, you have too many containers and need to manage them? Have you considered running more containers?” Sounds like something a container salesman might say.

Eswar: So, running containers demands that you have really solid observability tooling, things that you’re able to troubleshoot—successfully—debug without the need to log into the containers itself. In fact, that’s an anti-pattern, right? You really don’t want a container to have the ability to SSH into a particular container, for example. And to be successful at it demands that you publish your metrics and you publish your logs. All of these are things that a developer needs to worry about today in order to adopt containers, for example.

And it's on the service providers to actually make it easier for the developers not to worry about these. And all of these are available automatically when you adopt a Kubernetes service. For example, in EKS, we are working with our managed Prometheus service teams inside Amazon, right—and also CloudWatch teams—to easily enable metrics and logging for customers without having to do a lot of heavy lifting.

Corey: Let’s talk a little bit about the competitive landscape here. One of my biggest competitors in optimizing AWS bills is Microsoft Excel, specifically, people are going to go ahead and run it themselves because, “Eh, hiring someone who’s really good at this, that sounds expensive. We can screw it up for half the cost.” Which is great. It seems to me that one of your biggest competitors is people running their own control plane, on some level.

I don’t tend to accept the narrative that, “Oh, EKS is expensive that winds up being what 35 bucks or 70 bucks or whatever it is per control plane per cluster on a monthly basis.” Okay, yes, that’s expensive if you’re trying to stay completely within a free tier perhaps, but if you’re running anything that’s even slightly revenue-generating or a for-profit company, you will spend far more than that just on people’s time. I have no problems—for once—with the EKS pricing model, start to finish. Good work on that. You’ve successfully nailed it. But are you seeing significant pushback from the industry of, “Nope, we’re going to run our own Kubernetes management system instead because we enjoy pain, corporately speaking.”

Eswar: Actually, we are in a good spot there, right? Like, at this point, customers who choose to run Kubernetes on AWS by themselves and not adopt EKS just fall into one main category, so—or two main categories: number one, they have existing technical stack built on running Kubernetes on themselves and they’d rather maintain that and not moving to EKS. Or they demand certain custom configurations of the Kubernetes control plane that EKS doesn’t support. And those are the only two reasons why we see customers not moving into EKS and prefer to run their own Kubernetes on AWS clusters.

[midroll 00:19:46]

Corey: It really does seem, on some level, like there’s going to be a… I don’t want to say reckoning because that makes it sound vaguely ominous and that’s not the direction that I intend for things to go in, but there has to be some form of collapsing of the complexity that is inherent to all of this because the entire industry has always done that. An analogy that I fall back on because I’ve seen this enough times to have the scars to show for it is that in the ’90s, running a web server took about a week of spare time and an in-depth knowledge of GCC compiler flags. And then it evolved to ah, I could just unzip a tarball of precompiled stuff, and then RPM or Deb became a thing. And then Yum, or something else, or I guess apt over in the Debian land to wind up wrapping around that. And then you had things like Puppet where it was it was ensure installed. And now it’s Docker Run.

And today, it’s a checkbox in the S3 console that proceeds to yell at you because you’re making a website public. But that’s neither here nor there. Things don’t get harder with time. But I’ve been surprised by how I haven’t yet seen that sort of geometric complexity collapsing of around Kubernetes to make it easier to work with. Is that coming or are we going to have to wait for the next cycle of things?

Eswar: Let me think. I actually don’t have a good answer to that, Corey.

Corey: That’s good, at least because if you did, I’d worried that I was just missing something obvious. That’s kind of the entire reason I ask. Like, “Oh, good. I get to talk to smart people and see what they’re picking up on that I’m absolutely missing.” I was hoping you had an answer, but I guess it’s cold comfort that you don’t have one off the top of your head. But man, is it confusing.

Eswar: Yeah. So, there are some discussions in the community out there, right? Like, it’s Kubernetes the right layer to do interact? And there are some tooling that’s built on top of Kubernetes, for example, Knative that tries to provide a serverless layer on top of Kubernetes, for example. There are also attempts at abstracting Kubernetes completely and providing tooling that just completely removes any sort of Kubernetes API out of the picture and maybe a specific CI/CD-based solution that takes it from the source and deploys the service without even showing you that there’s Kubernetes underneath, right?

All of these are evolutions that are being tested out there in the community. Time will tell whether these end up sticking. But what’s clear here is the gravity around Kubernetes. All sorts of tooling that gets built on top of Kubernetes, all the operators, all sorts of open-source initiatives that are built to run on Kubernetes. For example, Spark, for example, Cassandra, so many of these big, large-scale, open-source solutions are now built to run really well on Kubernetes. And that is the gravity that’s pushing Kubernetes at this point.

Corey: I’m curious to get your take on one other, I would consider interestingly competitive spaces. Now, because I have a domain problem, if you go to kubernetestheeasyway.com, you’ll wind up on the ECS marketing page. That’s right, the worst competition in the world: the people who work down the hall from you.

If someone’s considering using ECS, Elastic Container Service versus EKS, Elastic Kubernetes Service, what is the deciding factor when a customer’s making that determination? And to be clear, I’m not convinced there’s a right or wrong answer. But I am curious to get your take, given that you have a vested interest, but also presumably don’t want to talk complete smack about your colleagues. But feel free to surprise me.

Eswar: Hey, I love ECS, by the way. Like I said, I started my life in the AWS in ECS. So look, ECS is a hugely successful container orchestration service. I know we talk a lot about Kubernetes, I know there’s a lot of discussions around Kubernetes, but I wouldn’t make it a point that, like, ECS is a hugely successful service. Now, what determines how customers go to?

If customers are… if the customers tech stack is entirely on AWS, right, they use a lot of AWS services and they want an easy way to get started in the container world that has really tight integration with other AWS services without them having to configure a lot, ECS is the way, right? And customers have actually seen terrific success adopting ECS for that particular use case. Whereas EKS customers, they start with, “Okay, I want an open-source solution. I really love Kubernetes. I lo—or, I have a tooling that I really like in the open-source land that really works well with Kubernetes. I’m going to go that way.” And those kind of customers end up picking EKS.

Corey: I feel like, on some level, Kubernetes has become the most the default API across a wide variety of environments. AWS obviously, but on-prem other providers. It seems like even the traditional VPS companies out there that offer just rent-a-server in the cloud somewhere are all also offering, “Oh, and we have a Kubernetes service as well.” I wound up backing a Kickstarter project that runs a Kubernetes cluster with a shared backplane across a variety of Raspberries Pi, for example. And it seems to be almost everywhere you look.

Do you think that there’s some validity to that approach of effectively whatever it is that we’re going to wind up running in the future, it’s going to be done on top of Kubernetes or do you think that that’s mostly hype-driven these days?

Eswar: It’s definitely not hype. Like we see the proof in the kind of adoption we see. It’s becoming the de facto container orchestration API. And with all the tooling, open-source tooling that’s continuing to build on top of Kubernetes, CNCF tooling ecosystem that’s actually spawned to actually support Kubernetes at option, all of this is solid proof that Kubernetes is here to stay and is a really strong, powerful API for customers to adopt.

Corey: So, four years ago, I had a prediction on Twitter, and I said, “In five years, nobody will care about Kubernetes.” And it was in February, I believe, and every year, I wind up updating an incrementing a link to it, like, “Four years to go,” “Three years to go,” and I believe it expires next year. And I have to say, I didn’t really expect when I made that prediction for it to outlive Twitter, but yet, here we are, which is neither here nor there. But I’m curious to get your take on this. But before I wind up just letting you savage the naive interpretation of that, my impression has been that it will not be that Kubernetes has gone away. That is ridiculous. It is clearly in enough places that even if they decided to rip it out now, it would take them ten years, but rather than it’s going to slip below the surface level of awareness.

Once upon a time, there was a whole bunch of energy and drama and debate around the Linux virtual memory management subsystem. And today, there’s, like, a dozen people on the planet who really have to care about that, but for the rest of us, it doesn’t matter anymore. We are so far past having to care about that having any meaningful impact in our day-to-day work that it’s just, it’s the part of the iceberg that’s below the waterline. I think that’s where Kubernetes is heading. Do you agree or disagree? And what do you think about the timeline?

Eswar: I agree with you; that’s a perfect analogy. It’s going to go the way of Linux, right? It’s here to stay; it just going to get abstracted out if any of the abstraction efforts are going to stick around. And that’s where we’re testing the waters there. There are many, many open-source initiatives there trying to abstract Kubernetes. All of these are yet to gain ground, but there’s some reasonable efforts being made.

And if they are successful, they just end up being a layer on top of Kubernetes. Many of the customers, many of the developers, don’t have to worry about Kubernetes at that point, but a certain subset of us in the tech world will need to do a deal with Kubernetes, and most likely teams like mine that end up managing and operating their Kubernetes clusters.

Corey: So, one last question I have for you is that if there’s one thing that AWS loves, it’s misspelling things. And you have an open-source offering called Karpenter spelled with a K that is an extending of that tradition. What does Karpenter do and why would someone use it?

Eswar: Thank you for that. Karpenter is one of my favorite launches in the last one year.

Corey: Presumably because you’re terrible at the spelling bee back when you were a kid. But please tell me more.

Eswar: [laugh]. So Karpenter, is an open-source flexible and high performance cluster auto-scaling solution. So basically, when your cluster needs more capacity to support your workloads, Karpenter automatically scales the capacity as needed. For people that know the Kubernetes space well, there’s an existing component called Cluster Autoscaler that fills this space today. And it’s our take on okay, so what if we could reimagine the capacity management solution available in Kubernetes? And can we do something better? Especially for cases where we expect terrific performance at scale to enable cost efficiency and optimization use cases for our customers, and most importantly, provide a way for customers not to pre-plan a lot of capacity to begin with.

Corey: This is something we see a lot, in the sense of very bursty workloads where, okay, you’re going to steady state load. Cool. Buy a bunch of savings plans, get things set up the way you want them, and call it a day. But when it’s bursty, there are challenges with it. Folks love using Spot, but in the event of a sudden capacity shortfall, the question is, is can we spin up capacity to backfill it within those two minutes that we have a warning on that on? And if the answer is no, then it becomes a bit of a non-starter.

Customers have had to build an awful lot of those things around EC2 instances that handle a lot of that logic for them in ways that are tuned specifically for their use cases. I’m encouraged to see there’s a Kubernetes story around this that starts to remove some of that challenge from the customer side.

Eswar: Yeah. So, the burstiness is where complexity comes [here 00:29:42], right? Like many customers for steady state, they know what their capacity requirements are, they set up the capacity, they can also reason out what is the effective capacity needed for good utilization for economical reasons and they can actually pre plan that and set it up. But once burstiness comes in, which inevitably does it at [unintelligible 00:30:05] applications, customers worry about, “Okay, am I going to get the capacity that I need in time that I need to be able to service my customers? And am I confident at it?”

If I’m not confident, I’m going to actually allocate capacity beforehand, assuming that I’m going to actually get the burst that I needed. Which means, you’re paying for resources that you’re not using at the moment. And the burstiness might happen and then you’re on the hook to actually reduce the capacity for it once the peak subsides at the end of the [day 00:30:36]. And this is a challenging situation. And this is one of the use cases that we targeted Karpenter towards.

Corey: I find that the idea that you’re open-sourcing this is fascinating because of two reasons. One, it does show a willingness to engage with the community that… again, it’s difficult. When you’re a big company, people love to wind up taking issue with almost anything that you do. But for another, it also puts it out in the open, on some level, where, especially when you’re talking about cost optimization and decisions that affect cost, it’s all out in public. So, people can look at this and think, “Wait a minute, it’s not—what is this line of code that means if it’s toward the end of the month, crank it up because we might need to hit our numbers.” Like, there’s nothing like that in there. At least I’m assuming. I’m trusting that other people have read this code because honestly, that seems like a job for people who are better at that than I am. But that does tend to breed a certain element of trust.

Eswar: Right. It’s one of the first things that we thought about when we said okay, so we have some ideas here to actually improve the capacity management solution for Kubernetes. Okay, should we do it out in the open? And the answer was a resounding yes, right? I think there’s a good story here that actually enables not just AWS to offer these ideas out there, right, and we want to bring it to all sorts of Kubernetes customers.

And one of the first things we did is to architecturally figure out all the core business logic of Karpenter, which is, okay, how to schedule better, how quickly to scale, what is the best instance types to pick for this workload. All of that business logic was abstracted out from the actual cloud provider implementation. And the cloud provider implementation is super simple. It’s just creating instances, deleting instances, and describing instances. And it’s something that we bake from the get-go so it’s easier for other cloud providers to come in and to add their support to it. And we as a community actually can take these ideas forward in a much faster way than just AWS doing it.

Corey: I really want to thank you for taking the time to speak with me today about all these things. If people want to learn more, where’s the best place for them to find you?

Eswar: The best place to learn about EKS, right, as EKS evolves, is using our documentation, we have an EKS newsletter that you can go subscribe, and you can also find us on GitHub where we share our product roadmap. So, it’s a great places to learn about how EKS is evolving and also sharing your feedback.

Corey: Which is always great to hear, as opposed to, you know, in the AWS Console, where we live, waiting for you to stumble upon us, which, yeah. No it’s good does have a lot of different places for people to engage with you. And we’ll put links to that, of course, in the [show notes 00:33:17]. Thank you so much for being so generous with your time. I appreciate it.

Eswar: Corey, really appreciate you having me.

Corey: Eswar Bala, Director of Engineering for Amazon EKS. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice telling me why, when it comes to tracking Kubernetes costs, Microsoft Excel is in fact the superior experience.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud

More episodes

Chapters

About Eswar

What is Screaming in the Cloud?