Screaming in the Cloud

When AWS has a major outage, what actually happens behind the scenes? Ben Hartshorne, a principal engineer at Honeycomb, joins Corey Quinn to discuss a recent AWS outage and how they kept customer data safe even when their systems couldn't fully work. Ben explains why building services that expect things to break is the only way to survive these outages. Ben also shares how Honeycomb used its own tools to cut their AWS Lambda costs in half by tracking five different things in a spreadsheet and making small changes to all of them.


About Ben Hartshorne:
 
Ben has spent much of his career setting up monitoring systems for startups and now is thrilled to help the industry see a better way. He is always eager to find the right graph to understand a service and will look for every excuse to include a whiteboard in the discussion.

Show highlights: 
(02:41)Two Stories About Cost Optimization
(04:20) Cutting Lambda Costs by 50%
(08:01) Surviving the AWS Outage
(09:20) Preserving Customer Data During the Outage
(13:08) Should You Leave AWS After an Outage?
(15:09) Multi-Region Costs 10x More
(18:10) Vendor Dependencies
(22:06) How LaunchDarkly's SDK Handles Outages
(24:40) Rate Limiting Yourself
(29:00) How Much Instrumentation Is Too Much?
(34:28) Where to Find Ben

Links: 
Linkedin: https://www.linkedin.com/in/benhartshorne/
GitHub: https://github.com/maplebed

Sponsored by:
duckbillhq.com

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

SITC-Ben Hartshorne-1
===

Ben Hartshorne: For all of these dependencies, there are clearly several who have built their system with this challenge in mind and have a series of different fallbacks. Uh, I'll, I'll give you the story of, um, uh, we used LaunchDarkly for our feature flagging. Their service was also impacted yesterday. One would think, oh, we need our feature flags in order to boot up.

Well, their SDK is built with the idea that you set your feature flag defaults in code. And if we can't reach our service, we'll go ahead and use those. And if we can reach our service, great. We'll update them. And if we can update them once, that's great. If we can connect to the streaming service even better.

And I, I think they also have, uh, some, some more, uh, bridging in there, but we don't use, uh, the, the more complicated infrastructure. But this idea that they design the system with the expectation that in the event of a service on unavailability, things will continue to work. Made the recovery process all that much better.

And, uh, even when, when, uh, their service was unavailable and ours was still running, uh, the SDK still answers questions in code for the status of all of these flags. It doesn't say, oh, I, I can't reach my upstream. Suddenly, I can't give you an answer anymore. No, the SDK is built with that idea of local caching so that it can continue to serve the correct answer.

So far as it new from whenever it lost its connection.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is one of those folks that I am disappointed I have not had on the show until now, just because I assumed I already had. Ben Hartshorne is a principal engineer at Honeycomb, but oh, so much more than that. Ben, thank you for dating to join us.

Ben Hartshorne: It's lovely to be here this morning.

Corey: This episode is sponsored in part by my day job Duck. Bill, do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is.

To learn more, visit duck bill hq.com. Remember, you can't duck the duck bill. Bill, which my CEO reliably informs me is absolutely not our slogan. So you gave a talk, uh, about roughly a month ago, uh, at the inaugural finops uh, meetup in San Francisco. Give us the high level. What did you talk about?

Ben Hartshorne: Well, I got to talk about two stories.

Um, I love telling stories. I got to about, talk about two stories of how we used honeycomb and instrumentation to help optimize our cloud. Spending a topic near and dear to your heart, uh, is what brought me there. We gotta look at the overall bill and say, Hey, what, where are some of the big things coming from?

Obviously it's people sending us data and people asking us questions about those data.

Corey: And if they would just stop both of those things, your bill would be so much better.

Ben Hartshorne: It would be so much smaller. Um, so would my salary, unfortunately. Um, so we wanted to reduce some of those costs, but, uh, it, it's a, it's a problem that that's hard to get into just from like a, a general perspective.

You need to really get in and look at all the details to find out what you're gonna change. So, uh, I gotta tell two stories. Uh, reducing costs. One by switching from a MD to, uh, arm architecture for Amazon. That's the Graviton chip set, which is fantastic. Uh, and the other was about the amazing power of spreadsheets.

As much as I love graphs, I also love spreadsheets. I, I'm sorry. It's a personal failing. Perhaps.

Corey: It's wild to me how many tools out there do all kinds of business adjacent things, but somehow never bother to realize that if you can just export and CSV, suddenly you're speaking kind of the language of your ultimate user.

Play up with pandas a little bit more and spit out an actual Excel file, and now you're cooking it with gas. Mm-hmm.

Ben Hartshorne: So, uh, the, the second story is about doing that with honeycomb. Taking, uh, a number of different graphs and looking at, um, five different attributes of our lambda, uh, costs and what was going into them, and making changes across all of them in order to, uh, accomplish an overall cost reduction about 50%.

Uh, which is really great. So the, the story, uh, it does combine my love of graph because we gotta see the three lines go down. Um, the power of spreadsheets and also this idea that. You can't just look for one answer to find the, uh, solution to your problems around, well, anything really. Uh, but especially around reducing costs.

It's going to be a bunch of small things that you can put together, uh, into one place.

Corey: There's a, there's a lot that's valuable when we start going down that particular path of starting to look at. Things through a, a lens of a particular kind of data that you otherwise wouldn't think to? I, I remain, I maintain that you remain the only customer we have found so far that uses Honeycomb to completely instrument their AWS bill.

Uh, we had not seen that before or since. It, it makes sense for you to do it that way. Absolutely. It's a bit of a heavy lift for. Shall we say everyone else?

Ben Hartshorne: Uh, and it, it actually is a, a bit of a lift for, for us to, to say we've instrumented the entire bill, uh, is a, a wonderful thing to, to assert. And, uh, as we've talked about, we, we use the power of spreadsheets too.

So there are some aspects there is that, there's some aspects of our eight of us spending and actually really dominant ones, uh, that lend themselves very easily to be. Using Honeycomb. Um, the best example is Lambda because Lambda is, uh, charged on a per millisecond basis and our instrumentation is collecting spans, traces about your compute on a per millisecond basis.

There's a very easy translation there, and so we can get really good insight into which customers are spending, how much or rather, which customers are causing us to spend. How much. Provide our product to them and, uh, understand how that, how we can balance our, uh, development resources to both provide new features and also, uh, understand when we need to shift and, uh, spend our attention managing costs that.

Corey: There's a continuum here, and I think that it tends to follow a lot around company ethos and company culture here, where folks have varying degrees of insight into the factors that drive their cloud spend. Uh, you are. Clearly an observability company you have been observing your AWS bill for, I would argue longer than it would've made sense to on some level.

In the very early days you were doing this and your AWS bill was not the, the, the limiting factor to your company's success back in those days, but, but you did grow into it. Other folks, even at very large enterprise scale, more or less, do this based on vibes and. Most folks, I think, tend to fall somewhere in the middle of this, but, but it's not evenly distributed.

Some teams tend to have a very deep insight into what they're doing and others are. Amazon Bill, you mean the books? It, it, again, most tend to fall somewhere center of that. It's, it's a law of large numbers. Everything starts to revert to a mean, past a certain point.

Ben Hartshorne: Well, I mean, you, you wouldn't have a job if, if they didn't make it a bit of a challenge to do so.

Corey: Or I might have a better job depending, but we'll see. Uh, I I do wanna detour a little bit here because as we record this, it is the day after AWS's big significant outage. I could really mess with the conspiracy theorists and say it is their first major outage of, uh, October of 2025. Uh, and then people are like, wait, what do you mean?

What do you mean this is World War? I like same type of approach, like, but these things do tend to cluster. Uh, how was your day yesterday?

Ben Hartshorne: Uh, well, it did start very early. Um, uh, our, our service, uh, has, has presence in multiple regions. Uh, but we do have our, our main, uh, US instance in, in Amazon's US East one.

And so as, uh, things stop working, uh, a lot of our service stopped working too. I mean, the, the outage was, was significant, but wasn't, uh, pervasive. There were still some things that kept functioning, and amazingly, we actually preserved all of the customer telemetry that made it to our front door successfully.

Uh, which is a big deal because we hate dropping data.

Corey: Yeah, it's, that took some work in engineering and, and I have to imagine this was also not an accident.

Ben Hartshorne: It was not an accident. Now their ability to query that data during the outage. Suffered.

Corey: I, I'm gonna push back on you on that for a second there.

When AWS's US East, one where you have a significant workload is impacted to this degree, how important is, uh, observability? I, I know that when I've dealt with outages in the. Past. There's, uh, the first thing you try and figure out of, is it my shitty, shitty code or is it a global issue? That's important.

And once you establish it's a global issue, then you can begin, uh, the mitigation part of that process. And yes, observability becomes extraordinarily important there for some things, but for others it's. There, there's also, at least with the cloud being as big as it is now, there's some reputational headline, uh, risk protection here in that no one is talking about your site going down in some weird ways.

Yesterday everyone's talking about AWS going down, like they own the reputation of this. Yeah,

Ben Hartshorne: that's true. Um, and also when a business's customers are asking them. Which parts of your service are working? I know AWS is having a thing, uh, how bad is it affecting you? You wanna be able to give them a solid answer.

So our customers were asking us yesterday, Hey, are you dropping our data? And we wanted to be able to give them a, a reasonable answer even in the moment. So yes, the, we we're able to deflect a certain. The, the reputational harm. But at the same time, there are people that have come back and say, well, I mean, shouldn't you have done better?

It's important for us to be able to rebuild our business and to to move region to region, and we need you to help us do that too.

Corey: Oh, absolutely. And I, I actually encountered a lot of this yesterday when I, uh, early in the morning tried to get a, uh, what was it, A Halloween costume and Amazon site was not working properly for some strange reason.

Now, if I read some of the. Relatively out of touch analyses in the mainstream press. Uh, that's billions and billions of dollars lost. Therefore, I either went to go get a Halloween costume from another vendor, or I will never wear a Halloween costume this year. Better luck in 2026. Neither of those is necessarily true,

Ben Hartshorne: and that's really.

Exactly why we're, we, were focused on preserving, successfully storing our customer's data in the moment. Because then when the, uh, when the time comes afterwards, they're like, okay, now we, we, we said what we said in the time, in the moment. Now they're asking us, okay, what really happened? Uh, that data is invaluable in helping our customers piece together which parts of their services were working and which weren't.

At what times

Corey: did you see a drop in, uh, telemetry during the outage?

Ben Hartshorne: Yep, for sure.

Corey: Is that because people's systems were down, or is that because their systems could not communicate out?

Ben Hartshorne: Both.

Corey: Excellent.

Ben Hartshorne: Uh, we did get some reports of, uh, from our customers that their, uh, specifically the open telemetry collector that was, uh, gathering the data from their application was unable to successfully send it to Honeycomb.

Uh, at the same time we were not rejecting it. So clearly there were challenges in the, the path between those two things. Uh, whether that was an AWS's network in some other network unable to get to aws. I, I dunno. So, uh, we definitely saw there were issues of reachability. Uh, and so undoubtedly there was some data drop there that's completely out of our, so the, the only part we could say is once the data got to us, we were able to successfully store it.

So, um, the question is, uh, was it. Customers apps going down. Uh, absolutely many of our customers were down and they were unable to send us on e telemetry because their app was offline. Uh, but the the other side is also true that the ones that were up were having trouble getting to us because of our location in US East.

Corey: Now to continue reading what the mainstream press had to say about this, does that mean that you are now actively considering evacuating AWS entirely to go to a different provider that can be more reliable, probably building your own data centers.

Ben Hartshorne: Yeah, you know, I've, I've heard people say that's the thing to do these days.

Now, I, I have helped build data centers in the past.

Corey: As have I, there's a reason that both of us have a job that does not involve that

Ben Hartshorne: there is, uh, the data centers I built, were not as reliable as any of the data centers that are available from our, our big public cloud providers.

Corey: I, I would've said, unless you worked at one of those companies building the data centers, and even back then, given the time you've been at Honeycomb, I can say with a certainty, you are not as good at running data centers as they are because effectively no one is.

This is something that you get to learn about at significant scale. The concern is, I see it as one of consolidation, but I've seen too many folks try and go multi-cloud for resilience reasons, and all they've done is they've added a second single point of failure. So now they're exposed to everyone's outage, and when that happens, their site continues to fall down in different ways as opposed to being more resilient, which is a hell of a lot more than just picking multiple providers.

Ben Hartshorne: But there is something to say though of looking at a business and saying, okay, what. What is the cost for us to be, you know, single region versus what is the cost to be fully, uh, you know, multi-region where we can fail over in an instant and nobody notices? Uh, those cost differences are huge. And for most businesses

Corey: of course ma, it's a massive investment.

At least 10 x.

Ben Hartshorne: Yeah. So for most businesses you're not gonna go that far.

Corey: My, my newsletter, publication is entirely bound within US West two, because if that goes down, that, that just happened to be for latency purposes, not reliability reasons. But if the region is hard down and I need to send an email newsletter and it's down for several days, I'm writing that one by hand.

'cause I've got a different story to tell that week. I don't need it to do the business as usual. Thing. And that that's a reflection of architecture and investment decisions reflecting the reality of my business.

Ben Hartshorne: Yes.

And that's, that's exactly where to start. And there are things you can do within a region to increase a little bit of resilience to certain services within that region suffering.

So, um, as an example, uh, uh, I don't remember how many years ago it was, uh, but uh, Amazon had an outage in kms, the, uh, the key management service. And that basically made everything stop. Uh, you can probably find out exactly when it happened.

Corey: Yes, I'm pulling that up now. Please continue. I'm curious. Now

Ben Hartshorne: they provide a really easy way to replicate all of your keys to another region and a pretty easy way to fail over accessing those keys from one region to another.

So even if you're not gonna be fully multi-region, you can insulate against individual services that might have an incident and prevent those one services from having an outsized impact on your application. You know, we don't need their keys most of the time, but when you do need them, you kind of need them to start your application.

So if you need to scale up or do something like that and it's not available, you're really out of luck. So I, the, the thing is, I, I don't wanna advocate that people try and go fully multi-region, but that's not to say that we advocate all responsibility for insulating our application from. Having transient outages in our dependencies.

Corey: Yeah. To, to be clear, they did not do a formal writeup on the KMS issue on their basically kind of not terrific, uh, list of, uh, uh, list of, um, out post event summaries. It's, things have to be sort of noisy for that to hit. I'm sure yesterday's will wind up on that list once they have, uh, the, the had that up before this thing publishes.

But yeah, they did not put the KMS issue there. You're completely correct. It's a, this is the sort of thing of what is, what is the ba, what is the blast radius of these issues? And I, I think that there's this sense that before we went in the cloud, everything was more reliable, but just the opposite is true.

Uh, the difference was, is that if we were all building our data centers today, my shitty stuff at Duck Bill is down as it is every, you know, every random Tuesday and tomorrow. Honeycomb is down because, oops, it turns out you once again are forgotten to replace a bad, hard drive. Cool. But those are not the happening.

At the same time, when you start with the centralization story, suddenly a disproportionate swath of the world is down simultaneously, and that's where things get weird. It gets even harder though because you can test your durability and your resilience as much as you want. But it doesn't impact, it doesn't account for the, the challenge of third party providers on your critical path.

You're, you obviously need to make sure that if, in order for honeycomb to work, honeycomb itself has to be up. That's sort of step one. But to do that, AWS itself has to be up in certain places. What other vendors factor into this?

Ben Hartshorne: You know, that was, I think, the most interesting part of yesterday's challenge, bringing the service back up.

Uh, is that. We do rely on an incredible number of other services. Uh, there's some list of all of our vendors that is hundreds long. Now those are obviously very different parts of the business. They involve, uh, you know, companies we contract with for marketing outreach and for, uh, business and for all of that.

Corey: Right. We use Dropbox here, and if Dropbox is down, uh, it, that doesn't necessarily impact our ability to wind up serving our customers, but it does mean I to find a need, to find a different way, for example, to get the recorded file from this podcast over to my editing team.

Ben Hartshorne: Yeah, so there's, there's the very long list.

And then there's the much, much shorter list of vendors that are really in the critical path, and we have a bunch of those too. Um, we use, uh, uh, vendors for feature flagging and for sending email and uh, for, um, uh, some, some other, uh, forms of telemetry that, that are destined for other spots. For the most part, when we get that many vendors all relying on each other.

They're all down at once. There's this bootstrapping problem where they're all trying to come back, but they all sort of rely on each other in order to come back successfully. And I think that's part of what made yesterday morning's, uh, out. Move from, uh, roughly what, like midnight to 3:00 AM Pacific all the way through the rest of the day and, and still have issues, uh, with, with some companies up until, uh, five six, 7:00 PM

Corey: This episode is sponsored by my own company, duck Bill, having trouble with your AWS bill.

Perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help. Remember, you can't duck the duck bill. Bill, which I am reliably informed by my business partner is absolutely not our motto.

To learn more, visit duck bill hq.com. The, the Google SRE book talked about this, oh geez, when was it? 15 years ago now. Damn near that. Uh, that at some point when a service goes down and then it starts to recover, everything that depends on it will often basically pummel it back into submission, trying to talk to the thing.

It's a, like I remember back when I worked at, uh, as a senior systems engineer at Media Temple in the days before GoDaddy bought and then ultimately killed them. Uh. They, they, I was touring the data center my first week. We had, uh, we had three different facilities. I was in one of them and I asked, okay, great.

I just trip over things and hit the emergency power off switch. Great. And kill the entire data center. There's an order that you have to bring things back up. In the event of those catastrophic outages, is there a runbook? And of course there was great. Where is it? Oh, it's not Confluence. Terrific. Where's that?

Oh, in the rack over there. And I looked the data center manager, and she, she was delightful and incredibly on her point, and she knew exactly where I was going. We're gonna print that out right now. Excellent, excellent. Like that. That's why you ask. It's, it's someone who has never seen it before, but knows how these things were going through that because you build dependency on top of dependency and you never get the luxury of taking a step back and looking at it with fresh eyes.

But that's what our industry has done. But you have, you have your vendors that have their own critical dependencies that they may or may not have done as good a job as you have of identifying those and so on and so forth. It's the end of a very long chain that does kind of eat itself at some point.

Ben Hartshorne: Yeah. There are two things that that brings to mind. First, we absolutely saw exactly what you're describing yesterday in our track patterns, where the, the volume of incoming traffic would sort of come along and then it would drop. As their services went off, and then it's quiet for a little while, and then we get this huge spike as they're trying to like, you know, bring everything back on all at once.

Uh, thankfully those were sort of spread out across our customers, so we didn't have like, just one enormous spike hit all of our, our servers. Um, but we did see them on a, on a per customer basis. It's, it's a real, very real pattern. Um, but the second one, for all of these dependencies, there are clearly several who have built their system.

With this challenge in mind and have a series of different fallbacks. Uh, and, and, uh, I'll, I'll give you the story of, um, uh, we used LaunchDarkly for our feature flagging. Um, their service was also impacted yesterday. One would think, oh, we need our feature flags in order to boot up. Well, their SDK is built with the idea that you set your feature flag default in code, and if we can't reach our service, we'll go ahead and use those.

And if we can reach our service, great. We'll update them. And if we can update them once, that's great. If we can connect to the streaming service even better. And I, I think they also have, uh, some, some more, uh, bridging in there. But, um, we don't use, uh, the, the more complicated infrastructure. But this idea that they design the system with the expectation that in the event of a service on unavailability, things will continue to work.

Made the recovery process all that much better. And, uh, even when, when, uh, their service was unavailable and ours was still running, uh, the SDK still answers questions in code for the status of all of these flags. It doesn't say, oh, I, I can't reach my upstream. Suddenly, I can't give you an answer anymore.

No, the SDK is built with that idea of local caching so that it can continue to serve the correct answer. So far as it new from whenever it lost its connection. But it means that if, if they have a transient outage, our stuff doesn't break. And that kind of design, uh, really, uh, makes recovering from these like interdependent outages, uh, feasible in a way that the, the, uh, the strict ordering you were describing just is, is really difficult.

Corey: At least in my case, I, I have the luxury of knowing these things just because I'm old and I, I figured this out Before it was SRE Common Knowledge, or SRE, was a widely acknowledged thing where, okay, you have a job server that runs CR jobs, uh, every day. And when it, it turns out that, oh, when you founded missed a CR job.

Oopsy doozy. That's a problem for some of those things. So now you start building in error checking and the rest, and then you do a restore for three days ago from backup for that thing, and it suddenly dinks it, missed all theron jobs and runs them all, and then hammers some other system to death when it shouldn't.

And you, you learn iteratively of, oh, that's kind of a failure mode. Like when you start externalizing and hardening APIs, you build, you learn very quickly. Everything needs a rate limit, and you need a way to make bad actors stop hammering your endpoints. Not just bad actors, naive ones.

Ben Hartshorne: And, uh, rate limits are a good, a good example because, um, uh, that is one of the things that that did happen uh, yesterday as people were coming back, we actually wound up needing to rate limit ourselves.

We didn't have to rate, limit our customers, but the, because, uh, so brief digression here. Um, honeycomb uses honeycomb in order to build honeycomb. Uh, we, we are our own observability vendor. Uh, now this, this leads to some obvious, um, uh, challenges in architecture. Uh, you know, how, how do we know we're right?

Well, in the beginning we did have some other services that we'd use to checkpoint our, our numbers and make sure that they, they were actually correct. Uh, but our production instance sits here and serves our customers and all of its telemetry goes into the next one down the chain. We call that dog food because we are, uh, you know, the, the whole phrase of eating your own dog food, uh, drinking your own champagne is the other, uh, um, more, more pleasing version.

Um, so the, from our production, it goes to dog food. From dog food. Well, what's dog food made of? It's made up of, of kibble. So our third environment is called kibble. Uh, so the, the dog food telemetry, it goes into this third environment and that third environment, well, we need to know if it's working too, so it feeds back into our production instance.

Each of these instances, uh, is emitting telemetry. Uh, and we have our, um, rate limiting and our, I'm sorry, our tail sampling proxy called refinery. That, uh, helps us reduce volume so it's not a, a positively amplifying cycle. Um, but in this, in this incident yesterday, uh, we started emitting logs that we don't normally emit.

These are coming from some of our SD.

Their services. And so suddenly we started getting two or three or four log entries for every event we were sending. And, uh, did get into this kind of amplifying cycle. So we, we put, uh, a pretty heavy rate limit on the kibble environment in order to squash that traffic and disrupt the cycle. Uh, which, which made it difficult to ensure that was working correctly, which, but.

It was, and, and that led us make sure that, make sure that the production instance was working alright. Um, but this idea of rate limits being a, a critical part of maintaining an interconnected stack, uh, in order to, to suppress these kind of, um, uh, like wavelike. Formations that oscillations, that start growing on each other and amplifying themselves, uh, can, can take any infrastructure down and being able to put in, uh, just the right point, a little, a couple switches and say, Nope, suppress that signal, uh, really made a big difference in our ability to, to bring back all of the services.

Corey: I,

I want to pivot to one last topic, but I, we could talk about this outage for days and hours. I, but there's, uh, something that you mentioned you wanted to go into that I wanted to pick a fight with you over, uh, was how to get people to instrument their applications to, for observability so they can understand their applications, their performance, and the rest.

And I'm gonna go with the easy answer because it's a pain in the ass. Ben, have you tried instrumenting an application that already exists without having to spend a

week on it?

Ben Hartshorne: I.

you're not wrong. It's a pain in the ass and it's getting better. There's lots of ways to make it better. Uh, there are packages that do auto instrumentation.

Corey: Oh yeah, absolutely. From my case. Yeah. It's Claude Coat's problem. Now I'm getting another drink.

Ben Hartshorne: You know, uh, you, you say that in jest and yet, um, they are actually getting really good.

Yeah.

Corey: No, that's what I've been doing. It works super well. You test it first, obviously, but yeah. YOLO slammed that into production. But yeah,

Ben Hartshorne: the, uh, the, the LLMs are actually getting pretty good at understanding where instrumentation can be useful. I say understanding, I put that in their quotes. Uh, they're good at, uh, finding code that represents a, a good place to, to put instrumentation and, and adding it to your code in the right place.

Corey: I need to take another try one of these days. Uh, the last time I played with Honeycomb, I instrumented my home Kubernetes cluster and I exceeded the limits of the free tier based on ingest volume by the second day of every month. And that led to either you have really unfair limits, which I don't believe to be true or the more insightful question, what the hell is my Kubernetes cluster doing?

That's that chatty. So I rebuilt the whole thing from scratch, so it's time for me to go back and figure that out.

Ben Hartshorne: Yeah. So, um, I will say a lot of, a lot of instrumentation is terrible. A lot of instrumentation is based on this idea that every single signal must be published all the time, and, um, that that's not relevant to you as a person running the Kubernetes cluster.

You know, do you need to know every time, uh, the, the, um, a, a local pod checks in to see whether it's, uh, needs to be evicted? No, you don't. What you're interested in are the, the types of activities that are relevant to what you need to do as an operator of that cluster. And the same is true of an application.

If you just, you know, put, uh, uh, in the tracing language, put a span on every single function call. You will not have useful traces because it doesn't map to, uh, a, a useful way of representing your user's journey through your product. So there's definitely some nuance to getting the right level of instrumentation, and I think the right level, it's not a single place, uh, it's a continuously moving spectrum based on what you were trying to understand about what your application is doing.

So, uh, at least at Honeycomb. We add instrumentation all the time, and we remove instrumentation all the time because what's relevant to me now as I'm building out this feature is different from what I need to know about that feature once it is fully built and stable and running in, in a regular workload.

Um, furthermore, a as I'm looking at a. Specific problem or question? I, we talked about, uh, you know, pricing for Lambdas at the beginning of this. Um, there was a time when we really wanted to understand pricing for S3 and part of our model, it, it's a struggle. Um, part of our, part of our storage model is that, uh, we store our customers telemetry in S3, in in many files.

Files. And we put instrumentation around every single F three access. In order to understand both the volume and the latency of those to, to see like, okay, should we bundle them up or resize it like this and how does that influence SOS and so on. And it's incredibly expensive to do that kind of, uh, experiment.

And it, it's not just expensive in dollars. Adding that level of instrumentation does have an impact on the overall performance of, of the system. When you're making 10,000 calls to S3 and you add a span around everyone, it takes a bit more time. So. Once we understood the system well enough to, to make the change, we wanted to make, we pulled all that back out.

So, for your Kubernetes cluster, uh, you know, maybe it's interesting at the very beginning to, to look at every single, uh, connection that any, any process might make. But if it's your home cluster, that's not really what you need to know as an operator. So finding the right balance there of instrumentation that lets you fulfill the needs of the business, that lets you understand the, the needs of the operator in order to, uh, best be able to provide the service that this business is providing to its customers.

It's a, it's a place somewhere there in the middle, and you're gonna need some people to find it,

Corey: and that's

easier said than done for a lot of folks. But you're right, it is getting easier to instrument these things. It is something that is iteratively getting better all the time, uh, to the point where now, like this is an area where AI is surprisingly effective.

It doesn't take a lot to wrap a function call with a decorator.

Ben Hartshorne: Mm-hmm. It just takes a lot of doing that over and over and over again.

You, you do a lot of them and you see what it looks like and then you see, okay, which ones of these are actually useful to me Now that's gonna, and uh, we want. Open to that changing and willing to understand that, uh, that this is an evolving thing.

And this does actually tie back to one of the core operating principles of modern sa uh, architectures, the ability to deploy your code quickly. Because if you're in this cycle of adding instrumentation, of removing instrumentation, you see a bug. It has to be easy enough to add a little bit more data to get insight into that bug in order to resolve it.

It's gonna do it and the whole business suffer for

what is quickly to you,

uh, in. Uh, I need to make this change and, uh, it's visible in my test environment. A couple of minutes I need to make this change and have it visible running in production. Um, it depends on like how, how much the, the, uh, how frequency, how frequent the bug comes, but I'm, I'm actually okay with it being about, about an hour for that kind of, uh, turnaround.

I know a lot of people say you should have your code running in 15 minutes. That's great. Uh, I know that's outta reach for a lot of people in a lot of industries, so, um. I'm, I'm not a hardliner on, on how quickly it has to be, but it can't be a week. It can't, it, it can bear, it can't be a day that just like you're gonna wanna do this two or three times in the course of resolving a bug.

And so if it's something too long, you're just really pushing out any ability to respond quickly to a customer.

Corey: I really wanna thank you for taking the time to speak with me about all this. If people wanna learn more, where's the best place for them to go?

Ben Hartshorne: You know, I have, uh, backed off of almost all of the platforms in which people carry on conversations in the internet.

Corey: Everyone

seems to have done this.

Ben Hartshorne: I, I, uh, I, I did work for Facebook for two and a half years and, um,

Corey: someday I might forgive you.

Ben Hartshorne: Someday I might forgive myself. Um.

Really different environment and, uh, I could see the allure of the world they're trying to create and it doesn't match.

Oh, I interviewed there in 2009. It was, it was incredibly compelling.

Um, it doesn't match the, the view that I see of the world we're in. And so, um, uh, I have a, a presence at, at Honeycomb.

Um, I do have, uh, accounts on all of the major, um, platforms, so you can find me there. Uh, there, there will be links afterwards I'm sure, but, um, LinkedIn, blue Sky. I dunno. GitHub, is that a social media platform now?

Corey: They wish.

We'll put all this in. The show notes Problem solve for us. Thank you so much for taking the time to speak with me.

I appreciate it.

Ben Hartshorne: It's a real pleasure. Thank you.

Corey: Ben Hartshorne is the principle engineer at Honeycomb. One of the possibly might have more than one. Seems to be something you can scale, unlike my nonsense as Chief Cloud Economist at the Duck Bill Group. And this is screaming in the cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice.

Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice along with an insulting comment that won't work because that platform is down and not accepting comments at this moment.