Tractable

In this episode of Tractable, Kshitij Grover is joined by Charity Majors, CTO of Honeycomb, to discuss Observability 1.0 and 2.0 and the critical shift from fragmented data sources to a unified source of truth for more efficient debugging and problem-solving. From identifying signs of organizations entrenched in '1.0 thinking' to the need for shorter feedback loops, Charity examines the key factors driving the evolution of observability tools and practices.

What is Tractable?

Tractable is a podcast for engineering leaders to talk about the hardest technical problems their orgs are tackling — whether that's scaling products to deal with increased demand, racing towards releases, or pivoting the technical stack to better cater to a new landscape of challenges. Each tractable podcast is an in-depth exploration of how the core technology underlying the world's fastest growing companies is built and iterated on.

Tractable is hosted by Kshitij Grover, co-founder and CTO at Orb. Orb is the modern pricing platform which solves your billing needs, from seats to consumption and everything in between.

Kshitij Grover [00:00:05]:
Welcome to another episode of Tractable. I'm your host, Kshitij, cofounder and CTO here at Orb. And today, I'm really excited to have Charity on the podcast. Charity is the CTO of Honeycomb, and Honeycomb is a platform that helps you understand your application in production, sometimes called observability, but I'm sure we'll dive into that today. Honeycomb is used by orgs like Vanguard, Slack, and Intercom. And, as many of you all know, Charity also writes and shares a bunch around engineering management and organizational design in addition to technical topics, so we'll have a ton to talk about. Charity, welcome.

Charity Majors [00:00:38]:
Thank you for having me. We've been trying to make this work for a while now.

Kshitij Grover [00:00:43]:
Yes. I wanna get into the meat of it. I know you have a ton of really well-formed thoughts at Honeycomb around observability or or what the industry at broad calls observability. Let's just start with a core thesis that you all have, which is this difference between Observability 1.0 & 2.0. So maybe you can give some context around what that means, and then we can dive into what the kind of precise differences are.

Charity Majors [00:01:08]:
So when we started talking about observability back in 2016, it was a way of differentiating between what we were trying to build and what the rest of the world was doing, which which was very rooted in metrics, which have no context, and the monitoring sort of approach which works really well for systems that fail in predictable ways, which increasingly is not our the systems that we're working on. And so at this point, I would say Observability 1.0 refers to tools that have many sources of truth. There's a famous definition that observability has 3 pillars, metrics, logs, and traces. Most people are paying for way more than 3 pillars. They've usually got run tools and APM tools and logging tools. Maybe they got structured and unstructured logs, and they've got their SLO, and all of these tools, they they're not connected.

Charity Majors [00:01:53]:
The only thing that connects them is the engineer who sits in the middle visually going, that spike looks like that spike, and maybe copy paste the IDs around from tool to tool. But you've got many sources of truth, nothing ties them together, which means that you have to rely a lot on guessing and intuition and, like, past experience when debugging systems. And Observability 2.0, it's a single source of truth. Arbitrarily wide structured data blobs, logs, whatever you wanna call them, events. But because there's one source of truth with all the shared context, you can derive metrics from them. You can derive traces from them by visualizing them over time. But this all connected and you as an engineer could go, here's my SLO. It's burning down at this rate.

Charity Majors [00:02:36]:
Here are the events that are violating the SLO. Here's how they're different from the events that are not violating the SLO. Like, I was just talking to a customer of ours who's just started rolling out front end Observability, and they're going from this like weeks long process of identifying latency problems, trying to repro them. Literally weeks, has been reduced down to minutes. Because when that connective tissue is there, you can just ask the question. You can just see exactly what's happening. So much of debugging boils down to, here's the thing I care about. I might not know why I care about it.

Charity Majors [00:03:08]:
And in fact, most of the process of debugging is figuring out what's different about this thing that I care about versus everything else that I don't care about.

Kshitij Grover [00:03:18]:
Yeah. That that's interesting. I'm wondering, like and and and you mentioned this a little bit, but what are some signs on the ground? So let's say I'm an IC engineer. What are some signs that my org is stuck in this, like, 1.0 style of thinking? You know, no matter what tool I use, like, what does that look like tactically?

Charity Majors [00:03:34]:
The huge difference between 1.0 and 2.0 is also that certainly 1.0 is very much about operating your code. And it's intrinsically reactive. Right? And 2.0 is very much about how you develop your code. There's so much I think it was just like dark matter in software engineering. It's like, why it doesn't seem like we should be moving this slow. Why are we moving so slow? Like, there you can't really put your finger on it because you can't see. You know? And, like, when you have an observability 2.0 mindset and toolkit, you can see where that time is going. You can see exactly what's happening.

Charity Majors [00:04:06]:
And this is something it's Plato's allegory of the cave, like, trying to explain to a blind person who's lived in a cave their whole life what it's like that. There's a bit of look, you almost have to take it on faith that it is this different. Because vendors have been lying through their let's not say lying, but, like, exaggerated the impact of what they sell you from day one. Right? And so there are very few things in software engineering that have actually have this kind of outsized impact. And in my experience, good observability tooling is one of them. Because we all we always talk as managers, directors, especially, you're always looking at the feedback loops inside your organization. Because these feedback loops amplify each other. Right? And so if it takes you 15 minutes from the time that you write the code to deploy the code versus, two hours.

Charity Majors [00:04:52]:
It's not actually an hour and 45 minutes worth of difference. It's way more than that because of the application effect of it. But in order to have those really tight feedback loops, you have to be able to move with incredible confidence. You have to be able to validate every single step of the way and quickly figure out if it's doing what you expected it to do, if anything else looks weird. And when all that they have are metrics and aggregates and exemplars

Kshitij Grover [00:05:19]:
Mhmm.

Charity Majors [00:05:19]:
You fundamentally can't answer those questions with any actual specificity.

Kshitij Grover [00:05:24]:
Mhmm. How much do you think the existing pattern is an artifact of maybe just like slower deploy cycles or this idea that you package up your code into some binary goes out a month from now, and all you can collect is some, like, bare telemetry data. Is it just that pattern is still the style of thinking and that with the sort of continuous release cycles deploying the production many times a day or even every commit? That's the thing that is just we have not fully, adapted to that change, or is it some other reason why this 1.0 style of thinking is so tempting or maybe even just prevalent?

Charity Majors [00:06:00]:
Why is it so tempting or prevalent? I think my personal philosophy or theory is that it has a lot to do with the fact that we are embodied beings. And as mammals, like, whenever we get scared, our instinct is to freeze and slow down and go, let me get control over things before we start up again. And that kind of instinct is just deadly in software because in software, speed is safety. And a lot of it has to do with the fact that when you're writing code, as much as anyone will ever know about this thing you're trying to build -- why you're doing it, how you're doing it, what you tried that worked, what you tried that didn't work, what the variables you did, what the functions are named, you have all mapped out, and how long does it last? An hour or 2, maybe? Like, it certainly doesn't provide you paging in and out another project, let alone a month down the line. You're gonna have to learn it all over again. Right? And you're not gonna learn it in the same way. Like, the shorter you can make it, like, ideally, you can have it in your head, instrument your code as you go, have it in your head, keep it in your head while you deploy, and then go and close the feedback loop by looking at that instrumentation that you just wrote and asking yourself, is it doing what I expect it to do? Is anything else weird? That moment right there is your best shot.

Charity Majors [00:07:14]:
Like, there's been data that shows that the cost of finding and fixing bugs goes up exponentially from the moment that the bugs are written. You type a bug, you backspace it. Good for you. Right? But finding it right after you put it into production, when you still have a clean baseline, right, it hasn't become the new baseline, that's your best chance of finding it if you still have all that context in your head. But all of these tiny little decisions that we all these little blockers, little bits of friction, and the the drift, and the the test getting slower over to all of these things conspire to make make these things longer and longer. And then when we have problems, we slow down even more. We're like, let's add a gate. Let's add some approvals.

Charity Majors [00:07:52]:
Let's add some more tests. When, in fact, the most important thing you can do as a leader of software engineers, whether you're an IT or a manager, I don't care, is keep those feedback loops tight. And this is a really daunting thing when you're coming from the other side of things. Right? But the good news is that every step you make towards shortening that feedback loop will also pay off. The feedback loop, it intensifies. It feeds upon itself. And it's not I feel like it's really frustrating as an engineer or a manager who's trying to debug a software engineering culture, and chemo's just fucking broken because it's broken in so many ways. And you're just like you're distracted by all the symptoms and all the ways that it's broken.

Charity Majors [00:08:35]:
But I think that leaders just need to narrow their field to focus on that interval. Right? Mhmm. Compress it. Make it shorter. Do whatever you can. Because so many of these symptoms, you don't need to fight the symptoms. You need to fight the fire, and the fire is how slow everything is. And this is also something where I think that a lot of people have taken these stabs at this and found it really frustrating and hard, and that's because good observability goes hand in glove with this.

Charity Majors [00:09:03]:
The more you can see what you're doing, the faster you can move. And the faster you can move, the better you can see what you're doing.

Kshitij Grover [00:09:10]:
Yeah. It's interesting because I suspect that most organizations don't think of their observability tooling as necessarily a feedback loop. They just think of it as a debugging tool. Right? Exactly. Which is I'm imagining most engineers will go to Datadog or will go to even, like, their entry point is, like, Pager Duty to whatever observability it's doing because

Charity Majors [00:09:29]:
Yeah.

Kshitij Grover [00:09:29]:
Something is wrong, something is on fire. And and I think in other places, you frame this as, like, your relationship with production. Right? Yeah. Is it a relationship where you only look at it when you're stressed out and worried that something is actively breaking? How should that be reframed? Is it like what is the right time to be setting up the surfaces in your observability tooling where it's not just like something is actively affecting a customer?

Charity Majors [00:09:58]:
I feel like we are right now in the early days of what I think of as observability-driven development. I work in the situation where TDD was, like, 15 years ago where everything is super manual and by hand, it feels a little foreign. We got used to the idea that we didn't know if our code was working or not until the test passed. Not enough for it to compile or parse.

Kshitij Grover [00:10:17]:
Right.

Charity Majors [00:10:18]:
Immediately, the test passed. And I feel like that the other that we're extending that that right now because in reality, you don't know if your code is working or not when the test passes. You know that it's probably mostly logically consistent. You know that you can find the bugs that you've already found in the past. You don't know if you've found if you've created any more bugs, and you don't know how users are gonna use it. You don't know that until you've watched it in production. I think that 2, 3, 5 years from now, there's going to be better tooling for this. It's not just gonna be, like, close the loop by manually instrumenting and looking at it.

Charity Majors [00:10:48]:
But that's where we're at. And I think that this sort of this next sort of evolution was really kicked off when we started realizing that you have to put engineers on call for their own work. It's not you can't just expect half of the engineers in the world to write the code and the other half to understand the code. This has to, like, coexistence in one mind. And so I feel like you get what I'm saying. I, I think that there's a lot of fear and dread around production systems, and I totally get it. As an operations engineer, I used to cause a lot of that fear of loathing. Get your ass out of my production system.

Charity Majors [00:11:21]:
It's a huge mistake. Right? It's our job not to make you a little more afraid of production because their job is ultimately production, but to find ways to welcome them into the fold and to make it safe. To make it safe for your curiosity, to make it safe for you to ask questions and explore. At the end of the day, we all want to do good work. We all want to make sure the code that we write gets used, and you can't do that unless you are intimately familiar with production.

Kshitij Grover [00:11:48]:
Yeah. It's I think you've written something about this, but, I don't remember where you landed on this, but one problem that I see a lot is that is related to how things are performing in production is product engineers will architect the future, write this data model, and then the infra team, the infamous infra team will be responsible for fixing the queries. Right? And, and they'll be in performing queries or they'll just be regressions that have been introduced, but only the infra team is really equipped to go and and triage all the slow query logs on the database. And that feels like the similar sort of problem where the feedback loop just isn't there, and the people who end up being responsible for fixing it aren't the people who wrote the code. Yeah. I wonder if that is because project development life cycles just dictate that people don't incrementally ship enough of that to see the feedback loop, or there's just some sort of, like, organizational problem where infra as a team that deals with this is just malformed. Right? It it just it's not a sensical concept.

Charity Majors [00:12:50]:
Yes. The answer is yes to all those things. I think you see a lot of sort of innovation in the past few years about pulling together product teams that that could own the lifecycle of what they're building for this reason. But there's also a reality that our tools I'm always really awkward whenever I talk about things that only Honeycomb can do, because I really I want to be talking in a very vendor neutral way. And I believe that a couple years from now, like, this will be the way that the industry operates, but there's a bit of a moat. Like, it's people have to write change vendors are gonna have to change the way they store things. There's a reality that if you're using tools that are built on top of metrics and there's two ways that people use the word metric. There's the generic.

Charity Majors [00:13:37]:
It's just a synonym from the telemetry, and then there's the metric, which is just the number with some tags appended, and time series databases store no relational data whatsoever. So, like, when it comes to debugging and understanding your code, like, context is what makes data valuable, and metrics have no context. And so that means that if you're a software engineer and you come along and you look at your instrumentation, you're trying to ask questions, the data that you need doesn't exist. Which means that the people who are debugging it have to come with this wealth of knowledge of how everything fits together and how high level code translates into low level systems resources. And that's the only way that debugging has been possible. Like, I used to hear engineers complaining a lot, but by putting them on call for their code, you are asking them to master two disciplines or do two jobs, And they were right. Traditional ops dashboards and tools, your Datadog, your Prometheus, they speak to you in the language of, we've got four types of system memory and this many CPUs. So you're a software engineer, you're shipping your GIF, and you're like, what exactly just happened? And, like, the tooling you need to serve to them speaks to them in the language of variables and functions and endpoints so that they can go, it's timing out type talking to that AP endpoint, or we're getting packet

Charity Majors [00:14:55]:
To there. And so it's this is not just anytime you're like, it doesn't, it's not working because people are just too lazy to look at their stuff or something like that. It's probably not the right answer. Like, the tooling hasn't existed, but also the way that we are structuring our teams hasn't been doing us any favors. So I feel like there's a lot kind of converging right now. I think that Honeycomb sits in the nexus of a bunch of different trends. And I think that this is why, if you look at the Dora metrics year over year, you're starting to see the teams that are doing well are doing are, like, get reaching, like, the escape velocity. And everyone else is getting, like, worse and worse over time.

Charity Majors [00:15:24]:
And everyone else is getting, like, worse and worse over time. And that's because it's hard, and it's all feedback loop, and it's all very self-reinforcing. But it's not hard. Like, this is the tricky part. It's not hard. It is so much harder. The way teams are doing it now is the hard way. It's so much easier to work in fast feedback loops when you can see the data.

Charity Majors [00:15:48]:
But there is a bit of the way through over the mountain is not yet like reducible to sound bites, and so there's a bit of, like, fighting our way through the forest to get to the promised land. But the promised land is so much easier than the way people are doing the work now.

Kshitij Grover [00:16:06]:
Yeah. It it almost feels like I think it one of the things you said is it almost feels like you're writing two different versions of your business logic. Like, you're writing your code and you're writing your function calls or whatever, but then you're having to rederive the importance of the different parts of that code in your observability tooling and perhaps having to even restate the importance of what should be measured and what shouldn't be measured. So one thing I'm wondering is part of the thesis that observability tooling or the thing that you go look at in production should just naturally fall out of your business logic. Like, how much intentionality should you have to put into setting up this other tool? Or should it just be like, great. I wrote my code. Everything is like framework level instrumented, and now I can see it in a tool like Honeycomb.

Charity Majors [00:16:57]:
It shouldn't be difficult. So, like part of the problem with today's tooling is that it is really hard to instrument your code in a way that will let you understand things because high cardinality is just it's prohibitively expensive. So when people say custom metrics, I've, I've been doing it for 8 and a half years, and I just learned this. Well, I always thought when you say custom metrics, Datadog gives you 100 free custom metrics. I'm like, cool. To me, that sounds like 100 lines of code where you inject a metric with the name of it. Right? That's not what custom metrics means. Custom metrics means cardinality.

Charity Majors [00:17:33]:
You get 103 cardinality. And cardinality, for those who don't know, means the number of unique items in a set. If you've got a collection of user users and their, their unique IDs, if they're you've got a 1,000,000 users, you've got a 1,000,000 cardinality. And the whole there are so many blog posts and talks and everything about how to try to not have cardinality in your data, because it's so freaking expensive. But that is a alley that goes nowhere because you have to have cardinality to understand your data. High cardinality data is the most identifying, the most interesting, the most I can't even count the number of times that I have solved some weird ass bug just because it admitted a weird ass unique string. If you don't have that high cardinality data, you're screwed. You cannot under so, like, I shit talk metrics a lot and perhaps unfairly because metrics are not a bad tool, but you have to when it comes to data, we've learned anything.

Charity Majors [00:18:31]:
You have to use the right tool for the job or your costs are just out of control. Metrics are a powerful tool for summarizing vast quantities of data and letting you age it out gracefully while, like, keeping the shape of it and just reducing the granularity. They are not a tool for debugging or understanding your system. Anytime you need a scalpel, metrics are not a tool you can reach for. So part of the sort of paradigm and yet metrics are what the last 3 decades worth of tools on the market have been built off of. And they're just a technology that reached its end-of-life usefulness. Also, for a long time, it made sense that we reached for the metric because the most important thing about our data was how expensive it was to store. RAM was expensive.

Charity Majors [00:19:15]:
Disk was expensive. Everything was so freaking expensive. So we reached for the metric because it was the smallest thing of any value that we could store. But that's no longer true, like, storage is cheap. Our attention is costly. And so, like, anytime like, the number one thing that I'm telling people is, if you wanna move to a world where you can understand your systems and have observability 2.0 outcomes, take all of this energy and cost you're allocating to trying to get useful stuff out of your metrics back data and just reallocate it to structured logs. That if you do nothing else, just do that. Because that way like, the metrics weigh just like madness lies.

Kshitij Grover [00:19:54]:
Yeah. It's funny. I'm guessing that there's more VPs of finance that know how custom metrics billing works than than engineers because Almost assuredly. Telling their teams.

Charity Majors [00:20:04]:
I wrote a blog post, called The Cost Crisis and Observability Tooling earlier this year. And I don't know how many observability engineering teams I heard from in the aftermath of that, who, like shamefacedly admitted to me, they spend an outright majority of their time on cost management. Not adding value to the business, not building libraries, not helping, not even supporting teammates -- actively trying to manage cost. But mostly by watching for unexpected spikes and cardinality, and then deleting them. That's all they do. It's a full-time job. The amount of time and energy and money we are sinking into this dying technology is, like, mind boggling. So your whole question was about instrumentation.

Charity Majors [00:20:50]:
Right? And yeah. Knowing how to instrument your code with metrics is it's even for experts. Like, to some extent, you just you just gotta put it in and see what comes out because you can't. It can change up from under you if the data changes. Oops. I thought that was gonna be low card. Now it turns out it's high. Over the weekend, it cost me $50,000.

Charity Majors [00:21:07]:
It's a not uncommon refrain. You have to think a lot about it. You have to, like, constantly delete things. It should be so much easier. It should be it should go something like this as you're writing code. You know how, like, we all have this sort of meta monitor in our minds? I should add a comment. This is gonna be challenging someday. We should have this it seems sort of like meta thread, just, this might be useful to someday.

Charity Majors [00:21:27]:
Toss it in the blob. Right? Oh, shopping card ID, probably useful someday. Toss it in the blob. User ID, probably useful someday. Toss it in the blob. And because so for Honeycomb, we charge by number of events, and we encourage you to make them as wide as possible, because width means context. If you've got an event that's 300 dimensions wide, awesome. They could all be high cardinality.

Charity Majors [00:21:51]:
We don't care. Right? The wider than 500, 1000, up to the number of Linux file handles that can be open. Like, just toss it in and forget about it. Do not delete that. You don't know what's gonna be useful someday. And it's using that one thing that's useful. It's like the conjunction of 5 or 9 or 11 rare things, and that's how you can identify what's wrong, like, immediately. Because, this thing I care about is different in twelve ways.

Charity Majors [00:22:17]:
Cool. Now I know what it is. And you don't get that unless you've just been accumulating things that might be useful someday over a period of time. But, no, it shouldn't be hard. You shouldn't be having to think about what's the format, what is the data type, what's the cardinality of this going to be, what about when I ship it to production, what if the size of the fleet grows? Is that going to change the entire model of this thing that I've like, all of that is so hard and should not be necessary.

Kshitij Grover [00:22:44]:
Yeah. It seems like an unfortunate chain of events, because not only does, create all this work for cost control, but it also means you have to be super intentional when you're first writing the code. To know what you're gonna care about. So, actually, it's a good segue. Let let's talk a little bit about that Honeycomb architecture and system because I think a lot of engineers, especially if you're coming from an OLTP database, you're like, that must be really expensive. I can't store all these columns because I'm gonna have to does Honeycomb have to index over all of them? And so maybe there's this kind of intuitive concern of I need to conserve the width of my event.

Kshitij Grover [00:23:21]:
Yeah. So, you know, talk a little bit about the column story of build and how you're able to effectively query overall this data.

Charity Majors [00:23:28]:
This is the beauty and the pain of data. Right? Is that when you find the right data model, problems that were previously, like, intractable, incomprehensible, can become so easy and so fast. Speed is one of the most important things about Honeycomb. We have an SLO for ourselves. I think the 95th percentile query seems to be under a second. It's absurdly low. We'll, we'll usually we'll often have the experience of code you're you're writing code, you're instrumenting, and as soon as you not even deploy, but as soon as you, like, issue an API request, as soon as you changed over to the browser and look for it, it's there.

Kshitij Grover [00:24:04]:
Yeah.

Charity Majors [00:24:04]:
It's so fast. And we start to take that for granted because and then we go use some other tool and we're like, oh my god, it takes minutes to show up. What is going on here? But I think that's so important because you need to be asking questions in a state of flow. Right? Anything that stops you is going to break your concentration. So yeah. So, like, we so Christine and I had worked at Facebook, and we used the tool there called Scuba. And there is a white paper out there on how Scuba was built, and that was the genesis of Honeycomb.

Charity Majors [00:24:29]:
And it's a distributed column store. Column store, it's the opposite o,f of a relational row based database where you have to, everything's stored as rows. You have to build indexes. In a column store, everything is an index, and that's just how it is. In the early days of Honeycomb, we stored the data on SSDs, but we pretty quickly ran into exactly the kind of cost problems that you would imagine when data stored constantly and accessed rarely. We're like, this is not going to be financially feasible for us for a very long time. And at that point, we server list our database.

Charity Majors [00:25:04]:
One of our engineers, Ian Welch, who was actually my first engineering manager when I was, like, 22, developed our early data store. And so when an event hits our API, it gets dropped into Kafka immediately. And then we have, like, pairs of retriever databases that that pull the event out of Kafka. So it gets stored in the SSDs for a short amount of time, but then it very quickly gets aged out to S3 buckets. And our query engine is actually run runs in Lambda jobs. So when you issue a query, it kicks off a Lambda job, which goes and, fans out, pulls the data down from, like, likely 100s or 1000s of s three buckets, and then merges it and returns it to you in under a second. So it definitely scales.

Charity Majors [00:25:47]:
We run on the order of tens of millions of writes per second. We have many hundreds of customers, and we effectively run the combined production load of them all. It's quite large quite scalable.

Kshitij Grover [00:26:03]:
And I just wanna highlight how different that approach is from something like a streaming aggregation approach where you're aggregating the data for a specific set of percentiles or distributions that you've preconfigured in stream to actually querying over the data in S3?

Charity Majors [00:26:19]:
Yes. This difference between write time aggregation and read time aggregation is part of the secret sauce. Anytime you're aggregating at right time, you're effectively locking yourselves in to being able to ask only the questions that you predicted upfront that you might want to be able to ask. Right? You're like, okay. This might I might want to ask what the 99th percentile is, what the max you're saying future me is only going to care about these questions, which is what you do with metrics. Right? And, part of the Honeycomb model is we store the raw events, and you can ask any question you want of them, which means we do all of the aggregation at read time. So you can ask you can slice and dice and combine and recombine. You can zoom in.

Charity Majors [00:26:58]:
You can zoom out. You can go up to SLOs. You can go all the way down to raw. You can combine every individual raw request against each other, and which is such a paradigm shift. Like, people don't realize how bad they have it because they don't know how good it could be, which to me is a little ironic because on the business side of the house, they've had nice things for years. Like, they would they you could not run a business without the ability to slice and dice and ask arbitrary questions of your business data. And yet on the systems side, we've just been over here living off starvation rations. Just we can't pay for white event.

Charity Majors [00:27:32]:
We can't pay for structured logs. Let's just serve ourselves metric. It's classic, like, the cobbler's children have no shoes. Which, honestly, is another thing part of this that I think is really interesting, which is that my friend, Austin, who works with this, has often said, if all that observability 2.0 does is give us a better, faster monitoring system, it will have failed because a huge part of making this data valuable is incorporating the business context. All of the really interesting questions you ever want to ask are some combination of system, application, user, and business requirements. Right? And having those siloed off into different teams and different tools just contributes to the the dark matter of software engineering. Just like just this difficulty with reconciling your your views of reality before you can even ask or answer the interesting questions.

Kshitij Grover [00:28:24]:
That makes a lot of sense. And how I don't know how much you all have thought about the the economics of this internally recently, but I'm curious. Have things like S3 Express One Zone changed the model at all for your customers? Like, have because that kind of lets you swap out a faster storage layer for really no architecture change. Is that something that people have asked for or you've thought about much?

Charity Majors [00:28:47]:
I haven't thought about it much. I haven't heard anything from the engineering team. Maybe they have, but, I don't think so.

Kshitij Grover [00:28:53]:
Interesting. Yeah. Because I'm just wondering. Maybe that's not a bottleneck at all. If your p95 is still less than one second, maybe that's the storage layer is not.

Charity Majors [00:29:02]:
That actually really surprised us. We we approached that as an experiment. We were there's no way that, s3 is going to be fast enough to meet the kind of means that we No way. And to our shock, it actually was because it's parallelizable enough. And also just because it it didn't have exactly the same performance characteristics as the SSDs did. But some of the things that were faster were slower, and some of the things that were slower were faster. And so all in all, it was a wash.

Kshitij Grover [00:29:32]:
Yeah. And and imagine you have to think carefully about things, like, the sizes of the files that you're writing so that you tune the parallelization factor. I imagine some pretty serious innovation in the actual storage format to s3 and may maybe the the caching in front of it. One thing I'm wondering about is you all have in addition to thinking about the storage layer and the query layer, you've also started introducing things like bubble up. Right? With this idea of surfacing. I don't know if you would call it categorically AI, and and so I'm curious how you would describe it, but surfacing outliers and insight. How do you think about that in kind of relation to the workforce people that use Honeycomb for?

Charity Majors [00:30:12]:
I would not call it AI. I would call it straight up computing. So BubbleUp, for those who don't know, is just any dash any graph that you draw in Honeycomb, you can draw a little bubble around any part. If you're like, ah, this is interesting. I want to understand it. Or this is a this this looks like an outlier. This is a spike, whatever. Draw a little bubble around it, and we instantly precompute or we compute for all of the dimensions inside the bubble versus outside in the baseline, and then we sort and dip them.

Charity Majors [00:30:37]:
So you're like, I care about this, and then immediately you're like, ah, the thing I care about is different in x y z ways. And that is possible because we store so much context, so we can just compute it. I actually think that all of the, quote, unquote, AI innovation in this space is a hack around the fact that people are not storing the context and can't compute it, so they have to guess about it. And it does not seem very effective to me, honestly. There are some great uses for AI, but they're all in nondeterministic relational things all along, you can't calculate the things. I think it's a little bit cynical, and it's I have not heard any good things about anyone about how well it works.

Kshitij Grover [00:31:22]:
Yeah. It sounds like you're saying AI is a, like, interpolation technique. But if you can just store all the data, you should store all the data, and you you should be able to query over it effectively. Right?

Charity Majors [00:31:31]:
I wanna say, there are some really interesting things we can do with AI in this space. We actually have a AI feature called query assistant. So you can just ask questions of your data using natural language. I just shipped something. What are the slowest endpoints? What changed? Which is dope.

Charity Majors [00:31:45]:
And I think that most of the really promising areas for AI are those sort of fuzzy, nondeterministic, like, mostly in the realm of how are my humans interacting with this system? And then how can we use that sort of synthesize and put everyone up to the level of the best debugger in every corner? Because while you're developing on a part of the system, you know that part intimately. But when it comes to understanding the system, you have to understand the whole thing. So if we could be, like, how can we, like every everyone is expert at interacting with their part of the system, so how can we bring that wisdom together for everyone? I think that's really interesting. But the simple stuff about, I just shipped to change, what happened, you should not need AI to tell you that.

Kshitij Grover [00:32:27]:
I really like that. It sounds like what you're saying is that you wanna lower the barrier to entry for people to interact with the the tool with AI, and that could be things like natural language querying. But for the precise things where you're looking for something, that's maybe not a place for AI.

Charity Majors [00:32:44]:
The place for AI. Thought positive is just so expensive.

Kshitij Grover [00:32:49]:
Yeah. Exactly. That makes sense. And actually that's related to this. We've talked about the architecture of Honeycomb. It would be, useful context to talk about that kind of data model of Honeycomb. Notably, not these kind of three pillars, but what does it look like for someone to use Honeycomb in the sense of the the workflow and the product? If it's not just, oh, I'm having to flip back between traces and metrics and logs and try to manually connect the dots?

Charity Majors [00:33:14]:
That's such a great question. When you're debugging with 3 pillars, it does look a lot of, okay. I'm getting paged. I'm jumping to my dashboard. I'm gonna start, like, flipping through dashboard, looking for other shapes like that shape. And then when I find something I think might be it, I'm gonna jump to my logging tool and look for the same time range and and start looking for strings that I like, you have to know what you're looking for in order to find it, and then you have to jump around from tool to tool. And with with Honeycomb, it's mostly typically, you get paged because of an SLO, it's burning down.

Charity Majors [00:33:44]:
So you jump to the SLO page. Because the SLO is burning down. You see exactly why it's burning down. You see why you see how the events that are violating SLO are different from the events that are not violating the SLO. You might click on the top one, and you'll go to a graph of it, and you can be like, show me this over time. It's just it's absurdly simple because and yet it's hard to to describe, because it's not like you do butts. You look at the data tells you what you look at next, and then you look at the next thing.

Charity Majors [00:34:11]:
And typically, within a few seconds, you found it. We have this this experience over and over again when we're doing a POC or a trial with a prospective customer, where we'll be pairing with them, instrumenting or something, and we'll be like, what's that? They'll be like, what's what? And and then 5 minutes later, they'll start have an outage or something. And we're like, oh, it's right there. Or another thing that happens a lot is we start looking at it and we'll find things. We're like, oh, what's happening here? And they're like, oh, shit. And then they're like, oh, we gotta go fix that. And we're like, I hear you, but this has been like this for god knows how long. And if we stop to fix everything we're gonna find, we're never gonna get this rolled out and instrumented.

Charity Majors [00:34:49]:
Because people's aggregates, like, are covering over a multitude of sins. Right? There are so many things broken in your system right now you just don't know about because your data doesn't surface it to you until a customer complains, and then you go waste a day or 2 trying to repro it. But when you have the ability to just see the outliers in your data, suddenly, so many things become so obvious to you.

Kshitij Grover [00:35:12]:
So is it right to think that the in the honeycomb thesis or way of doing things, a lot of your intentionality is going into the definition of the SLOs. And I I almost feel like that makes sense from a macro business perspective.

Charity Majors [00:35:25]:
Absolutely. A big part of the shift from Ollie 1.0 to 2.0 is often around that sort of entry point that what you're getting alerted for, what you're getting paged for. Right? Because in the battle days, we're like, anytime a symptom looks like it's spiking, let's alert someone, wake them up wake them up in the middle of the night, which is doesn't scale. Right? And it burns people out really fast. And so you want to shift to SLOs so that you're aligning engineering pain with customer pain. You don't wanna try and wake someone up because something might be about to go wrong.

Charity Majors [00:35:55]:
You want to wake someone up when customers are in pain, and that's it. So you put in the work to define these SLOs. You're like, these are the things that our business is built on. There are lots of things that can happen that someone should get to eventually, but these are the things that our customers will notice and feel that will hurt us as a business. And once we all agree on those things I think of SLOs as like the APIs for engineering teams, because in the absence of these SLOs and agreements, you're all getting up in each other's business about your road maps. Right? You're micromanaging, you're like, but I wanna get on your road map there. If we make these agreements, then I agree to provide this level of service. What I do on the other side of that is my fucking business.

Charity Majors [00:36:39]:
I think that's a lot of the really important work around this is defining SLOs.

Kshitij Grover [00:36:43]:
And it sounds like that would be the right bridge between the engineering team and maybe the VPs and directors, which is, okay, what are we getting out of this migration to a tool like Honeycomb adapting this new style of thinking. Yeah. Okay. Maybe we're getting these SLOs defined for the first time ever. Right? We're we're actually understanding what the the contract of our service is.

Charity Majors [00:37:03]:
Yes. A 100%. I think that's really meaningful. I think that's really insightful. I think that the other thing that the companies that that Honeycomb seems to be really successful at this point, like, there are 2 things we really look for. Number 1 is, does the company have dollar values attached to individual requests? Delivery companies, finance. If they know that their outcomes their specific outcomes, like, lead to changes in money up or down, then they're probably really good customers Honeycomb customers.

Charity Majors [00:37:32]:
Versus ad companies are just spray and pray they don't give a shit if an individual ad makes to an individual eyeball. Not a great Honeycomb customer and the second thing that predicts a really good Honeycomb customer is a certain level of intentionality. Sometimes we get great customers who are, firefighting all day long, but more typically, it's companies that have gotten themselves out of the weeds a bit, such that they're able to plan for the next couple of years. They're, like, looking at what's coming down the pipeline, and those make great honeycomb customer. A lot of them are the tip of the iceberg of the platform engineering sort of changes, where they're devoting teams of engineers not to owning production directly, but to empowering engineers to own theirs directly.

Charity Majors [00:38:15]:
Those are great Honeycomb customers because and, ultimately, we want to become a great vendor for the teams that are at the bottom 50% of engineering teams. But right now, we're a very small team relative to our competitors. And the ones that make really good Honeycomb customers are the ones that are really anticipating the future and trying to build the right sort of human so it's you talking about architecture to solve future business problems.

Kshitij Grover [00:38:40]:
. And it's frustrating because it's really hard to imagine being able to pop your head out and think about the future when you're spending so much of your time on things like cost control. So since it's like you need that stuff to be able to do it, which is tricky.

Charity Majors [00:38:53]:
It's a whole chicken and egg thing.

Kshitij Grover [00:38:55]:
Maybe last question. What are you most excited about that Honeycomb's working on now? Or it could be just, industry wide. What do you think is working in terms of the way you're trying to push forward the conversation?

Charity Majors [00:39:07]:
I'm really excited about OpenTelemetry. The the goal of OpenTelemetry, it's got a bit of a reputation looks like Kubernetes for being overkill, like, a big, kind of bulky. But I'm not gonna argue with any of that. But the beautiful thing that Otel does is make it so you instrument your code once, and then you can make vendors compete for your business based on how awesome they are, not unlocking you in to their custom instrumentation crap. I'm really excited that project is has really achieved liftoff and people are getting on board and people it's in a good place. I'm also really excited. Honeycomb mobility, the front end observability, connecting the front you should be able to trace all the way from your device to the back end to back without breaking.

Charity Majors [00:39:52]:
I feel like the the edges of tools are what creates silos way too often. It's like you've got your view of reality because you use this tool. I've got mine because I use that tool. And so knitting everyone into together into a single engineering or with a single source of truth is really exciting to me.

Kshitij Grover [00:40:08]:
Yeah. Awesome. I'm both excited for less vendor lock in, industry wide, but also really excited about all the conversations you're spearheading throughout the industry. So really great having you in this conversation and on the podcast, and appreciate your time. Thank you.

Charity Majors [00:40:23]:
Thank you so much for having me. I'm glad we managed to make this work.

More episodes

Chapters

What is Tractable?