Danyel Fisher is a principal design researcher at Honeycomb.io, makers of observability tooling for engineering and DevOps teams. Prior to joining Honeycomb in May 2018, Danyel worked as a senior researcher at Microsoft for nearly 14 years, with a focus on data visualization. He holds a masters in computer science from UC Berkeley and a PhD in information and computer science from UC Irvine.
Join Corey and Danyel as they talk about the different kinds of research, what the biggest misunderstanding about Danyel’s job is, how figuring out the root cause of an outage is like a murder mystery, how nobody really knows what digital transformation means, how it’s easy to find issues when you start an observability project but how starting such a project is the hardest part, what Honeycomb means by testing in production and why they encourage teams to do that, the difference between conducting research for a juggernaut like Microsoft and an agile startup like Honeycomb, and more.
Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.
Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
This episode is sponsored by our friends at New Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visit newrelic.com. Observability made simple.
Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.
Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Danyel Fisher, a principal design researcher for Honeycomb.io. Danyel, welcome to the show.
Danyel: Thanks so much, Corey. I'm delighted to be here. Thanks for having me.
Corey: As you should be. It's always a pleasure to talk to me and indulge my ongoing love affair with the sound of my own voice. So, you have a fascinating and somewhat storied past. You got a PhD years ago, and then you spent 13 years at Microsoft Research. And now you're doing design research at Honeycomb, which is basically, if someone were to paint a picture of Microsoft as a company, knowing nothing internal, only external, it is almost the exact opposite of the picture people would paint of Honeycomb, I suspect. Tell me about that.
Danyel: I can see where you're coming from. I actually sort of would draw a continuity between these pieces. So, I left academic research and went to corporate research, where my goal was to publish articles and to create ideas for Microsoft, but now sort of on a theme of the questions that Microsoft was interested in. Over that time, I got really interested in how people interact and work with data, and it became more and more practical. And really, where's a better place to do it than the one place where we're building a tool for people who wake up at three in the morning with their pager going, saying, “Oh, my God, I'm going to analyze some data right now.”
Corey: There's something to be said for being able to, I guess, remove yourself from the immediacy of the on-call response and think about things in a more academic sense, but it sounds like you've sort of brought it back around, then. Tell me more. What aligns, I guess, between the giant enterprise story of research and the relatively small, scrappy startup story for research?
Danyel: Well, what they have in common is that in the end, it's humans using the system. And whether it's someone working in the systems that I built as an academic, or working on Excel, or Power BI, or working inside Honeycomb, they're individual humans, who in the end are sitting there at a screen, trying to understand something about a complex piece of data. And I want to help them know that.
Corey: So, talk to me a little bit about why a company that focuses on observability—well and before they focused on that, they focused on teaching the world what the hell observability was. But beyond that, then it became something else where it's, “Okay. We're a company that focuses on surfacing noise from data, effectively.” We can talk about that a bunch of different ways, cardinality, et cetera, but how does that lend itself to, “You know what we need to hire? That's right: a researcher.” It seems like a strange direction to go in. Help me understand that.
Danyel: So, the background on this actually is remarkable. Charity Majors, our CTO, was off at a conference and apparently was hanging out with a bunch of friends, and said, “You know, we're going to have to solve this data visualization thing someday. Any of you know someone who's on the market for data visualization?” And networks wiggled around, and a friend of a friend of a friend overheard that and referred her to me. So, yeah, I do come from a research background.
Fundamentally, what Honeycomb wanted to accomplish was they realized they had a bunch of interfaces that were wiggly lines. Wiggly lines are great. All of our competitors use wiggly lines, right? When you think APM, you think wiggly lines displays. But we wanted to see if there was a way that we could help users better understand what their systems were doing. And maybe wiggly lines were enough, but maybe they weren't. And you hire a data visualization researcher to sit down with you and talk about what users actually want to understand.
Corey: So, what is it that people misunderstand the most about what you do, whether it's your role, or what you focus on in the world because when you start talking about doing deep research into various aspects of user behavior, data visualization, deep analysis, there's an entire group of people—of which I confess, I'm one of them—who, more or less, our eyes glaze over and we think, “Oh, it's academic time. I'm going to go check Twitter or something.” I suffer from attention span issues from time to time. What do people misunderstand about that?
Danyel: The word ‘researcher’ is heavily overloaded. I mean, overloaded in the object-oriented programming sense of the word. There's a lot of different meanings that it takes on. The academic sense of researcher means somebody who goes off and finds an abstract problem that may or may not have any connection to the real world. The form of researcher that many people are going to be more familiar with an industry is often meaning the user researcher: the person who goes out and talks to people so you don't have to.
I happen to be a researcher in user research. I go off and find abstract questions based on working with people. So, the biggest misunderstanding, if you will, is trying to figure out where I do fit into that product cycle, I'm often off searching for problems that nobody realized we had.
Corey: It's interesting because companies often spend most of their time trying to solve established and known problems. Your entire role—it sounds like—is, “Let's go find new problems that we didn't realize existed,” which is, first, it tells a great story about a willingness to go on a journey of self-discovery from a corporate sense, but also a, “What, we don't have enough problems, you're going to go borrow some more?”
Danyel: [laugh]. So, let me give you a concrete example if you don't mind.
Corey: Please, by all means.
Danyel: When I first came to Honeycomb, we were a wiggly lines company. We just had the most incredibly powerful wiggly lines ever. So, premise: you've got your system, you've successfully instrumented with the wonderful wide events, and you've heard other guests before talking about observability and the importance of high cardinality-wide events. So, you've now got hundreds of dimensions. And the question was rapidly becoming, how do you figure out which dimension you care about?
And so I went out, I watched some of our user interviews, I watched how people were interacting with the system, and heck, I even watched our sales calls. And I saw over and over again, there was this weird guessing-game process. “We've got this weird bump in our data. Where'd that come from? Well, maybe it's the load balancer. I'll go split out by load balancer. No, that doesn't seem to be the factor. Ah, maybe it's the user ID let's split out by user ID. No, that one doesn't seem to be relevant either. Maybe it's the status code. Aha, the status code explains why that spike’s coming up. Great, now we know where to go next.”
Corey: Oh, every outage inherently becomes a murder mystery at some point. At some point of scale, you have to grow beyond that into the observability space, but these, from my perspective, it always seemed that well, we're small-scale enough in most cases that I can just play detective in the logs when I start seeing something weird. It's not the most optimal use of anyone's time, but for a little while, it works; just over time, it kind of winds up being less and less valuable or useful as we go.
Danyel: I'm not convinced it’s a scale problem. I'm convinced that it's a dimensionality problem. If we only give you four dimensions to look at, then you can check all four of them and call it a day. But Honeycomb really values those wide, rich events with hundreds, or—we have several production systems with thousands of columns of data. You're not going to be able to flip through it all.
And we can't parallelize murder mystery solving. But what we can do is use the power of the computer to go out and computationally find out why that spike is different. So, we built a tool that allows you to grab a chunk of data, and the system goes in and figures out what dimensions are most different between the stuff you selected and everything else. And suddenly what was a murder mystery becomes a click, and while that ruins Agatha Christie's job, it makes ours a whole lot better.
Corey: There is something to be said for making outages less exciting. It's always nice to not have to deal with guessing and checking, and, “Oh, my ridiculous theoretical approach was completely wrong, so instead, we're going in a different direction entirely.” It's really nice to not have to play those ridiculous, sarcastic games. The other side of it, though, is how do you get there from here? It feels like it's an impossible journey. The folks who are success stories have been doing this for many years getting there, and for those of us in roles where we haven't been, it's, “Oh, great. Step one: build an entire nuclear reactor.” It feels like it is this impossible journey. What am I missing?
Danyel: You know, when I started off at Honeycomb, this was one of my fears, too, that we'd be asking people to boil the ocean. You have to go in, tell your microservices and instrument them all, and pull in data from everything. And suddenly, it's this huge, intimidating task. Over and over, what I've seen is one of our customers says, “You know what? I'm just going to instrument one system. Maybe I'll just start feeding in my load balancer logs. Maybe I'll just instrument the user-facing piece that I'm most worried about.”
And we've hit the point where I will regularly get users reporting that they instrumented some subsystem, started running it, and instantly figured out this thing that had been costing them a tremendous amount on their Amazon bill, or that had been holding up their ability to respond to questions, or had been causing tremendous latency. We all have so many monsters hidden under the rug, that as soon as you start looking at it, things start popping out really quickly and you find out where the weak parts of your system are. The challenge isn’t boiling the ocean or doing that huge observability project, it's starting to look.
Corey: It's one of those stories where it sounds like it's not a destination, it's a journey. This is usually something said by people who have a journey to sell you. It's tricky to get that into something that is palatable for a business, where it's, “Oh. Instead of getting to an outcome, you're just going to continue to have this new process forever.” It feels—to be direct—very similar to digital transformation, which no one actually knows what that means, but we go for it anyway.
Danyel: Wow, harsh but true.
Corey: It's hard to get people to sign up for things like this, believe me.; I tried to sell digital transformation. It's harder than it looks.
Danyel: And now we're trying to sell DevOps transformation and SRE transformation. A lot of these things do turn into something of a discipline. And this is kind of true—unfortunately, this is a little bit true here, too. It's not so much that there's a free lunch, it's that we're giving you a tool to better express the questions that you already have. So, I'd prefer to think of it less as selling transformation, and more is an opportunity to start getting to express the things you want to.
I mean, I'd say it sits under the same category, in my mind, is doing things like switching to a typed language or using test-driven development. Allowing you to express the things that you want to be able to express to your system means that later, you’re able to find out if you're not doing it, and you're not seeing the thing that you thought you were. But yeah, that's a transformation, and I don't love that we have to ask for that.
Corey: One challenge I've seen across the board with all of the—and I know you're going to argue with me on this, but the analytics companies. I would consider observability on some level, to look a lot like analytics—where it all comes down to the value you get from this feels directly correlated with the amount of data you shove into the system. Is that an actual relationship? Is that just my weird perception given that I tend to look at expensive things all the time, and that's data in almost every case?
Danyel: No, I actually wouldn't quibble with that at all. Our strength is allowing you to express the most interesting parts of your system. If you want to only send in two dimensions of data—“This was a request and it's succeeded.” That's fantastic, but you can't ask very interesting questions of that. If you tell me that it's a request, and it did a database call, and the database call succeeded, but only after 15 retries, and it was using this particular query and the index of that was hit this way, the more that you can tell us, the more that we can help you find what factors were in common. This is the curse of analytics; we like data. On the other hand, I think that the positive part is that our ability to help find needles in haystacks actually means that we want you to shovel on more hay as the old joke goes, that's a good thing. I agree that that cost center though.
Corey: Yeah, the counter-argument is, is whenever I go into environments where they're trying to optimize their AWS bill, we take a look at things and, “Well, you're storing Apache weblogs from 2008. Is that still something you need?” And the answer is always a, “Oh, absolutely.” Coming from the data science team, where it's almost this religious belief that with just the right perspective, those 12-year-old logs are going to magically solve a modern business problem, so you have to pay to keep them around forever. It feels like on some level, data scientists are constantly in competition to see whether they can cost more than the data that they're storing, and it's always neck and neck at some level.
Danyel: I will admit that for my personal life, that's true. I've got that email archive from, you know, 1994, that I'm still trying to convince myself I'm someday going to dig through and go chase, like, I don't know, the evolution of what the history of spam is. And so it's really vital that I keep every last message. But in reality, for a system like Honeycomb, we want that rich information, but we also know that that failure from six months ago is much less interesting than what just happened right now. And we're optimizing our system around helping you get to things from the last few weeks.
We recently bumped up our default storage for users from a few days to two months. And that's really because we realized that people do want to kind of know what goes back in time, but we want to be able to tell the story of what your last few weeks, what your last few releases looked like, what your last couple tens of thousands of user hits on your site are. You don't need every byte of your log from 2008. And in fact—this one's controversial, but there's a lot of debates about the importance of sampling. Do you sample your data?
Corey: Oh, you never sample your data, as reported by the companies that charge you for ingest. It’s like, “Hm, there seems to be a bit of a conflict of interest there.” You also take a look at some of your lesser-trafficked environments, and it seems that there's a very common story where, “Oh, for the small services that were shoving the stuff into, 98 percent of those logs are load balancer health checks. Maybe we don't need to send those in.”
Danyel: So, I'm going to break your little rule here and say Honeycomb’s sole source of revenue at this point is ingest price. We charge you by the event that you send us. We don't want to worry about how many fields you're sending us, in fact, because we want to encourage you to send us more, send us richer, send us higher-dimensional. So, we don't charge by the field, we do charge by the number of events that you send us. But that said, we also want to encourage you to sample on your side because we don't need to know that 99 percent of what your site sends us is, “We served a 200 in point one milliseconds.”
Of course you did; that's what your cache is for, that's what the system does, it's fine. You know what? If you send us one in a thousand of those, it will be okay, we'll still have enough to be able to tell that your system is working well. And in fact, if you put a little tag on it that says this is sampled at a rate of one in a thousand, we'll keep that one in thousand, and when you go to look at your account metrics, and your P95s of duration, and that kind of thing, we’ll expand out that, and multiply it correctly so that we still show you the data reflected right. On the other hand, the interesting events: that status 400, the 500 internal error, the thing that took more than a quarter second to send back, send us every one of those so we can give you as rich information back about what's going wrong. If you have a sense of what's interesting to you, we want to know it, so that we can help you find that interestingness again.
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; I mean, they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.
Corey: No, and as always, it's going to come down to folks having a realistic approach to what they're trying to get out of something. It's hard to wind up saying, “I’m going to go ahead and build out this observability system and instrument everything for it, but it's not going to pay dividends in any real way for six, eight, twelve months.” That takes a bit of a longer-term view. So, I guess part of me is wondering how you demonstrate immediate value to folks who are sufficiently far removed from the day-to-day operational slash practitioner side of things to see that value?
Danyel: Mmm.
Corey: It's the, “How do you sell things to the corner offices?” Is really what I'm talking about here.
Danyel: Mmm. Well, as I was saying before, we've been finding these quick wins in almost every case. We've had times when our sales team is sitting down with someone running an early instrumentation, and suddenly someone pops up and goes, “Oh, crap. That's why that weird thing’s been happening.” And while that may not be the easy big dollars that you can show off to the corner office, at least that does show that you're beginning to understand more about what your system does.
And very quickly, those missed caches, and misconfigured SQLs, and weird recursive calls, and timeouts that were failing for one percent of customers do begin to add up pretty quickly to understand unlocking customer value. When you can go a step further and say, “Oh, now that this sporadic error that we don't understand doesn't wake up my people at 3 a.m., people are more willing to take the overnight shift. Morale is increasing, we've been able to control which of our alerts we actually care about.” It feels like it pays off a lot more quickly than the six to nine months range.
Corey: Yeah. That's always the question where, after we spend all this time and energy getting this thing implemented—and frankly, now that I run a company myself, the actual cost of the software is not the expensive part. It's the engineering effort it takes because the time people are spending on getting something like this rolled out is time they're not spending on other things, so there's always going to be an opportunity cost tied to it. It's a form of that terrible total cost of ownership approach where what is the actual cost going to be for this? And you can sit there and wind up in analysis-paralysis territory pretty easily.
But for me, at least, the reason I know that Honeycomb is actually valuable is I've talked to a number of your customers who can very clearly articulate that value of what it is they got out of it, what they hope to get out of it and where the alignments or misalignments are. You have a bunch of happy customers, and, frankly, given how much mud everyone in this industry loves to throw, we'd have heard about it if that weren't true. So, there is clearly something there. So, any of these misapprehensions that I'm talking about here that do not align with reality are purely my own. This is not the industry perspective; it's mine. I should probably point out that while you folks are customers of The Duckbill Group, we are not Honeycomb customers at the moment because it feels like we really don't have a problem of a scale where it makes sense to start instrumenting.
Danyel: This is where my sales team would say we should get you on our free tier. But that's a different conversation.
Corey: Oh, I’m sure it is. There's not nearly as much software as you might think for some of this. It’s, “Well, okay. Can you start having people push a button every time they change tasks?” And yeah, down that path lies madness, time tracking, and the rest? Yee.
Danyel: Oh, God. Yeah, no, I don't think that's what we do, and I don't think that's what you want us to do. But I want to come back to something you were just talking about: enthusiastic customers. As a user researcher, one of my challenges—and this was certainly the case at Microsoft—was, where do I find the people to talk to? And so, we had entire, like, user research departments who had Rolodexes full of people who they'd call up and go ask to please come in to sit in a mirrored room for us.
It absolutely blew my mind to be working for a company where I could throw out a request on Twitter, or pop up in our internal customer Slack and say, “Hey, folks. We're beta’ing a new piece,” or, “I want to talk to someone about stuff,” and get informed, interested users who desperately wanted to talk. And I think that's actually really helped me accelerate what I do for the product because we've got these weirdly passionate users who really want to talk about data analysis more than I think any healthy human should.
Corey: That's part of the challenge, too, on some level is that—how to frame this—there are two approaches you can take towards selling a service or a product to a company. One is the side that I think you and I are both on: cost optimization, reducing time to failure, et cetera. The other side of that coin is improving time-to-market, speeding velocity of feature releases. That has the capability of bringing in orders of magnitude more revenue and visibility on the company than the cost savings approach. To put it more directly, if I can speed up your ability to release something to the market faster than your competition does, you can make way more money than I will ever be able to save you on your AWS bill.
And it feels like there's that trailing function versus the forward-looking function. In hindsight, that's one of the few things I regret about my business model is that it's always an after-the-fact story. It's not a proactive, get people excited about the upside. Insurance folks have the same problem too, by the way. No one's excited to wind up buying a bunch of business insurance, or car insurance, or fire insurance.
Danyel: Right. You alleviate pain rather than bring forward opportunity.
Corey: Exactly.
Danyel: I've been watching our own marketing team… I don’t want to say struggle; I'll say pivot around to that. I think when I first came in—and that was about two years ago—the Honeycomb story very much was, we're going to let your ops team sleep through the night, and everyone's going to be less miserable. And that part's great, but the other story that we're beginning to tell much more is that when you have observability into your system when what the pieces are doing, it's much less scary to deploy. You can start dropping out new things—and our CTO, Charity, loves to talk about testing in production—what that really means is that you never completely know what's going to happen until you press the button and fire it off. And when you've got a system that's letting you know what things are doing—when you were able to write out your business hypotheses in code, and go look at the monitor, and go see whether your system’s actually doing the thing that you claimed it was, you feel very free to go iterate rapidly and go deploy a bunch of new versions. So, that does mean faster time-to-market, and that does mean better iteration.
Corey: You're right. There's definitely a story about, what outcome are companies seriously after? What is it that they care about at this point in time? There's also a cultural shift, at some point, I think. When a company goes from being small and scrappy, and, “We're going to bend the rules and it's fine, move fast break things, et cetera,” to, “We're a large enterprise, and we have significant downside risks, so we have to manage that risk accordingly.”
Left unchecked, at some point, companies become so big that they're unwilling to change anything because that proves too much of a risk to their existing lines of revenue, and long term they wither into irrelevance. I would have put Microsoft in that category once upon a time until the last 10 years have clearly disproven that.
Danyel: I spent a decade at Microsoft, wondering when they were going to accidentally slip into irrelevance and being completely wrong. It was baffling.
Corey: I'd written them off. I mean, honestly, the reason I got into Linux and Unix admin work was because their licensing was such a bear when I was dabbling as a Windows admin that I didn't want to deal with it. And I wrote them off and figured I'd never hear from them again after Linux ate the world. I was wrong on that one, and honestly, they've turned into an admirable company. It's really a strange, strange path.
Danyel: It is. And I definitely—I should be clear, did not leave Microsoft because it wasn't an exciting place or wasn't doing amazing things. But coming back to the iteration speed, the major reason why I did leave Microsoft is because I found that the time lag between great idea and the ability to put it to software was measured in years there. Microsoft Research was not directly connected to product. We'd come up with a cool idea and then we'd go find someone in the product team who could potentially be an advocate for it.
And if we could then we go through this entire process, and a year, or two, or five later, maybe we'll have been able to shift one of those very big, slow-moving battleships. And this is the biggest difference for me between big corporate life and little tiny startup life was, in contrast, I came to Honeycomb, and I remember one day having a conversation with someone about an idea I had, and the next day we had a storyboard on the wall, and about two weeks later, our users were giving us feedback on it and it was running in production and the world was touching this new thing. It's like, “Wow, you can just… do it.”
Corey: The world changes, and it's really odd just seeing how all of that plays out, how that manifests. And the things we talk about now versus the things we talked about five or ten years ago, are just worlds apart.
Danyel: Yeah.
Corey: Sometimes. Other times, it's still the exact same thing because there's no newness under the sun. So, before we wind up calling it an episode, what takeaway would you want the audience to have? What lessons have you learned that would have been incredibly valuable if you'd learn them far sooner? Help others learn from your mistakes? What advice do you have for, I guess, the next generation of folks getting into either data analytics, research as a whole, making signal appear from noise in the observability context? Anything, really.
Danyel: That is a fantastic question. While I'd love for the answer to be something about data analytics, and something about understanding your data—and I believe, of course, that that's incredibly important. I'm not going to surprise you at all by saying that, in the end, the story has always been about humans. And in the last two years, I've had exposure to different ways of human-ing than I had before. I'm sure you saw some of this in your interview with Charity about management.
I've been learning a lot about how to persuade people about ideas and how to present evidence of what makes a strong, and valuable, and doable thing. And those have been career-changing for me. I had a very challenging couple of months at Honeycomb before I had learned these lessons and then started going, “Oh, that's how you make an idea persuasive.” And the question that I've been asking myself ever since is, “How do I best make an idea persuasive?” And that's actually my takeaway because once what a persuasive idea is, no matter what your domain is going to be, it's what allows you to get things into other people's hands.
Corey: That's, I think, probably a great place to leave it. If people want to hear more about who you are, what you're up to, what you believe, et cetera. Where can they find you?
Danyel: You can find me, of course, at honeycomb.io/danyel, or my personal site is danyelfisher.info. Or because I was feeling perverse, my Twitter is @fisherdanyel.
Corey: Fantastic. And we'll put links to all of that in the [00:31:07 show notes], of course. Well, thank you so much for taking the time to speak with me today. I really appreciate it.
Danyel: Thank you, Corey. That was great.
Corey: Danyel Fisher, principal design researcher for Honeycomb. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast program of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast program of choice, and tell me why you don't need to get rid of your load balancer logs and should instead save them forever in the most expensive storage possible.
Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com, or wherever fine snark is sold.
This has been a HumblePod production. Stay humble.