Break Things on Purpose

Time for a cross over! Today Page it to the Limit host Mandi Walls, DevOps Advocate at PagerDuty joins Julie for a special two parter. They are interviewing Kolton Andrus, co-founder of Gremlin and Alex Solomon, co-founder of PagerDuty. Both of whom have known each other for a good while. Each of them share the origins of their respective companies, both of which began in their respective work at larger organizations. Kolton and Alex reflect on how they identified the space where they could build their respective companies and the shift from larger entities to start ups. Each of them offer up some excellent insight!

Show Notes

In this episode, we cover:
  • 00:00:00 - Intro
  • 00:01:56 - How Alex and Kolton know each other and the beginnings of their companies 
  • 00:10:10 - The change of mindset from Amazon to the smaller scale
  • 00:17:34 - Alex and Kolton’s advice for companies that “can’t be a Netflix or Amazon”
  • 00:22:57 - PagerDuty, Gremlin and Crossovers/Outro
Transcript
Kolton: I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Julie: Welcome to the Break Things on Purpose podcast, a show about chaos, culture, building and breaking things with intention. I’m Julie Gunderson and in this episode, we have Alex Solomon, co-founder of PagerDuty, and Kolton Andrus, co-founder of Gremlin, chatting about everything from founding companies to how to change culture in organizations.

Julie: Hey everybody. Today we’re going to talk about building awesome things with two amazing company co-founders. I’m really excited to be here with Mandy Walls on this crossover episode for Break Things on Purpose and Page it to the Limit. I am Julie Gunderson, Senior Reliability Advocate here over at Gremlin. Mandy?

Mandy: Yeah, I’m Mandy Walls, DevOps Advocate at PagerDuty.

Julie: Excellent. And today we’re going to be talking about everything from reliability, incident management, to building a better internet. Really excited to talk about that. We’re joined by Kolton Andrus, co-founder of Gremlin, and Alex Solomon, co-founder of PagerDuty. So, to get us started, Kolton and Alex, you two have known each other for a little while. Can you kick us off with maybe how you know each other?

Alex: Sure. And thanks for having us on the podcast. So, I think if I remember correctly, I’ve known you, Kolton, since your days in Netflix while PagerDuty was a young startup, maybe less than 20 people. Is that right?

Kolton: Just to touch before I joined Netflix. It was actually that Velocity Conference, we hung out of that suite at, I think that was 2013.

Alex: Yeah, sounds right. That sounds right. And yeah, it’s been how many years? Eight, nine years since? Yeah.

Kolton: Yeah. Alex is being humble. He’s let me bother him for advice a few times along the journey. And we talked about what it was like to start companies. You know, he was in the startup world; I was still in the corporate world when we met back at that suite.

I was debating starting Gremlin at that time, and actually, I went to Netflix and did a couple more years because I didn’t feel I was quite ready. But again, it’s been great that Alex has been willing to give some of his time and help a fellow startup founder with some advice and help along the journey. And so I’ve been fortunate to be able to call on him a few times over the years.

Alex: Yeah, yeah. For sure, for sure. I’m always happy to help.

Julie: That’s great that you have your circle of friends that can help you. And also, you know, Kolton, it sounds like you did your tour of duty at Netflix; Alex, you did a tour duty at Amazon; you, too, Kolton. What are some of the things that you learned?

Alex: Yeah, good question. For me, when I joined Amazon, it was a stint of almost three years from ’05 to ’08, and I would say I learned a ton. Amazon, it was my first job out of school, and Amazon was truly one of the pioneers of DevOps. They had moved to an environment where their architecture was oriented around services, service-oriented architecture, and they were one of the pioneers of doing that, and moving from a monolith, breaking up a monolith into services. And with that, they also changed the way teams organized, generally oriented around full service-ownership, which is, as an engineer, you own one or more services—your team, rather—owns one or more services, and you’re not just writing code, but you’re also testing yourself. There’s no, like, QA team to throw it to. You are doing deploys to production, and when something breaks, you’re also in charge of maintaining the services in production.

And yeah, if something breaks back then we used pagers and the pager would go off, you’d get paged, then you’d have to get on it quickly and fix the problem. If you didn’t, it would escalate to your boss. So, I learned that was kind of the new way of working. I guess, in my inexperience, I took it for granted a little bit, in retrospect. It made me a better engineer because it evolved me into a better systems thinker. I wasn’t just thinking about code and how to build a feature, but I was also thinking about, like, how does that system need to work and perform and scale in production, and how does it deal with failures in production?

And it also—my time at Amazon served as inspiration for PagerDuty because in starting a startup, the way we thought about the idea of PagerDuty was by thinking back from our time at Amazon—myself and my other two co-founders, Andrew and Baskar—and we thought about what are useful tools or internal tools that existed at Amazon that we wished existed in the broader world? And we thought about, you know, an internal tool that Amazon developed, which was called the ‘Pager Duty Tool’ because it organized the on-call scheduling and paging and it was attached to the incident—to the ticketing system. So, if there’s was a SEV 1 or SEV 2 ticket, it would actually page either one team—or lots of teams if it was a major incident that impacted revenue and customers and all that good stuff. So yeah, that’s where we got the inspiration for PagerDuty by carrying the pager and seeing that tool exist within Amazon and realizing, hey, Amazon built this, Google has their own version, Facebook has their own version. It seems like there’s a need here. That’s kind of where that initial germ of an idea came from.

Kolton: So, much overlap. So, much similarity. I came, you know, a couple of years behind you. I was at Amazon 2009 to 2013. And I’d had the opportunity to work for a couple of startups out of college and while I was finishing my education, I’d tasted startup world a little bit.

My funny story I tell there is I turned down my first offer from Amazon to go work for a small startup that I thought was going to be a better deal. Turns out, I was bad at math, and a couple of years later, I went back to Amazon and said, “Hey, would you still like me?” And I ended up on the availability team, and so very much in the heart of what Alex is describing. It was a ‘you build it, you own it, you operate it’ environment. Teams were on call, they got paged, and the rationale was, if you felt the pain of that, then you were going to be motivated to go fix it and ensure that you weren’t feeling that pain.

And so really, again, and I agree, somewhat taken for granted that we really learned best-in-class DevOps and system thinking and distributed system principles, by just virtue of being immersed into it and having to solve the problems that we had to solve at Amazon. We also share a similar story in that there was a tool for paging within Amazon that served as a bit of an inspiration for PagerDuty. Similarly, we built a tool—may or may not have been named Gremlin—within Amazon that helped us to go do this exact type of testing. And it was one part tooling and it was one part evangelism. It was a controversial idea, even at Amazon.

Some teams latched on to it quickly, some teams needed some convincing, but we had that opportunity to go work with those teams and really go develop this concept. It was cool because while Netflix—a lot of folks are familiar with Netflix and Chaos Monkey, this was a couple of years before Chaos Monkey came out. And we went and built something similar to what we built a Gremlin: An API, a front end, a variety of failure modes, to really go help solve a wider breadth of problems. I got to then move into performance, and so I worked on making the website fast, making sure that we were optimizing things. Moved into management.

That was a very useful life experience wasn’t the most enjoyable year of my life, but learned a lot, got a lot done. And then that was the next summer, as I was thinking about what was next, I bumped into Alex. I was really starting to think about founding a company, and there was a big question: Was what we built an Amazon going to be applicable to everyone? Was it going to be useful for everyone? Were they ready for it?

And at the time, I really wasn’t sure. And so I decided to go to Netflix. And that was right after Chaos Monkey had come out, and I thought, “Well, let’s go see—let’s go learn a bit more before we’re ready to take this to market.” And because of that time at Amazon—or at Netflix, I got to see, they had a great start. They had a great culture, people were bought into it, but there was still some room for development on the tooling and on the approach.

And I found myself again, half in the developer mindset, half in the advocacy mindset where needed to go and prove the tooling to make it safer and more scalable and needed to go out and convince folks or help them do it well. But seeing it work at Amazon, that was great. That was a great learning experience. Seeing at work at Amazon and Netflix, to me said, “Okay, this is something that everyone’s going to need at some point, and so let’s go out and take a stab at it.”

Alex: That’s interesting. I didn’t realize that it came from Amazon. I always thought Chaos Engineering as a concept came from Netflix because that’s where everyone’s—I mean, maybe I’m not the only one, but that’s—that was my impression, so that’s interesting.

Kolton: Well, as you know, Amazon, at times, likes to keep things close to the vest, and if you’re not a principal engineer, you’re not really authorized to go talk about what you’ve done. And that actually led to where my opportunity to start a company came from. I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Mandi: So, what ends up being different? Amazon—I’ve never worked for Amazon, so full disclosure, I went from AOL to Chef, and now I’m at PagerDuty. So, but I know what that environment was like, and I remember the early days, PagerDuty you got started around the same time, like, Fastly and Chef and, like, that sort of generation of startups. And all this stuff that sort of emerged from Amazon, like, what kind of mindset do you—is there a change of mindset when you’re talking to developers and engineers that don’t work for Amazon, looking into Amazon from the outside, you kind of feel like there’s a lot more buy-in for those kinds of tools, and that kind of participation, and that kind of—like we said before, the full service-ownership and all of those attitudes and all that cultural pieces that come along with it, so when you’re taking these sort of practices commercial outside of Amazon, what changes? Like, is there a different messaging? Is there a different sort of relationship you have with the developers that work somewhere else?

Alex: I have some thoughts, and it may not be cohesive, but I’m going to go ahead anyway. Well, one thing that was very interesting from Amazon is that by being a pioneer and being at a scale that’s very significant compared to other companies, they had to invent a lot of the tooling themselves because back in mid-2000s, and beyond, there was no Datadog. There was no AWS; they invented AWS. There wasn’t any of these tools, Kubernetes, and so on, that we take for granted around containers, and even virtual servers were a new thing. And Amazon was actually I think, one of the pioneers of adopting that through open-source rather than through, like, a commercial vendor like VMware, which drove the adoption of virtual everything.

So, that’s one observation is they built their own monitoring, they built their own paging systems. They did not build their own ticketing system, but they might as well have because they took Remedy and customized it so much that it’s almost like building your own. And deployment tools, a lot of this tooling, and I’m sure Kolton, having worked on these teams, would know more about the tooling than I did as just an engineer who was using the tooling. But they had to build and invent their own tools. And I think through that process, they ended up culturally adopting a ‘not invented here’ mindset as well, where they’re, generally speaking, not super friendly towards using a vendor versus doing it themselves.

And I think that may make sense and made a lot of sense because they were at such a scale where there was no vendor that was going to meet their needs. But maybe that doesn’t make as much sense anymore, so that’s maybe a good question for debate. I don’t know, Kolton, if you have any thoughts as well.

Kolton: Yeah, a lot of agreement. I think what was needed, we needed to build those things at Amazon because they embraced that distributed systems, the service-oriented architectures early on, that is a new class of problem. I think in a world where you’re not dealing with the complexity of distributed systems, Chaos Engineering just looks like testing. And that’s fine. If you’re in a monolith and it’s more straightforward, great.

But when you have hundreds of things with all the interconnections and the combinatorial explosion you have with that, the old approach no longer works and you have to find something new. It’s funny you mentioned the tooling. I miss Amazon’s monitoring tooling, it was really good. I miss the first iteration of their pipelines, their CI/CD tooling. It was a great iteration.

And I think that’s really—you get to see that need, and that evolution, that iteration, and a bit of a head start. You asked a bit about what is it like taking that to market? I think one of the things that surprised me a little bit, or I had to learn, is different companies are at different points in their journey, and when you’ve worked at Amazon and Netflix, and you think everybody is further along than they are, at times, it can be a little frustrating, or you have to step back and think about how do you catch somebody up? How do you educate them? How do you get them to the point where they can take advantage of it?

And so that’s, you know, that’s really been the learning for me is we know aspirationally where we want to go—and again, it’s not the Amazon’s perfect; it’s not the Netflix is perfect. People that I talk to tend to deify Netflix engineering, and I think they’ve earned a lot of respect, but the sausage is made the same, fundamentally, at every company. And it can be messy at times, and it’s not always—things don’t always go well, but that opportunity to look at what has gone well, what it should look like, what it could look like really helps you understand what you’re striving for with your customers or with the market as a whole.

Alex: I totally agree with that because those are big learning for me as well. Like, when you come out of an Amazon, you think that maybe a lot of companies are like Amazon, in that they’re… more like I mentioned: Amazon was a pioneer of service-oriented architecture; a pioneer of DevOps; and you build it, you own it; pioneer of adopting virtual servers and virtual hosting. And you, maybe, generalize and think, you know, other companies are there as well, and that’s not true. There’s a wide variety of maturities and these trends, these big trends like Cloud, like AWS, like virtualization, like containerization, they take ten years to fully mature from the starting point. With the usual adopter curve of very early adopters all the way to, kind of, the big part of the curve.

And by virtue of starting PagerDuty in 2009, we were on the early side of the DevOps wave. And I would say, very fortunate to be in the right place at the right time, riding that wave and riding that trend. And we worked with a lot of customers who wanted to modernize, but the biggest challenge there is, perhaps it’s the people and process problem. If you’re already an established company, and you’ve been around for a while you do things a certain way, and change is hard. And you have to get folks to change and adapt and change their jobs, and change from being a, “sysadmin,” quote-unquote, to an SRE, and learn how to code and use that in your job.

So, that change takes a long time, and companies have taken a long time to do it. And the newer companies and startups will get there from day one because they just adopt the newest thing, the latest and greatest, but the big companies take a while.

Kolton: Yeah, it’s both that thing—people can catch up quicker. It’s not that the gap is as large, and when you get to start fresh, you get to pick up a lot of those principles and be further along, but I want to echo the people, the culture, getting folks to change how they’re doing things, that’s something, especially in our world, where we’re asking folks to think about distributed system testing and cross-team collaboration in a different way, and part of that is a mental journey, just helping folks get over the idea—we have to deal with some misconceptions, folks think chaos has to be random, they think it has to be done in production. That’s not the case. There’s ways to do it in dev and staging, there’s ways to do it that aren’t random that are much safer and more deterministic.

But helping folks get over those misconceptions, helping folks understand how to do it and how to do it well, and then how to measure the outcomes. That’s another thing I think we have that’s a bit tougher in our SRE ops world is oftentimes when we do a great job, it’s the absence of something as opposed to an outcome that we can clearly see. And you have to do more work when you’re proving the absence of something than the converse.

Julie: You know, I think it’s interesting, having worked with both of you when I was at PagerDuty and now at Gremlin, there’s a theme. And so we’ve talked a lot about Amazon and Netflix; one of the things, distinctly, with customers at both companies, is I’ve heard, “But we’re not Amazon and we’re not Netflix.” And that can be a barrier for some companies, especially when we talk about this change, and especially when we talk about very rigid organizations, such as, maybe, FinServ, government, those types of organizations, where they’re more resistant to that, and they say, “Don’t say Amazon. Don’t say Netflix. We’re not those companies. We can’t operate like them.”

I mean, Mandy and I, we were on a call with a customer at one point that said we couldn’t use the term DevOps, we had to call it something different because DevOps just meant too forward-thinking, even though we were talking about the same concepts. So, I guess what I would like to hear from both of you, is what advice would you give to those organizations that say, “Oh, no. We can’t be Netflix and we can’t be Amazon?” Because I think that’s just a fear of change conversation. But I’m curious what your thoughts are.

Alex: Yeah. And I can see why folks are allergic to that because you look at these companies, and they’re, in a lot of ways, so far ahead that you don’t, you know—and if you’re a lower level of maturity, for lack of a better word, you can’t see a path in your head of how do you get from where you are today to becoming more like a Netflix or an Amazon because it’s so different. And it requires a lot of thinking differently. So, I think what I would encourage, and I think this is what you all do really well in terms of advocacy, but what I’d encourage is, like, education and thinking about, like, what’s a small step that you can take today to improve things and to improve your maturity? What’s an on-ramp?

And there’s, you know, lots of ideas there. Like, for example, if we’re talking about modern incident management, if we’re talking Chaos Engineering, if we’re talking about public cloud adoption and any of these trends, DevOps, SRE, et cetera, maybe think about how do you—do you have a new greenfield project, a brand new system that you’re spinning up, how do you do that in a modern way while leaving your existing systems alone to start? Then you learn how to do it and how to operate it and how to build a new service, a new microservice using these new technologies, you build that muscle. You maybe hire some folks who have done it before; that’s always a good way to do it. But start with something greenfield, start small, you don’t have to boil the ocean, you don’t have to do everything at once. And that’s really important.

And then create a plan of taking other systems and migrating them. And maybe some systems don’t make sense to migrate at all because they’re just legacy. You don’t want to put any more investment in them. You just want to run them, they work, leave them alone. And yeah, think about a plan like that. And there’s lots of—now, there’s lots of advice and lots of organizations that are ready and willing to help folks think through these plans and think through this modernization journey.

Kolton: Yeah, I agree with that. It’s daunting to folks that there’s a lot, it’s a big problem to solve. And so, you know, it’d be great if it’s you do X, you get Y, you’re done, but that’s not really the world we live in. And so I agree with that wisdom: Start small. Find the place that you can make an impact, show what it looks like for it to be successful.

One thing I’ve found is when you want to drive bottoms-up consensus, people really want to see the proof, they want to see the outcome. And so that opportunity to sit down with a team that is already on the cutting edge, that is feeling the pain, and helping them find success, whether that’s SRE, DevOps, whether it’s Chaos Engineering, helping them, see it, see the outcome, see the value, and then let them tell their organization. We all hear from other folks what we should be doing, and there’s a lot of that information, there’s a lot of that context, and some of its noise, and so how we cut through that into what’s useful, becomes part of it. This one to me is funny because we hear a lot, “Hey, we have enough chaos already. We don’t need any more chaos.”

And I get it. It’s funny, but it’s my least favorite joke because, number one, if you have a lot of chaos, then actually you need this today. It’s about removing the chaos, not about adding chaos. The other part of it is it speaks to we need to get better before we’re ready to embrace this. And as somebody that works out regularly, a gym analogy comes to mind.

It’s kind of like your New Year’s, it’s your New Year’s resolution and you say, “Hey, I’m going to lose ten pounds before I start going to the gym.” Well, it’s a little bit backwards. If you want to get the outcome, you have to put in a bit of the work. And actually, the best way to learn how to do it is by doing it, by going out getting a little bit of—you know, you can get help, you can get guidance. That’s why we have companies, we’re here to help people and teach them what we’ve learned, but going out doing a bit of it will help you learn how you can do it better, and better understand your own systems.

Alex: Yeah, I like the workout analogy a lot. I think it’s hard to get started, it’s painful at first. That’s why I like the analogy [laugh]—

Kolton: [laugh].

Alex: —a lot. But it’s a muscle that you need to keep practicing, and it’s easy to lose, you stopped doing it, it’s gone. And it’s hard to get back again. So yeah, I like that analogy a lot.

Julie: Well, I like that, too, because that’s something that we talked a lot about for being on call, and understanding how to handle incidents, and building that muscle memory, right, practice. And so there’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident.

And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.

For the second part of this episode, check out the PagerDuty podcast at Page it to the Limit. For links to the Page it to the Limit podcast and to all the information mentioned, visit our website atgremlin.com/podcast. If you liked this episode, subscribe to Break Things on Purpose on Apple Podcasts, Spotify, or wherever you listen to your favorite podcasts.

Jason: Our theme song is called, “Battle of Pogs” by Komiku, and it’s available onloyaltyfreakmusic.com.

[SPLIT]

Mandy: All right, welcome. This week on Page it to the Limit, we have a crossover episode. If you haven’t heard part one of this episode featuring Kolton Andrus and Alex Solomon, you’ll need to find it. It’s on the Break Things on Purposepodcast from our friends at Gremlin. So, you’ll find that atgremlin.com/podcast. You can listen to that episode and then come back and listen to our episode this week as we join the conversation in progress.

Julie: There’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident. And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.

Kolton: You hit it though. Like, the key to me is I’m an engineer by trade. I felt this pain, I saw value in the solution. I love to joke, I’m a lazy engineer. I don’t like getting woken up in the middle of the night, I’d like my system to just work well, but if I can go save some other people that pain, if I can go help them to more quickly understand, or ramp, or have a better on-call life have a better work-life balance, that’s something we can do that helps the broader market.

And we do that, as you mentioned, in service of a more reliable internet. The world we live in is online, undoubtedly, after the last couple of years, and it’s only going to be more so. And people’s expectations, if you’re an older person like me, you know, maybe you remember downloading AOL for a couple of hours, or when a web page took a minute to load; people’s expectations are much different now. And that’s why the reliability, the performance, making sure things work when we need them to is critical.

Alex: Absolutely. And I think there’s also a trend that I see and that we’re part of around automation. And automation is a very broad thing, there’s lots of ways that you want to automate manual things, including CI/CD and automated testing and things like that, but I also think about automation in the incident context, like when you have an alert that fires off or you have an incident you have something like that, can you automate the solution or actually even prevent that alert from going off in the first place by creating a set of little robots that are kind of floating around your system and keeping things running and running well and running reliably? So, I think that’s an exciting trend for us.

Mandy: Oh, definitely on board with automating all the things for sure. So, of the things that you’ve learned, what’s one thing that you wish you had maybe learned earlier? Or if there was like a gem or a nugget for folks that might be thinking about starting their own company around developer tools or this kind of software, is there anything that you can share with them?

Alex: Kolton, you want to go first?

Kolton: Sure, I’ll go first. I was thinking a little bit about this. If I went back—we’ve only been at about six years, so Alex has the ten-year version. I can give you the five, six-year version. You know, I think coming into it as a technical founder, you have a lot of thoughts about how the world works that you learn are incorrect or incomplete.

It’s easy as an engineer to think that sales is this dirty organization that’s only focused on money, and that’s just not true or fair. They do a lot of hard work. Getting people to do the right thing is tough. Helping with support, with customer success.

Even marketing. Marketing is, you know, to many engineers, not what they would spend their time doing, and yet marketing has really changed in the last 20 years. And so much of marketing now is about sharing information and teaching what we’ve learned as opposed to this old approach of you know, whatever you watched on TV as a kid. So, I think understanding the broader business is important. Understanding the value you’re providing to customers, understanding the relationships you build with those customers and the community as a whole, those are pieces that might be easy to gloss over as an engineer.

Alex: Yeah, and to echo that, I like your point on sales because initially when I first started PagerDuty, I didn’t believe in sales. I thought we wouldn’t need to hire any salespeople. Like, we sell to other engineers, and if they’re anything like me, they don’t want to talk to a salesperson. They want to go on the website, look around learn, maybe try it out—we had a free trial; we still have a free trial—and put in a credit card and off to the races. And that’s what we did it first, but then it turns out that when doing so, and in customers in that way, there are folks who want to talk to you to make sure that, first of all your real business, you’re going to be around for a while and it’s not—you know, you’re not going to not be around tomorrow.

And that builds trust being able to talk to someone, to understand, if you have questions, you have someone to ask, and creating that human connection. And I found myself doing that function, like, myself and then realized, there’s not enough time in the day to do this, so I need to hire some folks. And I changed my mind about sales and hired our first two salespeople about two-and-a-half years into PagerDuty. And probably got a little bit lucky because they’re technical engineering background type folks who then went into sales, so they ended up being rockstars. And we instantly saw an increase in revenue with that.

And then maybe another more tactical piece of advice is that you can’t focus on culture too early when starting a company. And so one lesson that we learned the hard way is we hired an engineer that was brilliant, and really smart, but not the best culture fit in terms of, like, working well with others and creating that harmonious team dynamic with their peers. That ended up being an issue. And basically, the takeaway there is don’t hire brilliant but asshole folks because it’s just going to cause a lot of pain, and they’re not going to work out even though they’re really smart, and that’s kind of the reason why you keep them around because you think, well, it’s so hard to hire folks. You can’t let this person go because what are we going to do? But you do have to do it because it’s going to blow up anyways, and it’s going to be worse in the long run.

Kolton: Yeah, hiring and recruiting have their own set of challenges associated with them. And similar to hiring the brilliant jerk, some of the folks that you hire early on aren’t going to be the folks that you have at the end. And that one’s always tough. These are your friends, these are people you work closely with, and as the company grows, and as things change, people’s roles change, and sometimes people choose to leave and that breaks your heart because you’ve invested a lot of time and effort into that relationship. Sometimes you have to break their heart and tell them it’s not the right fit, or things change.

And that’s one that if you’re a founder or you’re part of that early team, you’re going to feel a little bit more than everyone else. I don’t think anything you read on the internet can prepare you for some of those difficult conversations you have to have. And it’s great if everything goes well, and everyone grows at the same rate, everyone can be promoted, and you can have the same team at the end, but that’s not really how things play out in reality.

Julie: It’s interesting that we’re talking about culture, as we heard about last week, on the Break Things on Purposeepisode, where we also talked about culture and how organizations struggle with the culture shift with adopting new technologies, new ways of working, new tools. And so what I’m hearing from you is focusing on that when hiring and founding your company is important. We also heard about how that’s important with changing the way that we work. So, if you could give an advice to maybe a very established—if you are going to give a piece of advice to Amazon—maybe not Amazon, but an established company—on how to overcome some of those objections to culture change, those fears of adopting new technology. I know people are still afraid of holding a pager and being on call, and I know other people are afraid of chaos as we talk about it and those fears that you’ve mentioned before, Kolton. What would your piece of advice be?

Alex: Yeah, good—great question. This will probably echo what I’ve said earlier, which is when looking to transform, transform culture especially, and people and process, the way I think about is try to not boil the ocean and start small, and get some early wins. And learn what good looks like. I think that’s really important. It’s this concept of show, don’t tell.

Like, if you want to, you know, you want to change something, you start at the grassroots level, you start small, you start maybe with one or two teams, you try it out, maybe something like I mentioned before, in a greenfield context where you’re doing something brand new and you’re not shackled by legacy systems or anything like that, then you can build something new or that new system using the new technologies that are that we’re talking about here, whether it’s public cloud, whether it’s containerization and Kubernetes, or whatnot, or serverless, potentially. And as you build it and you learn how to build it and how to operate it, you share those learnings and you start evangelizing within the company.

And that goes to what I was saying with the show don’t tell where you’re like showing, “Here’s what we did and here’s what we learned. And not everything went swimmingly and here are things that didn’t go so well, and maybe what’s our next step beyond this? Do other folks want to opt-in to this kind of new thing that we’re doing?” And I’m sure that’s a good way to get others excited. And if you’re thinking about longer-term, like, how do you transform the entire company, well, that’s this is a good way to start; start small you learn how to do it, you learn about what good looks like, you get others excited about it, others opt-in, and then at some point through that journey, you start mandating it top-down as well because grassroots is only going to take you so far. And then that’s where you start putting together project plans around, like, how do we get other teams to do it, on a timeline? And when are they going to do it? And how are they going to do it? And then bring everyone along for the journey as well.

Kolton: You’re making this easy for me. I’ll just keep agreeing with you. You hit all the points. Yeah, I mean, on one hand, the engineer in me says, you know, a lot of times when we’re talking about this transformation, it’s not easy, but it’s worth it. There’s a need that we’re trying to solve, there’s a problem we’re trying to solve.

And then the end, what that becomes as a competitive advantage. The thought that came to mind as Alex was speaking is you need that bottoms-up buy-in; you also need that top-down support. And as engineers, we don’t often think about the business impact of what we do. There’s an important element and a message I like to reiterate for all the engineers that, think about how the business would value the work you do. Think about how you would quantify the value of the work you do to the business because that’s going to help that upper level that doesn’t, in the day-to-day feeling the pain, understand that what we’re doing is important, and it’s important for the organization.

I think about this a little bit like remote-by-default work. So, when we founded Gremlin, we decided you know, we didn’t want offices. And six years ago that was a little bit exceptional. Folks were still fundamentally working in an office environment. I’m not here to tell you that remote-by-default is easy, works for everyone, or is the answer.

Actually, what we found is you need a little bit of both. You need to be able to have good tooling so folks can be efficient and effective in their work, but it’s still important to get folks together in person. And magic happens when you get a group of folks in a room and let them brainstorm and collaborate chat on the way to launch or on the way to dinner. But I think that’s a good example where we’ve learned over the last couple of years that the old way of doing it was not as effective as it could be. That maybe we don’t need to swing the pendulum entirely the other way, but there’s merits at looking at what the right balance is.

And I think that applies to, you know, incident management, to SRE, to Chaos Engineering. You know, maybe we don’t have to go entirely on the other end of the spectrum for everyone, but are there little—you know, is there an 80/20 solution that gets us a lot of value, that saves a lot of time, that makes us more efficient and effective, without having to rewrite everything from scratch?

Alex: Yeah, I like that a lot. And I think part of it, just to add to that, is make it easy for people to adopt it, too. Like, if you can automate it for folks, “Hey, here’s a Terraform thing where you could just hit a button and it does it for you, here’s some training around how to leverage it, and here’s the easy button for you to adopt.” I think that goes with the technology of adopting, but also the training, also the, you know, how-tos and learnings. That way, it’s not going to be, like, a big painful thing, you can plan for it. And yeah, it’s off to the races from there.

Kolton: I think that’s prudent product advice, as well. Make it easy for people to do the right thing. And I’m sure it’s tricky in your space; it’s really tricky in our space. We’re going out and we’re causing failure, and there’s inadvertent side effects, and you need to understand what’s happening. It’s a little scary, but that’s where we add a lot of value.

We invest a lot of time and effort in how do we make it easy to understand, easy to understand what to expect, and easy to go do and see what happens and see that value? And it sounds easy. You know, “Hey, just make it easy. Just make it simple,” but actually, as we know, it takes so much more effort and work to get it to be that level of simplicity.

Alex: Yeah, making something easy is very, very hard—

Kolton: [laugh].

Julie: —ironically.

Kolton: Yeah. Ironically.

Mandy: Yeah, so what are you excited for the future? What’s on your horizon that maybe you can share with us that isn’t too, like, top-secret or anything? Or even stuff, maybe, not related to your companies? Like, what are you seeing in the industry that really has you motivated and excited?

Alex: Great question. I think a couple of things come to mind. I already mentioned automation, and we are in the automation space in a couple of different ways, in that we acquired a company called Rundeck over a year ago now, which does runbook automation and just automation in general around something like running a script across a variety of resources. And in the incident context, if an alert fires or an incident fires, it’s that self-healing aspect where you can actually resolve the issue without bothering a human.

There’s two modes to this automation: There’s the kind of full self-healing mode where, you know, something happens and the script just fixes it. And then the second mode is a human is involved, they get paged, and they have a toolbox of things that they can do, that they can easily do. We call that the Iron Man mode, where you’re getting, like, these buttons you can push to actually resolve the problem, but in that case, it’s a type of problem that does require a person to look at it and realize, oh, we should take this action to fix it. So, I’m very excited about the automation and continuing down that path.

And then the other thing that really excites me as well is being able to apply AI and ML to the alerting and incident response and incident management space. Especially our pattern detection, looking for patterns and alerts and incidents, and seeing have we seen this kind of problem before? If so, what happened last time? Who worked on the last time? How did they resolve it last time?

Because, you know, you don’t want to solve the same problems over and over. And that actually ties into automation really nicely as well. That pattern detection, it’s around reducing noise, like, these alerts are not real alerts, they’re false alerts, so let’s reduce them automatically, let’s suppress them, let’s filter them out automatically because the signal to noise is really important. And it’s that pattern detection, so if something major is happening, you can see here’s the blast radius, here’s the services or systems it’s impacting. Oh, we’ve seen something similar before—or we haven’t seen something similar before, it’s something totally brand new—and try to get the right folks involved quickly so that they can understand that blast radius and know how to approach the problem, and resolve it quickly.

Kolton: So, it’s not NFT’s is your PagerDuty profile picture?

Alex: [laugh].

Kolton: Because that’s, kind of, what I—no, I’m kidding. I couldn’t help but just like what do I not see—like, I’ve, I’ve tried to think of the best NFT joke I could. That was what I came up with. I agree on the AI/ML stuff. That opportunity to have more data and to be able to do better analysis of it, I’ve written some of that, you know, anomaly detection stuff—and it was a while back; I’m sure it could be done better—that’ll get us to a point.

You know, of course, I’m here to push on the proactive. There’s things we can do beyond just reacting faster that will be helpful. But I think part of that comes from people being comfortable sharing more about their failures. It’s a stigmata to fail today, and regardless of whether we’re talking about a world where we’re inciting things like blameless postmortems, people still don’t want to talk about their failures, and it’s hard to get that good outage information, it’s hard to get the kind of detail that would let us do better analytics, better automation.

And again, back to the conversation, you know, maybe we know what Amazon and Netflix looks like, but for us to create something that will help solve a broader problem, we have to know what those companies are feeling in pain; we need to know what their troubles are hitting at. So, I think that’s one thing I’ve been excited about is over the past two years, you’ve seen the focus on reliable, stable systems be much more important. Five years ago, it was, “Get out of my way, I got features to write, we got money to make, we’re not interested in that. If it breaks, we’ll fix it.” And you know, as we’re looking at the future, we’re looking at our bridges, we’re looking at our infrastructure, our transportation, the software we’re writing is going to be critical to the world, and it operating correctly and reliably is going to be critical. And I think what we’ll see is the market and customers are going to catch up to that; that tolerance for failure is going to go down and that willingness to invest in preventing failure is going to go up.

Alex: Yeah, I totally agree with that. One thing I would add is, I think it’s human nature that people don’t want to talk about failures. And this is maybe not going to go away, but there is maybe a middle ground there. I mean, talking about postmortems, especially, like, when a big company has a big outage and it makes the news, it makes Hacker News, et cetera, et cetera, I don’t see that changing, in that companies are going to become radically more transparent, but where I do think there is a middle ground is for your large customers, for your important customers, creating relationships with them and having more transparency in those cases. Maybe you don’t post it on a public status page a full, detailed nitty-gritty postmortem, but what you do do is you talk to your major customers, your important customers, and you give them that deeper view into your systems.

And what’s good about that is that it creates trust, it helps establish and maintain trust when you’re more transparent about problems, especially when you’re taking steps to fix them. And that piece is really important. I mean trust is, like, at the core of what we do. I have a saying about this—[unintelligible 00:19:31]—but, “Trust is won in droplets and lost in buckets.” So, if you have these outages all the time, or you have major service degradation, it’s easy to lose that trust. So, you want to prevent those, you want to catch them early, you want to create that transparency with your major customers, and you want to let them in the loop on what’s happening and how you’re preventing these types of issues going forward.

Kolton: Yeah, great thoughts. Totally agree.

Julie: So, for this episode of deep thoughts with Kolton and Alex, [laugh] I want to thank both of you for being here with Mandy and I today. We’re really excited to hear more and to see each of our respective companies grow and change the way people work and make life easier, not just for engineers, but for our customers and everybody that depends on us.

Mandy: Yeah, absolutely. I think it’s good for folks out there to know, you’re not alone. We’re all learning this stuff together. And some folks are a little further down the path, and we’re here to help you learn.

Kolton: Totally. Totally, it’s an opportunity for us to share. Those that are further along can share what they’ve learned; those that are new or have some great ideas and suggestions and enthusiasm, and by working together, we all benefit. This is the two plus two equals five, where, by getting together and sharing what we’ve learned and figuring out the best way, no one of us is going to be able to do it, but as a group, we can do it better.

Alex: Yeah. Totally agree. That’s a great closing thought.

Mandy: Well, thanks, folks. Thank you for joining us for another episode of Page it to the Limit. We’re wishing you an uneventful day.

What is Break Things on Purpose?

A podcast about site reliability engineering (SRE); Chaos Engineering; and the people, processes, and tools used to build resilient systems. Sponsored by Gremlin. Find us on Twitter at @BTOPpod.

Kolton: I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Julie: Welcome to the Break Things on Purpose podcast, a show about chaos, culture, building and breaking things with intention. I’m Julie Gunderson and in this episode, we have Alex Solomon, co-founder of PagerDuty, and Kolton Andrus, co-founder of Gremlin, chatting about everything from founding companies to how to change culture in organizations.

Julie: Hey everybody. Today we’re going to talk about building awesome things with two amazing company co-founders. I’m really excited to be here with Mandy Walls on this crossover episode for Break Things on Purpose and Page it to the Limit. I am Julie Gunderson, Senior Reliability Advocate here over at Gremlin. Mandy?

Mandy: Yeah, I’m Mandy Walls, DevOps Advocate at PagerDuty.

Julie: Excellent. And today we’re going to be talking about everything from reliability, incident management, to building a better internet. Really excited to talk about that. We’re joined by Kolton Andrus, co-founder of Gremlin, and Alex Solomon, co-founder of PagerDuty. So, to get us started, Kolton and Alex, you two have known each other for a little while. Can you kick us off with maybe how you know each other?

Alex: Sure. And thanks for having us on the podcast. So, I think if I remember correctly, I’ve known you, Kolton, since your days in Netflix while PagerDuty was a young startup, maybe less than 20 people. Is that right?

Kolton: Just to touch before I joined Netflix. It was actually that Velocity Conference, we hung out of that suite at, I think that was 2013.

Alex: Yeah, sounds right. That sounds right. And yeah, it’s been how many years? Eight, nine years since? Yeah.

Kolton: Yeah. Alex is being humble. He’s let me bother him for advice a few times along the journey. And we talked about what it was like to start companies. You know, he was in the startup world; I was still in the corporate world when we met back at that suite.

I was debating starting Gremlin at that time, and actually, I went to Netflix and did a couple more years because I didn’t feel I was quite ready. But again, it’s been great that Alex has been willing to give some of his time and help a fellow startup founder with some advice and help along the journey. And so I’ve been fortunate to be able to call on him a few times over the years.

Alex: Yeah, yeah. For sure, for sure. I’m always happy to help.

Julie: That’s great that you have your circle of friends that can help you. And also, you know, Kolton, it sounds like you did your tour of duty at Netflix; Alex, you did a tour duty at Amazon; you, too, Kolton. What are some of the things that you learned?

Alex: Yeah, good question. For me, when I joined Amazon, it was a stint of almost three years from ’05 to ’08, and I would say I learned a ton. Amazon, it was my first job out of school, and Amazon was truly one of the pioneers of DevOps. They had moved to an environment where their architecture was oriented around services, service-oriented architecture, and they were one of the pioneers of doing that, and moving from a monolith, breaking up a monolith into services. And with that, they also changed the way teams organized, generally oriented around full service-ownership, which is, as an engineer, you own one or more services—your team, rather—owns one or more services, and you’re not just writing code, but you’re also testing yourself. There’s no, like, QA team to throw it to. You are doing deploys to production, and when something breaks, you’re also in charge of maintaining the services in production.

And yeah, if something breaks back then we used pagers and the pager would go off, you’d get paged, then you’d have to get on it quickly and fix the problem. If you didn’t, it would escalate to your boss. So, I learned that was kind of the new way of working. I guess, in my inexperience, I took it for granted a little bit, in retrospect. It made me a better engineer because it evolved me into a better systems thinker. I wasn’t just thinking about code and how to build a feature, but I was also thinking about, like, how does that system need to work and perform and scale in production, and how does it deal with failures in production?

And it also—my time at Amazon served as inspiration for PagerDuty because in starting a startup, the way we thought about the idea of PagerDuty was by thinking back from our time at Amazon—myself and my other two co-founders, Andrew and Baskar—and we thought about what are useful tools or internal tools that existed at Amazon that we wished existed in the broader world? And we thought about, you know, an internal tool that Amazon developed, which was called the ‘Pager Duty Tool’ because it organized the on-call scheduling and paging and it was attached to the incident—to the ticketing system. So, if there’s was a SEV 1 or SEV 2 ticket, it would actually page either one team—or lots of teams if it was a major incident that impacted revenue and customers and all that good stuff. So yeah, that’s where we got the inspiration for PagerDuty by carrying the pager and seeing that tool exist within Amazon and realizing, hey, Amazon built this, Google has their own version, Facebook has their own version. It seems like there’s a need here. That’s kind of where that initial germ of an idea came from.

Kolton: So, much overlap. So, much similarity. I came, you know, a couple of years behind you. I was at Amazon 2009 to 2013. And I’d had the opportunity to work for a couple of startups out of college and while I was finishing my education, I’d tasted startup world a little bit.

My funny story I tell there is I turned down my first offer from Amazon to go work for a small startup that I thought was going to be a better deal. Turns out, I was bad at math, and a couple of years later, I went back to Amazon and said, “Hey, would you still like me?” And I ended up on the availability team, and so very much in the heart of what Alex is describing. It was a ‘you build it, you own it, you operate it’ environment. Teams were on call, they got paged, and the rationale was, if you felt the pain of that, then you were going to be motivated to go fix it and ensure that you weren’t feeling that pain.

And so really, again, and I agree, somewhat taken for granted that we really learned best-in-class DevOps and system thinking and distributed system principles, by just virtue of being immersed into it and having to solve the problems that we had to solve at Amazon. We also share a similar story in that there was a tool for paging within Amazon that served as a bit of an inspiration for PagerDuty. Similarly, we built a tool—may or may not have been named Gremlin—within Amazon that helped us to go do this exact type of testing. And it was one part tooling and it was one part evangelism. It was a controversial idea, even at Amazon.

Some teams latched on to it quickly, some teams needed some convincing, but we had that opportunity to go work with those teams and really go develop this concept. It was cool because while Netflix—a lot of folks are familiar with Netflix and Chaos Monkey, this was a couple of years before Chaos Monkey came out. And we went and built something similar to what we built a Gremlin: An API, a front end, a variety of failure modes, to really go help solve a wider breadth of problems. I got to then move into performance, and so I worked on making the website fast, making sure that we were optimizing things. Moved into management.

That was a very useful life experience wasn’t the most enjoyable year of my life, but learned a lot, got a lot done. And then that was the next summer, as I was thinking about what was next, I bumped into Alex. I was really starting to think about founding a company, and there was a big question: Was what we built an Amazon going to be applicable to everyone? Was it going to be useful for everyone? Were they ready for it?

And at the time, I really wasn’t sure. And so I decided to go to Netflix. And that was right after Chaos Monkey had come out, and I thought, “Well, let’s go see—let’s go learn a bit more before we’re ready to take this to market.” And because of that time at Amazon—or at Netflix, I got to see, they had a great start. They had a great culture, people were bought into it, but there was still some room for development on the tooling and on the approach.

And I found myself again, half in the developer mindset, half in the advocacy mindset where needed to go and prove the tooling to make it safer and more scalable and needed to go out and convince folks or help them do it well. But seeing it work at Amazon, that was great. That was a great learning experience. Seeing at work at Amazon and Netflix, to me said, “Okay, this is something that everyone’s going to need at some point, and so let’s go out and take a stab at it.”

Alex: That’s interesting. I didn’t realize that it came from Amazon. I always thought Chaos Engineering as a concept came from Netflix because that’s where everyone’s—I mean, maybe I’m not the only one, but that’s—that was my impression, so that’s interesting.

Kolton: Well, as you know, Amazon, at times, likes to keep things close to the vest, and if you’re not a principal engineer, you’re not really authorized to go talk about what you’ve done. And that actually led to where my opportunity to start a company came from. I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Mandi: So, what ends up being different? Amazon—I’ve never worked for Amazon, so full disclosure, I went from AOL to Chef, and now I’m at PagerDuty. So, but I know what that environment was like, and I remember the early days, PagerDuty you got started around the same time, like, Fastly and Chef and, like, that sort of generation of startups. And all this stuff that sort of emerged from Amazon, like, what kind of mindset do you—is there a change of mindset when you’re talking to developers and engineers that don’t work for Amazon, looking into Amazon from the outside, you kind of feel like there’s a lot more buy-in for those kinds of tools, and that kind of participation, and that kind of—like we said before, the full service-ownership and all of those attitudes and all that cultural pieces that come along with it, so when you’re taking these sort of practices commercial outside of Amazon, what changes? Like, is there a different messaging? Is there a different sort of relationship you have with the developers that work somewhere else?

Alex: I have some thoughts, and it may not be cohesive, but I’m going to go ahead anyway. Well, one thing that was very interesting from Amazon is that by being a pioneer and being at a scale that’s very significant compared to other companies, they had to invent a lot of the tooling themselves because back in mid-2000s, and beyond, there was no Datadog. There was no AWS; they invented AWS. There wasn’t any of these tools, Kubernetes, and so on, that we take for granted around containers, and even virtual servers were a new thing. And Amazon was actually I think, one of the pioneers of adopting that through open-source rather than through, like, a commercial vendor like VMware, which drove the adoption of virtual everything.

So, that’s one observation is they built their own monitoring, they built their own paging systems. They did not build their own ticketing system, but they might as well have because they took Remedy and customized it so much that it’s almost like building your own. And deployment tools, a lot of this tooling, and I’m sure Kolton, having worked on these teams, would know more about the tooling than I did as just an engineer who was using the tooling. But they had to build and invent their own tools. And I think through that process, they ended up culturally adopting a ‘not invented here’ mindset as well, where they’re, generally speaking, not super friendly towards using a vendor versus doing it themselves.

And I think that may make sense and made a lot of sense because they were at such a scale where there was no vendor that was going to meet their needs. But maybe that doesn’t make as much sense anymore, so that’s maybe a good question for debate. I don’t know, Kolton, if you have any thoughts as well.

Kolton: Yeah, a lot of agreement. I think what was needed, we needed to build those things at Amazon because they embraced that distributed systems, the service-oriented architectures early on, that is a new class of problem. I think in a world where you’re not dealing with the complexity of distributed systems, Chaos Engineering just looks like testing. And that’s fine. If you’re in a monolith and it’s more straightforward, great.

But when you have hundreds of things with all the interconnections and the combinatorial explosion you have with that, the old approach no longer works and you have to find something new. It’s funny you mentioned the tooling. I miss Amazon’s monitoring tooling, it was really good. I miss the first iteration of their pipelines, their CI/CD tooling. It was a great iteration.

And I think that’s really—you get to see that need, and that evolution, that iteration, and a bit of a head start. You asked a bit about what is it like taking that to market? I think one of the things that surprised me a little bit, or I had to learn, is different companies are at different points in their journey, and when you’ve worked at Amazon and Netflix, and you think everybody is further along than they are, at times, it can be a little frustrating, or you have to step back and think about how do you catch somebody up? How do you educate them? How do you get them to the point where they can take advantage of it?

And so that’s, you know, that’s really been the learning for me is we know aspirationally where we want to go—and again, it’s not the Amazon’s perfect; it’s not the Netflix is perfect. People that I talk to tend to deify Netflix engineering, and I think they’ve earned a lot of respect, but the sausage is made the same, fundamentally, at every company. And it can be messy at times, and it’s not always—things don’t always go well, but that opportunity to look at what has gone well, what it should look like, what it could look like really helps you understand what you’re striving for with your customers or with the market as a whole.

Alex: I totally agree with that because those are big learning for me as well. Like, when you come out of an Amazon, you think that maybe a lot of companies are like Amazon, in that they’re… more like I mentioned: Amazon was a pioneer of service-oriented architecture; a pioneer of DevOps; and you build it, you own it; pioneer of adopting virtual servers and virtual hosting. And you, maybe, generalize and think, you know, other companies are there as well, and that’s not true. There’s a wide variety of maturities and these trends, these big trends like Cloud, like AWS, like virtualization, like containerization, they take ten years to fully mature from the starting point. With the usual adopter curve of very early adopters all the way to, kind of, the big part of the curve.

And by virtue of starting PagerDuty in 2009, we were on the early side of the DevOps wave. And I would say, very fortunate to be in the right place at the right time, riding that wave and riding that trend. And we worked with a lot of customers who wanted to modernize, but the biggest challenge there is, perhaps it’s the people and process problem. If you’re already an established company, and you’ve been around for a while you do things a certain way, and change is hard. And you have to get folks to change and adapt and change their jobs, and change from being a, “sysadmin,” quote-unquote, to an SRE, and learn how to code and use that in your job.

So, that change takes a long time, and companies have taken a long time to do it. And the newer companies and startups will get there from day one because they just adopt the newest thing, the latest and greatest, but the big companies take a while.

Kolton: Yeah, it’s both that thing—people can catch up quicker. It’s not that the gap is as large, and when you get to start fresh, you get to pick up a lot of those principles and be further along, but I want to echo the people, the culture, getting folks to change how they’re doing things, that’s something, especially in our world, where we’re asking folks to think about distributed system testing and cross-team collaboration in a different way, and part of that is a mental journey, just helping folks get over the idea—we have to deal with some misconceptions, folks think chaos has to be random, they think it has to be done in production. That’s not the case. There’s ways to do it in dev and staging, there’s ways to do it that aren’t random that are much safer and more deterministic.

But helping folks get over those misconceptions, helping folks understand how to do it and how to do it well, and then how to measure the outcomes. That’s another thing I think we have that’s a bit tougher in our SRE ops world is oftentimes when we do a great job, it’s the absence of something as opposed to an outcome that we can clearly see. And you have to do more work when you’re proving the absence of something than the converse.

Julie: You know, I think it’s interesting, having worked with both of you when I was at PagerDuty and now at Gremlin, there’s a theme. And so we’ve talked a lot about Amazon and Netflix; one of the things, distinctly, with customers at both companies, is I’ve heard, “But we’re not Amazon and we’re not Netflix.” And that can be a barrier for some companies, especially when we talk about this change, and especially when we talk about very rigid organizations, such as, maybe, FinServ, government, those types of organizations, where they’re more resistant to that, and they say, “Don’t say Amazon. Don’t say Netflix. We’re not those companies. We can’t operate like them.”

I mean, Mandy and I, we were on a call with a customer at one point that said we couldn’t use the term DevOps, we had to call it something different because DevOps just meant too forward-thinking, even though we were talking about the same concepts. So, I guess what I would like to hear from both of you, is what advice would you give to those organizations that say, “Oh, no. We can’t be Netflix and we can’t be Amazon?” Because I think that’s just a fear of change conversation. But I’m curious what your thoughts are.

Alex: Yeah. And I can see why folks are allergic to that because you look at these companies, and they’re, in a lot of ways, so far ahead that you don’t, you know—and if you’re a lower level of maturity, for lack of a better word, you can’t see a path in your head of how do you get from where you are today to becoming more like a Netflix or an Amazon because it’s so different. And it requires a lot of thinking differently. So, I think what I would encourage, and I think this is what you all do really well in terms of advocacy, but what I’d encourage is, like, education and thinking about, like, what’s a small step that you can take today to improve things and to improve your maturity? What’s an on-ramp?

And there’s, you know, lots of ideas there. Like, for example, if we’re talking about modern incident management, if we’re talking Chaos Engineering, if we’re talking about public cloud adoption and any of these trends, DevOps, SRE, et cetera, maybe think about how do you—do you have a new greenfield project, a brand new system that you’re spinning up, how do you do that in a modern way while leaving your existing systems alone to start? Then you learn how to do it and how to operate it and how to build a new service, a new microservice using these new technologies, you build that muscle. You maybe hire some folks who have done it before; that’s always a good way to do it. But start with something greenfield, start small, you don’t have to boil the ocean, you don’t have to do everything at once. And that’s really important.

And then create a plan of taking other systems and migrating them. And maybe some systems don’t make sense to migrate at all because they’re just legacy. You don’t want to put any more investment in them. You just want to run them, they work, leave them alone. And yeah, think about a plan like that. And there’s lots of—now, there’s lots of advice and lots of organizations that are ready and willing to help folks think through these plans and think through this modernization journey.

Kolton: Yeah, I agree with that. It’s daunting to folks that there’s a lot, it’s a big problem to solve. And so, you know, it’d be great if it’s you do X, you get Y, you’re done, but that’s not really the world we live in. And so I agree with that wisdom: Start small. Find the place that you can make an impact, show what it looks like for it to be successful.

One thing I’ve found is when you want to drive bottoms-up consensus, people really want to see the proof, they want to see the outcome. And so that opportunity to sit down with a team that is already on the cutting edge, that is feeling the pain, and helping them find success, whether that’s SRE, DevOps, whether it’s Chaos Engineering, helping them, see it, see the outcome, see the value, and then let them tell their organization. We all hear from other folks what we should be doing, and there’s a lot of that information, there’s a lot of that context, and some of its noise, and so how we cut through that into what’s useful, becomes part of it. This one to me is funny because we hear a lot, “Hey, we have enough chaos already. We don’t need any more chaos.”

And I get it. It’s funny, but it’s my least favorite joke because, number one, if you have a lot of chaos, then actually you need this today. It’s about removing the chaos, not about adding chaos. The other part of it is it speaks to we need to get better before we’re ready to embrace this. And as somebody that works out regularly, a gym analogy comes to mind.

It’s kind of like your New Year’s, it’s your New Year’s resolution and you say, “Hey, I’m going to lose ten pounds before I start going to the gym.” Well, it’s a little bit backwards. If you want to get the outcome, you have to put in a bit of the work. And actually, the best way to learn how to do it is by doing it, by going out getting a little bit of—you know, you can get help, you can get guidance. That’s why we have companies, we’re here to help people and teach them what we’ve learned, but going out doing a bit of it will help you learn how you can do it better, and better understand your own systems.

Alex: Yeah, I like the workout analogy a lot. I think it’s hard to get started, it’s painful at first. That’s why I like the analogy [laugh]—

Kolton: [laugh].

Alex: —a lot. But it’s a muscle that you need to keep practicing, and it’s easy to lose, you stopped doing it, it’s gone. And it’s hard to get back again. So yeah, I like that analogy a lot.

Julie: Well, I like that, too, because that’s something that we talked a lot about for being on call, and understanding how to handle incidents, and building that muscle memory, right, practice. And so there’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident.

And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.

For the second part of this episode, check out the PagerDuty podcast at Page it to the Limit. For links to the Page it to the Limit podcast and to all the information mentioned, visit our website atgremlin.com/podcast. If you liked this episode, subscribe to Break Things on Purpose on Apple Podcasts, Spotify, or wherever you listen to your favorite podcasts.

Jason: Our theme song is called, “Battle of Pogs” by Komiku, and it’s available onloyaltyfreakmusic.com.

[SPLIT]

Mandy: All right, welcome. This week on Page it to the Limit, we have a crossover episode. If you haven’t heard part one of this episode featuring Kolton Andrus and Alex Solomon, you’ll need to find it. It’s on the Break Things on Purposepodcast from our friends at Gremlin. So, you’ll find that atgremlin.com/podcast. You can listen to that episode and then come back and listen to our episode this week as we join the conversation in progress.

Julie: There’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident. And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.

Kolton: You hit it though. Like, the key to me is I’m an engineer by trade. I felt this pain, I saw value in the solution. I love to joke, I’m a lazy engineer. I don’t like getting woken up in the middle of the night, I’d like my system to just work well, but if I can go save some other people that pain, if I can go help them to more quickly understand, or ramp, or have a better on-call life have a better work-life balance, that’s something we can do that helps the broader market.

And we do that, as you mentioned, in service of a more reliable internet. The world we live in is online, undoubtedly, after the last couple of years, and it’s only going to be more so. And people’s expectations, if you’re an older person like me, you know, maybe you remember downloading AOL for a couple of hours, or when a web page took a minute to load; people’s expectations are much different now. And that’s why the reliability, the performance, making sure things work when we need them to is critical.

Alex: Absolutely. And I think there’s also a trend that I see and that we’re part of around automation. And automation is a very broad thing, there’s lots of ways that you want to automate manual things, including CI/CD and automated testing and things like that, but I also think about automation in the incident context, like when you have an alert that fires off or you have an incident you have something like that, can you automate the solution or actually even prevent that alert from going off in the first place by creating a set of little robots that are kind of floating around your system and keeping things running and running well and running reliably? So, I think that’s an exciting trend for us.

Mandy: Oh, definitely on board with automating all the things for sure. So, of the things that you’ve learned, what’s one thing that you wish you had maybe learned earlier? Or if there was like a gem or a nugget for folks that might be thinking about starting their own company around developer tools or this kind of software, is there anything that you can share with them?

Alex: Kolton, you want to go first?

Kolton: Sure, I’ll go first. I was thinking a little bit about this. If I went back—we’ve only been at about six years, so Alex has the ten-year version. I can give you the five, six-year version. You know, I think coming into it as a technical founder, you have a lot of thoughts about how the world works that you learn are incorrect or incomplete.

It’s easy as an engineer to think that sales is this dirty organization that’s only focused on money, and that’s just not true or fair. They do a lot of hard work. Getting people to do the right thing is tough. Helping with support, with customer success.

Even marketing. Marketing is, you know, to many engineers, not what they would spend their time doing, and yet marketing has really changed in the last 20 years. And so much of marketing now is about sharing information and teaching what we’ve learned as opposed to this old approach of you know, whatever you watched on TV as a kid. So, I think understanding the broader business is important. Understanding the value you’re providing to customers, understanding the relationships you build with those customers and the community as a whole, those are pieces that might be easy to gloss over as an engineer.

Alex: Yeah, and to echo that, I like your point on sales because initially when I first started PagerDuty, I didn’t believe in sales. I thought we wouldn’t need to hire any salespeople. Like, we sell to other engineers, and if they’re anything like me, they don’t want to talk to a salesperson. They want to go on the website, look around learn, maybe try it out—we had a free trial; we still have a free trial—and put in a credit card and off to the races. And that’s what we did it first, but then it turns out that when doing so, and in customers in that way, there are folks who want to talk to you to make sure that, first of all your real business, you’re going to be around for a while and it’s not—you know, you’re not going to not be around tomorrow.

And that builds trust being able to talk to someone, to understand, if you have questions, you have someone to ask, and creating that human connection. And I found myself doing that function, like, myself and then realized, there’s not enough time in the day to do this, so I need to hire some folks. And I changed my mind about sales and hired our first two salespeople about two-and-a-half years into PagerDuty. And probably got a little bit lucky because they’re technical engineering background type folks who then went into sales, so they ended up being rockstars. And we instantly saw an increase in revenue with that.

And then maybe another more tactical piece of advice is that you can’t focus on culture too early when starting a company. And so one lesson that we learned the hard way is we hired an engineer that was brilliant, and really smart, but not the best culture fit in terms of, like, working well with others and creating that harmonious team dynamic with their peers. That ended up being an issue. And basically, the takeaway there is don’t hire brilliant but asshole folks because it’s just going to cause a lot of pain, and they’re not going to work out even though they’re really smart, and that’s kind of the reason why you keep them around because you think, well, it’s so hard to hire folks. You can’t let this person go because what are we going to do? But you do have to do it because it’s going to blow up anyways, and it’s going to be worse in the long run.

Kolton: Yeah, hiring and recruiting have their own set of challenges associated with them. And similar to hiring the brilliant jerk, some of the folks that you hire early on aren’t going to be the folks that you have at the end. And that one’s always tough. These are your friends, these are people you work closely with, and as the company grows, and as things change, people’s roles change, and sometimes people choose to leave and that breaks your heart because you’ve invested a lot of time and effort into that relationship. Sometimes you have to break their heart and tell them it’s not the right fit, or things change.

And that’s one that if you’re a founder or you’re part of that early team, you’re going to feel a little bit more than everyone else. I don’t think anything you read on the internet can prepare you for some of those difficult conversations you have to have. And it’s great if everything goes well, and everyone grows at the same rate, everyone can be promoted, and you can have the same team at the end, but that’s not really how things play out in reality.

Julie: It’s interesting that we’re talking about culture, as we heard about last week, on the Break Things on Purposeepisode, where we also talked about culture and how organizations struggle with the culture shift with adopting new technologies, new ways of working, new tools. And so what I’m hearing from you is focusing on that when hiring and founding your company is important. We also heard about how that’s important with changing the way that we work. So, if you could give an advice to maybe a very established—if you are going to give a piece of advice to Amazon—maybe not Amazon, but an established company—on how to overcome some of those objections to culture change, those fears of adopting new technology. I know people are still afraid of holding a pager and being on call, and I know other people are afraid of chaos as we talk about it and those fears that you’ve mentioned before, Kolton. What would your piece of advice be?

Alex: Yeah, good—great question. This will probably echo what I’ve said earlier, which is when looking to transform, transform culture especially, and people and process, the way I think about is try to not boil the ocean and start small, and get some early wins. And learn what good looks like. I think that’s really important. It’s this concept of show, don’t tell.

Like, if you want to, you know, you want to change something, you start at the grassroots level, you start small, you start maybe with one or two teams, you try it out, maybe something like I mentioned before, in a greenfield context where you’re doing something brand new and you’re not shackled by legacy systems or anything like that, then you can build something new or that new system using the new technologies that are that we’re talking about here, whether it’s public cloud, whether it’s containerization and Kubernetes, or whatnot, or serverless, potentially. And as you build it and you learn how to build it and how to operate it, you share those learnings and you start evangelizing within the company.

And that goes to what I was saying with the show don’t tell where you’re like showing, “Here’s what we did and here’s what we learned. And not everything went swimmingly and here are things that didn’t go so well, and maybe what’s our next step beyond this? Do other folks want to opt-in to this kind of new thing that we’re doing?” And I’m sure that’s a good way to get others excited. And if you’re thinking about longer-term, like, how do you transform the entire company, well, that’s this is a good way to start; start small you learn how to do it, you learn about what good looks like, you get others excited about it, others opt-in, and then at some point through that journey, you start mandating it top-down as well because grassroots is only going to take you so far. And then that’s where you start putting together project plans around, like, how do we get other teams to do it, on a timeline? And when are they going to do it? And how are they going to do it? And then bring everyone along for the journey as well.

Kolton: You’re making this easy for me. I’ll just keep agreeing with you. You hit all the points. Yeah, I mean, on one hand, the engineer in me says, you know, a lot of times when we’re talking about this transformation, it’s not easy, but it’s worth it. There’s a need that we’re trying to solve, there’s a problem we’re trying to solve.

And then the end, what that becomes as a competitive advantage. The thought that came to mind as Alex was speaking is you need that bottoms-up buy-in; you also need that top-down support. And as engineers, we don’t often think about the business impact of what we do. There’s an important element and a message I like to reiterate for all the engineers that, think about how the business would value the work you do. Think about how you would quantify the value of the work you do to the business because that’s going to help that upper level that doesn’t, in the day-to-day feeling the pain, understand that what we’re doing is important, and it’s important for the organization.

I think about this a little bit like remote-by-default work. So, when we founded Gremlin, we decided you know, we didn’t want offices. And six years ago that was a little bit exceptional. Folks were still fundamentally working in an office environment. I’m not here to tell you that remote-by-default is easy, works for everyone, or is the answer.

Actually, what we found is you need a little bit of both. You need to be able to have good tooling so folks can be efficient and effective in their work, but it’s still important to get folks together in person. And magic happens when you get a group of folks in a room and let them brainstorm and collaborate chat on the way to launch or on the way to dinner. But I think that’s a good example where we’ve learned over the last couple of years that the old way of doing it was not as effective as it could be. That maybe we don’t need to swing the pendulum entirely the other way, but there’s merits at looking at what the right balance is.

And I think that applies to, you know, incident management, to SRE, to Chaos Engineering. You know, maybe we don’t have to go entirely on the other end of the spectrum for everyone, but are there little—you know, is there an 80/20 solution that gets us a lot of value, that saves a lot of time, that makes us more efficient and effective, without having to rewrite everything from scratch?

Alex: Yeah, I like that a lot. And I think part of it, just to add to that, is make it easy for people to adopt it, too. Like, if you can automate it for folks, “Hey, here’s a Terraform thing where you could just hit a button and it does it for you, here’s some training around how to leverage it, and here’s the easy button for you to adopt.” I think that goes with the technology of adopting, but also the training, also the, you know, how-tos and learnings. That way, it’s not going to be, like, a big painful thing, you can plan for it. And yeah, it’s off to the races from there.

Kolton: I think that’s prudent product advice, as well. Make it easy for people to do the right thing. And I’m sure it’s tricky in your space; it’s really tricky in our space. We’re going out and we’re causing failure, and there’s inadvertent side effects, and you need to understand what’s happening. It’s a little scary, but that’s where we add a lot of value.

We invest a lot of time and effort in how do we make it easy to understand, easy to understand what to expect, and easy to go do and see what happens and see that value? And it sounds easy. You know, “Hey, just make it easy. Just make it simple,” but actually, as we know, it takes so much more effort and work to get it to be that level of simplicity.

Alex: Yeah, making something easy is very, very hard—

Kolton: [laugh].

Julie: —ironically.

Kolton: Yeah. Ironically.

Mandy: Yeah, so what are you excited for the future? What’s on your horizon that maybe you can share with us that isn’t too, like, top-secret or anything? Or even stuff, maybe, not related to your companies? Like, what are you seeing in the industry that really has you motivated and excited?

Alex: Great question. I think a couple of things come to mind. I already mentioned automation, and we are in the automation space in a couple of different ways, in that we acquired a company called Rundeck over a year ago now, which does runbook automation and just automation in general around something like running a script across a variety of resources. And in the incident context, if an alert fires or an incident fires, it’s that self-healing aspect where you can actually resolve the issue without bothering a human.

There’s two modes to this automation: There’s the kind of full self-healing mode where, you know, something happens and the script just fixes it. And then the second mode is a human is involved, they get paged, and they have a toolbox of things that they can do, that they can easily do. We call that the Iron Man mode, where you’re getting, like, these buttons you can push to actually resolve the problem, but in that case, it’s a type of problem that does require a person to look at it and realize, oh, we should take this action to fix it. So, I’m very excited about the automation and continuing down that path.

And then the other thing that really excites me as well is being able to apply AI and ML to the alerting and incident response and incident management space. Especially our pattern detection, looking for patterns and alerts and incidents, and seeing have we seen this kind of problem before? If so, what happened last time? Who worked on the last time? How did they resolve it last time?

Because, you know, you don’t want to solve the same problems over and over. And that actually ties into automation really nicely as well. That pattern detection, it’s around reducing noise, like, these alerts are not real alerts, they’re false alerts, so let’s reduce them automatically, let’s suppress them, let’s filter them out automatically because the signal to noise is really important. And it’s that pattern detection, so if something major is happening, you can see here’s the blast radius, here’s the services or systems it’s impacting. Oh, we’ve seen something similar before—or we haven’t seen something similar before, it’s something totally brand new—and try to get the right folks involved quickly so that they can understand that blast radius and know how to approach the problem, and resolve it quickly.

Kolton: So, it’s not NFT’s is your PagerDuty profile picture?

Alex: [laugh].

Kolton: Because that’s, kind of, what I—no, I’m kidding. I couldn’t help but just like what do I not see—like, I’ve, I’ve tried to think of the best NFT joke I could. That was what I came up with. I agree on the AI/ML stuff. That opportunity to have more data and to be able to do better analysis of it, I’ve written some of that, you know, anomaly detection stuff—and it was a while back; I’m sure it could be done better—that’ll get us to a point.

You know, of course, I’m here to push on the proactive. There’s things we can do beyond just reacting faster that will be helpful. But I think part of that comes from people being comfortable sharing more about their failures. It’s a stigmata to fail today, and regardless of whether we’re talking about a world where we’re inciting things like blameless postmortems, people still don’t want to talk about their failures, and it’s hard to get that good outage information, it’s hard to get the kind of detail that would let us do better analytics, better automation.

And again, back to the conversation, you know, maybe we know what Amazon and Netflix looks like, but for us to create something that will help solve a broader problem, we have to know what those companies are feeling in pain; we need to know what their troubles are hitting at. So, I think that’s one thing I’ve been excited about is over the past two years, you’ve seen the focus on reliable, stable systems be much more important. Five years ago, it was, “Get out of my way, I got features to write, we got money to make, we’re not interested in that. If it breaks, we’ll fix it.” And you know, as we’re looking at the future, we’re looking at our bridges, we’re looking at our infrastructure, our transportation, the software we’re writing is going to be critical to the world, and it operating correctly and reliably is going to be critical. And I think what we’ll see is the market and customers are going to catch up to that; that tolerance for failure is going to go down and that willingness to invest in preventing failure is going to go up.

Alex: Yeah, I totally agree with that. One thing I would add is, I think it’s human nature that people don’t want to talk about failures. And this is maybe not going to go away, but there is maybe a middle ground there. I mean, talking about postmortems, especially, like, when a big company has a big outage and it makes the news, it makes Hacker News, et cetera, et cetera, I don’t see that changing, in that companies are going to become radically more transparent, but where I do think there is a middle ground is for your large customers, for your important customers, creating relationships with them and having more transparency in those cases. Maybe you don’t post it on a public status page a full, detailed nitty-gritty postmortem, but what you do do is you talk to your major customers, your important customers, and you give them that deeper view into your systems.

And what’s good about that is that it creates trust, it helps establish and maintain trust when you’re more transparent about problems, especially when you’re taking steps to fix them. And that piece is really important. I mean trust is, like, at the core of what we do. I have a saying about this—[unintelligible 00:19:31]—but, “Trust is won in droplets and lost in buckets.” So, if you have these outages all the time, or you have major service degradation, it’s easy to lose that trust. So, you want to prevent those, you want to catch them early, you want to create that transparency with your major customers, and you want to let them in the loop on what’s happening and how you’re preventing these types of issues going forward.

Kolton: Yeah, great thoughts. Totally agree.

Julie: So, for this episode of deep thoughts with Kolton and Alex, [laugh] I want to thank both of you for being here with Mandy and I today. We’re really excited to hear more and to see each of our respective companies grow and change the way people work and make life easier, not just for engineers, but for our customers and everybody that depends on us.

Mandy: Yeah, absolutely. I think it’s good for folks out there to know, you’re not alone. We’re all learning this stuff together. And some folks are a little further down the path, and we’re here to help you learn.

Kolton: Totally. Totally, it’s an opportunity for us to share. Those that are further along can share what they’ve learned; those that are new or have some great ideas and suggestions and enthusiasm, and by working together, we all benefit. This is the two plus two equals five, where, by getting together and sharing what we’ve learned and figuring out the best way, no one of us is going to be able to do it, but as a group, we can do it better.

Alex: Yeah. Totally agree. That’s a great closing thought.

Mandy: Well, thanks, folks. Thank you for joining us for another episode of Page it to the Limit. We’re wishing you an uneventful day.