Braintrust by Cortex

In this episode of Braintrust, Cortex co-founder and CTO Ganesh Datta sits down with Randy Shoup, SVP of Engineering at Thrive Market. Randy shares lessons from his leadership roles across multiple companies and explains how measurement and transparency can help teams build stronger engineering cultures.

Randy and Ganesh chat about how fear can block progress, why recovery speed matters more than trying to prevent every failure, and how teams improve through steady, incremental gains. They also discuss a few practical ways to build trust around metrics so organizations can use visibility for learning instead of punishment.

What is Braintrust by Cortex?

Candid conversations with the builders shaping the future of engineering.

Braintrust dives into the operational realities of running high-performing engineering organizations, from production readiness and migrations to AI adoption and operational excellence.

Hosted by Ganesh Datta, CTO & Co-founder of Cortex

Randy Shoup (00:00):
The sole goal of the team is to mitigate the failure, whatever it is, and restore service as quickly and as fully as possible. As a service provider, I should prioritize recovering quickly when I do fail as opposed to trying to prevent all failures. Failure happens, and so to the extent that we have limited resources to put on it, we should spend those resources on how can I get things back instead of a day? How about an hour instead of an hour? How about 10 minutes instead of 10 minutes? How about one minute? You're much better off having ten one minute outages, if you see what I mean, than one hour outage.

Ganesh Datta (00:42):
You are listening to Braintrust by Cortex, where we explore how engineering leaders blend AI platforms and culture to build high performing software teams. I'm your host, Ganesh Datta, CTO, and co-founder of Cortex, an internal developer portal designed to help engineering teams ship reliable software faster with AI. In each episode we go deep with CTOs, VPs of engineering and technical leaders who've been in the trenches navigating the tension between speed and quality, building reliability at scale, and figuring out how to lead through major platform shifts. Whether you're running a team of 10 or a thousand, this is your space to learn from people who've made the hard calls and live to talk about it. Hello, welcome to the Braintrust podcast. I'm Ganesh. I'm with the co-founders and CTO at Cortex. We help engineering organizations create a culture of reliability through things like scorecards and self service workflows. Today I have on Randy, who is currently the SVP of Engineering at Thrive Market. Great to have you on.

Randy Shoup (01:45):
Great. Yeah. I'm Randy Shoup. I'm the SVP of Engineering at Thrive Market. As you said, I think we're going to talk about a whole bunch of stuff today, but I used to be VP of engineering for a platform at eBay. I'm the chief architect and I've been an individual contributor for a lot of my career. Worked at Google, eBay, StitchFix, WeWork, a whole bunch of places, sometimes as individual contributors and then most recently as engineering leaders. So yeah, looking forward to the conversation,

Ganesh Datta (02:13):
Very excited to have you on. I think like you described, having been in IC for such a long time and then engineering leadership for roles for such a long time, I think you've seen the full end of the spectrum and I think you bring a very interesting perspective. I want to maybe start off way back when in the early days, and you have talked about in the past that in your organizations, especially legacy organizations sometimes have a resistance to measurement. And we all know you can't improve what you don't measure, but sometimes there's a reluctance to say, maybe we don't want to know what's wrong. Is that true? Where does that come from and how do you coach an organization out of that mindset?

Randy Shoup (02:51):
Yeah, wow. Well, you just described exactly my saga of the Velocity initiative at eBay as chief architect. So yeah, it is not so much legacy, not legacy, but it is Ron Westrom, who is a sociologist emeritus, has this great framing of organizational culture where he talks about three different kinds, a typology of cultures. This is in all the DevOps literature. So he talks about pathological cultures, which are cultures of fear and threat where novelty is crushed. You shoot the messenger, people live in fear, right? They're like a fear for their jobs or for their reputations or whatever, and that tip engenders in a lot of resistance to change and so on. And we'll talk about more about that. That's the answer to your question. There are other kinds of organizations too, others bureaucratic where it's all about rules and standards and we do it this way because we've always done it this way, and those tend to be very rigid, but middling in there output in terms of engineering outcomes, but also middling in terms of their business outcomes.

(03:58):
By the way, as you can imagine, pathological has the worst engineering and the worst business outcomes read the Accelerate book to understand why. The third one is the generative culture, which is a culture of performance of theory y versus theory X, like a culture of when there's a failure, it generates inquiry, you're looking to improve and there's a culture of improvement in learning. So to answer your question about is it often true that legacy organizations resist measurement? Yes, but I think it's, that's not their step one, it's that they resist. This is particularly true at eBay, but in lots of other similar pathological organizations, there is a culture of fear and therefore shining a light on what the current situation is in my team is threatening to me because again, I'm in constant fear of being embarrassed or maybe even let go or something like that.

(04:54):
And so if I were an adaptive reaction, unfortunately in such a culture of fear is to be very risk averse and also very antit transparent, very non-transparent. And I think you also asked, isn't that a challenge for making change? Like oh yeah, it's a big challenge for making change. And so one of the real challenges that my partner Mark Weinberg and I had in terms of doing this velocity initiative, I'll talk about what we're trying to do maybe in a moment, but for the moment we're trying to make change across the entire 5,000 engineering organization. And one of the first things we were doing with the teams that we were working with was trying to up their psychological safety or conversely reduce their fear of, hey, here are these two fancy people coming in digging in my area and trying to help me be better.

(05:53):
And we just want you to be better. And again, your point, the only to get better is to be very open and honest and straightforward with yourself and others about where you currently are. And by the way, I know you know this, but that's just not just true in engineering organizations with your health, with your diet, with your relationships, with your personal, I dunno, psychology or whatever, all these things are like there's no way to get better until we are very, I mean this is failing very zen. It really is, right? There's a very much eastern philosophy idea of let, let's deeply understand and be centered and understand the world the way it is, and only then can we move forward.

Ganesh Datta (06:40):
I love that. I think it was either Andreesen or Horowitz who had said something like, as leaders or as founders, your hardest job is to run towards uncomfortable truths. And that's what makes great leadership is you have to face those things head on. Avoiding them is the challenge. I think it is probably very similar where organizations maybe don't want to face that truth, they don't want to really know what's going on because maybe the status quo is okay. We see that even in organizations that maybe I wouldn't describe as pathological in our own customers, but there's a perception they're worried about the perception of introducing metrics. They're like, if we measure things, people may think that we're going to use it for all these other reasons, even if that perception is actually not there, there's a fear that it will create fear and it is like they see how that might be perceived. And so if you were an engineering leader or even somebody working in engineering operations or develop productivity, how do you introduce this concept into an organization like that where, hey, we are measuring this not for that purpose. Is there a way to broach the subject or introduce it in a way that's psychologically safe?

Randy Shoup (07:47):
Yeah, that's very insightful. Just I'll say back the premise, which I agree with, and then we'll go forward from there. So I was saying in a pathologic organization, I fear transparency because I don't want people to see what's happening. I fear what would happen if they did. You are also noting even in the opposite kind of organization in a generative organization, one correctly might expect that, hey, when we introduce measurement, are people going to feel like we're going to use the measurement for evil? And that is a real fear. Those are two different things. Obviously I know you know that, but just like we had both those fears. So yeah, so your question was how do you deal with it? I mean, I don't think there's any magic, but it really comes down to trust the velocity initiative that I'm going to keep referencing in this conversation.

(08:36):
2020, I returned to eBay as the chief architect Problem statement was, eBay's product development and engineering is too slow versus other industry peers and we should be faster. And so we leverage the Dora metrics or the accelerate call 'em the same thing, the key four metrics use that as our measure of software delivery. And so we were doing exactly what you're asking about, which is we were coming to teams trying, I mean the hope and the goal was make everybody better and everybody wanted that. But to your point, there was a fear that, hey, we're going to measure this chief architect and other fancy folks are coming and measuring things, and is that going to be a problem? Is that going to be used for evil? Again, the best way is the best way to demonstrate trust is to, I dunno, start by being vulnerable a little bit like we are coming in and again, being straightforward about that, hi, we are introducing these metrics, here's why we're doing it.

(09:45):
And also we completely understand that it's going to freak you out and just being so demonstrate. So building the trust and it takes time as it should. The whole point of fat of measuring the teams was to make them better, not to compare them or to make them feel bad, but your team is five units of whatever, and we want it to be 20 units of whatever, and there's no way you can get to 20 without recognizing that you're five. And so the way that we specifically did that was we did it very transparently. So we had a dashboard that you could drill in and look at every team and how they were performing on all the Dora metrics plus a bunch of other things that we did. But then also we met every week with every team we were working with, and we didn't try to work with all 5,000 at the time.

(10:35):
We had some pilot teams to start with, but we met every week with them. That was both to get their feedback about what we as the platform team could do better for them, what things they were actively struggling with. Okay, you tried a bunch of things this past week, what went well, what went poorly, but also exactly to your point, this is my words. How can we tamp down your anxiety that we're going do this for evil? It's like I'm here to help. I'm like, people have to, I'm from the government. I'm here to help. And people have to actually believe that, and you have to believe that in order for them to believe that, you have to demonstrate that that's true, if that makes any sense, right? So yeah,

Ganesh Datta (11:17):
That makes a lot of sense. I guess my last question on this note, do you think that every team should be able to see every team's metrics? And the reason I ask that is because I think, again, from a cultural standpoint, sometimes you have folks who are strong believers are like, Hey, we only want managers to be able to see this, and that will help prevent a culture where, hey, we're using this to measure you or stack rank people or whatnot. I believe sometimes that, hey, by opening it up to everyone, we treat it like observability. It's like these are the facts. It's not good, it's not bad, it just is. And we all look at it. I dunno if you have an opinion on No,

Randy Shoup (11:56):
I mean I have your opinion both we're adults, which means we have to be able to hold seemingly contradictory ideas in our head. And one of those ideas is we should have full transparency and everybody should be able to see and be inspired by what other teams are doing. And the other thing we need to hold in our head is that will freak some people out. So the trick is to, and it's exactly this cultural thing of slowly but surely trying to move from a pathological fear-based model to a generative performance-based model. Yeah, I mean, what we did is this had never been done before, but every team could see every other team. And what we did not do is go, Hey team, with the old legacy stuff, you have two units of goodness, and this team with all the new staff has 20 units and aren't they great and you terrible?

(12:55):
It's like, no, the way to think about it honestly is less a comparison between teams and more a comparison within a team over time. A point very well made by Dr. Nicole who's the primary author of Accelerate and all of the DevOps related research, but she and Abby noted just came out with a book a month ago called Frictionless, which I happen to have on the top of my book stack. And it talks about how to execute these kinds of transformations. And one of the great points they make is exactly this is don't compare team A to team B, but compare team A to team A minus one, I don't know how to say it, team A minus minus what they were last week, last month, last quarter. The thing I would add to that again is the side of seeing the fear comes from, oh, aren't I going to be compared with these other teams?

(13:54):
And I sometimes look bad if I'm not the best. Okay, that's a real fear that people have, but that's what leadership is there to do. But the other thing frankly, which is the flip side of it is, hey, I get inspired by these other teams. And I guess the other mechanism, sorry to go back to your earlier question, but I think it relates, the other mechanism that we did, I thought that was very effective in this velocity transformation was that we had a weekly team of teams meeting or scrum of scrums if you like, and it was at least one member from all of the teams that we were working with, so 10 or 15 individual teams. And that hour was like all rah rah, and it was like, Hey, you guys went from 10 to 12, these people over here went from 20 to 30, what did you guys all do?

(14:44):
Oh, well, we did this and we put this tool into practice and we changed the way we did our standups. And so teams each other and just watching other folks be better, it sets up a friendly competition that's not a euphemism, genuinely a friendly competition as distinct from people feeling bad. You asked the question about transparency. I think we want to live in a world where transparency is a net good, and we use it for good. And the only way to start doing that is to just do it. And again, initially people will feel uncomfortable. I get it. There's nothing weird about that. Again, two contradictory ideas in our head, but by not using it for evil, we demonstrate that it's being used for good, if that makes any sense.

Ganesh Datta (15:37):
That makes a lot of sense. And I like the idea of maybe if you're doing a thing that could be construed as negative counterbalancing that with the ritual that airs in the side of positivity, like the rah rah meeting or whatever you want to call it,

Randy Shoup (15:50):
Even, I mean, we called it the team of teams and there's a book called Team of Teams, and we were very much inspired by that book. But yeah, I mean, I don't know that we were, I guess it makes it sound like it was more intentional on my part than it, so thank you for that credit. But it was our intuition and it turned to be correct that by collectively having teams be able to say openly that they were struggling with something and that engendered, Hey, how can my team help you as distinct from your terrible and having everybody see that, do you know what I mean? Hey, you haven't really made this much forward progress as I think you wanted to. What can my team do tomorrow to help you? And it's like failure generates inquiry. It's like opportunities or challenges or impediments as we used to call them, generate help.

Ganesh Datta (16:43):
It's assuming good intent. It's like most of the time your developers are not trying to be slow on purpose. It's like most developers are craftspeople. Most of them want to ship and do good work.

Randy Shoup (16:56):
I mean, I would replace most with all. I mean, there are people in the world that have bad intent, that's why we have to have security. But I have yet to find, I mean I could have done this in an hour, but I intentionally, he took 10 no halper ever, right? The reason why it takes 10 is the system, not the individual. And and the other thing to be said very strongly, and you kind of implied it in your question, but I'll say it explicitly, none of these things should be used to measure individuals. The unit of measurement is the team, full stop, end of story. There is not value in measuring deployment frequency lead time, certainly change value rate or MTTR for an individual that doesn't tell you anything and does create would engender culture of fear. And so it's not about stack ranking. Nothing should be about stack ranking people, but it's definitely not about stack ranking individuals. The unit of production of value is the team, and that's the correct unit to measure.

Ganesh Datta (18:09):
Absolutely. I want to zoom out now. We've obviously been talking about visibility and measuring things, and one of the things we tend to measure is things like incidents and uptime and reliability and things like that. And you've talked in the past about this idea that incidents are normal, like software fails, things break, shit happens. To put it another way, but what matters is how well you recovered from that. Could you expand on that a little bit? What does that mean?

Randy Shoup (18:36):
Yeah, that's great. I would say there are two things that you want out of incidents you want in the incident. You want to recover as fast as possible, and I'll put more behind that. And then also John Alba, who's wonderful in this area says, incidents are unplanned investments. We've already paid the cost, right? In the downtime, the customers have already paid. So the only way to get an ROI for that is to get some R. Well, we paid the I, so what's the R? The R is the learning, the R is, here are the action items that we're going to do next to make sure it never happens again to make sure we're more reliable and so on. Yeah. So maybe direct me a little bit into what you wanted to dig into.

Ganesh Datta (19:19):
Yeah, I just wanted see why is the focus about recovery? Why does it matter how well you're

Randy Shoup (19:24):
Recovery? Yeah, absolutely. Yeah. So think about yourself as the customer of some service like electricity. And that's relevant to me because in the two weeks of rain we had over Christmas, I lost power in my house twice. So what I want out of the power company is that they restore my service as quickly as possible, and that is what they should want to. So when you are in an incident, and I've done it for 38 years, once upon a time I had hair, the sole goal of the team is to mitigate the failure, whatever it is, and restore service as quickly and as fully as possible. Sometimes you can remediate, that's even better, but our immediate goal is how can we mitigate and restore services as rapidly as possible? And sometimes, depending on what the failure is, sometimes it's like, okay, spin up more capacity.

(20:26):
Sometimes it's moved to some other place in the world. Sometimes it's pull the plug out and restart it, reboot it, or whatever. Those have been the solutions to many of those are specific solutions to specific incidents I have in the back of my head. But yeah, that's because again, you're providing a service to your customers, whether there are other teams at your current company or the customers of your company, it doesn't matter. The goal is to restore service. And after that, only then can you restore service as quickly as possible. And a lot of times that can be something like again, hey, let's spin up in a new region, double the capacity and just go. They're like, okay, great. Now we have time to breathe and now we need to figure out what to do next. And maybe that's not sustainable. And again, this is very obvious, but the analogy to triage in an emergency room is exactly this. We're stabilize the patient and do whatever you need to do to get the patient in a stable state. Then the doctors and nurses take a step back and go, alright, what's the proper long-term way to solve whatever they have, stop the bleeding. And only then we go, okay, now we have to do this particular operation or whatever. I hope that makes sense.

Ganesh Datta (21:48):
Yeah, that makes a ton of sense. And I think it's a very pragmatic take because I think you have to accept that things will break in real life and things happen. Even in your electricity case, you cannot prevent a tree from falling on a wire somewhere.

Randy Shoup (22:03):
That's where you were going. I'm so sorry. Yeah, I

Ganesh Datta (22:05):
Was, no, no, I think you answered it. Exactly. Yeah.

Randy Shoup (22:09):
Let me answer what you're actually, I mean, I answered part of it, but not fully. What you were really asking me is, as a service provider, should I prioritize trying not to fail or should I prioritize recovering quickly when I do fail? And yes, the answer is I should prioritize recovering quickly when I do fail, as opposed to trying to prevent all failures.

(22:33):
That is why of the four metrics we do not measure and MTBF meantime between failures. We don't measure how long has it been since the last one day since sir industrial accident and river. Instead, we measure MTTR meantime to recover. There's some quibbles in the world about whether that's a good metric, but let's leave it. Let's leave it there for the moment. No one argues in a serious way that everybody argues that recovery is more important. And again, to your point, we don't want to, we would love to not have failure, but failure happens. And so to the extent that we have limited resources to put on it, which we typically do, we should spend those resources on how can I get things back instead of a day? How about an hour instead of an hour? How about 10 minutes instead of 10 minutes? How about one minute?

(23:25):
And you're much better off having ten one minute outages, if you see what I mean, than one hour outage. So yeah, prioritizing the recovery is absolutely, essentially all of these kinds of, this disciplines called resilience engineering. And again, it applies to airline safety, it applies to medicine, it applies to software, it applies to fire, all these kinds of things. And everybody has come to independently invented, don't blame the people, have a retro prioritize, stabilizing the patient, blah, blah, blah, right? I mean, all these different domains that need to be resilient to stuff independently discover essentially the same thing.

Ganesh Datta (24:16):
And I think that's where SLOs come in, obviously, because you want to be able to define an acceptable level of fine, because I think that's an easy way to counterbalance. Yeah. Should I be prioritizing being up or recovering from failure? Well, you can get a more nuanced answer maybe if you've also spent the time to define, well, what is acceptable to my customers in terms of reliability of the platform? Recovery helps meet your SLO targets and stuff as well, but it's maybe an input into this broader equation.

Randy Shoup (24:49):
What you're seeing on my face is thinking about it. So I agree with everything you say about SLOs. I guess I wouldn't immediately tie it only to recovery time. I think you should prioritize recovery totally independent of that. How is our system doing? Are we meeting the service demands, requirements, requests, desires of our customers? And to your point, a lot of times we're a little hand wavy about it and much better is to be disciplined and structured and quantitative about it. SLOs are just doing a lot of work here, but they are just a way of us taking a step back and saying, what does a good customer experience look like? Or maybe more precisely what is the minimum accepted customer experience that we would feel proud about?

Ganesh Datta (25:47):
And that might help you balance should I invest in the quality of the service? Because in the electricity example, if you're electricity was going on every day, you might say, I don't care that it takes you five seconds to get my electricity back. Just keep my power on for a good amount of time, and then go figure out how you're going

Randy Shoup (26:02):
To fix it. And actually, I mean, again, I haven't done this, but I've considered more than once doing battery backup and Exactly. For that reason, yeah, my community is actually discussing actively, and we spin up a little local solar,

Ganesh Datta (26:21):
Oh, there you go,

Randy Shoup (26:21):
Battery station essentially for our little community because we can't trust pg e to keep the lights on or whatever. Anyway. But yeah, no, I mean, I think that's very visceral and very evocative for people. If you think about it, I need the electricity, so what would you want of the power company? And that's what we're supposed to do, essentially as service providers and software.

Ganesh Datta (26:41):
Exactly. I would to talk a little bit about how to create a culture around this culture of reliability, the culture where an organization is willing to take a step back and say, Hey, we are going to go back and think about the root cause or what the investment is and not just run back to the next ticket or whatever that might be. How do you foster this culture of reliability? Is it people, is it management? Is it systems? What goes into that?

Randy Shoup (27:08):
Absolutely. Yes.

(27:10):
Yeah, in no particular, so what I meant was it is people and culture, it is organization, it is technology, it is systems, it is all those things. You asked how to foster it, implying that when it doesn't exist, how do we get it the first time? And actually I'm kind of in that situation at Thrive Market, so it's very, very visceral for me and have been at other places as well. Interestingly, it's back to your first question. Step one is actually measure, and you measure of course, because that is the way to figure out whether you've made improvements or not. But also, interestingly, the best first step to generating a culture of caring about a thing is to start measuring that thing. And again, being transparent about the measurement, I have found that 80% of the battle of this is not your question, but it's similar. How can we get engineering teams to care about cloud costs? Well, a way to do it is to, I don't know, get mad at them and punish them. But 80% of the way there, you can just go with, here are your cloud costs, right? I mean, no joke. No joke.

Ganesh Datta (28:24):
Literally, we've seen this in practice

Randy Shoup (28:27):
And I'm one of these people and there are many like me. Are you kidding me? 50 billion. No way. I can make that 40 million, 30 million. There's something about humans in general, but engineers in particular that look at something like that and we go like, there's no way we do 10 million API requests for that thing. And then it is, okay, but that's a really inefficient and blah, blah, blah. So simply making it transparent is step one to the culture. And then step two or related step is what does the organization incentivize? And you can incentivize something in a negative or positive way, and the longer term is the positive way, celebrating the teams or the services or whatever you're measuring that have improved their SLOs or their measurements or whatever, incentivizing it through positive reinforcement of the good behaviors as distinct from the culture of fear, of pathological, of negatively reinforcing the other ones.

(29:38):
You need a little bit of the stick sometimes, but rarely the sustainable cultures come from the carrot side. And then the other way to do it, or the other related way to do it, and this is true of bugs or compiler warnings or tests or performance SLOs, where you start at a particular level that you give yourself a target that's achievable and I think of as a ratchet. So okay, we get to this point and we set the SLO at this point, and then we do that and we're very happy with that. Okay, now let's raise it or below it or whatever's the appropriate thing, make it more challenging, and then do that for a while and then make it more. So that's particularly true with tests and quality, where again, I like to think of it as a ratchet where we have a bug that somehow slipped through our test.

(30:33):
We didn't test for it or whatever. Okay, great. We write a test that reproduces the bug, we put that in our suite, we fix the bug all good, the ratchet moves forward chunk, okay, we're never going to have that one again. Or at least if we have it, we'll detect it early rather than late. All of these improvement things you can kind of think of in that way. You're just like, what is a mechanism that I could put in place so that we would never slip back? We're at some point and then we move forward, and then how does it not slip back? And SLOs are a way to do it. Tesla are a way to do it. Does that make sense?

Ganesh Datta (31:10):
Yeah, I love that. I mean, we see that in practice with our customers. We have a product called scorecards, which basically allows you to define best practices. Production readiness is a great example of this. Here are the things that we consider to be part of a production ready service, monitors and SLOs and all these things. And what we've see in practice from our most successful customers is AIDS gamified. So people generally do levels, so do bronze, silver, and gold or whatever, and services meet those standards. But eventually over time, the stuff that's in gold moves down to silver, moves down to bronze. As more and more teams accomplish those things, it's like, Hey, we're all doing those things that we considered aspirational one day, but these are now the basics. And so it's holding ourselves so that higher standards. So we kind see that play out across someone as well.

Randy Shoup (31:53):
Yeah, I love that you say gamified. I say positively reinforced, and we're saying the same thing, right? That's a specific implementation, which actually works really well for human psychology of Yeah, what you're not saying is the bronze people are terrible and they're fired. What you're saying is there's bronze and there's silver, gold, I dunno, platinum diamond, whatever the credit card companies want to keep doing black anyway, but that's a really, really good way of quantifying and making tangible positive reinforcement for good work

Ganesh Datta (32:31):
And then celebrating it, like you said, calling out those teams. And an advisor make a great analogy the other day, which was like, you want to create a culture of celebrating the first downs, not just the touchdowns. And I was like, yeah, it's a really good way of describing it.

Randy Shoup (32:42):
That's really good. Yeah, it's really good. All these things are games of inches anyway. I mean,

Ganesh Datta (32:46):
Genuinely. Exactly.

Randy Shoup (32:48):
There are very few examples in the world of one shot order of magnitude and improvement. Everything that I'm really proud of. It's actually true. Every single thing that I've done that I'm really, really proud of in my career has all been these tiny little game of inches and this velocity improvement at eBay where we doubled engineering productivity, that's one of them. Earlier in my career at eBay, doing something entirely different. It was improving the ranking function for the search engine and driving hundreds of millions of dollars additional revenue. Similarly, improving the performance of the search pages, again, driving hundreds of millions of dollars in revenue. And every one of those was like, there wasn't one smoking gun, no silver bullet. It was just simply a game of inches, like making a small improvement here, a small improvement there, a small improvement here. Ratchet, ratchet, ratchet, ratchet. And that's maybe, I think the only way really to get serious improvement. I'm going to pontificate for one more second and then I'll let you ask your next question. This is exactly Malcolm Gladwell's outliers, if you've ever heard of that book. I

Ganesh Datta (34:03):
Haven't read it. I've heard of it. Yeah,

Randy Shoup (34:05):
It, it's where the 10,000 hours to become an expert comes from. I mean, the idea is very quick, which is all world-class experts, or at least youth experts come from deliberate practice and constant improvement in the small, right? There's no concept of step change. There's only reinforced, reinforce, reinforce, improve, improve, improve, improve, and that's the way you get to greatness.

Ganesh Datta (34:36):
Yeah, I love that. I mean, think about another way. It could be the idea of compound interest or whatever. It's like you get 1% better, you get 1% better, and actually over time,

Randy Shoup (34:45):
Compound interest. It's exactly compound interest. Yeah. In fact, that's the analogy we use. Well, we're making a lot of investments in improving our platform and foundations at Thrive Market, and that is exactly the analogy or metaphor rather that we use with the non-technical executive team is like, we are making investments and they will pay off. And it's exactly compound interest, right? Exactly. And we need to make this investment now because we need the 10% per whatever to start accruing now as this thing from later. And

Ganesh Datta (35:20):
Yeah, I love it. My final question, I know we're coming up on time. If you have one piece of advice for an engineering leader who's thinking about starting to get their organization to focus on MTTR, what would you recommend? Where do you recommend they start?

Randy Shoup (35:35):
I don't think I have any different advice than the answer I gave earlier, which is make it visible, right? So I mean, you asked how do I get people to care about MTTR? It's like, how do I get people to care about X? And I'm not saying this just because you measurement, but measuring X and show and being transparent about X, that is the way to do it. And when again, you can get 80% of the way there simply with measuring and not manipulating or encouraging the behavior, if that makes any sense. And then on top of that, it's like, okay, now start to encourage the behaviors that you want. I think that's it. I mean, the other thing, if you're specifically asking about NTTR, it's having the owners of the service carry the pager for the service, right? So that's the other way to do it, which is or not the other way, the complimentary way to do it, which is put the responsibility for the thing, let the benefits and the burdens run together. That's a legal term, but for the team that has the ability to influence the reliability of the service, should have the responsibility and the accountability for driving that

Ganesh Datta (36:43):
Reliability, the ability on it.

Randy Shoup (36:45):
Yeah. Which is exactly SRE, DevOps, the ability you run it. Exactly.

Ganesh Datta (36:50):
Yeah. Well, thank you so much, Randy, for joining me on the podcast. This was awesome. I had a great time.

Randy Shoup (36:54):
Happy to be here. Thank you.

Ganesh Datta (37:02):
Thanks so much for listening to this episode of Braintrust. If this resonated with you, do me a favor, share it with another engineering leader who's wrestling with these same challenges. And if you want to continue the conversation or learn more about how we're thinking about internal developer portals at Cortex, reach out to us at cortex.io. Thanks for listening, and we'll catch you on the next one.

More episodes

Chapters

What is Braintrust by Cortex?