Candid conversations with the builders shaping the future of engineering.
Braintrust dives into the operational realities of running high-performing engineering organizations, from production readiness and migrations to AI adoption and operational excellence.
Hosted by Ganesh Datta, CTO & Co-founder of Cortex
Shawn Burke (00:00):
You really want the meeting to be action driven. You want those things to show up green the next week. And we saw actually amazing progress over 12 months. The conversation might be what's up with your process that you're falling behind on this. The conversation might be, what is the engineering work required to fix this? I think you do really want an outcome because the goal is to come back green the next week. You don't want to be read every week.
Ganesh Datta (00:26):
You're listening to Braintrust by Cortex, where we explore how engineering leaders blend AI, platforms, and culture to build high performing software teams. I'm your host, Ganesh Datta, CTO and co-founder of Cortex, an engineering operations platform designed to help organizations continuously improve their operational maturity and reduce developer friction. In each episode, we go deep with CTOs, VPs of engineering, and technical leaders who've been in the trenches, navigating the tension between speed and quality, building reliability at scale, and figuring out how to lead through major platform shifts. Whether you're running a team of 10 or a thousand, this is your space to learn from people who've made the hard calls and live to talk about it. Hey, welcome to the podcast. I'm Ganesh. I'm one of the co-founders and CTO at Cortex. Great to have you here.
Shawn Burke (01:22):
Thanks. Glad to be here. My name is Shawn Burke. I'm a distinguished engineer at Cortex. I've been at Cortex for about a year and a half. Prior to Coretex, I worked at companies like SoFi, Uber, and Microsoft, and so I've seen a few different takes on operational excellence.
Ganesh Datta (01:36):
Cool. Excited to have you here. Just for the crowd, you want to quickly describe what a distinguished engineer does. Sounds very distinguished.
Shawn Burke (01:42):
The joke that I like to say is it's the only title that's an oxymoron.
Ganesh Datta (01:46):
Oh what do you mean by -
Shawn Burke (01:47):
Can you be distinguished and be an engineer? Essentially, in most companies, it's the top end of the individual contributor scale for software engineers. At different companies, it's going to mean different things, but typically distinguished engineers set technical direction. They map out highly technical roadmaps, or in a lot of cases, which have been my roles, they often have responsibility for the how of the engineering team does its work. So you partner typically with an engineering leader. As a pair, you work on the what, which is what the engineering leader sort of usually sets the priorities and then the distinguished engineer will often work on the how of both maybe a particular project, how we're going to build a new thing, or how the development team does its work, specifically what technologies you use, how you make sure that you stay healthy in those technologies, things like that.
Ganesh Datta (02:36):
Well, thanks for sharing. Well, today I want to talk about operational excellence and in particular, I want to dive deep into operational excellence reviews. The nitty-gritty of who's involved, how do you run it, what do you look at, the why and all that stuff. So maybe let's start with defining what operational excellence even means.
Shawn Burke (02:55):
Yeah. So operational excellence, this is from memory, but the definition is typically your ability to drive down your cost for managing your system and to be able to know that your system is healthy in a given state. And so you want to do that because there's a variety of things in engineering that don't fit neatly into actually building the software. How many alerts is it generating? Do you have the right alerts? What are the latencies? What are the availabilities? Things like this. And so your operational excellence process focuses on that sort of health and operations aspect of your services.
Ganesh Datta (03:29):
Why does that matter? Why would an organization care about operational excellence?
Shawn Burke (03:32):
Yeah. So like I said, for two reasons. One is that these things often don't get swept up in typical engineering planning processes. So there's a set of work for having a quality service delivered to customers that doesn't usually show up in regular engineering. And the reason you want to have this is because a lot of times you want to benchmark how services and systems are operating against across an organization. And so without an operational excellence review, you typically don't have an opportunity for everybody to sit down and consider these issues and make sure that you have a service that has ... Maybe you have a company that is doing contracts with other companies with SLAs where you have to contractually sort of adhere to certain performance requirements. Making sure that you're doing that is typically part of an operational excellence review. In addition, often you want to make sure you're looking at things like build times and failure rates and just generally the meta process of operating your software.
(04:35):
And so the why for that is that if you don't have a structured process around it, it usually just doesn't happen. These issues get lost in the cracks.
(04:44):
They sort of often suffer from this. You don't see patterns across the organization that all of your services are actually having a lot of latency problems or build system problems or things like that.
Ganesh Datta (04:54):
Is this something that non-technical stakeholders care about? Is this the right lens for them or is this more of the engineering organization being able to introspect and determine if they're doing the right things to deliver software the right way?
Shawn Burke (05:06):
Yeah. I think the outcomes of operational excellence are very impactful for non-technical stakeholders. So for example, if a service is not reliable and it has outages, that is going to affect any business that is running. So that's a thing that usually is focused on by operational excellence. Compliance is often handled. Some aspect of compliance is often handled in operational excellence. And so I think that the services that have a good track record on operational excellence are going to do two things. They're going to deliver a more reliable product for a business and they're typically going to be able to deliver a higher velocity because they will have less things to slow them down incidents and bugs and things like this.
Ganesh Datta (05:45):
Yeah. But it sounds more like the things you think about when it comes to operational excellence are the engineering organization's own practices and the outcomes of that is what the business cares about. The product and business stakeholders are not necessarily asking you questions about your operational excellence. They're asking you about those outcomes. Is that right? Yeah.
Shawn Burke (06:04):
Typically, the only thing that you'll hear from the business owners or non technical people is, "Hey, what was going on with that incident last week?" Or, "Yeah, we've signed an SLA. We have an SLA agreement with some company that uses our service and we are concerned that those SLAs are being breached."
Ganesh Datta (06:19):
So you mentioned the importance of an operational excellence review. What is an operational excellence review?
Shawn Burke (06:24):
Yeah. So typically an operational excellence review is that you sit down a group of people on your team and let's talk about different sorts of people that might be there on a regular cadence and go through kind of a scripted set of health metrics for your organization and services. And so before we get to the details of it, I think that it's important to talk about the why this really matters. I think the hardest thing in most organizations is driving behaviors. People have a lot of conflicting things on their time and priorities and they're constantly deciding what to do. And for a leader to say to the organization, "Hey, we need to improve the latency of our services." It might matter for a minute or an hour, but definitely not a day. It's just it goes away quickly. And the only effective thing I've seen in organizations to drive an outcome is to check that outcome on a repeated basis week to week.
(07:20):
So number one is the regularity of that you're having a checkpoint to it on a very common cadence. Number two is that the people in the room have to be able to drive outcomes. And so you can't have an operational excellence review with all of your junior engineers. You need to have an operational excellence review with the most senior reasonable person to be in there, whether that's your CTO or your VP of engineering. It's very important that the team gets the signal that this is really important and we really care about it. And by having the leader driving that meeting or sitting in on that meeting and being attentive, it really helps reinforce that in the organization. So I think that's really key. Other types of things I've seen that have tried to make this happen just don't work because that's missing. So once you get the right people in the room, so you have the leader and then you want a representative set of people that can reason about the systems in detail.
(08:07):
So at my last company, for example, the CTO and the VPs are in the meeting as well as directors, senior directors and staff engineers. So you want to keep the meeting from being too large. And so for your various constituent areas, you want those people there so that they can answer questions of why is this red? And we'll talk about the what's red and green in a second, but you want the right people in there to have the context to be able to talk fluently about what's going on. It's fine for other people to be at the meeting. I think the danger of having a meeting where a bunch of leaders are is that everybody wants to be in that meeting because the leaders are there. So you want to manage that a little bit. So we didn't really have strict rules at my last company.
(08:43):
We just sort of had guidelines that if there was some specific reason for you to be there, that was fine, but generally it was limited attendance. So that's the sort of what you do. I think we did a weekly. I think weekly is probably a good cadence for most companies. And then so now we can talk about like what you actually do in that meeting. And so there is an ideal state for operational excellence reviews that takes a lot of engineering work to get to and most people aren't quite there yet and it's often a moving target. What you want is a set of automated checks that produce a report, whether it's a dashboard and Datadog or some document or something that goes through the areas that you care about and highlights them as red or green. The ideal operational excellence review is you all sit down and go through the red areas and then you're done.
(09:29):
And the way that you can structure that is two things. One is that you need to have people own the various areas and figure out what those metrics should be and then they have to work together to do whatever reporting mechanism is because if people are going to be responsible for manually pulling together a report every week, the whole thing will fall apart. It has to be automated. So that's really critical. You want to be able to just go through the red areas and then that report should be the same across as many of your organizations as possible. And then you basically go through the sub areas of your business. Division A, let's look at your stuff. How come what's going on with your SLO around availability? And we should probably define SLOs, but we'll do that in a second. What's going on with this metric?
(10:14):
What's going on with that metric? You talk about it. The reason that it's so important to do it as a group is a lot of times multiple teams have the same challenges and most organizations are very vertical. Information runs up and down. It doesn't run side to side. And this meeting is a great time for a very constructive discussion where somebody else is like, "Oh, you know what? We've been struggling with that too." And now you might have the seeds of an actually common problem.
Ganesh Datta (10:37):
You mentioned red and green and having the right metrics there. How do you decide one metrics you're looking at?
Shawn Burke (10:41):
Yeah. So let's actually take a step back and talk about SLOs for a quick second and then we'll go back in there. So typically depending on the area, you might have different types of metrics, but for general service health, you have what's called a SLO, which is a service level objective. And a service level objective says that you want to keep ... So you can imagine that a service has good events and bad events and you define what those are. And a good event might be a successful return in less than 200 milliseconds. If you get an error or it takes longer than 200 milliseconds, that's not successful. And so you can do aggregate over those and say that I want, because we've all heard about nines, typically you say, "Once I've defined my SLI, I want three nines. I want 99.9% of the requests to be green good events and that's your budget for bad events." And so there's a lot of kind of industry knowledge around SLOs, but that's the thing you're going to have to understand to run a proper operational excellence process.
(11:36):
So teams or people go off and to your question, decide what the metrics should be. And typically what they'll do is they'll look at those services and they'll do the basics sometimes called golden signals. So they'll look at latency availability, which is latency, is it fast enough availability, is it even accessible to begin with? And then they'll do a third one, which is error rate typically. And so those are the golden signals and typically what they'll do is try and build SLOs around those. But usually there's actually a litle bit more too. There's usually some local contextual stuff. So at my last company, for example, we use Cortex scorecards as one of those things. And we would say in an area, what are your SLOs for the things that you've identified as important and how many of your Cortex scorecards are at the highest level, which we called level two.
(12:28):
And then you can do the same operation for other stuff. So it's basically what's important to you and your organization. A lot of times these discussions will uncover new areas to do, but you do want to keep it so you can get through the whole loop in an hour.
Ganesh Datta (12:40):
That makes sense. Is it useful to have business level metrics or SLOs in here or is it mostly like gold to signal type things?
Shawn Burke (12:48):
That's a great question. So SLOs are designed to be from the point of view of the consumer of a service and that consumer might be an upstream service, it might be an external consumer, it doesn't matter. So the goal for SLOs is to often try to wrap up that concern so it is a business concern. And by connecting that to the engineering team, you do create a really nice connection between the business team and the engineering team that's often a little bit abstract for them. So for example, the typical SLO around latency is one that always has a business facing outcome or availability. So if the service is down, it's a bad business outcome. If it's slow, it's a bad business outcome. It's often important when you craft your SLOs to meet with your business team and get what their priorities are and then try to craft common ground that are represented in SLOs.
Ganesh Datta (13:39):
How do you make sure this dashboard or report you're looking at is not inundated with metrics? It's just noise.
Shawn Burke (13:46):
Having somebody who owns the operational excellence meeting and can think about, are we getting through it all in an hour or whatever, how much time you spend, are the conversations focused? Do they get too quickly off into details that are not at the right level? One of the nice things about having the CTO sitting in there is that that kind of level sets what you're going to discuss because there's a certain set of stuff that they're going to care about or VPs. And so the innovation problem I think is by time boing it, that's what really helps do it. And so my last company, what we did is we actually had a Python script that generated the report every week. And so I would on Monday morning, run a script that would generate a big markdown file actually. We would actually post that markdown file to a GitHub repository that was operational excellence reports with the title with the date in the file name and then everybody would know where to find it and then that worked really well actually.
Ganesh Datta (14:46):
Who should run the operational excellence review? Is that the CTO running it or are they just an attendee and somebody else runs it?
Shawn Burke (14:52):
Yeah, they shouldn't run it. And in my case, we rotated between me and some of the other staff engineers.That worked pretty well. It's kind of good sometimes in some of these regular meetings to have a few different people do it because different people kind of bring a different focus to the meeting a little bit. It shouldn't be the CTO. I think it would be fine if it was, but the person that generates the report and understands how the report is generated and can debug if there's a problem with it is the person that should probably be running the meeting.
Ganesh Datta (15:20):
You mentioned threshold like red and green. Are there shades of orange in these metrics we should be thinking about or is it just things are good or things are bad?
Shawn Burke (15:27):
It actually depends on the use case, meaning there are some SLOs that are backed by an SLA, for example, that are hard and fast. If you breach those, there's maybe monetary penalties. But in a lot of cases, you want your SLOs to be a little challenging. You want to feel like you can hit them. I think in software systems, there might be the databases under extra load, and you make a judgment call about whether or not red is that bad or how red is it and that's all fine. There's another aspect in SLOs called error budget, which we talked about the red and the green events. So if you say you want to have 99.9% green over a period of time that gives you a number of events that can be red within a window of time and that's what's called your error budget. And so you can see how fast you're kind of going down.
(16:16):
And the reason that's important is that if you're running out of error budget and you are going red on your SLOs, that typically means the team needs to spend more time on operational issues. If you're running really green and you're not burning your error budget, like let's imagine your service is 100% green all the time and you do a lot of traffic. That means that you can add new features to your heart's content. When you start adding new features, I guarantee you that number is going to start to drop and that's when you know how to dial it back and forth a bit.
Ganesh Datta (16:40):
Do you at any point decide that you've been too aggressive with your target? Like you spend six weeks and you're like, it's just red all the time and that's just the state of the world and so we should update our SLO. Does that happen? And if so, how do you decide that?
Shawn Burke (16:55):
Oh yeah, that happens all the time. I can think of a couple different flavors of that happening. One is that with SLOs, you want to set them as low as possible to satisfy your customer and it takes a while for people to learn that lesson. And they learn that lesson because an engineer who feels like their stuff is really great will say, "Let's go for five nines." And they'll quickly discover that like the universe doesn't care for that approach and they'll be read all the time. And so usually take some learning and some sensibilities about what does it mean for this thing to have failures and latency failures are not the same as availability failures. If it's maybe some of those things, if you set your latency goal at 200 milliseconds and you're getting a few at 250, maybe you should just back it off a little bit and then think about from your customer's point of view, maybe it just doesn't matter.
(17:48):
So one category is that you've just been too aggressive and the system can't support it. Each additional nine is exponentially more work to hit and so it's highly costly. That's one of the reason why you want to have the lowest availability you can get with because it gets more expensive. So one flavor is we just set the bar too high. Another flavor which we see commonly is that within a service there is a subset of requests for that service that are slower than others for some unusual reason. You have a customer who's got a lot more data or they're doing a more complex query. And so sometimes you have to work out about, does that mean your system is unhealthy or you've got an issue that you need to solve? And so sometimes you need to craft these to try to capture the breadth of your customer's experience and not be too biased by a single one.
Ganesh Datta (18:33):
That makes sense. Shifting gears, let's talk about the agenda for one of these. So what do you actually cover? What is the agenda for this meeting? What does good look like? Are you just going through the report? Is there something that happens at the beginning of the end? What is the actual structure of one of these meetings?
Shawn Burke (18:46):
Yeah. So this is kind of N equals one. I think there's a couple ways to do it. That report I mentioned was great because that was the structure of the meeting and our goal was to get through all of the groups in a single every week. Sometimes we did not. And so what we would do is that we would pick up the next time at the person, at the team that was next. And so that would allow the report to ... So you just wouldn't get to them that week and you'd look and if nothing's terrible, then that's okay. You would pick up with them. So the discussion of the meeting, we would craft the report so that the biggest things that were most important were at the top of each section. So each section was by the division or the engineering business unit or whatever.
(19:27):
And so we strove to keep less than 10 business units and less than 10 functional areas within a business unit. And so for example, let's just imagine that you have a business unit called Payments. And so at your company, payments might be a business unit and if your company is doing some sort of online commerce, maybe what do you call it, catalog or like whatever your SKUs are as another unit. But so payments and then within payments, they're going to have subs, they're going to have a set of services and functionalities. And so what we encourage teams to do, we wanted to have 10 or less top level functional areas and within those we wanted like seven or less functional areas because a lot of these systems, they've got multiple areas inside of them. Maybe they've got like the payment system has got inbound and outbound to payment vendors, maybe they've got their own ledger.
(20:19):
So they've got individual components inside. And so the first thing you want to do is create this topology of your organization, which is your organization in the shape of in a functional tree effectively. And then we would run that report. We had a YAML file that defined all those and then for each of them had like ... I think we had like before we were able to auto discover SLOs, they just paste in your SL IDs, like the Datadog IDs, you might paste in the services that comprise that. We always aspired to make that all 100% automated. But essentially for every functional sub area, you have a list of things that the report is driven off of. That's what tells the report to go do its stuff. And then so we would walk through those in order, but for each division, each of their functional sub areas were listed together on a single graph.
(21:04):
So for the payments team, there would be a single graph that showed, for example, Cortex status and we would have a line for each of the sub areas in there. And so that allows you to drill in and then our reports had a tabular format below the graph so we could always look at the data.
Ganesh Datta (21:21):
Should individual teams adopt this practice as well? Is this just an org wide thing that you do at the VP, CTO, staff engineering level, or should individual sub-teams get in the practice of doing their own operational excellence reviews at their individual scale?
Shawn Burke (21:36):
I think teams should have an operational excellence dashboard that they pay attention to as part of their, whatever their keep the lights on routine is, whether that's their on- call, sometimes on- call is responsible for that. Sometimes teams do have an operational ... I mean, if you're running a very operationally heavy service, if you're running S3, you probably have this meeting. But if you're running a couple of individual services, I don't want to say no to that, Nash, but I think it's probably a little heavyweight for smaller teams. You need to have this machinery to really drive it effectively. And I think as we do get better tools that can maybe do this stuff automatically, AI tools and other sorts of service catalogs like Goretex, I think you would be in a place where maybe you could just have that data ready and you could go through and do it.
(22:22):
Yeah.
Ganesh Datta (22:22):
I think a lot of us, including the listeners, have been in these operational excellence meetings where great conversation, lots of discussion and that's it. And you move on and you do this again next week and nothing ever happens. How do you structure this meeting and what happens after a meeting so that something is actually done about these conversations that are happening?
Shawn Burke (22:39):
That is a great question. I think the way that you have to handle that, it depends on who the folks in your organization are. If you have a TPM, it might make sense to have a TPM sit in on the meeting and then create tickets for all the outcomes. That's effectively what we did is we would create tickets for the outcomes we want. We would do in the meeting to keep the meeting fluid as we would take notes of what these outcomes were and then we would go and make sure the tickets got created. But that is a really important part of it. I think back to what we talked about earlier, having the leader in the meeting is what really reinforces a lot of this stuff because a lot of people in a random meeting might say, "Sure, I'll go fix that and not do it.
(23:16):
" But if the CTO is sitting in, they have a little bit more motivation.
Ganesh Datta (23:19):
Is someone in the meeting responsible for halting the conversation and asking, "Okay, what are we doing about this? " Because these conversations can devolve and you can have really interesting technical conversations, but is that someone's responsibility? Should every conversation end with a, "What are we doing about this? " Or is it okay to just have some open-ended conversations?
Shawn Burke (23:36):
Yeah, it's always okay to have ... I mean, you really want the meeting to be action driven.You want those things to show up green the next week. And we saw actually amazing progress over 12 months from when we started to things that had driven to all green on various things. A category I forgot to mention is we also put compliance in there. So vulnerabilities, we had defined an SLA for how old high and critical vulnerabilities could be and so we tracked that kind of stuff in there as well. I think the conversation is basically like, are teams trending in the right direction or are they not ... Some things like vulnerabilities churn constantly, so are they not going up and Jira tickets and things like this. So I think a lot of the conversation might be what's up with your process that you're falling behind on this.
(24:21):
The conversation might be, "What is the engineering work required to fix this? " I think you do really want an outcome because the goal is to come back green the next week.You don't want to be read every week.
Ganesh Datta (24:32):
Does the structure, the metrics or anything like that change now with the increased adoption of AI coding assistance? As people are potentially producing more code or more code, they don't necessarily understand is it enough to have same SLOs and golden signals? Do we have to change what we're thinking about? I
Shawn Burke (24:49):
Don't think the operational piece of it changes, but I think this is emerging in the industry, but keeping an eye on how much code is being generated by AI assistance, I think is a thing that we probably want to do. And for example, my own use of these assistants, they often generate a tremendous number of maybe or maybe not helpful unit tests. And I think it's very easy as an engineer just to, if they're green, like, great, check them in. And so those are tests that are going to break maybe falsely in the future and you want to consider them. So I think for operational excellence, I think that the code review quality would be the place that I would want to lean into there and have tooling that lets me know that developers actually really closely reviewed AI. But I think the Agentic stuff actually would help you, on the other side, would really help you power your operational excellence product like this giant script that I wrote by hand to generate all this data- Probably a lot easier.
(25:48):
... probably done at lunch while I was just packing away with Claude just as easy. So I think some of that tooling is probably going to be a lot easier and I think there's a future world where some of these AI agents could act like a chief of staff and could help me figure out the why behind some of these patterns.
Ganesh Datta (26:07):
Yeah, absolutely. Do you think it's important for operational excellence reviews to then cover things like flaky tests or build times? If coding assistants are potentially increasing the number of tests you're running or writing questionable tests or things like that, should we be looking at more foundational things like that as a way to capture potential breakages in the actual SDLC itself?
Shawn Burke (26:36):
Yeah, absolutely. I mean, I think some companies would consider those to not be operational issues.
(26:41):
I think they fit very nicely in that proces. So any metrics about your developers sort of inner loop, we define interloop as the sort of code test debug loop that developers go through. There's lots of science that to the extent you can accelerate that loop, you'll see better outcomes in your organization. I think it's absolutely appropriate to keep an eye on that and things that slow that sort of thing down are slow build pipelines, flaky tests, long build times and things like this. So I think that those things would be 100% appropriate in there. The organization, the challenge with those is going to be defining what good and bad looks like at an organizational level because certain areas, for example, build times very often front end code that's in Typescript doesn't have this problem and the backend code has got a build time problem. So you'd have to manage it, but I think that flaky tests and build success rates are a thing that's pretty standard and that would absolutely apply.
Ganesh Datta (27:39):
Last question for me, you mentioned achieve staff being interesting here. If you did have a chief of staff, what would you have a chief of staff do for you?
Shawn Burke (27:45):
I would want a chief of staff to be able to tell me the why behind the metrics that I'm seeing. So for example, if it was flaky tests, for example, I'd want it to go tell me exactly why tests are failing. Is it hardware failures? Is it timeouts? Is it race conditions? And try to point back to what in my process could be changed so that I would be able to identify those more easily and work them out.
Ganesh Datta (28:11):
To have someone go dig into this stuff or get the why.
Shawn Burke (28:13):
Yeah.
Ganesh Datta (28:14):
Well, thanks so much for coming on just to summarize the stuff we talked about, the importance of operational excellence reviews to make sure you're actually delivering the services that you should be looking at the golden signals, having the right metrics and the right thresholds, using SLOs as a way to do that, have the right people in the room, ideally senior leadership from the CTOs to VPs, to directors, to staff engineers, having a clear agenda. Ideally the agenda is the report, is the structure, and making sure that people are following up and acting on those things. Is there anything else you would add?
Shawn Burke (28:41):
That's a good summary.
Ganesh Datta (28:42):
Sweet. Thank you so much for joining me on the podcast.
Shawn Burke (28:44):
All right, thanks for having me.
Ganesh Datta (28:51):
Thanks so much for listening to this episode of Braintrust. If this resonated with you, do me a favor, share it with another engineering leader who's wrestling with these same challenges. And if you want to continue the conversation or learn more about how we're thinking about engineering operations platforms at Cortex, reach out to us at cortex.io. Thanks for listening and we'll catch you on the next one.