Candid conversations with the builders shaping the future of engineering.
Braintrust dives into the operational realities of running high-performing engineering organizations, from production readiness and migrations to AI adoption and operational excellence.
Hosted by Ganesh Datta, CTO & Co-founder of Cortex
Steve Evans (00:00):
The most important thing you do is hire. If you hire the right people, you can suck at every other aspect of your job and you'll deal okay. And then after that, it becomes giving people the right context so that now that you've hired the right people, they have the context to do the right things.
Ganesh Datta (00:18):
You're listening to Braintrust by Cortex, where we explore how engineering leaders blend AI, platforms, and culture to build high performing software teams. I'm your host, Ganesh Datta, CTO and co-founder of Cortex, an internal developer portal designed to help engineering teams ship reliable software faster with AI. In each episode, we go deep with CTOs, VPs of engineering, and technical leaders who've been in the trenches, navigating the tension between speed and quality, building reliability at scale, and figuring out how to lead through major platform shifts. Whether you're running a team of 10 or a thousand, this is your space to learn from people who've made the hard calls and live to talk about it. Hey, thanks for joining The Braintrust by Cortex podcast. Excited to dive into some of the nitty-gritty topics of running engineering teams. We have some exciting topics lined up today. I'm Ganesh, one of the co-founders and CTO at Cortex.
(01:17):
Excited to have you on. Do you want to quickly introduce yourself?
Steve Evans (01:19):
Yeah, I'm Steve Evans. I used to be SVP of Engineering at Chegg, Chegg is an EdTech, helps college students. I left there about a month ago and I'm in between and just exploring what's next and talking to a lot of companies. So just talking to you right now.
Ganesh Datta (01:37):
Well, excited to have you on. I know we have some exciting topics lined up. There was a hot take you posted recently about microservices being the trillion dollar mistake. And I can tell you from experience, I started at Cortex because I saw my last job heading towards the trillion dollar mistake. I think very similar experience that you described. Basically, we made it too easy to build microservices because we had a horrible monolith and then we invested in the infrastructure to make it easy to build microservices. Lo and behold, it's so easy, people have built a lot of them. And then you start to realize, well, now we're introducing a whole new class of problems that we didn't have before. Maybe we solved one. We have a different problem. Eventually I realized this is getting out of hand. I started Cortex six acres ago to solve that.
(02:21):
Anyway, all that to say, very personally experienced this problem and excited to dive into that. But maybe for listeners, you want to start by describing this trillion dollar problem. What was it about microservices that led you to believe that maybe it was the wrong path to have gone down? Maybe not the wrong path, but
Steve Evans (02:39):
... Yeah. I mean, look, first of all, I like to talk in extremes, so warning. But when I think of microservices, what I was railing against was the micro and microservices. And so really it's where we fell in the pendulum. So if we talk about the monolith, I think it's pretty clear where the challenges came there. We might smell the time, but you got this one big piece of thing. It's hard to scale it. It's brittle. It's hard to deploy all that kind of stuff. Okay, cool. We came up with this idea of a microservice. And at least my experience was we created this new tool called the microservice, and all of a sudden everything ... Essentially, we gave the team a hammer and everything became a nail. And next thing you know, we have five to 10 microservices per engineer over the course of many years.
(03:41):
And I think on the opposite side of the monolith, you've got what's talking to what, where is the boundaries of a microservice, of an individual microservice? What is the purpose of an individual microservice? How are these things actually scaling? You hear about things like, oh, you want to have things in microservices so you don't have a single point of failure. Well, if in a user experience, this thing talks to this thing that talks to this thing that talks to this thing to fulfill this one user behavior, it's still a single ... One thing fails in that chain, it doesn't matter.
(04:24):
And so the point of what I posted and then my follow-up the next week was we created the microservice and then the friction point in the development process was creating the new microservice. And so then my infrastructure team did a great job of making it ... Infrastructure team and the developer experience team did a great job of making it super easy to launch a new microservice. And then essentially the developer, they get a user story and they have the choice of, do I go figure out where to put this thing or I just build a new microservice? And one has a cost and the other has zero cost. Well, guess what do you do? You create a new microservice and now ... Okay, so now what was the next issue? Oh, we have all these new microservices coming up, but we don't have the observability stack.
(05:20):
The alerts aren't getting created privately. Oh, cool. We'll automate that. And we just keep solving around this expanding empire of microservices. And it's almost like we became a microservices shop as opposed to serving our customers. And so you just have all this complexity that's not serving the customer, but serving the engineering organization. And again, I'm not suggesting we swing the pendulum back to one monolith, although I would argue in the early days of a company, why not?
(05:57):
I really not sure. You can go really, really far on a monolith, but there's somewhere in between. I don't remember my posts what I suggested because I really don't care what we call them, but a small service, a medium
Ganesh Datta (06:15):
Service. Microservice.
Steve Evans (06:17):
But my general thought is a services that serve a domain, a service for login, a service for checkout, a service for ... And for whatever, it depends on what your business does. And I've heard other ways to tackle it like a service, you should split services based on how your product scales, which I think is great if you understand that. I think the reality is, is most people don't understand that until they're further along in the development process and then you're going back and rearchitecting things. So you might be able to do that retroactively. But I think just being really thoughtful about how you should split your services will get you 90% of the way there.
Ganesh Datta (07:09):
Yeah. To your point, Cortex for that very reason from the early days was a monolith purely, well, A, because to sol those problems.
Steve Evans (07:18):
Serving microservices. Yeah, that's hilarious.
Ganesh Datta (07:21):
Yes, exactly. For all the same reasons that we serve customers with microservices, we were like, "We should not make those same mistakes." And so it's only recently that we started breaking things out of the monolith and trying to be intentional about it.
(07:33):
But yeah, the monolith got us very, very, very far. It works well. And especially if you can use the right patterns and create the right boundaries within your monolith, there's no reason why you can't get some of those same advantages. But that being said, there obviously are costs to monoliths, like you said, we moved to this microservices world for good reason. Then builds take longer and test takes longer to run and CI gets more complicated and deployments are less frequent because it's a giant thing and then you have multiple versions of this thing running. And so naturally we're like, okay, well, one of the things that's causing this is the size of the thing. And so if we make the size of the things smaller, then we can test run faster or whatever and build their quicker. But it was very much like an engineering led initiative to solve the pain that they were facing with models.
(08:23):
So to this point, it's like, okay, well, you would assume that engineering organizations are now feeling the pain of microservices. Even as the team made it really easy to spin up services, they were like, oh yeah, alerting and monitoring and dealing with incidents was really hard, but yet people still kept doing it. What do you think led to, "Hey, we should just keep doing more microservices versus realizing this was the same moment that we had with monoliths." Was there something underlying to it? You mentioned the ease of getting started, everything feels greenfield. What was the underlying reason that people were like, "I should just create a microservice." What led to that?
Steve Evans (08:56):
I mean, there's a reason people keep doing drugs even when it's ruining their lives. It's that decision in the ... I mean, seriously, it's very difficult. I think that this is one of those things where it's very difficult for the individual developer to do the right thing because the individual developers being judged on that sprint, not on the health over the course of three years. And so showing up to the Sprint demo with, yeah, it's not done, but I've incrementally made the engineering organization a healthier place in some immeasurable way. It just doesn't make sense. So it has to be a movement that is larger than the individual and then it competes against everything else. So I think it's one of those things that it's a lot easier to, like what you guys said was really resist the microservices. You guys were perfectly positioned because you were solving the problem as your purpose.
Ganesh Datta (10:07):
And we provide new service creation as a product.
Steve Evans (10:09):
Yes. Yeah. As a product, as a business, as a reason to exist. And so you were careful not to fall into the trap. Yeah. And so you were very careful not to go too far. Whereas most companies, they're solving a different ... Every other company in the world, bar a few, are solving a different problem. And so they accidentally get over their skis in this regard. And then unwinding that becomes incredibly difficult. When you have the complexity, like, okay, now we're going to spend cycles to arbitrarily wind this back of it becomes really difficult. It's going to take a long time.
Ganesh Datta (10:48):
There's inertia there.
Steve Evans (10:49):
There's inertia and it's just like, oh, we're going to take these five microservices and merge them together. It's got to be opportunistic. You're not going to do that just for fun. You're not going to do that. Doing that in the name of tech debt reduction, I don't think is a great idea. You got to wait for opportunistic reasons to do it.
Ganesh Datta (11:12):
Yeah. And then there's an incentive structure problem where the creation of the microservice the first time around probably looks a lot cooler for your own promotion than like, oh, I merged a bunch of microservices that we spent a ton of time breaking apart back into a bigger thing. And I think to that point, you're talking about how in a Sprint demo or something, it's really cool to come in and say, look at all the progress I made on this thing. And maybe on the surface, it looks like you're moving a lot faster. We're shipping all this stuff and so many deploys and whatever. But there must be something that indicates things are not what they seem. Yes, we're shipping a ton of stuff. We're breaking apart the model. We're shipping all these microservices. Teams are moving faster. So in your seat, how do you realize what led you to realize, hey, this is starting to actually be a problem?
(11:58):
Because like you said, the individual team may not realize that in their micro environment.
Steve Evans (12:04):
Well, where you start to see it as a problem is not in the sprints, but in, I think the most common place you start to see is in the incident of where you've got something went wrong and now you're trying to figure out what. And someone pulls up the service map in your observability platform and it makes the New York City subway system seem simple. And you're like, "But wait a second, this seems more complicated than my business actually is. " Yeah. And so then you're sitting there. And keep in mind, I was an engineering leader, I reported to the CTO, I ran a 300 person organization. So by the time I got involved with an incident, actively involved, it was not a good day. It wasn't like it was the first 15 minutes. It had been at least half an hour, usually more like an hour.
(13:00):
And so this was when the team was to the point where they're not a hundred percent sure what's going on. And more and more and more, it was, we've got this mesh of services and they're trying to ... They're essentially, and it's December 16th when we're talking. And so the most obvious analogy, they're unwinding the Christmas tree lights and trying to figure out how this thing works. And so that's an example where you really start to see the cost. I think the berry cost that's actually much bigger. I think it's very easy to look at downtime or customer impact of either downtime or service disruption. Degradation.
(13:44):
But I think that the really dangerous cost that gets missed a lot is the drag on developer productivity. So think about when there's an outage and, oh, just pay attention to how much time is spent figuring out the journey that a user is taking through your microservices. Obviously the user's not, but that traces, right? Now think about your developers is they go to develop in your ecosystem. They're doing that every single day. Now, yes, the ones that have been there for six plus months, they're working within their domain, they know it pretty well. Okay. But there's a cognitive tax they're dealing with. You've got a bigger tax you pay when you onboard someone. You've got a tax when someone moves from one area of your engineering organization to another, and then you pay the tax anytime someone tries to work across the organization of figuring out this.
(14:54):
I'm a West Coaster. I can go to New York and within the Bureau of Manhattan and I can get around fairly easily, but there's a pretty big tax if I start to get outside Midtown. All of a sudden I'm like, "I'm pretty uncomfortable." So multiply that by 10 in complexity of some of the service maps I've seen in the microservices environment. And so that's the tax that builds up over time. And so then that's where leadership has to come in and you just have to say, "We are going to do this because it's the right thing to do for the business and we're going to start working our way back in a responsible fashion, even though it's not going to be sexy." I have another LinkedIn post coming out today about how every person should own a dashboard. And I stole this from someone and I really like the concept of being able to quantify a purpose you have into a dashboard and maybe a dashboard someone could own, and it could be a relatively genius person is like number of microservices, number of microservices per engineer.
(16:17):
You could break this down into different areas if you're large organizations. And just yes, it's very difficult to say, "Oh, in this sprint, I'm going to go tackle this one example." Because that one example is not going to move the needle. But over the course of a year of, "Eh, when we get the chance, we're going to merge this, we're going to merge that, we're going to start to bring this stuff together," you could over time start to make a dent such that in 18 months when you have that outage, imagine removing every other subway station from that New York City subway map. I realize that's not great for a subway system, but my analogies are falling apart here. But for the usability of maintaining the system of microservices architecture, it would significantly reduce not only your ability to recover from system impact, but then start to think, oh, how much did this improve our ability to develop and understand and onboard people and work across teams that hidden cost that's very, very difficult to measure.
Ganesh Datta (17:32):
Yeah. The assumption here is that the organization has realized that they've crossed the inflection point where they are paying this hidden tax now and it is worthwhile to move backwards from there. What dashboard are you looking at that tells you, "Hey, we're starting to feel this developer tax." For example, if you're looking at cycle time, it actually might be reasonable. Your PR cycle time on these micro repos are relatively fast. They're small PR, people are shipping and the bills are really fast and reviews are easy because it's within the context of a small window. And so your cycle time is maybe really fast. Maybe your single service uptime metrics are looking okay because it's so bounded. You have three feature flags, you can turn things off and on really easily, you have circuit breakers, whatever. So maybe at the micro level, things look relatively healthy.
(18:20):
But to your point, you in your position are kind of seeing the macro tax across the organization where things are slowly screeching to a halt because of the developer, the cognitive tax. And so have you found something that you can look at and say, "I'm starting to see this slow down across the organization," even if teams don't necessarily realize they're feeling that yet?
Steve Evans (18:40):
No, I think that's really hard. And look, I think there are some qualitative things you could look at. For example, if you have a truly greenfield product, how much faster does that move than something in the existing stack? Of course, it's going to be faster, but is it 10X or is it 5X? By what degree? I think you could just look at things like when you onboard a new engineer, how long does it take for that person to come up to speed and why? Is it tooling or is it just understanding a cognitive load? But I think one of the challenges in this space is agentic coding, AI coding has really ... I don't think it's broken measuring engineering productivity, but it has shown us how broken measuring engineering productivity is.
Ganesh Datta (19:41):
Tell me more about that. What's broken? I don't disagree.
Steve Evans (19:45):
Well, I think it's very clear that we were measuring, we have been measuring for a very long time, how fast engineers run on a treadmill, not how far they're going. There's an analogy I like to use of when I talk to non-tech people about engineering productivity, I like to talk about how I break the world into miles per gallon and then actual outcome-based. And when I think of miles per gallon, I think of all the traditional DORA metrics, space, doesn't matter. I actually really don't care on the details, but it's cycle times, release frequencies, sprint velocities, change failure rates, all your really standard stuff. God forbid, lines of code or PR, whatever, whatever you want to talk about, but anything in that bucket. And then you've got things on more of the business side that could range from, so for example, on a commerce engineering team, like in a subscription business, you've got payment success rate, which is a super important metric.
(20:57):
It's basically at the end of the month, how many subscribers were able to successfully renew? Not how many people chose to renew, but on your Netflix subscription, at the end of the month, were they able to successfully renew? It's super important because if they weren't, they can't renew you the next month. It's a great way to erode a subscription business is having a low PSR. So let's use my commerce engineering team because it's super generic. Let's imagine you have a commerce engineering team that deploys eight times a day. Their sprint velocity has been increasing 5% a month for the last 12 months. Their cycle time's really low, their change failure rates really low, et cetera, et cetera, et cetera. And their PSR has been degrading 1% a quarter for the last two years. That to me is the equivalent. You live in the Bay Area?
Ganesh Datta (21:46):
I grew up there.
Steve Evans (21:48):
Okay. You're in San Diego.
Ganesh Datta (21:49):
San Diego now. Even
Steve Evans (21:49):
Better. San Diego now. You're in San Diego. I'm in Seattle.
Ganesh Datta (21:52):
Yeah.
Steve Evans (21:54):
It's the equivalent of you having a car that gets 60 miles per gallon and you're going to drive to visit me in Seattle, get great gas mileage, but you're heading south.
Ganesh Datta (22:04):
Yeah.
Steve Evans (22:05):
You great gas mileage, but you're never going to get there.
Ganesh Datta (22:08):
Yeah.
Steve Evans (22:09):
Right?
Ganesh Datta (22:09):
Yeah.
Steve Evans (22:10):
Whereas if I've got a team who's improving their PSR consistently quarter over quarter, they release once a month, they don't even know what their sprint velocity is. They've never heard that the term change failure rate, et cetera, et cetera, et cetera. Which team are you going to pick?
Ganesh Datta (22:31):
The one that's moving the business needle.
Steve Evans (22:33):
And then if you've got a team that has a declining PSR and they look at it and they're like, "Yeah, every time we do a release, we have to roll three out of four back and it's because we have no test. Our test coverage is horrific and then we're going to go solve for that. " That's what you really want. You want someone who can tie their engineering metrics to the business outcomes. That's ultimately what we want. You guys released a report a few weeks ago, a month ago, whatever, and you'll correct me here, but it was basically with Agentec coding, the amount of code we're producing has gone up and the amount of failures we have has gone up at the exact same rate or a very similar rate, which I think to me has just shown that our definition of ... You read all the headlines, engineering productivity is up thanks to AI, but you dig just one level under that surface.
(23:24):
And what does that mean? If we're not producing better business outcome, what does it mean that engineering productivity is up?
Ganesh Datta (23:34):
It's funny because if you think about it a little more, it's almost this chicken or the egg problem where you're shipping more and then you're rolling stuff back and then you're shipping more to fix the stuff you roll it back. And so how much of your productivity is actually just you fixing the stuff that you broke in the first place and it's just a never ending testing
Steve Evans (23:49):
Stuff? Oh yeah. That's a really good point. Yeah. So the original question was like, do you have a really good sense of when you pass the inflection point? Well, no, because I don't think as a discipline, as an engineering discipline, we have a really good sense of what ... We have a really good quantitative sense of what good is. We have a really good qualitative sense, but I think there's a gap here between our ability to measure this with high fidelity, high accuracy, and being tied to really strong business outcomes. This is struggle.
Ganesh Datta (24:29):
Yeah. And I think that the classic failure of so many organizations, especially I think internal facing software engineering teams, is to not think about business outcomes. I talk about this a lot. Platform engineering teams that don't have a North Star business metric is doomed forever to just spin wheels and ship random stuff because they're not tied to a business outcome. And this is just the same thing at a macro level. But the hot take I was going to pose is, is it even worth measuring any of these intermediate metrics? What's the point?
Steve Evans (25:00):
So I think there is. So first of all, I think every engineering team needs to have a business KPI that they are very aware of and tracking. I think some teams, it's very simple, like my PSR example. And if you're a growth engineering team, they're usually very good at it, right? Yeah. And for others, it's a lot fuzzier and they got to work on it. I think there's examples where my developer experience team nailed it. Make it easier to deploy microservices. They saw a business need and they nailed it. Unfortunately, and it was 100% my fault. It just turned out to be the wrong business need or it wasn't a balanced business need. We missed the counterbalance to that exact business need. I do think the engineering metrics, as we talk about them today, have value in the sense of they are useful to understand anomalies and changes like trends.
(26:12):
If I've got one team that releases one fifth of the time as every other team I have, I'd like to understand that. There might be a very good reason why, but why is it? But I do think we need a next generation of understanding.
(26:33):
I have this joke, and this is probably going to get me in trouble, but engineers make things that don't fall down. I remember when we were software developers, not engineers, we upgraded ourselves to engineers at some point, but engineers make things that don't fall down, and that's not quite what we do. I like the aspiration, but to really aspire to that, I think we need to up-level the way we measure ourselves a bit. And I think we need to figure that out. I think there's a huge opportunity there to engineering metrics 2.0.
Ganesh Datta (27:13):
Yeah. Is there value, I guess, in going back much further in the basics? And what I mean by that is basically a lot of the noise that you're describing is in these intermediate metrics, like the cycle times and the PR sizes and the throughput and things like that, which are kind of secondary. They don't really capture business outcomes, but they're not granular enough for a team to act on.
(27:36):
So if you widen that aperture a bit more, you kind of have the raw inputs on one side, which is very much not related to metrics. And it's very much practices and behaviors. And it's like your professional readiness type things. It's like when you're going into production, we actually don't care what your uptime is right now. Do you have an uptime monitor? Do you have an SLO defined? Are you doing the things we want you to be doing? Because as an organization, we've agreed that those things will lead to better outcomes. And on the other extreme, it's like, okay, well, actually let's throw away the intermediate metrics and we have this operational excellence thing that most organizations do or want to do, which is one level higher, which is things like incidents and SLOs and things that are capturing customer impact. And then you kind of drill into the bottom.
(28:22):
It's like, okay, well, if we have our SLOs are degrading across teams, well, that's capturing customer impact. Ideally, you have a payment success rate SLO on that particular team and you realize that it's degrading. Okay, well, now I can go figure out what's going on and I'll go look at incidents or rollbacks or whatever, the intermediate metrics there. And that gets translated down to very clear action items that the teams have to go and follow up with. And so maybe what most of the organizations should be looking at is operational excellence stuff, which is things that organizations have been doing for a while and then practices which are kind of things people want to do. And then maybe it's this middle ground of stuff, which are more, you dive into those when you need it, when you have a hypothesis and you're trying to figure out what's going the wrong way or what's trending the wrong way.
(29:07):
Does that work? Am I missing something in terms of the way you might want to structure that as an injury organization?
Steve Evans (29:13):
I think everything you said makes a lot of sense. I think the really important detail is to be laser focused on the business context. What are you trying to accomplish as a business? And I'll give you an example. In there, you talked about Uptime just as an example. Uptime is not the be all end all for a business. For some businesses, if you're a retail shop, if you're amazon.com and it's between Thanksgiving and Christmas, I'm not sure there's a business, there's a metric ... Well, there's a bunch of dollars and cents metrics, but uptime is a super important metric. When I was at Chegg, I would joke that ... So for quick context, Chegg helped college students study. I joke that on Friday nights we could turn the website off and I'm not sure anyone would notice because the college students that studied on Friday night didn't need help studying as a general rule.
Ganesh Datta (30:19):
That's so true.
Steve Evans (30:20):
No, I'm exaggerating a bit. But honestly, I would take a multi-hour outage on a Friday night in July over a 15-minute outage on a Sunday night in the middle of December during finals, 10 times over, right? Yeah. And so downtime is at the be all, end all. And so then you get into different stages of a company. Before I was at Chaga, I was at a company that did a diagnostic test for cancer patients. Doctors would send us a tissue sample. We would sequence the genome and then we'd send them a report on how they should treat the patient's cancer. And the engineers, they would talk a lot about uptime. And it's like the doctor sends us a tissue sample and then two weeks later we send them a report. You realize that downtime might affect the labs overtime. It's a COGS issue potentially.
(31:27):
But it's actually the biggest risk to the business was accuracy of results and patient privacy. Disclose patient data or give the wrong results. Those were the two things that were going to put the business at risk. And so having that business context was super, super important. And so going back to what metrics should we focus on? Should we be laser focused on this? Should we be focused on that? Having your business context, maybe the business is getting ready to sell or to go public. And actually the only thing that matters is the financials and stop eating so many bagels in the break room because we're trying to increase our EBITDA margin or whatever it is. I had times where Chug grew, I think it was 57% in 2020. And I had people coming to me saying, "I think I found a way to save $300 a month on the AWS bill." I was like, "These are not the problems we're solving right now." And so I also think what should we be measuring as an engineering discipline shifts based on what problems the business is solving because I have this saying that my team hates now, and this is not IT summer camp.
(32:52):
We're not here for kicks and giggles. We're here to produce business outcomes. And if you don't understand what the business outcomes are, then how are you focused on the right engineering metrics?
Ganesh Datta (33:10):
I like to say software is a means to an end. We've talked about High quality code, whatever that means. Again, it's a means to an end. You could argue that writing maintainable code leads to better business outcomes because of X, Y, and Z, and we can move faster. It's less customer impact. But again, it's a means to an end. It's very much a trade-off. And so I think that totally resonates. I wrote an article I think maybe five years ago about SLOs and the failure modes of SLOs, and it's very much like people just pick the golden signals and they just stick with that and it's like, that's it. But to your point, for most teams, it's not particularly really useful. If you own a background, like a task processing system, actually maybe it doesn't matter if your worker is up or down, what matters is the throughput of your ability to handle those tasks, depending again on the priority of those tasks and what is the desired latency and all that stuff.
(34:00):
And so I think that business context is really important. And it's actually a good segue to maybe the final topic I wanted to touch on, which is how do you disseminate business context? You had another post about this, which was we have a- Are you my
Steve Evans (34:13):
One viewer? LinkedIn sends me an email every week. You had one view. I might be- I now know who it is.
Ganesh Datta (34:19):
Yeah. I might be that one impression. But yeah, it was basically like there's a scheme of telephone that app is from actually starting from the CEO to the CTO, to the VPs, to the directors, to the managers, to the ICs, and things kind of get lost along the way and you end up solving contrived problems as a result. Maybe for the last couple of minutes here, we can quickly talk about what causes that problem and what have you seen work to help reduce that missed handoff of context along the way and keeping that working?
Steve Evans (34:49):
Well, I'll just start. What causes the problem is everyone doing their job. When you hire a junior engineer straight out of college with a computer science degree, they think their job is to write code, and it's mostly true. And I'm pretty forgiving of that. They're very focused on the technology and they're not very cognizant of the fact that they're part of running a business. And so when the CEO gets up in all hands and talks about we're closing this deal with such and such customer, it's in a big deal. It's usually in one ear and out the other because they don't care. It's not important to them. It's not going to help them deliver their sprint to deliverable. And that's just human nature. And that's the extreme, right? But I think how big is Cortex today, just for context? Employees.
Ganesh Datta (35:46):
About roughly 100.
Steve Evans (35:47):
100 employees. Yeah. And so when I met you guys, you were probably 30. I mean, it might have been you, your dog, and a few other people. So you've gone from 30 to 100. So the number of relationships, it's like exponentiates. It's one of those problems. So you can imagine when it goes from a hundred to a thousand, it's not 10 times larger. I think it's a thousand times larger, if I'm doing the math right. So this is one of those hockey stick problems. And so in my post, I talked about how ... What happened was someone asked me, how do you disseminate information? And there's a lot of people like, "Oh, you need to have all hands. You need to be communicating a lot." And I think there's value in all hands and stuff like that, but I don't think it conveys context very much for the reasons that I talked about of the context of, and let's just use the example, we're signing this big client.
(36:42):
What matters in that context varies greatly between the legal team, the HR team, accounting, engineering, inside of each individual engineering team. Why that matters varies a lot. And the CEO, honestly, doesn't have a clue how that matters to individual teams. And so my advice was essentially along the lines of, okay, so you've got the CEO talking to the CTO, hopefully they're pretty in sync. If not, you've got much bigger problems like game over. Then you've got a CTO talking to their VPs. That's usually pretty healthy. Again, if not, we've got bigger problems. I find that usually it's either between the VPs and the directors or the directors and the managers where things, the cascade of context starts to break down. If you start looking at this in reverse, you have an IC that becomes a manager at some point in their lives, and they're really focused on learning how to manage a team, on how their value has shifted away from being able to hands-on keyboard, create results, to essentially with their mouth, create results, with their influence, with their communicating with others, produce results.
(38:07):
That's essentially what a manager is figuring out how to do. I think it's the, just as an aside, I think it's the toughest career transition that people go through. And then at some point, managers become directors. And I think in most organizations I've seen, we don't do a great job of helping them understand what that means. And because of that, most people have never had a great example of what being an engineering director is, and therefore this problem works its way up.
(38:41):
And so the CTO tells the VPs, "Here's some context. We are signing this major client and the success of this will make or break this company." Let's use Cortex as an example, and I'm going to 100% make this up if any potential investors are listening. We're signing Walmart, and that is going to make or break Cortex. So success of Walmart matters. Walmart has these unique attributes about it we've never seen before. We cannot wait. Going into Walmart and waiting for all these things to become a problem is not an option. We have to get in front of it. That's the conversation between at the C-suite level. They all understand that and everyone's going off to get in front of the problems we're going to have with Walmart. Everything from, "Oh, our accounting system doesn't know how to have that many integers in an invoice to like, we've never seen this many microservices in our life too, et cetera, et cetera, et cetera." And then CTO talks to the VPs, the engineering VPs, and they're talking about what challenges they're going to have to work through.
(39:57):
And then VPs talk to the directors and then I find, let's just pretend this is where it breaks down. VPs talk to the directors about this and the directors haven't learned yet because they didn't have great directors when they were managers. They haven't learned yet that their job is to then go talk to their managers to cascade this information, that this is coming. And for example, this is what it means. Okay, I'm commerce managers. This is what we need to go get ready for. This is what we need to start thinking about. Oh, this user story that's coming up. Yeah, we haven't signed with Walmart yet, but Walmart's coming and if it's not Walmart, it's going to be Amazon or it's going to be Kmart or I can't think of other big companies anymore. We need to start thinking like this because the business context has changed.
(40:48):
We're no longer selling to ... We're no longer focused on one billion to 10 billion ARR companies. We're looking at Fortune 10 companies. So our context has to change. And if you essentially think of it as water trickling down, there's blockages. And so what happens is the commerce team, we're really picking on commerce today, they don't get this context. So they keep being like, well, we're never going to have customers that pick. So who cares? Oh, descope that. It doesn't matter. Or you've got product that's like, oh, we got to get ready for the Walmarts of the world. Let's start adding this thing. And they keep deprioritizing, like this stuff starts trickling. They just keep deprioritizing it. And so I think you essentially have to find that spot where the context stops and you've got to unstop it. And you just got to be really explicit with, in this example, the director level of like a big part of your job is cascading context.
(41:56):
I always say, when you're a people leader, the most important thing you do is hire. If you hire the right people, you can suck at every other aspect of your job and you'll do okay. And then after that, it becomes giving people the right context so that now that you've hired the right people, they have the context to do the right things. And then at our peak, I ran a 300 person org. So figuring out, let's go back to my plumbing example. I want to figure out, did I fix the block? As I might org grew, I didn't like that I no longer knew what was going on. And so I started ... And I don't like all hands because once you have an all hands meeting of more than like 30 people, you only have a certain type of person who actually speaks up.
(42:43):
And it's usually the type of person you don't want to have speak up. It's usually the type of person you're like, "I wish you wouldn't have. " So I started having round tables of like 10 to 20 people. I'd basically take my org chart, I would sort it by manager, and then I'd just round rob and assign people. And so I would get groups of people that were mixed up by ... It wasn't by function, because if you meet by function, then they want to talk about their function. I don't want to talk about their function. I meet with them, I meet with them as a team or with their leaders if I want to talk about commerce engineering, but I want to talk about what's going on. I mean, I would call these round tables and it was 10 to 20 people just kind of depended on what was going on.
(43:30):
Ideally in person, but obviously that wasn't always possible. The best versions of these is they would ask me questions and then I could talk about it. And then obviously I could kind of like our conversation, I just kind of talk about whatever I want to talk about regardless of what you ask. But the best versions of these were when everyone started talking to each other. Someone would say, "We have a lot of microservices." Now someone could get my opinion on that, but even better would be four or five people from different teams would start talking about how-
Ganesh Datta (44:01):
Their pain and their...
Steve Evans (44:02):
And their environments get pretty complex. But what would happen is I would have a really good sense of what everyone was feeling and thinking and seeing, but I'd also get a sense of what context was people missing because I would say like ... And then when Walmart shows up and if everyone was shaking their head, I'd be like, "Okay, cool." Or there'd be some people in the room be like, "Walmart?" And then I could pay attention and be like, "Oh, everyone that's under Director Bob has no idea about Walmart. That's a problem." And then you go fix it. You figure out why Bob's not cascading. Or you figure out, "No, Bob's cascading, but Jimmy, manager Jimmy's not. " And it actually wasn't Bob, it was Jimmy, whatever, right? Yeah.
Ganesh Datta (44:49):
I love it. It's very much like a similar approach to what we were talking about earlier with the metrics. It's you start with the business metrics, the thing that you care about in this case, it's context. And then you kind of measure that and you figure out a way to measure that might be your payment success rate and then you realize, "Oh, well, actually this is trending down. What's going on? Okay, well, we've had incidents of these classes and this is what's causing it. " It's basically just debugging problem as human versus system.
(45:13):
And to your point, it's like as an injured organization, you would think that we would pick up those lessons. But to your point, it's like your managers are kind of trained to, or they see their success as, well, I was getting graded on how many tickets I pushed out. And so I'm going to now make sure that my team can push out a ton of tickets. And then when you go to become a director, it's like, okay, well, I was being graded on, or they assumed that they were being graded on and how many tickets they were able to make their teams push. And so now the director's like, "How many tickets can I make my managers make their team push?" And so you kind of end up in a state where you don't realize that at a director level, you're starting to really think about it as it's a business level role.
(45:53):
Your job is to make sure you're solving the right business problems and figure out the things that are preventing the business from moving forward. It becomes very little about the tactical versus the business. And the same thing holds true for your staff plus folks. You want your staff plus engineers to be thinking about business problems as well.
Steve Evans (46:11):
I think ultimately you want everyone to. Now I'm pretty forgiving of a junior level folks who are not super business oriented, but I think you need to start. We ask five-year-olds to start to read. We don't ask them to read novels, but we start the process and I think you just need to start introducing, start moving them along. I think to be fair to directors, they start getting introduced to confidential information. And I don't think we do a good enough job of being really explicit. You need to cascade this. This is not to be shared. And I think there's some ... I don't mean to just blame those below me. I think there's some explicitness that needs to be given there as well. But I think in general, just wrapping up this whole concept around metrics and cascading and everything is, I think as engineers, we get very binary about this stuff.
(47:08):
Can I measure this perfectly? Oh, I can't, so I'm not going to. And I think there's a lot of opportunity around qualitative analysis of an engineering organization, engineering health. I think it's a Jeff Bezos quote that says, "You should look at the data. I'm going to butcher this, but you should look at the data, but you should listen to the qualitative. You should listen to the sentiment. If the data looks good, but the customers are upset, listen to the customers."
Ganesh Datta (47:34):
Listen to the anecdotes.
Steve Evans (47:36):
Something like that. So yeah, Uptime's great, but everyone's burned out trying to keep the systems up. You got a problem or whatever version of that it is, right? It's not just purely the ones and the zeros and the metrics. You got a bit of both you got to do.
Ganesh Datta (47:54):
Yeah. And at the end of the day, it all comes down to the business. I think that's the number one takeaway is don't do contrived things for the sake of it and don't optimize for the wrong things. The thing you're optimizing for is the business. But Steve, this was a great conversation. Lots of takeaways. I personally will have a lot of LinkedIn posts, I think, from the stuff that we talked about today. It's going to be a race to see. I appreciate it. It's going to be a race to see who can talk about it first, but I think you have these lined up already, so we'll be rooted to chase. Steve, thanks for hopping on. This was awesome. Really enjoyed the conversation.
Steve Evans (48:27):
Have a good one.
Ganesh Datta (48:28):
You too.
(48:35):
Thanks so much for listening to this episode of Braintrust. If this resonated with you, do me a favor. Share it with another engineering leader who's wrestling with these same challenges. And if you want to continue the conversation or learn more about how we're thinking about internal developer portals at Cortex, reach out to us at cortex.io. Thanks for listening, and we'll catch you on the next one.