Candid conversations with the builders shaping the future of engineering.
Braintrust dives into the operational realities of running high-performing engineering organizations, from production readiness and migrations to AI adoption and operational excellence.
Hosted by Ganesh Datta, CTO & Co-founder of Cortex
Fred Mare:
The expectation from customers these days are extremely high. They want features available. They want features to be well polished. So production readings is contributing to ensure that if I'm shopping out a new feature or I'm giving a new experience to a customer, that it's going to work and it's going to be a seamless experience from day one.
Ganesh Datta:
You're listening to Brain Trust by Cortex, where we explore how engineering leaders blend AI, platforms, and culture to build high performing software teams. I'm your host, Ganesh Datta, CTO and co-founder of Cortex, an internal developer portal designed to help engineering teams ship reliable software faster with AI. In each episode, we go deep with CTOs, VPs of engineering, and technical leaders who've been in the trenches, navigating the tension between speed and quality, building reliability at scale, and figuring out how to lead through major platform shifts. Whether you're running a team of 10 or 1,000, this is your space to learn from people who've made the hard calls and live to talk about it. Thanks for being on the podcast.
Fred Mare:
Thanks for having me.
Ganesh Datta:
So I'm Ganesh, I'm one of the co-founders and CTO Cortex. If you want to quickly introduce yourself.
Fred Mare:
I'm Fred Mare. I'm a principal engineer at Xero and I'm from Melbourne, Australia.
Ganesh Datta:
Well, I'm excited to dive into all things production readiness, what that even means. How do you think about it? How do you roll it out? How do you get people to care? So maybe we'll start at the very top. What is production readiness and why do we care about it and why do you care about it?
Fred Mare:
I mean production readiness really starts with a key thing and it's a key thing we do. Everything we do as an organization, as engineers is the customer, customer focus. So if we think about what we want to ship out to the customer, new features, new experiences, production readiness is the core part of that, and it's really to be in a position to answer the question, am I ready to ship out this feature to my customer? Am I ready? Is it going to be secure? Is it going to be available? Is it going to give a good experience to the customer?
Ganesh Datta:
The other framing I've heard is more of an internal focused approach where it's like are we ready to support the service? So I guess why do you think about it from a customer lens versus a internal readiness lens?
Fred Mare:
I think the two are intertwined.
Ganesh Datta:
Yup.
Fred Mare:
If you think purely from an internal perspective, you're missing the focus of the customer. It's important as a team or a division, the service you own it, you have a clear indication of what you'd be able to ship out, how confident you are owning and then supporting it, but if you stop at that point and you don't think about when it gets to the customer, I think you're missing an extremely important part of that.
Ganesh Datta:
Yeah, I like that framing. It's almost like the customer lens. It's like a super set of what you might think about otherwise. I guess in practicality, what does that look like? What is net new in a production readiness program that is included when there's a customer focus versus not, like SLOs? Is it more of the deliverable on the other side? What does that lens look like?
Fred Mare:
I mean there's key things that would contribute to that, but I think there's not a clear separation between the two, but SLOs, absolutely. The ability to be aware of latency, the ability to understand when if you're shipping it out, that probably a good example would be you're shipping out a new feature. It's available in production, it's live. To be aware that it's functioning correctly before your customer would know about that. It's about synthetic testing, having those signals to give you an indication that hold on, I just shipped this out and things are going wrong, but I want to pick that up as quickly as possible so that I can solve the problem and why do we enroll it back or pushing out a new fix to improve that experience before the customer knows about that.
Ganesh Datta:
What does a production readiness program look like? What does it entail? Where does it start? Where does it end? What does it actually mean in practicality?
Fred Mare:
If you think about production readiness, at the core, it is understanding what is the expectation can be standard. For instance, engineering or reliability standard. So giving your engineering teams a clear understanding of what is the expectation. So when you want to ship something, these are the things you need to adhere to, and in principles. Principles forms part of that, and then once you have that, you want to create a visibility. A good analogy is if you're driving in a car, you have a dashboard, you know that if you're going to be driving to Australia from New York, which is impossible, you want to make sure that you have either enough fuel or your EV is charged fully, and those indicators gives you an indication of where you are now. If you don't have those indicators and you just get in your car and you're going to drive all the way to Melbourne, hopefully you're not going to do that, please don't do that Ganesh.
And you on the way and you don't have any indicators, you never know things are going wrong, and that's where having the confidence to know that everything's fine when I'm starting off, when I'm going to ship something up. So that's really when you filled up your vehicle and all the indicators saying there's no gauges, no signals are telling you things are going wrong, and then once you have that confidence to be in a position to know that, all right, I'm comfortable shipping that out. So for us, from a production readiness, where we look at a confidence score as an example.
Ganesh Datta:
What is a confidence score?
Fred Mare:
Very good question. Confidence score for us is to give an indicator to our teams. If you think about you making a change, a small change and you taking it from your local environment and you then push it into your source control, which normally is just get up. Once you've done it and you have a PR open, the confidence score is the ability to go through multiple signals to get an indication of what is the implications of that change. If you ship this change out to a customer, what will be the implications? So those signals can be from a testing perspective quality. So using something like SonaRQube, test coverage, it can be around the number of lines of code that you were pushing in as part of that change. If you think about large PRs means high risk to an extent, you're going to see one a small change and that can contribute to that confidence core and being giving a signal to say, "Look, if I take this new change and I deploy it based on that calculation, the probability of it having an impact on a customer would be quite small."
Ganesh Datta:
So is it safe to say the confidence score is like a delta in the production readiness for a service or something like that?
Fred Mare:
Yeah, it's the starting point. Once you get a single change and you then start looking at a point where you're now going to use your CI/CD pipeline and you're going to start deploying that out into an environment, you want to perform multiple changes, multiple change scores would then be aggregated for that deployment, and then you start bringing those together to be in a position to know that if I take change A and B, I put it together based on the score, I'm still in a good position for that to go out to production.
Ganesh Datta:
I want to get back to how this fits into production readiness, but maybe before we talk about that, if I don't have a production readiness program today or a shared definition of what good looks like and what the expectations are like you said earlier, where do I start? Are there common things that every company or engineering team should be looking at when it comes to production readiness? If not, how do you design the right set of expectations? Where do you start?
Fred Mare:
I think from a engineering perspective, there's key things that I think we all should be aware of, and we at Xero, we use a PRC, production readiness checklist, and we also have this operational readiness. So it's for every change that goes out, we have a clear indication of what these practices, those standards are referring to has been implemented for the service that we are going to be deploying. Key things as part of that is one, we talked about quality. So testing for me personally is probably the most important part of it, to understand if this change goes out and things go wrong, what's the implication for the customer? And then when you say okay, if it does implicate the customer, how quickly can I roll back? What is that rollback strategy? And then from a supporting perspective, having runbooks available, the type of deployment, for instance, canary releases, is that possible?
Feature management is another example, which I'm a big fan of, is an expectation is if you're going to deploy a new change into production, that feature is by eye and a feature flag, and then progressive rollout is another opportunity to ensure that this reduced the blast radius as part of what is going out to production.
Ganesh Datta:
So it sounds like maybe distilling it, it's like there's a category of prevention. It's like, "What are the practices we can follow to prevent something from going wrong?" The second category in a production readiness checklist or operational readiness checklist is detection. So if something goes wrong, we want to know about it. From there, it's like mitigation. So if something goes wrong, we know about it, are we able to mitigate it easily?
Fred Mare:
I would say if we go back to the analogy of driving a car all the way to Melbourne. At the beginning you've got your confidence so you know everything's fine. As you're driving, you get in a sense of things are going wrong, you can make a decision. You can say, "Okay, car's overheating, I can stop." And then your ability to reach out, you're phoning AAA to come around and help you to establish what is wrong. A key thing that I wanted to add to that is not all changes are the same, and that's an important thing to probably call out. So what we do is we look at product and we really get a sense, and this is where the customer focus comes in again. We really want to understand the jobs to be done.
Ganesh Datta:
By each service.
Fred Mare:
Every service. So from a service perspective, but also the product itself. Clear understanding of what those are and then based on that, we can create a criticality theory. We can understand how important that job is or the jobs for that service, and then based on that, we give it a rating, and that contributes to the standards that I was talking earlier, reliability standards where it's a low risk area, the expectation for that would be lower in comparison to something that is touching a large percentage of customers.
So we really want to ensure that teams that owns high critical services, that they clearly understand that we from a testing perspective test coverage, debatable thing, is to ensure that you're not going to send something out with 30% test coverage. It also helps as part of our software design reviews to enable the team. I'm a principal engineer or principal engineers has got a responsibility to review and really going to get an understanding of the changes we make, the design of the implementations of changes to a service. It's really to understand if we make change A, what is the impact and criticality tiering forms a very important part of that.
Ganesh Datta:
One of the things that this brings up for me and I think this is more important than ever, has your idea of production readiness changed at all with the advent of AI coding assistance? Do you think about production readiness differently? Is it more important? Is it different? How has it changed, if at all?
Fred Mare:
Look, I mean AI plays an important part. It's not just changing for me, not just production readiness, but it's changing the way we look at engineering as a whole. I think one of the key things that it does contribute when we're talking about production readiness is where's the code really coming from? We're using AI assistance, we're using review from a code review perspective. We're using that AI support to get a clear indication of engineer A is making a change. It's really helping us to create that additional visibility, but we also want to ensure that whatever we're shipping out to our customers are robust and strong and it's doing what it's supposed to be doing. So it's bringing an additional level of awareness, but it doesn't change it to the extent where there's this clear separation because if you think about code that you shipping, if it's coming from an AI assistant, that is giving you a recommendation on a function, it's generating your unit test for you and you're very happy about that.
It's still the human factor that comes into where before that goes out to any customer that is being reviewed. We still want to make sure that the human element is there to make sure that everything we do is being verified and then when it goes out to production, it's not going to cause a lot of challenges or problems along the way.
Ganesh Datta:
Does that mean that production readiness is more important now than it was in the past? You talk about the human element and I think about maybe developers who own a service, you could say okay, they understand their service, and so maybe you get around some of the requirements for production readiness. Not that you should, but you could argue that maybe people knowing their services intimately could help prevent certain classes of issues. With AI writing more of that code, does the adherence to production readiness process, is that more important now?
Fred Mare:
I don't think so.
Ganesh Datta:
Okay.
Fred Mare:
It'd be funny to say no because it's this massive shift. I think production readiness becoming extremely important is the dependency on the systems that we are building. For me, it's always about the customer. It's the user at the end of the day, you think about Xero, we're providing a small business for small businesses, service like accounting core and payroll and workforce management and expenses. If you're a small business and you're already very time poor, you want to feel confident that when you're using a system, that it is doing exactly what it's expected to do. So production readiness contributes to that. The expectation from customers these days are extremely high.
They want features available, they want features to be well polished. So production readiness is contributing to ensure that if I'm shipping out a new feature or I'm giving a new experience to a customer that it's going to work and it's going to be a seamless experience from day one. If it doesn't work, then we're in a position that we can roll back quickly and the customer doesn't feel frustrated. AI and AI assistance or atomic agentic AI, all of these things that comes into play, it's just really helping us to contribute to that experience, but I don't think it's elevating production readiness to the point where suddenly it's becoming a critical thing. It's probably just another signal that comes into play and it's probably something we need to be aware of being that it could have an impact, but I think from an engineering practice, things are very similar for me.
Ganesh Datta:
Yeah. When we talk about production readiness, is that a one-time thing? You mentioned as a principal engineer, we need to review things. You've been reviewed and you get the badge and now you're ready and you can go, or is it like an ongoing process? Or how often do you look at it? How often you think about the readiness of a service, of a change, whatever that might be?
Fred Mare:
So we separate the two. So we have the software design review and that happens earlier. Should we be shifting similar to when we talk about quality and testing, we really want to get it closer to the engineer to the point where they're making the change. If things are not really 100% correct, then we can adjust quite quickly. So that's that first phase of it. The operational readiness side and that confidence score from an individual change to when we deploy or from a service, we may have that aggregated view with additional signals more for getting a sense of history, the custom metrics if you think about using Cortex as part of that, to bring that kind of data, to really see how things are changing if it is a continuous thing.
If you think about we making changes on a regular basis, the expectation is that we're making a lot of changes. Services are changing the whole time. It's a living thing. So I don't see it as you're making a change today and then six months down the line, you go like, "Wow, I'll look at my confidence score now or I think I'm production ready." So if you're making a change, it will do that kind of calculation. Every time you deploy, it's really going to give you that sense of confidence that whatever you're going to send out to the customer is working correctly.
Ganesh Datta:
How do you prevent overloading developers with yet one more gate or requirement now that they've think about like, "Oh, I have to think about my production readiness and my confidence core and all these things." How do you make sure that it doesn't become a burden for engineers?
Fred Mare:
I think that's where automating it as much as possible comes into play. We've learned from when you're doing reviews as an example, that's why we want to bring it closer to the developer. So you want to get the engineer or the team to be in a position to say, "Look, this is the change we're making to a service." And they am in a position between the group that looks at it and the team that owns it, at that point we really want to be in a position to give feedback early.
So we don't go and invest 60% or 80% of time capacity, and then when we get to that point, they go like, "We need to change it again." But then when you think about that confidence score, the reason we're investing in that is to take the heavy lifting away from the team. So the team, the only thing they have to worry about is when they're pushing a change, they have the indicator, they have the visibility of the implication of that change. They don't have to go to multiple places to get a sense of, right, I've got to go to SonaRQube and maybe I'll look at some metric and then I've got to go to another system or I've got to gather my whole team together and we're going to have a meeting or two meetings on a weekly basis to come together and agree, "Okay, we are ready for production now." So we really want to take that away from it. All the things we do, and that's AI contribution from tools that we bring into the space, is really to enable the engineer to focus again on the most important part.
Ganesh Datta:
Is the customer.
Fred Mare:
Is the customer.
Ganesh Datta:
Yeah. How and who should be looking at production readiness? Is it a team's responsibility to continuously review their own services? Is it more of a operational excellence thing where leadership is the one consuming the information? Is it both? Who are the right people to be looking at this? Is it tech leads? What does that demographic look like, I guess?
Fred Mare:
It's not one group as per se, but it's at different levels and different times. So it's the engineers closer to the ground, closer to the change we're making. So for them, it would be to ensure their confidence in deploying. Then once you start aggregating and they're bringing all that data together and get some form of history, that's really when it comes to management, your engineering manager or even higher to be really getting an indication of the score in that area. So if you start looking at a metric and you start seeing that for instance, things are going really well and suddenly it's dropping. It can give an indication that we lost the focus. It can also give an indication that maybe there's been a vulnerability that comes into play or we have a dependency on another team in that space to give us some indication of what we are in a position to deploy. So I would say it's just different focuses and different use of that kind of information and kind of metrics.
Ganesh Datta:
And different granularity of that information.
Fred Mare:
Absolutely, yeah.
Ganesh Datta:
You mentioned production readiness and operational readiness. That's two different things. What is the difference? Is there a difference?
Fred Mare:
There's probably not a difference in true honesty. Operational is just to be in a position to when you deploy it, then you're going to start using it like it's ready. Production's the same thing. Production readiness is really just to give a sense of when it's in front of the customer. I would say it's probably the same kind of thing.
Ganesh Datta:
Yeah, it's different marketing.
Fred Mare:
Different marketing.
Ganesh Datta:
Yeah, but I do think there's some value in marketing for a reason, right? It's like it frames the way you think about what goes into it, what comes out of it, and so I think calling it production readiness and framing it, and maybe we should call it customer readiness, I don't know.
Fred Mare:
Okay. For me, the reason we're doing all of this is for the customer. If you say to me, "Should we do a customer reading?" Absolutely. Production readiness does give an emphasis. So even when you're talking to engineers and you use the language of is that in production?
Ganesh Datta:
Yeah.
Fred Mare:
And I think that that's probably an important factor to consider that.
Ganesh Datta:
Yeah, the other thing I want to talk about is where in the process if at all do you gate on this? Do you prevent deployments from happening? If somebody's not meeting their production readiness standards, do you block a change from going out if that change has a low confidence score? Are gates part of this or is it more of a kind of side review process, if you will?
Fred Mare:
I mean there is a check-in point and a lot about that is that to ensure, I mean we're talking about those checklists to be in a position to know that the team has thought about all of the aspects, the ability to roll back, to have a really strong awareness of the implication of a change on the customer. If that is not done, like there is a check as part of the deployment process to get it out to production where it would not allow a change to go through. Confidence core and that is still something that is not a hard stop. It all depends on the way we're going to look at it down the path, down the track.
Ganesh Datta:
Right now, the way you think about it, why is deployment gate on the production readiness piece, but not on the confidence? What's different about those or why are those two things treated differently?
Fred Mare:
I mean confidence score and that's quite a new aspect that's coming into play. It probably stopped. If you think about to be operational ready, then we have the verification happening at that point, the two will probably come together and the one would contribute to the other because we also want to ensure that that, especially from the heavy lifting that I was referring to, that we still continuously reduce the investment in ad space, investment from an engineering team in ad space.
Ganesh Datta:
That makes sense. Do you think about security as being part of production readiness or confidence? You hear sometimes two schools of thought where it's like security is a totally different thing. We have security standards and then we have traditional operational type things. Do you see those as being part of the same program or do you think about them separately?
Fred Mare:
You can't separate the two.
Ganesh Datta:
Yeah, I agree.
Fred Mare:
Yeah, if you're talking about CVEs or we're talking about ability from GitUp as a Dependabot to give that visibility. SonaRQube, another aspect from a security perspective, it plays an important aspect of what we do. I personally can never say we have to separate the two. If you're not thinking about security on a daily basis, then you definitely have a massive gap in how you're thinking about production readiness or shopping out something.
Ganesh Datta:
When you think about the kind of ongoing review of production readiness, we talked about developers looking at the confidence score. We talked about teams consuming some of this information. Is there a different granularity of this information that people should be looking at? Why should a leader care about production readiness? Are they looking at actual production readiness and confidence scores? Are they looking at second tier metrics like their door metrics, or should they care about production readiness and the way it's defined?
Fred Mare:
I think they should be caring about both. I think one is actually contributing to the other, and that's again that indicator. It's a signal. So if you're looking at door metrics as an example and you're saying deployment frequency, if you've got a low deployment frequency, there's probably a high probability that your confidence score is going to be not as great as you would like because it just means that the team is not in the position either by the CI/CD pipelines, maybe the way it's working, all these other aspects is contributing to that indicator giving you a signal that you need to make some changes. So I think the two works close together, you need to look at both of them.
Ganesh Datta:
Yeah. How do you think about what goes into a production readiness checklist in the first place? Because obviously you don't want to overload people with 100 requirements, but you still want some sort of baseline set that people should care about how do you determine if something should be part of a standard checklist? How do you determine when to remove something? What does that process look like?
Fred Mare:
I mean, at the moment we are looking at a subset of signals, especially on that confidence that change confidence score, and those are the most important ones. You can always look at it, I mean the lessons that we've learned over years investing in this kind of space, especially when you talk about standards. Standards can really bloat. You start out with good intentions. You think we just want to make sure that we're compliant. You really want to put that in front of a team to say you have to think about these things, and then suddenly when you open your eyes, 30, 40 of these things that a team's going to look after. So we consciously ensuring that again, we want to reduce the number, and I think an important aspect of looking at these kinds of things is if you need to be able to automate it. If you can't automate it and you want to introduce it as part of being ready to go to production, you're putting yourself in a very difficult position.
Ganesh Datta:
Yeah. Do you think about and maybe this is a little bit more Cortex specific, but how do you think about the importance if at all of levels to production readiness or the gamification of it? Is that something you guys have adopted or do you think about it more as just a baseline set of requirements that everyone has to follow?
Fred Mare:
If you think about when we're calculating a score, we think about a service, you can gamify it. There's always the opportunity to do that, but you can also look at from a bring scorecards into the equation, which is such a great mechanism to create a visibility and then the ability to do any initiative, to really put it out there. If you bring those two together, I guess this is where the engineering manager or the team lead, the owner of services really comes into play. They can get a sense of how confident we are to go out to production. If it's low, we can use scorecard initiatives to drive the change, but also can create a visibility for the team to feel, look at the moment, you're at a 40 out of a hundred and the things that we see is for instance, your quality aspects is very low.
So you can use the scorecard to get that visibility, and then you can start investing in that space. You can either bring in some support to enable the team, especially if you think about more legacy systems, what maybe we are really starting with a low coverage in that space to help the team to work around that, start increasing the level of confidence.
Ganesh Datta:
The last question that I have, the listeners of this podcast are engineering leaders, VPs, CTOs, and similar folks. If you could give them one piece of advice, a lot of people aspire to have great production readiness programs and are maybe early on in the journey. If you give them one piece of advice on how to drive a successful production readiness program if you're just starting off, what would you share?
Fred Mare:
Start small. Don't overload your team and be very clear about what the expectation is. I think as engineers, like I mentioned earlier about the customer expectation, the pace of things that needs to go out, the world is really moving very fast, is to be very selective how you want to approach this, and to see it as a journey and not to come in very hard and cold and say, "I want you to go from zero to 80 in a week's time." I think you're going to overload the team. It's going to have a very negative impact. When we talk about scorecards, that gamifying aspect, really use your tools to encourage your team to grow in that space and just really investing in a team to be able to do that.
Ganesh Datta:
I love it. So just to summarize, have a customer mindset, start small.
Fred Mare:
Customer.
Ganesh Datta:
Automate as much as you can, shift as much of it left as you can.
Fred Mare:
Absolutely.
Ganesh Datta:
And that's a good place to start.
Fred Mare:
That's the one.
Ganesh Datta:
All right, well thank you so much for being on the podcast. Really enjoyed it.
Fred Mare:
Love it. Thank you very much, Ganesh.
Ganesh Datta:
Thanks so much for listening to this episode of Braintrust. If this resonated with you, do me a favor. Share it with another engineering leader who's wrestling with these same challenges, and if you want to continue the conversation or learn more about how we're thinking about internal developer portals at Cortex, reach out to us at cortex.io. Thanks for listening and we'll catch you on the next one.