Braintrust by Cortex

Cortex co-founder and CTO Ganesh Datta sits down with Matt Bailey, DevOps consultant and founder of Merge Ready. Matt shares lessons from helping large regulated organizations in finance, healthcare, and government transform their DevOps practices, and explains why DevOps is an outcome rather than a toolchain.

Matt and Ganesh discuss why compliance can be mostly automated rather than a mandatory bottleneck, how to turn 30-day approval processes into continuous audit readiness through controls as code, and why treating platform teams as product teams drives natural adoption. They also explore decision latency as a core organizational problem, the importance of stakeholder management as a DevOps skill, and how AI agents may shift infrastructure drift management from prevention to embrace.

What is Braintrust by Cortex?

Candid conversations with the builders shaping the future of engineering.

Braintrust dives into the operational realities of running high-performing engineering organizations, from production readiness and migrations to AI adoption and operational excellence.

Hosted by Ganesh Datta, CTO & Co-founder of Cortex

Matt Bailey (00:00):
DevOps is an outcome. It's not a tool chain. It's not the CICD platform that you've just implemented because by and large, it doesn't matter what CICD platform you pick if your processes are crap or non-existent.

Ganesh Datta (00:17):
You're listening to Braintrust by Cortex, where we explore how engineering leaders blend AI, platforms, and culture to build high performing software teams. I'm your host, Ganesh Datta, CTO and co-founder of Cortex, an engineering operations platform designed to help organizations continuously improve their operational maturity and reduce developer friction. In each episode, we go deep with CTOs, VPs of engineering, and technical leaders who've been in the trenches, navigating the tension between speed and quality, building reliability at scale, and figuring out how to lead through major platform shifts. Whether you're running a team of 10 or a thousand, this is your space to learn from people who've made the hard calls and live to talk about it. Hello, welcome to the Braintrust Podcast. I'm Ganesh Datta, the co-founders and CTO at Cortex. We help engineer organizations create a culture of reliability and make sure that they can move fast without increasing the number of incidents.

(01:20):
Today I have on Matt from MergeReady. Great to have you on.

Matt Bailey (01:23):
Hi, Ganesh. Thank you. It's good to be here. Yeah, I'm Matt Bailey. I am a DevOps consultant and founder of MergeReady, a DevOps community and YouTube channel.

Ganesh Datta (01:33):
Very excited to have you on. Before we started recording, we were talking about your unique experience in this field. You have seen "the underbelly of the beast" for a lot of organizations having been in this consulting capacity and having helped so many organizations with their DevOps transformations. I would love to start there maybe and talk a little bit about what you've seen and what good looks like and what bad looks like. So what makes a enterprise successful in their DevOps transformation and what is the common failure modes that you see?

Matt Bailey (02:05):
Sure. Yeah, it's a great question. I've been working for the past few years primarily in large regulated organizations in finance, in healthcare, even government projects as well. And although they're different industries, they face and you see the same patterns and problems. A lot of what I see is that they are trying to implement DevOps and what they're really doing is implementing a singular tool with some theater around it and not looking at the kind of flow and incentives there are for engineers and engineering teams to deliver value. So it's the same across all of these different industries depending on the organization, but in particular in these organizations, compliance, it's not an option. It's not optional, I should say, but it doesn't have to be the brakes on your software development lifecycle. It can work in tandem with your software development lifecycle and it can make things much, much better.

(03:24):
So that's my take on the underbelly as an overview.

Ganesh Datta (03:29):
I love that. And I think what you call out is a lot of organizations, some tools, specific tools come to mind, but they roll out a new platform as part of their DevOps transformation. But I always like to think about is if you're transforming from, excuse my language from shit to shit, it's not really a transformation, you're just doing it in a slightly different way. And I think the whole point of a transformation is that you're trying to unlock some new way of working and new capability. And it could be, "Hey, because of the way we're working, we're only able to deploy once a month." And by undergoing a series of changes to our tooling, our culture, our way of life, if you will, we can go from deploying once a month to many times a day. And that's a real transformation. And I think that's maybe missing from a lot of DevOps transformations.

Matt Bailey (04:16):
Absolutely it is. And you touched upon it there, but I like to say that DevOps is an outcome. It's not a tool chain. It's not the CICD platform that you've just implemented because by and large, it doesn't matter what CICD platform you pick if your process is a crap or non-existent. So yeah, you really have to make sure that you're putting that effort in to drive the culture of DevOps to really embrace that as an outcome, which a lot of people don't do. And you have an ecosystem where there's a lot of self build it yourself, build versus buy. A lot of these larger organizations tend to see themselves as having a much larger software engineering team, therefore they'll build versus buy, but they still treat their platform teams as kind of second class citizens. So the platforms that are either whether they've built them or they're managing something else, they won't treat them as products when the platform team very much should be treated as a product and managed in the same way because you're trying to deliver value and that's a common pitfall is that these teams become a ticket queue.

(05:41):
It's just another gate in the way of innovation and improving that culture.

(05:51):
So if you treat platform as a product, the adoption of that platform will come naturally.

Ganesh Datta (05:58):
Because you're actually driving towards some sort of value, like you're iterating it towards something that people actually care about. You said driving towards that outcome, driving towards that culture of DevOps. What does that mean? What is a DevOps culture? What does that look like?

Matt Bailey (06:11):
Yeah, it's an interesting question and one that I've tried to answer on many different occasions. It boils down to breaking down barriers within an organization to enable your software development lifecycle from writing code to production is seen as seamless and not perceived as an effort or filled with gates. We often talk about gates within CICD, security gates and all that kind of stuff when it should feel more like you're sailing down a nice river and yeah, you may slow down a bit sometimes, but it's a flow, not like a canal with locks and gates and that kind of thing. I think it's hard to know when you have achieved DevOps. It's such a broad term, but the way I would like to measure it is, and it's difficult with these larger organizations, is basically measuring your engineering team's happiness. How happy are they within their role?

(07:29):
What does the developer experience look like? Whether that is software engineers, platform engineers, DevOps, engineers, SREs, whatever it may be. I don't know how you measure that outside of the ... I prefer the happiness survey as opposed to necessarily things like Dora and looking at metrics, but obviously that kind of stuff is key for stakeholders as well to see that you are improving as you go.

Ganesh Datta (08:05):
Yeah, I love that. It's almost like a state of Zen. Here's kind of a tangential question, but I'm sure one of the listeners are thinking we hear a lot about platform engineering now. What is the difference between DevOps and platform engineering, if any?

Matt Bailey (08:19):
It depends on the organization and the definition of roles and teams obviously differs vastly even within the same industries. And I've seen worked at two different banks, for example, that have had platform teams that do different things. One, I would consider an SRE team who managed the reliability and on- call and that kind of thing. Whereas a DevOps team, I think the ideal, and this is who am I to define these things? But generally speaking, a DevOps team is someone who generally works towards achieving that culture of breaking down barriers through automation, through providing self-service where possible. I see the platform team as almost an attachment to that, if not part of the same team, where it leans more on the software engineering side of things and is more to do with building out platforms internally or expanding on existing platforms. It's a tricky one, but I tend to think of a DevOps engineer as the evolution of what your infrastructure engineers were.

(09:45):
And I think of our platform engineers as a similar persona, but focused more on software engineering to achieve the same goals.

Ganesh Datta (09:57):
That makes a lot of sense. I guess if I were to distill that, it's kind of your platform engineers are building the foundational capabilities that then DevOps teams can use to build right automation for developers. The platform team might build out the capability to deploy something to Kubernetes in a standardized way. The DevOps team might use that to build and maintain a CI platform and build the right sort of guardrails and automation within your CI set of capabilities and make it easier for developers. So DevOps is bridging the gap between the raw under the hood platform and what the developers care about. Is that a reasonable

Matt Bailey (10:35):
Summary of that? Yeah, I would say so. Yeah. And I think when it comes to hands-on keyboards, what you would see a DevOps team doing is providing templates in the form of infrastructure as code, providing templates in the form of CI/CD pipelines. That's not to say that the developers and software engineers themselves won't contribute to that because I think internal contribution is key for a DevOps culture, but the accountability and responsibility lies with that DevOps team.

Ganesh Datta (11:08):
I like that. That makes a lot of sense. I want to go back to the idea that we're talking about this Zen state, specifically in regulated industry. So I come from a regulated industry as well in my previous job and developing software in a highly regulated field is quite different. There is a different level of rigor and compliance, like you said, is not optional. It is a thing that every feature, everything you ship has to go through that process. Acurity is paramount. All those things are true. And so I guess is this idea of this DevOps flow or is Zen state achievable in a regulated industry? If you have compliance as a requirement, people tend to think of compliance as a required bottleneck. It has to be a choke point, but is that true? Does it have to be that way?

Matt Bailey (11:51):
No. No, it doesn't. Compliance can be entirely ... I don't want to be absolute, but it can be mostly automated. So we've got the tools out there now to enable us to bake in compliance into those templates that we discussed earlier to add those security scanning tools to the best of our ability, streamline approvals and manage change management in an automated way. The tools exist. I think most of these organizations, it's not a DevOps problem, it's decision latency and just not having the ability to decide upon what they're going to do. And if you've worked in a large organization, as you said, in a regulated environment where it can take two or three tickets in ServiceNow, it can take three or four or five or six approvals during those requests to just get a dev server. And in order to raise those tickets, you need to have registered your application in the software catalog and assigned an owner and that required tickets as well.

(13:23):
That's all automateable. There are some aspects of some regulatory standards that require manual approval and documented change management, but if you've got the resilience in place and you're not assigning that to a single person or a single manager and you have contingencies if people aren't there or if there's delays, then I think it's entirely possible to fully automate and streamline these processes. That's not to say I've ever seen that in these large regulated organizations. I think everyone's at some different stage trying to reach that. In smaller organizations, it would be far easier to implement.

Ganesh Datta (14:19):
I love the phrase decision latency. I don't think I've heard that before, but decision latency implies to me that there's a human element. Somebody has to come in and make some sort of decision, but you say that generally this process can be automated. And so what does that actually look like in practice? I give an example of a workflow that maybe has historically had a human in the loop or humans in the loop with multiple layers of approvals that can be translated into something automatable because those two things sound like they're at odds, but you're saying that they're not.

Matt Bailey (14:47):
Yeah, sure. I can't give specifics, but at one client I was with last year, an adjacent team to mine. So I was on what we would've discussed earlier as the DevOps team and I was working with the platform team who were building this feature. The idea was that, and it goes back to ServiceNow and change management. The idea was that there were several stages during change management where releasing to production would still completely waiting for approvals. And if you've got more than one approval, you can guarantee that one of them isn't going to be done that day or that week. So the idea was to implement a standard change workflow. So this would be a change where we've seen this type of change before. We understand the risk of this change. We have approved changes in the past with these caveats. Therefore, we are able to during the CICD pipeline reference our change management system for previous changes and agreed standard changes and therefore automatically approve your ServiceNow requirements and automatically raise the initial change request as well.

(16:20):
And this was using GitLab, CICD. So the rollout, and it doesn't sound particularly encompassing, right? It sounds, yeah, that's fine. Maybe you get one pipeline a week or something like that, but we're talking an organization with over 15,000 software engineers and we were delivering this in a way that it was completely reusable across all those different teams in the form of compliance frameworks. So integrating all these features with GitLab into this particular workflow, which saved countless days and dollars.

Ganesh Datta (17:10):
That's really interesting. I mean, it sounds like one of the requirements to automate this is having a shared definition of what standard is. We all have to agree that these types of things are required, these types of things are standard, this is what good looks like, and then finding a way to codify that.

Matt Bailey (17:27):
Yeah. And this comes down to DevOps and I know that sounds silly, but traditionally when people think of DevOps, they think about breaking down that barrier between developers and operations, but that definition has expanded far beyond those two organizational units. Change management teams, they're approachable. They exist. You can message them, you can talk to them and have meetings with them to decide these things. You can talk about their pain points, talk about your pain points and communicate what your desired outcome is and why, what the value is. They can communicate why they need all these things and the cost of not having these things. And together you can come to decisions on what can be automated and how you automate that. And I think that's just generally the culture of DevOps across any organizational units. I mean, we're seeing FinOps when it comes to cost management in the cloud and making sure that budgets are assigned correctly and that you manage your resource all your resources appropriately.

(18:40):
DevOps is far more than just the software engineering part now.

Ganesh Datta (18:44):
Yeah. And I think it's true even like you said earlier about platform engineering teams or just platform teams general having a product mindset. And I think that's kind of what you're describing is the product mindset is trying to figure out, what is the root problem that we're trying to solve here and what are the blockers that are preventing the user, the end user, whatever from getting to that value? It sounds like maybe an approach like the five whys, if you've heard of it, is kind of what you're looking for here, which is like you continuously ask why until you get to the root cause. It's like, "Oh, we need an approval here, but why?" "Well, we're worried about the risk here. Okay, why? "You keep asking that and eventually get to like, " Okay, we had this incident three years ago that caused this particular thing.

(19:22):
And so if we can prove that that class of issues is easy to guarantee it won't happen, then we can actually automate this whole chain of things. And so I think that five whys, I think it came from Toyota or something like that way back when, but it sounds like an approach like that is probably what you want if there are human in the loop processes you want to automate.

Matt Bailey (19:41):
Yeah. And it is true that these compliance processes are some of the most difficult to unravel. I've seen at a healthcare company an IDAM process that caused a DevOps server request to take in excess of 30 days because of the various tools and teams that required to do approvals or just manual steps to create groups in active directory to do this, that, and the other. And my God, that diagram, spaghetti doesn't begin to describe and all the diagrams are wrong because the processes are far too complicated. I think if your process has become that difficult to document, they're probably too complex or need consideration.

Ganesh Datta (20:41):
Yeah, I like that. There's a customer that we were working with and I remember seeing one of the similar Spiege diagrams except it was just to create a repository and I think they had 14 approval steps just to create a new repository. I'm sure it's like anytime there's a process, there's a reason why. It's not that people put process in for no reason, it's like in the historical context of the memory of the organization and the memory of the organization is codified in these human processes. And you see that even at startups, over time, the reason proces bloats is something went wrong or say there was a failure mode and you're trying to prevent that failure mode with the process and eventually over time that calcifies and coming in and changing all that is actually quite difficult because it's at the end of the day, it's like deep rooted memory or if you want to describe it differently, it's the trauma of the organization that they're hiding behind a bunch of process.

(21:35):
I think

(21:35):
You have a very interesting perspective in the sense that you've gone into these regulated organizations and I think regulated organizations are particularly challenging in this sense because regulation is a very ... It's a very painful version of this trauma. Compliance for regulated things is a very painful version of this trauma that organization has. And so those processes are put in place, have even more deep rooted meaning to the organization. And so you go into these organizations and you convince them they're like, "Hey, the way you're doing things right now is not working," and you go and help them transform that. If a leader from a highly regulated organization is listening to this episode, what would you say is a place to start? We're like, "Hey, yes, we are living in this world with calcified process and pain and our developers are unhappy. How do I even start?" Where would you start if you were there in their shoes?

Matt Bailey (22:27):
That's a great question. Start small. You pick a sample team or group and maybe you identify by some metrics that you might have around risk and pick that team as your pilot. This is the biggest risk to our organization because there's been issues with things going into production or whatever it may be. Whatever it is, pick a small team and just start treating your compliance as controls as code, evidence everything and try and be continuous with your compliance so you're always audit ready because you never know. If you're always audit ready, when that day comes, you're already there and you can hand it over, you can just hand over your evidence and whatever it may be. And audit season is a particular pain point, right? So gathering manual evidence kind of shuts down product teams across the board when they're getting Slack messages, "Get me a screenshot of this.

(23:44):
Can you screenshot the CICD logs?" They flashbacks. Yeah, yeah,

(23:49):
Yeah. Yeah. Can you add to the spreadsheet the different artifact names that were deployed to production and when they were deployed and when this incident happened? And it just halts everything and the cost of that to an organization is in the tens, if not hundreds of millions, especially in these large regulated organizations in financial organizations. So try and just codify everything. Make sure you've got your controls as code, your policies, your guardrails, your templates. So perhaps you're using a new platform and you think this platform has the ability to do all that, migrate that team over to it, migrate them, see how it goes, get someone in to help you. There are specialists out there who, like myself, who work on these giant digital transformations within these giant organizations and we've had nothing but success after succes. So implementing things like GitLab or Harness or whatever it may be as the core of that digital transformation, but then making sure we don't go into that tools are DevOps mindset, making sure we keep that mindset presence where DevOps is the culture, this tool is here to support us.

(25:20):
And then you can implement those controls as code. You can implement everything as templates, it can all be reusable and then you just migrate. Sounds simple, right? Always sounds

Ganesh Datta (25:30):
Simple.

Matt Bailey (25:31):
Yeah. There's no one size fits all for this kind of problem. I don't have a definitive answer of where to start, especially because I don't know the organization that you're operating in or what problems you have, but if you start to look at those problems and go back to your five whys and follow some kind of a framework like that in order to find out the lowest hanging fruit and then pick it.

Ganesh Datta (26:02):
That's really important. I think three recent episodes have all mentioned start small as a solution for completely unrelated things, by the way. But I think that seems to be a very standard practice for success and it's true. We know that in product management as well as like, don't boil the ocean, start small, straw, iterative. Why would a DevOps transformation or a production winning this program or anything else be any different? You want to start small, you want to prove the value and use that as a beacon to the rest of the organization. "Hey, look, look what we're able to accomplish. Why wouldn't we want more of this? "But on that note, I think I'm sure as somebody who goes into organizations and helps with this transformation, I'm sure you've seen this quite frequently, but the question I just asked earlier assume that somebody within the organization has already realized," Hey, this is a problem.

(26:53):
I want to do something about it and then try not to figure out where to start. "I think a lot of organizations probably haven't even realized yet that there is a problem and I'm sure you talk to these organizations and say like, " Hey, it's holding you back from X. It's holding you back from this DevOps outcome. If you are a rogue platform engineer or DevOps engineer or a leader with an organization who's trying to make this case internally like, Hey, the status quo is not great. The status quo is holding us back, how would you tell that story? How do you go and convince an organization that maybe hasn't realized that yet, that they should be even thinking about solving this problem in the first place?

Matt Bailey (27:30):
It highlights my belief that DevOps, one of the core DevOps skills is communication and understanding stakeholder management. Now, a lot of people disagree, but where DevOps does work acros a multitude of teams, it actually becomes incredibly important, especially if your role is to break down barriers and make things more successful within the organization. So when dealing with trying to surface pain points in order to influence change, you need to understand who you're talking to and what they perceive as value or what is valuable to them and then you can start to try and record in some form of tangible data what those issues are. So one thing may be that your pipelines are incredibly slow because we only have one Jenkins server that's running everything or something like that. And then your stakeholder may be a engineering manager who has KPIs to do X amount of releases per week or something like that.

(28:58):
It doesn't matter what it is, but you identify that what value, where they see value and where they get value and then say, "Well, your KPIs are directly impacted by the speed of this pipeline because if we were able to release quicker, we would be able to iterate over our features faster and deliver more value at the end of the sprint or whenever it may be. " I'm not sure that that's a particularly good example, but the core of it is understand what value you need to provide, understand how the stakeholder that you are trying to explain this to, understand where they see value and try and align the two and how removing a particular pain point would improve both of your lives

Ganesh Datta (29:48):
That makes a lot of sense. And I think it kind of goes back to the same thing that we were talking about at the beginning, which is focusing on the outcome. And I think if you can tell that story with data or with anecdotes and tying it to some Sort of business outcome I think is a way to get the organization to realize that there's a problem. It's like, hey, somebody up the chain is getting pressure to ship more, ship faster, ship better, whatever that is. And if you can tie the limitations in your process and your systems to that outcome, then you can get people to see those specific things and you start small, you start with a small slice, a steel thread or whatever you want to call that. And I think that is evidence that you need for that transformation.

Matt Bailey (30:29):
Yeah. And you also need that evidence to validate for yourself that you are making the correct assumption around something, which is important that you're actually raising a legitimate concern. But within these organizations, we should caveat all of that with you can't expect this change to come tomorrow. There are budgets, there is sign off, there is all sorts going on with various different stakeholders that you don't know within an organization. There is myriad of different things going on. So make sure everything is documented and ready to go as soon as the stakeholder's ready to pick it up and have a look.

Ganesh Datta (31:09):
Yeah. You were mentioning earlier that one of the most important skills in DevOps is stakeholder management. And I think people maybe think about it, especially in large organizations. They think of it as politics. You said sign-offs. Sign-offs sometimes imply, oh, there's a different stakeholder somewhere that is going to shoot down this idea for some reason, or there's somebody who's trying to get their claim to fame and some organization they're trying to do their own thing. But if you're able to go back and figure out like, "Hey, what do they really care about? " You were talking about compliance as an example earlier. Hey, why do we have this process? Why does it matter? And if you're able to go and do that with other stakeholders, the whole point of stakeholder management is trying to find wins for the rest of the organization as well that makes it possible for you to accomplish what you want as well.

(31:55):
I think thinking about it as politics sometimes, especially in these large organizations, is the wrong way to think about it. They're individuals with their own

Matt Bailey (32:03):
Aspirations,

Ganesh Datta (32:04):
With their own goals, with their own pressures, with their own stressors, amissions, whatever that might be. And you want to figure out, okay, how do we make this a win that everybody can showcase? And starting small is one way to do that.

Matt Bailey (32:15):
Yeah, absolutely. The perception of politics you need to be careful with because that really just means people's interests that you don't yet know or understand. And that could be, "Oh, I want this tool because it's better and it will be faster." But your manager's saying, "Oh, well, we can't afford that because we don't have the budget." But they are trying to get the budget for another engineer who will just add to that bottleneck. So if you try and understand what's going on, if you can get that sort of information, then you can use that as to create yourself a more full picture of everything that's going on and therefore be able to present a better value proposition to why you should be doing something.

Ganesh Datta (33:01):
I'm sure all the listeners have one thing on their mind and it's always right now it's AI and I'm sure that comes up in your conversations as well. The advent of coding assistants and LLMs and every part of the stack today, does that change any of this? Does that change how you think about compliance or DevOps or these outcomes that we've been talking about?

Matt Bailey (33:20):
No, no, I don't think so. I don't think it changes the outcomes. It just changes how we get there. I think there's going to be some really interesting changes in the world of compliance in software engineering with the advent of AI agents. For example, I've already seen a tool called Costly, which I've spoken about on my channel in the past, who they really embrace the evidence and continuous compliance. And I think what they're aiming to do with AI in their platform is have auditors be able to ask their product in plain English about particular things and to just give it to present you the evidence that's needed. So you will have an auditor just talking to a product as opposed to all these developers, for example. So that's one really cool thing about AI and audit times. I think we understand the guardrails required in order to use AI safely within software engineering now.

(34:33):
I think that is propagating around teams. I think having Copilot or whatever it may be assist you in your day-to-day work is becoming more of a norm and the stigma that was around that I think is dissipating when we know we have these guardrails in place to protect us and protect our reputation and our product. I think infrastructure as code is going to change quite drastically as we see AI agents improve and be trusted more. I think the way that drift and infrastructure drift is handled is going to shift. I think we're going to embrace drift far more than we ever did. I have a very controversial take that in fact we will be heading back towards ClickOps in the AWS console and provisioning servers and then it's codified automatically and there's some cool products out there already looking at this kind of mindset where just embrace the mess.

(35:41):
AI can codify it. It's all there. It's all following the guardrails that we need. I think it's going to be an interesting time and I've tried to predict year on year what AI is going to do and bring and how fast it's going to do it and got it wrong every time. So AI is ahead of me usually. So all I can say is I'm excited to see what the truth of it is and how that does play out. And I really hope that it just makes developers' lives easier because at the end of the day, I'm a DevOps engineer, that's my goal.

Ganesh Datta (36:17):
Be nimble, be ready to adapt. And at the end of the day, focus on the outcome and whether it's AI or current tooling or something else, I think you'll always have the right mindset and I think it's kind of what matters in the end. Matt, it was great to have you on. Thanks so much for sharing your wisdom. You've seen a lot. You've helped so many organizations with what I believe is one of the hardest transformations to do and so we're really honored to have you and your learnings shared on the podcast today. Thanks so much for joining.

Matt Bailey (36:44):
Thank you for having me.

Ganesh Datta (36:51):
Thanks so much for listening to this episode of Braintrust. If this resonated with you, do me a favor, share it with another engineering leader who's wrestling with these same challenges. And if you want to continue the conversation or learn more about how we're thinking about engineering operations platforms at Cortex, reach out to us at cortex.io. Thanks for listening and we'll catch you on the next one.

More episodes

Chapters

What is Braintrust by Cortex?