Screaming in the Cloud

Spencer Kimball, CEO of Cockroach Labs, joins Corey Quinn to discuss the evolving challenges of database resilience in 2025. They discuss the State of Resilience 2025 report, revealing widespread operational concerns, costly outages, and gaps in failover preparedness. Modern resilience strategies, like active-active configurations and consensus replication, reduce risks but require expertise and investment. Spencer highlights growing regulatory pressures, such as the EU’s Digital Operational Resilience Act, and the rising complexity of distributed systems. Despite challenges, Cockroach Labs aims to simplify resilience, enabling organizations to modernize while balancing risk, cost, and customer trust.

Show Highlights

(0:00) Intro

(0:36) Cockroach Labs sponsor read

(3:14) The foundational nature of databases

(3:55) Cockroach Labs’ State of Resilience 2025 report

(8:55) CrowdStrike as an example of why database resilience is so important

(11:04) What Spencer found most surprising in the report’s results

(15:13) Understanding the multi-cloud strategy as safety in numbers

(18:29) Cockroach Labs sponsor read

(19:23) Why cost isn’t the Achilles’ heel of the multi-cloud strategy that some people think

(23:52) Executives are blaming IT people for outages as much

(28:21) The importance of active-active configurations

(32:01) Why anxiety about operational resiliency will never fully go away

(37:52) How to access the State of Resilience 2025 report

About Spencer Kimball

Spencer Kimball is the CEO and co-founder of Cockroach Labs, a company dedicated to building resilient, cloud-native databases. Before founding Cockroach Labs, Spencer had a distinguished career in technology, including contributions to Google’s Colossus file system. Alongside co-founders Peter Mattis and Ben Darnell, he launched CockroachDB, a globally distributed SQL database designed to handle modern data challenges like resilience, multi-cloud deployment, and compliance with evolving data sovereignty laws. CockroachDB is renowned for its innovative architecture, enabling consistent and scalable database performance across regions and clouds. Under Spencer’s leadership, the company continues to redefine operational resilience for enterprises worldwide.

Links

Cockroach Labs: https://www.cockroachlabs.com/
The State of Resilience 2025 report https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/

Sponsor
Cockroach Labs: cockroachlabs.com/lastweek

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

Spencer: Redefining your threshold for what's a disaster where you're going to have a recovery step and postmortems for all the affected applications, you kind of move that threshold forward. You say, we're going to be able to survive an availability zone going away.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Spencer Kimball, who's the CEO and co founder of Cockroach Labs. It's been an interesting year in the world of databases, data stores, and, well, just about anything involving data. Spencer, thanks for joining me.

Spencer: Boy, it's a pleasure to be here.

Sponsor: Outages happen – and it’s never good when they do. They severely disrupt your business, cost time and money, and risk sending your customers to the waiting arms of your competition.

But what if you could prevent downtime before it starts?

Enter CockroachDB – the world’s most resilient database. Thanks to its revolutionary distributed SQL architecture, CockroachDB is designed to defy downtime and keep apps online no matter what.

And now, CockroachDB is available pay-as-you-go on AWS Marketplace — making it easier than ever to get started.

Get the resilience you require – WITHOUT the upfront costs. Visit cockroachlabs.com/lastweek to learn more, or get started today on AWS Marketplace.

Corey: Cockroach Labs has been one of those companies that's been around forever on some level. Like, I was hearing about CockroachDB, must have been, oh dear lord, at least 10 years ago, if not longer.

Time has become a flat circle at this point, but it's good to see you folks are still doing well.

Spencer: Well, you know, that's what most startups were encouraged to do in the various boom and bust cycles that we've been a part of, be cockroaches, right? Uh, lose this pretty idea of being a unicorn and get down to basics and survive.

And. Yeah, we have been. Your memory is accurate. We've been around for just about 10 years now. It'll be 10 years in February.

Corey: Okay, good. You guys know at least my timing is not that far gone. And you're also one of the vanishingly few number of companies in the tech ecosystem that does not have AI splattered all over every aspect of your web page.

But last time I said that, it's like, well, we actually have a release going out in two days. It's, oh no. So if I'm just jumping the gun on that, you're about to be AI splattered. Uh, let me retract that.

Spencer: Well, AI is very exciting. And it'll lead to a lot more use cases. But the interesting thing about databases is they're required for every use case, and nothing's changing about that.

We do have some AI capabilities. That's not at all unusual in the database space, but it certainly doesn't define us. We're not a database for AI. Sure. You can use it for AI use cases. In fact, you ought to, we've got some pretty big, uh, AI headliners that use cockroach. For their use cases. Yeah, we're not chasing the A.

I. Puck. The reality is we solve a very, very difficult problem, which is how do you become the system of record for the most critical data that metadata that really runs the mission critical applications that people rely on every day? And that has always been a problem, right? Since these systems were first introduced in the sixties, and it will be a problem in 100 years that needs to continuously be solved.

Corey: For all the joking I do about anything as a database if you hold it wrong, the two things I'm extraordinarily conservative about in a technical sense are file systems and databases. Because when you get those wrong, the mistakes show. If you're a big enough company, they will show on the headline of the New York Times.

It's one of those problems where let, let's be very sure that we know what we're doing on things that can't be trivially fixed with a wave of our hand. Like, oh, I just dropped that table. Were you using it for something? Is not a great question to be asked.

Spencer: That's exactly right. I mean, this is a foundational piece of infrastructure and if you build your house on a bad foundation, the problems start to show up and they don't stop until you don't have a,

Corey: something that you folks have done.

relatively recently is very much singing my song. It's fairly common to see companies release state of x reports. And I didn't see this one coming from you folks, though in hindsight, I absolutely should have. Uh, the state of resilience 2025, which is honestly like catnip for me at this point. What led to the creation of this thing?

Did someone just say, Hey, we should wind up doing this and people might click a link somewhere.

Spencer: Because if so, it worked out super well for you. Well, listen, you have to step back and look about five years into the past. Uh, we actually just saw an opportunity about five years ago to release the first of these kind of annual reports.

It wasn't about resilience at that point in time. It was actually a consequence of us really struggling with the idiosyncrasies, the different hyperscaler cloud providers. We were actually making Cockroach available as a fully managed cloud service. In addition to many of our customers running it themselves as sort of a self hosted product.

And in that process, we experienced. Some pretty dramatic differences between the hardware and the networking and the sort of costs of the different cloud vendors. And so we actually went in there and we got much more scientific about it and started, uh, really doing benchmarking with a very database centric perspective and the results of that benchmarking were very interesting.

And we figured the rest of the industry would be very curious to know what we came up with in terms of, you know, what's the best bang for the buck? What are the most efficient options in the different clouds? And at that first report, there was quite a bit of discrepancy and ultimately an arbitrage opportunity for For people that were going to select one cloud over another to run database backed workloads.

And the success of that report encouraged us to do that same one two more times. So we did it for three years, and it was probably one of our highest performing pieces of content because it was interesting. But what was interesting is The cloud vendors paid a ton of attention to it as well, and it created quite a brouhaha internally in some of these, uh, some of the CSPs.

And as a consequence, the actual prices and the differences in performance between the cloud vendors, uh, began to, to be diminished in those three years to the point where in the third report, the differences were fairly de minimis. So we actually in the fourth year decided to do a different report, and that fourth year was actually the state of multi cloud.

So what we're looking at is Kind of similar to this most recent one, which, of course, we'll talk quite a bit about. We actually surveyed a huge number of enterprise businesses out there. We talked to the CIOs and the architects and so forth, and we asked them, you know, what's your stance on multi cloud?

Are you using a single cloud? Are you using on premise still? In other words, are you hybrid? Um, do you have? Two of the hyperscalers. Do you have all three of the hyperscalers? What are the reasons for that? It was actually pretty eye opening. We actually found that in the enterprise segment, most companies were definitively multi cloud.

They had at least two and often three of the three big hyperscalers. Uh, and a lot of that was due to just different teams, different times, more permissive attitudes, people kind of running in their own direction. There was M and a, so they acquired companies that used a different cloud than wherever their center of gravity was.

And it's kind of hard to move these applications once they get started.

Corey: It doesn't stop people from trying, not that I ever see it go well, but yeah, we've decided to do this cause it's in line with our corporate strategy. Cue four years of screaming and wailing and gnashing of teeth.

Spencer: Yes, that makes a ton of sense.

I mean, I'm happy to be participating now, but this. Most recent year, we decided to focus on what our customers were asking Cockroach to help them solve. And we saw that that was really our biggest differentiator. And we're a database, we're a distributed database that's really cloud native, and that has some nice advantages.

One of them is scalability. It can get very, very, very large. And that helped some of our big tech companies in particular, that had these big use cases and, you know, millions or tens or hundreds of millions of customers. But we also found that Resilience is important to all of our customers, and this is another thing that like a really distributed cloud native architecture can get right in a way that more legacy monolithic databases don't have as easy a time at.

And so we actually just focus this report on. Again, a survey, and I think we hit a thousand senior cloud architects, engineering and tech executives, a minimum seniority was vice president. And we looked at North America and MIA. So, you know, Middle East in Europe and APAC. And it was just this year, recently ending in about September 10th.

And boy, the results were surprising and a little eye watering, I'd say. Um, just in terms of how pervasive the Resilience concerns are and the damages resulting from a lack of resilience and the sort of unpreparedness and just the general high I like Defcon one level of anxiety about where these companies were, how much this stuff was costing and.

And ultimately, what, what that was going to mean going forward.

Corey: It makes sense. People are not going to reach for a distributed database in my experience, unless resilience is top of mind for wanting to avoid those single points of failure. Yet there's also an availability and latency concern for far flung applications.

Sure. You don't get very far down that path unless resilience is top of mind for anyone running something even that they care even halfway about making sure it doesn't fall over for no apparent reason or bad apparent reasons is sort of the thing that they need to be cared about, they care about the most, at least in my world, I'm an old grumpy washed up unix admin who turned into something very weird afterwards, but I was always very scared of make the sure the site stays up.

I didn't sleep very well most nights waiting for the pager to go off. Mhm. Fortunately, this year has been the most notable, not to dunk on them unnecessarily, but one of those notable outages was the crowd strike issue, and I timed that perfectly because it hit the day I started my six week sabbatical, so I wasn't around for any of the nonsense, the running around.

I, I hear about it now, but I was completely off the internet for that entire span of time, and I could not have timed that better, now to the point where I'm starting to wonder if people suspect that I had a hand in it somewhere. But as best I can tell, it was one of those things that had a confluence of complicated things hitting all at once, like most large outages do these days.

No one acted particularly irresponsibly, and a lot of lessons were learned coming out of it. But no one wants to have to go through something like that if they can possibly avoid it.

Spencer: It's a good thing it didn't happen on the day you were leaving, uh, if you had Delta tickets, because that was a major problem.

Corey: It seems so. It was, it's one of those areas where it's, whenever you have a plan for disasters and you sit around doing your disaster planning, your tabletop exercises, the one constant I've seen in every outage that I've ever worked through has been that you don't quite envision how a disaster is going to unfold.

Most people didn't have every Windows computer starts in a crash loop on boot instantly. That just wasn't something people envisioned as being part of what they were defending against. Every issue I've ever seen of any significant scale has sort of taken that form of, Oh, in hindsight, we should have wondered what if, but we didn't in the right areas.

I'm curious what you found in the report, though, that surprised you the most.

Spencer: Well, I think it was the pervasive nature of the operational resilience concerns. Uh, you know, that that was by far the most surprising, you know, I will just make a comment on the crowd strike outage. You know, I think that what it represents is is a certain well, first, maybe it helps to understand crowd strikes business model, which is which is really quite a huge value proposition for the companies that use it.

What they do is they say, Okay, we're sort of a one stop shop for handling online. All of the compliance and the security applied to the very vast and growing surface area that is threatened by cyber attacks. And if anyone listening to this has ever had to stand up a service in the cloud, the number of hoops you have to jump through is quite intimidating.

And it's only increasing in terms of the scope and the number of boxes you have to check. And so that growing complexity of the task is made much more tractable. By product like CrowdStrike that not only has a huge sort of set of capabilities that address all of those threats, but it also is constantly being updated.

In order to address the evolving threat landscape. And that's part of what went wrong, right? They, they, like many companies were allowing CrowdStrike to automatically update, you know, and just immediately upon releases coming out, instead of letting them bake a little bit and letting somebody else find out the hard way that, that, that the update might have a problem in it.

And it was kind of a simple programming error, but like, this is just an example of one of these things where. You kind of have to trust this technical monoculture, which was CrowdStrike's ability to protect these Windows machines from cyber threats, because if you don't trust somebody else, every single company out there has the same problems, and most people are going to address them very poorly without trusting CrowdStrike's technical competence and their economies of scale and so forth.

Of course, that same thing applies writ large to the hyperscaler, right? These are massive technical monocultures. And by the way, any one of those three companies, you know, that the AWS, GCP and Azure are better than probably any other company in the world at running secure data centers and services. And, uh, you know, the whole substrate, which we call the cloud, the public cloud.

These days, each one represents a very exceptionally. Fine tuned and expert level technical monoculture, but nevertheless, it's a technical monoculture, right? So if something's wrong with one of these, it can be quite systemic. And it just like the CrowdStrike was just a very simple programming error, which honestly, Should have been caught, but like, you know, S.

H. I. T. Happens, right? Everyone knows that. And when you look at the increasingly complex way that, uh, any modern application is deployed using just a bunch of different cloud services put together and so forth. All of those services and pieces of infrastructure. They're relying on the trust on whichever vendor that is putting things together properly, protecting against cyber threats.

Mhm. Dealing with their own kind of lower level minutiae of managing resilience and scale and not going down. And you have to put honestly tens or hundreds of these things together in a modern, uh, service that's being stood up. And so the only way to really prepare for the unknown unknowns, like which one of these things is going to fail on you, is diversification.

You know, the companies, for example, that had more than Windows running. This is, you know, the CrowdStrike thing is just just one small example. You know, they had Windows, Mac and Linux machines. For example, they certainly didn't have as much of an outage is both whose organizations relied only on Windows.

Again, a little bit of a facile example, but it's one of the reasons that, uh, companies are eager to embrace a multi cloud strategy. For example,

Corey: one of the challenges is unless you do that very well with embracing a multi cloud strategy to eliminate single points of failure is you inadvertently introduce additional single points of failure.

Instead of being, we want to avoid a W. S. Is, uh, issues. So we're going to put everything on Azure for our e commerce site. Except Stripe, which is all in on AWS. So now we're exposed to both Azure's issues, and we can't accept anyone's money when AWS goes down as well. A question I have, based upon the audience that you're speaking to when you conduct surveys for this report, is there a sense of safety in numbers?

As in, when the CrowdStrike issue happened, to continue using an easy recent example, the headlines didn't say individual company A or individual company B was having problems. It is still down to computers aren't working super great today, and everything's broken. Whereas if people are running their own environments and they experience an outage there, suddenly they're the only folks who are down versus everyone.

Like, is there a safety in numbers perception?

Spencer: Oh, a hundred percent. I mean, that is one of the big reasons to use the public cloud, right? You're not going to get fired if, uh, one of the big hyperscalers has a regional cloud outage because you're not the only company that went down when, say, US East.

Disappeared from the D. N. S. Right. It was a huge, huge list of companies. Now, the problem, of course, with that is that that safety in numbers really applies to the larger pack of smaller company. Once you get over a certain size and you have really mission critical applications and services that consumers rely on.

And we'll bitterly complain into their X. com when the thing goes away. Uh, then, you know, the safety and numbers argument wears a little bit thin, right? So those bigger companies, this enterprises with the mission critical estates, they actually have to think. Beyond where we can just make safe technology choices, rely on big vendors that are safe, quote unquote, safe choices.

You know, ultimately they, the best ones, I mean, not everyone does it right. As you'll read in this report that we have, I think a lot of companies feel unprepared here. But the ones that are leaning forward the most sort of the innovators. And to use the sort of crossing the chasm idea, the innovators and the early adopters, those kinds of companies are, you know, the ones that really do, you know, for example, embrace multi cloud as an example and seek to have that sort of diversification and much more in depth planning and adopt the latest infrastructure that is looking to exploit the cloud to have a higher degree Degree of resilience and, for example, more scalability.

That's sort of elastic so that you don't have a success disaster. Too many people essentially using your service and creating a denial of service kind of condition. So yes, you're totally right. Boy, I mean, running an application across multiple clouds actively. Is not for the faint of heart, right? So, but it is one of those things that the best companies are actually already starting to do.

And as they sort of pioneer that, it's like companies like Cockroach Labs and the hyperscalers and, you know, I don't know, hundreds, not thousands of other vendors. They all kind of start to make that easier, right? Just like CrowdStrike, for example, helps companies manage the complexity of all of these different security issues across the big surface area.

Expanding surface area. Companies like cockroach can help with the database, making that easy, for example, to run the database across and replicate actively across multiple cloud vendors. Now, that's not something that databases were expected to do 10 years ago. Now that there are some early adopters that are pushing in that direction, that kind of paves the way.

For the larger crowd to come along when that becomes more economical and, you know, a lot simpler where the complexity is, is sort of transparently handled by the vendor.

Sponsor: Unplanned disruptions to your database can grind your business to a halt, leave users in the lurch, and bruise your reputation. In short: downtime is a killer.

So why not prevent it before it happens with CockroachDB? The world’s most resilient database, with its revolutionary distributed SQL architecture that’s designed to defy downtime and keep your apps online no matter what.

And now, CockroachDB is available pay-as-you-go on AWS Marketplace — making it easier than ever to get started.

Achieve the resilience your enterprise requires, WITHOUT the upfront costs. Visit cockroachlabs.com/lastweek to learn more, or get started today on AWS Marketplace.

Corey: For those who may not be aware, I spend my days But I'm not, you know talking to a microphone indulging my love affair with the sound of my own voice as a consultant fixing Horrifying AWS bills for very large companies So I have a bias where I tend to come at everything from a cost first perspective In theory, I love the idea of replicating databases between providers.

If you're looking at doing something that is genuinely durable and can exist independently upon multiple providers simultaneously, then the way that the providers charge for data egress seems like it's sort of the Achilles heel to the entire endeavor. Just because you will pay very dearly for that egress fee across all of the big players.

Spencer: No, you're absolutely correct. I'll give you a couple takes on that perspective, which is, it is sort of a ground truth. There are mitigations and ultimately strategies that transcend the problem of economics here. Sort of just in terms of the base reality today, when your mission critical use case is valuable enough, then you'll pay those egress costs, right?

The economics actually make sense. Because the cost of downtime is so extraordinary and also the cost of reputation and brand and so forth. So, for example, let's say you're one of the biggest banks in the world and you have a huge fraction of U. S. retail banking customers, you might very well consider the cost of replicating across cloud vendors and paying those egress fees.

To be a fair cost benefit analysis.

Corey: Oh yeah, very much. So

Spencer: to the extent that that actually starts to happen, you know, you can negotiate with the vendors to, to give you relief from those egress costs.

Corey: That is half of our consulting doing the negotiation of these large contractual vehicles, uh, with AWS on behalf of customers.

And yeah, at scale, everything is up for negotiation as it turns out.

Spencer: Absolutely. And then of course there are, there are technical solutions that use other vendors. So you can do these direct connects. Things like Equinix and Megaport and so forth, and you can actually connect, and this is also very important if you're going to do something that's hybrid in terms of replication across private clouds and public clouds and so forth, you really need to think about hooking up essentially your own direct connections, and you can you can obviate some of those egress costs.

And of course, vendors like Cockroach in our managed service can do that. And you know, in those kinds of direct connect scenarios, You actually just get a certain amount of bandwidth that can be used, and that becomes quite economical if you fill those pipes. If you over provision that and you're barely using it, then you might pay more than the egress costs, right?

So, there are opportunities there to actually, to really mitigate the, the networking costs. And then, of course, You know, one thing we like to say is that resilience is the new cost efficiency. So, you know, that kind of goes back to that earlier point of like how valuable is the use case and what are the consequences of it going down.

But in this report, we just put out on the state of resilience. The numbers are a little eye watering. I mean, 100 percent of the thousand companies we surveyed reported financial losses due to downtime. So 100%, nobody escapes this. Large enterprises lost an average of almost 500,000, so half a million dollars.

per incident. And these things on average were 86 incidents per year. And so can you put your whole foundation, certainly as you migrate more legacy use cases or build greenfield kinds of things, it does make sense to think about spending to embrace the innovation that's available and, uh, obviate some of these mounting costs.

It. I think a, a much worse strategy would be to accept all this new complexity to build the latest and greatest. And by the way, throwing AI into everything is certainly on most people's roadmaps. You gotta get it into this complex ecosystem. You're calling out, you're calling these lms, everything's expensive.

All kinds of things can break because you're just increasing the complexity if you don't try to manage that. And really do it on things that aren't just the lift and shift of the old stuff. And then we're bringing in more and more new stuff. In other words, your foundation isn't improving as you, as you add additions to your house and new stories, but you're only going to exacerbate the problem, right?

So you really do have to embrace that. And ultimately the cost savings for this. Sort of mounting toll of resilience disasters. That is a good argument to invest a little bit in the, in the short term for a long term reward.

Corey: It feels like whenever you're talking about operational resiliency, it becomes a full stack conversation at almost every level outages where we had a full DR site ready to go, but we could not.

Make a DNS record change due to the outage in order to point to that DR site looms large in my memory having a database that's able to abstract that away. Sounds great. An approach that I've seen work from the opposite direction for some values of work has been the idea that you have to handle application at the application layer and then move everything up into code that solves for it.

A bunch of historical problems with databases that don't like to replicate very well, at the cost of introducing a whole bunch more. The takeaway that I took from all of this has been that everything is complicated and no one's happy. We still have outages. We still see a bunch of weird incidents that are Difficult to predict in advance, if not impossible in hindsight, they look blindingly obvious with that benefit of hindsight.

It's nice to know that at least the executives at these large companies feel that as well, that there's a sense that they aren't blaming. Like, so what is the reason that you had those out? It are crappy it people. I did not see that is as a contributing factor. In virtually any part of the report that I scanned, but I may have missed that part.

Spencer: I don't think people blame their staffs. I mean, it is an overwhelming challenge. And no matter what you do, whether you're migrating and trying to modernize or you're building just from scratch with the best of breed selection of technologies, you're going to have new problems. It's kind of like the devil, you know, for the devil, you don't.

But I think there is an opportunity to sort of make incremental progress. Uh, that really can address some of the things that are becoming unsupportable, you know, just because you have too many pieces cobbled together. And so when you only when you're doing everything in your own data center, everything was under your control.

Things changed very slowly. There was one set of concerns. You didn't need this new infrastructure and distributed capabilities and so forth. But as as things are moved into the public cloud and everything is shifting and there's All these different things connected and all are introducing their own points of failure.

You have to kind of move with the times, right? So you can't accept that the old way of doing things is kind of the same as the new way. Even though the new way is not going to remove all the problems. And in fact, we'll introduce some, uh, some things you haven't seen before. There is a, you know, empirical.

Experience for our customers. At least that you can move beyond some of the things that are causing an unacceptable number of outages. For example, availability zones going away or nodes dying or networks having partial partitions. Like those are the kinds of things that with a distributed architecture you can work around.

And also things like disc like, you know, the elastic block store in a W. S. Having high latency events, one every million rights, right? That might not have been a problem that anyone had on their radar, but it sure afflicts you when you are moving a huge application into the public cloud. And so how do you deal with that?

Well, on a legacy database, you really are kind of stuck on that EBS volume, and if it misbehaves underneath you, your end application is going to experience that. Pain, but with a distributed architecture, there's all kinds of interesting things you can do with sort of automatic automated failover between multiple EBS volumes and across multiple nodes and across multiple facilities and across regions and even cloud vendors.

The right way to think about it is in this new world, you actually have the opportunity to define what kind of an outage you're looking to survive automatically. So it's kind of like redefining. Your threshold for what's a disaster where you're gonna have a recovery step and postmortems for all the affected applications, you kind of move that threshold board.

You say we're gonna be able to survive an availability zone going away, and it's going to have this additional cost where we're going to survive an entire region going away. Like this frequently happens for whatever reason. Sometimes it's DNS. But, you know, once a year you see these things and you see all the companies that are affected by it, you can actually have your entire application survive that if the application is diversified across multiple regions.

And that means your application code is running in those multiple regions, and your database has replicas of the data in those different regions, and the whole thing needs to be tested.

Corey: That's the trick.

Spencer: That is the absolute trick. And when you read the report, you'll see that most of those surveyed, not very prepared to handle outages.

It's like 20 percent of companies report. being fully prepared, 33 percent had structured response plans, and less than a third regularly conduct failover tests.

Corey: From my perspective, it's always been valuable to run in an active active configuration if you need both ends to work correctly. Otherwise, we tested it a month ago.

Great. There have been a bunch of changes to your code base since then. Have those been tested? You have further problems dealing with the challenge of knowing when to go ahead and flip a switch, of okay, we're seeing some weird stuff, do we activate the DR plan, or do we just hope that it subsides? So much of the fog of the incident is always around what's happening, is it our code, is it an infrastructure provider, what is causing this, and Do we wind up instigating our D.

R. plan because once that started, it's sometimes very hard to stop or to fail back.

Spencer: That's a huge point, and it's one of the reasons that less than a third regularly conduct failover tests because often conducting a failover test means that you initiate an outage. Because that's how most disaster recovery and, you know, active, active failovers work.

Even though in, to your point, active, active and sort of like a Traditional Oracle Golden Gate set up. It does allow you to be testing both ends of, you know, your, your, your primary and your secondary, so to speak, because they're both actively taking reads and writes and so forth. So the participants, you know, they both work, you know, they're both there, you know, they're both reachable and so forth.

But if you really wanted to test what happens if, for example, you You make it so the one of your locations that has one of these replicas is no longer visible. All kinds of other things can go wrong. Plus, in order to do that, you may actually end up not having the full commit log of transactions in the database replicated.

So you might actually create the conditions where there's some data regression or even data loss. And so, you know, people are loathe to embrace that kind of testing on a regular basis because it can be so disruptive. But you do need to, if you don't turn off one of those data centers, you don't really know how your application might interact or other components that are dependent on that data center that you just totally forgot about.

Someone put in some new, you know, message queue thing that was only running in that one place. And now that message queues down, the whole system backs up. These are inevitable problems, right? If you don't test them, the beauty of, and I'll, I'll give cockroach another plug here of a sort of modern replication configuration like cockroach, which is called consensus replication, is that you don't just have sort of a primary and a secondary and an active, active configuration.

You would have three or more replication sites and you only need the majority up. So if you have three, you need two of them to be available. If you have five, you need three of them to be available.

Speaker 3: Odd numbers are very important for this. Otherwise to avoid split brain. Exactly.

Spencer: Or, you know, if you have four, you can do four, but that means that you don't really get much benefit from it because you need three always to be up.

You just need the majority. So the cockroach can handle all of those different configurations, but the beauty is you can actually turn off any of these. Replication sites, whether it's a node or it's an availability zone or it's a region or it's a cloud vendor and you have a total expectation that there's not going to be any kind of data loss or data regression or anything.

That's just how the system works. It's not, uh, the sort of asynchronous, you know, conflict resolution prone old fashioned way of doing things. It's a new kind of gold standard that does let you do this. Testing in situ with very real world scenarios, and that can, I think, change these statistics for companies right that less than a third regularly conduct over test when you need to regularly conduct these failover tests.

And by the way, that's still not going to get you to 100%. It just won't. There's things that you can't imagine that you wouldn't have tested for, but you can get a lot closer to the hundred percent.

Corey: Is there hope? This, I guess, is my last question for you on this, because a recurring theme throughout this report is that folks are worried, folks are concerned about outages, about regulatory pressures, about data requirements.

That it feels that fear is sort of an undercurrent that is running through the industry right now, particularly with regards to operational resiliency. Are we still gonna be having this talk in a year or two and nothing is going to have changed? Or do you see there being a light at the end of the tunnel?

Spencer: It's a great question. I think that the anxiety is never going to go away. I mean, recalling my reference to the crossing the chasm idea really is how people. Adopt new technology, right? You have these innovators and early adopters and then the early majority. And then you're kind of at the halfway point in the distribution.

And on the other side of that, you have the late majority in the, you know, the late adopters and companies that just never changed. They're on a zero maintenance diet. I think you're going to have that inevitably, and the complexity is going to keep increasing. So you're always going to have a probably a healthy majority of companies that are behind the curve.

Sounds right. So majority, but you know, it is nevertheless the case that we're in a rapidly evolving landscape, threat landscape, complexity, landscape, but also, you know, capabilities and potential for new markets and expansion and growth in any one of these companies. It's sort of a mix of exciting and anxiety inducing.

I think in several years we're gonna be having the same conversation. It won't be the same kinds of technology and the same kinds of threats. Those will have all evolved quite a bit. Meanwhile, the things that we're talking about now will have penetrated deep into that early majority and probably into some of the late majority.

But it will again be these four leaning companies that are tackling the newest threats with innovation, whereas most of the rest of the companies out there will will kind of be like, huh? You know, that that sounds like something that maybe we'll get to. But boy, we're struggling still with the last year's problems or, you know, last decades problems in some cases.

Corey: I do get a strong sense of urgency around this. It feels like there's a growing awareness here where companies will not have the luxury of taking a wait and see approach.

Spencer: There is urgency. We're seeing it across our business. And I think that urgency right now is You know, there's there's part carrot and part stick, and that's quite the right metaphor, but people have an urgency to modernize partly because they want to have all the benefits that they think A.

I. Is going to bring to their use cases into their customers, and that's like a there's quite a bit of excitement there. You know, I think there is. An increasing degree of urgency because regulators and anxiety because regulators are starting to look at this with a fairly acute perspective. And I mentioned there's a new regulation over in the EU and in the UK.

The Digital Operational Resilience Act, and they're actually looking hard at companies in terms of critical services and infrastructure. So, for example, if you're in banking or in utilities where people rely on your service now, they're starting. The regulators are starting to assess what your plans are to survive these different kinds of outages that I was describing.

Like what happens if your cloud vendor goes away or is deemed unfit for some systemic security or cyber threat. How long does it take you to move your service or to reconstitute in the event of a widespread failure? And those answers right now are not very good across those critical industries. But they're going to get better.

And then that moves the sort of state of the art and moves the regulators expectations. But of course, it creates a lot of anxiety and the teeth on these regulations, kind of like the GDPR are pretty extreme. So there's a big stick. I mean, it's rarely used, but, uh, you know, ultimately There's a growing realization of the costs of this complexity and the implications for that means for society.

That's where these regulations are being, that's the perspective that the regulations are being fashioned from. And when you're in one of these industries and you've got your budgets and. You got all your interesting new projects to try to grow your market share and so forth. Now you've got a host of new requirements from the regulator.

So, uh, you know, it's one thing if you're the kind of company that is kind of fairly on top of the innovation has made big progress towards migrating your whole estate to More modern technologies and infrastructures, but that's a small fraction of all the companies out there. So yeah, you're right.

There's a lot of anxiety, and I think it's because in the interest of doing things less expensively and doing things more quickly, building new services more quickly, there's been a lot of additional complexity that leads to failure in unexpected ways. And by the way, I was just gonna make all these things worse because cyber threats are definitely going to, I think, grow exponentially with you.

Uh, the ability to automate. And so, you know, I think that we're going to see this anxiety continue a pace. I don't know if it's necessarily going to get worse because they're just no matter what the regulatory frameworks require of companies, they can only require so much so quickly, right? They can't, they can't sort of break the systems back.

Corey: Companies have always been highly incentivized to avoid outages. If they could be said to have a corporate religion, it's money. They like money. And as you cited Every outage costs them money. They don't want to have those. The question becomes when, at what point is, are the efforts that individual companies are doing are no longer sufficient?

And I don't necessarily know that there is a good answer to that.

Spencer: No. Uh, and like I said, it's, it's rapidly evolving. I think that to your point again about being in a, in a, in a crowd, right? It's, it's a question of whether you're in the middle of that pack. If you are, you're pretty safe, I'd say. And so you have to look at your peers and just decide whether, uh, you know, you're going to get undue scrutiny and that's going to impact your brand or your bottom line, because it's not just regulators, of course, that look at that.

It's your customers. Are they going to go to a competitor? If they feel that you're not giving them the kind of service and the trust that they placed in you. That is being eroded.

Corey: I really want to thank you for taking the time to speak with me about all this. If people want to learn more or get their own copy of the report, where's the best place for them to find it?

Spencer: Let's see. Where is that report? I mean, I would listen. I would go to our website. It's all over that So it's cockroach labs. com and you will easily be able to find it's called the state of resilience 2025 Of course, you could just search on Google for that.

Corey: Or we'll just be even easier and put a link to it in the show notes for you.

Spencer: That works too. Thank you for doing that.

Corey: Thank you so much for being so generous with your time. I really appreciate it.

Spencer: My pleasure, Corey. Thank you for having me on.

Corey: Spencer Kimball, CEO and co founder of Cockroach Labs. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice.

Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice, along with an angry comment. And tell us, by the way, in that comment, which podcast platform you're using, because that at least is a segment that understands the value of avoiding a monoculture.

More episodes

Chapters

What is Screaming in the Cloud?