Screaming in the Cloud

When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.

About Omri:

Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.

Show Highlights:

(02:12) What is Updog and How Does It Work

(03:38) Why Knowing If It's a Global Problem Matters

(04:01) The Problem With Testing Every Endpoint Yourself

(05:52) How Datadog Discovered EC2 Outages From Their Own Systems

(10:38) When AWS Regions Go Down and Cascade Failures

(13:13) What Happens When Services Rebuild Completely
(16:29) The Most Important Learning During a 3 AM Incident
(20:11) Why This Took So Long to Build
(23:40) When Datadog Going Down Isn't Critical Path
(25:22) How They Picked Which AWS Services to Monitor
(27:07) What Comes Next for Updog
(30:11) Where to Find Omri and Updog

Links:

Datadog: datadoghq.com

Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/

Sponsored by:
duckbillhq.com

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

SITC-Omri Sass-002
===

Omri Sass: I would say it's always best effort. So we learn based on the knowledge that we have and when we need to adapt our knowledge, we do our best effort to adapt it. Our investment in this area is pretty good, and like we have people who do ongoing maintenance and continuously look at model improvements.

Corey : Welcome to screaming in the Cloud. I'm Corey Quinn and I am at long last, thrilled to notice that something in this world exists that I've wanted to exist for a very long time. But we'll get into that. Omri SaaS has been a Datadog for something like six years now. Omri, thank you for joining me.

Omri Sass: Uh, thanks for having me, Corey.

Corey : This episode is sponsored in part by my day job Duck. Bill, do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles. Phone number, but nobody seems to quite know why that is.

To learn more, visit duck bill hq.com. Remember, you can't duck the duck bill. Bill, which my CEO reliably informs me is absolutely not our slogan. So you are apparently the Mastermind, as it were behind the recently launched updog.ai. So forgive me, what's Updog?

Omri Sass: I, you know, I was expecting the conversation to start like that, and I have to say, uh, it's definitely not me being the mastermind there.

Uh, I joined, uh, Datadog, like you said, about six years ago. This thing has been in the making. Uh, some folks would say even before that. There is, I'm happy to share a bit of the history later, but I joined the Applied AI group, uh, here at Datadog a couple of months ago while this project was already ongoing.

So I do have to give credit where credit is due. We have an amazing applied AI folks like group of data scientists, engineers. Product manager who's, uh, this has been his passion for quite a while. Uh, and so I'm not the mastermind, I'm just the pretty face

Corey : I like, and it's an impressive beard. I will say. I, I'm envious.

I can't grow one myself. Uh, for those who may not be aware of the beautiful thing that is Updog, how do you describe it?

Omri Sass: So, Updog is effectively down detector. So if you're familiar with that, it's a way of making sure that very common, uh, SaaS providers are actually up. But Updog is powered by telemetry of people who actually use these providers.

It's not a test against all their APIs or anything like that.

Corey : It, it's also not user reported, like down detector tends to be. And I, I have to say it was awfully considerate of you, uh, to the day that we're recording this. Uh, most of the morning CloudFlare has taken significant. Outages. In fact, right now as we speak, there's a banner at the top of Updog.ai.

CloudFlare is reporting an outage right now. Updog is not detecting it as it is on API endpoints that are not watched today. We are working on adding those API endpoints to our watch list. Now, this sounds like a no-brainer. I have been asking various monitoring and observability companies for this since I was a young CIS admin because when.

Suddenly your website no longer serves web. Your big question is, is it my crappy code? Or is it a global issue that is affecting everyone? And it's not because whose throat do I choke here? It's what do I do to get this thing back up? Because if it's a major provider that has just gone down, there are very few code changes you are going to make on your side that'll bring the site back up.

And in fact, you could conceivably take it from a working state to a non-working state, and now you have. Two problems. Conversely, if everything else is reporting fine, maybe look at what's going on in your specific environment. For folks who have not lived in the, in the trenches of 3:00 AM pages, that's a huge deal.

It's effectively bifurcating the world in two.

Omri Sass: That is precisely right and I'll say one of the biggest reasons, uh, that it took so long, at least for us to get there, is the understanding that we can't just test all of these endpoints, right? There's a reason, like you, uh, mentioned down detector uses user reports.

If we were to run, uh, synthetic tests, for example, against all of these endpoints ourselves, if we, and we report something as down, we now need to verify that the thing being down is the actual website and not our testing. That is no longer correct

Corey : and it's worse than that because you take a look at any, like take AWS as I often am forced to do, and you Oh, okay, well is EC2 working in US East one that's over a hundred data center facilities.

Uh, it's not at at that scale. It's not a question of is it down? It's a question of how down is it? There are, you can have a hundred customers there and five are saying things are terrific and five are saying that they're terrible and. The rest are at varying degrees between those two points just because it's, it's blind people trying to describe an elephant by touch.

Omri Sass: That is exactly right. And what you just described is the realization that we had about the asymmetry of data. And I, I rest assured that's the, probably the word with the most, uh, syllables I'm gonna use today. That's, uh, above my, uh, my IQ grade. But what you just described is exactly the realization that we had about the asymmetry of data.

We have more data than any individual. One of our customers, IE, we have all of the blind people touching the elephant at the same time. And not needing to describe it, right? We, we have the sense of touch for all of the, these folks, and what we do is actually looking at this data in aggregate and using it to try and understand whether all of these endpoints are up or down.

Now let me try to make that slightly more real. When we started, uh, going along this journey, our realization was that when EC2 is down, that's actually the, the specific example, uh, when EC2 was down the load on our, uh, watchdog backend. Watchdog is our machine learning anomaly detector increases significantly because everyone has a higher error rate and higher latencies and a drop in throughput.

And so our backend had to compensate for that, and we saw a surge. In processing power. So we're like, Hey, we're not looking at customer data here is purely within our systems, but something is definitely going on in the real world. It's not a byproduct of anything that we're doing. It's not tied to any change that we've made.

It's not anything our systems are functioning. And through investigating that, we had realized that it's actually tied. To EC2. And then we started figuring out, wait, what are the most common things that people rely on that are observed with Datadog? And if you look at the uh, updog.ai website today. That's also a really easy way to see what people use Datadog to monitor of their third party dependencies.

'cause it's just the top API endpoints that we observe.

Corey : Well, I, I'm curious on some level, like effectively what I care about on this today, for example, there are a bunch of folks that wound up I looking at this now. You, you're right. You don't have. The CloudFlare endpoints themselves, but Aiden is first alphabetically.

They took a dip earlier. Uh, AWS took a little bit of one that I'm sure we'll get back to that in a minute or two. PayPal was down the drain. OpenAI had a bunch of issues. X is probably the worst of all of these graphs as I look at that formerly known as Twitter. And it great. This is a high level approach to, this is what I care about.

I, I almost want to take one level beyond this in either step where just give me a traffic light. Is something globally being messed up right now, like earlier today when multiple, multiple services are all down in the dumps? That's what I wanna see at the top, just the, yep. Things globally are broken.

Maybe it's a routing convergence issue where a lot of traffic between Oregon and Virginia no longer routes properly. Maybe it's that an off provider is breaking and everything is break, is out to lunch at that point, like you almost just wanna see at a high level with no scroll above the fold, what's broken right now.

Omri Sass: It's, uh, amazing that you should say that. Uh, a I know some people who work on this. I can file the feature request, uh, for you.

Corey : Oh, honestly, yeah. Oh yeah. It was either ask you on social media or ask you directly here in a scenario in which you cannot possibly say no.

Omri Sass: Oh, you, I, I'm, I'm a, I'm a mean bastard, but that, that is, that one's easy for me 'cause we're already working on it

Corey : because then you go the other direction where AWS right now, like each of these little cards are clickable.

So I click on one and it talks about different services, dynamo db, elastic load balancing, elastic search, so on and so forth. But I'm not seeing, at least at this level. Uh, here or anywhere else for that matter. Anything that breaks it down by region. And AWS is very good at regional isolation. So I'm of two minds on this.

On the one hand, it's, well, Stockholm is broken. What's going on? Yeah, there are five customers using that. Give or take. I exaggerate. Fine. It'll be great. This is a big picture. What is going on? Regardless of the intricacies of any given provider's endpoints. Then the other, it'd be nice to have more granularity.

So I can see both sides

Omri Sass: we're working towards that as well. Uh, the first release of Updog, which I think was fortunate timing for us and for the, uh, for our users is basically where we are today. We're investing heavily in improving, uh, granularity and coverage. So to your point about, uh, CloudFlare, uh, we went with the most common APIs.

Uh, that people actually look at, if you look at some of the error messages that people have been posting in, you know, on the, uh, the airwaves, formerly known as Twitter, you'll see that they mentioned a particular endpoint. I think it's challenge, uh, CloudFlare, that one's, uh, as far as I know, not a documented API, and I'm, I'm sure I'm gonna be corrected about that, uh, later, but it's not one that is commonly observed by our users.

And so when we began this journey. We had to go off of available telemetry. Now that it's public, we can take feature requests, we can learn our lessons, we can improve everything that we're doing, which is exactly what we're gonna do. CloudFlare, in particular, uh, we're, we're working overtime to make sure that we account for that.

Corey : Yes, people accuse me of being an. Asshole when I say this, but I'm being completely sincere that I have trouble using a cloud provider where I do not understand how it breaks. And people think, oh, you think we're gonna go down? Everything's going to go down. What I wanna know is, what's that going to look like?

Until I saw a number of AWS outages, I did not believe a lot of the assertions they made about cro, about hard regions separation, because we did see cascading issues in the early days. It turns out what those were is. Oh, maybe if you're running everything in Virginia and it goes down, you're not the only person trying to spin up capacity in Oregon at the same time, and now you have a herd of elephants problem.

Turns out your DR tests don't account for that.

Omri Sass: Oh, yeah. And I think that there's a couple of, uh, famous stories about, uh, people who realize that they should do Dr ahead of other people and basically beat the stampede. So the, their websites were up during one of these moments, uh, and then everyone else came up and were like, Hey, we're, we're out of capacity.

Like, what's going on? So there's a couple of like famous stories like that through, uh, through history and I'll, I'll just say the example that you just gave, I think is one that, uh, AWS is. You know that they've told this story, uh, quite a bit and regionally today I think makes sense. But even if you look at the, uh, USE Swan outage from a couple of weeks ago, a lot of the managed services were down, not necessarily because of the service itself was down, or, uh, you know, I'm, I'm not involved, I wasn't there.

But my understanding there was. Uh, dynamo DB was part of the services that were directly impacted and a lot of their other managed services use Dynamo under the hood. So there's the cascading failure would happen even if it's not cascading regionally. It cascades logically. Between the different services and they have so many of them.

So like, who's to know now? But we now have this information, we can use that and learn and adapt our models based on it.

Corey : What happens? Uh, again, this is, I'm sure that you're gonna be nibbled to death by ducks, and we will get there momentarily, but. What happens when they, when that learning becomes invalid?

By what I mean is that when we remember the great S3 apocalypse of 2019, they completely rebuilt that service under the hood, which to my understanding is the third time since it launched in 20 2005, that they had done that. And everything that you've learned about how it failed back in 20 20 19 is no longer valid.

Or at least an argument could be made in good faith that that is the case.

Omri Sass: So for that, I would say it's always best effort. So we learn based on the knowledge that we have and when we need to adapt our knowledge, we do our best effort to adapt it. Our investment in this area is pretty good and like we have people who do ongoing maintenance and continuously look at model improvements.

And so if we do something like that, hopefully we'll catch it. And I will say AWS in particular, but a lot of the providers that you would see on Updog. Uh, we have good relationships with them. We would hope that they come to us and tell us, Hey, this thing that you're saying is incorrect, or You need to update this, let's work on it together.

Corey : Yes, that is what I really want to get into here because historically when there have been issues for a while, I redirected gaslighting me to the a Ws official status page because that's what it felt like back in the days of the green circles everywhere. I had a inline transformer that would up that would downgrade things and be a lot more honest about how these things worked because.

Suddenly it would break in a region and it would take a while for them to update it. I understand that. It's a question of how do you determine the impact of an outage? How bad is it? When do you update the global status page versus not? And there are political considerations with this too, I'm not suggesting otherwise, but it gets back to that question of, is something on your side broken or is it on my side that's broken?

And every time this has happened, they get so salty about down detector and anything like it to the point where they've had VPs making. Public publicly disparaging commentary about the approach. It's, I'm not looking for the universal source of truth for things. That's not what I go to down detector for.

I'm not pulling up down detector.com to look for a root cause analysis to inform my learnings about. Things, especially with that, it's just user reports. So I'm sure today they're trotting that line out already because there was an AWS blip on down detector just because that is what people equate to.

The internet is broken with, which sort of a symptom of their own success. They have been unreasonably annoyed by the existence of these things. But it's the first thing we look for. It's effectively putting a bit of data around anyone I know on Twitter or the Tire fire, formerly known as Twitter, noticing an issue here because suddenly things started breaking and I need to figure out again, is it me or is it you?

Omri Sass: So, uh, there's two things that you said here. Uh, first of all, uh, I, I, my go-to is dumpster fire. Uh, but I'll take tire fire. You know, I, I, I learned something new today.

Corey : Just like the bike shed is full of yaks in need of a shave.

Omri Sass: Oh, wow. I'm, I'm thinking that as a personal offense in my beard. The other thing that you said here, I think is, uh, a realization that is starting to permeate across the industry.

And that is that some people will use this to measure their SLAs and they'll use this to go to their account teams and complain, demand credits and things like that. And that's valid reason for folks like AWS and other providers to maybe get a bit, you know, add a bit of salt, uh, to their behavior maybe.

Just a snitch. But the flip side is that more and more people come to the same realization that you come to in the first moment of responding to an incident when I'm still orienting myself in the the 3:00 AM case. And it's always the, the worst one is always at 3:00 AM. I'm groggy. I'm still orienting myself.

I'm trying to figure out what action do I take right now? Hey, this thing is not even my problem. It comes from somewhere else, is one of the most important learning learnings that I can grab in a moment and not waste time on uh, on it, especially given that most people don't think to ask that unless they're very experienced and have gone through these types of issues because of that.

We see more and more people who are actually interested in joining this. And when we launch, like on the day of launch rest, our, our legal team was basically like ready for all the inbound, salty, like angry emails. We didn't get any, but on the same week that we launched, we had a provider who's not represented on that page reach out to us and say, Hey, we're Datadog customers.

Like, why aren't we even up there? And we're like. Oh wow. Come, come talk to us. Like

Corey : we didn't know you broke.

Omri Sass: Yeah, exactly.

Corey : This episode is sponsored by my own company, duck Bill. Having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS.

Well, that's where Duck Bill comes to help. Remember, you can't duck the duck bill. Bill, which I am

reliably informed by my business partner is absolutely not our motto. To learn more, visit doc bill hq.com.

On some level, this is a little bit of, you must be at least this big to wind up appearing here like this is, I, I know people are listening to this and they're gonna take the wrong lesson away from it.

This is not a marketing opportunity. I'm sorry, but it's not. This is, this is systemically important providers. In fact, I could make an argument about some of the folks that are included. Is this really something that needs to be up here? Azure DevOps is an example. Yeah, if you're on Azure, you're used to it breaking periodically.

I'm sorry, but it's true. I, we talk about not knowing if it's a global problem. There have been multiple occasions where I'm trying to get GitHub actions to work properly, only to find out that it's, it can have actions that's broken at the time.

Omri Sass: Totally fair. But we still want to be able to reflect that to you.

Corey : But you also wanna turn this into a scrolling zoom forever. It, it would be interesting almost to have a frequency algorithm where when something is breaking right now. You sort of have to hope that it's gonna be alphabetically, supreme, whereas it would be nice to surface that and not have to scroll forever.

Again, minor stuff. Part of the problem is you don't get a lot of opportunities to test this with wide ranging global outages that impact multiple providers. So make hay while the sun shines.

Omri Sass: E uh, exactly. Like, uh, we, we got two in the, since we released this in a very, uh, brief moment. And I, you know, I, I, I say this, I might sound like I have a smile on my face, obviously, like.

My heart goes out to all the people who have to actually respond to those incidents and to every one of their users who's having a rough day. But to your point, this is a golden opportunity to learn and to make sure that that knowledge is disseminated and available to everyone as equally as possible.

Never let a good crisis go to waste. That is, uh, one of the first adages that I heard our CTO speak, uh, when I joined the company, and it's. Etched into the back of my brain

Corey : and it, it's an important thing, like I, I al, you've overshot the thing that I was asking for because as soon as you start getting too granular, you start to get into the works for me, not for you.

And it descends into meaningless noise. All I really wanted was for a traffic light webpage you folks put up there, or even a graph. Don't even put the numbers on it. Make it logarithmic con, just control for individual big customers that you have, and just tell me what your alerting volume is right now, where that's enough signal to answer.

The thing that I have, this is superior to that because, oh. Great. Now I know whether it's my wifi or whether it's the site that isn't working, some of these services are reliable enough that if they're not working for me, my first thought is that my local internet isn't working as well as it used to. I mean, Google crossed that boundary a while back.

If google.com doesn't load, it's probably your fault.

Omri Sass: Completely agreed.

Corey : My question without sounding rude about this, is why did this take so long to exist as a product?

Omri Sass: When I, I did kind of the rounds with the team and, uh, the director of data science who kind of, uh, runs the group. He's an old friend of mine.

We've been working very closely for a long time. I asked the same question and I heard a really funny bit of, uh, history and you can, if you Google Datadog Pokemon. You, you may find something kind of funny, uh, where in 20 16, 2 engineers, uh, here at Datadog realized that po, I think they were using Pokemon.

They were like playing all over and they'd realized that there were a bunch of connectivity issues and everyone, like, people were literally like. Trying to swap their phones and like tap on it, like figure out if it was the, the phone that was wrong, the wifi that was wrong, uh, or if Pokemon Go itself, like Niantic servers were down.

And they basically built a public Datadog dashboard that kept track of a whole bunch of, uh, health measures for the Niantic APIs. Yeah, and they kind of published that it made a splash on then Twitter, you, you tell me if it was before or after it turned into a, a dumpster full of, uh, tires on fire and that kind of idea of like, Hey, we can do this type of public service, stuck with the same group.

And then a couple of years later we released Watchdog the, uh, ML engine that does anomaly detection for us. Uh, and then came that realization that I mentioned earlier where, uh, every time that there's a major outage with one of the cloud providers, with one of the like main SaaS providers out there, uh, we would see load increase.

And ever since then we worked on refining the model. Like it started with data that we had and then it moved to, uh, what type of telemetry is the best predictor. We found that if we take fairly naive approaches. It gets noisy, it gets actually really noisy and we have so many cases where the service isn't in fact down.

It's a one-off, or it's something that changed either an hour or in a customer's environment that makes it look like the service is down. So we had to do a lot of refinement and we ended up building our, an proprietary model to do this. So it's an actual ML model that we built, homegrown. Doesn't use any of the, uh, big AI providers, anything like that.

It processes a massive amount of data. I think he threw out the like amount of, uh, you know, how, what petabytes of data it processes in a given time. I don't remember. And we had to build a low latency pipeline for it because the last thing that we want is to say, oh, this thing is up in five minutes from it being down.

Only show it. So there's a bunch of these things where. We started building and we're like, oh, this thing seems to work. And when it says something's down, it's mostly down, but not always. And then it's late. So it has to be high reliability. It has to be decoupled from the rest of Datadog or if we're down, not that we're ever down.

It's, uh, my favorite joke to tell in front of users. 'cause you know, in a room full of SREs, you tell someone, our code is perfect. We never have any issues. Everyone starts asking.

Corey : That's why Datadog needs to be a provider on this. And it just could. That doesn't need to be a graph. That could just be a static image

Omri Sass: there. There we go. We're also working on that. You should know we're, we are, we try to be fairly transparent, uh, not topic of today's conversation. We have a, a finops tool, uh, Datadog Cloud, cloud Cost Management. Uh, we put Datadog costs on there by default at no additional charge. So like we, we do try to make sure that we.

Accept our place in the ecosystem, uh, in that way, or try to be humble about our place in the ecosystem.

Corey : Oh, part of the challenge too is that I would argue, and you can tell me if I'm wrong on this, I, I don't see Datadog an outage as being a critical path issue. By which I mean if you folks go down dark globally.

No one's website should stop working because, ooh, our telemetry provider is not working, therefore we're going to just block on IO or it's going to take us down. Sure, they're gonna have no idea what their site is doing or if their site is up at all. But it's, you're a second order effect. You're not critical path.

Omri Sass: That is very correct, and to be fair, it's something that allows me to sleep much better at night. But, um, I will say that there are a couple of good examples of customers who use the observability data either, uh, to gate, uh, deployments. So if you practice, uh, continuous deployment or continuous integration and you don't have observability, suddenly you need to shut down your ability to deploy code.

And that may not be, Hey, the website is down, but it is considered, uh, in our language, uh, sev one or sev two incident. So like the. The worst kinds of, uh, incidents. Uh, and there are also other companies that are not ebusinesses. Uh, they have real world brick and mortar or you know, some airlines or things like that where if they lose, uh, observability into some parts of their system, there is a real, real world repercussion.

And so we take the, we take our own reliability very seriously. And again, kind of. Uh, a is that I've, um, kind of heard our CTO say that have been etched in my brain. Uh, our reliability target, it needs to be as, as high, so like as strict as our most strict customer, and that's how we treat it. And our, I will say our ops game is pretty good.

Corey : It, it has to be at some point. I I, I do have, again, things that you can bike shed to death if you need to. AWS you're monitoring 12 services. How did you pick those

Omri Sass: highest popularity among our users? Like these are, uh, by far the most used, uh, AWS services among the Datadog customer base.

Corey : Fascinating. The looking at the list, there are things that I'm not, I'm not entirely surprised by any of these things, with the exception of KMS.

I don't love the fact that that's as popular as it is, but there are ways to use it for free and yeah, it's, it does feel like it's more critical path and you're gonna see more operational log noise out of it, for lack of a better term. I'm sure that right now. Biggest thing that someone at Amazon is upset about internally is that their service isn't included.

I don't see bedrock on this. Don't worry. They already have Anthropic and open a a p and open AI's APIs on here. So they're covered there.

Omri Sass: And again, if anyone from the AWS Bedrock team wants to come and talk to us, they know where to find us. We, we have good friends on the Bedrock team.

Corey : Oh yeah. I, I still find it fun that as well, like there are a couple of these folks.

I, I'm the first one list. Ayden, A-D-Y-E-N-I don't off the top of my head, I don't know who that is or what they do. So this is a marketing story here.

Omri Sass: Oh, interesting. Oh, well, good for the Ayden folks

Corey : and, and payments, data and financial management in one solution. Okay. Now we know it's basically a, a pipeline for payments.

Omri Sass: Yes. And I'm, I'm actually willing to bet you that you have used them, you know, just like, uh, block, uh, formerly square, they have devices where you tap your credit card to use. I think they're, if, if memory serves, they're more popular in Europe than the us but I've seen their devices here too.

Corey : They are a Dutch company, which would explain it.

It's, it's that useful stuff. The world is full of things that we don't use. I, it's weird 'cause I'm thinking of this in the context of infrastructure providers, like what the hell kind of cloud provider is this a highly specific one? Thank you for asking.

Omri Sass: Exactly.

Corey : What do you see is coming next for this? I mean, you mentioned that there's the idea of the overall, here's what's broken and we, we learn as we go on this, but if you had a magic wand, what would you make it do?

Omri Sass: Well, the easy answer to that is what Corey said, but jokes aside, regional visibility into all the services,

Corey : the, the counterpoint to that is that. Global rolling outages are not a thing with AWS. They are frequently a thing with GCP and with Azure it's, it's Tuesday, they're probably down somewhere already.

Every provider implements these things somewhat differently. And then you also, the cascade effects, like as we saw this morning when CloudFlare goes down, a lot of API endpoints behind CloudFlare will also go down. If AWS is a bad day, so does half of the internet. There's a, there's a strong sense that this is becoming sort of a symptom.

Of centralization where it's not, again, reliability is far higher today than it ever has been. The difference is, is that we've centralized across a few providers that when they have a very bad day, it takes everyone down simultaneously, as opposed to your crappy data centers is down on Mondays and Tuesdays and mine, down on Wednesdays and Thursdays, and that's just how the internet worked for a long time.

Omri Sass: So I, I think that there's a, say, a moment of reckoning here for a lot of companies and, uh, the cloud providers included and many of their customers. You know, folks who maybe never invested in reliability or resilience or disaster recovery, or any flavor of being, uh, I think resilient is probably the best word here.

To any particular outage because to your point, while some things are more centralized, right, there are the, the main hyperscalers are where we're mostly centralized. A lot of things are significantly more distributed, and that distribution, on the one hand means we're more resilient in the overall, in the aggregate, but it's harder to figure out what's actually broken.

And so. On the one hand, I would hope that a bunch of, uh, companies that are critical for their users that have critical, uh, infrastructure up in the cloud. Would remember that the internet is not actually just US East one as uh, a lot of folks who consider the clouds to be, uh, didn't realize that the cloud was mostly just US East one and we all learned that the hard way a couple of weeks ago and would start to move things, uh, to other places or.

Build redundancies. And then, you know, maybe the negotiation here is I, uh, I'll be down, but I'll be down for not as long as the cloud provider. I can, uh, fail over safely or degrade gracefully or any of these things. A lot of nice sounding terms that we can throw at it, but that a lot of folks haven't heard or decided to.

Not priorit. Is now is a good, or they don't understand

Corey : what the reality of that looks like because you, you can't simulate an S3 outage. Yeah. You can block it from your app. Sure. Terrific. You can't figure out how your third party critical dependencies are going to react then. And when they all rely on each other, it becomes a very strange mess.

And that's why outages at this scale unique.

Omri Sass: Yep. Completely agreed.

Corey : I, I want to thank you for taking the time to walk me through the thought processes behind this and how it works. If people wanna learn more, where's the best place for them to find you?

Omri Sass: updog.ai. Uh, and then after that at uh, datadog hq.com.

And if you're already a Datadog customer, I'm sure your account team is, knows where to find me.

Corey : You're going to regret saying that because everyone duo, everyone is a Datadog customer. Omri, thank you so much for your time. I appreciate it.

Omri Sass: Thanks for having me.

Corey : Omri Sass, director of Product Management at Datadog.

I'm cloud economist Cory Quinn, and this is screaming in the cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice, along with an angry, bewildered comment of along the lines of date, A dog like Tinder for pets.

That's disgusting, showing that you did not get the point.

More episodes

Chapters

What is Screaming in the Cloud?