Real World DevOps

Wherein Corey Quinn and Mike Julian pontificate about the dangers of perfect infrastructure, why multi-cloud is (probably) a dumb idea, and that your biggest risk in a large-scale disaster is your entire team quitting to help your competitor for 10x more money.

Show Notes

About the Guest

Corey is a Cloud Economist at the Quinn Advisory Group. He has a history as an engineering director, public speaker, and cloud architect. Corey specializes in helping companies address horrifying AWS bills, hosts the Screaming in the Cloud and curates LastWeekinAWS.com, a weekly newsletter summarizing the latest in AWS news, blogs, and tips, sprinkled with snark.

Links Referenced: 


Transcript

Mike: Running infrastructure at scale is hard, it's messy, and it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're gonna talk about the rough edges. We're gonna talk about what it's really like running infrastructure at scale.



Mike: Welcome to the Real World DevOps podcast. I'm your host, Mike Jillian, editor and analyst for Monitoring Weekly, and author of O'Reilly’s Practical Monitoring.



Mike: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their Time Series Database InfluxDB, but you may not be as familiar with their other tools. Telegraf for Metrics Collection from systems, Chronograf for visualization, and Kapacitor for Real-Time Streaming. All of this is available as open-source. They also have a hosted commercial version too. You can check all of this out at influxdata.com.



Mike: Hi folks, welcome to the Real World DevOps podcast. I'm here with Corey Quinn the editor of Last Week in AWS. Welcome to the show, Corey.



Corey: Thanks, Mike. It's always a pleasure to hear myself talking.



Mike: I'm sure it is. So for those who don't know, Corey is one of my closest friends. So, this might get a little off the wall and banter-y. But hopefully everyone will enjoy it. So for those who don't have the pleasure of having met Corey yet, Corey, what is it that you do?



Corey: A lot of things is probably the best and most honest response to that. But what I'm best know for is either shit-posting on Twitter and/or writing Last Week in AWS, which is a newsletter that gathers information from the AWS ecosystem every week, discards the crap that no one cares about, takes what's left, and then makes fun of it.



Mike: It is pretty fucking funny. So by day, you are also an AWS consultant?



Corey: Yes, but in a more directed sense. Specifically, I start and I stop professionally at fixing the horrifying AWS bill. It's one of those areas whereas a consultant, I find that I'm more effective when I'm direct. And so I'll aim at a very specific, very expensive problem.


Mike: Absolutely. So you and I were talking a while back and this is something I have repeatedly come into around isn't Amazon just a single point of failure in my infrastructure? Shouldn't I really be focused on trying to mitigate the failure of AWS going down? What happens if you us-east-1 explodes again? Like, my website's offline, and now I have huge problems. So shouldn't I at that point maybe start thinking about multi-region or maybe multi-provider or any number of other really dumb ideas?



Corey: The answer to all of that is generally it depends, which is accurate and completely useless. The fact of the matter is is depending on what your business is and what your constraints look like, you're the best person to wind up saying that this either an unacceptable risk or ultimately this is something that you absolutely need to address and focus on. The example here is if your application has people's' lives depending on it, then yeah, you need to be able to withstand everything up to and possibly including a nuclear event. Whereas if you're running my side project of “Twitter For Pets,” if your site is down for four hours because AWS region as an issue, maybe it's okay. Maybe the internet is better for it.



Mike: Yeah, that's a good point. It always seemed to me to be an incredible amount of over-engineering, of people trying to get their applications and infrastructure to be completely ... completely capable of surviving basically everything. It's like, "Hey, wait a second. You run a very small social media application that no one's going to care about."



Corey: We saw a fair bit of this a couple years back with the S3 Apocalypse wound up hitting. This is not me trying to bag on Amazon in any meaningful way. They run an incredible amount of complexity at a stupendous scale that boggles the mind. Things break. It's what they do. That's the nature of how working with computers plays out. And there was a knee jerk reaction that we saw from a lot of infrastructure types after this happened where they immediately want to turn on replication for every S3 Bucket, so it's now multi-region in one or more locations. I understand the reflex reaction to that. There's no one better than Ops people at fighting the last war. But there also needs to be a rationed, measured response to this.



Corey: One area that seems to get lost along the way somewhere is you're going to be doubling or tripling in some cases your infrastructure costs from a raw perspective. Plus, the additional complexity, plus people's time to set all of that up, plus data transfer to get things from one place to another. Is this a reasonable response for your business constraints or is it a knee jerk reaction to something that is very unlikely to ever occur again? If we look back, we don't see a litany of individual websites that were called out for things breaking. It was called out as Amazon's failure and things like Instagram were mentioned or American Airlines or Amazon's own status page as examples of things that were impacted. But that was the day the internet broke for a little bit. So, most of us just kind of shrugged and went outside. And eight hours later, things are working again and life went on.



Corey: If you're an ad tech company and you need to be able to sustain that type of outage, and maybe it makes sense to do that. People are not going to come back and look at an ad again. But conversely, I was buying, I think, a pair of socks I want to say five years ago on Amazon. It threw a 500 error and suddenly I'm staring at a dog, which is fascinating. That's amazing. To hear the way some infrastructure people talk about outages, that would mean that I'm down one pair of socks. Every so often in my rotation, I just don't wear socks that day. Here in reality, I just waited a couple of hours and went back and bought socks and everything worked. People do come back. Now there is the counterargument if this happens every third time I try to purchase something, I'm suddenly spending a lot more money at target.com. There is a reputational risk here. But this doesn't necessarily mean that any downtime is automatically completely unacceptable and you have failed this company.



Mike: Yeah. Absolutely. I was reading a bit of research on this a while back. There's a large online retailer, I forget who it was, but they started looking into ... They saw that they had a whole bunch of instability over the year. They were down and then come back up and down again and all this. Finally, they had all the measurements around it. They had exactly how long they were down, the error rates, things like that as well as the purchases. They started correlating it and realized that the conventional wisdom is that for every hour you're down, the money you lost is directly equal to the money that you would have made normally in an hour. What they found is that no, actually, we lost maybe only 5%, because most people when they hit an error online are just going to come back later. Yeah, you're gonna lose some people forever, but they were probably very tiny impulse purchases anyways. The bulk of your purchases aren't going anywhere. They're just going to come back. If you're running a company that's like retirement planning, if your company provides retirement services, I'm not checking my retirement accounts every day. I'm not even checking them every week. If it's down, I'm just gonna come back next week.



Corey: Absolutely. There is some element of needing to be up for the retirement account story, not for the end-user but for how quickly you execute trades and because there's a whole pile of regulation that goes around securities. That tends to turn into something very esoteric and odd. But that isn't generally what people are thinking about in terms of how this winds up working.



Mike: Right.



Corey: There are some things that absolutely need to be up to a higher degree of certainty. A great example of this is the credit card network. If I'm trying to check out at a store and my credit card's declined, yeah, I very well may not come back and ring everything up in two hours. I may just either pay cash or not make the purchase at all. It comes down, again, to what is your business model? What is your constraint that drives your business? It's amazing to me how few DR plans, especially in the context of Cloud, tend to wind up having that taken into account. One area that ... Sorry, one conversation that I tend to have every frequently with various companies back in the days when I was a reasonably crappy employee was DR planning.



Corey: Okay, so we live in San Francisco. We need to have DR site at least an hour away. Here's where we think we're gonna put it. What do you think? Okay, let me get this straight. The city is in flames, and you're expecting people are going to go and care about keeping the website up for this non-life critical thing instead of tending to their families? Okay.



Mike: Sounds like a real winner to me.



Corey: Oh, yeah. Absolutely. Okay, let's pretend everyone is just fine. But the entire city is no longer tenable. Okay. But we're hosting this in either Oregon or Virginia or Ohio, not San Francisco. So what sort of event would take all of those things down simultaneously? Their response is pretty much, "Oh, just roll with it. It's happened, but everyone's okay." Okay. So I'm somehow going to continue working here rather than for the company who did worse planning for this magical disaster and is willing to pay me a consulting rate of 10 to 100 times more than I make here to go and fix their stuff instead. Remember, this is at-will employment. Not to sound like I'm being disloyal here, but past a certain point, you can absolutely buy me to go focus on a different problem. What does this look like? Suddenly, I find I'm not invited to those conversations anymore.



Mike: It's amazing to me how many DR plans completely miss the whole ... Your lead engineer gets poached right at the time that everyone else is also down. That's a pretty big problem. I'm pretty sure it's going to happen.



Corey: You can see some of this with hurricanes and people running the bucket lines of fuel up and down stairs and working in racks with things overheat. You've seen there where someone's working on something, and someone else will say, "Hey, can we pay you to come help with us for a little bit?" And they do. Usually, it's minor stuff because of the remote hand-style stuff of help me turn these things off, hit those power buttons, help me lift this thing. But that doesn't take much of a leap of imagination to think, "Yeah, can you get our site back up instead? Because you seem really calm, and everyone on our side is panicking." And who knows? Maybe the secret to your success is Quaaludes, because then you never panic about anything ever again. But it at least is a calming influence. People get poached in extremist type of circumstances.



Mike: Yeah, it's weird to me that DR plans tends to focus on all the stuff that really doesn't matter and ignore the stuff that is actually likely to happen. Because focusing on the stuff that's really likely and being ... It's hard. You have to be honest about what's actually going to occur, what your business really does, and the value of your company to its users and other stakeholders. That's a hard conversation.



Corey: It is. People ask me this question all the time. When I say I focus on AWS bills, people come back as if they come up with this brilliant counterpoint. Well, what if Azure or GCP overtakes AWS and drives them into irrelevance? Okay, in this scenario, suddenly everyone is using the other Cloud providers and my knowledge is useless. I've somehow sat here oblivious to this entire change for the 10 to 15 years this tectonic shift would take. Now, AWS has become the new legacy. Everyone's exiting for a variety of reasons. Even with the all hands on deck pace of speed there, you're not gonna see that in much less than eight years. You think I can't come up to speed on a new technology that has somehow taken over the world and it caught me napping? Yeah, as an individual consultant, I can pivot a lot more rapidly than these tectonic industry shifts can tend to happen. Yeah, I'm absolutely okay with focusing on this.



Corey: We do see that AWS, for example, is almost one of those too big to fail type of scenarios where if they go down even in a region that you think no one is in, a shocking number of sites suddenly have a terrible day. The fact that this happens so infrequently is a testament to their engineering prowess.



Mike: It's weird when people make that argument of what if Amazon just goes away, what if GCP and Azure overtake them? Like, okay, that is an interesting scenario. But how long would it take Amazon's customers to migrate? How long did it take them to migrate to begin with? A lot of these companies are in like year five or six of a migration, and they're still not done.



Corey: Or anywhere near done.



Mike: Right, or anywhere near done. Like, these things are not quick.



Corey: It's easy to make fun of those companies too. It's this legacy 30 year old PHP app or this ancient mainframe. What piece of crap does that run? The answer is the traffic lights or it handles the ACH payments networks that make the American financial system work. These are things that are serious business. This is not the Twitter for Pets style toy application that no one notices nor will long remember the outage that winds up happening if that goes away. This is serious stuff. There's a diligence there that most of us don't have to deal with in most of our environments.



Mike: On that note, that actually brings up a really interesting question to me. If I have something, this legacy, legacy in this sense is, generally I find people are really just talking bad about a thing they don't like to use. For example, I helped a company recently. I say recently. It really was only a couple of months ago with a Rails 2.1 application, because every consultant they talked to said, "No, we're not gonna help you upgrade that. You should just rewrite it." These are also the companies that are on the website saying we never recommend rewrites because it's a bad idea. When you have an application that old, your only choice is to rewrite it, but they don't. Well, how much is this company making? Many, many millions of dollars a year on a Rails 2.1 app. That's just on a small scale. So when you look at companies that are running the credit card networks and the ACH payment systems, there's a staggering amount of money flowing through these. These are not new systems. For that matter, they're not perfect. They're not interesting. Yet, you and I, being in Silicon Valley, we're often running into engineers and companies who are trying to build something perfect. What do you think about that? How does that work out for you? Have you ever seen this actually work out well?



Corey: I don't think it's ever happened for starters, because you get two engineers talking about the proper way to build something, you'll come up with at least three opinions. I would take it a step further, though, and argue that perfect code is a competitive disadvantage. Specifically, you can look at an awful lot of companies that died, because they spent all their time fixing code and refactoring stuff that their customers did not give toss about versus seeing companies that have succeeded despite the fact that their code is a burning tire fire because they're solving a business need.



Corey: A canonical example of this is Twitter for example. Twitter had serious problems back 2012 or so. This was before their current serious problems of providing a safe haven to Nazis. But this was in the days of focusing on ... They themselves were a Ruby on Rails shop. They were down all the time. The fail whale was becoming a recurring thing. But people loved using Twitter. The fact that everyone made fun of them for being down all the time meant that people were using their platform. At that point, a rewrite and focusing on stability absolutely made sense. They moved heaven and earth and pulled it off. It's sort of sad what happened to them from a corporate perspective since then. But from a technology perspective, they built something incredibly resilient that works. That was the right move. But had they done this back when they were just an idea tossing around in Jack Dorsey's complicit head, then there wouldn't have been a Twitter at all, because they would have died on the vine. They were too busy trying to make money and build a viable business. Once that was done, then you have resources to throw at things like technical correctness and rewrites and solving technical problems left and right. But until you get to that point, you don't have any technical problems. You have business problems. Focusing on that is almost never a question of what framework you're writing in, what language you're running in, what version of a thing you're using. There are exceptions. If you disagree vehemently with this, email Mike. But I absolutely stand by the argument that this isn't going to save anyone's business, because you're on an old version of Rails. That's lunacy.



Mike: Yeah, I've seen so many companies ... Given I'm from the sidelines here, but I've watched so many companies spend a ton of effort on building infrastructure. These are brand new startups. They don't have a product yet. They're spending all their time building up the latest set up on server list, on containers, and getting it all exactly how they want. It's like, "Hey, wait a minute, you just spent the past year not building a product." So that entire year is really a year you're never gonna get back. You should have spent it on building a product, even if it would have been shitty infrastructure.



Corey: One story that I like is Ryan Kroonenburg when he was building out A Cloud Guru. Said he had six weeks to build an online school. And the technology that got in there the fastest from his perspective was serverless. He said that's the reason he picked it. He wasn't trying to plant a flag in the ground and say, "This is the technology upon which I'm going to stake my reputation." 18:43 It was, from his perspective, the best answer to the problem he was facing. I don't know how accurate that was, but it obviously worked out for him. He got the site up and running inside of six weeks. It's now turn into a bit of a runaway success in the Cloud education market. That was ... Assuming that we take that story at face value, that was absolutely the right answer. He wasn't spending his time getting everything running on Kubernetes, because he has a problem that he can't entirely articulate, which seems like the only thing that people throw Kubernetes at.



Mike: Yeah, the problem is I don't have enough Kubernetes on my resume.



Corey: Exactly. Everyone's first project is always their own resume. Everything else is secondary to that. There's a reason that no one's writing stuff in Pearl anymore. It's because that's sort of the kiss of death on the resume side. The language is perfectly serviceable-ish.



Mike: Yeah, -ish. It's fun to me when people will ... They're complaining about my choice of I run my entire infrastructure on WordPress. And I don't even run that myself. I pay some other company to run WordPress for me. Why do I use WordPress versus doing this entirely myself? Why don't I even host it on Amazon myself? Well, because I have better things to do. My businesses is not running WordPress. My business is not running infrastructure. My business is helping people understand monitoring. So, all this comes down to where are spending your to quote @mcfunley, where are spending your innovation tokens?



Corey: Yes. I'm right there with you. I've run my quinnadvisroy.com website entirely on Squarespace. The reason I have a website at all is because if you don't people think that you're effectively someone fly-by-night who doesn't know how business works. It's like a business card. It's something that is sort of a token. It's evocative of someone who knows what they're doing. But it generates approximately zero dollars itself.



Mike: Right.



Corey: I wind up paying Squarespace something like $12 or $13 a month to run that thing for me. And I don't think about it again, because it's the best kind of problem, someone else's.



Mike: Yeah, and you and I are some of the most qualified people to run around on infrastructure. We actively attempt to not do it.



Corey: Absolutely. I could set up a whole series of load balanced EC2 instances. Step one, I would pay at least an order of magnitude every month more than I'm paying for Squarespace. But far more damaging is the fact that I would be spending 30-40 hours a month tinkering with it, trying to optimize things, making sure everything this updated and patched. There is zero business value for me in doing that. Part of this also stems from the fact that I think a lot of us as engineers started off as hobbyists. We did this for the love of it for lack of a better term.



Mike: I know I did.



Corey: Yeah, that's not the only path to get there, by the way. It's just how many of us fell into this market. When we're doing that, we equate the value of our time as zero. It doesn't ... There's no money in a hobby. Yeah, of course we'll spend 20 hours working on something rather than paying someone $200. Once you cross over into doing this professionally, oh by all means, spend the money on someone else. Many times, it's still difficult for us to emerge into that new mindset you retreat to the idea of devaluing our own time. That's one of the more dangerous things, I think, that we do as a sector.



Mike: To some out there, there's value in what I've been calling good enough engineering. It's just enough engineering, just enough work, to get the business needs solved. Like, if a very tiny PHP application solves your problem, fantastic. Do that. Don't go after some serverless thing. Now, you pointed out that Cloud Guru actually had the opposite situation where using modern technology was actually the right move. That's great. I wonder how many of those situations are actually there.



Corey: Oh, yeah. There's a lot of them out there like that where I think that if you just focus a little bit on solving a problem in the most expedient way possible, there's always time to go back and fix it later if that's what makes sense for your business. I see this percolating out in different areas too. I'll wind up with engineers who are passionate about finding an extra $100 a month in savings in their AWS bills. My question that I sometimes ask very directly is how much do you make a year. Their response is, "What does that have to do with anything?" The answer is I'm just trying to figure out how many more seconds we have to spend talking about this before the conversation costs more than you'll save this year. And that sometimes triggers a light bulb moment. More often, it causes them to come away with the conviction that I am, in fact, a jerk. But it does wind up at least laying bare the truth that people's time is money. I've never yet seen a company with a couple weird big data science startup exceptions where payroll was less than the infrastructure bill. People are always your most expensive thing.



Mike: Yeah, absolutely. It's fascinating to me that so many startups place the cost of their people at zero. I mean they're paying them. But then when you go to task them with something, you don't think about how much is it going to cost me to do that and how much is it going to cost me to do this other thing like opportunity cost.



Corey: The best kind of cost center, someone else's.



Mike: Right. There are a lot of costs involved here. But, we generally place the cost of an engineer, the cost of people at zero, which is absolutely wild to me. The only people who don't do this are generally sea levels or perhaps, more telling, the financial folks.



Corey: This is probably gonna start a war, so I'm absolutely going to ... In fact, for this one, yeah, you can email me on this one. I'm gonna stand by this. I think that this attitude needs to permeate its way into on call where if you value-



Mike: Absolutely.



Corey: ... people's time zero and then ask them to just be on call over Thanksgiving, it'll be fine, or, "I know you go home and you have family, but I've decided [inaudible 00:24:35] gonna call you, we expect you to fix it." I feel like the correct starting point to that is the hell I am. Then you're having a discussion and negotiation.



Mike: Right.



Corey: Believe me, I understand the idea of pitching in in the case of emergencies. But I don't know about you, but I don't want to have to think about my business 24/7. I don't want to have to sleep with one eye open. There's a reason that I picked a problem to solve that is strictly an issue during business hours. No one calls me at 3:00 in the morning screaming about their Amazon bill.



Mike: Yep. Even among the doctors I've known, yes, on call is part of their job. It is a core facet of what they're expected to do as in if I call you, you need to answer the phone. But they still have rotations. Like, they still have a period of time where they're expected to be on call. Most importantly, they're paid for the entire time that they're on call whether they take the call or not.



Corey: Let's not kid ourselves either with those doctors as well. People's lives are very often hanging in the balance, not we need to show people ads they don't want to see.



Mike: Yeah. As an SRE, it's not very often that lives are depending on the work we're doing. So, it's a very different sort of situation. Yeah, there's a lot of money depending on what we're doing, but there are no lives. If the site continues to be down or slow for a while, I'm kind of okay with that.



Corey: And if it does come down to a point where there's that much money on the line, well terrific. Then you can probably afford to hire an additional team of people to focus on this so that people can sleep at night.



Mike: Yeah, I completely agree with you. If you would like, you can also email me and complain about it too. This is a point that I actually talked about in my book, Practical Monitoring, where you should absolutely be paying people for on call above and beyond what you're paying them as your salary. The push back I've always heard is, "Oh, well, you're an SRE. You're a DevOps engineer. It is simply expected that on call is part of your job. It's like, "Yeah, but that's not true for doctors and nurses, so why should it be true for us?"



Corey: Exactly. And you take it a step further beyond that. It's we expect that you're going to wind up being on call to wake up when things break. Well, what about the application engineers, the developers that built this? We need to be able to reach those people too. How come they're not on call? Then, cue a lot of stammering and sometimes some half baked idea of trying to get everyone involved in being on call all the time. It just becomes an awful way to live.



Mike: Absolutely.



Corey: Again, I have my own biases in this. One of my first on call jobs was such a terrible rotation that I swore I'd never do it again. The scars of that have carried me a long way. I optimized around building resilient infrastructures so that whenever I got called out of hours it was something interesting, not something binal, not something that was going to annoy the crap out of me. There were always hard rules that went around with it. If it pages me, I'm empowered to do anything and everything it takes to make sure that never happens again. If that means that all I can do is turn the pager off, well okay, here we go.



Mike: Yeah. Completely agreed.



Corey: I should probably warn you that if anyone's listening this and trying to pattern their own career choices after mine, don't do that. I'm effectively unemployable as a direct result. There's a reason that I run my own company. No one else is willing to tolerate this for more than about 20 minutes at a stretch.



Mike: Yeah, I'm with Corey on that. Corey, you've been doing ... You do a lot of talks, but most recently you've been talking about this idea of ... You titled it the Myth of Cloud Agnosticism, which is a pretty provocative title.



Corey: Mm-hmm (affirmative).



Mike: What do you actually mean? What is all that about?



Corey: We often have stories where we talk about time to build an application, and we want to be able to deploy that to AWS, to GCP, to Azure. Because I've taken a sudden blow to the head with a pile of money, Oracle Cloud. I want to seamlessly deploy this to all of those locations. In practice, it always turns into a shit carnival because one of two things happens. Either you're actively deploying to all of those things and you are slowed incredibly to a pace of innovation because you're only using the primitives that all of those Cloud providers support and it's not a very long list; even manage database services interact very differently on an API level. Or, you're building it on one Cloud but claiming you're able to deploy elsewhere. And that goes back to our DR discussion where, "Okay, we're gonna time for our quarterly DR test," or, "We're gonna quarterly do our GCP deploy," and it fails three minutes in and you iterate forward and again and again and again. You finally get something kind of up and working. You declare it done. You immediately stamp approved in the binder. It goes on the shelf. The next commit breaks it until six months from now when you try it again. If you're not living it and doing active deploys there, you don't actually have something Cloud agnostic. You've just made a bunch of terrible decisions along the way.



Corey: The idea is to, from my perspective, is to pick a Cloud provider. I promise, I don't care which one you pick. Do whatever make sense for your business. That requires context I don't have. I have my own opinions and my own preferences. That's not the important part of the story. But the idea now is that you wind up building something like that where you pick your provider, you go all in on what their offering. You're going to be able to migrate to another provider down the road if you need to. That migration in five years is going to be far less time and effort that you're going to spend maintaining that full level of agnosticism the entire time.



Corey: That also winds up building to a point where you're paying for that option that you won't really use with feature velocity. That becomes effectively a giant waste of everyone's time and energy. There's a bunch of negative patterns that happen as a result of this and very few positive outcomes. Now people are pushing for a multi-Cloud world, largely from a position of third-party vendors where they want to make sure that if you're not multi-Cloud, you're all in on, I don't know, Azure. Well, I won't have anything to sell you, because there's a world where you're all in on Azure and you need to buy a NetApp, for example. There needs to be something keeping you afoot in each world. Because in those transitional painful places, that's where there's business opportunity for a lot of companies out there. There's a lot of incentives to go around. Make sure that you wind up going into multiple providers. I just find that the wrong answer in most cases. There are exceptions to this, but they're there.



Mike: I think that's really telling that really the only people that are talking about being Cloud agnostic are the ones with the best of interest in making sure that you are deployed into multiple places.



Corey: Let's be fair here. A lot of people are saying you should pick one provider and go all in. Those people all work for a Cloud provider.



Mike: That's also true.



Corey: It's not too big of a secret, which one they think you should go all in on. So, there's bias on both sides of this discussion. I just happen to think that all things being equal, pick a provider and go with it. There's a reason that my nonsense is all built on top of, generally, AWS. There are a couple of exceptions, but that's the horse that I'm riding these days. I've dabbled a little bit with the other providers, but I haven't built anything substantive there, just because that winds up slowing me down to a point where I'd rather focus on solving problems in one particular area.



Mike: Yeah, that's a great point.




Corey: Another point that often comes up in this is business risk. Well, what if the provider does something we hate? What if they become our competitor? Spoiler, if it's Amazon, they're already your competitor. The only business they're not going to compete with is mine, and that's because I make fun of them for a living. The rest of this, they already are there. If that is that big of a concern for you, well, talk to Netflix. One of their biggest competitors is sleep. One of their second biggest competitors is Amazon Prime Instant Video, whatever ridiculous long convoluted name that thing has this week. But they still wind up hosting a lot of their infrastructure on AWS and they're very public about this.



Corey: It tends to be this idea where if there's a business risk and a strategic decision made by your board, great. Talk about it. Understand the cost going in with this and then move forward. If it's that big of a concern, don't go on Amazon at all. Go Azure or go GCP from the get-go. But don't do this crappy thing where you're going to mostly be on Amazon, but be able to deploy somewhere else. If you look at the bills for these places that claim to be Cloud agnostic, they're still 98% in most cases, on one provider and then 2% of back ups or something on another provider instead. It winds up being a very strange story no matter how you unpack it. There's a lot more than goes into this as I build up. Again, I have an entire 45 minute talk on this that I'm not going to rant into a microphone for you at the moment. But, yeah, this is a nuanced issue. And there are exceptions. The thing that wakes humans up at PagerDuty, for example, has to be multi-provider, just because it has to sustain the outage of any provider or region when that's the thing that wakes you up to tell you it's down. But that's not their ad system.



Mike: Right. I remember there's a really interesting story for a show called The Grand Tour, which is the successor to Top Gear in the BBC where Netflix and Amazon Video were both bidding for this. Amazon won the show. Netflix decided to not go and spend that stupid amount of money for it. Now you have Netflix and Amazon competing directly on providing a show. It's all original content and then Netflix is still running all their infrastructure on the same infrastructure that Amazon's writing. It's like this is not necessarily all bad. To them, it's we're competing in one area and absolutely not in others. I'm sure their lawyers and risk assessment people are totally fine with the mitigation they have, which would be absolutely fascinating to see what they are.



Corey: Absolutely. They're never going to be too public about any of this, just because-



Mike: Of course.



Corey: ... it doesn't wind up being ... This is exposing the territory of materially sensitive information. This is not something that people are generally gonna get on stage and talk about.



Mike: Right.



Corey: Half the problem that I spent with onboarding a new client is making sure all the i's are dotted and t's are crossed as far as things I'm not allowed to talk about publicly about specific clients, which is generally, yeah, you aren't allowed to tell anyone you work with us is usually the first bullet point in that item. Okay.



Mike: Yeah, who wants to admit to the public that they have an incredibly large Amazon build they've been unable to tame.



Corey: Every large company you've ever heard of more or less falls into that because no one talks about it. When it happens to you, you think you're the only one.



Mike: Right. The real answer is that we're all suffering from it.



Corey: Yep, I have solved the same problem with very similar tooling in so many different environments. Because no one will talk about it publicly, there's no consensus around how to do this. Everyone's reinventing the same crappy wheel.



Mike: Yeah. Well, Corey, it has been such a pleasure having you on.



Corey: It really has. I'm delightful.



Mike: Yes, you are. I want to talk about your ... what you do, like the core of your work as a consultant, which is in, as we just mentioned, Amazon billing. There's, I imagine, a lot of people listening that are kind of interested in this. They all have stupid bills. What's something that people can do today or this week to improve their bill dramatically, like to make a real dent in it?



Corey: Sure. This is going to sound obvious, but you'd be surprised how often it isn't. When you look at the bill, it's presented alphabetically. Start with the big numbers. 60% of global spend is on EC2. But the first thing on that bill in some cases, Alexa for Business for a couple of bucks. People will start asking with that as their first question. You're spending x millions of dollars a month. Why is that the first thing you want to talk about? It's alphabetical.



Mike: Right.



Corey: Start with the big things. Manage net gateway data processing if that's anything substantial, and it can be. Replace that with instances you run yourselves. Make sure you understand how reserved instances work. Make sure that you are buying them appropriately for what you're doing. Any enormous number in your bill that doesn't make sense to you, dig into it and try and understand why. Worst case, reach out to me on Twitter, via email. I'm always willing to have conversations about this stuff. There's very little that's new under the sun in the world of AWS billing and capacity planning. But because no one really focuses on this as a full-time job, everyone gets to go and explore this for themselves. It's a constant journey of exploration where you're not breaking new ground globally. It's just new to you. If you have anything in particular you're curious about, please reach out. I'm always willing to have interesting conversations about this stuff for fun. This is what I do for fun. This is a sad commentary on how I view what my life looks like. But I got into this because I have a passion for it.



Mike: I bet you're a riot at parties.



Corey: You wouldn't believe.



Mike: On that note, where can people find out more about you and your work?



Corey: My professional site is quinnadvisory.com. We can throw a link to that in the show notes. But the newsletter is the more common means of interaction. That is lastweekinaws.com. I'm also at screaminginthecloud.com for my own podcast. And, of course, QuinnyPig on Twitter where I turn shit posting into an Olympic sport.



Mike: Alright. Well, thank you again for joining me on the show.



Corey: Thank you for having me. It's always a pleasure.



Mike: And thank you. Thank all you folks for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com, on iTunes, Google Play, or wherever it is you get your podcast. I'll see you in the next episode.

What is Real World DevOps?

I'm setting out to meet interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.