Join me as I continue a new series called Whiteboard Confessional by exploring a time in a previous life when Amazon ElastiCache for Redis caused an outage that led to drama, what it was like to work for someone who can be described as a “metaphor-spewing poet,” how every event and issue makes sense in retrospect, why you should never schedule important maintenance on a weekend, how Amazon ElastiCache for Redis works, the four contributing factors that led to the outage in question, why blameless post mortems are only blameless if you have that kind of culture driven from the top, and more.
In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.
So, in hindsight, what happens makes sense, but at the time when you’re going through an incident, everything’s cloudy, you’re getting conflicting information. And it’s challenging to figure out exactly what the heck happened. As it turns out, there were several contributing factors, specifically four of them. And here’s the gist of what those four were.
Number one, we used Amazon ElastiCache for Redis. Really, we were kind of asking for trouble. Two, as tends to happen with managed services like this, there was a maintenance event that Amazon emailed us about. Given that we weren’t completely irresponsible, we braved the deluge of marketing to that email address, and I’d caught this and scheduled it in the maintenance calendar. In fact, we specifically were allowed to schedule when that maintenance took place. So, we scheduled it for a weekend. In hindsight: mistake. When you’re having maintenances like this happen, you want to make sure that they take place when there are people around to keep an eye on things.
Three, the maintenance was supposed to be invisible. The way that Amazon ElastiCache for Redis works is you have clusters, and you have a primary and you have a replica. The way that they do maintenances is they wind up updating the replica half of the cluster, they then fail the cluster over so the replica gets promoted to primary, then they update the old primary, which then hangs out as the replica. This had happened successfully, or so we thought, the day before on Saturday, a full day before our customer got the error page that started this exercise. What had really happened was that we’d misconfigured the application to point to the actual primary cluster member rather than the floating endpoint that always redirects to the current primary within that cluster. So, when the maintenance hit, and the primary then became the replica, we were suddenly unknowingly having the application talk to an instance that was read-only. So, it would still work for anything that was read based. It wasn’t until it tried to write something that all kinds of problems arose.
And that led to a contributing factor four. Because reads still worked, our standard monitoring didn’t pick this up. We didn’t have a synthetic test that simulated a login. As a result, the first indication that something was even slightly amiss showed up in the logs when the customer got that failed page 15 minutes before my metaphor-spewing poet boss called me. So, when explaining this to the business stakeholders during the post mortem, we got to educate them in the art of blamelessness which you’d think would be a terrific opportunity for someone who’s only real skill is spewing metaphor, but of course, he didn’t decide to step up to that plate. Again terrible boss. So, someone from the product org was sitting there saying, “What you’re telling me is that someone on your team misconfigured—” Okay, slow down Hasty Pudding. We’re not blaming anyone for this. There were contributing factors, not a root cause. And this is fundamentally a learning opportunity with a lot of areas for improvement. “Okay, so some unnamed engineer screwed up and—” And we went round and round. Normally, an effective boss would have stepped in here, but remember, he only spoke in metaphor. Defending his staff wasn’t speaking in metaphor, so he, of course, chose to remain silent. As it turns out, and as anyone who knows me can attest, I have a few different skills, but a skill that I’m terrible at is shutting up. It turns out that blameless post mortems are only blameless if you have that culture driven from the top because everyone roundly agreed at the end of that meeting, that the way that it devolved was certainly my fault.
Now, let’s be clear. This was my team that was responsible for the care and feeding configuration of this application. And therefore it was my responsibility. Who had misconfigured it is not the relevant part of the story. And even now, I still maintain that it’s not. There were a number of mistakes that were made across the board, but the buck does stop with me. And there was a chain of events that led to this outage. Our monitoring was insufficient for something this sensitive, an error like that in the logs should have paged me before I got a walking metaphor calling me manually, we should have been testing that whole login flow with synthetic tests, and we should ideally have caught the misconfiguration of pointing the application to the cluster member rather than the cluster itself. But really, the biggest mistake we made across the board was almost certainly using Amazon ElastiCache for Redis. How using something else would have avoided this, I couldn’t possibly begin to say, but when in doubt, as is always a best practice, blame Amazon.
Announcer: Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.
What is AWS Morning Brief?
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense