Join me as I continue a new series called Whiteboard Confessional by exploring the time I sent out a newsletter to 18,000 people filled with broken links (yep, it was the other day)—and what I did to fix them without sending out an updated version. In this podcast, I also talk about what my email newsletter architecture looks like, how I use analytics to continuously optimize Last Week in AWS, why not all data is good data, what I am not interested in knowing about my readers, what I did to answer questions that my email marketing platform didn’t answer for me, how that ended up breaking things briefly, how I fixed what was broken, and more.
Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.
On Monday, I sent out a newsletter issue to over 18,000 people where the links didn't work for the first hour and a half. Then they magically started working. Today on the AWS Morning Brief: Whiteboard Confessional. I'm not talking about a particular design pattern, but rather conducting a bit of a post mortem of what exactly broke and why it suddenly started working again an hour and a half later. To send out the Last Week in AWS newsletter, I use a third-party service called ConvertKit that, in turn, wraps itself around SendGrid for actual email delivery. They, in turn, handle an awful lot of the annoying difficult parts of newsletter management. As a quick example, unsubscribes. If you unsubscribe from my newsletter, which you should never do, I won't email you again. That's because they handle the subscription and unsubscription process.
Now, as another example, when you sign up for the newsletter, you get an email series that tailors itself to a “choose your own platypus” adventure based upon what you select. True story. Their logic engine powers that, too. ConvertKit is awesome for these things, but they do some things that are also kind of crappy. For example, they do a lot of link tracking that is valuable, but it's the creepy kind of link tracking that I don't care about and really don't want. Also, unfortunately, their API isn't really an API so much as it is an attempt at an API that an intern built, because they thought it was something you might enjoy.
I can't create issues via API. I have to generate the HTML and then copy and paste it in like a farm animal. And their statistics and metrics API's won't tell me the kinds of things I actually care about, but their website will, so they have the data, it just requires an awful lot of clicking and poking. And when I say things I don't care about, let me be specific. Do you know what I don't care about? Whether you personally, dear listener, click on a particular link. I do not care; I don't want to know. That's creepy; It's invasive, and it isn't relevant to you or me in any particular way.
But I do care what all of you click on in aggregate. That informs what I include in the newsletter in the future. For example, I don't care at all about IoT, but you folks sure do. So, I'm including more IoT content as a direct response to what you folks care about. Remember, I also have sponsors in the newsletters, who themselves include links, and want to get a number of people who have clicked on those things. So, it also needs to be unique. I care if a user clicks on a link once, but if they click on it two or three times, I don't want that to increment the counter, so there are a bunch of edge case issues here.
Here are the questions that I need to answer that ConvertKit doesn't let me get at extraordinarily well. First, what were the five most popular links in last week's issue? I also want to care what the top 10 most popular links over the last three months were. That helps me put together the “Best of” issues I'm going to start shipping out in the near future. I also care what links got no clicks because people just don't care about them or I didn't do a good job of telling the story. It helps me improve the newsletter.
With respect to sponsors, I care how each individual sponsor performs relative to other sponsors. If one sponsor link gets way fewer clicks, that's useful to me. Since I write a lot of the sponsor copy myself, did I get something wrong? On the other hand, if a sponsored link gets way more clicks than normal, what was different there? I explicitly fight back against clickbait, so outrage generators, like racial slurs injected into the link text are not permitted. So, therefore when a sponsored link outperforms what I would normally expect, it means that they're telling a story that resonates with the audience, and that is super valuable data. Now, I'll tell you what I built, and what went wrong. After this.
In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.
I built a URL redirector to handle all of these problems plus one more. Namely, I want to be able to have an issue that has gone out with a link in it, but I want to be able to repoint that link after I've already hit send. Why do I care about that? Well, if it turns out that a site is compromised, or throws up a paywall, and makes for a crappy reader experience, I want to be able to redirect that link to somewhere that is less obnoxious for people. It's the kind of problem you don't think about until one day it bites you when a site that you’ve linked against is hosting a hostile ad network and then gets your entire newsletter blacklisted by a whole bunch of spam gateways. You never make that mistake twice.
So, I set out to write some code to handle this very specific problem and its use case. So, I built all of this on top of serverless technologies. Because I'm a zealot, and too cheap to run an EC2 instance, it's a perfect setup, too, for the serverless model. The links get an awful lot of clicks on Monday morning and taper off throughout the day, then there's a long tail ghost town of clicks for the rest of the week. There are some, but not a lot. So, running something full-bore that entire time doesn't make a whole lot of sense. I used the Serverless Framework because that maps to how serverless technologies work in my mental model. It fits my canonical understanding of the serverless state of the world. There's a single function behind an API gateway. I tried to use one of the new HTTP APIs that they announced a few weeks ago, but the Serverless Framework support for that isn't great yet. I wanted to set up a custom domain for this thing, and the serverless plugin for domain management doesn't support that functionality, plus, all of the example docs for Serverless Framework don't leverage the new functionality.
So, my choices were either write a whole bunch of custom code and figure this stuff out myself as I guinea pig my way through it, or use the historical, long-standing API gateway instead. Now, API gateway is overly complex, it's more expensive, and it's super confusing, but I have a bunch of working examples how to make it go. So, that was awesome. I've already suffered through those slings and arrows once, I now know the path. And this monstrosity went through several iterations. First, it was a stateless service that worked via Lambda@Edge. The problem with stateless services for things like this is that it would automatically shorten any link that you throw against it. It turns out that if you don't have a list of approved links, that it will return redirects for, spammers can and will misuse your link shortener to send spam anywhere, so I had to rebuild it with that viewpoint in mind.
So, it's an API gateway backed by a single Python lambda function that's less than 100 lines in us-east-1. And when I say less than 100 lines, I'm talking around 60 or so. What it does is it receives an HTTP event from the API gateway. There's a unique, per user, user ID that's automatically injected by ConvertKit that I don't ever query for anything, but it's stored as a query parameter in that request. There's also a path parameter that winds up decoding into a dict that contains the URL and an issue number because a lot of links are going to occur in different issues. And I want to make sure that I get the aggregate numbers correct between issues. It then takes all that data and makes a single call to DynamoDB that does what is basically magic, that outstrips what Route 53, my normal preferred database, is capable of in this use case.
It validates that the URL in question exists in the database for the issue number that's given. If it is, it checks to see whether the per user ID that is provided has clicked on the link already. If they have, it just returns the URL. If the user hasn't clicked that link yet, it increments the click counter and adds the user ID to an array in the database that's never actually queried. It just winds up acting as a “Yep.” So, it becomes an atomic click. So, you clicking five times only shows up as one for me. Lastly, if the link isn't in the database, it instead returns a redirect to lastweekinAWS.com. All of that is done in a single update item DynamoDB call that Alex DeBrie helped me with. His DynamoDB book is absolutely fantastic. Go buy it if terrible stories like this are up your alley, and how to avoid doing things like this.
So, all of this works. Everything I've described works. What broke? Well, for starters, it wasn't supposed to go out yet, but when I pushed a small change to fix another bug, this got merged in by mistake. What I was doing was adding a /v1 to the API endpoint for this system, before I rolled it out next week. Because you always want to be able to version your APIs, and assorted nonsense. If I were to eventually redo this in something sensible that a grown-up might use, I can keep the same URL and just add a v2, then route it differently with API gateway, and now I haven't broken any of the old links, because it's obnoxious when you have an old email sitting in your old archive box that you want to click on, and it throws an error. I try to maintain these things in perpetuity.
Now, because all of this wasn't even set to be deployed, it hadn't been hooked into my test harness, which would have flagged all the breaking links when I went to publish this, because I sent out enough broken links over the years that I want to keep those to a minimum. But this completely slipped through, and gave a false positive across all of the tests, because I am bad at programming. Now, there are still some strange lingering questions as to why this broke, and the various ways that it did, and how it got up in the first place, that I'm still tracking down. But now you at least understand why it broke, how I was able to fix it without sending another newsletter—though I did, to tell people that it was fixing and then promise this episode— and, best of all, how to tell a story around something you wrote that broke in an embarrassingly public way and turn it into a podcast episode. Thanks for listening to the whiteboard confessional on the AWS Morning Brief.
Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.
Announcer: This has been a HumblePod production. Stay humble.
What is AWS Morning Brief?
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense