Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.
Hello and welcome to Small Batches with me Adam Hawkins. In each episode, I share a small batch of the theory and practices behind software delivery excellency
Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, letâs begin todayâs episode.
Iâve said for a while that Test Driven Development is skill zero for professional software engineers. Itâs skill zero because it unlocks everything else. This isnât an episode about TDD though. Undoubtably that will come, so letâs move onto skill one for now.
Skill one is continuous delivery. Continuous delivery will put your changes into production as quickly as possible so you can learn, and iterate forward. The learning requires skill two: production operations.
The discipline of Production operations revolves around understanding the current condition of the system and comparing that against expected targets.
I must pull in a bit of Demingâs System of Profound Knowledge now because itâs relevant to the discipline.
The first point is Theory of Knowledge: how do we know what we know? The second point is understanding variation: whatâs the range of acceptable outcomes? The third point is: whatâs the aim of this system?
So where we do start knowing the current condition? It starts with the golden signals. If you can only measure four metrics of the system, then focus on these four. Theyâll point you in the right direction.
Remember this phrase: LETS. It stands for Latency, Errors, Traffic, and Saturation.
Latency is how long it takes for the system to service a request, ideally tracked in a statistical distribution.
Errors are the problems, ideally tracked as sums over time. They indicate the system failed to product the intended outcome. Think of HTTP 500s.
Traffic is a measure of the flow through your system. This may be the total requests in a time interval or something like requests per second.
Saturation is how full something is, ideally tracked as percentage. Think of a connection pool, disk usage, or queue.
Odds are that any telemetry tool work itâs salt will provide three of the four out of the box. The best ones will provide you all four.
Operations can correlate these signals to answer the question: how do we know what we know? Hereâs what a conversation may sound like in the operations review for a web service.
We know the system is operating correctly because the traffic levels are within established levels as measured by the total HTTP requests. We know the system is operating correctly because the number of errors, as measured by HTTP 5xx responses, is statistically small. We know the system is operating correctly because the latency, as measured by the server side response, is under 100 milliseconds. We know the system is operating correctly because saturation, as measured by serverâs incoming connection queue, averages twenty percent.
I repeated the phrase âas measured byâ to emphasize the focus on empirical facts. Operations is numbers driven. If you canât include âas measured byâ into what youâre seeing, then you donât understand it enough or donât have enough certainty to make any assertions about the current condition.
Everythingâs great when there are no problems. Something will go wrong. Time for a story.
Itâs 2:58 AM. Your phone shocks you awake with a page from your PM. It reads âthe website is downâ. No time for grogginess. Itâs go time.
You Put your empirical thinker hat on: âdownâ as measured by what? Well, you remember the handy phrase: LETS for the golden signals: Latency, Errors, Traffic, Saturation.
Traffic is usually a good starting point for these âdownâ scenarios.
You follow your mental block diagram of the whole system. First, checking traffic as measured by HTTP requests and responses at the load balancer. Everything looks good. Traffic is flowing through the load balancer. Next stop: traffic as measured at the app servers. HTTP requests and response count looks good. No problem with traffic. Next stop: errors as measured at the load balancer and app servers.
Now it starts to become a bit of a head scratcher. There are no red bars on the charts of 5xxâs from the load balancers or application servers. So thereâs no change in error counts. Whatâs the next golden signal? How about latency?
You pull up a chart of the p50, p90, and p95 latency on HTTP responses across all the app servers. There is a slight up tick in the p95 latency.
The next question is: which endpoints have gotten slower? Time to drill down into the metrics, so you split the chart by API and user-facing responses. No delta on the user-facing latency charts. Then you spot something: p95 latency for API responses has really gone up, though there are no errors. So where is the failure?
Time to go up level of abstraction. You think to yourself: what uses this API? Oh right. The fancy single page Javascript application.
So you pull up a dashboard of the golden signals coming from the Real User Monitoring application. You discover a red bar chart of errors. The red bars are consistent. They also correlate with the increase in API latency. Now a theory forms in your head. Something is probably timing out somewhere, then something something Javascript error. Next stop: error logs.
And finally, there it is. The Javascript app uses the API to fetch all the data to render the initial home page. The increase in latency causes a timeout in the client, which creates an uncaught error, which leads to a blank screen. Not really âdownâ but certainly broken.
All right thatâs all for this batch. Head over to https://SmallBatches.fm/88 for links to recommended self-study on production operations and ways to support the show.
I hope to have you back again for next episode. So until then, happy shipping!