A brief introduction of canary deploys and using them to mitigate risk.
Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.
Hello and welcome to Small Batches. I'm your host, Adam Hawkins. In each episode I share a small batch of software delivery education. Topics include DevOps, lean, continuous delivery, and conversations with industry leader. Now, let's begin today's episode.
I keep a jar in my office. The jar is filled with sticky notes for episode topics. I fill the jar when someone mentions something and I think "Oh, that's a good idea!" or if someone specifically requests something. This is my "challenge accepted" jar. If I take something from the jar, then I just gotta sit down and write an episode for the topic. I grab one from the jar when I don't have anything else queued up and I need to get an episode out. Today, I pulled "canary deploys" from the jar. So, here we go.
Canary deploys are fundamentally a risk mitigation strategy. The term "canary" originates from the idiom "canary in the coal mine". The story goes that coal miners would bring caged canaries into the mines with them. The birds would die if methane gas was present, thus alerting the miners to danger. This is effectively an early warning system. The same idea may be applied to software.
Consider the scenario where you're making a risky change to the system. Perhaps its a feature that will add significant load to the system, possibly causing negative side effects. You don't want to deploy the change all at once because of the risk. So you can lower the risk by using the canary deploy strategy.
Instead of deploying to 100%, deploy to a much smaller portion such as 5% or 10%. This significantly reduces the blast radius of a potential failure. Next, you need to observe system telemetry to check the canary isn't dead or causing problems. If you detect a problem then delete or rollback the canary. If you don't detect a problem then proceed normally.
Canary deploys are a great risk mitigation strategy for stable–in the statistical sense–systems with automated deploy pipelines and robust telemetry. They work especially well trunk-based development. Imagine a pipeline like this: commit, run tests, build artifacts, push artifacts to a staging environment, run integration tests against staging, push canary to production, if monitoring is OK, then delete the canary and mark build as OK, if not then delete the canary and mark build as failed. This strategy will take changes all the way to production for verification with minimal risk. However, it does require a robust automated test suite and automated monitoring to verify the canary.
Coordinating canary deploys is possible with off the shelf tooling like Kubernetes and setups like auto-scaling groups behind load balancers.
Canaries in Kubernetes require a Service and two deployments. The selector on the service should match the primary deployment and the canary deployment. You can tune the number of replicas in each deployment to adjust for how much traffic the canary receives. When you're done with the experiment simply delete the canary deployment then proceed with updating the primary deployment.
A similar solution applies to auto-scaling groups connected to load balancers. Create two auto-scaling groups behind the load balancer. One group is the primary and another is canary. If you want to deploy a canary, then deploy that change to the canary auto-scaling group. Tune the size of the canary auto-scaling group to adjust how much traffic the canary receives. When you're done with the experiment scale down the canary group to zero then proceed with updating the other auto-scaling group to the new version.
One last thing before we wrap up this episode. I must mention blue-green deploys because they are similar to canaries. The blue-green deploy strategy prepares two versions of the running system with the ability to instantly switch between the two.
So if you deploy a change that creates a problem, you can instantly switch back to the previous version then delete the new version. Conversely, if new version is OK, then you can delete previous version. The difference between canary and blue-green strategy is how much risk is mitigated and when it happens in the process. A blue-green strategy may be more useful for unstable systems where you need a human in the loop to decide wether proceed or switch back.
Alright, that's all for this batch. Here are some more resources on canary deploys and other deployment strategies.
The first is Michael Nygaard's wonderful book "Release It!". The book covers many facets of building reliable production systems from their internal architecture to their deploy pipelines.
The second is the Small Batches slack app. The app posts daily snippets defining common terms such as "canary deploys" and many more to your team's Slack channels. The app is currently free in beta, so signup today.
Find links to "Release It!" and the Small Batches slack app at smallbatches.fm/64. Well, I hope to have you back again for the next episode. So until then, happy shipping!