Preflight checks prevent deploys into known bad conditions
Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.
Launch poll audio clip
You just heard the NASA flight director run through a launch team poll. The flight director asks each system asking for a go-no-go. All systems must be a go for the launch to continue. This practice is critical because launching with the incorrect conditions may be catastrophic. The same practice applies to deployment pipelines.
Deployment pipelines are a series of steps. Each step has three parts:
+ Step 1: Verify preconditions.
+ Step 2: Do the work. This may be initiating a rolling deploy, launching canary, or running a migration.
+ Step 3: Verify the changes from step 2.
This process mirrors the structure in automated tests: establish preconditions, run code, then assert on results.
This approach provides three key benefits:
1. First, fail fast. It’s preferable to abort the pipeline early instead of identifying a failure at the end.
2. Second, correctness is baked into the pipeline. There’s no need to manually muck around after executing the pipeline to check if it worked ok. If the pipeline completes successfully then you can say confidently that the system is working. That’s the same feeling you get from a strong test suite.
3. Third: there’s a structure to confidently grow and refactor the pipeline. This is similar to the “red-green-refactor” loop in TDD. So when a regression is identified then, it may be rooted out at step one or tested for in step three.
Preflight checks are the first step in a pipeline. They are coarse-grained end-to-end tests that verify the pipeline may proceed. Let's consider a typical application.
Typical applications use a database and some external services. Let’s say the application uses PostreSQL and integrates with Twitter. This implies at least two things:
1. The application can connect and authenticate to the PostgreSQL server.
2. The application can connect and authenticate to Twitter If these preconditions are not met, then we know for certain there will be a production outage. The deploy pipeline must prevent shipping code when these preconditions are not met.
These requirements start a list of preflight checks. Let’s assume that the connection and authentication information is provided via URL environment variables in accordance with the 12 factor app guidelines.
The first preflight check establishes a database connection to the configured host & port, authenticates as the configured user, then runs a simple query. Ideally, this is executed via application code. If that’s not possible then it may be done with the psql command.
The second preflight check authenticates with Twitter then executes an operation part of the application’s normal operation. This may be fetching some tweets or loading a user’s profile. Ideally, this is executed via application code using the same libraries, such as an OAuth client, and a configuration as in production. If that’s not possible then it may be done with a few curl commands.
The point of these tests is not specific assertions. The goal is to check if something is wrong. If these tests fail, then something is critically wrong. Whoever, what is wrong is up to developers to suss out. This is simply the nature of end-to-end tests. There are infinite reasons why they can fail but only a single reason why they pass. Thus, the preflight checks should cast the widest net possible.
The importance of preflight checks cannot be understated. Many developers have not been introduced to this concept so it’s difficult to grok exactly what they’re doing. I’ve seen developers scratch their heads and wonder “why do we even need these?”.
I have two answers to that question.
1. First, a real-life example from my work at a previous company. A team member was suspicious when I introduced preflight checks for DB access as I described earlier. They were skeptical to the value and didn’t see the point. Fast forward a few months into the future. The team member made a change to the database configuration code. The PR was peer-reviewed and eventually deployed to production. The DB connection preflight check failed because of the code changes, thus preventing production outage. After word, the team member exclaimed “I love preflight checks!” in the deploy slack channel.
2. Second, I offer the idea of psychological safety. Gene Kim, introduced the five ideals in his recent book The Unicorn Project. Psychological safety is the fourth ideal. Put simply, developers need the peace of mind that they may go about their work without fear of negative consequences. Preflight checks ensure that known requirements are met, thus increasing developer confidence that their changes will work correctly. Ultimately, this improves flow and developer happiness. What developer doesn’t want more confidence in their changes working correctly in production?
I don’t know any such developer and I don’t think you do either. It’s certainly not you because you’re listening to this podcast!
So here’s the sequence of preflight checks in my deployment pipelines. Allow me to put on my launch director hat for a minute.
- This is the launch director preparing for production deploy. Initiating preflight check sequence. Config? - Config is go. Static configuration files passed static analysis.
- Credentials?
- Credentials are go for deploy. Validity confirmed.
- DNS?
- DNS is go for deploy. External services are resolvable.
- Traffic?
- Traffic is go for deploy. External services accepting incoming connections.
- Services?
- Go for deploy. End-to-end tests OK.
- Dry run?
- Go for deploy. Dry-run completed successful with target configuration.
- This is the LD on the deploy channel. Preflight checks are go for deploy. Pipeline is cleared for release in T-minus 5 seconds.
Alright, let me step out of the control room and take the headset off. Now I can elaborate on each check. The order uses the fail-fast approach.
The first config check is a sanity check. Many services use static configuration files in JSON, YAML, or TOML. They’re usually parsed when starting the application. So they should be parsed using a tool like jq. This also includes the validity of any certificates.
The credentials check verifies credentials separate from external services. Examples include AWS keys, instance profiles, and one-off API keys or tokens. The check should call an “identity” type operation. So for example, if AWS access is required, then this can be verified by calling the aws sts get-caller-identity command.
The DNS check uses the dig command. This verifies resolution of all external services hostnames. Imagine looping over all environment variables ending in _URL then calling dig with the hostname. This is the first external services prerequisite.
The next traffic check uses the netcat command. This verifies that configured hostnames accept traffic on the declared ports. Like the DNS check, imagine looping over the variables, parsing the hostname and port, then using the netcat command. If these two checks are go, then the services check follows.
The services check runs the end-to-end tests I described early. Just to reiterate, the goal is to verify the service can connect and use the external service as required for normal operations.
I also described the final dry-run check in a previous episode on the config factor. Learn more at smallbatches.fm/6. This check executes as many code paths in the application as possible then stops just before accepting incoming traffic. This verifies the application can load the target configuration meaning nothing is missing, unparseable, or unexpected.
This checklist hits the 80/20 sweet spot. It eliminates many regressions and will certainly save you when you least expect it.
That completes this batch. Visit smallbatches.fm and suscribe to the show for free.