Software Delivery in Small Batches

Adam presents a series of questions to understand production operations, plus a method to practice: MMIVM. That's Model-Measure-Instrument-Visualize-Monitor.

Want more?
Chapters
 
ā˜… Support this podcast on Patreon ā˜…

Creators & Guests

Host
Adam Hawkins
Software Delivery Coach

What is Software Delivery in Small Batches?

Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.

Hello and welcome to Small Batches with me Adam Hawkins. Iā€™m your guide to software delivery excellence. In each episode, I share a small batch of the theory and practices along the path. Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, letā€™s begin todayā€™s episode.

Software delivery excellence requires a ā€œyou build, you run itā€ mandate. Running it is crucial because software only provides value in productionā€”just building it is not enough.

Running software in production provides ample feedback opportunities to learn from design choices, expected behavior, and systemic problems. All that learning may be used to improve future development activities.

That right there is the Three Ways of DevOp: flow, feedback, and learning. First, find flow of software to production. Second, collect feedback about the process. Third, experiment at improving both those activities.

Engineers and teams often struggle with the ā€œrun itā€ portion of the mandate. Why? Because running, also known as operating, is a different skill than building software. However, it is skill like anything else. Anyone can learn it with coaching and practice on the gemba.

Deming tells us that practice means nothing without theory.

Speaking of Deming, this is the last week to enter my giveaway for a free copy of John Willisā€™ new book Demingā€™s Journey to Profound Knowledge. The giveaway ends February 29th. Go to SmallBatches.fm/103 for details on how to enter.

OK, let me channel Deming here. In this episode Iā€™ll share the theory and practices behind understanding production operations.

Letā€™s begin by creating common mental model for ā€œproduction operationsā€. The aim is provide working software to consumers in production. We can achieve that aim by working through a series of questions. The answers create a method for bringing systems under some level of operational control.

Every system must have an aim. Our software has a simple aim: provide value to consumers in production. This is the first question: What is the system supposed to do? Operators must be able to state how the software provides value to consumers and the intended behavior.

If this question cannot be answered, then itā€™s like playing darts without the dart board. You need the target.

After establishing what the system is supposed to do, then ask second question: How do we know the system is doing what it is supposed to be doing? Complex systems have multiple answers to this question.
Answering this question requires empirical thinking. Your answers should include ā€œas measured byā€. Hereā€™s an example. Consider a travel booking system. One answer to this question may be: ā€œThe number of completed booking as measured by the total bookings with confirmed paymentsā€.

Your answers should use language the consumer understands. Remember the systemā€™s aim to provide value to consumers in production. Frame your answers accordingly.

The first two questions build a mental model of system operation. The next step is bringing that mental model into the world. Hereā€™s the third question: How do we instrument the system to measure what itā€™s supposed to be doing?

Now weā€™re getting closer to the redwork.

Instrumenting means adding telemetry. Telemetry is typically logs, metrics, traces, and events. We need telemetry to understand what the system is doing at any point in time.

Go and see if the system produces the telemetry to measure what you came up with in question two. If the telemetry is missing, then close the gap.

Now you have a model of system operation and the signals to reconcile it. So hereā€™s the next question: How do we visualize the telemetry to know the system is doing what itā€™s supposed to be doing?ā€
Answering this question requires visual management with charts and other indicators. The visual management must clearly communicate the intended behavior so there is call-to-action when thatā€™s not happening. This may be a line chart with a colored horizontal marker for a threshold. If the measurement falls below the threshold, then thereā€™s a problem.

Teams can use their visual management on ad-hoc, daily, or weekly cadence to go and see if the system is doing what it is supposed to be doing.

The crucial bit here is that system behavior must be visualized. Peter Drucker has a popular quote: ā€œWhat gets measured gets managedā€. We can send gigabytes of telemetry to our monitoring system but never look at it. Thatā€™s measured, not managedā€”simply waste.

So, I prefer my version: ā€œWhat gets visualized gets managed.ā€

Weā€™re four questions into the cascade. So far, these questions have produced a mental model of system operations, the telemetry to reconcile it, and a manual visual management process to go and see if the system is doing whatā€™s supposed to be doing.

Time for the next last question: How do make it so weā€™re told when the system stops doing what itā€™s supposed to be doing?

Control theory states that control system must operate twice as fast as the underlying system. The production environment is changing multiple times a day. Relying on a manual visual management process on a daily is insufficientā€”forget weekly or biweekly.

Answering this question requires creating a 24x7 monitoring system that can page engineers when things stop working.

Now Iā€™m going to give you an acronym to internalize these questions and practice working through them. Itā€™s M-M-I-V-M

ā€œMā€ for Model. Create a simplified visual model of the system (such as block diagram with communication paths). Incorporate how consumers use the system.

Check your work: Can I quickly verbalize the diagram of what the system is supposed to be doing and how itā€™s designed to another engineer?

ā€œMā€ for Measureā€. Ask yourself the question: ā€œHow do I measure what the system is supposed to be doing?ā€. Consider the aggregate system and components in the diagram.

Check your work: I can state the system is working as measured by blank. The ā€œBlankā€ are typically golden signals (Latency, Errors, Traffic, and Saturation).

ā€œIā€ for Instrumentā€. Determine how the system produces the telemetry for your measurements. Typical sources are application logs, APM libraries, Cloud Provider telemetry, and custom metrics. This requires a ā€œGo and seeā€ attitude to assess whatā€™s already instrumented in the system and whatā€™s not.

Check your work: I have a link to the source telemetry for each of my measurements or a plan to add it.

ā€œVā€ for Visualizeā€. Visualize the telemetry from the previous step as time series charts. Proper visual management is a whole separate topic, so here some quick tips.

Leverage color. Use blue for traffic and red for errors. Use bar charts for counters. Use line charts for latency. Design the charts to clearly communicate the presence or absence of expected behavior. Use text widgets for reading instructions.

Check your work by evaluating each chart with this sentence: The system is working as measured by the behavior on the blank chart. Notice the expected behavior of blank. How well can you fill in those two blanks?

ā€œMā€ is for ā€œMonitorā€. Use the charts from the visualize step to create 24x7 monitors. Monitors will tell you when the system stops working. However, not all monitors are created equal.

Check your work before turning on those monitors: do I want to wake up at 3AM to fix this problem or can it wait until tomorrow? This is urgency. There are only two answers. Proceed accordingly.

Iā€™ve covered a lot in this episode, so letā€™s stop here to recap. I shared an exercise for understanding production operations. Itā€™s MMIVM for Model, Measure, Instrument, Visualize, Monitor.

This exercise acts as visual management system for the work itself. Iā€™ll explain.

Iā€™ve seen engineers struggle because theyā€™re in ā€œmonitorā€ without doing the work in ā€œinstrumentā€. The same goes for engineers eagerly jumping into ā€œVisualizeā€ without any understanding of the telemetry or the model behind it. The exercise acts as a way to move the work back to the appropriate step.

Once theyā€™re in the appropriate step, then the path forward is clear: get to monitor. First get the state of knowing the system is working. Next, be told when the system stops working. Then you can start doing real continuous delivery.

I challenge you to bring this exercise to your teamsā€”especially those with little ops experience or fuzzy ownership. When they get stuck working through each step then ask this question: ā€œWhatā€™s the real challenge here for you?ā€. Start to develop their capabilities from there.

Remember: MMIVM; Model, Measure, Instrument, Visualize, Monitor.

All right thatā€™s all for this batch.

Iā€™ve purposely used the phrase ā€œunderstanding production operationsā€ in this episode. This one of the four pillars in my Small Batches Way study guide. Get the guide to develop your capabilities in modeling, instrumenting, visualizing, and monitoring systems.

Get the guide and other helpful production operations links at SmallBatches.fm/103.

I hope have you back again for the next episode. So until then, happy shipping.