Software Delivery in Small Batches

Something completely different. This episode is a short story about leadership, communication, and collaboration.

Chapters

Learn More
★ Support this podcast on Patreon ★

Creators & Guests

Host
Adam Hawkins
Software Delivery Coach

What is Software Delivery in Small Batches?

Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.

Project Banana
Hello and welcome to Small Batches. I’m your host Adam Hawkins. In each episode, I share a small batch of software delivery education aiming to help you find flow, feedback, and learning in your daily work. Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, let’s begin today’s episode.
This episode is something entirely different. It’s a story about leadership, team communication, and collaboration. I hope you enjoy.
Project Banana
Brian returned to his desk. He just finished a walk to clear his head in preparation of the work ahead. Tomorrow is launch day. His team had worked for months on Project Banana. He choose “Banana” to keep it playful, even though the project was difficult and complicated. Project Banana was the team’s codename for a massive project for migrating all the product from one-off service infrastructure to container orchestration.
The team was almost a year into the project. Every few weeks the team executed the carefully choreographed process of establishing a proxy that could split traffic between the old service and the new service running in containers. Brian was happy with the progress to this point. Each migration was successful with a few unexpected kinks, but the team always found a way through to complete the process.
Today, the end was in sight. Just one more piece to go: the monolith. Brian was happy that his team’s work had not negatively impacted any SLOs. He wanted it to stay this way, but the most challenging piece would happen tomorrow.
Brian let out an anxious sigh before he sipped his coffee. The walk cleared out some thoughts, but opened the door to new ones: “What could go wrong?”, “What haven’t we prepared for?”, “Am I really ready for this?”. Another sip of coffee and the latin Stoic model ran through his head: premediatio malorum.
He knew tomorrow would be stressful. More importantly, he had to know the launch sequence inside and out because tomorrow he would be calling the plays.
A big H1 displayed “Monolith Launch Sequence” on his ultrawide display. The page view counter read: 238. Time for the last mental rehearsal and prepping his own notes tomorrow.
Brian read through the launch sequence looking for the predeclared abort points and the points of no return. Each of these points provided the team a moment to control the clock and observe the current condition. These pauses would allow the team to sync up, collaborate, discuss the path forward, then commit to the next steps. More importantly, they were already baked into the process, so no one needed to feel off calling a pause.
The ultrawide came in handy. He split the screen 50-50. One side for Confluence and other side notes in Notion.
Brian tried to write what he would say while leading tomorrow’s launch. “Begin with the end in mind” he thought to himself as he wrote down the first things he’d tell his team.
He reached for his cup, glanced in, lucky to find one last sip inside. He took as a sign to stop thinking and start writing. The flux came on as the hours passed. Finally content with his notes, he jammed command-Q closing everything in an act of confident completion. Tomorrow was launch day, all that was left to getting a good night’s sleep.
Launch Day
Zoom launched all over the country while the team joined the call.
Marko was the first to join. Marco handled the container orchestration. Lindsey and Diego joined a few minutes later. Lindsey knew app frontend to backend. Diego handled the databases and caches.
Diego settled into his chair and pulled the boom arm closer.
“Where’s Jean? She should be here already” Diego asked. He shook his head in embarrassment when he “Jean is in the waiting room” at the top of Zoom window. “Off to a great start” he muttered when he pressed the button to let Jean in.
Jean waved hello to everyone just as her webcam flipped on. “What’s up all?” She prompted the group. “Hopefully the proxy” quipped Lindsey while others chuckled and checked their dashboards. Jean handled the proxy and other infrastructure needed to coordinate the launch.
Jean didn’t take it seriously because she knew Lindsey was just joking around. Everyone was confident enough engage in banter and friendly teasing.
This wasn’t the team’s first rodeo. In fact, this was the last one in Project Banana. Spirts were high and each person was excited to finally move the monolith onto the container infrastructure.
Brian was the last person to join the call. He said “Thanks for waiting up a few minutes team” while smirking over his mug of freshly brewed coffee. Everyone knew he was delayed making his pour-over coffee. A few minutes of chit chat passed as the team settled in.
Brian hit the hotkey to split the display, hit the hotkey to split zoom to one side and his notes from last night on the other. He began reading his prepared remarks for the first play.
“OK gang today’s the day where we finally ship this thing. We’ve all gone over the launch sequence separately and together as a team. We have escape hatches at known abort points, and planned checkpoints along the way. We will follow the launch sequence to each waypoint using the telemetry from our systems to check the current condition each step of the way. We have multiple waypoints to cross so let’s move deliberately and one step at a time.”
Heads nodded along in Zoom squares.
“So far, we’ve managed to do each of these migrations without major incidents. Let’s aim to keep it that way, so if any of you see a problem then pull the andon cord like we’ve always done. Throw up the siren reaction in Zoom or send an at-channel siren emoji blast in Slack. We won’t continue until the team can swarm on the problem and commit to the next steps. Remember, it’s better to pull the andon cord then proceed in uncertain conditions. I’ll monitor the zoom call and our slack channel for andon pulls.”
Brian continued.
“Each one of these migrations has had its own unexpected hiccups. We don’t know what they will be, but we know how to handle them. We stop and swarm the problem. Let’s not be hasty in our actions. It’s better to allow some failures to continue while we investigate what’s happening than take improper action and make the situation worse. We have an error budget for a reason, so let’s use it. I’m holding the pager right now so you’re not interrupted while executing the launch sequence. I know you’ll need to focus on the launch sequence, so I’ll be the goalie.”
He paused for a sip of coffee and to catch his breath and check the time. T minus 10 minutes. Right on time for preflight checks.
Preflight Checks
Marko pulled up the launch sequence on Confluence and scrolled down to the preflight check section. A preprepared table was ready to be filled out. There was a row for each item and a column for each person.
Marko flipped down his headset mic and called out the first check: “VPN?”. Green flag reactions filled the Zoom window. Brian filled in the table on Confluence while Marko called out the remaining items. “All systems go” Marko declared, flipped his mic up, and passed the reigns back to Brian.
Brian prepared the last round with the team before committing to the launch sequence.
“Marko, Lindsey, Jean, and Diego time for final red/yellow/green on launch.”
Green circles came up on all the Zoom squares. The team was ready. The launch sequence was locked in.
“OK team. We’re all green. Time to settle in. Let’s light this candle. Jean, deploy the proxy in preparation for traffic splitting.”
Deploy the Proxy
Jean took control of the screen share. The team watched as she executed the predetermined commands to deploy the proxy. Jean called out “Deploy proxy instance” and Brian noted the time and telemetry on the launch sequence document.
“Adding old system load balancer behind the proxy”, followed by, “connecting container ingress to proxy”. Brian recorded the team had cleared the first way point. Onward to T2. “Jean, time for the next step. Run the acceptance test against the proxy.”
“Roger”, Jean ack’d. Logs streamed across the terminal. That was just debug logs. All the mattered was the green “Tests passed” at the bottom of the screen. T2 cleared. Brian scrolled the launch sequence to T3.
T3 was the last point before the traffic would split between the old and new systems. The team knew another check was in order. “Everyone, red/yellow/green, let’s see ‘em” instructed Brian. Four green indicators. “Roger Roger”, Brian echoed. “Next stop: T4. We’re going to start splitting traffic between the old and new system using a 95/5 split. I know we’ve done this before, but everyone stay on top of the their USE signals. We’re not expected problems at this point. Pull the andon cord if you spot a problem. The launch sequence calls for running at this traffic split for a 15 minute observation period. If all good after the 15 minutes, we’ll ramp up to 80/20. Jean, you have the controls.”
Brian noted down the completion of T3 on the launch sequence doc. Jean called out the command to first split over only 1% of traffic. Marko ack’d the command as correct and Jean smashed the enter key.Suddenly blue bars appears on all the charts.
Jean confirmed the expected appearance on her traffic charts. Connection queues looked good. Latencies stable. More importantly, no red bars on the error chart. “Goliath online!” She memed in the chat.
Diego gave as a thumbs on his charts because there was no delta, just as expected. There was no new traffic to the data layer, just traffic coming from different sources.
Marko didn’t need to say anything. The charts did the talking for him. Signals looked good. The increased load showed on the utilization charts and there was an increase in active containers. No sign of exhaustion, there was plenty of capacity on the table. Marko reacted with a confident “Let’s rock” emoji in the chat.
Lindsey plus-one’d Marko’s reaction. No errors from the application side.
“Copy that all, ramping up to a ten percent split as planned”. She copied the command from the launch sequence doc and waited for audio confirmation. This time Lindsey gave the go-ahead. Jean pressed one finger on the keyboard the same blue bars just got a whole lot taller. Much more traffic was flowing to the new system.
“OK team, we’re now at T3 in the launch sequence with a ninety-ten split between the two systems. Keep watch on the USE signals and let’s check back in fifteen as planned.” said Brian.
Everyone went on mute as they leaned back to observe what was happening. It didn’t seem like much at the at time, but this was the first time the monolith was running on new infrastructure in over five years. They were more curious to see just how the system would behave in production, not concerned with the success of the moment. The last step in Project Banana still had a way to go.
ANDON!
Diego leaned forward in his chair. Something didn’t look right. The queries per second numbers were higher than he remembered. “Blue bars OK” he thought to himself, “now what about the red bars?” His eyes moved to the next portion of the dashboard. No errors or dropped connections. No red bars. Now Diego was fully hunched over his mic. “Lindsey, my traffic numbers are high. What do you see?” he asked cautiously.
Lindsey didn’t notice anything, so she passed the question to Jean. “Jean, how do the traffic numbers look?”
The blue bars grew as Jean watched her dashboard. “Uhh…something is happening. Our traffic volume is increasing.” Now everyone could see the increased volume on their own charts, but no red bars. Error budgets were still OK.
“Let’s sit tight for a few minutes and see if this goes away. Let’s not be over reactionary. We’ve seen this before” Jean said while she motioned for calm with her hands.
Five slack notifications appeared simultaneously on five different computers. The CMO posted at an at-channel plastered with clapping and tada emojis. A big celebrity had just posted about the companies products on social media. She was overjoyed for such a fortuitous event. The CEO and a bunch of others reacted with money bags and party parrots.
Butt cheeks clenched on the Project Banana call. The blue bars were growing taller by the second. The team was just beginning the launch sequence on a delicate production migration that was almost a year in the making—just when unexpected event drove tons of new customers to the site.
Lindsey broke the silence. “OK, wow, what timing right? I know we’ve done the load testing and simulations based on normal traffic levels. The traffic right now is already high, and who knows how long this wave will persist. Just how much difference is there between the current traffic and our tests?”
Jean pulled up her dashboard and looked at a historical comparison. She saws the blue bars growing taller out the corner of eye. The customers keep coming, but luckily the errors did not. Jean put her dashboard on the screen share and calmly stated “we’re 4x over our tested levels. We’re in uncharted territory.”
Marko snapped his mic into place just outside his mouth. “I vote we cut off traffic to the new system and send 100% of traffic to the old system. We know the old system can take it. We don’t know if the new system can. We don’t want to risk it.”
Lindsey, Diego, and Jean all put a thumbs up the center frame. There was no need for discussion, everyone knew what needed to be done.
Jean copied and pasted the command to engage the escape hatch. Marko ack’d the command and Jean hit the enter key with a quickness. Everyone watched as the blue bars moved from one chart to the other, now the old system was dutifully chugging along handling the unexpected traffic volumes. The situation seemed stable for a moment. There was still one thing left to do.
Jean posted the words “ANDON” with a liberal amount of siren emojis and the all powerful at-channel in the team’s slack channel.
Brian’s Apple Watch vibrated his arm. He knew that meant only one thing. He double checked the notification. It was an andon pull. He cut his walk short and sprinted back to his desk. Luckily, he was only a minute away.
Control the Clock
Brian returned to his desk as quickly as he could. His Slack was lit up like a Christmas tree. Everyone was pinging everyone about this unexpected free marketing.
He collected himself for a moment, remembered the andon protocol, then rejoined the call.
“Thank you for surfacing this problem Jean.” he said calmly. “Please show me what’s going wrong.”
“Right, let’s go straight to the charts.” Jean responded.
Jean put her dashboard on screen share. They all leaned in to see what Jean was seeing.
“Look at the jump in these traffic volume charts. Check the CMO post in Slack. We’re getting slammed with traffic because a celebrity started mentioning us. Great for a business but a challenge for us.” Jean explained while she mentioned the growing blue bar charts.
Brian gave the visual nod “Mhmm, I’m with you please continue.”
“See these traffic charts here Brian? Volumes kept growing. That’s when Lindsey pointed out the problem: these traffic numbers are currently four times higher than our load tests against the new infrastructure.” Jean said.
Jean watched Brian’s face grimace. She had a hunch he knew what happened next, but he asked anyway. “Oh my, so there’s the problem. What did you do next?”
“Well, all of us decided to pull the abort handle as written in our launch plan. We cut off traffic to the new system and sent one hundred precent of traffic to the old system. We know the old system can handle it. Now we’re in the abort state and do not know how to proceed. The traffic volumes are still way higher than we’ve tested for. That puts our error budgets at risk, we simply don’t know what will happen if dump the load on the new system.” Jean finished. Diego, Marko, and Lindsey all nodded in consensus.
Brian sense the apprehension in the team. He this was a joint decision.
“Alright, this good. I appreciate all of your focus on maintaining the error budget and care in executing this launch safely. I’m sure the business would happily ride this free press, so we don’t want to cause any problems while it’s happening but we also need to finish this launch.” Brian said, careful to celebrate the team’s work.
Brian knew it was time to pivot into a different mode. He had to set the stage and call the next play.
“So now we’re in the abort state. Lucky for this is a stable state. The old system is chugging long. This means we have time to pause, collaborate, and commit to the next step. External conditions changed, so we must change with them. We need to discuss and decide together what that change is. Let’s begin with a failure mode and effects analysis around the high traffic levels. Everyone, please respond with a green/yellow/red on how this will impact the new system. I’m guessing the andon pull came from instincts. Nothing wrong from that, let’s aim for more empirical analysis at this point. I’ll be back in twenty to check the result.” Brian waved a sign off and went off cam.
Brian wanted to give the team free time and space for the failure mode and affects analysis. More importantly, he did not want his presence or reactions influencing their analysis. He hoped this absence would communicate he trusted them. Plus, Brian could use the time to mentally prepare for guiding the team through dialogue towards the next steps. Time to put on the coaching hat and prepare to go to the gemba.
The timer on Brian’s Apple Watch buzzed and prompted him to flip his cam back on.
Collaborate
Brian saw three green and one red. He knew he could work through this. He checked his mental playbook: have the minority speak first. That meant starting with Jean.
Brian inquired, “Jean, I see you responded with red. Can you explain why?”
Jean, took a breath to pause, and started with her analysis.
“I’m worried that if we proceed as planned, then we’ll blow through the connection queues in the proxy, leading to timeout errors, and 500 responses that will eat through our error budget.” she explained.
She zoomed in on the chart and annotated the screen to drive her point home. “See this chart over here. This is queue saturation in the time we only had a small split of traffic. We’re already pushing higher than expected. I think we can assume a linear relationship between traffic and queue saturation, could be worse though. We’re in uncharted territory here.”
Jean watched the reaction on her teammates faces. Heads shook in dismay while others breathed a sigh of relief because like Voldemort, part of the problem was named.
“Our aims is complete this migration without negatively impacting production. I don’t think we can do that if we follow the launch sequence as planned.” Jean added.
Brian was happy to see one level deeper understanding of the problem condition. Jean responding with red signaled that a team member could confidently raise the issue with team. It was a sign of how safe Jean felt.
“Yeah, thanks Jean. I see, and the nods amongst the team indicate as well, that this is a definite problem. Anything else behind the red response?” asked Brian.
Jean followed with a deep nod, then kept going. “The proxy is only the first link in the chain. I’m not sure if other parts of chain can cope with these significantly higher than expected traffic volumes. The proxy is just the first of our problems.”
Lindsey noticed the wheels turning in Marko’s head. It looked like he was doing something else. “Marko, what about utilization and saturation signals from the container orchestration? How big of a problem is this?”
The sound of Marko’s name got his attention and he flipped his mic down. “Um...Yeah I get what you’re saying. Let me think for a second.” and buried his face with his hands.
“Right”, Marko began a few moments later, “we’ve configured the service to scale horizontally based on load. These traffic levels will definitely cause scale outs in the container infrastructure.” He grabbed the screen share and pointed team to compute utilization.
“Jean’s chart shows we’re still four times higher than anticipated traffic volumes, but we have only have a 20% buffer in the infrastructure to handle these scale outs.” The team noticed the line chart hovering below the yellow warning marker.
“Here’s our problem though. We’re migrating to these containers to keep the company on budget. We could handle this traffic volume by scaling out the container infrastructure to match. That’s a no-go because we’ve already maxed the spend on the infrastructure. Completing the migration is the only way to free up money at this point. I’m 90% sure that we’ll increase latency across all the services running in containers. That’s a large blast radius. At that point, we’re not talking about problems in one service, but unpredictable problems in many services.” Marko’s head shook in a mix of fear and disappointment.
He ended with “This is dangerous. I’m going from green to yellow.”
Brian knew this discussion could spiral into problems only. He knew he need to control the discussion space.
“Yeah, I hear you Marko. This is good. So far we’ve identified two problems ahead of us: queue saturation and container compute capacity. Let’s see what else we can find.” Brian said calmly to the team.
Now best to revisit Lindsey and Diego. They were still green.
“Diego, I see you’re still holding green. You have a different take. I’m curious. Can you explain?”
“Definitely Brian”, Diego responded while he organized his desktop in preparation for a screen share. “Here are the utilization and saturation charts for each database and cache going back six months. Each chart had traffic volume on a separate axis. Check this out.” as he annotated each chart with a circle.
“See that we’ve had multiple periods with these higher traffic volumes without saturating the instances. This chart shows the connection queue depth and corresponding traffic volume. We have plenty of headroom even if more celebrities started piling on. We’ve been through this before, so I know we’re good. That’s why I’m holding green.” He closed the screen share. Thumbs up reactions and encouraging nods confirmed everyone felt the same.
“Thanks Diego. This is good news. I appreciate the historical analysis and helpful charts. Lindsey, you’re still showing green. How do you read the situation?” Brian inquired.
Lindsey had been mum this whole time. She understood the problem at a conceptual level, but knew that deploying a solution was outside her capabilities. That didn’t stop her from wanting to help the team.
She considered her self a real programmer. She knew the framework code, the app code, and a bunch of the supporting libraries. She led the team that ensured the app maintained horizontally scalability. In a sense, she listened proudly as the system would work as expected until it smacked up against other constraints.
While everyone was talking, she slapped together a system diagram with sticky notes in Miro. This was her mental model of the system. She put the bang emoji to indicate problems and gathered her thoughts. Lindsey felt this one of her superpowers: not to mired in the white box implementation details of everything, but generalist enough at the black box level to see the bigger picture.
Lindsey put her Miro board on screen share and started to talk. “OK, first here’s my mental model of the system” as she motioned with her mouse pointer.
“We have the proxy in front of both systems that can split traffic to each. The new service is horizontally scaling in container infrastructure based on load. Our container infrastructure has a hard cap on how much compute it can consume according to our budgets.” The team listened intently to the recap, grateful for the simple explanation of the system and problem.
“We know we can’t do anything here.” She pointed at bang emoji on the container infrastructure sticky. “That takes one variable out of the equation. So we’re left with two variables: the proxy and traffic volumes.” She paused for a moment when Brian interjected with “What are possible countermeasures to the problem?”
Everyone had joined the Miro board. Cursors moved about the screen. Marko’s flipped down his mic and selected the arrow between the proxy and the new service running in containers. “So if this is a simple matter of high traffic volumes, what if we reduce the traffic?” He made the arrow thinner on the board to drive his point home. “We can control the traffic split with the proxy. This is a key safety mechanism for us, so let’s lean into it. We can reduce the ramp up time to control the load.” Marko mimicked a volume knob with his hand, first turning quickly, then turning up more slowly.
Verbal nods filled each team member’s headphones. Brian noticed the team was making progress. There was one possible countermeasure. Progress.
Jean chimed in with her thoughts. “Marko’s idea would work but slowing down the ramp up time would significantly extend the launch sequence beyond our scheduled window. Perhaps that’s OK, but I cannot say so. Also, we’re assuming we cannot spare any money to increase the compute available to our container infrastructure. Are we really certain this is true? Even so, that assumes we can clear traffic through the proxy. We don’t have any countermeasures to that problem yet. Normally, I’d say we just scale out the proxy but we’re low on budget to support that too.” She grimaced while she said it, feeling like her own understanding of the problem had painted her into a corner.
The team’s butt cheeks clenched. They collectively saw the finish line of Project Banana fall below the horizon.
Brian restrained himself from speaking, instead choosing to focus on visualizing the countermeasures on the Miro board for everyone to see. The diagloue had ended. Brian checked his notes for any preplanned advice. He paused to consciously plan his next words.
“Jean, thanks for the analysis. The problem is we don’t have capacity to scale the proxy to support the traffic. You’re saying we can’t scale horizontally. What’s the real challenge for you in scaling vertically?” he asked then let the question set.
Jean’s head shook with confusion, then her eyes circled around processing a new question. The mental wheels turned.
“Riiiiighhtttt”, she said slowly as the doors opened for her, “I never considered that. I don’t even know what that would look like.”
The team pondered in silence, hoping someone did know what vertically scaling the proxy looked like. Brian waited longer ensuring he was the last to speak. Then asked the follow up.
“Well, want do you want out of the proxy?” asked Brian. That question was enough to change the thinking from open-ended to close-ended for Jean. That she had a clear answer for.
“That’s easy: we need larger connection queues to prevent saturating the proxy.” Jean said confidently.
“OK, Jean what’s the real challenge for you in increases in the connection queues?” Brian asked calmly.
“Oh shit, I honestly don’t know how to do that or where to start.” Jean admitted. Luckily there was a team with diverse expertise to support her.
“Alright team, how can we help Jean?” Brian prompted to the group.
Diego spoke up channeling his internal kernel geek. “Jean, can you show that chart again with the max queue size? I have a hunch that value is lower than what the underlying VM supports.”
Jean threw the dashboard back on the screen share. Diego stroked his chin while pondering the charts. “Hmmm, ya this value does look lower than expected. Can open the repo for this? I can review the configuration.”
She opened a YAML file. “BINGPOT!” Exclaimed Diego as he circled the commented out line. “Classic configuration problem. The defaults are too conservative. Let’s max this out and we should be good to go.”
“Wow! I can’t believe we missed that through all these launches!” Jean said with muffled shocked. The team breathed a collective sigh of relief as one red turned to yellow.
Brian saw the beginning of a PDCA loop forming. He knew it was time to grease the wheels. “Nice, great find Diego. So what’s the next in deploying this countermeasure?” coaxed Brian.
“I’ll commit this change and push it through the pipeline. Then, I’ll check the ulimit settings on the instance against the expected higher value. If all good there, then I’ll promote the change to production. We’ll have beefier and more powerful proxy ready to go in no time.” explained Diego riding high on his solve.
Brian ack’d with a big thumbs up then refocused the team on the remaining problems. “Diego’s fix is a countermeasure to the problem with the queue saturation in the proxy. That’s one down. We have another problem: the compute capacity. What are possible countermeasures for the traffic arrow Marko mentioned?” prompted Brian.
The sense of the confident curiosity returned to the team. Now, determined to change the last yellow to green.
Lindsey piped up. “There’s another assumption we’re making. Marko’s solution to ramp up traffic more slowly would take more time. That pushes the launch sequence outside our window. I think we’re assuming we can’t get more time because this launch has been baking for so long that it feels stuck in stone at this point. What if we’re wrong? What if we could more time? Brian, can we do that?”
The aha moment bells flipped on for everyone.
“Ah yes! I’ll check with management to if we can get extra time and what the impact may be. Let’s assume we get the extra time. I’ll be back in twenty minutes to review the updated launch sequence.” Brian left the zoom call to confer with management. Lindsey took lead in updating the launch sequence while Brian was away.
Brian’s face filled the Zoom call. “And?” Lindsey asked as soon as Brian connected to audio. “We’ve got the extra time” Brian said while display two big thumbs up center frame followed by clapping and tada reactions.
“How has the launch sequence changed?” Brian asked.
Lindsey took the screen share and split it between two documents. The launch sequence on the left and the launch dashboard on the right.
“The changes are in early parts of the launch. No abort points have changed. Only our T-markers, checkpoint telemetry, and ramp up time” Lindsey explained. “The biggest change is on our dashboard. It was clear we needed to improve our visibility into the mental model of the system. We’ve added a new section to the dashboard for these key saturation signals along side the error budget charts. We’ll monitor these charts at each waypoint, along with the predetermined telemetry. We’ve also tweeked the traffic ramping pattern after a few iterations. Initially we planned for a linear ramp up. Now, we’re going to start more conservative then speed up as we gain confidence in the operations of the system. This would keep things safety while keeping things on pace.” The team nodded while Lindsey finished her summary.
Brian was pleased. “Tight. I appreciate the process you used to update the launch sequence. The new visual controls will help us achieve our safety goals. I also see that each of you contributed to the new signals and checkpoints in the sequence. Now it’s time to commit and move forward. Marko, Diego, Lindsey, Jean, red/yellow/green sound off.”
Marko, Green. Diego still green. Lindsey, green. Jean, green.
“Alright team. We’re all green. Time to light this candle. Jean, proceed from T-3. Diego already deployed the updated proxy, so let’s split that traffic. I’m going back to confer with management on impact of the extended time window. I’ll handle them, you all take Project Banana over the finish line. I’ll be back thirty minutes before the end.” Brian signed off and left the team to execute the revised launch sequence.
Jean took control and began the sequence to split traffic between the old and new system. The team leaned closer to their monitors to see the new blue bars appearing on the traffic charts.
Complete
Brian returned a few hours later as planned. Everyone seemed laid back on the call, a marked difference from the last time he spoke with the team. No one even noticed that he rejoined the call.
“Oh, Brian is back everyone.”, Lindsey spoke up, ”We’re just chilling now.”
“What’s the situation now?” Asked Brian?
Lindsey pulled their launch dashboard on the screen and and pointed at a chart where blue bars had disappeared.
“Goliath online!” Lindsey explained, proof that those childhood gaming experiences stick with us well through adulthood and into the work itself. Everyone in the team knew what this meant.
Clapping and goliath reactions poured into the chat. A definite of celebration.
“This chart shows the point when there was no traffic at all hitting the old system. The chart next to it shows the traffic to the new system. So yeah, goliath online!” she continued.
“The extra time worked perfectly. That gave us enough time to slow down the ramp up to accommodate the load. Marko had a few tweaks to the container infrastructure during the launch sequence to keep everything running safely.” She directed attention to the error charts.
Marko chimed in after leaning forward from his chart. “Yeah, the lower ramp up time gave us time to get a few problems. I tweaked the resource allocation to other services already running the cluster to keep up with higher traffic volumes. Diego’s fixes to the proxy held. There were no real problems and our error budgets are still green. As far as we can tell, there was no negative impact to production and we handled the burst in traffic from the unexpected free press.”
Brian got the feeling of completion. The last service had finally been migrated to the new container infrastructure. Project Banana was complete! It was a big deal for the engineering team and the business itself. All the engineers were stoked to use the container tools compared to the hodgepodge of the old system; plus it was cheaper! That meant more money in the company’s pocket to grow the business or invest back into it. Things were definitely looking up.
“Goliath online indeed!” Said Brian with a huge smile on his face. “Project Banana is COMPLETE after all the hard work. Today’s launch is a great example of how all of you worked to maintain production safety and collaborate on adjustments to the plan in a stressful environment. Not only did we finish the launch, but we did while serving huge amounts of traffic! The management team told me we’re well ahead of our quarterly forecasts. Massive win-wins all around!”.
“Now there’s one last thing we need to do before we celebrate. I’ll schedule a retrospective for next week, so jot down some ideas while they’re fresh in your ahead. This is great fuel to help us level up as a team.”
“I also got all you something extra. There’s a $100 celebratory expenses waiting for you. Typically, I’d take everyone out of drinks but that doesn’t work in the remote world, so treat you and yours however you like.” Brian ended clapping for the team.
Complete
The team meet again a week later for the planned retrospective. Brian kicked it off with a simple question: “How can we get better?”
Jean knew exactly what she wanted to say. “Well, we don’t need the proxy anymore, but the conservative default may exist in other parts of our infrastructure. Odds are we don’t want that. Plus, we totally missed that in all the launch prep work.”
Diego was nodding along the entire time. “I agree with Jean. She identified two problems: one that these settings may already been in production; and two that we don’t want unexpectedly ship these settings to production.”
Brian knew the team was on the right track, so he stayed quiet and let the dialogue continue.
“How can we identify these settings in production across our system?” asked Lindsey.
Soon enough various sticky notes were added to the Miro board. The team began voting on the possible solutions. Each person asking questions like: “What’s the first step with this countermeasure?” and “How confident are we in this countermeasure?” bounced around the group.
Brian saw enough shared understanding forming as the discussion reached diminishing returns. This was the moment to bring the team back together to move from thinking to action by committing to next steps.
“Time out everyone. There’s great stuff on the board right now. Seems we have aligned on the background to the problem, a clear target condition, and some possible countermeasures. But how well do we really understand the current condition? We need to clear understanding before proceeding. Any volunteers for driving this target condition with an A3?” Brian asked.
Two hands went up. The two people looked at each other. Clearly one person was more interested than the other. One hand went down. Lindsey’s hand was up.
“I can do this Brian” Lindsey said. “This is a great opportunity for me to grow my capabilities and learn about the ops-y side of the job. Plus, this should improve things for our team and all the other engineers right? This sounds like a win-win all around.”
Brian created the epic for Lindsey’s A3 work and the process continued. Brian kicked off the next iteration with a similar question: “Where else can we get better?”
Conclusion
You’ve just completed another episode of Small Batches. I hope you enjoyed this episode as much as I did writing it.
Most of the examples in this story come from Leadership is Language and The Coaching Habit. There is an episode on The Coaching Habit in the back catalogue. Maybe you can identify some of the essential questions from that book in this story.
Next week, I’ll follow up with a standard episode on Leadership is Language along with some explainers about the communication patterns used in this episode. I loved this book. My copy is covered highlights.
Anyways, find links for everything on these books and more software delivery education at SmallBatches.fm/80.
I hope to have you back again for the next episode. Until then, happy shipping.