Tiny DevOps

In this episode of Tiny DevOps, guest John Goerzen applies his experience as an amateur pilot to IT risk management.

Show Notes

John Goerzen is a staff engineer at Fastly, and an amateur pilot. In this episode, we talk about some of the parallels between aviation and IT, as it relates to risk management, incident response, and the mentalities that can lead to problems. We discuss the concept of an accident chain; the idea that most incidents don't have a single cause, but a long list of contributing causes. We discuss the importance of blameless postmortems for improving how we respond to failures, and the human aspect of incident prevention.

Resources:
Video: Faulty Assumptions
NASA ASRS reports: Callback
Video series: AOPA Accident Case Studies
PDF: FAA Aeronautical Decision Making

Today's Guest:
John Goerzen
Blog: The Changelog
Mastadon: @jgoerzen@floss.social
Twitter: @jgoerzen

Watch the video of this episode

What is Tiny DevOps?

Solving big problems with small teams

Recording: Ladies and gentlemen, the Tiny DevOps Guy.

Jonathan Hall: Hello, and welcome to episode number two of the Tiny DevOps Podcast. I'm your host, Jonathan Hall and today, I have with me an old friend of mine from, I think before high school, even John Goerzen. John, would you take a moment and introduce yourself, tell us what you do professionally, why you know anything at all about DevOps?

John Goerzen: Sure. Well, I'm looking forward to this conversation. Right now, I'm a staff engineer at Fastly. We are a CDN, and we power a number of pretty large sites on the internet, but I've been working in the field for 25 plus years, I guess. I've been a WN developer for about that long. I've been in various roles in SRE DevOps, IT, development both technologists and a manager. Also I'm a geek. I have a lot of probably way too many hobbies, so everything from aviation to amateur radio, to photography and so forth.

Jonathan: Nice. Aviation, that's one of the reasons you're on the show today. You are an amateur pilot. How long have you been flying?

John: About five years now.

Jonathan: You have your own plane?

John: I do.

Jonathan: You took me up in it, it's probably been a couple of years now before the pandemic and everything. It was a lot of fun. As a self-proclaimed geek and a pilot, what's the geekiest piloty thing- maybe a piece of equipment or something that you have? Have you built any devices that are in your airplane or anything like that?

John: Yes. Well, I've done some stuff in the hanger, so the airplane piston engines really it's not great for them to start when they're cold and here in Kansas, it does get quite cold. You want to have them hooked up to some preheat, but you don't want it to be hooked up permanently because that can also be damaging. I have a raspberry PI in my hanger hooked up to a 4G access point and I have some ZOA switches and sensors, so I can schedule it to come on in the morning before I go fly and whatever.

The geeky part is the part that's not done yet, which is the hanger is 10 miles away. I would really like to stop paying that 4G bill every month and set up a long distance slow speed serial incusing Laura or XB radios. I haven't quite had the time to finish that. I've got the radios, I've got a TCP stack running over both Laura and XP, but I haven't actually put it into place yet.

Jonathan: That sounds like about three podcasts just to talk about all that. [laughs] I think we'll certify you a geek.

John: Thank you.

[laughter]

Jonathan: To the topic today, we wanted to talk about, or you wanted to talk a little bit about how aviation and DevOps can be related- are thought to be related to each other. Before we dive into that though, flying is widely decided to be the safest form of travel, especially commercial flight, yet it still strikes so much fear in people sometimes. I have good friends, some mutual friends who are terrified of flying. They don't want to come- I live in Europe, obviously, they don't want to come visit me because they don't want to fly over the ocean.

We know it's not rational, especially those of us who pride ourselves in thinking rationally, we know it's not a rational thought, but sometimes we still feel that way. Have you ever had any close calls?

John: No, I haven't. That actually gets to what we'll be talking about today and it's an intentional choice. We talk in aviation about the accident chain, and that is that there's usually not just one thing that caused the accident, but there is a whole series of things that if you made a different decision, even yesterday, maybe all of this wouldn't have happened. What we try to do is identify issues early to prevent the close calls, because the last thing you want is human performance to be the only thing preventing an accident, because we know that when we're very stressed, we're likely to not perform well anyway.

We still train for emergency situations because sometimes that is just life. In a small plane, the two ways you get into- the most common ways you get into a bad situation are, running out of fuel or mismanaging it or flying into bad weather. I will pay great attention to weather forecasts and I will cancel a flight and drive or fly somewhere else or fly a different day if there's going to be bad weather.

We don't have a guarantee in life about anything, but there are things we can do to try and minimize our risks so that we don't have-- I don't want to ever have a close call and I want to try and do whatever I can to prevent it.

Jonathan: That's great. Let's just go from there. You already started to tie into how this can relate to the IT profession, but maybe just [unintelligible 00:05:37] of that. You talk about trying to avoid close calls by basically thinking ahead, how does that apply to your job, to DevOps, to IT in general?

John: Maybe let's step back a second because aviation was not always safe like it is now. If you look back into say the 1920s, '30s, '40s, it was actually very deadly and I feel that were-- Then there was an effort to put a lot of research, spanning decades into figuring out what was going wrong and how we can fix it. I feel like we're at about that point in IT, like solar winds type stuff is still happening in our field, but we have new things coming online. They're trying to advance safety, like just talking about languages. We have Rust and Haskell. Anyway, to tie it into IT, I think there are a few things that we can- some mindsets we can start with. We can expect things to go wrong.

Sometimes we have wishful thinking and we think, oh, everything will always be fine. We should all have learned at the school of hard knocks\ that that is not the case in computing. Then we want to be able to react with intelligence when things do go wrong. Then we also want to blamelessly analyze failures with an eye to interrupting that accident chain next time, because just like in aviation, in IT that you can often go back and say, okay, this time that we deleted our entire customer database maybe we had some opportunities to put some controls in or to have different procedures or backups or redundancy or whatever and we can learn from that and make it better next time. Maybe that sort of an introduction.

Jonathan: I can think of so many examples. I'm sure you could list thousands and any listeners can probably think of examples of just the last time something went wrong and it's so easy to point your finger at the guy who hit enter after the RM command. Maybe forget to realize that, wait, why was it possible for him to remove the database or whatever thing they did.

John: Yes, exactly. That's really important. One of the things that-- I guess I should say it by the way, that, like where we're going to be talking about incidents where fatalities occurred and like these all represent a human tragedy, and I'm going to be talking about them in a clinical analysis way, not to minimize the tragedy, but we do that because we don't want there to be more tragedies. It's also, by the way, the same principle extends to postmortems that we do in tech, where we generally have to give a little bit of distance from the issue before we can have a real solid analysis.

Anyway, if you go and read NTSB accident reports, and by the way, if you like bend the metal on a private plane NTSB is often investigating and they will not just say, "Oh, well, it was pilot error." Well, they may say that was the primary cause, but then they'll go and identify contributing factors.

Jonathan: When you say bend the metal, can you elaborate on what that- it sounds like a pilot term maybe?

John: That means literally, if you have a mistake on your landing and like your wing scrapes the ground and bends, that is a-

Jonathan: So literally if metal gets bent in some way.

John: Yes. It was more literal than it might've sounded.

Jonathan: All right. Sorry to interrupt, continue.

John: People are sometimes tempted to talk about pilot error, but you can often go back and say, now, wait a minute. One of the strategies that we use in tech postmortem and it also NTSB does, is you say, "Okay, the cause of this was X." Well, why? You can keep asking why. "Why did the pilot make that error?" Well, they weren't trained for this situation or they hadn't practiced the situation, or air traffic control told them something that confused them or whatever. You have to stop some place, but you can keep following this a little bit.

They key is if you're the guy that deleted the database, if you do a postmortem and you said, "Yes, that was all Jo's fault," you've done it wrong because you have missed the opportunity to figure out why did Jo have the access to type rm-r/? Have we talked about, if you're rude on a production box, maybe you start every command with a comment character until you're sure it's right or working in pairs or having redundancy for these things, all the rest of the things that can go into it.

This reminds me of the thing that happened at one of the companies I was working at. We had a credential get checked into a GitHub repository, a very sensitive credential. Now, we believed that it had not actually resulted in compromise because it was a private GitHub repository, but we decided we were going to follow our process anyway. We took a whole bunch of engineers off of what they were working on, SREs, DevOps people, developers, a whole bunch of people. We executed our procedure to basically greekey and scrub production.

You could look at it and say, "Boy was that a kind of an overreaction?", or you could say, "Hey, we interrupted the accident chain right there." In fact, you can also imagine what might've happened if we didn't do that. Maybe there was a compromise. Maybe people would send malware to customers, whatever. Actually, that's what happened to SolarWinds. We can use those analogies in technology as well.

Jonathan: Have you changed the way you drive since you've become a pilot? [laughs]

John: That is an excellent question. The answer is yes. The car I drive doesn't have a tire pressure sensor. Every time before I get in the car, I do a little preflight and I check my tires to make sure they are all properly inflated. The other thing is after I land-- When you're flying, you're in this big, vast sky. If you pass within half a mile of some other plane, that feels uncomfortable and close in a lot of situations. Then you put the plane in the hanger and I drive home and it feels really uncomfortable that there is a car a few feet away. If the driver sneezes and closes their eyes and veers over, it's game over. [laughs] It just changes perception of what are the risks that we just take on a daily basis and don't even think about very much.

Jonathan: Exactly. You did this in [unintelligible 00:13:38] or you scrubbed your Git repository. How do you decide? When is it too minor to care about? You made the judgment call there that it was not too minor, but it sounded like maybe it was, maybe it was, maybe it wasn't and you decided to go with it. How would you decide?

John: That's really the big question, isn't it? Of how do you decide? It's a judgment call every time in tech and in aviation too in some cases. The things we hear about are when things go wrong. We can easily look back and say, "Oh, well, [chuckles] shouldn't have given root to so many people or maybe you shouldn't have set your password to SolarWinds123 in your FTP server and things like this." All the times that people choose right, we don't hear about. In fact, people may start to question, were you being too cautious there because nothing happened and all of that.

You can look at it as a graph. On one axis, you can have the severity of how bad would it be if something went wrong. On the other axis, you can have how likely is it for something to go wrong. In the case of this GitHub situation, it was fairly unlikely that something would go wrong, but the consequences of it going wrong would've been very severe. Of course, in that case, you also don't want to be the company that's like, "Yes, we knew about this, but we didn't think it would be a problem." That's just devastating to your public perception.

Jonathan: Even if nothing goes wrong, if it gets out somehow during an audit or something, it could still hurt your reputation, right?

John: Yes. Actually, AOPA, which is the Aircraft Owners and Pilots Association makes a series of videos where they talk about the findings in these NTSB accident reports. They'll pick certain ones that are pretty informative. They're generally pretty accessible to nonpilots as well. They've got several that are really interesting to talk about in terms of how they parallel tech.

In fact, one of my coworkers at Fastly that led our incident response team for a while, she is not a pilot, she would read NTSB reports and watch these AOPA videos because they were so informative to her in terms of how do we plan for all of this. There is one in particular that there was a pilot, he had a Baron, which is a twin-engine small plane and he was taking a bunch of people on a flight. He had low fuel and his plane was heavy. Planes have a weight limit, sometimes if you've got a lot of people, if you don't fill up your tanks, you'll stay under that limit.

He took off with weight a little bit above the limit and fuel a little bit below the legal minimum. He calculated, "I'll probably be okay." He took off. As he was getting close to the destination airport, there was bad weather there and the air traffic control vectored him around, so he had to fly a few extra miles. As he was doing that, both engines lost power due to fuel starvation.

He managed to get one engine restarted. This is a very experienced pilot by the way, several thousand hours of flying time. He managed to get one engine restarted, but he failed to properly manage to follow the checklist for dealing with a single engine. He left the plane in a configuration where it had a lot of drag, and so the single engine was not able to maintain altitude and the plane crashed. That's a situation where you had a whole accident chain there, right? Bad decisions upfront to go and it led to human performance being that last factor, the difference between being a close call that was okay and a close call that wasn't.

AOPA had a fantastic video on that. Let's see. I think I have it. It's called Faulty Assumptions. You can think about that. The pilot is, "Well, a little bit overweight, probably okay. A little bit fuel less than the legal limit, probably okay." Turned out he was a little bit wrong about how much fuel he had. He was more under than he thought. When you're just flying that fine line, you can't survive anything going wrong. I'm very much a by-the-book person, I would never have done that.

Jonathan: It sounds like he was mentally thinking there's a margin of error on each of the variables. If I'm a little bit overweight, it's fine. If he had only been a little bit overweight, he probably would've been fine, but then he did the same thing with the amount of fuel, and the same thing with the distance he was traveling and the margins just didn't add up, right?

John: Exactly. That's a mindset that we're sometimes prone to have in IT as well. It's like, "Hey, it's probably okay to do this little thing and run this thing without redundancy for a little while, open some more ports, whatever it might be," but you just have risk accumulating by death of a thousand cuts sometimes. That's a weakness of that model I just gave you of risk versus danger, because your risk and your danger or both can be cumulative. You can look at each thing individually, but if you're not looking at all of them holistically, then you may miss that you've introduced some larger risks that can happen when you combine all of these things together.

Actually, the FAA has- they identify five attitudes that this is actually part of private pilot training. Everybody has to know how to identify these attitudes. A lot of them apply to IT as well. There's a sixth that a lot of flight instructors will add. The attitudes are, anti-authority like, don't tell me what to do. In our field it could be, Oh, the Sarbanes-Oxley rules or those are just best practices. It's okay. We don't have to worry about rotating our keys, whatever," impulsivity, just doing it quickly, getting it done. Frankly in tech, sometimes that's what we have to do.

The big one that gets people is invulnerability thinking, "It won't happen to us." In fact, Solar Winds, they were a publicly traded company. Didn't have an information security officer, which is just for a company that size is boggling.

Macho is another one it does talk about is like thinking, "I can power through this, I can do it." Or resignation, it's like, "You know the servers crashed, what can I do now? I guess we should just go home."

Then aviation, we add the informal one is what find [unintelligible 00:21:44] get their hiatus. "I have got to get home because I've got work tomorrow and I'm going to just fly even though it's dodgy." We have deadlines, we have product launches, we have that in spades. If we can stop and identify these attitudes, it gives us a fighting chance to take a more logical approach and say, "Okay, we have this deadline and we have some risks." Can we stop and think about if these risks are in an acceptable level here rather than just be all in a frenzy and just go for it.

Jonathan: I think you've instilled, at least in me a deeper sense of the importance of considering these risks. How do we respond when something does go wrong? What's the healthy approach then, somebody did delete the database. I know I shouldn't point fingers now, but what do I do?

John: Well, I'll give you an illustration of what not to do to start with.

Jonathan: Great.

John: This was when I was doing Driver's Ed probably at the same time and in the same town you were. Our Driver's Ed instructor liked to see how people would handle surprise. One of my classmates was driving and he came up with a very loud and percussive sneeze and the person driving took her hands off the wheel and covered her face like this.

[laughter]

The first thing we've got to do is fight the desire to panic, right? Because when the chips are down and everything is going bad, the worst thing is, two's like, "Oh no, everything's broken. We've got, the router is for some reason, it's sending all the traffic down link A and link B is the bigger one and it's not being used. Let's just let's just take link A down and force it to send everything down, link B", when you didn't check to see why.

It's very easy to get- because when you're in that situation of panic, then we're already down this accident chain and we're to the point where a lot of things are relying on human performance and we're in a place where human performance is going to be impaired because we're all stressed. It's easier said than done to get out of the panic. Ideally, you may even have some run books or some preparation for some of these things. In aviation, we have checklists. I have, you do a lot by checklist to make, and we have some things that we do by memory. If the engine fails, there are certain things that you do by memory, and then you refer to your checklist and what to do in that situation.

By the way, for people that are afraid of flying an engine failure does not mean an immediate crash, it means your plane turns into a glider. We may have these attitudes of impulsivity come out when we've got a real critical situation and attitudes of resignation may also come out, but you've got to look at things-- You've got to basically follow basic troubleshooting. A lot of us probably started our careers doing some flavor of tech support. When you do that, you learn very quickly that you need to check and make sure that the cables are plugged in tight and you need to check and make sure that like the CD is in the drive or the USB stick is in the slot or all of these basic things.

Sometimes we may be tempted not to do that when we're in the midst of a crisis and may overlook something. You've got to gather your data. You've got to think it through logically in a compressed timeframe usually. Then hopefully, you get things going and then later make sure that, I like to say never squander a good outage. What good can come out of this? The way you do that is you have a really solid post-mortem process that's blameless. That's a challenge because even if you have that people will be skeptical when they come into the team. It's important that you model-

Jonathan: [crosstalk] that they might get blamed.

John: By the way, in that example, I told you about the GitHub incident. I never knew who it was that did that. Very few people of the company ever did. I am glad about that because that means that like people cared about actually making it as blameless as possible there. Recognizing that gee, if it was easy for that mistake to happen, maybe we should have put some things in place which we did. Having the post-mortem where you figure out, not just what happened, but why it happened and taking that a few levels deep, and then applying that to how we can make things better in the future, because you really want to not have so many incidents.

Jonathan: I remember joining actually, the last couple teams I joined over the last two or three years. I remember waiting for that first incident to occur on my watch so that I could do some post-mortem. I remember one in particular it was probably three years ago the first one gave was I don't remember the details of the incident at all. I just remember us all getting in a room and everybody not knowing what to do, because they'd never done it before. Coming out with, leaving that room with a list of action items that we can do to improve and keep to both prevent the incident from occurring. Second to help our response to the incident next time be faster.

It was such a great feeling afterwards. I think it was a two hour post-mortem and they rarely take that long for me, but the first time through it took a little longer. I just remember waiting, like how many weeks we have to wait for that first incident that anybody cares about.

[laughter]

It didn't take long.

John: Oh no, that's another week with no problems.

[laughter]

Jonathan: In your professional life, talked about this already a little bit, but maybe you have a specific example where this has changed the way you approach something. Your GitHub example is great, but it sounds like you were more, but maybe I'm wrong, but it sounds you were more of a bystander there than an active participant in that. Have you had the opportunity to apply these learnings in your computer career?

John: I've been for a while, a person that values correctness, and that's probably what's drawn me to languages like Haskell and Rust that have a robust type system of various kinds of correctness guarantees built in. I think this has sharpened that, and it's also-- The other thing it's done is sharpened my understanding of human factors and that's actually, something that also goes on in an incident.

A good team that's situated to respond to an incident well, will do things like tell people if you're tired, ask for help and we value that. If you don't know what to do, ask for help and we value that. If you have an idea and it's different than what everybody else is saying, please say it, and we'll discuss it and we value that too. Because sometimes you get the situation where it's like a Saturday morning at 2:00 AM and the on-call person is responding and you get a lot of people that have been working on it for a few hours and they're tired and they're exhausted. Human performance is terrible in that situation. If it's a tiny team, if you can tell somebody, "Hey, take a 30 minute break, go get some coffee or lie down or whatever," that can be helpful.

If it's a bigger team maybe you can say, "Okay, let's hand this off to a few other people and you folks get some rest." Sort of realize, we're tempted to have this macho attitude of we'll just push through and we'll figure it out. As a technologist we like to think through things logically and we like to say, "Okay, yes, the problem is-- It needs the same resolution at 2:00 AM as it what at 10:00 AM and we're the same people so logically we should be able to solve it the same way." Whereas we know from a lot of research into human factors that this is not really how humans perform.

While it sharpened my already tendency to value documentation and comments and communication, it also really made me appreciate companies that understand human factors and value them and understand that a culture that values people taking a break when they need it, is a culture that produces better results. Because you get a fewer mistakes and better performance out of people when you acknowledge that they are human.

Jonathan: That sounds like such a great place to work and I honestly have worked in places that seem to value the opposite attitude. I've worked in a lot of, especially young startups where they tend to value the overtime and the young people who have so much energy. They don't have families so they don't have to go home at five o'clock and they're trying to milk as much "performance" out of these people as they can. In my view and I would expect you agree it doesn't really produce the results they're aiming for.

John: That's absolutely right. It may briefly get more lines of code written, more stories completed, whatever your metric is, but what's going to happen to the quality of what happens? How soon is that going to bite you when that lack of quality comes back around? In some cases maybe you can-- If you're making it like a iPhone game, maybe you can survive that way, but it will bite you in the end eventually because it's going to lead to burnout.

It's more expensive than people realize to hire and train and get people up to speed. It's just a very much a very short-term mindset. Trying to run a business that way more than just like, "Okay, the chips are down this week and we need to rekey everything," is probably going to lead to a lot of problems down the road.

Jonathan: Hopefully the managers, IT managers, the CTOs listening will recognize what you're saying and will start treating their employees with dignity, giving them time off they need when they're tired. What if one of our listeners by chance is on a team and they're just a coder, or they're just an operations guy and their boss is demanding extra hours, or they're not giving them this and they're just expecting this extra performance. What can you do in that situation? I don't know what answers you have for this, but how can you put the best face on this situation and try to be responsible even if you're maybe the only voice on your team that cares about this stuff?

John: That's a really good question and also a very difficult one because people can be in different places in terms of their financial situation, their ability to find a different job and `how okay they feel with putting their neck out there based on some of those considerations. What I would say in general is regardless of what's going on around you want to be able to go home at the end of the day and hopefully you can go home at the end of the day.

Jonathan: [laughs] We hope so.

John: [chuckles] You want to be able to go home at the end of the day and feel that you acted with integrity. As far as the things you had control over which may be small, but as far as the things you had control over you acted with integrity and did it well. That doesn't mean that you're going to be able to convince a boss that is firmly in the other camp to change, but it does mean you can at least usually make a concern known, raise it. Ideally you would have it documented in some way in case there are eventual repercussions whether from a disciplinary procedure from HR, because the boss is saying you're being insubordinate or some lawsuit or whatever, it's good to have documentation of what's going on.

You can try to identify bugs and fix them. You can try document and comment and do what you're doing well. Of course the catches in a lot of those things take extra time. If your boss is just like, "We've got this sprint and we have got to get X, Y, and Z done this week." You're trading off dinner with the family for writing comments and fixing bugs and that's a difficult thing. To just be very honest there is not always a great answer to that situation.

The answer for some people might be to decide, this is not the place I want to be working and to start looking for some other job. That is not a failure, that is somebody realizing that they care about some things and that they would like to work at a place where there are others that do also.

Jonathan: Great. Good advice. That situation for some people-

John: We can link this back into aviation also. We had this recent helicopter crash where Kobe Bryant died as well as the pilot. I don't think that we have full information on what happened there. We can speculate that maybe the pilot felt some pressure to take the trip because that was a situation where we knew the weather was not great and that was known before they took off. How do you say no to Kobe Bryant?

Even if Kobe didn't specifically say, "Take this flight or you're fired." Which I don't know if he did or not. Just the fact that you are a pilot of a helicopter in California and your passenger is a world internationally known star that's very wealthy, is probably adding some pressure to take this unwise action. We talk a lot in aviation about pushing back on that pressure.

Now in the case of aviation the stakes are higher, because the stakes are literally life and death. We have a lot of stories of pilots that did push back on that pressure and view that as a moment when they acted with integrity and probably possibly saved some lives including maybe their own. We also have accident reports when people did not.

Jonathan: Do you ever hear the opposite? We got people who did not give in to the pressure and then, "Maybe I should have taken the fight anyway." That's-

[crosstalk]

John: Sure, yes. [chuckles] Pilots have some dark humor sometimes.

Jonathan: [laughs]

John: We have a saying for this, it's like, "Better to be on the ground wishing you were in the sky than in the sky wishing you were on the ground." In fact, this happened to my family. We were going to fly to Indiana one time, the weather forecast looked kind of dodgy and I said, "You know what? Let's not do it. We'll drive, 11, 12 hours versus two or three. We drove the whole way there almost the whole way there. I was looking up at the sky and it's like we could've gone because things cleared up earlier than forecast. It was sort of borderline, but I'm not going to- borderline is not okay to me.

I felt good about the decision and, I didn't get any pressure. Like my wife also is like, "You'll never get any pressure from me to do anything like that. That's always err, on the side of safety." So yes, that's common. [chuckles]

Jonathan: I can imagine. I'm sure that's common in IT too. I could probably think of examples from my own life. You took the extra time to write the tests or to double-check things, and then nothing went wrong. Maybe I could have just pushed it out.

John: Let's talk about Y2K. This is the perfect example of that, right? There was a big concern that there were going to be a lot of problems due to Y2K because people were using two-digit year fields and wrap-around arithmetic and all sorts of things. After Y2K happened it was picked up by media, some of which had a better understanding of what was going on than others. After Y2K happened, there really wasn't a major problem. People was like, "Oh. Well, that was so overblown, wasn't it?" We were all thinking the world was going to end and then everything was fine, pretty much.

Jonathan: There were reports that power would go out and flights would be cancelled. Disaster, right?

John: Yes. The reason everything was fine is because we took it seriously and we fixed all those things. It's not that the problem was ignored, but it's that we fixed it. We actually have the same problem with COVID. Here in the United States, in the early days of the pandemic about a year ago, Dr. Fauci was pressed to give an estimate of how many people will die.

He says, "Well, if we do nothing at all, in the United States, maybe two or three million." Then we did do things, we did have restrictions, we started wearing masks, and then people was like, "See? It's all overblown. He said two or three million people would die." I was like, "Well, we took it seriously and we did some things." That is so tough in technology, especially if you're reporting to somebody that's like a CFO or not a technologist, to explain that--

Oh, this even happened to me in a small scale, years ago. We had network equipment that was just in a closet at a company I worked for. I'm like, "Let's put this in a cabinet at least, can we? That we can lock? So that the secretaries aren't going to bump it and turn things off?" The CFO was like, "Oh, that's like several thousand dollars. That will never happen." A few months later, that happened. The CFO was like, "Why do we have this? Why are we so vulnerable to secretaries taking down our network?" I'm like, "I'll get you a quote to fix that in the morning."

[laughter]

Jonathan: Is there anything else that you think we should hear about? Any last words of wisdom?

John: Yes. One other thing to mention, sometimes it's helpful-- This probably isn't so much for small teams, but for a little bit larger teams, can you make a way to let people make an anonymous report of something that they're concerned about. Because, like I alluded to, sometimes people may not really buy into the blameless thing, because they've had many instances in their career that tell them that they shouldn't.

We have this in aviation, and also, it's administered by NASA, so it's separate from the FAA. It's called ASRS, I think that stands for Aviation Safety Reporting System. Every month, there are about four or five thousand anonymous reports that come into NASA. They have a monthly publication that's called CALLBACK. It's really fascinating reading, where they just talk about some of the reports that are coming in.

Jonathan: If they're anonymous, we don't know who they're from, but I'm assuming they're from people who work in aviation, generally?

John: Pilots, mechanics, other members of the flight crew. Air traffic controllers could also, as well as dispatchers. Basically, people that have some sort of license to be in aviation.

Jonathan: I see.

John: Some companies will have this. The company that I work for has this. Not all. It's just a matter of having a wide-open door as much as you possibly can, and recognizing that some people are fine with being loud about something and other people, just because of who they are or because of their experiences, are not. Anything that you can do to make it easy for somebody to speak up when they have a concern and to take it seriously is a great thing to do.

Jonathan: Good advice. If you're on that team that your boss won't listen to you, at least you can listen to somebody when they talk about

[laughter]

[unintelligible 00:44:33] All right. John, what are some resources that our listeners can refer to? We'll put links in the show notes, but just briefly, what are some things-- If we want to learn more about, watch some of these videos we're talking about, read some of these reports, where can we learn more?

John: I'll send you links to the AOPA Accident Case Studies video series. There are some great ones. I mentioned the "Faulty Assumptions" one. There's another one called "Traffic Pattern Tragedy," where they talk about task saturation, where you're so busy, you missed something blindingly obvious, as well as "Assertiveness," which we've touched on here a little bit.

Setting priorities when you deal with an instance. That's a really good one that people can draw a lot of the parallels from. The FAA has a publication called the Pilot's Handbook of Aeronautical Knowledge. They have a chapter in there called "Aeronautical Decision Making." It talks about these human factors that we've talked about in some more detail. That would probably be instructive for people.

Also, the ASRS CALLBACKs that I just mentioned have some newsletters. They're a monthly publication. It's pretty short, it's pretty easy reading. That's also I would recommend to people, just poke around in there. Oh, it's fascinating reading for me. It's everything from helicopters to 747s. It's really interesting.

Jonathan: Finally, if people are interested in connecting with you in some way, how can we get a hold of you?

John: I am on social media. I love to push open and open-source social media. The best place is @jgoerzen@floss.social on Mastodon. I am also @jgoerzen on Twitter. I have a blog that is updated irregularly, changelog.complete.org.

[music]

Theme music is performed by Riley [unintelligible 00:46:47] .

[00:46:50] [END OF AUDIO]

More episodes

Chapters

Show Notes

What is Tiny DevOps?