Guest Patric Conant discusses the dangers of ignoring low-level infrastructure during a DevOps transformation, and some simple steps to avoid this common blind spot.
Solving big problems with small teams
Jonathan Hall: Ladies and gentlemen, the Tiny DevOps guy.
[music]
Welcome to episode number one of the Tiny DevOps podcast. I'm Jonathan Hall, your host. Today I have my good friend, Patric Conant, who is going to be talking to us about infrastructure. Patric, why don't you start by introducing yourself a little bit. Tell us what you do professionally, what you do for fun, maybe, and why do you know anything about infrastructure?
Patric Conant: Sure. All right. I'm Patric Conant. I am an independent contractor, most of my work comes from one of the largest IT providers in the world. My day-to-day varies between standing up and making look professional a very expensive and large delivery for the company I work for, to holding a customer's hand, sometimes weeks or months into their operation, showing them how to get the best value out of it. We're the converged solutions group.
We pack up a lot of storage, compute, compute being virtualization in most instances, but occasionally you got application and fabric into a single package that usually is essentially co-located in my customers' environment. These are large organizations that have their own IT staff and management, and our packages kind of all together. They don't manage it the way they manage the rest of their infrastructure, it's done to our standards and practices, and most upgrades are professional services engagement. My role is very infrastructure, and the teams I work with, in these companies are from their architects to their engineers to their operations. I see lots of misses in the silos.
Jonathan: Maybe you can give me an example of a mistake you've seen.
Patric: Big organizations probably have, in the last two years, a much more bigger and pronounced cybersecurity organization than they did before. Cyber Security was probably distributed among operations, engineering, architecture, and now there's a separate group doing it, maybe two separate groups, Red team, and Blue team organization. This can just cause chaos, as opposed to like you imagine cybersecurity being implemented. Okay, we have somebody who goes through and does our internal audits.
We often have policies and procedures that are just not, we can't test them until it's in production, and who can change which passwords and whether we make a new account or share a password for these bottom-level infrastructure roles. The bare metal questions become a very tough challenge that really requires a lot of collaboration, and an organization-wide you know, at least set goals and plan.
Jonathan: Okay, it seems like really concrete, so we're talking about low-level infrastructure stuff, we're talking about the passwords to a router or something like that?
Patric: Yes, so my product is bare metal switches that are not managed in the organization's switch infrastructure. We're a separate set of switches, we're an engineering solution. Our upgrades happen in lockstep with our hypervisors, our application servers, our storage, and our switches, but IPMI is probably the--
Jonathan: To find out for those who don't know IPMI
Patric: Elephant is in the room, baseboard management. A standard from the '90s when we thought we knew how we would want to manage things, that really hasn't stepped forward much. There are places in the spec that say that you must allow logins with no password in your implementation. Even if your organization-- Your organization can say, we don't do that, but the IPMI spec from the '90s says that your implementation has to let it happen, go ahead.
Jonathan: All right. Go back a little bit here and get a big picture of what we're talking about. You see that these companies that you work with have started doing these DevOps transformations or whatever, and they just have-- These details have just gotten lost is that a fair way to look at it?
Patric: Well, yes, the expectation in the DevOps transformation was that the lowest level would be abstracted away. This is tough technical work with the biggest implications, obviously, somebody who compromises bare metal has done as much as they can do, the only thing that stands between them and everything in your universe is encryption. Very often, we have the bare metal do the decryption, they may not have the keys, but they can just ask the function to do the work. The pipe dream, the ideal, the hope for a level of, we're going to automate as much as we can.
I think the flip side of that, the boogeyman under the covers is we won't have to worry about that dirty bare metal, slow to implement, hard to troubleshoot, is it a software or hardware problem, all these things at the bottom of all our stacks, they're going to go away. We're going to push code as infrastructure, and the code isn't going to be dependent on hardware. Fundamentally, it's not like it can move to new hardware, but that doesn't mean that we don't have low-level practices that we really got to. He's got to be organized with the same care and emphasis that our highest-level abstracts do.
Jonathan: I think I see a blind spot frequently in DevOps circles, that there might be a similar thing that we're talking about a different application. That is that we like to focus on the things that are easy to automate, and things that are not easy to automate, we've had to just, yes, we will deal with that. A simple obvious example that everybody thinks of is, is UI testing. UI testing is automatable these days to a degree, but it's right, messy, and it's confusing. We just write the unit tests and leave the UI tests alone. It'll take care of itself. It sounds like maybe you're saying we have the same problem when it comes to this low-level infrastructure?
Patric: Well, absolutely. I think that's the downside to every wave of innovation that comes through. The problems that it solves get solved, and we're, yay, but on the flip side, all the things that it couldn't address are still here, and everybody in technology likes to think that we're not still solving the problems from assembly and COBOL, but some days we literally are.
Jonathan: Once again, I'm going to paint this picture. We have a team that is bought that they drink the DevOps too late, and then they threw the infrastructure baby out with the bathwater, and their server, their disk arrays, and their switches are rotting with insecure passwords and who knows whatever other kind of crap. Have you seen a company that didn't do this, that has done it properly? Have you seen an example where this has been done well?
Patric: Unfortunately, nobody calls me for the success story. I'm sure--
Jonathan: If you're listening and you've done this right please call we want to hear about it.
Patric: Right, exactly. Unfortunately, I just have the malaria outbreak viewpoint of the world when I come into work we have something to fix.
Jonathan: People don't call me when things are working well, people have organizational problems but it's the same thing. What can we do? Is it hopeless or do we have tools we can use to address this problem?
Patric: I think it's' a human being problem. It's a social problem. We just have to remember that there are humans on the other side of all these things. We get very excited about automation and the great thing about automation is you can abuse it and it doesn't complain, but somebody racked and stacked all your equipment, somebody laid out the bottom of your network plan. You may have a just completely agile fabric where you can push a button and the whole network can move across equipment, data centers, cloud providers, but somewhere somebody is doing that work, and if we don't forget then maybe we can pick up the phone and ask them.
Don't get me wrong, if you're in Amazon, you're not going to call and talk to the guy who racks and stacks your equipment, but all those considerations, all that low-level minutiae power consumption, weight. Somebody very technical and savvy needs to be working on these problems and it can't just have the outsourcing hand wave applied to it and then be like, this is terrible that none of our stuff works. Those guys are awful. Well, the devil is in the details.
Jonathan: Let's say I come to you. I'm, say, a middle manager at a 50 person company, and we're migrating, we're doing a DevOps transformation. We're planning it for the next say year, and I ask you, Patric, what can I do to make sure that in one year, two years, five years, this isn't biting me in the butt? Do I need to hire you to come into my company and then help me out? What can I do?
Patric: It's just a matter of knowing whose responsibility is, as far as I know. When we have good people working on tough problems, we get the best results we can hope for, like knowing who that person is.
Jonathan: Assign ownership to the server rack into the switches and make sure somebody has the job and a list somewhere of, do these things.
Patric: Exactly, because they're not impossible tasks, they're just unpleasant. You have to go tell the DevOps team, we're going to use half our equipment while we work on the other half, and somebody says, "Whoa, that can never happen." We're in an interesting space now, never doesn't seem like the right roadmap.
[laughter]
Jonathan: It also sounds like there's some parallels here to another topic that gets a lot of attention. That's just technical debt as it applies to code, and this idea that you write code and it's not done until it's deleted. It sounds like you're saying the same thing that your server rack isn't done until it's unplugged in front of the dumpster and you need to maintain it. You can't just plug it in and walk away.
Patric: Right, and again just like a good system, and a good infrastructure person. Even a good network person in the days of software-defined networking, where we think of it as more of a configuration than a body. Those people are worth their weight when it's impacting your organization. It's very funny how if you walk in on a happy production Wednesday and say, "These six things really need to happen in your organization." They'll be like, eyes raised, the last time we tried that it was bad, by walking in the middle of an outage, and I say, I need to fix these six things. Somebody will be walking up to me with whatever I need to get them so that we can move on and we just need-- It's the same six things, I'm not--
Jonathan: I get it. [laughs] I think you're right, before the conversation today we talked briefly about this and you just expressed the concern that DevOps ignores this whole area of infrastructure, and I think you're right. That's not to say that there's not any teams that do it. I'm sure there are some that do a good job. When I think about DevOps, I'm not thinking about firmwares on my switches or their passwords and stuff like that, but that obviously needs to happen. Now, of course, there are teams that more or less outsource all that. They're using Azure or AWS or something, and they literally don't manage that, but there are many teams that do, and there's many teams that have a hybrid approach of some sort and you just need to pay attention to these things right?
Patric: Right. Does that fundamentally work? Can you basically have a contract with a cloud provider that lets them do everything except your Python code?
Jonathan: Well, to an extent, and of course there are drawbacks.
Patric: Certainly.
Jonathan: Vendor lock-in would be an obvious one, [laughs] but it depends on what you're building, of course. Right? I definitely work with people, with companies that have completely outsourced that stuff and they're just making API calls to the Twilio or to S3 or whatever it is, and the app as far as it's concerned is essentially a docker container and that could literally move anywhere. In that sense, yes, there are some companies that do that, but I think there are many-- Just a few years ago I was working for a manufacturing company here in the Netherlands that we did use AWS, but we also had our own infrastructure in the building. It was a combination of those two, and if either one of those went down, we would be in trouble. I see that all the time.
Patric: Were they fundamentally redundant? Did you have more capacity in either location than you needed to run your operation?
Jonathan: Not per se. It was that they were doing different things. Our ERP system was in-house and our web front end was on AWS, that sort of thing. They were different functions in different places.
Patric: Sure. A lot of the hybrid cloud implementations I run into are like, "This is the first time we've ever considered that we might have redundancy at application level, and this is amazing." I was like, "I don't think the cloud is what solves your problem. I think that the fact that you never thought about application-level redundancy was what you missed and elasticity in the cloud really huge for the four feast or famine business models."
Jonathan: Well, yes, it's helpful for certain, like you said, business models. It's great for startups where the cost is so low that you might as well pay 15 times the price for somebody small sliver of a server somewhere else, than a fifth or a smaller portion of a whole server in your rack. Then like you said, the ability to scale up or down elastically is beneficial for certain business models.
Patric: The next place that I really intersect with DevOps, we've known each other a long time. I have a deep experience in system administration. The ops side of DevOps and I've had four or five DevOps interviews, and they all went like train wrecks. Fundamentally, I was talking to a developer and we just were not talking the same language. We worked together before and fundamentally what you were looking for was a dev-ops role. We were supporting a service broadly. Because it gets talked about fairly often, I always imagined I would be a shoo-in for some of these lower-level DevOps roles, and then I could spend some time learning whatever, the programming language and practices were, we walk in a technical interview ensues and we don't find any common ground. What the person interviewing me thinks is fundamental, I'm slower rusty on. The things that I could give quick core meaningful answers on like they don't consider a part of their day to day concerns. Like I said I had more of an infrastructure role so that's understandable but it's also frustrating to me because some of these roles I considered to be well below my pay grade or skill set. Even though I've probably written less than 1000 lines of shell that ended up in a file somewhere and not on a command line.
Jonathan: Yes, I think you're touching on a really valid point. When I see DevOps or site reliability engineering roles, they're usually looking for either developers who know some amount of infrastructure or infrastructure people who want to become developers or something. That they're really looking for a hybrid skill set there. I guess there's two sides to that coin. On the one hand it makes sense in a way that we want infrastructure people who can A, interact as developers with the developers and understand their needs and their concerns because they're supposed to be serving the developer.
Patric: Sure.
Jonathan: And B can maybe implement automation. But the flip side of that is that maybe the whole topic of our conversation here is we're throwing out half of the ecosystem with that. We're discarding the people like you who have excellent operational and infrastructure skills but maybe don't have the Dev skills or they're not as sharp. As a result we're not hiring people who have your skills. It's not just that we're not hiring you, it's that we don't have your skills on our team now.
Patric: Right. In an interview it came up. Like how much production code do you think you have out there? I was like, very little but there might be some really large organizations still depending on some one liners that I've written that-- That's not a load bearing. That's what grabs something from one side and puts it where the other side can have it. It wasn't until the first time somebody called me five years later about something I composed in 35 minutes with about 200 iterations at a command line. I was like, I have no earthly idea what this says. What was it we were trying to do? Generally it's some bash and a lot of basic UNIX utilities, grab [unintelligible 00:23:11] said.
Jonathan: I don't know what else to say. I think we're in agreement on this. I was looking for a fight today, I don't think it's happening.
Patric: We can but I think we largely agree on the process. That's how we came to work together. You were like, I could spend more of my time developing if I hired a sysadmin instead of another developer.
Jonathan: Right.
Patric: Then the sysadmin--
Jonathan: For the listeners who are wondering. Patric and I did work together briefly. What was it, eight years ago? I don't remember. It's been a while.
Patric It's a long time ago yes.
Jonathan: But we did work together for a while at a previous company in a different lifetime.
Patric: Yes, and I'd never heard the term site reliability engineer when we started that but that was very much how I pictured my role at the time even though I didn't have a name to put to it. Back to the human problem, I often get asked what I know about Docker Kubernetes containers. If you tell somebody it's a process with its own IP stack and that's fundamentally what it is and how it's managed, they're not impressed. As a sysadmin like that's what it is to me. That's how I figured out where the logs are and how to dissect the problems with it. In a good container there's not much to look at. That doesn’t start great conversations.
Jonathan: Completely different set of assumptions going into that question right?
Patric: Right.
Jonathan: Whether you look at it from the application side or from the system side, very different view on the same thing.
Patric: Yes. What's your background in containers? Well the first time I triggered into a system that I couldn't get the start was probably, I don't know, 20 years ago. Well led to the person I'm talking to that has nothing to do with the Docker infrastructure. I'm like when we work on something broken we're working on some pretty small pieces and we don't work on stuff that doesn't break. That, yes it works on my laptop and I could push it into production exactly like that. That's a stupid cool feature but as long as it works you don't need somebody to come fix it.
Jonathan: Right.
Patric: Tell me about your thoughts on cattle versus pets.
Jonathan: My thoughts?
Patric: Yes.
Jonathan: That's a good one. Well, for one thing cattle are very smart animals and we should treat them like pets too. We shouldn't just shoot a sick cow. Have you ever met a rancher who would shoot a cow who was sick? So it's a bad analogy.
Patric: Well, he'd certainly isolate the cow real fast.
Jonathan: But that's different, right? That's the same thing, if you have six cats and one sick, you're going to put-- It has worms. You're going to put that one in a different room too. That's actually treating it like a pet. But that aside, what I like about the so called cattle analogy is-- I like the idea of stateless services because it forces a sense of simplicity to your application. I say a sense because it's actually harder to think about but when you accomplish it, it's a certain sense of simplicity.
Patric: When we say stateless services, we mean that the state as a service shouldn't change the outside world?
Jonathan: Essentially yes.
Patric: Like at any time anything--
Jonathan: [unintelligible 00:27:07] within the application. I saw a tweet recently that the only stateless service only returns true or something like that [unintelligible 00:27:12] Not in that sense obviously but yes it doesn't depend on global state. It does depend on the database and things like that.
Patric: Right. When I think of stateless I think a UDPNFS and that's usually not what people are talking about when they say stateless and I'm like--
Jonathan: Stateless in a sense [unintelligible 00:27:30] request.
Patric: There you go. Which is all encapsulated.
Jonathan: Yes. That can be said on either end but for H2 protocol has no concept of that.
Patric: You can reboot a box and come back and ask for your session and cause undefined behavior because, yes.
Jonathan: Yes. I don't know. Do you have any resources for listeners who are interested in exploring this topic further? Whether it be books, speakers, authors.
Patric: Sure.
Jonathan: People who want to improve their infrastructure. Maybe they're concerned about their router in the corner without a password.
Patric: Insecured.org, the Nmap book let's have everybody know what the body of an audit looks like. You don't have to be good at it but I feel like the biggest battle in cyber security is people implementing things have never thought about an attacker’s point of view or [unintelligible 00:28:38] point of view. You can learn how to do an nmap scan in an afternoon. To take a look at your own infrastructure from that point of view, it's just a way of looking at things that somebody else has to do so you should probably be familiar with it when they come and ask you questions about it.
To speak more to the sysadmin developer rift, I think everybody should probably learn Orc. The book on Orc from the AWNK authors is probably 100 pages, still available on Amazon. I think there's some great benefits in learning or in that line by line thinking incredibly small. I challenge you to write an Orc script that can take up six megabytes of memory. Insecured.org and the book on nap. Spend a couple of afternoons learning your way around nmap.
Jonathan: That's great. We'll have all those resources in the show notes for anybody who's interested in following up. I have two more questions actually. First, is it time to go from DevOps to Dev secOps to Dev Sec Ops infra?
Patric: If there aren't conversations in your organization like that, then I hope you're outsourcing to a great Cloud provider or something. I think hardware is dirty and hard and slow. Oh, my God. You can spin up a Docker cluster and in an hour from what's Docker to, I'm doing something. You can spend five hours getting the six machines you decide to put in your lab to post. There needs to at least be a round table meeting of those groups but yes, I think, honestly, the most successful DevOps implementations are really the Infra Dev sec Ops. We didn't leave out any important pieces. I think that's a great term and when people start using it, I'll credit you.
Jonathan: Well, my last question unless you have anything else you want to add, how can people get in touch with you if they're interested in following your tweets or social media or anything? How can we get a hold of you?
Patric: Let's see. How about at Mirage computing on Twitter. I expect to have a YouTube space, and maybe a Twitch stream in the next six months. I'm still kicking around. When I do, I hope to come back and talk about it.
Jonathan: Well, what's your Twitter feed for announcements for YouTube channel and a Twitch stream? Meanwhile, we'll just click the Art button on all your tweets.
Patric: Thanks. I appreciate it.
Jonathan: Thanks so much, Patric, for joining me today. It's been a great conversation. I hope that it's been beneficial to those of you listening.
Patric: Thanks.
Jonathan: Thank you. This episode is copyright 2021 by Jonathan Hall. All rights reserved by J. Hall. Find me online at jhall.io. Theme music is performed by Writing Day.
[00:32:35] [END OF AUDIO]