Kube Cuddle

In this episode Rich speaks with David Flanagan (AKA Rawkode) from Equinix Metal. Topics include: How David got started writing code, leading development at a media company, the day Lemmy from Motörhead died, using Mesos and Marathon, how he got the idea for Klustered, some tools he’s learned about through his streams, and what he’s learned about troubleshooting Kubernetes.

Show Notes

Thanks for all of the support that the podcast is getting on Patreon. If you’d like to help keep the podcast sustainable for only $2 a month, you can get more info here. Listening is great support too.

David'sTwitter

Links:

Walid’s guides on GitHub for the CKA and CKS certifications

k9s

Nova and Stromberg on Klustered

Listener questions from @IworkWithHomer and @ebcarty on Twitter. Thank you!

Logo by the amazing Emily Griffin.

Music by Monplaisir.

Thanks for listening.

★ Support this podcast on Patreon ★

What is Kube Cuddle?

A podcast about Kubernetes, and the people who build and use it.

Rich: Hello, and welcome to Kube Cuddle, a podcast about Kubernetes and the people who build and use it. I'm your host Rich Burroughs. Today. I'm speaking with David Flanagan, better known as Rawkode. David is a Senior Developer Advocate at Equinix Metal. Welcome.

David: Hi, there. It's a pleasure to be here.

Rich: It's so nice to have you there. I'm just going to be transparent. We had a failed first attempt at recording the intro. So here we are. We're going to be repeating a bunch of stuff that we just already said, but but I wanted to thank you. You're one of the patrons who support the podcast and I really appreciate you signing up to do that.

David: No, not anymore I've canceled it.

Rich: Oh no, you did that while you were offline

David: When you made me restart. I was like, ah, I'm not doing this anymore.

Rich: Oh, No,

David: No, like it's the least I could do. Cause I really do appreciate how much effort you put into this. I know it's not easy. It's difficult and challenging. And you do this in your spare time. Thank you. It's like I said, it's the least I can do.

Rich: Thanks so much. I really appreciate getting to know you too. I'm such a big fan of your content and yeah it's it's exciting when people, when you're into someone's work and they support you as well, it's a good feeling. So, um, for those of you who are listening, I will put a link in the show notes to the Patreon in case you would like to check that out.

Okay. So, um, can you tell me a little bit about how you got started with computing and ended up in this Kubernetes community?

David: Okay. So, uh, I was terrible at computers, all through school. I tried doing computing in high school. I tried doing computing in college I tried doing computing at a university and dropped out, all of them. And I just, it just doesn't really work for me. And I've never really understand why. I think my learning methods are different from maybe the education system.

I'm very much a hacker mentality, whereas I really just need to go and kick the tires on something and break it apart. And that's not the way the education system works. And I spent too much time on a computer in my spare time too. You know, I wasn't, I wasn't, I'm very introverted. I still am. I was gonna say I'm an introverted child, but I'm an introverted adult too.

Uh, I don't, I never really went when I would put football or sports. Or I'd go to high school dances at random from like that I was much more comfortable just behind a computer screen. So the first way that I got into computers as my, my, my parents got me a computer and there was these Telenet talkers and you could go onto them and type any name that you wanted. I like the first four years of my life online was known as Giovanni. Um, I liked that didn't like playing it, but enjoy watching it in my younger days.
So I picked Giovanni and I could just be that person. And, I could talk to people easily, which is not something I can do in real life even today. Even doing this as slightly intimidating but your calm presence will keep me going. Okay. But yeah, it's just been a challenging, online was easier for me.

And it was through these Telnet talker systems and curiosity of that. I was like, I want to change this. I want to be able to have my name in color, or I want to be able, I want this feature from the thing. And open source was a thing in the late nineties still, we had Fresh Meat and Source Forge and all these other things.

So there was a system called You Too and the code is still online. That makes me laugh today, looking at it. But I downloaded the code and started to learning C and hacking on it until I made this telnet talker do the thing I wanted to do and submitted a patch, and it got accepted and I squealed like a child, and I was so excited and that was the rabbit hole. I just started experimenting with more code and more programming and looking at other languages and decided I want a career doing this.

And I didn't have any formal education. So I was just sending off CVs to any company in Scotland that would have me. And I got really lucky in that web development in the late 90s, early 2000s was a hard skill to find people for because the web was so new. And I got a company that, that took a chance on me and I worked there for five years.

And again really lucky expedience in that I joined a small development team. They didn't have operations, they didn't have support engineers. They didn't have customer engineers. They were just four people trying to build a product that was used by tens of thousands of companies in the UK. So I was responsible for all the code that wrote, getting it into production.

And that led me into the DevOps path where I started playing with tools like Puppet for remote configuration management. We had servers up and down the country at racetracks, and when something went wrong, I had to jump in the car and drive to the racetrack and try and fix it on site, which was not fun at 2:00 AM in in the morning driving down to England.

That's when I was okay, we have a problem, is that we need to be able to manage this machine remotely. So we want to put SSH on it, but we have to manage that configuration. We don't, nothing worse than phoning in someone at a racetrack and going, can you SSH or log to this machine or this terminal and install this package, and then that allows me to get onto it? No, of course not .That it just led me down the devil's path, playing with Puppet and then getting frustrated with that, playing with Chef, getting frustrated with that many years later, we're lucky to have Ansible and Salt Stack and all these other tools, but it just, yeah.

Curiosity was the primary driver and I did well in my career. And ended up leading development at a media company. Our product was Metal Hammer Magazine, Classic Rock Magazine. We had all these massive online websites and we had radio stations. And one of the biggest challenges we had in the organization was how do we scale our website when big news breaks.

And we had a doomsday scenario. The doomsday scenario was what if Lemmy were to die. And that happened. We had to live for that. We had to scale the whole website. And very sad day. I was a Motorhead fan. So it had doubly hard. But we were scaling VMs on AWS. We had done the whole cloud migration thing.

We were, that was supposed to be the dream and AWS has infinite scale. But what they don't tell you is there's still so much more you have to do on top of that to get to that scale.

Rich: Oh, wow. Yeah. The resource limits eh

David: It's just all of stuff. You know, because we were using SaltStack. It was at the time like the VM would come online and then SaltStack would bootstrap through user data, but then it would need to speak to the, the SaltStack system, pull down the state, run the state, get the container images pulled. Run the container images, it took five or six minutes. Then that was per host. And then you've got the auto-scaling groups which only go up X amount of time. Based on what you need. Then the kicker. Do you know the Elastic Load Balancers on AWS have a hard limit on the scale and you have to call them to get it removed.

Rich: Yeah. Yeah. That's what I meant

David: All right.

Rich: There's a bunch of those hidden resource limits that you don't know about as an end user necessarily, and often people don't find out about them until they're in that sort of situation where they really need the scale and they suddenly can't. I'm under, I'm under the impression that AWS has gotten a lot better at handling those things, that you can message them now, and they'll resolve it pretty quickly. But it's tough when you've got something to get done and you can't, because you've run into this sort of arbitrary limit that's there.

David: Definitely. This was what, 2013, 2014. around that time. Anyway. I thought we had, our auto scaling group was doing its thing. It was slow, but we were getting nodes I think we had, 30 or 40 different VMs online. The CPU usage was at 20%. I'm like, why is the traffic not working??

And that was the ELB. And I've never trusted Amazon since. But we are challenged there because we wanted to remove that SaltStack step, that bootstrapping. We wanted to speed it all up. And containers for the technology then that were just breaking through. That were changing that dynamic of where, okay. I just need to install Docker on a host, pull the image and run it. I don't need to do anything else. So we started with adopting containers, really early, Docker 0.4, 0.5, whatever it was. Really tough back then, didn't even have a Docker file in the first version. And so we were still building tarballs, but...

Rich: Wow, that, you were using Docker that early.

David: Well the Dockerfile didn't come till a bit later, actually. It wasn't one of the first things they shipped. But it definitely made a huge improvement. I still remember my first Dockerfile. It was like, it was great fun. Then I was just like, oh, I could do this thing. So that was great. And then that was just done the rabbit hole more, wasn't it. Like you've got these container things and now I want to be able to orchestrate them and I don't want to use like, OpsWorks was like a big tool back then.

And AWS, I think started pushing ECS and Kubernetes was this little contender coming out and Mesos and Marathon. Actually I spent too much time playing with Mesos and Marathon, I got completely enamored with it. I don't even remember why I went back to Kubernetes at that point, but something just drew me to it.

And I've been working with Kubernetes now for about five years and it's been fun to see it, at least is definitely, it's such a fast moving project and it really does make people's lives easier these days, I think.

Rich: It's really interesting the Mesos thing, because I think about that time period a lot. And there was a time where you legit could look at Mesos and Kubernetes and be like, Mesos hands down is the way to go. And that just changed so much.

David: Yeah. Sometimes I think we made the wrong decision. I think it was Betamax all over again. And Mesos just has so much going for it and marathon running on top of it. And I can't remember what the job scheduler was, but there was all these different components that just work really well with a nice UI. And then there was Kubernetes where I had to YAML or .JSON all the things and yeah, I still don't know how Kubernetes won but there's been a [unintelligible].

Rich: It's interesting to me, you mentioned Puppet and I've talked before about the fact that I worked there for a while. I was in the Puppet community as a user, and then I ended up working there as an SRE. I was working on how we used Puppet internally. And it was uh, I don't know, it there was this period of time where people were like, oh, containers are here now and Kubernetes and all these things, and we don't need configuration management anymore.

And then, you end up with a billion YAML files.

David: Yeah, definitely. It's funny how things work out, right? When you look at it, you're like, I don't understand how you got from A to B there, when there is like a nicer path, but it's things are the way they are. And I don't know like, I remember my time with Puppet quite fondly. It solved so many problems.

There was nothing like that when it came out and that was great. I literally had it running on every race course. Provisioning my software. Fantastic. And it, I think it probably sparked my interest in other languages. I think the DSL for Puppet was so revealing and so foreign to me as a C developer that I was like, Ooh, what is this thing?

And then, I started playing with a bit of Ruby and Python and just started going, Ooh, all the shiny technology out there. Like I need to go play with more of this.

Rich: I was writing tests for Puppet code. And so I actually wrote more Rspec that than I wrote Ruby. Back in the day, which is backwards. But um, you know, I think to me, the issue with Puppet and Chef was more not the tools themselves, but the fact that they didn't change rapidly enough, with what was going on.

Puppet specifically was very much modeled on managing static infrastructures and that was suddenly not a thing anymore.

David: Yeah, dynamic inventories were coming up hot and fast and people need to, like the software had to adapt to work with that. I think that's why Ansible was just so popular. It worked really well, SaltStack as well, and their ability for the agent, the communication method of SaltStack still amazes me.

I think it's ingenious the way that, you just have the SaltStack master run ZeroMQ and all the minions subscribe to it. And the master only ever has to write messages to the ZeroMQ and then the agent picks it up and puts a response back on it. It's just such a cool way to do it.

Rich: You tweeted the other day that you wanted to come on some podcasts and talk about your show Klustered and that's kinda how we got to this point where we're meeting up today. I'm a big fan of the show. I've seen several episodes. I can't always hang in till the end, cause sometimes they go on for a while, but I've...

David: They can get quite lengthy.

Rich: Yeah, I've watched like the first 30, 45 minutes of quite a lot of them. I also have ADHD. And so sometimes sitting still and watching something for a few hours is more than I could do. But it's a fascinating show. For the listeners who might not have seen it, can you just describe what Klustered is about?

David: Yes, definitely. And thank you for taking pity on my tweet and having me on to talk about it. So Klustered is, I had this idea. I just wanted to make learning material that was fun.

And I didn't understand why this didn't exist yet. And I thought we've got so many people trying to adopt Kubernetes and you know, you'll be like me, right. We, we speak to so many developers in the community and we hear all their challenges and we want to do more to help them. But, there's only so many blog posts you can write or so many links to documentation you can give. And I was like, there has to be a more engaging way that people can learn this stuff.

That's not just go RTFM. And I was like, I'm alright at fixing broken Kubernetes clusters. And I know a few other people that are alright at fixing broken Kubernetes clusters. I, would I do this on a stage at a conference? And I started thinking about that. And then COVID happened and there were no conferences and the idea disappeared for a long time.

And then, and then as I started building up my YouTube channel and my live streams, I say, I could do this on a livestream. Like live coding's painful, but I'm sure I can maybe make this work And so I just reached out to a few people and I'm like, if I give you a Kubernetes cluster, can you break it for me?

And surprisingly at the start people said yes. And my initial idea, it seems absolutely ridiculous now, but my plan was to go through 10 broken Kubernetes clusters on a one hour stream. And I've…

Rich: Oh wow.

David: And what I've now worked out is that actually, it doesn't matter how superficial or trivial the break on a Kubernetes cluster is.

As that been able to pinpoint what is wrong through the symptoms actually takes much more time than fixing the cluster itself. And there, sometimes you can just get on a cluster and it'll take you thirty minutes just to get a lay of the land and work out what's broken and fix it. It's like a [unintelligible] was was never going to happen.

But if that's get me to try the idea, if my first episode was back in January of this year with Walid Shaari. Walid he does some of the best GitHub repositories for our materials, for learning CKA CKD and CKS. I'll give you a link for that if you want to put them in the show notes, but I recommend them to everybody. They're just so so good. I spoke to Walid knows, that you want to do this thing with me, we're going to get some broken clusters and we're going to go on a stream and we're going to do our best to fix it. And surprisingly, Walid said yes, so there we are. We go live and I'm like, all right, we have two clusters. And we tried to use tmux. And it was so painful because nobody ever remembers the tmux shortcuts. And we're trying to like split panes and type into shared buffers. And eventually we did fix two broken clusters. It was so much fun. I was actually surprised. Yeah. Just because, it turned out that what was important, wasn't the breaks. Like there's the entertainment part. Like seeing how people demolish these things. And they've been getting creative. Like we can talk about some of the breaks, this is not a one-line change in the config anymore. Although some of them started off that way.
We are talking about kernel hackers going to town on the entire machine to stop you getting anywhere near the cluster. But what turned out to be more interesting was the way that people communicate with each other during the painting process, and the way that they handled the debugging, like some of the highlights are just, oh, there's a cool alias I'd never thought of. Or there's a cool keyboard shortcut I'd never thought of. Or, people would just bring on their own little tools. Like I remember the first time someone brought on k9s and used that. And I was like whoa, what's this thing. And just seeing the way that people um break down the problems and try to work out what's going on.
I found really exciting. Having the opportunity to share that with a broader audience so that we can all learn this stuff together. I couldn't be prouder of the work of Klustered.

Rich: That's fantastic. Yeah, big shout out. to Walid. He's also been a supporter of the podcast. He's been on several of the episodes, asking listener questions and all kinds of stuff. And I've definitely seen him in the Klustered chat quite a bit too. I think that it just seems terrifying to me. Like the idea of, you mentioned live coding and live coding is terrifying to me but this seems like just even more terrifying right?

Because somebody else has specifically like sabotaged this thing and you have no idea what they've done.

David: Yes. It was terrifying. And there's the, it's weird, right? Like you really get this competitive urge which are on it because you really want to fix this thing. But at the same time, people have gone out of their way to show that you can't. So yeah, there is a lot of, it's not fear. It's the fear of the unknown because you have no idea what you're walking into.

Um, and live coding is difficult. And you know what? I used to think that, people are going to think, I don't know what I'm doing. If I can't remember this command, or if I have to Google something, that's just going to be so embarrassing. But I find that I've actually got a now a unique opportunity where we can set some new norms. In my day job, I do Google stuff and I do forget commands.

Why am I worried about doing this on a stream? And as this, I think I just settled into it. I'm like, actually it's not my duty to show people that we don't have to have all the answers and we don't know all the commands. And we do Google the most ridiculous things on a day in basis and that's okay. And that kind of subsided the fear a little bit now it's just about, okay, let's have some fun and that's what it's all about.
Rich: One of my favorite examples that comes up from time to time is regex, because I'm one of those people who like, I'll look at a regex or write one once every five years. And so I'm never doing it enough frequently enough for it to really stick in my head. But unless you're using it all the time, why would you ever want it to stick in your head?

David: Exactly. I think is definitely one of those checky esoteric things. And even if you think, you know what you're doing, when you try to actually do it, it always turns out to be wrong anyway. And then there's different flavors and big reasons of regex yeah, I that. I can't stand it.

Rich: So you mentioned that that you've learned about some new tools. And I've seen this happen. I've been watching some of the streams where you're like, oh, wow, I didn't know about this thing that somebody mentions that ends up being super helpful. You mentioned k9s. I wondered if there's any other examples you could think of like tools or commands or things that, that you learned about.

David: Yeah, definitely. I think I learn something, every episode, even if it's just a shortcut or a part of the Linux system that I've never worked with before. But as far as tools, I discovered Teleport for this. And now I run it, I think on every cluster. It just allows us to have a shared terminal and where, you and I can both be typing at the same time if we wanted to, into a machine and trying to work out what's going on.
So Teleport's cool. K9s is awesome. I love seeing people use that. It just makes that whole dance of kubectl, get, delete, get logs, delete, and then edit, it just makes all of that a lot more fluid, a lot easier. Definitely a really cool tool. But I think the most surreal one was when Kris Nova was on my show with Thomas Stromberg.

So Kris Nova, I think we all know she's a kernel hacker in security into eBPF, I think currently at Twilio and Thomas Stromberg is one of the maintainers of the Minikube
And so putting them two onto an episode together, I was like, I just know this is going to be great because they're both got so much in depth knowledge about Kubernetes.
They were both in the Kubernetes team when I was just learning it. Like they were super early. Yeah. It turned out to be an almost magical episode. So Kris Nova went to town on this cluster. Live streamed it. I had to resist watching it to know what was going to happen. But we log onto the system and Thomas was like, all right I'm just gonna gather some facts.

I was like, okay. It was, not seen someone do that before. Typically we try and run kubectl get nodes but yeah, you're good. You do it your way. And he installed this tool called Sleuth Kit. Have you heard of Sleuth Kit?

Rich: I don't think I have.

David: So it turns out that during Thomas's tenure at Google, I think Thomas was there 10 years or something. I could go almost. Um had worked on a number of cyber security and forensic analysis tasks on machines when there was a suspected break-in or breach or a security incident.

Rich: Right.

David: And Sleuth Kit is a tool that can analyze the filesystem. And basically give you a snapshot of that, everything that has changed with a certain time period.
So Thomas jumps onto this machine and it's all Sleuth Kit. I'm like, I have no idea what this is. Runs a command, and the next thing I know he's got a directory with diff and patch files for everything that happened in the last 24 hours. Anyway, I'm just going to leave them there in case I need them, and then went on to debug the cluster normally.
And I was just so surreal. Because one, I had no idea that tool exists. I had no idea that even the technology was possible based on the filesystem. And just poked at the cluster, worked out what was wrong? Use this Sleuth Kit patch files as a reference. I mean, it all took 50 minutes, I think, to fix the cluster.

It was not a quick fix, there's still a lot of work to be done there.

Um, but just really cool seeing that background of A, working at Google, B, working on forensics analysis of compromised machines and bringing that to a Kubernetes situation where you want to try and work out what went wrong.

And that was really cool. Very cool actually.

Rich: That's super interesting. I didn't know about that either. Of course I know about things like audit logs, but that's such a different thing to actually be diffing the file system.

David: Yeah, I mean, I think this is the thing, right? Is that Thomas knew he was on the episode with Kris and we all know Kris's background. Kris had patched that, had used something called LD_PRELOAD to swap out one of the functions in glibc. So that even if you wanted to do ls on a directory, you wouldn't see the hidden files.

They completely went to town to make this thing hidden within the cluster. But there are tools and methods to get around that, too. That it you know what, it just wouldn't have worked if it wasn't for those two. And it was complete serendipity that was supposed to happen to go on that stream because we learned a lot from Kris and the breaks and the way that you, the way that she tackled the cluster and all of that.

But then equally we learned a lot in the fixing side, and I think that's, you know, the best episodes are the ones where we have fun with the break. We enjoy it. We have, we get a giggle, we laugh, but at the same time, we learn a lot in the debugging and processing and fixing phase.

Rich: Yeah, I'll have to put a link in the show notes to that episode specifically, and then people can find your um, the rest of your YouTube videos. Besides doing Klustered you do some other things, but oh, I wanted to ask you, I usually save listener questions at the end until the end of the show, but there were some questions specifically about Klustered that someone asked.

So I thought that we would tackle those now. It's @IworkWithHomer on Twitter, asked several, but I'm just kinda tackle a couple of these. What are your favorite and least favorite breaks that you've seen so far?

David: Okay. At least favorite is easy. The Unicode break.

Rich: He predicted you were going to say that.

David: Eh, he predicted right I used to organize cloud-native Glasgow and Docker Glasgow. And through my relationship there working and met someone called Guy Templeton, who is a container engineer at Skyscanner in Glasgow. And I reached out to Guy and I was like, do you want to come on and join me?

And he's also the co-chair of SIG Autoscaling. So he knows his Kubernetes. And I thought he was going to give me some really cool autoscaling bug or something, a resource management bug. I was like, come on and do Klustered with me. So he got a cluster, he broke it and we're working through it. We're fixing things left right and center.
And this is one thing that's not working. My, my application can't speak to the backend database. There's something wrong with the DNS discovery in the cluster. So I'm assisting with two terminals open looking at the CoreDNS config on that side, I'm looking at an example CoreDNS config on this side.

And I don't see any differences. I'm going up and downand looking and scanning. And I have no idea what's going on. And it turned out that Guy had just found an E that looks like an E but isn't an E. So my CoreDNS was running service discovery for kubernetes.cluster.Local, or wherever it was, but with one of those letters changed.
And I'm still he, he told me in the chat, this is what I've done, and I'm looking at them and I still don't see the difference. And there's been a blanket rule ever since that we just, there's no Unicode things here. I think it's to us is a good example of some of the weirder breaks on Klustered.

Like we do want them to be production and real-world so that people can learn all the tools, but we still got a lot of fun and enjoyment from the absolutely absurd. And I don't know where that one lies because there is the potential to typo but whether you would typo like that. I don't know.

But that was the most frustrating one. I think it's the first time I've ever audibly yelped on my stream when he told me and I actually seen the letter change.

Rich: I've just flashed back to um, I worked at an internet provider in the late 90s and we had the guy who was our host master, who maintained all the DNS records used this editor called Joe. And he would once in a while, like fat-finger something and embed hidden control characters in the file that would just break everything.

But like from that point on, so like the first 40% of the DNS records would load and then the rest wouldn't and it was just such a nightmare. Oh my gosh.

David: Yeah. Remember when things used to be easier and it used to be backup files with like little tildes and names on them and stuff, and you could just swap them back. We don't do that anymore. I think we're missing up.

Rich: it's the .bak those...

David: And I would have .bak David one, David two, David three, David 4056. Like I never deleted those things. And that was my git system back then. Oh Yeah. definitely.
That's least least favorite break that one was so frustrating. And even when you knew it was there, it was really hard to work it out and I've got in the habit now of whenever I open files of searching for the partI expect to see. And if it doesn't show up, then I suspect Unicode. Yeah, that was my least favorite. Um,

Rich: What about favorites?

David: So there've been quite a few interesting ones. I'm not going to pick one person in particular, but a few people have taken the same route. In fact, I'm going to go with two because one of them is a funny story about Kubernetes. And another one is a funny story about the efforts people go to to break it. The first one is a good thing about Kubernetes. Jason DeTiberus actually deleted the kubelet binary on the machine. And stuck an old kubelet in. Which seems really harmless. Now it turns out that Jason went back a couple of versions and nothing broke. In fact he actually had to roll back from Kubernetes 1.19 or 1.20 back to Kubernetes 1.3, before the kubelet would actually break the system. And debugging, that was a pain, because to me it looked like the kubelet existed. It had a binary. I could run "- help." It all looked normal. I didn't run "kubelet version" because I just had, it just doesn't pop into my head that someone would replace the binary with a really old version. Really painful to debug. Looking at the logs, nothing of course made sense because the kubelet was trying to be a kubelet and it just doesn't understand the API server, it's falling over.

I went down the path of trying to debug RBAC and rules and everything. I had no idea what I was doing. So that was a, that was a fun one. And I just like how far back he had to go to actually get the kubelet to break the system, which I think is a testament to just how great everyone working on that project is. Now, the other one is, people when they don't replace the kubelet with a really old kubelet, people have started patching the kubelet. Like literally forking the kubelet and modifying the code to return whatever they want.

Rich: Oh wow.

David: Which is really difficult to debug because 99% of that kublet is being a kubelet.
And then there's this little tiny 1%, that's being a little devil child. And you try and try to work at what was going on in that system. People have done this with the API server too. 99% of the API server calls all do the right thing, but if you try and update your application, bleh, just breaks.

And the, and even people replacing kubectl on all of the machines. So I can't even run my commands. It looks normal, but they literally just have it output the same thing, which looks normal again. And I'm like, I'm making changes and I hope to see them in the system and forget it. I think those ones are great just because it takes you so long to work out what's going on.

You can't help smile when you work it out.

Rich: Yeah, replacing the binary is pretty cruel.

David: Very sneaky. Very sneaky.

Rich: What kinds of things have you learned about troubleshooting Kubernetes from, from being on the show?

David: Yes. That's a great question. Really just, kubernetes is really difficult. Um, you know, we think of it as a system for running distributedapplications, but Kubernetes itself is more distributed even than I think we give it credit for, or even realize most of the time. And the API server running might mean that you got half a working system, but the fact that it has to speak to a scheduler, it has to speak to the controller managers. And something you don't realize, like I spent a lot of the last, what is this, August? Eight months going through the static pod manifest for the control plane and there are a lot of flags in there. And been able to disable the pod controller with dashboard controller, I [unintelligible] every single time. And why do we, like, why is that option there?

Why can we disable the pod controller? And uh, all the different port map, you can see, another thing is you can see the legacy and the history of the project as it's changed as well. There are so many parameters in the static pod configurations, for the API server and the kubelet that are just there that don't really do anything anymore, but that kind of have to be there. Like setting the insecure port number to zero and a few other things like this.

That must be really confusing for people that are new to the project, because I see myself doing it, where I'm going into the config and I know something's broken and I'm trying to work out what it is. And they're all just red herrings. They're all things that look suspicious but are completely harmless. And being able to navigate that and understand that has just been a complete chore of trial and error and fire, lots of fire.
So yeah, I can see how difficult this project is for people to pick up. And because it's difficult for me as a five or six year person in it now.

Rich: Yeah, I'm under the impression that at this point, there's a high reluctance to introduce breaking changes. So I wonder if any of that is related to the fact that there's this old, legacy, not, not my favorite word, but the fact that there's this old stuff in there that doesn't necessarily fit with what people are doing now, but people are maybe reticent to, to pull it out at the same time?

David: Yeah, definitely. I think that's definitely a big problem. I don't like using the word legacy either, but I think there's a certain application [unintelligible] that is 100% correct. And I think what we also need to understand is that Kubernetes historically has not been an entirely secure system.

Yeah, that's the word. not been an entirely secure system. And we've seen a lot of effort by a lot of people over the last two years alone to fix that, and that is causing a lot of these flags to be lying around in config files, like the insecure ports. Like, you know, did you know the kubelet used to run an API that anyone could talk to and run containers in your system?

That was just there. That was bound to a public IPv4 address on all your machines. But they are getting better. It leaves both the little funny looking flags on your config files and it can trip you up. But I think I've just, to go back to your question. The thing I've learned is that this just takes a lot of patience. There are a lot of moving components. And that really, just use a managed service. Just use GKE, just use EKS if you've got a need for bare mail and I'm supposed to advocate for bare metal, and bare metal is awesome, but you gotta be prepared for all the learnings along the way.

Rich: Yeah, agreed. I mean, there's a, there's a cost, I think with most decisions like that, whether to use a managed Kubernetes or to run your own. It's just all about trade-offs right. And you're just going to need a level of expertise to run it on bare metal that you don't necessarily need to use a managed service.

David: Yeah, if you need complete flexibility in the way that your system is configured, and you want to run custom schedulers, or you want to use the custom container run times, go the bare metal rows. And you're going to have to invest in people that can get this deep understanding. And I hope Klustered helps these people that are in this position. But yeah, it might just be easier to use GKE. In fact people always ask me on Twitter, like, why use bare metal clusters? Why don't you just use GKE? Wouldn't it be a lot easier? And that the brakes have just gotten too sophisticated, too complicated.
They're just too intricate now that you couldn't, we don'talways, in fact it's rare that we see breaks that are purely on the Kubernetes API service. Um, A lot breaking Kubernetes comes from the underlying host and…

Rich: It seems like the point of those managed services too, is to fence off things. So you can't break them.

David: Yeah, they're supposed to be uh, child proof and that you can't make any silly mistakes, then they're always going to keep running and it doesn't matter what silly things you're throwing at the API. It's going to try and keep your system online. Whereas in bare metal, you can literally do anything you want.

Rich: So @IworkWithHomer had another question, which is where do you think most people's biggest diagnostic weaknesses are, as in what part of a cluster? That if broken, do you think would be the most difficult for people to solve?

David: etcd. Done. Next question..

Rich: Oh, wow.

David: Yeah okay. There's probably two. etcd, I've seen every single episode. If someone breaks etcd we're all straight to the docs. Even just connecting to etcd and a Kubernetes is, or at least a kubeadm etcd configured environment, right? You have to enable the etcd v3API. You have to point it to all the different keys and the peer keys.
And then eventually you can run an etcdctl command. And it's only you can run an etcdctl command. You think, how often do you have to do that? Even trying to remember how to get the health of an etcd cluster is, is really painful. In fact the simplest break in the world was by Team Talos against the Team Red Hat, and Team Red Hat, that took them the whole episode, just even work at what was going on.
And I was right there with them, I was just as perplexed. And I turned it all Team Talos had done was say "etcdctl member add" fake IP address. That was it. But that one simple change moved etcd from being an single worker mode to a distributed etcd where it could never, ever get a quorum.

And that node never came online and there was no way to fix it, or there was a way to fix it, but it was extremely complicated. And we ended up just restoring etcd from a backup before the etcd command was added, the fake member was added. Because removing the member was also extremely complicated.

And it's amazing that one change took a team of really smart engineers. I never even just identified what had happened. That is how, I think that is just how separated we are from etcd because of managed services. And I'm sure there are people that manage etcd in bare metal environments that are working with it day in day out that, sure, child's play. But for most people and etcd is the biggest. Oh please don't be etcd, please don't be etcd.

Rich: That's super interesting. I mean, I know there are people who use etcd for other things besides Kubernetes too. And so those folks probably know quite a lot more about it. If they're using it to, you know, as a data store for their applications or other things like that.

David: Oh, yeah, totally. But it's familiarity. If you don't work with these technologies, the Dio, you know, when you work with Kubernetes, you've already got so much, you need to learn and work with to deploy your application. That to a certain point we forget etcd is there at least we don't even acknowledge it.

We're too busy thinking about the API server and schedulers and the container runtime and the CNI implementation and our service meshes that we've got running on it, and then certificate management. And then they'll go all over admission controllers and our GitOps controllers. And there's this little etcd component is, this is like, you know, in Star Wars, where they've got that little hole on the ship.

That's the etcd. It's just sitting there waiting for someone to target it. And then the whole thing. And that's just how it feels to me. So yeah, that's the one thing that I think most people are like please don't be etcd. It used to be certs.

Rich: Oh, yeah.

David: But now with kubeadm, you just do "kubeadm renew certs all," and it's just done.
And yeah, I don't think we fear certs anymore. And then there's one more that I'll mention because it is rising in prominence, now. I see a lot more people trying to bring this into Klustered and I'm terrified of it, is eBPF.

Because eBPF is super powerful. It can block syscalls that can break networking.
I can do that. One I haven't seen yet is any debugging tool, and please somebody go write this, but any debugging tool that can just give me a list of kprobes in the system. So I know that eBPF is running something. Which may be difficult with Cilium on my cluster, of course. But it's really, right now, there's no way to identify if someone has run some eBPF code in your system. I was there is someone, please tell me how to do it because more people are using it and I'm getting very scared.

Rich: Duffie Cooley, are you listening?

David: Yes Duffie we need your help.

Rich: You, you were talking a little bit about the challenges to you know, a new person coming into Kubernetes. I'm wondering if there's other advice you might have for someone in that position, who's, because there's a lot of past history at this point, right? There's a lot that's happened. And for folks who've been in the community for a number of years, I think they, they may not recognize what the experience is like for somebody who walks in today, at day one.
David: So advice for people that are going to be operating Kubernetes. Do I understand that question right?

Rich: Yeah.

David: Pick up a drinking habit probably would be a good start. Eh uh, but you're going to have to be patient even when you start to understand this system, even when you start to understand the quirks and you start to get a gut feel for it.

And I've noticed that with more recent clusters that, the more breaks I've seen now, like we're an episode 24, over 40 something broken clusters. I'm now starting to speak Kubernetes. Is that a thing? I don't know. I can see things that happen and immediately I'm like, okay, it could be this, it could be this, it could be this.

And that just comes from experience. I generally believe that Klustered is a great way to get at it because I'm learning these things by watching it and sometimes partaking. So I would encourage anyone new to operating Kubernetes clusters to give Klustered a chance. And cause what you're going to see is experts from all around our community, showing how they would debug a problem.

But yeah. Do you, as patience and experience, I think that's true for probably most things in technology. I wish I was bringing something new to the table here. But, but patience and expedience and just persistence are the key to all of this.

Rich: I feel like the certifications could potentially be helpful too. Like, I think that the things that you learn to be able to pass the CKA are things that would probably help somebody quite a bit.

David: Yeah the certifications, the CKA is a great exam and the course materials for that. And Walid's repository, I can't give that enough shout outs. It just, there's so much material in there. And, even Kelsey Hightower's Kubernetes The Hard Way is still being updated. And that walks you through manually building your own Kubernetes cluster.
And again, the more times you do that, the more mistakes you make every time you do it, you're just going to learn and start picking all this stuff up.

Rich: Yeah. So you're, you're doing some other things besides Klustered as well. You have Rawkode Livewhere you have a lot of folks on. I've been a guest on there and really enjoyed it. You have folks on there, it seems like they're mainly people from projects coming on to talk about what they're working on?
David: Yeah. So it's the most selfish thing I have ever done in my entire life. But remember I said, I was curious? I really want to play with all of these technologies. And I've managed somehow to convince you and many other people to come on and sit down and literally guide me through your technology.

And I say that as a joke, of course, obviously I'm not selfish, I'm just curious. But you know, we have a format I think works really well and I get to play the idiot, which I'm really good at and sit there with my terminal open and have awesome people like yourself that just have all of this knowledge and experience and sit down and really just have a conversation and talk about why this technology is fun and exciting and interesting.

And at the same time, just run through the getting started guide. And it's the commentary of, getting started guide, anyone can do it on their own time. But that component with the commentary from yourself and others, I think this adds so much context and value to the demo that it just elevates that somehow.

I just find them super interesting. And people are really passionate about their projects as well. And we don't get that free documentation, but you know, when you sit there and people are saying, this is why it's this way, this is why we built it this way. This is why we made this decision. All of that just helps paint this big, complete picture.

And anyone that's interested in this technology, you can hopefully come and watch these episodes and absorb that passion and see how to get started and hopefully spark their curiosity to go play with it for a minute. And I just yeah, again, I just love that I get to sit and do this a couple of times a week and just see all of these cool texts.
When's the last time you looked at the cloud native landscape? I mean, it's probably three times the size since then. And that's assuming you looked at it earlier today, so it's good to [unintelligible].

Rich: Yeah my guess is that you're one of the people out there who probably has exposure to the widest amount of tools, just because of all the different kinds of streaming that you've done.

David: Yeah. I just think it's so difficult these days. The hardest challenge is that of choice. Like, I feel it in the cloud native communities, as we're building all these distributed systems that with microservices and we're running on Kubernetes, is that choice is eventually a bad thing.

We get fatigued from making all of these decisions. Like If you have to sit down and say, well what service mesh, are we going to use? Looks like people are using Istio. I heard people say Linkerd is really good. Oh, there's also Gloo or, and then there's Ambassador. And these are all great products and they're all doing something very similar.

And at the end of the day, we just have, there's only so many decisions we can make in a day. And I hope that my show gives people an opportunity to ease that burden. And my dog hello now if you can hear.

Rich: Hello. What is your dog's name?

David: Daisy.

Rich: Ah, that's so cute.

You also have a show now on cloudnative.tv.

David: I do. I don't know how to have the time for all of this, I've got to say. But I've been working with a wonderful team at CNCF and the community and, Dan POP is the co-producer and really driving this project. And he's done a phenomenal job of that. I co-chair this cloudnative.tv with Kat Cosgrove, and the three of us are really just trying to make sure that we're getting more material out there to help people and just deal with all this.

Like I said, cloud native is really, really hard. My, my show is trying to tackle a frustration that again, I have personally. And that is, I want to be able to contribute more code to these projects, but these projects are all so mature and they've got so many people working on them. They're often backed by relatively large organizations and it can be very intimidating for people that want to come and get that first pull request in.
And that's why LGTM isthe name of the show. And the idea is to just remove that barrier to entry so that people can come and see, okay, here's the project, here's a code walk through. And then we make a live feature request or a bug fix. And we open a pull request and we test the application. And we just want to make it paint by numbers.
We want anyone to be able to come along and say, I want to contribute to Linkerd. Oh cool. There's a nice 15 minute video with one of the maintainers, literally cloning the project, making a change, pushing it and trying to share as much knowledge as they have about the way the rules are. Like slash commands in a pull request.

I don't know if you've ever tried to contribute to Kubernetes, but even just trying to work out, should I, should I LGTM, even though I'm not a maintainer or should I okay to test, or can I request a test or who do I assign that to? And it's just, you can be overwhelmed so so quickly. And LGTM just wants to help people that want to contribute to these projects make that first contribution.

Rich: That's fantastic. Yeah, the pull request stuff can almost be like a foreign language, if you're not into the conventions that a specific team is using.

David: Yeah, I think these days then on some projects writing the code is the easy part. And then there's all these unwritten rules and sometimes they're, they're documented, but all mystery process like CI or GitHub Actions or something like, you know, some projects will not accept the commit that's not formatted with semantic conventions or something like that. I think we can do a better job of helping people get onboarded and be familiar with these projects. So it can feel as if we can just remove a little bit of that fear or worry from contributing and get them on their way. That's all it takes.

Rich: That's fantastic. Um, What are your feelings about the community right now? Like it feels to me like things are still rapidly expanding. I remember there was such that, that huge crowd at KubeCon San Diego. It seems like the community is growing even since then. Is that what you're seeing as well? That there's so a lot of new folks coming in?

David: Oh yeah, definitely. And so many, I think now that it's cloud native and Kubernetes is getting so much adoption and becoming such a staple in IT and technology. Um, But we're now starting to see universities and colleges have programs on teaching on containers, and Kubernetes is becoming part of everything that's being taught as soon as they're 18 or 19 or 20.

And they're starting to break into this industry. Which is just bringing in even more new people. I can't remember specifics to what the KubeCon numbers are, but I think in Europe there was 26,000 virtual attendees and that's just growing year after year. So we're seeing the conferences get bigger and bigger.

We're seeing the number of projects continue to expand and we're seeing more and more people coming into the space, which is great. The more people we have here, the better, and the more people that want to contribute and be involved. Great. The more people that just want to set in the sidelines and use the project, great. Like, you know, It's never a bad thing to have all of these extra people. Um, and you've got wonderful podcasts like yours, where you're having conversations and making it easier for people to understand who's in the space and how they didn't get involved. And, Dan Pop has, and I've got my shoes and native TV and we've got Saiyam and look at Kunal
there's just all these amazing people who just in content that's again, making it easier for even more people to come in. And it's a really great time to be in cloud native. I think.

Rich: Yeah. Agreed. So impressed by that whole list of folks that you just mentioned. They're all people that I think are just really killing it. Saiyam is somebody who like, I don't even know how he cranks out the amount of content that he does. You're the same way. Pop is really great.

Kunal as well. It's really I dunno, I find it really inspiring, to look around and see what, what people are accomplishing. The hard part is just trying not to measure yourself against that right, like...

David: I couldn't agree more. It can be intimidating for everyone involved. Definitely. And uh, but what I would encourage anyone listening to do is if you've ever thought about having a podcast or a live stream or producing content, go for it. There's no such thing as too much material to help people because this is really difficult.

And I think we just need more content. We need more docs, we need more guides, we need more tutorials. It doesn't always have to be video, but we just need people to start sharing their experience. I think it has to evolve from, it used to be very conference driven, like we would just go to conferences, we stand on a stage, we try to look impressive and then, some code or some sites, slides. But we have to evolve it and more, yeah, like I said more articles, more videos, more podcasts. We just need more and more of this stuff to make it easier.

Rich: Yeah. I agree completely. And it's interesting because I think sometimes people don't understand that like even the most experienced people that you can think of when you look around in a community can get a talk rejected, right? Like it's not like everybody still always is able to do exactly what they want.

And I think it's a huge advantage to not wait on other people to let you say the things that you want to say, right? Like when you can do like you're saying and hop on YouTube or hop on Twitch, or, start a podcast, you don't need anyone's permission, to get in front of an audience at this point.

David: Yeah, there was a really cool tweet I retweeted the other day and it said if you're producing content to help people learn, we're not competition we're teammates. And I think that is so true. Like we are all producing materials. Like it doesn't matter if this person is getting more views or more subscribers, we're all just trying to make this content accessible to as many people as possible.

And the fact that it's there, then that's a complete bonus. And I lost my train of thought.

Rich: That's okay. I feel the same way, and that's the vibe that I've gotten from, you mentioned Pop, I've had a lot of conversations with him about this stuff, and he's so supportive. Um,

David: He really, really is.

Rich: supportive of this podcast and other things that I do. And I know you two hate each other, but still managed to pretend that you don't or something.

David: Yeah. I can't even remember what side of the fence I'm on anymore, to be honest. That I remember was what I was going to say. It was like, I was, I can't remember who said it and I wish I could attribute it to somebody, but they said the best engineers are the ones that say, I don't know the most. And I think the,

Rich: I think I saw that as well.

David: And I think the best people that are speaking at conferences and speaking, or doing content like this, are also the people that are being rejected the most and the people that are just not having content land like you have to accept that there is a rough road ahead and it's going to be difficult. Um, but you're still leading your way.

Rich: I think there's, there's a lot of advantages to, I'm not sure if I talked about this on the podcast before, but I actually got my current job through the podcast. Like the CTO and CEO had heard the podcast and approached me when they saw that I was looking for work. And um, you know, you're somebody who you've done a really great job with branding.

You've got this Rawkode persona that you have, I think, done a fantastic job of getting that name out there. And you have an employer, and from what I've heard, they're fantastic, but you know, you're not reliant on them, to be able to say what you want to say.

David: Yeah, I guess we're, you know, we're, we're continuing to demonstrate involvement in the community. And I think that's really important for [unintelligible]. Like, especially in open source, I think is one of the best things that any developer can do is be engaged in and contribute and build relationships because, you know, the people always come first before the technology. And I think that's true no matter what, and the best way is just to talk to people, have conversations, even just ask people how they are. Especially in today's world, where, you know, most of our conversations are through a browser Like, you know, use that opportunity. Let's say hey, because we're all, we're all dealing with a rough age.

Rich: Yeah, for sure. Um, there were a couple other listener listener questions from Eric, um, @ebcarty. First which projects do you feel are addressing interesting challenges in the cloud native space?

David: Yeah, interesting projects. Um, I'm going to give a shout out to Linkerd. I think the service mesh is one of those under utilized things and it's difficult because it was so buzzwordy for awhile. Um, but I think that their approach to it, which was to make it a good developer experience first was totally the right move.

And I don't think we've seen that from Istio and other competitors. And I'm going to do something completely narcissistic and quote myself now. But I said this on Twitter a couple of years ago, and I was so impressed with myself, it stuck with me. Um, but I said a good developer experience is one where a developer is successful with intuition rather than informed decisions.

Rich: Oh, wow. That is really

David: I know I was really impressed with myself, but I think it's just so true. I mean, how many times do you pick up like Kubernetes, it's all good developer experience. So you're never going to deploy and operate a Kubernetes cluster with intuition, but the fact that Linkerd you can make a lot of, you can just go with it. It just does the right thing, I think is super impressive and what we should be striving for across the entire cloud native landscape. Another project I'm so impressed with right now is Red Panda. I don't think anyone's ever heard of it.

Rich: I've heard of it, but I don't I'm I'm not,

David: It is a Kafka compatible server written C++ with at a bit of b assembly. Uh, and you know, we're seeing so many people move to, well, I think we're we're past that now and I think we've seen so many organizations now, adopt Kafka as almost the status quo.
But we're running it. Uh, you know, it's that a heavy JVM that requires Zookeeper is quite difficult to operate and you know, the JVM and containers is notoriously painful and you stick them in Kubernetes and it gets worse because we've got stateful workloads in Kubernetes. And seeing something like Red Panda come out where you can use blob storage and your cloud provider of choice. It's C++, it runs well in containers it's Kafka API compatible. And one of the most impressive projects I've came across in the last few months.

Rich: That is super cool. Um, yeah, I'm also, uh, also really impressed with Linkerd um, I've chatted with William before and he seems great.

David: Yeah. I mean, you deploy with no special configuration and you get TLS across all your services. That is the developer experience that we should be, we should be striving for. Um, also give a shout to, you know, the folks at Ambassador, are working on telepresence and the Tilt dev books and the Skaffold dev folks, and the really cool projects for they're trying to bring a good developer experience to building your application locally, but working in a remote cluster.

I think that is the biggest hurdle we need to kind of get through the next couple of years, because more people adopt Kubernetes and, and the more microservices we have, you can't run all that stuff locally in your own machine.

They have this like shared environment and the tools that are making that possible, Skaffold, Tilt, Telepresence, I'm sure there are others, but those three, I think, are all doing an absolutely amazing job.

And I can't wait to see what new things come out of that too.

Rich: Yeah, that's fantastic. Um, one more question from Eric. Um, is it preferred to run a database in a Kubernetes cluster or consume it as a managed service?

David: Um, I'm, I'm big advocate of running stateful workloads and Kubernetes, because I think that's where we need to be. Isit asas painless and frictionless as I think it should be? Definitely not. Um, if you're in a cloud provider and you've got attachable storage through EBS or anything similar across the providers and you get definitely go for it, run out in Kubernetes and you'll probably be quite successful.

Stateful sets have come a long way. Um, But if you're not in that environment, definitely not. Just run it on a machine and take care of it, give it hugs and cuddles every night, tuck it in and read a story and hope that it never leaves you.
Rich: Yeah. Um, I'm sort of, uh, in the same boat on this one, as I am with um, using managed clusters, which is that, you know, if your cloud provider has a good database that fits your use case, you know, just, just, you know,

David: Yeah.

I think databases are one of those things that if, as an organization, if you're going to pay for something, pay for managed database, use a SaaS product, whatever it is. Find the database you like and just give them some money to support that business. Um, and remove a whole bunch of headaches.

Now you can do it on Kubernetes, it's getting easier, but yeah, I'd probably lean to just here's some money. Mongo, DB, Postgres, Cockroach, whatever one is, please make my life easier.

Rich: Yeah. Agreed. Um, well, it's been super fun chatting with you, David. I'm really glad you uh, tweeted that thing the other day. And, and we, uh, got this scheduled. is there anything else you want to add here?

David: Uh, I hope that Klustered and all the things that I'm doing with Rawkode Live helps people. Um, and if there's anything else I can do to help, I hope people feel that they can approach me. And my DMs are always open and I'm always happy to help anybody no matter what their problem is. So please don’t hesitate.

Rich: That's awesome. I will definitely put links in the show notes to um, all of your fantastic shows and I do really uh, encourage people to check them out. Um, the great thing about something like Rawkode Live right, is that there's a whole bunch of episodes about different tools and you could just like go and look for the tools that you're interested in and, and watch those.

David: Over 150 in one year. Yeah.

Rich: And Klustered really is a blast. Um, I'm kind of terrified about coming on, but I love watching it

David: Oh, we have to do that. Come on. Has to happen together.

Rich: Someday. Um, uh, you're uh, at rock code on Twitter, R a w k o d e. And um, like I said, I'll, I'll put those links in the show notes. Do you have anything else coming up that you want to mention?
David: No I don't don't think so. Thanks very much, it was a pleasure.

Rich: Kube Cuddle was created and hosted by me. Rich Burroughs If you enjoyed the podcast, please consider telling a friend. It helps a lot big. Thanks to Emily Griffin who designed the logo. You can find her at daybrighten.com and thanks to Monplaisir for our music. You can find more of his work at loyaltyfreakmusic.com Thanks a lot for listening.

More episodes

Chapters

Show Notes

What is Kube Cuddle?