Fork Around And Find Out

Kubernetes isn’t just a platform—it’s a revolution. On this episode of Fork Around and Find Out, Justin and Autumn sit down with Kubernetes co-creator Brian Grant to explore the origins of this game-changing technology. From Google’s internal tooling to the cloud-native juggernaut it is today, Brian takes us behind the scenes of Kubernetes’ evolution, including its roots in Borg and the creation of CNCF.


Brian also opens up about his fascinating career, from debugging GPUs at PeakStream to improving Google’s threading systems. Along the way, he shares his candid thoughts on Terraform, GitOps, and the future of infrastructure management. We’re talking insider stories, tech critiques, and the cyclical nature of trends like AI—all packed into one unmissable episode.


Brian is a visionary who’s shaped the cloud-native ecosystem as we know it. We can’t wait for you to hear his story and insights!


Show Highlights
(0:00) Intro
(0:31) Tremolo Security sponsor read
(2:42) Brian’s background
(6:20) What it’s like working on something great but it not being the right time
(9:17) How Brian’s work from the 2000s is still important today
(11:16) Why Brian said ‘yes’ to Google after previously turning them down
(12:59) The history of the FDIV bug
(16:49) What Brian was doing when his old company was bought by Google
(20:51) How Brian’s education helped him get started down this path
(23:47) Brian’s jump from Borg to Kubernetes
(32:27) The effect Kubernetes has had on the landscape of infrastructure and applications
(35:48) Tremolo Security sponsor read
(36:47) Times Brian has been frustrated at how people use Kubernetes
(41:05) The patterns Brian notices thanks to his years in the tech industry
(48:04) What Brian expects to see next as manual actions to make providers work to make a comeback
(54:58) Reflecting on Brian’s serendipitous journey through the tech world
(1:02:18) Where you can find more from Brian


About Brian Grant
Brian Grant is the CTO and co-founder of ConfigHub, pioneering a new approach to provisioning, deploying, and operating cloud applications and infrastructure. As the original lead architect of Kubernetes, Brian created its declarative configuration model (KRM) and tools like kubectl apply and kustomize. With over 30 years in high-performance and distributed computing, he’s held pivotal roles, including tech lead for Google’s Borg platform and founder of the Omega R&D project. A Kubernetes Steering Committee and CNCF Technical Oversight Committee member, Brian also boasts 15+ patents and a Ph.D. in Computer Science, shaping the future of cloud and computing innovation.


Links Referenced


Sponsor
Tremolo: http://fafo.fm/tremolo


Sponsor the FAFO Podcast!
http://fafo.fm/sponsor

Creators & Guests

Host
Autumn Nash
Host
Justin Garrison

What is Fork Around And Find Out?

Fork Around and Find Out is your downtime from uptime. Your break from the pager, and a chance to learn from expert’s successes and failures. We cover state-of-the-art, legacy practices for building, running, and maintaining software and systems.

Brian: From my perspective, one of the things that it did is it created an infrastructure ecosystem that was broader than any single cloud.

Justin: Welcome to Fork Around and Find Out the podcast about building, running, and maintaining software and systems.

Sponsor: Managing role-based access control for Kubernetes isn’t the easiest thing in the world, especially as you have more clusters, and more users, and more services that want to use Kubernetes. OpenUnison helps solve those problems by bringing single-sign on to your Kubernetes clusters. This extends Active Directory, Okta, Azure AD and other sources as your centralized user management for your Kubernetes access control. You can forget managing all those YAML files to give someone access to the cluster, and centrally manage all of their access in one place. This extends to services inside the cluster like Grafana, Argo CD and Argo Workflows. OpenUnison is a great open-source project, but relying on open-source without any support for something as critical as access management may not be the best option. Tremolo Security offers support for OpenUnison and other features around identity and security. Tremolo provides open-source and commercial support for OpenUnison in all of your Kubernetes clusters, whether in the cloud or on-prem. So, check out Tremolo Security for your single sign-on needs in Kubernetes. You can find them at fafo.fm/tremolo. That’s T-R-E-M-O-L-O.

Justin: Welcome to Fork Around and Find Out. On this episode today, we are reaching a quorum. We have three of us. We have Brian Grant. Thank you so much for coming on the show.

Brian: Hi. Thanks for inviting me.

Autumn: It’s so weird that you didn’t say, ‘ship it.’ Like, it’s just, like, we have made—

Justin: We have forked. We forked the—

Autumn: Oh—

Justin: —podcast.

Autumn: —we did fork. Oh, my God.

Justin: [laugh]. This is brand new open-source. It’s a brand new fork. And again, we’re reaching, I don’t know, the [unintelligible 00:02:14] consensus here, Autumn and I have elected Brian as the leader of this episode [laugh]. And so—

Autumn: Poor Brian’s like, “I just got here. How did they, like, put me onto this new responsibility?” Like—

Justin: Look Brian is responsible for that joke, right? I’m putting all the—

Autumn: Oh, you should have seen the dad jokes that happened before you got here, okay? We were trying to figure out the YouTube Short thing, and I was like, “We don’t even have any shorts.” And Justin’s like, “It’s cold. I have pants on.” And I was like, “Oh, my God.”

Justin: For people not familiar, Brian is one of the three founders, or people that started Kubernetes within Google, and—

Brian: Well, actually we had roughly five. So, there was Joe Beda and Brendan Burns on the cloud side, and Tim Hockin and I on the internal infrastructure side.

Autumn: Does that mean that we, like—that people blame you when they scream at Kubernetes?

Brian: Oh, I am to blame for a lot of things in Kubernetes. We could talk about that. Probably more things than any other single person, maybe. Maybe Tim now because I’ve been away for a while.

Justin: Yeah, were you one of the first people that wrote it in Java? Was that the first—

Brian: No, that was the prototype. Brendan—

Justin: Brendan wrote the first one in Java? That sounded scary.

Brian: [Beda 00:03:26] was very early, and worked on the initial implementation. Also Ville Aikas, who’s at Chainguard now. But he didn’t really carry over once we started building it for real. He did something else. And then there was Craig McLuckie on our product side. But when we started, we weren’t in our location. We didn’t have a manager, you know? We just started working on it.

Justin: And you were coming from internal infrastructure. You were doing Borg, and Omega, and everything else inside of Google and kind of shifted—

Autumn: What’s Borg and Omega? You got to, tell—

Justin: That’s what—I was just going to go into that. That was great.

Autumn: Okay good. Because I feel like sometimes when, like, you work in a certain type of software, we forget that everybody doesn’t know, like, necessarily what that is. So, tell us all the things.

Brian: Yeah. So Google, many large companies and even a lot of small companies, built all of its own infrastructure tooling. So, it had an internal container platform called Borg, which was actually the reason… the motivation for adding cgroups to the kernel, to the Linux kernel. So, that was just added and rolling out the time I joined Google in 2007. So, that was just like brand new in Borg.

And the reason Google created Borg was they had two previous systems, Babysitter—which Borg ran on when I started on the Borg project—and Work Queue, which ran MapReduce jobs. So, they had one, kind of, batch queuing system and one service running system. And what they found was that they didn’t have enough resources for all the batch workloads. There were a lot of idle resources in the serving workloads, especially during certain parts of day, in certain regions. So, they wanted to make a system that could run both types of workloads to achieve better resource utilization. Resources were scarce for years and years and years, you know, even though they had a vast fleet of machines.

Justin: It’s funny to think you’re like, “Oh yeah, no, we have hundreds of thousands of nodes, and we have not enough resources.” [laugh].

Brian: Yeah. Yeah because the main services were run all the time, they were adding new services, they were moving services that were previously not on Borg onto Borg, they were moving acquisitions onto Borg. There just weren’t enough resources. Borg project kicked off around the beginning of 2004, so before I joined, around the time that Tim joined Google, I think. And—

Justin: Yeah, Tim just hit 20 years. That’s amazing.

Brian: Yeah. Yeah, yeah. So, I was only there 17 years.

Justin: [laugh]. Slacker. Come on, Brian.

Autumn: You said, “Only,” like, it was not that long [laugh].

Justin: But you joined directly to the Borg team?

Brian: No. Actually that was one of the teams that—so I came into an acquisition before. I did something that was not of interest at the time, which, if you read my LinkedIn page, you may know what that is, but I did high performance computing on GPUs way, way too early…

Justin: [laugh].

Brian: Like 2005.

Justin: Nobody cared. That was the problem [laugh].

Autumn: What’s it like working on something and knowing that it’s going to be great, but it not being the right time? Like, is that so frustrating? Because, like—

Brian: Yeah, I’ve done that a few times.

Autumn: You said a few times [laugh]. Not once, but he’s like, I’ve been in that struggle.

Brian: The startup was PeakStream, and the challenge was that there weren’t people who needed extremely high performance computing in something that was not, like, a Cray supercomputer. Because I did supercomputing in the ’90s and worked for a national lab and things like that as well, and you know, they had their own kind of big, metal machines. But there were people who needed high performance computing, but not of that scale or cost. The people who did were willing and able—and able is an important part—to actually hire experts to squeeze every last cycle of whatever chip they were using.

There are also a bunch of other challenges, like—I mean, at the time we—it was before Nvidia launched CUDA. So, Nvidia was working on CUDA. Our founder was from Nvidia. They called it GPGPU back in those days, General Purpose Computing on GPUs. So, that was starting to attract interest, somebody wrote a book.

PeakStream was one of the companies, kind of, starting in that area, one of the earliest. And back in those days, the chips, I mean, they weren’t designed for it. They were designed for graphics, right, so they didn’t really have a normal computing model. They didn’t do I.e.,EE floating point; they did something was that was sort of floating point-ish. And they didn’t do integer computation either because the shaders didn’t need it.

So, they didn’t do 32-bit integer computation. They did some simple computations. Like, indexing into memory, normally, the way that works in a CPU is the memory unit has an adder that takes an integer memory address of whatever the word size is on the machine, like, 64 bits these days, on those chips, and it does an add of an index, an add and a shift to scale the index. So, if you’re loading something that’s four bytes or eight bytes or one byte, you know, it does the shift appropriately and adds the index. So, that’s an integer computation. You get the memory location that you want. These chips didn’t do that. They actually indexed into memory using floating point.

Justin: And I make—can I just say, like, already, off the bat, we’re less than ten minutes into this episode, and we’ve already explained [laugh], like, deep chip, like, addition, for how these things are working. And also, like, we got our first, like, ‘um, actually,’ which I feel like I need, like, a sound bite for when you were correcting me on the founding of Kubernetes, which was amazing. I love this already. This is going to be such a good show.

Autumn: Well, also that, but like, for context, when you’re talking about this chip work, how long ago was that, Brian?

Brian: This was 2005.

Autumn: And think about how relevant this is today, and how much people want to get all they can out of chips.

Brian: I looked at the APIs for XLA and some of the recent machine-learning GPU interfaces, and they’re very, very eerily similar to what we did [crosstalk 00:09:16].

Autumn: That’s what I’m saying. Like, how crazy—y’all didn’t see Brian’s face when we said, “How many times have you worked on stuff that, you know, was too early,” but talk about how relevant that is into what people are doing right now.

Brian: Yeah. Well, the thing I did before PeakStream was another interesting hardware-software thing, which was a company called Transmeta.

Autumn: Do you sleep, Brian? Have you slept in the last 20 years? Like, what [laugh]—wait, how long have you been in tech, total because you’ve done some things. We’re only ten minutes in.

Brian: More than 30 years.

Justin: You’re the whole dotcom bubble. That’s awesome.

Brian: Well, I went to grad school during the dotcom bubble, mostly, in Seattle. So like, a lot of students were dropping out to go to Amazon and things like that, in the mid-’90s. One of the first Web crawlers was designed by another student at University of Washington. So yeah, I was watching that, and I don’t know, I didn’t really feel the pull of startups at that time, but when I did approach finishing my PhD, I considered both industry research and startups; I didn’t really think about academia. And the startups did appeal to me more. I did turn down a startup you may have heard of—which is Google—when it was 80 people.

Justin: 80 people [laugh]?

Autumn: You turned down Google at 80 people?

Justin: That’s a startup. You don’t want to go there. It’s who knows what the future looks like for that.

Brian: It didn’t work well for me compared to the other search engines, so I was like, I don’t want to move to California.

Justin: Alta Vista was awesome.

Brian: Yeah, I know. Alta Vista was awesome. And I searched for, like, I need a preschool for my daughter. Like, how do I search for that? And I just—it was just awful. So, I was not impressed by that. And I also was interested in the technical area I was working, which was dynamic compilers, which is, you know what Transmeta was all about. PeakStream also used that, a dynamic compiler. So, I built three dynamic compilers in my career: one in grad school; I worked on one at Transmeta, as well as a static compiler; and PeakStream.

Autumn: Because everybody does that on a Tuesday. Like [laugh].

Justin: [laugh].

Autumn: That is awesome.

Justin: So, what finally led you back to Google to say yes the second time?

Brian: I was acquired.

Justin: Oh, right. You’re right, PeakStream was acquired.

Brian: PeakStream was acquired.

Autumn: You were like, “I didn’t even have a choice.”

Justin: So like, now they’re like, “We still want Brian. We’re going to buy the whole company to get you in here.” [laugh].

Autumn: They were going to get you at, like, some point.

Brian: I did the pitch, so you know, it’s not completely involuntary. The other potential acquire was… well, I won’t say that, but we did have other potential acquires. But you know, we hadn’t found product-market fit because customers like high-performance trading or seismic analysis, you know, these kind of they could hire the high performance computing engineers to actually build what they needed. So, in addition to all the exotic hardware bugs, which I could talk about for a long time if we wanted to do that because that was super fun, but like, the 1U and 2U server boxes would put in the cards in an orientation such that the fans on the GPUs would get in the way, and even if they widened the space between the slots, it would then blow into the motherboard and melt it.

Justin: Sure.

Brian: So, like, this is a thing that happened.

Justin: [laugh]. A small problem.

Autumn: What do you mean, just melting [laugh]?

Brian: Yeah. There’s a lot of heat. It would melt solder, it would melt plastic.

Justin: Well, you’re probably at, like, 300 watts, 400 watts of GPU, even more. Like, that heat got to go somewhere.

Brian: Yeah, and the quality was also a problem because, for graphics, they’re like, just most of the pixels need to be right. If one pixel doesn’t compute the right value, make it zero, and it will be black, and nobody will notice in 1/24th of a second. So, their bar for correctness was not the same as Intel. Like, after the FDIV bug, Intel was just, like, super paranoid about correctness. And so, [crosstalk 00:13:03]—

Justin: Oh man, I was just reading about that bug. That was so big. I completely forgot about that.

Autumn: Can we give the listeners some context? What was the FDIV bug, guys?

Brian: The FDIV bug, the there was a bug in the floating point division unit where it sometimes would give the wrong result, and that was not considered acceptable.

Justin: The computer didn’t math, and this was a problem. And it was in the chip itself. Actually, there was a Bluesky thread. I will find it, and we will put it in the [show notes 00:13:27]. Because it was an amazing—they had, like, they decapped the chip and looking at the trace, like, here’s where the bug is, physically on the chip.

Autumn: Y’all are missing Justin’s very excited face—

Justin: I love it.

Autumn: —because his face, it’s like Christmas morning, and it’s crazy.

Brian: Yeah, so that was—at Transmeta, we had a lot of that because the industry was undergoing a lot of change. First of all, we were changing everything. We had a new approach. We were doing dynamic binary translation in software from x86 to a custom VLIW.

Justin: Like, not emulation layer? Like…

Brian: In software.

Justin: Okay.

Brian: It was an emulation in the software, in a hidden virtual machine that the end-user could not access.

Justin: What could go wrong?

Brian: Actually, all that worked, awesome.

Justin: [laugh]. The hardware was the problem.

Brian: Well, the industry was transitioning from 130 nanometer to 90 nanometer, which the leakage characteristics just changed dramatically, and from aluminum wires to copper wires. And we changed our fab to TSMC, a little fab that nobody had ever heard of. And month after month, we were looking at these photos of, from an electron scanning microscope, saying, you know, this is the reason the chips don’t work this month. There’s a thing called the vias. So, the chips are multiple layers, alternating silicon and metal, and the metal is the wire layers that connect all the gates together, all the transistors together.

The metal layers all need to be connected because the electricity comes in on the pins on one surface of the chip and needs to flow through all the metal on the chip. So, there’s a thing called vias, which is holes in the chip and the metal needs to drip down through as part of the process of manufacturing these things, at microscopic, like, atomic-level scales. So, there’s all kinds of things in the viscosity of the metal, where, if it’s not exactly right, it won’t go through the hole because it’s so small. So, if you can imagine, like, raindrops collecting on a sheet of plastic, or something like that, and not falling off, kind of like that.

So, we would see these pictures of, oh, this via didn’t go through, that via didn’t go through. Oh, this one actually went through, and splattered across, and shorted a bunch of wires together. So, we had a bunch of photos like that for, I forget how many months, like, six months or something. It was a long time for somebody trying to get a product out. Yeah. So, that was exciting.

Then once we got the chips back for the 90 nanometer generation, which was the second generation chip design—and I just started at a—fortuitously the week that project kicked off, so I was there from the beginning on that chip generation. The software was all new, the static compiler was new, the dynamic compiler was new, the boards were new, the chips were new, like, the fab was new, the process was new, like, everything was new. So, of course, nothing worked, right?

Autumn: Yeah. I was going to say this then trying to figure out what’s wrong is just—

Brian: Yeah. So, we had a 24-hour bring-up rotation, so there’s always people in the lab trying to figure out what’s wrong and working around it. So, my parts, eventually, after the hardcore bring-up lab, where it’s like, well, we don’t have a clock signal. Why don’t we have a clock signal? Well, the phase-locked loop has a problem. Well, what can we do to electrically make the phase-locked loop work?

Once I got to the point where they could kind of run, I got a board on my desk with a socket I could just open and close, and there were balls on the chips, rather than pin so I could actually just get a tray of chips and slap one in and close the socket and turn it on and try to debug what was going on. Because different chips had different characteristics. Probabilistically, there’s a distribution. If this is not interesting, by the way, you can stop me any time.

Justin: No—

Autumn: No, it’s so interesting.

Justin: I did not expect this to go this direction, and I absolutely love it. But also, we have so much other stuff I want to talk about. This is, like, 20 years ago, and at some point Google bought the company. Why did Google buy it and what were you doing when you joined? Because you said you weren’t on the Borg team originally.

Brian: I honestly, we had the same investors as Google, Kleiner and Sequoia, and actually, when we started PeakStream, I worked in the back of the Sequoia office for a few months before we found our office, and got a company name, and things like that. I was the third engineer there, but not a co-founder. Effectively, there was some hope that maybe the technology could be useful. And actually, my investigation into the data centers in Borg was one of the things that convinced me it was going to be quite challenging, but also we didn’t find a customer within Google for that, for dense floating point computation at that time. Like, the computations were more sparse for the types of things they were doing back then.

So yeah, we spent a few months talking to lots of people in the company and tried to find something useful, but then said, well, it wasn’t going to be actually useful. So, then I pivoted the team that was brought over to focus on something that was a problem, which was Google’s—about half of its code was C++ and half was Java. And this, 17 years ago, it was just at the beginning of the NPTL, the new POSIX threading library in Linux. Before that, there was this thing called Linux Threads that was terrible and not really usable. So, when Google started—and there was no C++ threading standard, right, so you had to write your own threading primitives, effectively, to do stuff. And you know how memory safe C is, right?

So, Google had developed all of its own threading primitives. They were pretty low level though. And the first engineer hired into Google decided, well, we’re scraping the entire web; we need throughput, which was true. And the chips at the time were the Pentium 4, which was what Transmeta was competing against, which—well, anyway, I won’t go back into CPUs, but determination was made that the most efficient thing to do would be to write a single-threaded event loop and run everything that way. And that was true at the time, but very shortly after 1998, when Google started, multi-core happened. Chips changed everything.

Now, multi-threading is good for CPU utilization and latency. Java had a very strong threading model from very early on, so all the Java code was actually in pretty good shape, but the C code was not. There’s a lot of single-threaded code in Google, so I started an initiative to fix that. So, an opportunity was the Borg team. Other opportunity was, make everything on Borg run better.

So, I ended up doing the latter. I started a bunch of projects to help the new POSIX threading library roll out, and the fleet to develop some new, easier to use threading primitives to develop documentation. I mean, back in those days in Google, it’s like, engineering was maybe 10,000. Biggest I’d ever worked for at the time. You know, I’d done two startups before that. So, I thought, “Oh, man, Google’s so huge.” Little did I know 17 years later, it would be 20 times bigger.

Autumn: It’s so crazy that you were almost the 80th employee now that, like, a thousand that probably seemed so big in context, but—

Brian: Ten thousand.

Autumn: Oh, ten thousand. But then, like, it’s just massive now.

Brian: Hundreds of thousands, yeah. It was big, but in those days, I could do things like having a company-wide tech talk. So, I did. I started an initiative called the multi-core initiative to actually promote threading in C++ and to make it work better. So, we built a multi-threaded [HP 00:20:30] server, which is still in use—HP server—and some threading primitives, and worked on documentation and thread profiling tools, some annotations in the compiler that would, kind of similar to annotations in Java, where you could identify areas that are supposed to be locked and ensure that the [mutex 00:20:49] was used properly, and things like that.

Autumn: This is a random question, but what was your PhD in? Because how did you get started in, like, GPU and, like, these very in-depth—

Brian: My background was systems. So, as an undergrad, I worked on networking and operating systems and supercomputing. And I also started grad school working in the supercomputing area. I did three summers at Lawrence Livermore National Lab. And I worked on the climate model, and some group communication primitives, and porting to MPI, which was brand new back in those days.

I actually went to one of the MPI spec development meetings. That’s the Message Passing Interface: MPI. So, I transitioned into compilers because there was an interesting project doing runtime partial evaluation, and that’s where you take some runtime values in the program and use that to generate specialized code that just works for those values. You know, so in some cases, you could get kind of dramatic speed-ups from the code from doing that. So—

Justin: You’re doing, like, dynamic tuning for the code? Or was it like—

Brian: It’s not tuning; it’s compiling. So, you know, if you have some computation that uses some value like an integer, there are standard compiler—if that’s a constant, like, five, there are standard compiler optimizations, like, constant folding that will take that value and pre-compute any values that can be pre-computed. If that value is input at runtime, you don’t know what the value is, then you just have to generate all the code. And if there are conditionals based on that code, you have to generate branches and evaluate those conditionals, et cetera. If there were certain values, even data structures that were known to be constant, you could potentially do some pretty impressive optimizations, like unrolling loops, which allows you to pre-compute even more things and reduce the amount of code dramatically.

So, the biggest speed-ups we would get were, like, 10x speed-ups from doing that. So, for example, if you had an interpreter, an interpreter would normally have an execution loop where it would read some operation to interpret, you know, dispatch to something to evaluate it. Like, if it’s an add, it will go add, and return back. So, if you actually gave the interpreter a program as a constant value, what could you do? Well, effectively, you can compile it into the code instead of interpreting it, right?

So, that was sort of the most impressive use case. Not super realistic, but there were some cases like that could be done. So, to do this, what you have to do is analyze where all the values flowed to that you wanted to take advantage of, and split the program into two: one piece would be a compiler that would do all the pre-computation, and the other piece would be the code that would be admitted once you had the values that were pre-computed.

Justin: And over and over again, it seems—I mean, like your graduates, your doctorate, all this stuff, like, you’ve been doing this optimization over and over again, and now you’re at Google, you’re doing this C++ multi-core, we got to do this thing. Let’s fast forward to, like, where do you go from, you’re doing Borg stuff to, like, this Kubernetes thing comes out? Like, this is, like, hey, we want to do something else that’s going to—we want to open-source it, we want to do something generalized from the stuff we were doing internally.

Brian: Okay, yeah. I mean, after about a year-and-a-half, I got about as far as I could on making things multi-threaded, so I transitioned to the Borg team, 2009. After short order—Borg was about five years old at that point in time—it was clear that it was being used in ways it wasn’t really designed for. So, I came up with an idea for how to rearchitect it. I started the project called Omega, which was an R&D project to rearchitect it. Worked on that for a while.

After a couple of years, cloud became a priority. I mean, before that, it was not really a priority for the whole company. You know, the Cloud team was a pretty small team in Seattle, and most of Google was down in Mountain View in the Bay Area. Google had App Engine for a couple of years, but kind of core cloud product that people would think about would be the IaaS product, the Google Compute Engine.

Justin: And my understanding was, App Engine was basically, like, a customer front to run jobs onboard directly, right? It was so restricted that it didn’t have a layer in between.

Brian: Did have a layer.

Justin: Okay, s—

Brian: It had a pretty elaborate layer, actually. And Cloud Run shares some DNA with that.

Justin: But the restrictions behind App Engine were because of that platform layer between because it was just, like, you have to architect your application in a specific way to make it run here, and we will take care of all the infrastructure side of it for you.

Brian: Yeah, a big part of that is—well, there’s multiple parts of it. One is sandboxing, so you can actually run stuff multi-tenant before we had hardware virtualization primitives that could be used for sandboxing, like, in gVisor. I mean, eventually it moved to gVisor. But before there was gVisor, there was like a Ptrace sandbox or something like that. But all the networking stuff in Google is different and exotic. You know, like things don’t communicate by HTTP. You know, I had some RPC system that sort of predated use of HTTP as a standard networking layer. None of your normal naming DNS, service discovery, proxying, load balancing, none of that stuff works, right, because they have all their own internal stuff for that.

Justin: Yeah. Yeah, all the internet things are just like, “Ah, that’s not for us.”

Brian: And the compute layer, you know, they did want the sandboxing to be really strong, so, yeah there are a bunch of reasons for the restrictions, but they’re, you know, based on the technology at the time. And it’s before Docker containers, things like that.

Justin: Right. Yeah, but in this so you have this research project, basically internal, like, to rearchitect Borg into this Omega thing, and then what happens there? Like, where’s that transition?

Brian: I mean, it turned out to be not worth it and somewhat infeasible to roll it out. We did partially roll out pieces of it, internally, and it kind of made things more complex operationally during the transition, and it just didn’t provide enough value. The install base was really big, and it was growing faster than we could write new code, and that was a time when the Borg ecosystem was just exploding internally, and new things were being added at all the layers at a very rapid pace. And changing the user interface was actually one of the most problematic parts of it was also pretty much a non-starter because there was like, zillions and zillions of lines of configuration files and about a thousand clients calling the APIs directly. So, it was just too much.

So, some of the ideas were folded back into Borg, like labels in Watch, which if you know Kubernetes, it may sound familiar. And other parts were turned down, but Kubernetes—you know, as cloud became important. GCE, Google Compute Engine, GA’ed, at the end of 2013. And that was also when Joe Beda kind of discovered Docker, and said, “Hey, look at this Docker thing.” Management directors and above were kind of trying to figure out, how can we apply our internal infrastructure expertise to cloud, now that it’s becoming a priority?

So, I shifted off in that direction, and we started exploring, well, there’s this group put together by a couple of directors called the Unified Compute Working Group, and actually the original motivation, nominally, was to produce a cloud platform that Google could actually use itself someday. Because App Engine was considered too restrictive, and Google Compute Engine was VMs, and Google had never used VMs. Like, it just skipped that. It used containers, more or less, processes, Unix processes from the beginning, so there was no way they were going to use VMs. They’re, like, way too inefficient, they’re too opaque, they’re hard to manage how to be container-based.

So, you know, some of the original things were, yeah, it should be like, Borg. And I’m like, wait, wait, wait, wait, wait. We just spent years trying to [unintelligible 00:28:31] Borg. Let’s not do it just like Borg. Kubernetes actually ended up being open-source Omega, more or less, based on a lot of the architectural ideas, and some specific features, even, like, scheduling features. So, some of the more unusual terminology was just lifted whole cloth from Omega, like taints and tolerations, for example, as just one example. So, there were a bunch of things from Omega we just simplified.

Justin: Wasn’t the pod aspects in Omega? Like, the grouping?

Brian: It was, yeah. So, the pod was one thing I felt was really important, and I tried to introduce it into Borg around 2012. That was super hard to introduce at the core layer, since the ecosystem was so big. But Borg’s model was, it had a concept called an [ALEC 00:29:14], which was an array of resources across machines, and the idea was that, you know, it’s kind of your own virtual cluster that you can schedule into.

But nobody, almost nobody, used it that way. What teams did was they had a set of processes they wanted to deploy together, usually an application and a bunch of side cars for logging and other things, and they wanted those things deployed in sets. So, you know, I talked to the SREs, and they said, “Ah, we just want this.” That led to the concept, which, at the time, was called Scheduling Unit in Omega, and for the experiment in Borg. And that was just what in Borg was a set of tasks from jobs. And tasks weren’t even a first class resource in Borg. Jobs were the first Class resource, and jobs were arrays of tasks.

So, you had this weird, challenging model where you had an array of resources across machines, and you had multiple arrays of tasks that you wanted to pack into those. So, if you needed to horizontally scale, you needed to grow your ALEC first, and then grow your jobs after. And if you wanted to skill down, you had to do it in the opposite order. And technically, you could do it in either order, and things would just go pending and not schedule into the ALECs, but that created a lot of confusion, so people tried to avoid that. But the pod or scheduling unit primitives was just a lot easier for how people were using it. I have this set of things. I want those deployed together, just as if they were on one machine. Just do that. If you want to scale, that’s a unit to scale by.

Justin: I remember, like, in like, Mesos time, like, it was always like, oh, well, don’t try to schedule things together. Just write better code. And I’m like, that’s not how the real world works. Like, that’s [laugh]—

Brian: Yeah, we had a bunch of cases, like, we have very complex [fat 00:31:01] clients for interacting with storage systems that were just super challenging to rewrite in all the languages. And Google restricted the languages you could use in production. For a long time, it was C++ and Java. Python was added, but it wasn’t as widely used, not for serving workloads anyway. It was used more for tooling. Eventually Go came around, but you know, that was decades later.

But rewriting the Colossus client to interact with the files, the distributed files to work, for example, you know, if that’s tens of thousands of lines of code, you don’t want to do that multiple times. So, how those things evolved over the years, I mean, eventually there was a way that was created for running those things without the normal sidecar model. They would all do it in the same container, effectively. But there were a bunch of reasons to have side cars for various reasons.

Justin: And for anyone that wants to read more about this—I mean, we’ll link your blog posts in the [show notes 00:31:59]—“The Technical History of Kubernetes” is a collection of a lot of your old Twitter threads that gather a lot of these pieces together, which was a great combination of them all in one spot.

Brian: “The Road to 1.0” post also has kind of a different perspective. It was more once the Kubernetes project started, how did it evolve for the first couple of years.

Justin: Now, I’m going to make a jump here. Kubernetes is 10, going to be 11 years old.

Brian: I mean, for me, it is more than 11 years old.

Justin: Yeah, exactly. You’ve been on it for a while. But like, as far as, like, an open-source, the official, you know, stance of hit ten in 2024, what has this shift for what Google was trying to do with Kubernetes initially, and the open-sourcing of everyone else also using this in various places, what has that done to the landscape of infrastructure and applications?

Brian: From my perspective, one of the things that it did is it created an infrastructure ecosystem that was broader than any single cloud. Because at the time we started Kubernetes, there was the AWS ecosystem, and that was pretty much it, right? Like, obviously Google had before GCE was GA’ed, it had pretty negligible usage on measurable market share, I think. And that was the time that the Kubernetes project started. Even Azure wasn’t really very, very present.

And even now, ecosystem-wise, I look at Infrastructure as Code tools, for example, there are a bunch that work for AWS only, and there aren’t very many things that work only on the other clouds in the open-source ecosystem, at least. But Kubernetes sort of created its own island, where you could have this rich ecosystem that works pretty much anywhere, it works on-prem, it works on any cloud. People have differing opinions on whether it’s a good thing or a bad thing, but I view it as mostly a good thing that you have a large ecosystem of tools that work everywhere, and that was not the case before. And for the especially for the people who are on-prem and what the thing that was available before was Mesos and OpenStack.

And Mesos, in my opinion, it’s kind of overly complicated. The scheduling model just didn’t work at a theoretical level, and the open-source ecosystem was not as strong. Like, a lot of the big users just built their own frameworks and then open-sourced them, and that’s sort of death to the ecosystem. But, you know, even those who did, the tooling was not compatible across frameworks, so it’s just super fragmented. So, it didn’t really have the potential to grow this sort of ecosystem that Kubernetes did.

And then when we created the CNCF, you know, taking inspiration from what happened in the JavaScript area, where there was the Node.js Foundation, and I forget what the foundation was before they unified, but there was another foundation. And a couple of things like Express went into the Node.js Foundation, but most other projects were not accepted into that foundation, so they had to find a home in some other foundation, and that was really awkward. So, one thing I wanted to do with CNCF was ensure there was a home for all those other projects.

Before CNCF was really ready, Kubernetes project itself kind of became an umbrella project and took on a bunch of those projects. Like Kubespray, for example, for setting up Kubernetes clusters with Ansible. But, you know, as soon as after we created the—initially, it was called Inception, I think, but then, you know, after became the sandbox, then kind of the doors really open to all those projects. So, I think that’s been very positive for experimentation and developing of new things. You know, it does give you a paradox of choice, it makes things a little bit hard for figuring out what you should actually use versus what’s available, but overall, I see it as a very healthy development.

Sponsor: Running Kubernetes at scale is challenging. Running Kubernetes at scale securely is even more challenging. Are you struggling with access management and user management? Access management and user management are some of the most important tools that we have today to be able to secure your Kubernetes cluster and protect your infrastructure.

Using Tremolo security with open unison is the easiest way, whether it be on prem or in the cloud, to simplify access management to your clusters. It provides a single sign on and helps you with its robust security features to secure your clusters and automate your workflows. So check out Tremolo Security for your single sign on needs in Kubernetes.

You can find them at fafo.fm slash Tremolo. That's T-R-E-M-O-L-O.

Autumn: I feel like you guys did a great job with almost unifying a lot of things and just kind of having, I don’t know—were you and has anybody ever done anything with Kubernetes that you were just, like, almost offended by that it’s so—

Justin: [laugh].

Brian: [laugh].

Autumn: This is your baby. You’ve seen it go from so many—I have, like, three questions, but I want to start here [laugh].

Brian: Well certainly, there were a lot of things I was very—that I didn’t really imagine that I was very happy about. Retail edge was one of these scenarios where I wanted to make sure Kubernetes could scale down. Borg, I think the minimum footprint is, like, 300 machines or something at the time I worked on it, so there’s no way it could scale down to something you could just run. And Mesos kind of had that problem, too. It had a lot of components, multiple stateful components. Cloud Foundry required a bunch of components.

So, I wanted it to be able to scale down to one node, so it just has one stateful component, which is etcd. It doesn’t have, like, a separate message bus, you know, although that was a design that could be considered. But the reason was for doing kind of local development, like, Minikube or Kind type things, mostly. Retail edge was sort of really fun that, you know, it’s like in every Target store, been on spacecraft, and ships, and all kinds of other places I never really imagined. In terms of offended, you know, I remember one time—

Autumn: Like, have they ever made it, like, overly complicated when you were trying to make it simple or just something that you’re just like, “Dude, I was trying so hard to prevent this.”

Brian: There is. I mean, early on, I was very concerned about fragmentation, which is why I helped create the conformance program. So, all the attempts to sort of fork it and do something a little bit different, and there were some cases like that where some people said, oh, I just want to run the pod executor. I just want to run Kubelet, but I need to make changes. No, no, no. You actually need to make sure that the API works.

When Kubernetes was sort of young and vulnerable, I think that was a big concern I had. Or other cases, like, the Virtual Kubelets, you know, I didn’t want to fork the ecosystems. Like, oh, only certain things work with Virtual Kubelets, or only certain things work with Windows. So, on Virtual Cubelet, I kind of started sketching a bar for what I think compatibility would need to be required.

Justin: Minimum cubelet [laugh].

Autumn: I honestly think that your work in that aspect really shows, though because even when people say that Kubernetes is difficult, there’s a reason why so many people use it because it really does have that whole ecosystem that is really, kind of—I think open-source can be so political, and the fact that there’s so many different projects, but they all kind of align is really impressive. Were you involved in the naming because, like, Kubernetes naming, like, just cracks me up.

Brian: No. Honestly, the naming was outsourced. There’s, like, a search for potential names and a trademark search, and things like that. That aspect is pretty boring.

Justin: Lawyers got involved. And [laugh]—

Brian: Yes.

Autumn: [laugh].

Brian: You know, it couldn't be named what the code name was, so, you know, that was never a contender.

Autumn: But did you have any influence on the fact that it’s Greek, right? Like, all the different—

Brian: I mean, it did start a trend. Istio, for example, for a while, everything was getting a Greek name.

Justin: I now work at a startup that has a Greek name. This is how this works [laugh].

Autumn: That’s what I’m saying. Like, I feel like, just the continuity of the naming started, kind of, a lot of the way that people start choosing to name their open-source projects. And kind of, you almost make sure you could relate the fact that these projects were related by their naming, you know? I thought that was cool. It seems like Kubernetes was the first to really do that.

Brian: Docker did it as well. There were a bunch of shipping analogies and… and Helm sort of followed that pattern.

Justin: I mean, themes were big for any technology. Like, config management, you had Puppet, and you had Chef, and you had all these, like, words that, like, oh, it has to be the cookbook and the—

Brian: My opinion, Salt took it to an extreme.

Autumn: Kubernetes had so many though. And the fact—

Justin: Salt with the pillars and everything else, yeah, you’re right.

Autumn: [laugh]. That’s true. Okay, so with your experience, right, you’ve gone through the chips, you’ve gone through supercomputing early, you were in the, you know, C and Java, and now, with people wanting to rewrite everything—you saw when they wanted to rewrite everything in Java, right, now, everybody wants to rewrite everything in REST, right? You saw supercomputing before it was cool, and now everything is chip boom, AI. Are there patterns that you see that, like, either you’re excited about or alarmed about, or is it weird seeing it go from where you started with all these things? And it’s kind of like the same but different?

Brian: [Kind of 00:41:39] same, but different aspect is, you know, I think what keeps software engineers employed, so I can’t argue too much with that, but redoing the same things over and over in slightly different and hopefully better ways is, I think, something that will continue to exist. Like, now, everything with AI, right? So, it’s very reminiscent of the dotcom bubble in that sense, where everything’s like a retail store, but dotcom. Mostly, there were a few big winners there, like, you know, Amazon, eBay, but you know, most of the companies did not succeed. A lot of the kind of existing companies got their act together and put together a web storefront, right, and now that’s easier than ever. So, I think AI will kind of be similar where, you know, there’s a bunch of startups that are experimenting in cases where they are sort of doing something that people already do, but just with AI.

Justin: Sprinkle little AI on it. Yep [laugh].

Brian: That will probably end up being a product feature. In the positive case for them, it will end up being an acquisition that makes it into an existing product. It is super challenging for big companies to innovate, certainly a challenge that Google has, I think. Honestly, Google always had it. So, if you think about what are the big products at Google, a lot of them are acquisitions, even things you think of, of Google is all about ads, I mean, most of that technology is acquisitions.

Justin: Yeah, DoubleClick, and—yeah. I always find it interesting where it’s like, it’s not that you can’t innovate at a large company, it’s that it’s really hard to get that to actually have impact because I know so many cool, innovative internal projects that have been at all these big companies, but the only way they get it to be an impact at the company is they have to leave, go make a startup, and they get bought by the company, and [laugh] now they have a say of like, oh, now it’s the innovative thing that I was doing here ten years ago, but you didn’t believe me.

Autumn: That’s also how we reward certain innovation. Like, people are always trying to figure out the projects that go in their promo doc. And if you don’t reward a certain type of innovation, you’re—

Justin: Yeah.

Autumn: —almost strangling it.

Justin: That system is very rigged for a certain type of innovation.

Autumn: And it’ll be, like, the dumbest projects that they waste the stupidest amount of money on, and it has absolutely no value, and then—when people talk about empire-building, you know what I mean?—and then somebody actually built something that’s helpful and cool, and they have to go [laugh] [unintelligible 00:43:56] and come back. I mean, like, even look at Meta. Like, it most—look at all the acquisitions they’ve done.

Brian: I mean, a lot of times, when these things start, like PeakStream, for example, it’s not clear that something is going to be—whether it’s going to succeed, whether it’s going to be important. It’s a risk, right? Like Nvidia played a really long bet on compute on GPUs. ATI, at the time, decided not to do that, and they ended up getting acquired by AMD. And AMD doubled down on graphics.

And they actually won all the consoles, laptops, mobile phone deals, like, all of them away from Nvidia at that time. For a long time, basically the national labs were the customers of that stuff, but now, it’s everybody, so the long bet has really paid off. But that really requires a lot of faith, I think.

Autumn: It’s crazy how—you know how, at one point, Apple invested in—wait, was it Windows invested in Apple, right? And then how AMD was doing better than Nvidia at one point, you know? Like, just the way that the—just, it’s so hard to know what is going to work out, you know? Like, look at where we’re talking about the dotcom, and remember when we had Rich on Ship It, and he was talking about how huge WebMD was—

Justin: WebMD, yeah.

Autumn: —right? And then we were just talking about Amazon versus eBay. Who even buys stuff on eBay, anymore [laugh]?

Justin: I just bought stuff on eBay. What are you talking about [laugh]?

Autumn: You and, like, five other people [laugh]. You know what I mean? Like, Yahoo was so big, and now nobody uses that, and it’s just crazy. And I feel like I haven’t even been involved in tech that long, and I can’t even imagine the things you’ve saw in 30 years, Brian. Like, you’ve seen it go—

Brian: So, as far as doing things too early, multiple times, Transmeta’s chips were low power, general purpose computing chips, and they went into devices like ultra-light laptops, tablets, wearables, smartphones, in 2000. The year 2000.

Autumn: No way. Did you bet on anything or really believe in anything, and then nobody thought it was cool, and then now you’re like, see, [laugh] like, I told you.

Brian: Well, so in Transmeta, yeah. And I really liked what Transmeta was doing. And that was kind of my dream job because in school I had electrical engineering classes, and computing classes, and things like that, but I started programming when I was ten. The first computer was a kit computer that my dad built, a 6502-based KIM-1 kit computer. And it had no persistent memory, no persistent disks, nothing, and no ROM with firmware.

So, every time you turn on the power, it’s a clean slate. There’s nothing. There’s no assembler. There’s nothing. It just had an LED display and a hex keypad. So, I would have to type in the program from scratch every time you turn on the power. And back in those days, those Byte magazine would have 6502 assembly programs, and I would have to manually—

Justin: Flip them all [laugh]?

Brian: —manually assemble them, and type in the hexadecimal machine code and then run it. But anyway, when we got an Apple II, we’d turn on the power, and there would be a prompt, right? There would be a program running, and that was just so amazing for me. So, you know, Transmeta, I really learned, from the time you turn on the power, what happens, how does the computer work? Like, I worked on the code that decompressed the firmware out of the ROM, for example. I worked on frequency-voltage scaling. I worked on the static compiler.

So, we had software TOB handlers that ran through my static compiler. Like I dealt with things all at, like, this crazy, super low level. If the instructions didn’t get scheduled, right, the chip had no interlocks. What an interlock does is, if you have one instruction that writes the register, and another instruction that reads from that register, an interlock will stall the CPU pipeline until that register value is written. There’s like a scoreboard that keeps track of these things. Transmeta chips, in order to be low power, is trying to cut circuit count, so it didn’t have interlocks.

Justin: That leads me right into, like, the last thing I want to talk about here because we have this—Kubernetes thing exists, we have this extensible API that you helped make it conformant so it is consistent for everyone in whatever environment they’re in, and in one of the ways that we’ve been seeing with that is this notion of using that API and this notion of control loops to do more infrastructure managements, things like Cloud Connector at GCP, ACK at AWS. And they’re reimplementing some of that, like you mentioned in, like, very cloud specific, like, this is my cloud implementation of this thing because I know the APIs. And in most cases, those are now generated from the APIs, right? Like, we’re not manually writing this stuff out again. Like, with Terraform, we had to do a lot of manual stuff to make providers work. And there’s this new wave of Terraform-like things that are happening, which is also, again, you started taking a risk there and looking into this more, and what do you see coming in that area next?

Brian: Well, for the Kubernetes-based controllers, and in general, what I’ve seen, I came up with the idea for what became Config Connector around the end of 2017, when Kubernetes initially had third-party resources, and then that was redesigned to Custom Resource Definitions, CRDs. CRDs were in beta for a really long time. It had a lot of features that were hard to get to the GA level. But it was starting to become popular at that time. People were writing controllers to manage, like, S3 buckets and individual cloud resources.

I saw it as a way to solve a couple of problems for Google. And Google had a Deployment Manager product that had a bunch of technical, non-technical challenges at the time. Kubernetes and Terraform started at the same time, so Terraform is still pretty early in 2017. You know, Ansible was way more used at that time than in Terraform. We did have a team that had started to maintain Terraform, and it had a semi… I would say, semi-automatic, ability to generate the Terraform providers from the APIs. And that still remains true. It’s still semi automatic, it’s not fully automatic.

And I actually wrote a blog post about some of the challenges with APIs that make it hard to automate. And I don’t think Google’s APIs are the only ones that have these issues. Kubernetes was growing a lot by the end of 2017. I think that’s when AWS launched EKS, and VMware, and, you know, pretty much everybody, even Mesosphere, had, like, a Kubernetes product. So, it seemed like with a Kubernetes-centric universe, maybe it would be something you would want to do, and it would provide that more consistent API that you couldn’t get from the providers, so something you could build tooling against.

You know, there are some big Google Cloud customers that adopted it, but overall, not remotely as many as have adopted Terraform. And it’s much less popular, especially for—even amongst GKE customers, it’s not nearly as popular, and most of those platform teams know Terraform. And they’re used to Terraform, so they manage infrastructure with Terraform. I think the one potentially sweet spot for it is for resources that application developers would need to interact with, like, database, or a Redis instance, or message queue or something like that, from the cloud provider where you could, in theory, provision it using the same sort of tooling that used to deploy your app. Although, you know, these days—people used to love Kubernetes in the early days, that was always very gratifying.

Some users would say, you know, it changed their lives and things like that. These days, with the larger number of people using it, you get some people who don’t love it as much. You know, anything widely used has that. Terraform has that, too. Helm has that. But yeah, it just hasn’t really materialized, people managing resources there.

Crossplane is probably the most prevalent way, although, you know, not on GCP because GCP customers want to use something that’s supported, and GCP endorses and things like that. So, ACK and the Azure service operator, I’d be interested to know how many users there are, but just looking at, kind of, social media posts and things like that.

Justin: I feel like it kind of came out of this notion, especially in, like, the serverless worlds, where once you deploy a Lambda function, you’re like, oh, I need my queuing system, and my S3 bucket, and my database, and I want them all the deployed from the same CloudFormation stack. And people were like, oh, I could replicate the same thing with containers, and get that same sort of feeling of, I don’t care about the infrastructure, but someone has to care about how that infrastructure got there, and who runs those controllers, and how they’re authenticated, and where they go. And usually that used to be a service of something like CloudFormation, and now it’s something that, oh, the platform team has to run 87 different controllers for every different connection that we want [laugh] to put in there.

Brian: Right, yeah. And upgrading controllers in CRDs is still pretty challenging. I actually wrote a blog post about using KRM, the Kubernetes Resource Model, for provisioning cloud infrastructure as well. There are a bunch of challenges with using the Kubernetes tooling, like a lot of the cloud APIs are designed so that you call one, it gets provisioned, some IP addresses or allocated or something. You get that back in a result. That may take 20 minutes, it may take a long time, then you need to take those values and pass them as inputs to another call.

And that requires orchestration at a level that—you know, in Kubernetes, everything—the controllers are all designed, so you just apply everything and the controllers sort it out. And if you don’t design your infrastructure controllers to do the same thing, the Kubernetes controlling functionality doesn’t actually work. So, like, if you deploy a set of resources with Helm, and you can’t actually provision one thing until the other thing is already provisioned, and your controller doesn’t do the waiting, Helm’s not going to do the waiting. Like, you’re just hosed. So, you could actually do that, you know, if you want to design the controllers to work, like the built in controllers in Kubernetes.

That’s a lot more work because the APIs don’t work that way. If you wrap the Terraform providers, they don’t work that way, right? So, that’s another big layer that you would have to build in your, sort of, meta controller over the underlying controllers to actually make that work. And you know, there ends up being this demand, for the people who do adopt it, to have every infrastructure resource they want to use covered by it. So, all the work just goes into that, and the work doesn’t go into, like, fixing the usability problems. So, I think Crossplane has at least a partial solution to that, but you have to do it in their composition layer, so the user of Crossplane has to specify those dependencies, at least in some cases. That just makes it feel more like Terraform again.

Justin: Yeah. You’re basically just making a new module, right? It’s just, like, a module in a different form.

Brian: Honestly, I don’t think it’s going to be dramatically more popular ever than it is right now to do it that way. There’s just not enough benefits. There are some benefits, but they are kind of killed by how people use it. So, for example, the composition layer in Crossplane, effectively is a templating layer, so now you can’t just go change the manage resources directly because it will create drift with the composition layer. And if you need to template the composition resources using Helm, now you’re storing it in a Git repo in some templating form, Go template format or whatever, and that’s hard to change and hard to write, right, so you can’t build tooling on top of that.

The big benefit of using KRM could be that you could actually build controllers or tooling that actually just automates the generation and editing of those resources for you. The way people use it, they pretty much destroy that potential benefit of using a control plane.

Autumn: I have a question. So, you know how you said that chip job was your dream job at the time, right? What’s it like having a career as long as yours, and doing the things that you’ve done, do you just keep getting the next dream job? And what was your favorite out of those dream jobs, you know?

Brian: Yeah, it was pretty serendipitous. I wish I could say, like, I really planned my career, but I really didn’t. I loved all the jobs. I loved Transmeta and PeakStream. They were amazing and awesome. I learned a lot, and it was very exciting for a while.

And then, you know, Google, working on Borg, and especially Kubernetes, you know, Kubernetes is definitely the most industry impact—and CNCF—anybody could pretty much ask for. So, for the next thing I’m planning to do, that was definitely a consideration when I spent six to nine months deciding what I wanted to do. The opportunity to have industry impact again, will it be as big as Kubernetes, mmm, maybe not, but it could become—you know, has that potential.

Justin: We just went down this whole deep path of, if anyone doesn’t know what Crossplane is, and doesn’t know what Kubernetes is, and doesn’t know what ACK and Config Connect, I’m sorry we didn’t explain that very well. But basically these are all—the book I wrote at the time, we call it Infrastructure as Software, where it’s basically like Terraform in a for loop that keeps applying something or driving to a state. And what you were describing was all of the pains that I’ve lived over the last decade of trying to template Helm and all these other things of, like, oh well, you know what? Like, at some point templates aren’t good enough, for all the reasons you just—like, the configuration drifts, and the ability to do complex things, and all that stuff just becomes really difficult. But as a user, I just want the template. I just want the, give me some sane defaults, and I just give you a little more data for what I want.

But, in my head, what you just kind of described—and my last question here is, what I’m kind of curious about is, what you were describing, of all of these problems, how that relates to something like System Initiative, where System Initiative took a different approach of it’s not Terraform, it’s a direct model to database, sort of—the UI, the GUI on top of it is a representation of the actual infrastructure, based on the actual API calls and what’s actually in the database. And being able to modify those things directly is one of its strong points, from what I’ve seen. Is that what you’ve seen as well? Is that something that you think is the actual ultimate goal?

Brian: Well, I definitely think that Infrastructure as Code as we know it has reached a dead end, more or less. I think in my entire career of more than 30 years, what we’re doing today feels very similar to what I did in the late-’80s. It’s, you have some build-like process that generates some stuff that you apply to some system, and the actual details of the syntax, and the tools, and whatever has changed a little bit, but it feels pretty much the same as what I did in college. So, I understand the reasons for how we got there. It’s pretty expedient.

I don’t mean it in a disparaging way. I actually mean it in a very complimentary way, but Infrastructure as Code tools were easy to build. They really hit a sweet spot in terms of making it easy. For example, Terraform, the orchestration it does is pretty simple, the compilation it does is pretty simple, the model is pretty straightforward. The providers are pretty easy to write.

They don’t ask too much of the provider author. And even for using it, it feels like scripting. Need to provision a few resources, you can write some Terraform. Once you learn the language, works—except for some baffling decisions—like deleting stuff by default, it works in a mostly predictable way, right? So, it’s pretty expedient, you know, is a pretty useful tool. It got pretty far.

But at scale and for some people, it’s not that easy to use. And actually adding that kind of scripting layer on top of the APIs, much like Crossplane and the other [unintelligible 00:59:09]-based tools where people are, you know, using Helm on top of it, compositions on top, kind of takes away the power of the APIs. So, APIs as the source of truth is what enables interoperable ecosystems of clients and tools to interact with those APIs, right? You publish an API, and you can build a GUI on top, and a CLI on top, and automation tools on top, terminal consoles, and all kinds of cool things, ChatOps, whatever, like, you can build all that on top. And if you wrap it and say, “No, no, no, you have to go out to Terraform, and check it into Git, and get it reviewed,” and you’re saying, “No, you can’t do that anymore,” right? And I think that’s a huge limitation to what we can do with Infrastructure as Code. And it’s not just Terraforming. That’s just the most popular one. The same is true of Pulumi, and anything else out there.

Justin: And that was just some, like, deep-seated, GitOps is not the answer you’re looking for, sort of like, vibes there [laugh].

Brian: Yeah, I think GitOps—I have a couple of blog posts about GitOps. GitOps, I think, solved certain problems. The core benefit that I see from GitOps—I mean, retail edge, it has a networking benefit, so there’s, like, a specialized benefit there, and if you have a large number of targets, you need something that retries better than a pipeline and stuff like that. But what GitOps does is it creates a one to one binding between the resources that are provisioned or created in Kubernetes, if you’re talking about GitOps for Kubernetes, and the source of truth for that configuration, right? So, I think there’s value in that, especially in the world where you’re saying you have to go change that configuration to do anything.

The unidirectionality of it, where if you want to make a change, you have to change your configuration generator, program, or template, or you have to change the input variables, you have to check that into Git, go through your CI pipeline to deploy it, and that is… very restrictive. It’s very slow. It creates a lot of toil. Why do I have to go edit Infrastructure as Code by hand, right? So, different people are exploring different solutions for not writing the Infrastructure as Code by hand, like, you have the Infrastructure from Code tools that are generating it. I don’t really think that’s the answer.

You have the System Initiative is kind of interesting, although kind of challenging to sort of understand exactly what it is. But I do think it’s good that folks are exploring alternatives. I don’t think just kind of building more generation layers that still have the same overall properties, like the unidirectional flow, are going to provide dramatic benefits over what we’re doing now. Like, people are, of course, try to use AI to generate Terraform and other Infrastructure as Code.

Like, I’ve tried… doing that. It works kind of okay for CloudFormation. Works less okay for Terraform, in my experience. That could be another whole podcast. But I don’t think that ultimately changes sort of the overall math in the equation, right? Like, you still have to have humans that understand it, that can review it, and make sure it’s correct and not hallucinated. And—

Justin: You need some experts that have more context than the system itself. Like, so there’s someone outside of the system that knows, is this safe or the right way to do it.

Autumn: Which, everybody’s plan to automate it is going to make it harder for those humans to have that.

Brian: Yeah. Then you also have to deal with configuration drift, and you know, all the other problems that are kind of independent of the configure—Infrastructure as Code tool that you’re using.

Justin: Brian, this has been awesome. Thank you so much for coming on the show. Where should people find you online if they want to reach out, if they want to ask you more questions, if they want to, I don’t know, like, get in touch?

Brian: Yeah, I a—thanks for having me on. I’m BGrant0607—it’s a trivia question, what the numbers stand for—but on LinkedIn, Twitter, BlueSky, Medium. And I mean, I’m also on Hachyderm and some other Mastodon things, but that seems a lot more fragmented. And also still on Kubernetes Slack and CNCF Slack, as just Brian Grant, I think.

Justin: Well, thanks again so much. Anyone that has questions or wants to reach out, we actually don’t have a Slack instance for Fork Around and Find Out. We’re not doing any, sort of like, real-time chat for this. BlueSky is like—social media is kind of where I’m trying to gravitate towards for these sorts of conversations, if you have other feedback or want to reach out. I don’t want to check another chat system and log into another system for it. Like, I’m already there. Autumn and I are both there. We have the Fork Around and Find Out BlueSky handle which will be posting these episodes, so feel free to leave comments and send us messages on there. And yeah, we will talk to you all again soon.

Justin: Thank you for listening to this episode of Fork Around and Find Out. If you like this show, please consider sharing it with a friend, a coworker, a family member, or even an enemy. However we get the word out about this show helps it to become sustainable for the long-term. If you want to sponsor this show, please go to fafo.fm/sponsor, and reach out to us there about what you’re interested in sponsoring, and how we can help. We hope your systems stay available and your pagers stay quiet. We’ll see you again next time.