Tech on the Rocks

Kostas (00:02.018)
Brian, hello and thank you so much for being here with us today, together with Nitai. Can we start by getting like a brief introduction, tell us a few things about yourself?

Brian Grant (00:19.448)
Sure, and the thing I'm best known for is I was the original lead architect and API design lead for Kubernetes. I was part of the team that created Kubernetes since before we decided to make it open source, since before it was called Kubernetes at Google. And then as the project evolved, was...

part of what we call the bootstrap committee and the steering committee once we formalize governance for the project. I was also on the Cloud Native Computing Foundation Technical Oversight Committee and an inaugural member of that committee building the portfolio of.

projects early on in the history of the foundation. Overall, was at Google 17 years. I left recently before Kubernetes. I was a tech lead of the control plane for Google's internal container platform, Borg. And then before Google, I worked on kind of more compiler architecture sorts of things. I worked on a

Whole program, link time, optimizer and scheduler at Transmeta and high performance computing on GPUs at PeakStream around 2005.

Kostas (01:48.568)
All right, that's quite an amazing background, I have to say. think there are many things to talk about here for sure. Let's start up. I'd like to start with a question about the past first and then we'll get like into more the more recent things you've been working on. So you were doing high performance compute thing.

long before AI was a thing or GPUs were anything outside of something that like primarily was used for gaming, right? So how it was back then trying like to do that and what motivated you also like to do that back then?

Brian Grant (02:31.51)
Yeah, so I actually worked on supercomputing back in the early 90s. I interned at Lawrence Livermore National Laboratory and worked on a variety of high -performance computing projects there, some collective communication algorithm implementations, and I worked on a climate model, traced the performance bottlenecks, which all came down to communication issues as well. And then

When it came to PeakStream, I had been working on dynamic compilers in particular for, as part of my PhD, I worked on a runtime partial evaluation system at Transmeta. I worked on both static compiler and a dynamic compiler for their custom VLIW. So working on high performance computing on

GPUs was really interesting to me because it was kind an exotic architecture to develop a compiler for. had...

It was also VLIW. It also had SIMD in a word parallelism. had thread level parallelism. So it had several different kinds of parallelism and also a complex memory system. That was the time when people were just starting to experiment with what was called GPGPU at the time. And our co -founder was

from Nvidia and had seen kind of what people were trying to do with the GPUs and the kind of performance that could be achieved. So we had some quite impressive performance. calculations would run like 30 to 50 times faster than the Intel CPUs at the time. You know, so some things got quite impressive.

Brian Grant (04:32.878)
benefits from that. So we were looking at some seismic calculations. I think synthetic aperture radar was one application, some finance applications. It's kind of interesting. Some of the customers had cases where the code was public domain, but the data was secret and others were the opposite where the data was all public and the calculations were the secret sauce.

But really at that time, anybody, we were trying to make high -performance computing on GPUs easier and more accessible by allowing array computations to be expressed in high -level languages and then dynamically creating a graph using all the array operations together into kernels dynamically.

generating the GPU programs and running those on the GPUs. But at the time, the people who needed the performance badly enough really wanted to eke out every last flop. So they didn't care as much about usability and cared mostly about performance. And they could hire the...

people with the expertise or who could develop the expertise to actually go do that. So we had trouble finding product market fit at that time for that reason, I would say.

Brian Grant (06:10.254)
The problem was quite difficult though. The GPUs were not designed for computation. They didn't do IEEE floating points. They didn't do 32 -bit integer computation even. The memory system didn't have read after write consistency. There were pretty exotic bugs in how the chips work. So we had to reverse engineer how the chips worked and how the code really needed to be generated.

or to work correctly. There was an interesting case where conditionals, because the programs were SIMD, conditionals need to be converted to predicates. then operations that were not supposed to execute would disable right back to memory or to the register file. But in some cases, the values written to the register file actually mattered for other computations.

even though they were supposed to be disabled. So we would get incorrect answers. So debugging those kinds of issues in collaboration with the chip partner, we were working with ATI at the time. This was all before CUDA was released. That was pretty challenging because we were doing things with the chips that obviously they weren't really designed for. And the...

product cycle was pretty fast -paced. So the chip vendors would release new chips every six months or something like that. So they didn't do the kind of rigorous qualification and verification that Intel did. Intel had a five -year product cycle on its chips. it had the time. especially after the FDIV bug, Intel...

Brian Grant (08:05.9)
verified the chips very, very thoroughly to reduce the probability of computation bugs to almost zero. But the GPUs were not like that. They had other priorities like 3D mark scores and implementing the DirectX, latest DirectX feature sets and things like that. So as long as they worked well enough for games,

That was kind of considered good enough. So there are interesting rules of thumb, like if something went wrong, it was better for the results to be zero than a non -zero value, because a zero would show up as a black pixel on the screen. And in one frame, a black pixel, nobody would notice, like 1 24th or 1 30th of a second. So that was considered good enough, as long as there weren't too many dead pixels.

But in a computation, like if you're doing a simulation or some important calculation, you actually want the results to be 100 % correct. And that was just not the case a lot of times. So we did have to run our own.

tests on the graphics cards to validate that they actually worked and there was a huge fallout rate from the commercial cards that were actually sold.

But even the ones that passed, they were really interesting corner cases that we encountered in the sequence of instructions that we generate, where we'd have to change the sequence in order to avoid those architecture bugs or implementation bugs.

Nitay (10:04.057)
That's super interesting. There's a lot of interesting bits you said there. It sounds like essentially a lot of what you guys were doing was kind of taking these GPUs and trying to find for them a new use case that the traditional kind of graphics rendering, gaming, et cetera, kind of things hadn't focused on. And so you were doing that while at the same time trying to layer on kind of an easier interface, make it much more accessible and deal with underlying...

chip level design constraints. And so how did you kind of navigate that? I'm curious to hear, and how did you, you know, deal with, you know, some of the IEEE floating points stuff you mentioned, the read after writing consistency and so forth, and why ultimately did people not seem to care? Like you said something interesting there where you said kind of that for your use cases, it was ultimately about just raw performance, raw computation, and people could do it in the silliest interfaces and so forth.

Brian Grant (10:59.852)
Yeah, I think a lot of the issues with the IEEE floating point or the exotic architecture bars, the lack of read after write consistency, and because we were writing the compiler and the low level kernels, like we had some folks who would hand write kernels for

Common linear algebra operations like matrix multiply, LU factorization, things like that. Things you would find in a linear algebra library like a BLAS library. Since we were writing those things, we could avoid the pitfalls and implement workarounds and things of that nature.

So once we figured those things out, it wasn't so much of a problem unless it dramatically degraded performance or something like that. But for all the issues I can remember, we found acceptable workarounds for those things.

Brian Grant (12:04.846)
In terms of the original product conception, as a user of it, I actually found it super usable. And actually, if I look at one of the modern libraries like XLA or something like that, it looks eerily similar to what we built almost 20 years ago. But now I think there's audience of folks who want to get that level of performance.

you know, due to the new machine learning AI use cases back then, it was really before the national labs even really started using GPU seriously for.

for high -performance computing. NVIDIA took a long bet that high -performance computing would be a market payoff, but at the time, it was really just graphics. And in fact, NVIDIA had shifted their focus to implement some new GPUs that would be dual -purpose, both...

graphics and high -performance computing. And that is around the same time AMD was acquiring ATI. And what I remember is ATI actually won

Brian Grant (13:22.604)
an entire generation of console deals, laptops, like all the graphics applications other than the super high end desktop gaming graphics, all the other graphics applications, know, ATI actually won those deals.

because they had the right combination of price performance for those graphics applications. Whereas NVIDIA released CUDA, I think around 2006 -ish, maybe, so while we're in the middle of, while we're working on our product, and really started to ensure that those high -performance computing applications would work well, address the floating point computation issues and so on, have a model that...

It's not a real, I wouldn't call CUDA a high level language, but it wasn't assembly level either. So it was a programming model that engineers could use to get the highest amount of performance out of those chips. that seemed to, the market maybe wasn't big enough for...

just high performance. So I think the dual purpose was a good idea for a long time. Now, you know, with all the AI applications, it probably is a big enough market. But, you know, when they when even when Nvidia was getting started, it wasn't so from a chip perspective, it made a lot of sense. The

Brian Grant (14:58.35)
But yeah, the folks that actually wanted the performance would really, you know, this was a time when people would hand code kernels and assembly language and things like that. Something else that happened around, I think 2007 -ish, 2006, 2007 was Intel actually increased the performance of the SSE instructions, the SIMD and word instructions, by a factor of six in one generation. So...

That was a one -time bump, but people didn't know it at the time. Moore's law was kind of reaching its end, but it wasn't quite there yet. It was the beginning of the multi -core era as well. But that dramatic increase in SSC performance reduced the gap between...

kind of typical GPU performance on even array, dense array calculations and what you could get with a CPU. And the CPU, the GPUs were not designed for things like putting into racks of servers. They were designed for putting into gaming laptops.

the high -end GPUs. So they had like huge heat sinks and huge fans and the cards were oriented such that they would when put into a typical server box would like melt the board just because they weren't designed to fit into that 2U form factor for the machines that were getting built.

So that took a while, I would say, to kind of build out that whole ecosystem. And as a small startup, we couldn't really drive those kinds of changes. NVIDIA could, but we could not. So I think it really took a large enough hardware vendor to kind of drive those bigger changes.

Nitay (16:59.213)
Right. This was the time, I'm correct, when they went from like SSC one to SC2, SC3, cetera. Right. I think that's what you're referring to. Right. You said, you said something interesting before about where you were writing dynamic compilers. I it might be helpful for the audience just to define like what is static versus dynamic compiler. And in general, I think most folks, least most folks I know, you know, they think of a compiler as like, okay, I write a bunch of code.

Brian Grant (17:06.07)
Yeah. Yeah.

Nitay (17:26.647)
It turns it into some instructions and that's kind of it. Like what has changed? Why is it any different when in reality compilers have actually come a long way in the modern times in terms of single line compilation and incremental compilation and all these kinds of different things. So I'd love to maybe get a little more color there.

Brian Grant (17:45.868)
Yeah, sure. A static compiler normally is built up of a sequence of translation phases. So you write your program in some high -level language. These days, could be Go or Rust or... I'm just thinking of the compiled languages as opposed to the interpreted languages, but...

I'll get to that. Diamond compilers is where it becomes a little bit blurry. But back when I was working on it, a lot of the code was written in C or C++, so kind of lower level, high level languages. But you run the compiler and it translates that form into some kind of internal representation, like an abstract, subtract tree or a graph of

blocks, and then a series of phases run on that gradually reduce the representation to something closer to machine instructions. And once you get to something close to machine instructions, typically a compiler would generate assembly language, which would have a one -to -one match with the instructions that are actually supported. Then an assembler would take

that assembly level representation and generate the actual machine code that the CPU understands. Usually software is constructed in units of individual files. So you would generate a number of separate object files that have

the machine instructions and some metadata, like symbols that need to be resolved late. For example, then a linker would take that pile of object files and do that final resolution and construct the executable image. And the executable image is something that the operating system knows how to load the code into memory. So it's quite a multi -stage.

Brian Grant (19:49.272)
process, but at the end of the day, you have something that's fairly close to the metal that the operating system can kind of give to the CPU. And the CPU knows how to execute sequence of instructions. You know, when it encounters a branch, it can evaluate the branch and branch to different instruction and so on. A dynamic compiler leaves some of that work done until you actually execute the program. So that gets into these, you know, the interpreter program. So in a language like Python or

Java or JavaScript. These are interpreted languages and usually you hand the interpreter the program in the high -level language.

And it does parses that language, reduces it to some internal data structure like an abstract syntax tree. And then it will actually walk that tree during execution. What dynamic compilers really started to do, really Java popularized the idea of what we call JIT or just -in -time compilation, which would be to take at runtime when you're executing the program. If you execute a sequence like a function or something enough times,

it would trigger a compilation pass to generate instructions that could then be fed more directly to the CPU. And the amount of speed up you could get, the difference in performance between compiled code and interpreted code is more than 10x.

It could be 20x, 30x, kind of depends on what the language is, how optimized the interpreter is. I sometimes interpreters compile to bytecode as well to make it more efficient to interpret. It's kind of an intermediate step. But the...

Brian Grant (21:41.974)
Reason not everything is typically just -in -time compiled is because the compilation process, that multi -phase process I talked about is fairly expensive. And the more optimizations you do, the more expensive it is. So if you're gonna run the program many, many times, doing a lot of optimization can pay off if you may not execute that code as many times and it won't pay off. So that's where...

some kind of adaptive technique for determining, you know, is this section of code worth compiling? Is this function worth compiling? Is this class worth compiling? Whatever the unit of compilation is, kind of using runtime execution information to decide is it worth doing or how much time is it worth spending on compiling it?

Like you might do a simple compilation phase with no optimizations, just generate simple machine code in one pass if it's executed a few times, but if it ends up being executed a hundred times or a thousand times, you might say, well, it's actually worth running a lot of optimization passes over it to try to eliminate redundant competitions, eliminate unused code, things like that in order to...

streamline the performance of that as much as possible. So actually what Transmeta did in what we call the code morphing software is it actually took x86 code as the input and did what we call dynamic binary translation to its custom BLYW format. But effectively it was a JIT compiler.

there was an interpreter that would interpret the x86 code instruction by instruction. So like add, load, add, store. But then we also, if that code was executed enough times, like say 15 times, it didn't have to be a lot, we would execute, we'd run a dynamic compiler over it. And that dynamic compiler would optimize redundancies. It would do things like register allocation.

Brian Grant (23:54.81)
x86 code back in those days didn't have a lot of registers, so a lot of values were stored and loaded back from to and from the stack in x86 or from other parts of memory for variables. And if you touched a value multiple times in an instruction sequence, you'd want to eliminate the redundant loads, for example.

to improve performance. So those kinds of changes, mean, modulo threading, thread safety, all of that kind of thing. So those kinds of optimizations could be done and it could improve performance quite dramatically for those instruction sequences.

Kostas (24:44.024)
Brian, you mentioned that back in the days you were building actual compilers for the GPU. how building a compiler for the CPU is different to, or then put it this way, how the hardware target changes the way that the compiler is built if there is a difference there, right? But I'd love to hear that because...

Brian Grant (25:06.728)
yeah, are a lot of changes. So in grad school, I worked on a dynamic compiler that generated code for a DEC alpha early multi -scalar chip. And it was actually statically scheduled. you would only get two instructions executing concurrently if a bunch of rules were followed. they couldn't be, I forget what the rules were exactly, but.

The chip had a certain number of functional units of different types, like a certain number of ALUs, a certain number of memory units, and so on. So one thing I remember that's pretty simple is that the instructions pairs had to be aligned on a certain, on an even memory word boundary.

So basically it would fetch both instructions at the same time, as long as they were fetched as a single chunk, basically. So that had to be on a memory line boundary. And...

There could be rules like you could execute two ALU instructions as a pair, but you couldn't execute two memory instructions as a pair. So you do one, two ALU instructions. ALU instruction would be like an add, a subtract, a multiply, a compare. Or memory instruction would be like a load or a store.

So there are a bunch of rules like that. VLIWs are even wider. So you might be able to execute four instructions at a time or eight instructions at a time. But there might be a limited number of functional units. So maybe you could execute up to four ALU instructions and two memory instructions and a branch and one other type of...

Brian Grant (27:00.022)
Miscellaneous instruction or something like that all in one go. So the compiler's job is to kind of figure out how to pack those instruction words. In the case of GPUs, SIMD operations are pretty common. So at multiple different levels. So one might be SIMD within a word. So you might have

very wide registers and you want to add pairs of numbers where both numbers are stuffed into the same register. So the compiler would have to figure out that, you know, these two 32 -bit values can be packed into a 64 -bit value, for example. So there can be things like that. There's the scheduler might have to be able to space instructions apart, like

this instruction computes its data value, but that data values can't be available and the next cycle can only be available three cycles later. So in the subsequent instruction or set of instructions, you don't want to consume the value because that would either create a stall in the pipeline waiting for that value to be ready or you might not even get the correct answer because the value wasn't

that you expected as a result from a previous computation wasn't available yet. So the register would still have the value of the previous computation. So there are a bunch of complexities like that. GPUs, at the time I was working on them, had a bunch of additional complications like there were different types of memory. So there was on the graphics card memory, there was on the CPU memory, which would have to be fetched across PECIE at much lower performance. There was also a dedicated on chip.

memory that was not a cache but had cache -like performance, but you couldn't actually read back from that memory in the same pass as you wrote that value. So effectively, it was write -only memory for that given shader execution.

Brian Grant (29:08.62)
GPUs had a number of funky characteristics. So the set of registers is usually what the instructions execute on directly. So if you want to add two values on a of a risk -like instruction set, you need to load values from memory into registers. And then you can add two registers and produce the result into another register. And if you want to store that back into memory, a store instruction can take the value from the register and store it back to memory.

The, whereas Intel's instruction set x86 was what we call a complex instruction set, their instructions could operate directly on memory or on registers. But if you did operate on memory, then the instructions would be slower. GPUs would have additional complications like the register file could be dynamically partitioned. So in some cases you would have, you know,

Brian Grant (30:07.278)
say 16 registers available, and in other cases, if you partition the registers differently, you could have 64 registers available. So that would actually change what instructions were valid that you could execute and how memory values had to be loaded into those registers and stored back to memory. And the reason for the dynamic partitioning was thread -level parallelism. So the GPUs could execute in a 70 -like style.

multiple distinct threads in parallel operating on different parts of, and for graphics, a different part of the image, right? So you would carve up the image into pieces, you would give each piece to a different thread in your shader execution, that thread would work on that tile effectively. So you would give each, you know, based on how

many threads you thought could efficiently do that computation. Like if you could just carve up the entire image, you would want the maximum amount of threads perhaps. If you had a computation that needed to iterate over

Brian Grant (31:25.742)
parts of the image, then it might make sense to have fewer threads and bigger tiles. So different computations picked a different point in that trade -off space. But effectively, when we compiled the program, we had to understand.

which point in the trade -off space made the most sense and we'd have to make a decision and decide, look, this is how we're gonna partition into threads and partition the register file for that given generated code. So that was one benefit of doing the dynamic compilation for the GPU is we could make those decisions late, know, at the time when we had more information, like how large the arrays were, for example.

So yeah, it's pretty complicated. There's a lot of different dimensions and a lot of different levels. Like even on a CPU, it can matter how much, what is the cache size? Like if you're tiling a matrix multiply, you wanna fit the smallest amount of tile in the L1 cache, the lowest level of cache.

Kostas (32:11.502)
Okay, that's...

Kostas (32:26.082)
Yeah.

Brian Grant (32:35.522)
But if you have L2 cache or L3 cache, it matters how big are those caches, what is the performance difference between those caches. Is your computation dense or sparse? There are a of different factors that determine how that kind of numerical code needs to be written or needs to be compiled.

Kostas (32:56.822)
Yeah, I think we can probably have, I don't know, many episodes talking just about compilers, but one last question about compilers before we move on, because there are also a lot of other very interesting things that we can talk about. So you've seen compilers for a very long time and you've seen their evolution, obviously. So actually it's a question with two sub questions. One, the first one is, in your career so far, like,

seeing the evolution of compilers. If you had to pick one innovation or one part of the progress that compilers have made all these decades that they've been out there, what you would pick as the most influential one or more important one? And the second is what do you see as the future of compilers? It's a technology that is obviously

super important, it's kind of the foundations of anything that we do with computers, right? So there's a lot of very smart engineering that has already happened on them, but probably there's still a lot of space there for innovation and solving more problems. So tell us a little bit about like these two questions.

Brian Grant (34:22.222)
Yeah, well, I'll caveat it with, haven't really worked in that area for about 17 years. So I haven't really been keeping track of it. But around the time that I was working on it, a big innovation was static single assignment form for the intermediate representation. So that really simplified a lot of the standard compiler.

optimizations that people were doing at the time. then LLVM kind of created this reusable compiler framework that made those sort of standard passes easier to develop. also, know, GCC was already pretty, popular and targeting, like every, every new chip generation that came out for any kind of product, whether it was a

Brian Grant (35:19.591)
you know, like a DSP kind of chip for a mobile phone or a chip for some other, I mean, it was a little bit just before the really mobile revolution, but you know, there are a lot of specialized chips or at least specialized chip functions. Maybe it would be ARM plus, you know, some specialized instructions or.

or something like that. having a reusable compiler framework that could target different architectures that were similar but different in some ways, I think really made a lot of that innovation at the hardware level possible. Otherwise, it'd be too expensive to develop a new compiler from scratch for some new instructions to architecture or something like that. Now we have...

Brian Grant (36:16.216)
programs where just the instruction level optimization is not the most important thing, but kind of more macro level optimizations like loop fusion being one example to get the optimal performance out of one of these high performance computing kernels. That creates sort of whole new tier of

Brian Grant (36:44.15)
optimizations that need to be performed.

Brian Grant (36:49.814)
And maybe, you know, new, there's always been a...

Brian Grant (36:59.918)
I've worked mostly on the interface between the compiler and the hardware side, but there's also the interface between the compiler and the language side. And something interesting that's been happening over the past couple of decades, which probably always happened, but it seems to me to have accelerated is the frequency with which new languages are introduced. So it takes about, I think, 20 years for a language to become...

And it's an entire ecosystem of tools and libraries and ID integrations and all these types of things to become mature. I think that's about how long it took Java. Java was challenging because they were also innovating on the just -in -time compiler side. you know, I'd say Go is approaching that level of maturity now. Right, it's almost 20 years.

Brian Grant (37:56.558)
But anyway, and there's a lot of new languages, Rust, Zig, whatever. So I find that. I think compilers can potentially, especially these reusable compiler frameworks can make it easier to develop certain types of new languages.

Brian Grant (38:21.048)
So there may be a lot more kind of niche languages than there used to be.

Nitay (38:28.089)
Yeah, absolutely. I remember first seeing SSA and things like that as well. It was interesting actually, the first time I remember playing with a language called Scala that a lot of people may know. And one of the things that was kind of unique to Scala was this notion of having differentiating between variables that are one time called vowels, where you only assign it once.

And it was interesting because this realization that like really what they did was they took SSA and they brought it up to the language level and people would program in ways such that 95 plus percent of your code you realized actually only needed to be assigned once. And so it both made it easier for the compiler to do SSA, but also the code itself actually ended up looking simpler and easier. It was actually like from a human cognitive level, made the reasoning a lot simpler.

Brian Grant (39:14.252)
Yeah, that's an interesting hybrid between functional language in many ways and an imperative language.

Nitay (39:20.836)
Right.

Nitay (39:24.993)
Exactly. Cool. So going from there, you mentioned you then went to Google and it sounds like Google, you kind of started out doing a lot of deep performance stuff, multi -threading and so on, and then moved to a lot about kind of how applications and services were managed and all the way to Borg and Kubernetes and so forth. So tell us a bit about kind of the things you saw there and like, was the problems that Google saw? Why create Borg and like kind of how that evolved?

Brian Grant (39:55.95)
Yeah, so that transition was nominally my starter project on Borg was to improve performance by making the control plane more multi -threaded. Google had a lot of single threaded C++ code because it kind of started in 98, sort of just before multi -core took off and before in PTL. So Linux threads were not very good. There was no C++ threading standard.

So the code was written to be more throughput oriented event loops. And by the time that I joined the board team, I think there were...

a lot of eight core machines in the fleet and they were moving to 16 core machines at that time. So there was a lot more opportunity for thread level parallelism and a lot more need for it in order to fully utilize the hardware. The board control plane itself, and because the applications were not.

adapting to the increasing core counts as fast, the trend was to run more workloads on the same clusters. And that created more work for the control plane to do. So that was kind of how I got into it. I needed to understand, well, okay, what is the control plane spending time on? I had to look at how people were using the system.

And I found that in many ways, people were using it in ways it wasn't really designed for. So the core of the control plane just scheduled jobs, which were arrays of tasks, like 20 tasks or 40 tasks. Those tasks were kind of what we know now as containers. They weren't Docker containers, but they...

Brian Grant (42:07.2)
ran in C -groups and trurutes on the machines in the fleet.

Brian Grant (42:17.058)
there was a lot of functionality that was built outside of the core control plane. So like batch scheduling and cron scheduling and map reduce and these sorts of things. So a lot of these systems had common issues. Like they needed to store information about...

Brian Grant (42:45.432)
for that as input for that specific system, like a batch. If you wanted to schedule your job as a batch job, queued in time, the batch scheduler might need some additional information, like how urgent is it to run this job, for example. And you could build another control plane with another database of that information and.

keep track of it, but people didn't want to do that. So they would find ways or ask for ways to extend the board control plane so it could store that information. So it wasn't like in Kubernetes or in the cloud where there's just a model where you can create additional resources or additional services and represent those and have that.

those resources stored in a consistent way. It was really not extensible in that way. So that was one observation. another was that all these systems then would pull for information because they're all running jobs on board. They needed to keep track of,

has the job completed, has the job succeeded, did it fail, or did individual tasks fail, things like that. So they'd be constantly pulling for information and there wasn't a way to notify these higher level systems when there were changes that were relevant to them. So that was the observation that actually led to the watch API in Kubernetes. And also to the resource model in Kubernetes.

where it was clear that we would need to be able to represent.

Brian Grant (44:32.204)
multiple types of control tasks, both kind of the low level, you know, just execute these containers and also higher level control loops of various kinds to whether it was batch scheduling or auto scaling or rolling updates or whatever it was. If you didn't have a way to represent those things, then, you know, it could get pushed to a client in a configuration file somewhere or...

Some higher level system would need to store the information somewhere else or it'd scroll it away in some unexpected place in the Borg system. So having, or you could just end up with a lot of inconsistent styles of APIs or configuration formats, things of that nature. And the whole ecosystem didn't feel very cohesive due to that. So.

That was a big part of the inspiration for the Kubernetes design where we kind of knew that we need an array of capabilities, even though we wouldn't build it all at the beginning. But there are certain common patterns to types of information that would be needed or types of control plane functionality that would be needed in order to build up the system in a similar way.

Nitay (45:59.235)
And it seems like to that point, you know, lot of the floor design and then thinking to Kubernetes, a lot of it was...

providing a rich set of abstractions and core capabilities and infrastructure and control and scheduling and so forth, as opposed to the kind of other model of like you application developer or service builder, like bringing yourself kind of thing. And we're going to give you the minimal set of interfaces, make it super simple, but then like you bring your kitchen sink with you, right? It seems like Borg is very much, or Kubernetes may be better example, may have much took the approach of we're going to...

give you a rich set of different layers and abstractions. And Kubernetes is fairly well known for being a fairly kind of bulky, I mean, in a good way, right? Like it brings a lot to the table, but it also kind of necessarily takes a good amount to get going to that sense. Was that like, I'd to hear more in terms of like, because it sounds like those conscious decisions that you guys made that said like, I know we're gonna package more and more of this into the control plane.

Kubernetes or Borg should own this. What would be the scheduling, the logging, the, like you said, kind of the watchdog, all these kinds of things. What was those kind of design decisions like?

Brian Grant (47:16.182)
Yeah, I'm the original motivation for Borg. Before Borg, Google had two other systems for scheduling workloads. One was called Babysitter, which when I started on Borg, actually ran the Borg control plane. And that would run services fairly statically on individual machines.

And there was a work queue, which would schedule batch jobs like MapReduce on a pool of machines. And those would be kind of time -bounded workloads. So the original idea for Borg was to be able to run both types of workloads on the same system so that when there was underutilized capacity or unused capacity on the service side, the batch workloads could soak up that capacity.

and use those machines because the machines already the fleet was big enough that idle machines were costing a of money. that was the original motivation. That is also part of the motivation that led to innovations like Seagroups to provide more control over like at the context switch level in the kernel. We wanted

the latency -sensitive services to be able to be prioritized. Like if suddenly there was a spike in load, we wanted those to have scheduling priority at the kernel layer level and the batch job to be deprioritized and to run more slowly and have less CPU available to it. It turned out for Kubernetes that didn't really matter because most people running Kubernetes in the cloud, the cloud could

be elastic in different ways than physical machines were. So, you know, if you had idle cycles, you could remove virtual machines, for example, or create smaller virtual machines. So there are more dimensions of elasticity available in Kubernetes, at least in the cloud, when executing it in the cloud compared to board. think one of the other changes is because we did make

Brian Grant (49:36.92)
Kubernetes so extensible and there are different ways of running it. Now with CRDs and operators and all that, people would run smaller clusters so that they could construct them differently or operate them differently. I think AI training is a good example of that where a training workload is more like a supercomputing workload and that all the nodes need to be running the same workload without interruption.

Whereas with a typical SAS kind of workload, people try to design it so everything can be kind of incrementally upgraded in pieces. with everything is replicated so that if you need to take down an instance, you can because there's enough replicas to pick up the slack. So you want to incrementally update your application, incrementally update the nodes.

and conventionally update everything. And that kind of thing is very disruptive to like an AI training workload or even a stateful workload. So people ended up running a lot of smaller clusters that were configured differently or owned by different people or operated differently. So the scheduling aspect, I think in retrospect,

for a lot of users just doesn't matter as much. In Kubernetes, one benefit of scheduling containers onto nodes though, that I think still has a benefit is the startup time, because you can cache the container images on the nodes if you need to.

Restart containers or redeploy containers, being able to kind of schedule those back onto the same nodes can be beneficial. Being able to add nodes and have workloads, on that, schedule on those nodes to get the elasticity I mentioned before is beneficial. I think different, scheduling is still beneficial, but for different reasons than maybe in an on -prem environment.

Brian Grant (51:50.728)
like Borges.

Brian Grant (51:56.566)
In terms of context decisions, I mean, from very early on, we decided to make the scope of Kubernetes broader than Borg to try to make that ecosystem more homogeneous for one, also to be kind of sufficient for running an application. So if you just schedule containers dynamically, one problem you run into pretty immediately is how do you then connect to the container running on a particular node?

So immediately you need some kind of service discovery solution. And you could say, well, we're not going to provide that service discovery solution. can just pick one. They can use that to do. They can use console. They can use Eureka. That's a user problem. They can solve that themselves. That would be a valid design position to take. But I think we felt like that was too big of a hurdle to run a lot of workloads that if you were

running one container to one machine and just load balancing on VMs. You could just use a cloud load balancer. You didn't need fancy service discovery mechanism. And then to run on a dynamically scheduled system, you would have to change your application. We didn't want users to necessarily have to change your application. So that's why the service mechanism was added in the kube -proxy.

were added is so that users wouldn't have to modify their application immediately just to run on Kubernetes. And that's also why eventually we added DNS so that applications that identified what host they would connect to through DNS could just run on Kubernetes. We wanted pretty much anything that could run in a Docker container.

It was statically started on a virtual machine to be able to run on Kubernetes. the DNS has its issues, but it did actually make that possible between the service proxy and DNS. So you don't have to worry about what, there are no dynamically scheduled ports. You didn't have to worry about what port.

Brian Grant (54:20.858)
your workload is running on, you can just decide what port it runs on. I think that makes the system more complex, but it makes the user experience of getting an application up and running a lot simpler, I think.

Nitay (54:43.641)
And you made a few interesting points there and one in particular that you've kind of highlighted a few times. It seems like as we've talked, which is kind of people beating up systems or using them in ways that, you know, wasn't necessarily intended or, or potentially because of, you know, lack of expertise. Like you were saying, when you came in initially, everything was single threaded. And then you gave the example, I liked around kind of distributed systems, right? Like famously, you know, back in the day.

build a distributed system, you'd have to go read some vector clocks and Paxos and so forth. And then Google came around and said, hey, we got this chubby thing, was their Paxos imitation, which is what essentially became console and zookeeper and so forth, which are raft but related. And basically said, don't worry about all that stuff. Just call this service. We got it handled. Don't try to break your head around distributed consensus. And so similarly, it sounds like you're saying here with Kubernetes, a lot of things

the end result is like you as a user, like don't worry about the scheduling. Don't worry about like just toss it in there and it will most likely just work. And the question that leads me to is as you kind of think forward and as you've seen how people use containers and Kubernetes and cluster managers in general and so forth, what should people actually concern themselves on beyond the core business logic of the thing they're trying to implement? Obviously there's always that, but beyond that, like for example, at Google,

Were you guys actually trying to get everybody to understand multi -threading at a deep level so that they could utilize it heavily? Or was it, well, you know what, just write all these single threaded, single processor things and we'll just schedule more containers, more pods essentially, right? We'll just parallelize them. Like you don't worry about it. And so what are the things that, you know, engineers, you would...

like to see or you think kind of Kubernetes should be pushing people to say, this is where you should actually get a lot of expertise and this is where things matter versus everything else, know, scheduling and so forth. Like let us handle it. Don't worry about it.

Brian Grant (56:49.462)
That's a good question. think one of the...

Brian Grant (56:54.434)
things that's still relatively hard or at least not standardized. Well, I'd say in general.

something that's not very standardized is how the application interfaces with the management system and how the management system interfaces with the application. So what I mean by that is, I want, simple thing is I want to run an application, how do I set some of its configuration parameters?

That might be by environment variables. It might be by configuration file. That configuration file could be Toml or INI, JSON, XML. Could be a lot of different formats. I don't even know what form. It could be a custom format. Engine X has a custom format. Kafka has a custom format. So that's.

not very standardized and that makes it harder to manage applications. There are a bunch of other interfaces. I think of it as inbound. So application configurations like an inbound interface. There's also outbound interfaces like open telemetry, think is a good example for monitoring information about what's going on with the application. How many requests is an application receiving? What is the the average latency of those requests?

Logging is another example that came up earlier. Docker kind of semi -standardized how logging works, not entirely. So that could be formalized a bit more. Readiness and liveness probes or what people typically call health checks, I think would be another good example. So a lot of these things were kind of more standardized or at least there were ways to discover.

Brian Grant (58:50.9)
what the application expected or what the application provided, then it would be more straightforward to manage the applications. Right now, it requires a lot of manual configuration to make that stuff work. Or like the application connecting to a backend service like a database or a messaging system or a key value store or an object store or something like that. Those

The mechanism for making that connection is very non -standard. are mechanisms for doing it. Heroku has a mechanism for injecting that kind of information, like database URL, Redis URL, things like that. I was looking at radius recently. Crossplane has its own way of doing that.

Of course, everything related to secret management is super painful. How do you manage the secrets? How do you get them into the application?

Brian Grant (59:59.086)
So I think kind of there's a typical set of.

Brian Grant (01:00:05.01)
dependencies, I guess, and sort of needs that an application would have running on any system, like Kubernetes or Platform as a Service or some other container platform like ECS or Cloud Run or something like that. But the lack of standardization around these things just puts a lot of burden on the user to figure out how to wire everything together.

Nitay (01:00:36.975)
Why is that, that like, that problem hasn't been solved yet, or maybe put it differently, like why are the current approaches doing it the wrong way, if you will, right? Like to the point of the previous part of the conversation, you know, we've solved complex standards around networking and complex standards around security and logging, like things that are deeply technical challenges.

It seems like a config is like, write some key, I write some value, I put it somewhere. Like, what's the big deal, right? Like, yeah, this is in a way kind of the lowest hanging fruit at this point, it seems like. And so why is it you think that that hasn't been solved right or that people keep kind of bashing their heads with the wrong approaches?

Brian Grant (01:01:16.482)
Well, think, and one of the number one challenges is just the lack of consistency. And with the existence of many applications already, like I mentioned, we supported DNS and Kubernetes because a lot of applications use DNS. If there's not a de facto standard already, it becomes hard to...

Amplify open telemetry is kind of interesting in that it hit a at a time when there was growing interest in observability partly because of the emergence of of kubernetes and Kind of side -by -side missing CNCF Prometheus There became more awareness that there was a need for this New kind of telemetry so people were instrumenting their applications. It was kind of the time was ripe for

developing a de facto standard for that. There hasn't been a similar moment in some of these other areas, I don't think. So, you know, there's a bunch of existing applications, different languages have different standard conventions like property files in Java or XML from, you know, at least a certain generation of Java applications. You know, whereas other...

languages have their own conventions. hasn't been, you know, Spring is an example of a framework that had a bunch of its own conventions. Not every language has a framework as dominant as Spring that could sort of drive a de facto standard and certainly across languages I haven't seen crossover in the way some of these types of things are done. But, you know, now that more more applications are

running in a dynamically scheduled environment, whether it be Kubernetes or ECS or NoMad or whatnot, maybe there'll be more interest in developing a de facto standard if there are tools around it that can take advantage of that standardization. If you get that plug and play ability,

Brian Grant (01:03:36.398)
So I think containers achieve that. Open telemetry is hopefully on its way to achieving that. So yeah, I think you need some sort of window of opportunity or a driving forcing function or some other adjacency that actually is able to drive the de facto standard. But in terms of a...

Brian Grant (01:04:05.144)
company driving it, I think it's unclear how to monetize it. So I don't know that for that aspect, how likely is a company to drive this standardization. VAPR is kind of interesting, which Diagrid is trying to prioritize, I believe. And that takes it a bit beyond.

just the kind of the minimal information about how to connect to backend services, but actually creating more opinionated interfaces to those backend services.

Brian Grant (01:04:52.087)
Yeah, so I'd say the business challenge and business incentives are maybe more challenging than the technical ones.

Nitay (01:05:03.117)
Yeah, that's an interesting point. So, so business challenges around creating a standard and preventing the old XKDCD commenter or what was it that was like, you know, way,

Brian Grant (01:05:11.278)
Well, there will always be the XKCD, you know, now there's 15 standards thing, but you know, you just need the new thing to have enough critical mass that it can be, you know, that most new people would just say, yeah, you know, I don't have to invent something from scratch. I'll just choose the, the obvious thing. And it has benefits for, for adopting that. So, and it's easy to adopt because my app framework supports it or

Nitay (01:05:14.957)
Right.

Brian Grant (01:05:38.758)
My other tooling supports it. There's a VS Code plugin, et cetera.

Nitay (01:05:44.867)
And is it, so to that point, I think it's mostly about the kind of the business standards, the incentives and the plug and play integrations with all the kind of existing tooling out there, or is there anything different, you know, if you were kind of starting from zero today and designing that spec, you will, anything particularly different that you would do or anything that like, you know, you think is critical in order for that standard to actually take off and to be adopted.

Brian Grant (01:06:16.462)
Well, that's a good question. mean, if I look at like Heroku as an example, you know, has at least some aspects of those types of interfaces, you know, but they were attached to the Heroku platform. So in order to become more broadly adopted, it would have to get buy -in from other platforms or from, you know, more loosely coupled.

projects like from app frameworks and from database vendors and from observability vendors and things like that. So it's a lot of work to create that of coalition of aligned interests.

Brian Grant (01:07:06.79)
So we'll have to see if such a, what would be the incentive for that kind of coalition, I guess, to emerge. With open telemetry, I think a lot of the newer observability companies, either you'd have to build a collector that adapted from whatever people were exporting to whatever format you could ingest or convince customers to.

export your ingestion format directly. For any new company, that's pretty high bar, but if you say, well, like all the companies are adopting the same format, then it becomes less of a hurdle to get customers to adopt it. So there's sort of mutual incentive for that kind of competition, I guess.

Nitay (01:08:09.849)
Yeah, creating standards is tough. The inner -mouse kind of alignment and consortiums you have to put together. Cool. Well, I think we're...

Brian Grant (01:08:17.282)
Yeah, mean for Kubernetes, you know, the...

Brian Grant (01:08:25.838)
the incentive was really around effectively making it easier to consume infrastructure. So by the end of 2017, I think all the major vendors had Kubernetes product because it was popular enough that they saw it as in their interest to make sure that it was easy to adopt. And that in turn made the whole Kubernetes ecosystem larger.

which create more incentive for users to adopt. So there's this sort of virtuous cycle of network effects. So you kind of need to hit that. Find a similar scenario where you can generate those sorts of network effects.

Nitay (01:09:13.881)
Yeah, the network effect is a huge one in terms of getting the standard up and going. And I know there was a lot of interesting kind of strategic decisions at that point with I imagine as well. that like, you know, have to imagine Google was seeing kind of AWS take off and the lock -in that you have with, you know, running workloads on EC2 and so forth. And so things like Kubernetes help kind of...

create a layer above that that abstracts away and kind of dissociates you from the low level cloud resources and allows you to kind of get more flexibility in terms of where you run your workloads and shifting and moving things around and so forth.

Brian Grant (01:09:53.612)
Yeah. And that's, that was really another reason why we wanted that thicker abstraction as well. Because then you could, you know, the application developers could, or whoever was deploying and managing the applications could live, spend more of their time just in Kubernetes and be not as tightly coupled to an individual cloud provider. Right. So if we didn't have service discovery and load balancing at all in Kubernetes, then

Nitay (01:10:14.701)
Exactly.

Brian Grant (01:10:22.156)
you would fall back to whatever the cloud provider provided and then you wouldn't have the same degree of workload portability.

Nitay (01:10:30.595)
Right, so it quickly became kind of table sticks. Otherwise, you get locked in and you limited to what you can do.

Brian Grant (01:10:37.612)
Yeah, so Kubernetes is more like a cloud in a box effectively than something that just runs containers.

Nitay (01:10:45.347)
Right. Right.

Very cool. OK, I think we're basically close to wrapping up now. Any last thoughts in terms of where you'd like to see the community or the ecosystem as a whole evolve, what you'd like to see folks work on or shift towards? I you mentioned some of the standards around configs. Anything else you've been thinking a lot about and would love to see the community dig in on?

Brian Grant (01:11:15.948)
Yeah, what I was talking about before was the interface between applications and the system that manages applications, like Kubernetes or Heroku or whatever, or connections between the applications and infrastructure components like databases and messaging systems and so on. I also think the higher level...

Brian Grant (01:11:42.664)
ways of managing both the applications and the infrastructure that they run on or run or consume. You know, I think there's definitely room for innovation there. The current kind of best practices using infrastructure as code. And it feels like we've kind of hit a dead end with infrastructure as code. And, you know, it makes certain things easier and it makes other things harder. So that's an area that

I'm looking at based on my experience with Kubernetes, I think there's opportunities for doing things differently that.

Brian Grant (01:12:24.334)
could retain the things that are kind of beneficial from infrastructure as code, but address some of the fundamental issues. I think another interesting effort in that area is system initiative. We're just kind of taking a different look of, know, kind of core tenets of infrastructure as code, like version control and sort of rethinking that.

From GPU Compilers to architecting Kubernetes: A Conversation with Brian Grant

From GPU Compilers to architecting Kubernetes: A Conversation with Brian GrantFrom GPU Compilers to architecting Kubernetes: A Conversation with Brian Grant

More episodes

From GPU Compilers to architecting Kubernetes: A Conversation with Brian Grant

From GPU Compilers to architecting Kubernetes: A Conversation with Brian Grant

Chapters

What is Tech on the Rocks?