Pop Goes the Stack

GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack.

Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token.

From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state.

The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.

Creators and Guests

Host
Lori MacVittie
Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.
Producer
Tabitha R.R. Powell
Technical Thought Leadership Evangelist producing content that makes complex ideas clear and engaging.
Guest
Tim Michels
F5 Distinguished Engineer and Architect integrating concepts for platform design, hardware acceleration (custom and COTS), software stack functional delineation and abstraction, to create hyper-converged appliances, chassis, and COTS systems.

What is Pop Goes the Stack?

Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.

Lori MacVittie (00:05.11)
Welcome back to Pop Goes the Stack, the podcast where the future of tech shows up with big promises and no run book. I'm Lori MacVittie and I'm here to read the fine print. So today we wanted to talk about GPUs because everybody else is talking about GPUs and they think that they're the bottleneck for getting performance and scale for inference. And guess what? It's not, actually, it's memory.

And more specifically, it's the KV cache. It's quietly eating your lunch while everybody else is busy counting accelerators. Because here's the thing, inference stop being stateless the minute we started doing real work with it. Long context, multi-turn conversations, agents that don't know how to quit, all that piles up state and that state has to live somewhere. Now, right now, and too often, it's living in the most expensive place possible.

So in this episode, we've got Tim Michels. Hi, Tim.

Tim Michels (01:09.326)
Good morning, Lori.

Lori MacVittie (01:11.108)
Good morning. To dig into KV cache offload and other offloads. We're going to talk about a few different types to kind of understand what's going on. And we want to talk about it not as a cool optimization, but as the architectural shift it actually is. You know, what happens when you stop treating inference like a request pipeline and start treating it like a stateful system? Because if you don't, you're going to spend a lot of money recomputing yesterday's thoughts.

Now you'll notice Joel Moses is not here today and that's because Joel "OpenClaw" Moses is off in some kind of execution loop and couldn't make it. But he'll be back, don't worry. So it's just you and I, Tim, and let's dive in and start by kind of defining what the heck is KV cache? Everybody's talking about it. What is it actually?

Tim Michels (02:06.03)
Yeah, well, I think in order to understand KV cache you have to get beyond this monolithic understanding of how an LLM actually functions and how it's built. So when you think about an LLM, you send it a prompt, it chews on the prompt a while and it spits out a response. That's the monolithic view. But when you actually go a click down under the covers, it's actually a two-step process.

So the prompt comes in and it gets chopped into tokens and these tokens are processed, vectorized, labeled, however you want to think about it. And this creates the base data structures to then be fed into the second stage, which is the decode, which does a token by token response generation. But that middle thing that the output of the prefill that goes into decode is called the KV cache. And so that's the thing we're talking about.

And if you have KV cache handy, you can skip the very expensive and slow prefill stage, which is taking prompt text and turning it into vectorized context for the LLM.

Lori MacVittie (03:18.544)
And that's helpful, especially as we start hearing people talk about distributed inference, where they're actually starting to break that up based on those division. So there's the pre-fill that fills up the KV cache that goes into decode and then gets churned on by the model. Right?

Tim Michels (03:38.724)
Yeah, and an observation people have made to take advantage of that separation is that prefill is computationally bound. It's very heavy in the number of flops required, as opposed to decode, which has to hold the entire KV cache context and operate on it. So it tends to be memory bound. So if you can have dissimilar heterogeneous compute, you can actually say, hey, these nodes with their GPUs are better. They're built for prefill.

Versus this pool of nodes is better and built for decode. And so you can actually get sort of a separation of concerns and target the right infrastructure for each job.

Lori MacVittie (04:19.646)
And that's good because right now we all know that GPU memory specifically is very, very expensive and it's hard to get your hands on. Like there's, at least that's what I hear.

Tim Michels (04:29.156)
Yeah, I think there's two elements. Before I jump into the memory, let's talk about the GPU part, which is, you know, if you just send your prompt and it chooses a random, say a round-robin GPU, when you get there, there's nothing, there's no context. You have to build everything from scratch. And this gives you the longest time to first token. This gives you the most consumption of GPU resources to get your response.

Lori MacVittie (04:33.508)
Okay. All right. Okay.

Tim Michels (04:55.706)
But if you've been there before and you've essentially asked the same query, maybe it's the same query with a different detail added to it, what you really want to do is go back to that same GPU that already has your cache and start with that work already done. So the ability of a load balancing system to understand prompt to GPU assignments so that you're targeting GPUs that might have that cache still available to them, that's a really powerful tool.

Lori MacVittie (05:24.126)
Ahhh.

Tim Michels (05:24.248)
Now, how does that overcome your problem, right? You're saying, hey, this cache is an HBM memory. This is the smallest, most expensive piece of memory in the system

Lori MacVittie (05:33.33)
Yeah.

Tim Michels
that's right there next to the GPU compute. The way we tackle that is creating a memory hierarchy. This idea that, yes, there's HBM memory. That's amazing, but it's small. Then there's the host memory of the host CPU that's holding that GPU. And it has a much bigger DRAM footprint than the HBM footprint.

Tim Michels (05:53.112)
And then within that node, there's also a local SSD, which is probably measured in many terabytes of storage. And then behind that, we have a fourth tier, which is network attached storage, which can be measured in petabytes. Each of these is colder. Each of them is further away. Each of them is slower to get to. They are all faster than having to recompute.

Lori MacVittie (06:15.038)
Wow. And that's really what's going on is we're trying to improve the performance and efficiency of these systems because, well, they're really expensive and users don't like to wait for a recipe for, you know, apple crumble. Like, I can't wait, right?

Tim Michels (06:30.896)
Yeah, with time to first token being one of the key measurements, yeah.

Lori MacVittie (06:35.388)
Right, and the way that we've dealt with this through all of the application generations, if you will, we've always gone back to how can we offload or cache some information that will make it faster to do this, to avoid something that is computationally expensive, right? That's one of the reasons that sticky sessions works is because the computational expense there is usually an, "well, I have to actually negotiate the TCP and I have to open up a socket and there's IO and it's expensive. So let's just keep it open." Right. We kind of cache the session in order to make it faster. And every time a new application type comes out, we see these problems like, huh, this is really, you know, taking a lot of time. How can we optimize it by caching or offloading?

Now KV Cache is one way to do that, but you mentioned semantic caching earlier too as another option for how to actually make this faster. So can you explain that?

Tim Michels (07:38.158)
Right. KV cache is getting into that middle step between the prefill and the decode within the LLM. Alternate view is I'm going to step outside the front door of the LLM. I'm going to observe prompts that go in, capture the replies that come back. And then I'm going to cache that as a semantic cache. If I see another prompt come in that matches something in my cache, I don't bother the LLM, I just respond immediately with the same response because it's the same prompt.

So you assume the response has to be the same, right?

Lori MacVittie (08:12.54)
Right, well that's, yeah

Tim Michels
So that's a really powerful technique for leveraging your infrastructure to get much faster response times and much more effective compute through a

Lori MacVittie (08:23.048)
Yeah, I've often, you hear this, you know, every time you say thank you, it's costing whoever is hosting that LLM, it's costing them millions because you say thank you. If there was an intermediary in there, it could just catch the thank you and respond appropriately and never actually incur the cost. Like this seems to be like an obvious solution to me, but call me silly. Maybe I've been in that, you know, app delivery intermediary world too much.

I'm like, that makes sense, right.

Tim Michels
Right.

Lori MacVittie
So similar. Right, being able to intercept and reuse any time. And part of the reason that we need to think about things like this is that the context windows have grown incredibly huge and that gets sent back. And that's causing a lot of this problem, I guess you would say, in the KV cache, because if you have a million token context, right? I mean, that's huge. It's got to process that, right, if you send it back.

Or is only part of it recomputed? So that's the other question. If I'm in the middle of writing a book with an LLM and it's got a lot of this already, how much of that is in the cache? Is it only the new stuff? Is it the old stuff? What's going on in there? How much of that is actually causing the problem?

Tim Michels (09:39.386)
Yeah. Yeah, I think the size of the cache is a big problem. So there's some math that says an individual token in cache represents 500 kilobytes of data. That's one token. And so when we build out, say an average prompt, we quickly get to a size of like 60 gigabytes for one prompt, one KV cache for one prompt. So even when you say, "Hey, you know, this thing has filled up my HBM, I push it off to storage somewhere. Now I want to go get it and I want to get it now, and I'm in a hurry."

Oh, we got to stream a 60 byte storage query back to the GPU systems. The networks aren't necessarily built for these elephant flows, right? We're talking about a low number of queries, but massive transfers. And I think it's breaking networks and it's breaking storage arrays. Their TCP stacks aren't necessarily tuned for this. The intermediate middle boxes aren't tuned for this.

There are problems that show up with these work load patterns.

Lori MacVittie (10:43.943)
I hear RDMA tossed around a lot as solving that back end problem. But then again, I've heard a lot of different, you know, InfiniBand, this, that, right, on the back end to try and solve some of this "we need to store it."

Tim Michels (10:55.076)
Yeah, I think it's, yeah, there is some RDMA, but it's more of a traditional Ethernet networking problem.

Lori MacVittie
Okay.

Tim Michels
And in fact, companies like Nvidia has come up with their NIXL (Nixle), which is a networking library. I think it stands for Nvidia Inference Transfer Library. And so this is their mechanism of having an optimized networking stack just for this problem. And so if you can get that stack running on the GPU side and on the storage side, you can get very efficient transfers between the two. Because it's been tuned for these large transfers.

Lori MacVittie (11:30.911)
Well, if it wasn't stateful, you know, if you didn't care and you just answered every prompt cold, right? I mean, why do you need to know my entire history before you answer? Right? And I mean, you do. Right? The idea that these were going to be stateless at all was crazy.

Tim Michels (11:44.228)
But your CFO won't like the answer. If your answer is buy more GPUs, your CFO won't like that answer.

Lori MacVittie (11:52.775)
No, no. Well, they don't like buy anything. You put buy in a sentence and they just immediately go, no, you can't have that. So there's other solutions and one of them, and you mentioned this earlier so now I want to come back to that, is the load balancing, right? The routing, right? As we're starting to look at it, because it's less about, I mean, it is load balancing, but it's more routing to the right model. And you're using different values.

Like traditionally we base it on what, you know, how fast you responded, how many connections you have, what's your health, things like that. Location, you can do sovereignty, all sorts of things. But you're saying we actually need to start thinking about incorporating things like KV cache. What's the queue length? Variables that tell us, you know, how efficient a particular model is going to be at processing this prompt and choose that instead. Yes?

Tim Michels (12:49.7)
Yeah, so if we go to a throwback story, right? Back in the day for web caching, it was about persistence for the shopping cart. So you open a session, you're shopping, you go away. When you come back, if you go to a different server, it knows nothing about your shopping experience. So you need to persist. Your new session needs to know, oh, this is Lori. She was just here 10 minutes ago. Let's take her back to her cart and show her the items she had already set up to buy.

Lori MacVittie (12:59.294)
Yeah.

Tim Michels (13:16.91)
That kind of persistence is similar to the KV cache. You've been here before, you created some state, and now for a better user experience, we need to take you back to where that state is sitting, and that's typically near the GPU or AI cluster that you were on before.

Lori MacVittie (13:34.147)
That requires a little bit of accounting, right? You have to keep track of which user went to which GPU or which model, you know, where is it likely to be? And that actually seems a little easier than keeping track of what's the state of the KV cache and the queue length in every single model instance. It's just like, "Hey, it's Lori. She was on number one before. Let's send her back there."

Tim Michels (13:46.416)
Right.

Tim Michels (14:00.056)
And you can see the similarity to semantic caching because you're sort of keeping track of the same keys, which is the prompt and the user. But now you're just keeping the location as the output of the cache, not the actual response.

Lori MacVittie (14:15.057)
So it's like sending it to a distributed cache. Sort of like, I mean, almost like sharding. Like, oh, this is AdaM; it goes over here because that's

Tim Michels (14:22.281)
It is like sharding, yes.

Lori MacVittie
Yeah, oooo. Ah, I caught that. I'm happy with myself this morning now because like I caught that. But those are the kinds of techniques that we have to get to to be able to deal with this because it's not, even if GPUs and the memory associated with them like are suddenly like, hey, it's a nickel, right?

Lori MacVittie (14:44.933)
I mean, they're just easy to get and they're very, very cheap, you still have the problem of performance. That doesn't solve the performance issue and keeping context together. And that's really something for that app delivery tier, if you will, that routing to help manage because the notion that these are going to be stateless is bunk. Right? They are stateful in a sense, and we need to adapt to be able to optimize the entire data flow, if you will.

Tim Michels (15:13.594)
Yeah, I think if

Lori MacVittie (15:14.288)
using different tactics.

Tim Michels
you look at the arc of the AI infrastructure build out, it's been run and build as fast as we can. And there's no time for optimization. But as we start to run into the power wall, which means you don't have any more gigawatts, the space, you don't have any more data center space, and the dollars, you just run out of your trillions of dollars, we can't build more fast enough. So we have to turn back and say, "Hey, can we get a lot more performance out of what we've already built?" And then I think we slow down and we start to

Tim Michels (15:43.982)
look at these systems and start to tune them. And we'll discover there are 2x, 3x, 5x improvements we can make on our existing infrastructure. And a lot of that revolves around this idea of KV cache, location awareness, semantic caches. This is where we can get high leverage very quickly.

Lori MacVittie (16:04.319)
Cool. We need to get on that because this isn't going away and I think agents are going to exacerbate this problem. Because now you've got, and they need to be fast, right? We expect them to be fast at least, like go do these things and do them quickly because I'm waiting for the answer or a response. And you're going to have the same issues with them, but they're actually at least as we assume at this point, they will be a one to many. One user might have many agents running around. So we're actually multiplying our impact on how much is being used. So we need to get really efficient and optimize how that actually happens.

Tim Michels (16:43.908)
Yeah, and the scale of all this is tremendous. If you're building, say for a KV caching solution for 100,000 users if you're in a hyperscaler use case, you're talking hundreds of petabytes of KV cache storage in order to run the four tier system, right? So how do we hook up all that networking? How do we make all of that work? These are interesting problems that the networking storage companies are only just now beginning to grapple with.

Lori MacVittie (17:17.619)
Yeah, and I mean, I hear about the distributed inference and part of me goes, well, that makes sense. And it does, right? And especially as you described it, right? Being able to optimize for the resources and the hardware that you have and build it out that way. But then we're introducing another network hop and another network thing that we have to optimize, make sure it can handle the speed, the bandwidth, right, everything.

And every time we do that, we've seen us devolve into this, you know, SOA microservices. There's just too many. Right? We can't actually like call enough things fast enough. So I think there's a balance there to be very careful about as we start distributing this stuff is that we can't distribute it too far yet because the network still remains kind of, in some cases, a bottleneck in that we just we can't move stuff. We can't move data through the system fast enough.

Tim Michels (18:11.972)
Yeah, I mean that drives you to data locality, but again, this is a caching

Lori MacVittie (18:15.365)
Mmm. Yeah.

Tim Michels
story, right?

Lori MacVittie (18:15.365)
Oh, yeah, we're back to caching.

Tim Michels
And let's move the prompt around because it's much smaller than the KV cache.

Lori MacVittie (18:22.003)
I like that. Yes. Move the small things. That's going to be my takeaway. Right? Right now I'm saying that's the takeaway. Right, is you move the smallest things. Reuse as much as you can and store the small things, not the big things.

Tim Michels (18:36.794)
Yeah, and just the economics of this are going to drive incredible progress because the most expensive thing by far is your GPU compute. So we need to be investing in storage and networking solutions and intelligence at the edge of these AI clusters to make the AI clusters more efficient because that's where the money is. And as we start to monetize this AI infrastructure, it's going to be all about efficiency, right, and TCO.

Lori MacVittie (19:00.782)
Right.

Tim Michels (19:04.153)
And so, you know, there will be money spent on storage and network efficiency solutions because the payback of it is enormous.

Lori MacVittie (19:11.507)
Well what, so if you were going to give a takeaway for the enterprise who isn't, right, they're not building these solutions, they're looking for those solutions, what can they do right now to help start being ready to optimize and make the most of what they do have? What can they do?

Tim Michels (19:32.56)
I think if they're building new infrastructure, they need to talk to their partners upfront about efficiency solutions and not just accept the answer of buy more GPUs. That's the wrong answer.

Lori MacVittie (19:46.441)
Hasn't

Tim Michels
If they already have GPUs, I think they can look into retrofitting on the outside to get more out of them. And there's been a lot of examples already in the open community about how this is done and the results have been very encouraging.

Lori MacVittie (20:00.871)
I like that. I mean, boiling that down, the answer is never throw more hardware at it. Like that's just, it's never the best answer. It is an answer. People do it. But it's never the best answer and it's not gonna be the most efficient. So yeah, we're still figuring it out. I mean, that's the honest truth. You know, we've got cache ideas, offloads, network. There's a lot of moving pieces here to get this as it starts to scale out. We're just starting to see the problems that we actually need to fix. So it's coming.

Tim Michels (20:31.997)
And the encouraging thing, Lori, is we've solved a lot of these problems in other domains.

Lori MacVittie (20:37.683)
Yeah. Yeah.

Tim Michels
We just need to bring those thoughts and solutions to this new domain, which is AI compute.

Lori MacVittie (20:41.639)
Right, yeah, I think it is we just needed to see that the problem exists and go, now we need to solve that one. It's a little different, let's do this. So, we will solve it and that's a good thing. That's a good thing. Not today, because that's a wrap for this episode of Pop Goes the Stack. If this episode helped you spot the next production headache early, subscribe before the change window closes.