{"type":"rich","version":"1.0","provider_name":"Transistor","provider_url":"https://transistor.fm","author_name":"Pop Goes the Stack","title":"KV cache is the real inference bottleneck (Not GPUs)","html":"<iframe width=\"100%\" height=\"180\" frameborder=\"no\" scrolling=\"no\" seamless src=\"https://share.transistor.fm/e/7f0e9e41\"></iframe>","width":"100%","height":180,"duration":1269,"description":"GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack.Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token.From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state.The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.","thumbnail_url":"https://img.transistorcdn.com/EOH5giVF50GDCoaIBECLMap8fBWcZH3C5tsFwM0Tn9s/rs:fill:0:0:1/w:400/h:400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS80MGQ2/ZDBjM2JjMmMyZDg0/MGY5ZTEyYTViOTgy/N2RiYS5wbmc.webp","thumbnail_width":300,"thumbnail_height":300}