Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.
00:00:05:06 - 00:00:22:18
Lori MacVittie
Hey, it's Pop Goes the Stack. The show that dissects emerging tech with all the finesse of a root cause analysis after a 3 a.m. pager alert. I am Lori MacVittie, and we're going to get postmortem today. We've got our co-host, Joel Moses. Hey, Joel.
00:00:22:21 - 00:00:23:11
Joel Moses
Hello. Good to be here.
00:00:23:13 - 00:00:41:23
Lori MacVittie
And today because we're talking about context windows, we've got Vishal Murgai, who is a senior architect in our AI Center of Excellence. That's a mouthful. He focuses on AI and data science. Welcome, Vishal.
00:00:41:25 - 00:00:43:09
Vishal Murgai
Glad to be here. Thank you.
00:00:43:12 - 00:01:14:14
Lori MacVittie
Awesome. Well, this episode is brought to you by Anthropic, as it often is. They lobbed a million token grenade into the copilot, coding wars. And suddenly every startup hawking clever context management, looked like it was selling floppy disks in a cloud world. Very bad. But what's really at stake here is, is context, right? The running memory of a model that it can see in a single shot.
00:01:14:16 - 00:01:39:09
Lori MacVittie
So it's the difference between feeding a model a sentence versus handing it your entire codebase or compliance manual and customer history all at one time. Now, as enterprises scale out AI and start building it themselves, and we know they are, 80% are running inference themselves, which means they have to manage context. We wanted to bring to light some of the issues, challenges.
00:01:39:14 - 00:01:53:06
Lori MacVittie
What should they be looking for? And you know, how can they understand what is context anyway? Since most of it's hidden behind chat UI's today, for most users. So let's get into it.
00:01:53:09 - 00:02:16:21
Joel Moses
Yeah. Now, Anthropic announced an increase in what's called the context window for its popular Claude Sonnet 4 foundational model back in the middle of August. This was pre, pre GPT-5 release. And this, this, increase in the context window was not just an incremental one. It doubled from 500,000 to 1 million tokens, which is a pretty big jump in one go.
00:02:16:24 - 00:02:24:19
Joel Moses
Vishal, if you could, explain to our audience what a context window is and why a why the a large context window is important.
00:02:24:21 - 00:02:44:26
Vishal Murgai
Yeah. So lets see, the definition of context window is again it's kind of the maximum span of text tokens that a large language model can process at a time, at once, right? Even every prompt, every document, the conversation history, and the instructions that fit into a single, single window, right? That's kind of how much you can fit
00:02:44:26 - 00:03:03:12
Vishal Murgai
inside. It used to be like 512, 1K when we started all this, like years ago. Now it's like 1 million more. Pretty good, more, right? So that's kind of, and tokens kind of, chunk of text, usually three four characters is to for our audience to understand this. But 1/3 of a word typically. So 100 tokens is about 75 words typically.
00:03:03:12 - 00:03:26:11
Vishal Murgai
That's what I've seen from my experience, right. And the 1 million window now roughly is about seven, 700 to 750k words, right? It's a huge, huge chunk there. So multiple books worth of data essentially, right.
Joel Moses
Okay.
Vishal Murgai
And because the model doesn't have infinite memory, so it's, it's kind of, it's only, I would say it's, it sees the text in the, in the current window,
00:03:26:11 - 00:03:41:05
Vishal Murgai
right? So that's kind of how it is. Old text that falls out, any text old that falls outside is forgotten basically. Anything which is outside this window is forgotten. That's kind of the definition of this, right? So, it's a working memory in human brain essentially if you want an analogy of that, right. So.
00:03:41:05 - 00:03:58:17
Joel Moses
So what approach did people use prior to the availability of larger context windows? I mean, the data that people ask of these models can sometimes exceed the context window parameters. And so what approaches did they commonly use to try to analyze all the data that they needed to analyze.
00:03:58:19 - 00:04:16:09
Vishal Murgai
So before all this came into picture like before this context flow, we used to do chunking, right? We were basically, it's a great question to start all this, right? So before we had a million tokens, people still needed the models to kind of reason over big documents and all, right? So what the model simply, model will do essentially,
00:04:16:12 - 00:04:38:15
Vishal Murgai
we had to do, was it had to do chunking and retrieval, RAG models, if you call it, right. So what if we take a big document, split into small overlapping places, like 500 to 1000 words, right, and tokens, right, and create vector embeddings for each of those? Essentially break it down, create vector embeddings for each of those things, and store them in a database, right. At query time,
00:04:38:17 - 00:04:50:25
Vishal Murgai
retrieve only the top and most relevant chunks, right. That's kind of how we would do it. So that's how it was done before, chunking and then retrieval, or that kind of thing way before.
00:04:50:28 - 00:05:14:24
Lori MacVittie
That kind of explains why sometimes the AI forgets what you said, because it only pulls the top X, or something dropped out of the context window because it became too full and it it forgot your name, and calls you something else. Those kinds of issues, right. It can forget important information because it gets pushed out or it's just not as relevant when it did the actual embeddings.
00:05:14:26 - 00:05:16:00
Vishal Murgai
Right.
00:05:16:03 - 00:05:30:24
Joel Moses
Now, it sounds to me like, now that we don't have to chunk necessarily, now that ever larger context windows are available, that has an architectural design effect on AI applications, doesn't it?
Vishal Murray
Yes.
Joel Moses
What are some of those?
00:05:30:27 - 00:05:51:27
Vishal Murgai
So, I think it doesn't, it doesn't. Okay, it helps a lot. All these large token windows, right, context windows. But now when you're talking of challenges, it's governance at scale, right. When you, when can dump entire document or entire code archive into, you risk your sensitive, regulatory information, or relevant data to go into the, into model,
00:05:51:27 - 00:06:25:06
Vishal Murgai
right. So that's kind of one example, the governance at scale now. Because talking of huge data here, right, cost of and because if I am a user, right, I say, okay, I am lazy I dump everything into this, I don't know, this LLM and then let it figure it out, right. Increase the cost and token efficiency again, right. Because you have to know the cost there, right? And trajectory and ordering, right, the model has no in-built roadmap on how this whole big chunk that you saw, basically the concept of trajectory that was used recently, right, with all these things. That, you need a roadmap, guideline kind of thing, how
00:06:25:06 - 00:06:51:00
Vishal Murgai
to, if you have have 1 million tokens or 750k words, right, how do you even break it down? How do you see what is where, right? So give it a trajectory and ordering issues, right, and observability kind of thing. And debugging, right, if something goes wrong. What what, who was able to access what? Some compliance issue essentially, right. Because if somebody was able to pull up data from sources which were not supposed to be used, I'd say you, it's the US versus Europe versus Asia, right.
00:06:51:06 - 00:07:08:05
Vishal Murgai
You want to pull it out, right, you don't want to. This is access policy, right? So if you, you have to stick to that particular governance model, right, and privacy issues, latency system limits. All these are like they're multiple, right? We can go into them one by one, right.
00:07:08:07 - 00:07:29:12
Joel Moses
So, so there are there are some architectural challenges with the approach. And there are architectural benefits as well. So maybe one of the benefits is the less reliance on things like RAG or pulling additional context information in at inference time to include. You can include more information in one shot. But but that comes with things like regulatory compliance concerns.
00:07:29:12 - 00:07:32:17
Joel Moses
It comes with other other concerns. Is that what I'm hearing?
00:07:32:19 - 00:07:33:12
Vishal Murgai
Right, yes. Exactly, yes.
00:07:33:14 - 00:07:43:04
Joel Moses
Now, what's useful in a large context window? What can we do now with a larger context window more accurately than we could do before, even by chunking?
00:07:43:06 - 00:08:05:10
Vishal Murgai
So, okay, it's, it's a it's a good question. Now we have, since you can put all these different things into a single, fit more context into more words, or more tokens into a single, single request, you have, you can show it more, right? Instead of chunking. And because chunking also also led to, you know, chunking shift or context drift, they called it before,
00:08:05:10 - 00:08:21:19
Vishal Murgai
right? Because you have to remember your chunking and remember and have your context from the first request to the next one. So you'd go one by one, right? So cross-document ref, now cross-document reasoning is possible, right, because previously that was an issue because you couldn't, you could do it but it had all the issues that we had seen before, right.
00:08:21:21 - 00:08:44:12
Vishal Murgai
But this one at least now you have chunking would only see a few fragments at once. But with this one you can see the, show the whole document, right? Compared to it, you can say, you know, compare all 40 vendor contracts, for example, right. And then give me, do a search or some template, right? Before this whole thing came into, the large window came into picture, it was you had to chunk it
00:08:44:12 - 00:09:09:23
Vishal Murgai
and then you will always have drifts there and, you know, that kind of thing. So, second is narrative can be more cohesive now, right. It's more cohesive. Right, because every entire chunk on manuals can now can we can add intact now, right? The model stance sequence is caveat also. And cross-referencing, right. That's
Joel Moses
I see
Vishal Murray
and yeah I think that's
00:09:09:26 - 00:09:11:12
Joel Moses
Yeah. So, if I, oh, go ahead Lori.
00:09:11:15 - 00:09:32:29
Lori MacVittie
I was going to say this feels very similar to the early web evolution where, right, at first sessions were passed back and forth, they were stuffed into cookies, into URLs. I mean, they would put them anywhere until they figured out how to get more persistent sessions and store that. And that sounds a lot like context being shared.
00:09:32:29 - 00:09:58:24
Lori MacVittie
But now either the model, or I should say the inference server, can actually maintain a million tokens of that or use external systems, like databases of some sort, to actually persist that longer over time, which of course there are risks. But there are also advantages in how you might build new applications or leverage that in your enterprise.
00:09:58:26 - 00:10:15:21
Vishal Murgai
Right. Right, right. Yeah. I think if you think back early days of I just mentioned sort of the web, right, HTTP was stateless. Every request forgot what came before that, right. Now there was invented sessions and cookies as you mentioned, yeah. So that seems very similar to that approach, yeah.
Joel Moses
Yeah.
Vishal Murgai
Now we have bigger ones, yeah. That's right, yeah.
00:10:15:24 - 00:10:47:12
Joel Moses
Now when I think of what enterprises develop over the years that increases measurably in size year over year and could present an interesting analysis task or something with a large context window, I, my mind, because I'm an engineer, immediately goes to codebases of legacy applications. So what, what benefits does a very large context window have for either code generation or conversion of existing legacy code bases to new programing languages, for example?
00:10:47:14 - 00:11:06:05
Vishal Murgai
Yeah, it's a very deep question. Okay. So , again, I think before you go to advantages, you have to remember that compliance issue, right?
Joel Moses
Oh, absolutely.
Vishal Murgai
If I'm an engineer I have to, right? You give me something to do, I'll say, okay, you will give me this task now, okay, I'll put all this database and the Git, push it to an LLM and let it answer all the questions.
00:11:06:05 - 00:11:24:09
Vishal Murgai
I can update there and come back, whatever, right. That's intermediary, right. So that's kind of an approach people will do, right. The first thing they'll do, right?
Joel Moses
Right.
Vishal Murray
But that's, we know that's simple. But some will work, right. It will have other side effects, right. Now, so now in terms of, now going back to your question,
00:11:24:09 - 00:11:45:09
Vishal Murgai
right. So the benefits of this, right? Now, before the whole core of visibility is there, now, this new scheme that we have. Before you could only fit small files or snippets. The model will lose track across file referencing again, right? Now we have one single thing. Large context lets you drop an entire module or even whole repo into the into this large model, right,
00:11:45:09 - 00:12:06:13
Vishal Murgai
preserving structure and preserve dependency, right. That's kind of what it is, right. And, a benefit of, the benefit of that was, you know, the, the model can generate and refactor code with awareness of the entire system. Not just isolated pieces of the code or functions, right. So it sees a more cohesive picture there, right.
00:12:06:13 - 00:12:31:21
Vishal Murgai
So, second is again, cross language conversion, conversion of text with context, right? Because now it knows that, you know. Before translating code basis chunks into often led to mismatched APIs or lost business context. Logically, right?
Joel Moses
Right.
Vishal Murgai
But with this one it can see whole, see more context, more, these things are just getting better. Even tools can connect to AI models or into the code base.
00:12:31:21 - 00:12:49:28
Vishal Murgai
That's another thing we are seeing on it. Tools can connect to code, code base ideas and even commit directly to GitHub. Basically, you give me something to do, I will connect to some kind of another model, we'll do all the magic to convert from this language to this language to Python to C, whatever that is,
00:12:49:28 - 00:12:54:10
Vishal Murgai
right. And then commit to this other one, right. So that's the benefit of that, right. So
00:12:54:12 - 00:13:16:19
Joel Moses
Yeah. And I, I appreciate you bringing up the challenges of that particular approach before we talk about the benefits. Because a lot of times when people are uploading things it's for inferencing into these models, they consider the benefits first and the challenges only after something bad happens.
Vishal Murgai
Basically, yeah.
Joel Moses
And so, you know, there are absolutely risks when you take your entire private code base and upload it, effectively, to these models.
00:13:16:19 - 00:13:25:24
Joel Moses
And one should certainly consider the challenges and possibilities of that ahead of the benefits is is, I think what, is what you're saying.
00:13:25:24 - 00:13:46:03
Lori MacVittie
We should probably recommend against uploading your entire code base to a public AI anything. Because one, that's probably gonna break company policy, besides just being, right, a bad idea. But, you know, if you wouldn't upload it to a Google doc, don't put it in a public AI. Like, that should be the rule.
00:13:46:06 - 00:14:00:22
Joel Moses
Yeah.
Vishal Murgai
And I think, to add to what Lori said, so you can also, supposing I'm going to upload this whole code or whatever document, right? There should be some kind of meter there. This document is so big, this is going to take this long and this token cost because it's not going to be free, right.
00:14:00:22 - 00:14:25:00
Vishal Murgai
This is kind of, telling the user or the engineer, okay, this is going to, upload is fine, but, does this have. Again, and that's one thing I think, is if you're going to do it, give a trajectory, right. I think that's, that is beautiful concept they have introduced now. The whole conversion that, you know, you don't just don't dump the code and you would stay the trajectory there right, you know. Some price architecture, translate core modules,
00:14:25:02 - 00:14:30:26
Vishal Murgai
refactor patterns, all these things, right? So it's kind of I think if I would do it, I would do it that way, right. I would, you know.
00:14:31:01 - 00:14:51:06
Joel Moses
Yeah, I think that's an important point that you just made there, that, that the tokens still have cost, right. So just because you can upload things and have more tokens considered in one shot of context, those tokens still cost the same. So it can consume more, but it will still cost you just as much as it would if you were chunking it,
00:14:51:09 - 00:14:55:26
Joel Moses
and inputting all that data and managing the context windows yourself.
00:14:55:28 - 00:15:24:28
Lori MacVittie
Well, isn't that the the one of the benefits of building out your own AI applications to leverage it, is that you can actually inject things like, hey, you're trying to upload something that's going to cost, you know, this much money and well, your budget doesn't allow for that. So let's cut it down. Those kind of functionalities you can embed in that you don't get if you just go to ChatGPT, which happily accepts any size, of course.
00:15:25:00 - 00:15:29:13
Lori MacVittie
Especially if you're on an enterprise plan and paying by the token.
00:15:29:21 - 00:15:52:01
Joel Moses
Yeah. Now, now, is there anything else that people need to take into account with the availability of ever larger context windows? And and secondarily, do you think that that the context window elements of, of these foundational models, is this becoming kind of a race between models to provide ever larger context windows?
00:15:52:04 - 00:16:11:11
Vishal Murgai
Yeah, I think. So the thing I'm wondering is that while a million token context sounds amazing. You can feed the whole document, a code base into this, right. But again, you have to be careful of, you know, and mindful of what you will do there, right. There is always privacy issues, governance issues. And, I think one thing I didn't bring up was, audits,
00:16:11:11 - 00:16:31:12
Vishal Murgai
right. And who, and also explainability, right. If you were converting C code to something, right, or some kind of document reference, a Q and a kind of thing, right. You should be able to cite where it came from. Because often I think I've seen that it will pick, there's a positional bias there, right. Something in the, suppose you're asking a question, oh, what is a, I don't
00:16:31:15 - 00:16:49:11
Vishal Murgai
know, some kind of, what is say. It will always pick the first position, like whatever answer comes first, it will pick that one. So when something is in the appendix much below there, that will be forgotten, right. Basically, something in the middle will not be considered that much as compared to something in the first and the last part of it,
00:16:49:11 - 00:17:10:03
Vishal Murgai
right. So that's one thing that I've seen that, and it's kind of the context decay or I think they call it recency bias, right. Essentially that, that, even the huge window models often forget all the content. Right? So I think that's another thing that you have to worry about here, right. Cost, quality trade off, diminishing re. Again,
00:17:10:06 - 00:17:32:21
Vishal Murgai
beyond certain point, I think what I've seen or read is that for 200 300 tokens, added token context, may not lead to much gain. Right? Again, from my personal experience, right. I don't know, we can we can different answers there. But, right. And persistent need for, needed for RAG now, right. Because RAG remains valuable, even large, with large context windows,
00:17:32:22 - 00:17:50:26
Vishal Murgai
right. Because context windows will have finite limits no matter how, 1 million, 2 million, 10 million, but have some limit always, right. So we still need RAG there, right? RAG supports real-time updates and audit trails. RAG helps again, and to target more selective information, not just right biggest dump, right, so.
00:17:50:26 - 00:18:09:12
Joel Moses
Right. One final question before we talk about what we learned today, if you put your prediction hat on, Vishal, and you think about models that are going to be available five years from now, how large of a context window is a model five years from now going to have, do you think?
00:18:09:14 - 00:18:28:23
Vishal Murgai
It's a, it's a tough one. So we started about three years ago, I would say. When the whole race started from, we came from like 1k when I started all this you started using all this to like 1,000,000 in 3 years. Right, and we all learned and matured over time. Now let's see, right, so we we started from zero to like whatever, what scale we are at, right.
00:18:28:25 - 00:18:46:21
Vishal Murgai
So, so I believe multi-million, it's hard to pick a limit. Maybe 5 to 10 million. Because finally there'll be diminishing returns, right? You can put everything there, but no it's more about, I think, giving it trajectory. I think that's basically, fine you can have much larger window but, you can the whole codebase,
00:18:46:21 - 00:19:12:21
Vishal Murgai
but, you know, given the geometric growth from 4K to, 1k up to 4K for this one these model, this is what enable AI to process entire books and all. Large capacity, the thing is the architecture advances, right? Like, you know, sparse memory, memory layers, expert models, are becoming production grade now, right? So I think if, if you ask me for number about 5 to 10 million, I think,
Joel Moses
Oh, okay.
00:19:12:21 - 00:19:14:03
Vishal Murgai
because after that I don't know, right.
00:19:14:03 - 00:19:34:18
Joel Moses
All right. Well, I for one, look forward to my 1 billion token context windows in the near future. You know, one of the things that I learned today is, is about the race to the ever larger context window. I think that this is being considered competitively to be an advantage from one model to offer to another.
00:19:34:21 - 00:19:57:07
Joel Moses
So I noted quite dramatically that GPT-5 has a larger context window than 4.1 did. And I think that the this this is definitely being considered as something competitive by by the model providers. I also learned that, that you still need to take into account all of the challenges of inputting large amounts of data into these models.
00:19:57:09 - 00:19:59:10
Joel Moses
What about you, Lori?
00:19:59:12 - 00:20:28:24
Lori MacVittie
I learned that bigger is better when it comes to actually getting accurate answers from AI. So bigger context windows are leading to better accuracy, but it's also going to cost you. And that there are a lot of privacy and compliance and governance sticky issues that you can run into by oversharing. And maybe big context windows encourage oversharing.
00:20:28:29 - 00:20:57:07
Lori MacVittie
I didn't learn, but I think I learned something I want to dig into, which is how that 1 million token window actually translates into memory and resources needed. Because as those grow, that means that the inference server can actually handle less and less, it sounds like. And I think that's a scale issue in an architecture. So enterprises are going to struggle building these out if they have context windows too big.
00:20:57:07 - 00:21:03:21
Lori MacVittie
So, maybe that's a tunable knob for enterprises as they're building out applications.
00:21:03:23 - 00:21:24:16
Vishal Murgai
True. And then finally, in my opinion, win winners of this race will be finding those who shape trajectories, right? Because who decide what goes on in the model path. Everything is logged, explained, because at the end governance, cost, orchestration of all this, and trust of all the flows. It's one simple question that, you know, what part do we want the model to work with,
00:21:24:16 - 00:21:36:08
Vishal Murgai
right. And I think that's going to be things, just not window, but things around that, right, trajectory and all these things like compliance will finally decide who wins this race, I think in my opinion.
00:21:36:10 - 00:21:37:15
Joel Moses
Excellent.
00:21:37:17 - 00:21:53:03
Lori MacVittie
Awesome, awesome. Well, that is a wrap for pop goes the Stack. Follow the show, so you're on call for our next forensic adventure. Until then, sleep with one eye on the logs and the other on the subscribe button.