Pop Goes the Stack

Traditional performance meant deterministic response times. Identical inputs produced near-identical execution times. Optimizations reduced latency, but variance was minimal. Insert AI inference and performance engineering has been flipped upside down. Latency depends on model size, tokenization, batching strategies, and generation settings. Identical inputs may produce different response times. The new dimension of performance is variance—not just how fast the system responds, but how response times distribute across requests, how many tokens per second are processed, and how efficient each response is relative to cost.

In this episode of Pop Goes the Stack, Lori MacVittie, Joel Moses, and special guest Nina Forsyth dive into the impact of AI inference on measuring performance. It's time to rethink performance observability, focus on infrastructure optimization, agent-to-agent interactions, and robust measurement techniques. Listen in to learn how traditional approaches must evolve to manage this multi-dimensional puzzle.

Creators and Guests

Host
Joel Moses
Distinguished Engineer and VP, Strategic Engineer at F5, Joel has over 30 years of industry experience in cybersecurity and networking fields. He holds several US patents related to encryption technique.
Host
Lori MacVittie
Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.
Guest
Nina Forsyth
Nina has been focused on product innovation and optimization throughout her career. Whether steering a product from concept through launch or measuring the performance of an existing solution, she prioritize users every step of the way.
Producer
Tabitha R.R. Powell
Technical Thought Leadership Evangelist producing content that makes complex ideas clear and engaging.

What is Pop Goes the Stack?

Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.

00:00:05:18 - 00:00:26:27
Lori MacVittie
Welcome back to Pop Goes the Stack, where we take the comforting myths of old school performance engineering, hold them up to the light of modern AI systems, and watch them crumble like day old biscotti. I'm your host, Lori MacVittie. And as almost always, our co-host, Joel Moses is here to help keep us on track.

00:00:27:00 - 00:00:28:10
Joel Moses
Hey, Lori. Good to be here.

00:00:28:13 - 00:00:31:09
Lori MacVittie
Good to be here. You didn't have a choice; we made you.

00:00:31:11 - 00:00:33:15
Joel Moses
No, no, no, I was shanghaied. Definitely.

00:00:33:18 - 00:00:59:13
Lori MacVittie
Yes. It's, it's horrible, huh? Well, once upon a time, performance meant deterministic response times, right? We measure performance. We have to because we have to care about it, because users are impatient. We don't like to wait. But back in the old days--I like that, we're already in the old days, right--identical input, identical execution paths, identical latency.

00:00:59:21 - 00:01:28:17
Lori MacVittie
Universe was nice, orderly, engineers slept well at night, and nobody had to ask what token variance meant. They didn't. They didn't care. But, inference has showed up. And it's rewriting the entire rulebook in crayons. We got no idea. Performance is probabilistic. Latency is wiggling, not jittering, wiggling

Joel Moses
Wiggling.

Lori MacVittie
now. Throughput swerving. Identical requests walk through different kinds of pipelines and come out wherever the heck they want.

Joel Moses
Yeah.

00:01:28:19 - 00:01:49:28
Lori MacVittie
So today's episode is really about that shift. Last episode we talked about availability and how the definition of that is changing and how we have to change measures. And this time we want to focus on performance because that too is being impacted. And to talk about performance, we brought somebody on who is really well versed in all things performance.

00:01:50:04 - 00:01:52:21
Lori MacVittie
Nina, welcome.

00:01:52:24 - 00:02:04:05
Nina Forsyth
I'm so glad to be here. Thanks for the invite. Yeah, I guess a little bit of background, I used to do all the performance testing at F5 for quite a few years, so. yeah.

Lori MacVittie
Yes.

00:02:04:07 - 00:02:27:19
Joel Moses
Yeah.

Lori MacVittie
Yeah. That's, when I hear performance and figuring out numbers, I'm like, it's got to be Nina because she knows. She's done all sorts of different kinds of performance testing: apps, you know, our products, everything.

Nina Forsyth
Yep.

Lori MacVittie
So she's got a good view on performance and is now working with a lot of the AI system. So more, right, about performance. So, I mean, let's dig in.

00:02:27:19 - 00:02:31:11
Lori MacVittie
So performance is not what it used to be, right? We're

00:02:31:12 - 00:02:43:28
Nina Forsyth
Yeah, I mean it used to just be about, you know, throughput and connections per second. And it's changed now. The, you know, how fast do you get the response back is not the only thing that matters.

00:02:44:01 - 00:03:08:09
Joel Moses
That's correct. Yeah. And there's a fair measure, especially in these artificial intelligence applications, of non-deterministic performance, even to the same types or subjects of prompts, right. So you can, you know, it's really fun for me to talk to some of the old school veteran system engineers who are used to, you know, spending eight milliseconds of compute and getting eight milliseconds response time out of that.

00:03:08:09 - 00:03:31:08
Joel Moses
Those were the days. But we're beyond that now. Inference systems, the same prompt can take from 300 milliseconds to 3 seconds. And, you know, it's always tough to talk to folks who are used to the old deterministic ways and the pain in their eyes when they can no longer rely on things to return in predictable time.

00:03:31:10 - 00:03:37:09
Joel Moses
One of them described it to me as a character building moment when he realized that the old tricks didn't work anymore.

Nina Forsyth
Yeah.

00:03:37:11 - 00:04:08:25
Lori MacVittie
It's that predictable piece. I mean, one of the things we saw early on with ChatGPT when it first came out is people's tolerance for a lot of latency went way through the roof. Like, as long as it said, "oh, I'm thinking," right, we reacted in very human ways, like, oh, well, let's give it time. Right? When the same thing for an application, if it wasn't back in the blink of an eye, we'd be throwing our phone across the room and complaining about the internet.

00:04:08:28 - 00:04:17:06
Nina Forsyth
And now we're worried about how much it costs. How much that little connection makes, how much is this question going to cost me?

00:04:17:09 - 00:04:37:11
Joel Moses
Yeah. I mean, honestly, a lot of us used to brag about, you know, achieving our best 20 millisecond response times, and now with AI systems, it's better if you simply brag about your 95th percentile. Because no one really cares about how fast your fastest response is. They just care how slow the slowest one is.

00:04:37:14 - 00:05:02:03
Joel Moses
And that matters, especially if you're the one who got the slowest response, right? But non-deterministic performance, you know, it's more about the perception that the performance is faster than it is. These systems don't respond in deterministic time. And so that's why you have the little three dots that precede the response and why systems talk about how they're thinking about a response.

00:05:02:03 - 00:05:12:21
Joel Moses
It's in order to give instantaneous feedback to a user so that they will sit and wait for the tokens to be allocated and finished.

00:05:12:24 - 00:05:35:15
Lori MacVittie
Right. Well, it's also why they stream the response. Those tokens are being generated and it's being streamed back a little bit at a time. But it gives the perception that something is happening. And you can only read so fast anyway, so you're seeing it come out even as it's still working in the background, right? It's not compute the answer and then send it back.

00:05:35:17 - 00:05:57:01
Lori MacVittie
And that's very different when you are maybe the system engineer or on the operation side and you're trying to figure out like: what are my KPIs for this? What am I measuring anymore? Like, users yelling at me? Am I, you know, regexing for angry words to see if they're, you know, happy or not? How do you deal with that?

00:05:57:01 - 00:06:03:05
Lori MacVittie
Because it completely changes everything about how we manage performance.

00:06:03:08 - 00:06:05:26
Joel Moses
Yeah.

Nina Forsyth
Yeah.

Joel Moses
And how much it costs as well.

00:06:05:29 - 00:06:07:23
Lori MacVittie
And so yeah.

00:06:07:25 - 00:06:28:08
Nina Forsyth
Yeah, I think of playing around with the chat bot a little while ago. I asked a question, I asked the same question and I got three different answers. And I was like, this is not right. So it's like even that level of performance, like there needs to be guardrails

Joel Moses
Yeah.

Nina Forsyth
to make sure that the answers are not only the correct, but can you imagine getting three different answers and three different costs for the same question?

Joel Moses
Oh, yeah.

00:06:28:08 - 00:06:32:06
Nina Forsyth
That just changes how you look at applications nowadays.

00:06:32:08 - 00:06:50:15
Joel Moses
It does. It also changes how people think about the cost of the application. I once tried to explain tokens per second to our finance team and I said, "think about it like cost per word generated over time." And then they were like, "well, wait a minute, does that mean that when it slows down we pay more?"

00:06:50:17 - 00:07:00:08
Joel Moses
And I said, "yes, that's exactly what it means." And their eyes got so wide, I swear that our GPU budget shrank in real time.

00:07:00:10 - 00:07:24:27
Lori MacVittie
Well it's a hard thing to wrap your head around. It was much easier to go, well bandwidth, you know, bytes per second, right, these are also very measurable things. We know, well if it's going to return sizes, we could optimize for it. How do you optimize for responses? Are we constantly telling the models behind the user's back, you know, "really short answers,"

00:07:24:27 - 00:07:28:27
Lori MacVittie
you know, "one word," you know, "don't go over budget"?

00:07:29:00 - 00:07:55:26
Joel Moses
Well, I mean, I think when you're working with non-deterministic responses from some of these systems, you try to get determinism into the system as much as possible. So, getting rid of delay in inter networking between GPU hosts has an overall effect in terms of the number of tokens per second that you can process. So anywhere that you can leverage infrastructure determinism to solve for the nondeterminism is good.

00:07:55:26 - 00:08:21:13
Joel Moses
So all the old school tricks about lower latency networking and putting things closer to data resources, those tricks actually do work in terms of delivering performance back into these non-deterministic systems. But, you know, it's one of those things where you can kind of optimize these things almost ad infinitum, almost infinitely. And, so you have to

00:08:21:17 - 00:08:23:06
Nina Forsyth
So, the olden days still matter.

00:08:23:09 - 00:08:50:24
Joel Moses
The old days still matter. Right. Exactly. So wherever you can move latency and delay out of the system using infrastructure ways to do that, these systems that perform non-deterministically they take advantage of the fact that there's no added delay or added latency along the path. And so, again, it's still worth thinking about determinism, even if you know that you're not going to get it out of the system.

00:08:50:27 - 00:09:13:12
Lori MacVittie
But how do you measure that? I'm going to I'm going to keep going back to that. How do we measure performance? I mean, Nina you've developed like thousands of different types of tests to test different types of performance. So, if you had to sit down and test an AI for performance, I mean, what would you start measuring?

00:09:13:12 - 00:09:15:12
Lori MacVittie
Where do you start

Nina Forsyth
I mean,

Lori MacVittie
with something like that to understand

00:09:15:12 - 00:09:20:24
Nina Forsyth
you use AI to measure AI. I mean, wouldn't, you know

Joel Moses
Yeah.

00:09:20:27 - 00:09:22:17
Joel Moses
That's certainly an approach.

00:09:22:17 - 00:09:26:11
Lori MacVittie
Wait, doesn't that double the cost?

00:09:26:11 - 00:09:41:01
Nina Forsyth
Yeah, I mean, I think there's some data modeling you could do though. I mean, I think finding the guardrails for your questions. Like, you know, back in the day, we would do health monitors; you would, you know, check the response to make sure the response has the things you expect to have respond. And it's another way you could do it,

00:09:41:01 - 00:10:01:18
Nina Forsyth
you know, if you ask a sample question that, you know, maybe is on the cheaper side so you don't cost money, but are the responses what you expect? And the time responses that you're getting back, what's the variance like? And I think that's a, you know, a reasonable way to do some data scientist. But, you know, using AI to check AI is not the worst idea I've had.

00:10:01:20 - 00:10:03:13
Joel Moses
That's true. That's true.

00:10:03:16 - 00:10:31:10
Lori MacVittie
But you're still measuring response time, it sounds like. You're still measuring, okay, well, how long does it take it to respond? How long does it take it to generate? So almost the time between when it first starts sending tokens to its last token and going, "okay, this is a number we can actually measure"

Joel Moses
Sure.

Lori MacVittie
and we just can't do fast, you know, "is it always under five milliseconds?" because we know it won't be.

00:10:31:10 - 00:10:40:18
Lori MacVittie
But we can start looking at--and you brought this up when we, before we started the conversation, Nina--variance, right.

Nina Forsyth
Yep.

00:10:40:21 - 00:11:13:10
Joel Moses
Yeah. I think in the end analysis observability is the key. Right? Being able to watch not only your infrastructure and how your infrastructure is performing while delivering the AI application, but the impact that that has on the AI delivery itself. So when you're measuring first token to last token there's a time, and if you can compare that to the changes in infrastructure that you propose or simulate to make, you can actually reduce the time between the first and last tokens by moving data resources closer to the point of inference, for example.

00:11:13:10 - 00:11:42:10
Joel Moses
So, if you're using RAG, if you're going five hops away and you're injecting an additional 500 milliseconds of latency in doing so, it's going to translate into a slower first to last token. And so you can actually make changes, small changes--definitely recommended in this type of workload--to infrastructure. And through your monitoring you can correlate one latency reduction to a token-per-second latency reduction.

00:11:42:12 - 00:11:51:09
Joel Moses
It's, again, it's still going to be non-deterministic--that's just how these systems are--but you can actually measure a performance difference there.

00:11:51:11 - 00:11:58:27
Lori MacVittie
Right.

Nina Forsyth
And you're definitely not going to want to look at the latency averaged across.

Joel Moses
No.

Nina Forsyth
You want to see what that variance is and you want to see the, the high end.

00:11:59:00 - 00:12:23:01
Joel Moses
Exactly. And it also means that when you're observing things, you also have to know what the AI system is doing. Is it making RAG queries? Is it touching another AI agent that's out there in doing its response? So it's not just monitoring the infrastructure that you know about around your application; it's also knowing how the AI system might be interconnected to other things.

00:12:23:03 - 00:12:48:03
Joel Moses
Latency has a, as you know, even in AI systems latency has a magnifying effect. The more latency you inject, the more exponential your performance problems become one system over. And that's just performance engineering 101. It's the, the measuring points are different and the way we need to monitor and provide observability for these systems is different,

00:12:48:05 - 00:12:50:00
Joel Moses
but the challenge is the same.

00:12:50:02 - 00:13:10:19
Lori MacVittie
Well except that, you know, now if we think about agents, we can't know for sure how many different systems that might be making a call to. With web apps, right, we knew oh, it calls a database and it calls this other service--there's 2 or 3--okay, we can factor that in. This agent might call five, it might call two, it might call ten.

00:13:10:22 - 00:13:12:26
Joel Moses
That's correct. Yeah, and so you

00:13:12:28 - 00:13:14:07
Lori MacVittie
So, there's a lot of variance there to

00:13:14:13 - 00:13:36:28
Joel Moses
That's correct. And so why that, that's why there's an interest in things like MCP gateways and an A2A, which gives you the ability to monitor agent to agent transactions. These sorts of things are newly emerging and they're not necessarily in a lot of the observability systems yet, but it's definitely something to watch for.

00:13:37:01 - 00:14:04:07
Lori MacVittie
That's an interesting thought, is that multiple pieces of infrastructure might be needed to manage performance, like restricting tool calls, restricting agent calls, watching and feeding back. Instead of having just one system that might be watching for performance and kind of being the source of truth, you're going to have multiple systems that you're going to have to correlate kind of like a performance SIM, right, if you will,

00:14:04:07 - 00:14:07:20
Lori MacVittie
right, is going to grab them all. It's, is that what I'm hearing?

00:14:07:20 - 00:14:28:09
Nina Forsyth
Yeah, and I think you also look at the size of the engine, so like, or size of the LLMs. You might want smaller ones running most of your questions.

Joel Moses
That's right.

Nina Forsyth
And if you do, you know, analysis over time and find that, you know, your small model's not working as well, then you can take time in order to help learn that smaller model in order to, you know, improve performance and cost across the board.

00:14:28:11 - 00:14:33:11
Nina Forsyth
This is like a multi-factor performance nowadays. It's pretty complicated.

00:14:33:13 - 00:14:58:21
Joel Moses
Exactly. Yeah, and there are also nerd knobs within those LLMs to turn; things like temperature, things like bit size of the compiled model. All of these things are things that you can tweak and they can lead to changes in the token per second. But they can also reduce accuracy. And so you have to be very careful about making direct changes that affect your performance inside the LLM.

00:14:58:27 - 00:15:17:00
Joel Moses
That's why it's probably better to look at the infrastructure surrounding it, find a model that works the way you like it to work and returns the accuracy like, and then figure out how to optimize everything that surrounds that model. Tweaking the model itself can lead to strange behavior.

00:15:17:03 - 00:15:46:01
Lori MacVittie
That's also, you could, I mean, that would be, is cost a new DDoS, right? I mean, being able, I'm going to change the temperature and give it this complex question so that it gives me a lot of tokens coming back. Like, can you, like that's, it's another concern you have about, right, you've got to watch performance in terms of that because it's going to, it directly relates to cost now in a way that, you know, I mean bandwidth is cheap, right?

00:15:46:01 - 00:15:57:06
Lori MacVittie
We all know that now. So we don't worry about how much does it cost for every, you know, byte I send out. But we are focused on that with tokens because that is the cost model. So we have to be more careful about that.

00:15:57:09 - 00:15:59:03
Joel Moses
Yeah.

00:15:59:05 - 00:16:31:16
Lori MacVittie
It's definitely related. That's a lot of change. That's what I'm hearing is a lot of change. It's not just, it's not just that the, if the definition of performance change, if we have to watch variables variability instead, if we have to watch time between tokens and not bytes, that means that there's a lot of infrastructure changes that have to be made in order to measure it correctly in the first place, before you can even start optimizing,

Joel Moses
Yeah.

Lori MacVittie
because if you don't know how it's performing, you can't do anything to change it, right?

00:16:31:18 - 00:16:45:26
Joel Moses
Yeah, that's

Nina Forsyth
And you might say, like,

Joel Moses
that's true.

Nina Forsyth
accuracy might not even matter if the results are not correct. So if you're not managing that then does any of it really like matter? Like, you know, multi so dimensional now it's really hard to figure out.

00:16:45:28 - 00:17:12:11
Joel Moses
Yeah that's actually what I'm taking away from this discussion today. There is definitely a new definition for performance within AI-driven applications. And I think we used to think about performance like a stopwatch. You know, it's precise, it's predictable. And now it's like a toddler getting ready for bed: sometimes they respond to you right away, other times they take 20 minutes, and sometimes they just start telling you a story about a dinosaur for no reason whatsoever.

00:17:12:13 - 00:17:38:09
Joel Moses
Inference performance is not really about speed. It's about managing variance or variability. It's not how fast is it; it's what mood is it in today and is it at least coming back in a predictable amount of time? Right. It's about maintaining a stable distance between first and last token and not having some that are way out to lunch and others that return extremely quickly.

00:17:38:12 - 00:17:42:03
Joel Moses
The variability is actually the new measure of performance.

00:17:42:05 - 00:17:56:19
Nina Forsyth
So like if you're an enterprise right now, very little AI experience but you have this problem, what would be the first thing you focus on? I think it would be the infrastructure side; figuring out how infrastructure is behaving and then move on up the stack.

00:17:56:21 - 00:17:59:06
Joel Moses
Yeah, I think that's a good place to start for sure.

00:17:59:08 - 00:18:24:09
Lori MacVittie
Oh, absolutely. I think there's a lot of different pieces of infrastructure, like I said, that get touched by this that you have to start looking at. How are we going to measure it in observability? And all of these things are not just, it's not just multi-variable, as you mentioned Nina, it's also that it's multi measurement right. The impact is also, right,

00:18:24:15 - 00:18:50:03
Lori MacVittie
it's just expanded because performance could be indicative of a security problem, right, an attack. It could be indicative of a problem somewhere else in the infrastructure. It's so interrelated with all of these different systems, just the way AI operates, that you have to worry about: okay, I have to know what I'm going to measure and then I have to start looking at all the other systems that might touch.

00:18:50:03 - 00:19:13:11
Lori MacVittie
It's going to be, it is performance. It might be touching, you know, data stores but it might be a security thing we have to worry about. And really, maybe this is the thing that breaks silos because everybody has-. I can dream. Stop laughing at me. I feel bad now. No, okay, there'll still be siloes. There'll be new silos.

00:19:13:14 - 00:19:35:21
Lori MacVittie
But there shouldn't. Because it really is that related, right? Performance could be a security issue, it could be a cost issue, it could be an infrastructure issue. It could just be a network issue. And we really have to figure out how to get the right measurements to the right systems and the right people in order to figure it out fast, because it costs money. A lot of money.

00:19:35:23 - 00:19:53:26
Lori MacVittie
Big money.

Nina Forsyth
I think everyone, I mean, if you talk about silos everyone wants to talk to the data team. So at least we have that silo broken down. Like, I think our own AI data team has been very busy, and they're talking to more people than they had ever imagined. So, imagine that sort of helping silos in some way.

00:19:53:29 - 00:19:54:21
Joel Moses
That's right.

00:19:54:23 - 00:20:18:05
Lori MacVittie
That's true, that's true. Well hey, you know, you got to start somewhere, right?

Nina Forsyth
Yep.

Lori MacVittie
One at a time works. Works. All right. Well, hey, I think that's a wrap for Pop Goes the Stack this week. We can't go back to deterministic compute. That ship has sailed, caught fire, and it's now generating tokens about its feelings as it sinks.

00:20:18:08 - 00:20:29:09
Lori MacVittie
But we can and will move on as AI eats your data center. So be sure to subscribe so you know when it's next snack time is scheduled.