Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.
00:00:04:28 - 00:00:40:21
Lori MacVittie
Welcome back to Pop Goes the Stack, the show where emerging tech gets deployed directly into reality, no staging environment provided. We, we like risk; it's okay. I'm Lori MacVittie, and I'm here to monitor the fallout. So, we've spent the last decade getting very good at measuring things: latency, throughput, error rates, maybe availability if we're feeling ambitious. They're all solid metrics. They're all necessary. And they're all completely inefficient
00:00:40:21 - 00:01:11:07
Lori MacVittie
once you introduce agents into the system. The thing is that once you introduce software that acts on behalf of a user or business or another system, the question shifts from "Is it fast?" to "Is it doing the right thing?" And that's a much harder problem to measure. And since we're introducing agents, they're different than the AI that we've been working with for the last couple of years where we started to get good at measuring that.
00:01:11:07 - 00:01:32:16
Lori MacVittie
And now we've got agents and we have to figure out how to measure that now. It's continually evolving here. So the question is, what's different? How does that impact what we measure? And just as importantly, how do we measure it? Because we know some technologies are just falling by the wayside; they're not working. For what we need to do for AI,
00:01:32:19 - 00:01:42:24
Lori MacVittie
agents are going to complicate it. So in this episode, we've invited our observability guru, Chris Hain. Hi, Chris.
00:01:42:26 - 00:01:44:08
Joel Moses
Welcome.
Chris Hain
How's it going? Thanks for having me.
00:01:44:09 - 00:01:53:22
Lori MacVittie
There you go. To dig into what actually matters when you're measuring agentic systems. And with us, as always, is Joel "OpenClaw"
00:01:53:25 - 00:01:54:10
Joel Moses
Oh, boy.
00:01:54:12 - 00:01:58:04
Lori MacVittie
Moses. Yeah, it comes back to haunt you every time.
00:01:59:12 - 00:02:01:21
Joel Moses
Yeah. Yes, Lori, I, too, like to live dangerously.
00:02:01:22 - 00:02:17:18
Lori MacVittie
You do. You do. So let's kick it off. What is so different from generative AI that agents introduce some questions about what are we going to measure and how we're going to measure it now. So what's different?
00:02:17:20 - 00:02:43:14
Chris Hain
Wow. Big topic. So I mean if generative AI and it's kind of probabilistic nature, like you probably get something out of it.
Lori MacVittie
Big word.
Chris Hain
It's not deterministic like it used to be, right. Like I could put something in and get the same thing out every time more or less. So, yeah, generative AI kind of blew those assumptions out. I think agentic AI takes it to the whole next level because no longer is it just, you know, one prompt and response
00:02:43:14 - 00:03:03:16
Chris Hain
and what does that, you know, produce. It's now decision change that are based on those probabilistic responses upstream that flow all the way through the system. So like, yeah, the days of like measuring error codes and durations and having that be a good indicator of things being correct is, you know, long past I think.
00:03:03:19 - 00:03:27:15
Joel Moses
Yeah. It strikes me that an agent is almost tailored to deliver some things that are unexpected. It's not that your agentic AI failed.
Chris Hain
Right.
Joel Moses
Maybe it, through its decision chain, pursued a different KPI than you thought it would. Right? So maybe you're measuring the wrong thing. You're measuring what you thought the outcome was, but it figured a different path.
00:03:27:16 - 00:03:39:10
Joel Moses
It chained together another set of services and you didn't expect that. So how does one get control over that? How does one start to instrument a system that charts its own future?
00:03:39:13 - 00:03:59:12
Chris Hain
Well, I think the great news is we don't have to throw the baby out with the bathwater in this realm. So, you know, the way that distributed services and kind of microservice monitoring typically worked in the kind of traditional space was, if you were well instrumented, you probably had tracing to see what your microservices are calling and all the interactions between them.
00:03:59:12 - 00:04:33:19
Chris Hain
So that technology is still really useful for these agentic chains, right. Like, you can represent a step that an agent is taking as a span. So your initial query to, you know, "hey, go do this thing" is the root span. And then, you know, you've got all these child spans as it's calling tools, doing other actions. Whatever it's doing, it's able to record that data as this representation that you can then reconstruct kind of the, if you've ever gone into the developer tools and looked at the call graph when you open a web page, you get kind of a similar view of the steps the agent's taking.
00:04:33:19 - 00:04:58:06
Chris Hain
And then each of those spans can be kind of decorated with additional metadata, right. So, what model is it calling? How many tokens did it use? What was the exact prompt that was sent to the to the model to produce the action? All of these things can be attached to those spans and you can get a very rich kind of replay of what did this agent do when I told it to, you know, book me a flight, right.
00:04:58:09 - 00:05:27:18
Joel Moses
Right.
Lori MacVittie
It sounds like basically instrument whatever you can in order to get that. And what I see is the problem is that a lot of this is dependent on the information being sent to the tool. Like if you're calling an MCP something, there's a request that comes from the agent. That call might be executed through code that was generated by the agent and therefore has no instrumentation in it whatsoever.
00:05:27:22 - 00:05:40:19
Lori MacVittie
So,
Chris Hain
Sure.
Lori MacVittie
So how do you deal with that? Is that just a "well, that's why you instrument the endpoint it's calling." Or do you just go, "eh, make it up, whatever. Doesn't matter."
00:05:40:21 - 00:06:00:27
Chris Hain
Yeah I mean that's a good,
Lori MacVittie
Guess.
Chris Hain
Yes. I mean, so there are SDKs maybe you tell that agent, "you got to use this SDK that I happen to know is instrumented to do these things automatically." Right? But it might not. Right? It might choose not to, you know, just ignore that piece of the puzzle. But the good news is, like, you don't have to invent any of these wheels, right?
00:06:00:28 - 00:06:25:07
Chris Hain
Like, there are big industry wide efforts to kind of standardize and align on conventions for not only, you know, how to track these things, but what is tracked. So when you make a, you know, agentic request, there are standards out there. There's one, OpenTelemetry is very popular in the, you know, kind of helping align everybody on both, you know, what signals should look like, but also what's inside of them.
00:06:25:07 - 00:06:45:13
Chris Hain
So there's this thing called semantic conventions, and there's a whole section of it that's very actively being developed around generative AI, and agentic AI in particular. So this would do things like tell you specifically, you know, when you're making a tool call, here's the structure of the thing that goes into it. It's got, you know, this attribute, names
00:06:45:13 - 00:07:08:16
Chris Hain
and, you know, here's how you represent the different parameters you sent into the model when you call the LLMs. So all of this is like becoming standardized in a way that could be reused across vendors and across observability tooling. So, you know, a lot of kind of the old days--back, you know, ten years ago--you'd wind up in these silos, right?
00:07:08:18 - 00:07:35:06
Chris Hain
Like, I use Datadog or I use Splunk or whatever, and it was really hard to kind of move between them. Right, you're kind of locked in. With OpenTelemetry we kind of broke ourselves free from those chains. And then we came into this realization that, okay, it's not just enough to standardize on the transport and how you define a signal, you also need to define what the content of those metrics or traces is.
00:07:35:07 - 00:08:07:28
Chris Hain
How do you name, you know, system CPU utilization? Is it CPU utilization or is it system CPU utilization? And, you know, if it's got a hostname attached to it, is that in a field called host or something different. So the same exact things will bite us in agentic AI. But the good news is, we learned enough in the old world that we're kind of getting ahead of that a little bit earlier so that I can pick up, you know, I can call any model from any provider, I can use any kind of framework, and there's a good chance it's already going to have some of these standards implemented. Which means I can put whatever front end in front of
00:08:07:28 - 00:08:14:28
Chris Hain
it. I can switch things out. It just gives you a little more freedom to do what you need to do.
00:08:15:01 - 00:08:17:12
Joel Moses
Interesting.
Lori MacVittie
Joel's favorite subject... standards!
00:08:17:17 - 00:08:24:16
Joel Moses
Oh, I love standards. I love creating new ones.
00:08:24:19 - 00:08:49:18
Joel Moses
So it does strike me that, you know, as we're we have a pretty good handle on things semantically in the generative AI world. I think the agentic AI world is still kind of figuring out really what it needs to measure. And so some of the
Chris Hain
Yep.
Joel Moses
things that leap to mind for me that I wouldn't want to know about if I'm setting up something that does auto chaining of decisions and runs autonomously.
00:08:49:20 - 00:09:10:09
Joel Moses
One is, I want to know how many decision loops per task, right? How long is it considering and what length of chain is it constructing to accomplish the goal that I've set it to? If suddenly it goes from 5 to 12, first of all, it's more expensive; second of all, why did it go from 5 to 12?
00:09:10:10 - 00:09:34:21
Chris Hain
Right.
Joel Moses
That seems strange. You know, goal misinterpretation rate; that's something that might have a measure to it. Correction loops per task. Human override frequency. How often do I need to say, "whoa whoa whoa whoa, slow down. Don't do that."? And my favorite metric, regrets-per-second.
00:09:34:24 - 00:09:40:02
Chris Hain
Many, many.
Lori MacVittie
Which is closely related to the attempts-to-blackmail-per-second
Joel Moses
Yeah. Oh,
Lori MacVittie
or per-decision-chain.
00:09:40:04 - 00:09:43:21
Joel Moses
no, I wrote that prompt poorly.
Lori MacVittie
Yeah, hmmm.
Joel Moses
Regret, regret.
00:09:43:21 - 00:09:45:04
Lori MacVittie
Yeah
00:09:45:04 - 00:10:09:24
Chris Hain
Yeah. Well, I mean, like it really all gets back to that lack of determinism, right? Like, all of these things are not just signals of, like, is it doing the right thing, like it's... You know, you have to measure a lot more and there's a lot more room for interpretation. Right? So you wouldn't probably put a thumbs up, thumbs down button and consider that telemetry for, you know, a database call or something.
00:10:09:25 - 00:10:29:00
Chris Hain
Right. Like here you've got to like have the humans rate the interaction to some extent. And all of these things have to be accounted for in the data model. So yeah, I mean same thing is like if it's calling, just because it's working today or in one instance of the prompt doesn't mean that it's going to be working tomorrow or the next time you do the exact same thing.
00:10:29:00 - 00:10:47:01
Chris Hain
So having the signals of like the last time I asked this, it made seven tool calls and spent a thousand tokens. And being able to kind of compare that to, you know, over time how it's doing is like super useful signal that none of that really needed to exist in the the old days, right?
Joel Moses
Right. Yeah.
00:10:47:03 - 00:11:03:27
Lori MacVittie
That's a lot of lot of lot of data. Because you're talking about a lot of textual right prompts and
Chris Hain
Yep.
Lori MacVittie
right deci-, I mean chains.
Chris Hain
Yep.
Lori MacVittie
This isn't something that's easily represented by: here's a number. Right?
Chris Hain
Exactly.
Lori MacVittie
It's lengthy. So that is
00:11:04:00 - 00:11:06:18
Joel Moses
It could be multimodal data.
Chris Hain
Exactly.
Joel Moses
Could be audio, video, etc.
00:11:06:21 - 00:11:09:03
Lori MacVittie
Yeah. I mean that's going to be expensive.
00:11:09:06 - 00:11:12:03
Chris Hain
Yeah, my picture is now a piece of my, you know, yeah.
00:11:12:03 - 00:11:13:26
Joel Moses
That's exactly correct, yeah.
Lori MacVittie
Yeah.
00:11:13:28 - 00:11:15:07
Lori MacVittie
Exactly. So
00:11:15:09 - 00:11:38:10
Chris Hain
Like logging data, traditionally, would generally be, you know, a short line that tells a human, "hey, this is the spot where something is going weird." Now it's potentially, you know, a gigantic prompt and interaction and there's all kinds of privacy and, you know, sensitive data concerns around that. So there's got to be governance in addition to, you know, what you should have been doing anyway, but probably goes a step to the next step.
00:11:38:10 - 00:11:43:20
Chris Hain
So all of these things are just compounding and making the world an interesting place to live in today.
00:11:43:24 - 00:11:54:00
Joel Moses
Yeah.
Lori MacVittie
Is that int-
Joel Moses
Collecting metrics on semantics is interesting business. I mean, you're essentially measuring things that may not be failing technically, but they
Chris Hain
Right.
Joel Moses
may be succeeding incorrectly, you know.
Chris Hain
Yeah.
00:11:54:01 - 00:11:58:13
Lori MacVittie
Right. Right. It did,
Joel Moses
It's a measure of correctness.
Lori MacVittie
it reached the goal. Yeah, it reached the goal, but
Chris Hain
Yeah, it got there
00:11:58:14 - 00:12:01:20
Chris Hain
but it spent a $1,000. And last time I asked it to
Joel Moses
Right.
Chris Hain
it spent five.
00:12:01:21 - 00:12:02:12
Lori MacVittie
Exactly.
00:12:02:13 - 00:12:05:10
Chris Hain
All of that has to be monitored and managed. It's
00:12:05:18 - 00:12:24:27
Lori MacVittie
Right. And how do you measure that? I mean that's, right, because when
Chris Hain
Yeah.
Lori MacVittie
I guess my point in, you know, this is big data that you're having to store somewhere and sifting through that, which means it's going to lag well behind the actual execution. So there's not going to be a "oh, we'll just reset." You know, it's you can't stop it in real time that way.
00:12:24:27 - 00:12:31:26
Lori MacVittie
Or can you? Do we need to? That's kind of where I'm thinking like, how do we do this?
00:12:31:28 - 00:12:54:09
Chris Hain
Yeah I mean I agree like the I think the tooling--so I've been using like OpenCode and some of these other things, you mentioned OpenClaw. I don't know how well instrumented OpenClaw is, but OpenCode is very opaque. Like you can't tell what it's doing really. Like it'll tell you in the screen it's doing these things, but you don't really get a good view of how much money did I just spend or how many tokens.
00:12:54:10 - 00:13:15:20
Chris Hain
Right? But like I said, if we start to instrument these things better and start to build the tooling around it, in real time I should have a dashboard that says, "oh, that interaction, that sub-agent call spent $0.50 or 10,000 tokens." And I should be able to see that as I'm watching it do its work and monitor it in real time.
00:13:15:22 - 00:13:31:27
Chris Hain
On the other side is more like the enterprise or like the service backend side where, yeah, you're right, it's not you're not going to be looking at an individual session all the time to try to say, "Oh, no, this is going out of whack. We got to stop that one." It's going to be more holistic.
00:13:31:28 - 00:13:49:12
Chris Hain
Like over time, the interactions are getting better or worse, they're getting more expensive, they're using more tokens, or whatever it is. So yeah, it's different sets of data for different uses, but they can all kind of follow the same patterns and the same, you know, semantic conventions. Right?
Joel Moses
Right.
00:13:49:14 - 00:14:14:12
Lori MacVittie
And I keep hearing, right, like frameworks, right, standard ways to do this. So it needs to be instrumented, there needs to be some standard conventions so that you can actually consistently measure over time, all good things. But that presupposes that people are using frameworks that do that, right,
Joel Moses
Sure.
Lori MacVittie
to build them in the first place and that it's implemented these standards.
00:14:14:12 - 00:14:35:02
Lori MacVittie
So in the past, we've always had alternate mechanisms for collecting that data. Say, you know, eBPF in the core system to be
Chris Hain
True.
Lori MacVittie
able to kind of extract it, whether you put it in there or not. So are those kind of things going to be also part of the solution, or is that just a stopgap?
00:14:35:07 - 00:14:57:24
Chris Hain
I think that's a super important part of the solution. In fact, it's like what you mentioned is like, well, yeah, I instrumented the framework or everything that I told, you know, everything I knew about and that existed before I made the call. But then it goes and writes something that, you know, didn't bother to instrument. eBPF is a great spot to say, oh, it's making these additional system calls that I don't expect or it's calling out to these locations that...
00:14:57:24 - 00:15:13:03
Chris Hain
Is it, right? I don't know. So yeah, I think that kind of auto instrumentation and like the lower level system instrumented pieces are going to be increasingly, you know, important. You got
Joel Moses
Yeah.
Chris Hain
Yeah.
00:15:13:06 - 00:15:36:07
Joel Moses
Well let's add a new wrinkle to this. Obviously metrics that change over time are also an interesting source of data that you may want an agent to analyze to see if it can devise a response. So monitoring, for example, the rise in cost for a particular system and having the agent devise a way to control those costs. So this is not just monitoring agents and agent activity.
00:15:36:10 - 00:15:44:28
Joel Moses
It's also that agents are going to be a tool that is used to make observability actionable. Isn't that case?
00:15:45:01 - 00:16:03:27
Chris Hain
Yeah. And it goes back to that idea of, you know, we need to capture a lot more of like the actual session. Like a humans not going to go probably sift through a million sessions, but you could have that agent go and, you know, look at all of these and tell me where it's going wrong and help me optimize the whatever it is that needs to be optimized to make those interactions better.
00:16:03:28 - 00:16:13:24
Chris Hain
Right? That's going to be a huge component of it, I think. So, yeah, again, it's making sure you have the data for the agent to go and take advantage of.
Lori MacVittie
Isn't that like
00:16:13:27 - 00:16:15:15
Joel Moses
And then of course, there's
Lori MacVittie
letting the fox
00:16:15:15 - 00:16:21:03
Lori MacVittie
and the hen, right, watch the hen house? Like, hey Fox, come watch the hen house.
Chris Hain
Yeah.
Lori MacVittie
Hey agent, come watch... Like, mmm...
00:16:21:03 - 00:16:23:16
Chris Hain
Hey agent, just fix all my observability problems, please.
00:16:23:16 - 00:16:24:04
Lori MacVittie
Come on.
00:16:24:06 - 00:16:31:22
Lori MacVittie
Yeah. I mean, what if they start collaborating and between themselves, like just say it's right.
Chris Hain
Once they start
Lori MacVittie
just say it's right.
Chris Hain
faking the metrics. Yeah.
Lori MacVittie
Right. Yes. I mean, come on.
00:16:31:24 - 00:16:37:00
Chris Hain
Hey, the easiest way to optimize this is to make the metrics look like it's optimized.
Joel Moses
Ooooo.
00:16:37:03 - 00:16:40:15
Lori MacVittie
Oh. How do, oh, my head hurts.
00:16:40:20 - 00:16:44:28
Chris Hain
Now we need a new security product that...
00:16:45:01 - 00:16:51:07
Joel Moses
My solution is to subtract 100 from the cost metric. Yeah. We're back in
00:16:51:09 - 00:16:53:15
Chris Hain
I need to bring the cost down by 100.
00:16:53:18 - 00:16:59:09
Joel Moses
Yeah. Or my solution is to turn off all instances. There, just zeroed out the cost.
00:16:59:09 - 00:17:02:03
Chris Hain
Oh that's, yeah,
Lori MacVittie
Zero, yes.
Chris Hain
that's happening.
00:17:02:06 - 00:17:13:04
Lori MacVittie
And we laugh and we joke about this. But, I mean, given some of the things we've seen, right, agents do so far, these are all plausible scenarios that we actually
Chris Hain
Oh, 100%.
00:17:13:06 - 00:17:34:15
Lori MacVittie
Yeah. We need to think about them ahead of time to figure it out. It's
Joel Moses
Yeah.
Lori MacVittie
because it is different, right? We can't just say, "oh, if the performance is bad." Or we, I mean, it's very, it was so concrete before and now we have to actually actively like think about "what could this thing possibly do?" to try and put something in place to counter it.
00:17:34:15 - 00:17:43:22
Lori MacVittie
And that's really hard for us because we're used to managing systems, not basically systems that are kind of like people.
00:17:43:25 - 00:17:45:01
Chris Hain
Yeah.
Lori MacVittie
It's
00:17:45:04 - 00:18:05:07
Joel Moses
Yeah. It's not about just measuring execution success. It's actually about measuring intent alignment. Did it do the thing correctly, not just did it do it. And that's going to be
Lori MacVittie
Yes.
Joel Moses
I bet that's going to be an area of intense research. And I hope standards do develop in the signals collection space for that.
00:18:05:14 - 00:18:21:09
Chris Hain
Yeah, most definitely. I mean I think that is the key learning is that the more we can standardize early on and the less kind of entropy in all of the the world out there, the easier it will be to figure those things out efficiently and make good use of these new technologies. So.
00:18:21:09 - 00:18:24:16
Lori MacVittie
Is that your takeaway for
00:18:24:22 - 00:18:25:06
Chris Hain
I think it is.
00:18:25:06 - 00:18:26:07
Lori MacVittie
the episode.
00:18:26:07 - 00:18:37:04
Chris Hain
I mean, if there's one thing, it's get it instrumented and get it standardized into a nice format that you can rely on, right. So that that for me is it.
Lori MacVittie
Joel?
00:18:37:06 - 00:19:04:09
Joel Moses
Yeah. You know, my takeaway is, you know, that old screenshot that they have of the Microsoft tool where the task failed successfully. That sort of measurement is entirely possible in agentic AI and with good reason. I think that when you're setting yourself up to imbue an agent with some power, create a service account with an attitude basically, that the things that you monitor should not just be:
00:19:04:10 - 00:19:20:01
Joel Moses
is it on or is it off? Is it running? Has it completed? You actually have to measure, is it correct. And that's going to be a struggle for people. And it's going to have to be something the industry will grow into.
00:19:20:03 - 00:19:46:15
Lori MacVittie
Awesome. I think my takeaway is mostly that, you know, you need to be aware that this is a problem, right? The old GI Joe theory, right, knowing is half the battle. Let's, you know, just recognize that this is a problem, that you have to measure things differently, you have to instrument, and then you can actually go about finding solutions. And they're going to be emerging just like the problems are emerging.
00:19:46:15 - 00:19:52:26
Lori MacVittie
So have a little patience but recognize it's going to be different. It's really going to be different.
00:19:53:00 - 00:19:59:09
Chris Hain
Positioning ourselves for success by
Lori MacVittie
Yes.
Chris Hain
making good choices about the things we do know.
00:19:59:12 - 00:20:00:22
Lori MacVittie
That should be on a card somewhere.
00:20:03:14 - 00:20:03:28
Joel Moses
I'd buy the t-shirt.
00:20:03:28 - 00:20:05:22
Lori MacVittie
I would.
00:20:05:26 - 00:20:19:25
Lori MacVittie
I'd buy that T-shirt. All right. For, that's a wrap then. That's definitely a wrap for Pop Goes the Stack this week. If the hype level dropped a few decibels today, hit subscribe because we'll keep the noise filter running.