Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.
00:00:05:04 - 00:00:23:03
Lori MacVittie
Hey, welcome to Pop Goes to Stack, the podcast about emerging tech and the unintended chaos it leaves in its wake. I'm Lori MacVittie, your host here to translate technology hype into human speak. And to help with that, we've got co-host Joel Moses. Welcome back Joel.
00:00:23:05 - 00:00:24:07
Joel Moses
Yeah, good to be here, Lori.
00:00:24:09 - 00:00:32:19
Lori MacVittie
All right. And we brought some guests along. You notice there's extra. We each brought a plus one. We've got Ken Arora, Distinguished Engineer.
00:00:32:21 - 00:00:33:12
Ken Arora
Hi there.
00:00:33:15 - 00:00:38:20
Lori MacVittie
Hey, Ken. And we've got Kunal Anand our Chief Product Officer.
00:00:38:22 - 00:00:43:00
Kunal Anand
Hey, everybody. I don't do human speak, I do AI slop.
00:00:43:02 - 00:01:09:01
Lori MacVittie
We're in trouble already. Oh my goodness. So today's episode we wanted to talk about reliability, hence the chaos reference. Well, it used to be boring. And that was actually a good thing, at least in operations. The same input in same input out. It was predictable. Deterministic systems reward discipline. We like that. That's because if something changed you could trace it,
00:01:09:01 - 00:01:32:29
Lori MacVittie
you could diff it, you could roll it back. So reliability was actually a math problem. It was a good thing, right. It was very well understood. Well inference is here and it's blowing that up. There is not a good corollary for reliability when it comes to AI inferencing cause slop apparently is, you know, how we're going to phrase that.
00:01:32:29 - 00:01:52:03
Lori MacVittie
So we wanted to talk about that. How does it change? It needs to change. I'm going to start, I'm going to say it needs to change. You can't measure it the same way. But how does it change and why? And I believe Ken might even argue, "no, it doesn't and we shouldn't." And he's just going to argue because he's Ken.
00:01:52:03 - 00:01:53:25
Lori MacVittie
So,
Ken Arora
I don't argue, I just
Lori MacVittie
take it away.
00:01:53:28 - 00:01:57:25
Ken Arora
just an alternate point of view. Point counterpoint.
00:01:57:27 - 00:01:59:27
Lori MacVittie
That's arguing.
00:02:00:00 - 00:02:28:14
Ken Arora
But you know what slop seems so pejorative. If
Joel Moses
I agree.
Ken Arora
this were another human you might call it creativity.
Joel Moses
That's right.
Ken Arora
You know, the point is that, you know, one of the really attractive things about GenAI, the reason it's opened up new areas for us, for humans to interact with computers is because of its non-determinism, slop, it's creativity. That you can give it the same prompt, ask it the same question,
00:02:28:15 - 00:02:46:18
Ken Arora
it'll give you different answers, different takes on things. And sometimes--not always--sometimes that's the feature, not a bug. And I guess I, you know, I'd want to at least make that point of view known is sometimes it's a feature. Now, where it's not a feature and it's a bug, I think we do need to do things about it.
00:02:46:21 - 00:02:48:08
Ken Arora
I don't know, Joel. Any thoughts on that?
00:02:48:10 - 00:03:07:27
Joel Moses
Well, I agree with you. I think that the term slop is a little pejorative and, actually, what we're dealing with here is just a bit of managed variability. It's the difference between a marching band where every drum beat, every step, every trumpet blast is on time. But you can take the same instruments and you can apply them to a jazz band.
00:03:07:29 - 00:03:29:05
Joel Moses
And what differs is the structure and the timing. And it can be variable and, in some cases, fairly random--a solo as opposed to a group effort. And there's nothing wrong with that. It produces different output. And actually what we value in jazz music is the variability. And so I think that calling it slop is a little pejorative.
00:03:29:06 - 00:03:32:22
Joel Moses
It's actually creativity out of variability.
00:03:32:24 - 00:03:55:21
Lori MacVittie
I'm going to disagree about the nature of that word because I make a specific dish and my son calls it crock slop and he loves it. He's absolutely excited by slop. So, you know, I'm going to say context is important in how you're perceiving the value of that word.
00:03:55:23 - 00:04:10:04
Ken Arora
You know, well, one of those little known facts--well known here but maybe not well known--is, right, is Kunal is a bit of a musician I believe.
Joel Moses
Right.
Ken Arora
And so probably has this music analogy. So it's a little bit like jazz, a little bit like riffing. How do you view like, I don't know, is there a music analogy there, Kunal?
00:04:10:04 - 00:04:39:09
Kunal Anand
I think there's a spectrum of slop. I think on one side you've got classical, which is like everything preordained, planned down to the note that Joel is describing. And then on the other side, you get, you know, bands like Phish or like the Grateful Dead who will just like go off and do whatever on the other side.
00:04:39:11 - 00:05:19:27
Kunal Anand
And I think there's beauty in both of these things. I mean, when you think about how technology has progressed--and this is more meta--but these accidental discoveries that have led to very big breakthroughs, you couldn't have gotten there just by being very strict. You needed to explore the studio space, right? At the same time, if you are all the way to the opposite side where it's just like maximum openness and maximum creativity that's unconstrained, you don't end up actually, you know, solving the goal. Or you don't end up, you know, in hill climbing algorithm actually like kind of like climbing the hill per se.
00:05:20:00 - 00:05:37:09
Kunal Anand
So I think there's a spectrum. And I think there's some use cases that require you to play classical, and then there's some music or some use cases that sort of require you to play jazz or to be, you know, alive. You know, I don't know what you want to call Phish anymore today. I don't want to say
00:05:37:09 - 00:05:55:00
Joel Moses
Jam band, I guess.
Kunal Anand
Jam band I guess, I don't know.
Joel Moses
Sure, yeah.
Kunal Anand
Yeah, sure. We should have a separate conversation on jam bands.
Joel Moses
Sure.
Kunal Anand
That's a separate thing there. But I don't know, it's a beautiful time right now, actually, to be making things and building things. But I agree with you. I think this is a really important topic on reliability.
00:05:55:05 - 00:06:16:25
Lori MacVittie
And I think, and you're hitting on it and Ken hit on it before, right, that word "context" is important. Sometimes you want the creativity, you want the jam session, you want the practice, you're composing, you're figuring it out, you're exploring a domain or a space and so you want a lot of variability and a lot of unreliability at that point.
00:06:16:25 - 00:06:39:08
Lori MacVittie
But when you're putting this into the context of something like agentic AI, which a lot of people are, that's more task oriented. I have a thing I need you to do. Variability there can be bad.
Joel Moses
Sure
Lori MacVittie
You don't want to end up, you know, booking flights to the wrong place in that flow. You want consistency, at least in behavior.
00:06:39:10 - 00:06:44:24
Lori MacVittie
And maybe that's where we're headed is behavioral-based reliability.
00:06:44:26 - 00:07:07:02
Joel Moses
Yeah. Definitely semantic consistency is what you're aiming for. So if you're doing something like vibe coding, you want the semantic consistency to have a wide berth to give you lots and lots of ideas. If you have an AI system that is supposed to be a support bot, you probably don't want it to go have an immense semantic corridor to work inside.
00:07:07:02 - 00:07:23:24
Joel Moses
You want to be as precise as possible. Now, there's always going to be variability in these systems. If you ask, you know, if you take out a support ticket as a customer, one day, you know, the system might tell you "sorry for the trouble we'll fix that," the next day, it might say, "have you tried turning it off and on again?"
00:07:23:26 - 00:07:53:16
Joel Moses
That's my classical one. The next day it may offer you coupons for free technical support. And those are all within the confines of semantic consistency. Now, if it says, "well, let's send you to Hawaii," it's not a travel agent. And perhaps you need to analyze what your system is doing there.
Lori MacVittie
Or go to Hawaii.
Joel Moses
On the flip side, you know, I know that our esteemed leader here is a vibe coder, and he wants a fairly wide berth for obtaining ideas from the system, right?
00:07:53:16 - 00:07:55:15
Ken Arora
Right.
00:07:55:15 - 00:08:18:10
Ken Arora
You know a lot of it depends on context. When you were talking, it made me think a little bit about sometimes we conflate these two concepts. Consistency: do I get the same answer over and over again, even if semantically? Versus accuracy: is it the right answer?
Joel Moses
Right.
Ken Arora
And those are both important things to think about
00:08:18:10 - 00:08:27:21
Ken Arora
and you might have different solutions for. You can get consistently the wrong answer. If I keep going to New York and you keep sending me to Hawaii, is that a problem?
00:08:27:23 - 00:08:29:09
Joel Moses
Depends.
00:08:29:09 - 00:08:29:21
Lori MacVittie
Ehhh.
00:08:29:23 - 00:08:33:25
Kunal Anand
That sounds like a lifestyle decision, honestly.
00:08:33:27 - 00:08:38:05
Lori MacVittie
You know, was it free? Then it's probably, you know, then I'm okay.
00:08:38:07 - 00:09:01:16
Ken Arora
Yeah. But I, you know, so I was just going to make that point. And then even context, I mean context is hyper personal. The vibe coding is a hyper personal thing. Kunal is an experienced programmer. He probably gives it more latitude and might want to explore what are the patterns that are used for this. On the other hand, if I'm using this in the context as let's say fresh out of school or I'm just teaching somebody how to program, I'm probably going to give it a more narrow set of things.
00:09:01:24 - 00:09:10:24
Ken Arora
I think earlier you alluded to, Joel, about guardrails. Now those are, I think, important. But how do you do semantic guardrails is an interesting question.
00:09:10:26 - 00:09:21:09
Joel Moses
That's a fantastic question, yeah. I mean, these systems have drift. These systems have variability built into them. Guardrails get pretty tricky when the target keeps moving.
00:09:21:11 - 00:09:53:09
Kunal Anand
So I think there's a place for guardrails, which is, you know, being as close as you can to the model. But then when you start to compose systems and agents on top of these models, I think you have a different technique that you have to go through: evals. And I think when you kind of look at the world today, there's not a lot of people that know how to, like, build these sort of flows and test acceptance criteria or evals around these models before they get their workloads into production.
00:09:53:09 - 00:10:19:01
Kunal Anand
And that's where things kind of go awry. Having built like a multi-agentic thing at home, I was building something that was built on top of a visual language, a VLM--basically a visual model for producing, you know, data object classification. And I wanted to take that first output, which was tell me what's in this photo,
00:10:19:04 - 00:10:53:26
Kunal Anand
and then based on what's in the photo, like go down, you know, a decision tree. But if that first classification is wrong, you're completely off the rails, right?
Joel Moses
Right.
Kunal Anand
And so in order for me to like build reliability in to the overall system, I had to build all these evals that I wasn't planning to build. Because I was having to like effectively wrap every place where the output would have to go to something that would end up either making a decision or following a decision tree or some complex logic.
00:10:53:29 - 00:11:18:20
Kunal Anand
And that was something I was totally unprepared for. And by the way, I think that's like a domain that is going to absolutely open up in the next like 1 or 2 years as people start to deploy more and more of these systems and as agentic becomes real. I mean, I think there's a lot of hype around this right now. But as it becomes real with like actual orchestration systems, I wouldn't be surprised if we started to see, you know, whether it's bottoms up evals that the community produces.
00:11:18:20 - 00:11:35:13
Kunal Anand
But I think it's so context, and we've said that word a couple of times now, it's so context specific based on the application that you want to embed these agents or these workloads into. So knowing how to build an eval, knowing how to construct it, knowing how to determine whether or not it's a quality eval or if it's just noise in the system, I think that's going to be super duper important.
00:11:35:13 - 00:11:36:01
Joel Moses
Right.
00:11:36:03 - 00:12:01:21
Lori MacVittie
So test is prod and prod is test. I mean, that's
Joel Moses
Yeah.
Lori MacVittie
it's all collapsing into the same thing. Like the code, right, when you're vibe coding, right. The coding pipeline is, sure it still exists there, but when an agent is vibe coding now it's in production. Now you're, right, the pipeline's moved. It's someplace else. You almost have to collapse this whole thing into the singularity
00:12:01:24 - 00:12:05:13
Lori MacVittie
that will be the data center.
Joel Moses
Yeah.
Lori MacVittie
Which is a scary thought.
00:12:05:15 - 00:12:30:04
Joel Moses
Yeah. We talked extensively about evals in our section on availability, our podcast on availability that just took place. Because, you know, availability is about correctness as well as responsiveness in the PAR triad that Lori has so lovingly crafted, about application availability and application delivery in the era of AI. But, yeah also, you know, correctness is a part of consistency.
00:12:30:04 - 00:12:35:21
Joel Moses
It's a part of the semantic categorization that you have to go to, to ensure reliability. So.
00:12:35:24 - 00:13:02:23
Lori MacVittie
Right, but the problem is that as we're moving toward that, away from understanding what correctness means in a traditional world, now we have inference and we're talking semantic consistency, which means does it mean the same thing? Did it produce a similar outcome? There's, we don't have methods for that. You know, folks are looking at how do I support all this inferencing and build out infrastructure to support it and to secure it.
00:13:02:25 - 00:13:24:22
Lori MacVittie
And they're saying I don't know how to establish semantic consistency. There's no measures for it. There's no reasonable systems that tell me what that is or what I should be looking for. So they're kind of caught in this, you know, limbo, right? They know they need to do something, but they don't have the tools yet to do it.
00:13:24:22 - 00:13:34:00
Lori MacVittie
So, I think defining something, you know, how do you measure semantic consistency then like,
Joel Moses
Yeah.
Lori MacVittie
how are we going to do that? That would be the first step.
00:13:34:02 - 00:13:55:06
Ken Arora
So I think this is a real exciting year. There's work going on as talking about inference, right, the meta the overarching topic today, how do you do it efficiently? Right. So some of these techniques say, "well, I'll ask a bunch of different LLMs, inferencing engines, and then I'll have another inferencing engine or a set of them to make decisions."
00:13:55:06 - 00:14:17:18
Ken Arora
And you can imagine what that does to latency cost, power usage, everything else. A different tack, and I'll say when it's applicable I would encourage people think about this, can you place structure on the outputs? Can you make them quantifiable? In which case we can use traditional distance metrics. Can you make them structured? Can you tell an LLM summarize this in three bullet points?
00:14:17:21 - 00:14:39:05
Ken Arora
And now you've got, you know, you are solving a more constrained problem.
Joel Moses
Yeah.
Ken Arora
And that's something I think that's a bit of just tactical advice for people doing this today is: if you're worried about this and you want to build guardrails, can you structure the output of the LLM or your AI agentic system to be amenable to the sort of guardrails you're used to building?
00:14:39:08 - 00:15:04:01
Kunal Anand
This is an open question, which is, how possible is that when you've got these very large scale foundational models by these large labs that are so inherently capable of doing so many things that have their own unique system prompts, and even though you could try to influence the behavior of this thing, it could totally still go off the rails, right?
00:15:04:02 - 00:15:23:14
Kunal Anand
Because you didn't build it. You didn't write it. So how possible is it to do it with one of those? And are we just like a year away or months away from, you know, to your point, like these maybe smaller models or more focused models that aren't necessarily built or trained by these organizations or by people
00:15:23:16 - 00:15:52:07
Kunal Anand
but, you know, they're somewhat general purpose, but they're not too dangerous, they can't go, you know, and hallucinate so crazily or we've got, you know, some improvements elsewhere in the system where they can't go off on its own and do something crazy and it's just more constrained. And so, like, inherently less capable model more constraints versus, you know, very large frontier foundational model that's, you know, multi-trillion parameters that can do anything you want.
00:15:52:09 - 00:16:14:24
Joel Moses
That's a really good point. And actually, that brings up something else, which is, people's definitions of reliability and consistency may be different. Like some people may want reliability or consistency in how much they pay for the actions of the AI models. You know, when you have a lot of variability in the system over time, you know, you pay for tokens generated over time.
00:16:14:24 - 00:16:35:24
Joel Moses
You don't pay for the number of calls that you make, period. And so it's like paying for parking by how much coffee you drank while you were parked. So how do you drive consistency into a system that that has inherent variability built into it? It's going to have differing costs over time based on how it generates.
00:16:35:26 - 00:16:48:24
Joel Moses
So, if you're trying to get, you know, drive reliability into the system, you may want reliable cost over time. So, the
Kunal Anand
Totally.
Joel Moses
honestly the definition of reliability could be different depending on what your target is.
00:16:48:27 - 00:17:14:11
Lori MacVittie
Yeah and
Kunal Anand
Lori.
Lori MacVittie
I was going to say. So some of the data coming back from our annual survey, when we asked about the techniques people are using with those, right--specifically trying to understand what are they doing themselves--and the top answer for how they're basically adapting models was multi-model chaining and orchestration. So Ken's example, we're going to use most of them.
00:17:14:13 - 00:17:42:18
Lori MacVittie
The second top answer was distillation. So they're distilling models. And the third top answer was prompt engineering and templates. Which I've heard all three of these kind of solutions, right. The guardrails are in, right, prompts, templates; they're just loaded,
Joel Moses
Sure.
Lori MacVittie
right, at system time. So all of these are in there. So people are already trying to figure out how to make the models do what they want to do, which is really an effort to be more reliable.
00:17:42:18 - 00:17:53:17
Lori MacVittie
Like I want consistency, how do I get that? And we can't rely on the stock model, so we have to do something. And it looks like they're trying all the things right now.
00:17:53:20 - 00:18:14:00
Kunal Anand
There's something you brought up earlier which is prod. Here's a fun little twist for this conversation is there is when you, like, reliably or guaranteeing reliability in staging or in a dev environment means something totally different when you get these things into production.
Joel Moses
Sure.
Lori MacVittie
That's true.
Kunal Anand
And so like, how do you do it?
00:18:14:00 - 00:18:39:23
Kunal Anand
And so I think maybe now it makes more sense. Seriously, like it makes more sense now. Like, maybe you need to actually have the entirety of the production data set or the universe of choices that could come in as input into these models. Like, you just need more of that in your dev staging environment than like your limited data set or what you play with in staging isn't good enough anymore because, you know, it's the Wild West.
00:18:39:26 - 00:19:01:08
Lori MacVittie
It's it works on my machine like now applied to the entire environment. Like, well it works in my environment, that's your problem over there. I, you know, I don't think we're going to solve this today, unfortunately. It would be cool if we could. That would be awesome. But I don't see that happening except that we need to do something.
00:19:01:08 - 00:19:23:14
Lori MacVittie
It is changing and how do people adapt to that? So, what would you leave people with to think about as they're trying to figure out how do I deal with the fact that it's not reliable at the moment? Or enforce that or measure it? All these things. Like it's changing, what do they need to do? What should they take away? Other than slop is good.
00:19:23:16 - 00:19:26:03
Lori MacVittie
I'm just going to throw that out there. Slop is good is a takeaway.
00:19:26:03 - 00:19:50:17
Ken Arora
So I'd start with the takeaway is: think about your problem and how much slop/creativity do you want. How much variability is appropriate for what you're doing? And sometimes it's a feature and when it's not, I'd say think about how can you constrain the problem. And the tools you have might be the prompt. That's the most common tool.
00:19:50:17 - 00:19:58:05
Ken Arora
And so if you can't play with the models as Kunal said, then constrain the prompt. Make it something more amenable to the guardrails and apply guardrails to that.
00:19:58:07 - 00:20:15:24
Joel Moses
I can't wait till these systems have a slop knob that you can turn up to 11. That would be pretty cool.
Lori MacVittie
All the way up.
Joel Moses
My takeaway is, you know, variability in these systems is actually part of their design and part of the beauty of their design. And so don't think that, you know, same prompt different answer is an unreliable system.
00:20:15:24 - 00:20:19:18
Joel Moses
It's not a bug. It's jazz.
00:20:19:20 - 00:20:24:28
Kunal Anand
I'm gonna go with what was the anchorman thing, "60% of the time, it works every time."
00:20:25:00 - 00:20:26:20
Joel Moses
Exactly.
00:20:26:22 - 00:20:46:04
Kunal Anand
No, I think it's more along the lines of, like what Ken and you both, both you Ken and Joel what you said. I think, know your problem. There's some places where, you know, shelling out to an AI or prompting an AI makes total sense and you can get away with whatever output kind of comes back. But know your domain.
00:20:46:10 - 00:21:12:21
Kunal Anand
Know your problem. If you are really worried about the boundaries and you really are worried about hallucinations having a side effect or having an unintended outcome or something that happens that you're just worried about, apply guardrails. Apply evals to better know what the model is capable of before you can get this thing into production. You know, we don't even get to talk about my favorite technique, which is like using multiple models with each other to double check each other's work.
00:21:12:24 - 00:21:19:02
Kunal Anand
That's fun. And so, like, there's all sorts of fun things that you can start to do. But yeah, it's a fun time right now.
00:21:19:05 - 00:21:40:28
Lori MacVittie
It is fun and it's moving fast. It's, every time I see the data it's just like, yep. It's rolling along, it's rolling around. It's already here. It was last week. You're, you know, we're on to the next thing. So it's very exciting and it's awesome. And things are changing which makes it exciting because we're going to see new ways to do things, new solutions, new approaches, right.
00:21:40:28 - 00:22:09:13
Lori MacVittie
Something new always comes out of that that you didn't think was possible or necessary before. So an AI will probably help find that solution, which is unironically ironic, I guess. But in the meantime, yeah, know what you're doing, right. Define your problem and use the tools that you have at your disposal today, right. Right now nobody can measure reliability, but there are tools to help you intercept,
00:22:09:13 - 00:22:48:05
Lori MacVittie
look at the, you know, look at the prompts, look at the responses to be able to judge it outside the flow. Maybe you can't do it in real time, but you can get an idea, right. Is my system acting like a chat bot, or is it trying to send everybody for a free trip to Hawaii?
Joel Moses
I'm good with that.
Lori MacVittie
And if that's your system, let us know. We want to get in on it. So, that is a wrap for Pop Goes the Stack. If your stack is still upright, celebrate by subscribing and maybe mainlining some caffeine. We'll see you next time for more hype translated into human speak and subtitles will be included for the carnage.