Pop Goes the Stack

Here's the newest bright idea in AI: don’t pay humans to evaluate model outputs, let another model do it. This is the “LLM-as-a-judge” craze. Models not just spitting answers but grading them too, like a student slipping themselves the answer key. It sounds efficient, until you realize you’ve built the academic equivalent of letting someone’s cousin sit on their jury. The problem is called preference leakage. Li et al. nailed it in their paper “Preference Leakage: A Contamination Problem in LLM-as-a-Judge.” They found that when a model judges an output that looks like itself—same architecture, same training lineage, or same family—it tends to give a higher score. Not because the output is objectively better, but because it “feels familiar.” That’s not evaluation, that’s model nepotism.

In this episode of Pop Goes the Stack, F5's Lori MacVittie, Joel Moses, and Ken Arora explore the concept of preference leakage in AI judgement systems. Tune in to understand the risks, the impact on the enterprise, and actionable strategies to improve model fairness, security, and reliability.

Creators and Guests

Host

Joel Moses

Distinguished Engineer and VP, Strategic Engineer at F5, Joel has over 30 years of industry experience in cybersecurity and networking fields. He holds several US patents related to encryption technique.

Host

Lori MacVittie

Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.

Guest

Ken Arora

Ken Arora is a Distinguished Engineer in F5’s Office of the CTO, focusing on addressing real-world customer needs across a variety of cybersecurity solutions domains, from application to API to network. Some of the technologies Ken champions at F5 are the intelligent ingestion and analysis of data for identification and mitigation of advanced threats, the targeted use of hardware-acceleration to deliver solutions at higher efficacy and lower cost, and the design of user experiences based on intent and workflows. Ken is also a thought leader in the evolution of the zero trust mindset for security, and how that will be applied to increasingly distributed and even edge-native apps and services. Prior to F5, Mr. Arora co-founded a company that developed a solution for ASIC-accelerated pattern matching, which was then acquired by Cisco, where he was the technical architect for the Cisco ASA Product Family. In his more distant past, he was also the architect for several Intel microprocessors. His undergraduate degrees are in Astrophysics and Electrical Engineering, from Rice University.

Producer

Tabitha R.R. Powell

Technical Thought Leadership Evangelist producing content that makes complex ideas clear and engaging.

What is Pop Goes the Stack?

Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.

00:00:05:11 - 00:00:37:20
Lori MacVittie
Hi. You're listening to Pop Goes the Stack, the podcast that treats emerging text like an incident report—root cause to be determined. Impact, probably your weekend. Usually is, especially on a Friday. I'm Lori MacVittie and my co-host Joel Moses is back from his fine tuning exercise.

Joel Moses
Thanks, Lori.

Lori MacVittie
I can't wait to hear what you've learned while you've been kind of on a break, cause today we're going to talk about the newest bright idea in AI.

00:00:37:23 - 00:01:00:12
Lori MacVittie
So don't pay humans to evaluate model outputs, right? If we're looking for hallucinations or, you know, is it, is it kind of gone wonky? Eh don't let humans tell you that; let another model do it. And it's called LLM-as-a-judge. And there was a great paper on it and that was written about it, and it where it talks about this thing called preference leakage.

00:01:00:14 - 00:01:25:16
Lori MacVittie
And, and apparently when models in the same family look at their baby, they all say "yes, it's beautiful." Essentially this is this is a problem. So to talk about it, we brought Distinguished Engineer Ken Arora back to talk about this. So yeah let's let's jump in. Let's talk about preference leakage and judges and models and

Joel Moses
Yeah.

Lori MacVittie
you know should I be scared.

00:01:25:18 - 00:01:40:12
Joel Moses
Well, let's let's let's first start by talking about why judgment systems are necessary. Ken, why don't you explain to us what what, what first of all, what what LLM judging means. Why why why is it necessary? What does it do?

00:01:40:15 - 00:02:12:22
Ken Arora
Well, you know any system, any machine learning system, fundamentally needs some sort of feedback, some sort of way of saying, am I doing the right thing? Am I not doing the right thing? And and that feedback can come in many forms, as Lori said, you know, sometimes that's humans saying, right answer, wrong answer. But humans are expensive and take a lot of time and slow and all the problems that, you know, organic life forms have. So, one of the ways to deal with that is to say, well, why don't I just have the model, I'll do something else.

00:02:12:22 - 00:02:30:08
Ken Arora
And there are a lot of approaches that have been used. And we could, I could talk for a while; start a podcast about different, different ways of doing this. But one of the techniques that people have tried using and are using is let me use another model, right? It's sorta like phone a friend. It's like, okay, I think that's the answer,

00:02:30:08 - 00:02:50:25
Ken Arora
let me phone a friend. And therein lies the rub. I'm going to toss this back to Joel because Joel like actually played with this. But the idea they found is if you phone a friend and that friend is somebody who is very much like you, like a twin brother, maybe you shouldn't be surprised if that answer has a bit more bias to it than if you phoned somebody random.

00:02:50:28 - 00:03:26:24
Joel Moses
Yeah, yeah, I'm going to try to stay a little bit above the philosophical aspects of of keeping everything inside the family and, you know, create creating structures that, that are essentially in the same family. But it is it is true. So, whenever you're creating a judgment, whenever you're trying to determine the quality of the output or the quality of the content going in, setting an LLM that judges the output of something that's being prompted is a way to very quickly figure out whether the the content is, is accurate, whether it's complete, that the whole nine yards.

00:03:26:27 - 00:03:55:24
Joel Moses
However, this, this, this research paper, and we're going to link it below, says that if you, if you use the same basic model or model family—now, I think that that's also interesting to note. So when I say model family, like there are different families like the, the ones from OpenAI, the GPT4o line, and continuing and then there's, there's Gemini and there's Llama—but when you point the, the, the model at its, it's its counterpart,

00:03:55:24 - 00:04:28:23
Joel Moses
meaning if I use an OpenAI generated model and I test it or I judge it using an OpenAI generated model, that there is some inherent bias that the output is better than it is. And and the, the paper goes into some detail about how that might be the case. It makes a lot of assertions, but it, it's, it statistically indicates, through a variety of battery tests using some very popular LLM tests, that that this could possibly be the case.

00:04:29:00 - 00:04:52:04
Joel Moses
And it seems to be isolated to models within the same family. So so that's that's an interesting outcome. Now, how severe is that problem? The paper classifies it as a vulnerability of judgment systems. It also goes on to talk about how this particular vulnerability might be used to determine what system is judging the output of an LLM

00:04:52:06 - 00:05:14:00
Joel Moses
and has some interesting things to say about that. But, in the end analysis, it still indicates that even when judgment systems are used, they tend to still improve the output. Right. So so I'm not sure you can classify this as a vulnerability. It's more of an an odd characteristic of judgment systems. That, do you agree?

00:05:14:03 - 00:05:36:15
Ken Arora
Yeah I think I think that's true. It's it's a little bit of I, I am sorry, I'll go back to being a little more philosophical and just the analogy here is, is maybe this isn't as I mean maybe this is a symptom. The fact that you have this this bias. I like that word better. It's, you know, this preference leakage sounds like a medical condition.

00:05:38:09 - 00:05:44:15
Lori MacVittie
I, there is a chemical solution for that. I'm pretty sure.

00:05:44:18 - 00:06:07:09
Ken Arora
But, yeah, this this systematic bias that and you know, and this isn't just one right there, this, this idea of, you know, panel of experts, mixture of experts has been around a long time in AI. You know, I step back and I go, I wonder how much this is also really, in fact, reflective of these things in a family are often trained in a similar way.

00:06:07:09 - 00:06:29:13
Ken Arora
And, and, and their training sets have similar goodness metrics that they're going after. So what you're really seeing is, is a reflection. It's it's a, it's a symptom of the fact that there that they these models, again, to make them anthropomorphize them, grew up in the same family in the same culture. So, of course they're going to have the same biases.

00:06:29:13 - 00:06:51:13
Ken Arora
If I grew up in a small town in Texas with a small family, it's going to be very different than if I grew up in, you know, middle of Delhi with, you know, extended family and 10 million of my best friends living in that same city. And it's going to be reflected in my worldview. And you can say, oh, well, yeah,

00:06:51:13 - 00:06:56:15
Ken Arora
you know, there's bias there. It's like it's it's reflective of the of the back story.

00:06:56:18 - 00:07:25:07
Lori MacVittie
And that's, it, I'm not sure it's a problem or that there actually is an answer. And some of this seems to be like we're searching for an answer to an unsolvable problem, because it's if I asked Joel, you know, is this correct? And then I asked someone else, is it correct? I can get two different answers. And when it's when it's not, you know, to to use the word deterministic, I mean, it's it's there two plus two is four.

00:07:25:09 - 00:07:50:20
Lori MacVittie
Everybody should give me that answer. But if I ask, you know, how you know is, you know, is 70 degrees a good temperature. Well, you're going to get different answers, right? So how how do you decide if that's correct? Correct for who? And correct for what?

Ken Arora
Yeah.

Joel Moses
Yeah.

Lori MacVittie
And so that transfers over into even things like, you know, business where I'm writing an email.

00:07:50:20 - 00:08:12:21
Lori MacVittie
Is that correct? Well, who are you talking to? Who is it coming from? What's the topic? What's the context? And there's 100 different variables that impact is this a good, as you pointed out, email. And that's what we're asking it to judge is this, right, completely, you know, randomly generated bunch of words that describe something.

00:08:12:28 - 00:08:33:07
Joel Moses
Yeah. Now I want to be very clear about what this what this preference leakage concept is and separate it from the concept of data leakage, which is a related but but different thing. So data leakage is where you have a training corpus, you've got a bunch of stored data and you use that to train a model. And then you, you run an evaluation test set against it.

00:08:33:07 - 00:09:03:04
Joel Moses
And there's a bunch of different evaluation test sets against it. And then you evaluate the output of this, this trained model and discover where it is leaking certain information through directly from the training corpus. Okay. So that's that's one element of leakage. Now, in order to guard against that, sometimes what, what you can do is you can generate synthetic data alongside other data and feed that into the training set, which tends to improve its ability to be resistant to data leakage.

00:09:03:07 - 00:09:35:17
Joel Moses
But what this paper is saying is that if you use the same model to generate synthetic data to train the model itself, and then you judge it using a system that uses the same model, that there's an inherent relatedness bias. And that that you, you, it will tend to score higher because the model that generated the synthetic data that it was trained on, and the system that is judging the veracity of the output, they, they, they have some sort of connectedness.

00:09:35:20 - 00:10:00:28
Joel Moses
And because of that, the scores look skewed, and that, that, that, that relatedness can sometimes put at risk your ability to defend against things like data leakage or to detect and stop things like hallucinations using a judgment-based system. So it's, it's kind of, it's kind of a backside evaluation problem. And that that's that's an interesting concept.

00:10:00:28 - 00:10:21:12
Joel Moses
Like, like if you're, if you're generating data from a particular model and you're evaluating it using the same model or model family, that the relatedness of what generated the data and the relatedness of of what is evaluating the data, there's a connection between the two, even though they are not physically connected in any way. And that's, that's a that is a fascinating outcome.

00:10:21:12 - 00:10:25:11
Joel Moses
Although I think it's kind of something that you should expect.

00:10:25:24 - 00:10:37:21
Joel Moses
You, the, the main corpus that you're, that you're, that you're creating the synthetic data from is the same and the what it was trained on is the same, so there will be some sort of correlation there.

00:10:37:23 - 00:10:57:11
Ken Arora
Yeah. I think you use the word right. Even though they're not connected, I, I, I nuance that if they're not directly connected,

Joel Moses
Directly.

Ken Arora
but they are indirectly connected because they're, they have that same training set. It's, it's the synthetic data you generate can only be emitted from the universe of that space that that model is aware of.

00:10:57:13 - 00:11:18:00
Ken Arora
If I go back to that again, you know, human analogy, if I grab somebody from the tropics and I ask them questions, odds are that I'm not going to have a lot of of synthetic data that's generated is going to talk about the nature of the 17 types of snow that the Eskimo knows about. And and so, of course, there's going to be a blind spot there.

00:11:18:00 - 00:11:29:09
Ken Arora
And they're going to answer things. They're going to say, oh, do I need to worry about, you know, how do I know, I don't know, how I build a roof. And it's like, oh no, roofs are fine, just put a flat roof it's fine because rain will fall right off. Cause they're never thinking about snow loads and things like that.

00:11:29:09 - 00:11:36:03
Ken Arora
To kind of push this analogy a bit further than probably I should have. But yeah, there's it's it's your universe.

00:11:36:06 - 00:11:58:11
Lori MacVittie
Right? That's the, I mean, that's the way that the model, and I hate to say we keep saying, model was trained, but it builds weights and relationships in the training data. And those same weights and relationships exist in other models that were trained with that data. So it makes sense that if you, you know, ask the question, get an answer, and then you feed it and say, hey, is that good?

00:11:58:18 - 00:12:06:16
Lori MacVittie
Well, it's it's got the same values and weights inside of it. It's going to be like, well yeah, it's good. I'd say the same thing.

00:12:06:18 - 00:12:27:03
Joel Moses
Yeah. So, so, so what do we do about this? I mean, that's the obvious question. So, you know, it's a common practice in inside of engineering or working with these models to, to kind of judge their outputs using, using evaluation tools, things like LLaMA-Factory, AlpacaEval, Arena-Hard. There's a whole bunch of test kits out there.

00:12:27:05 - 00:12:59:12
Joel Moses
It's becoming a much more common practice in, in the AI space to, to look at your models with a critical eye for security purposes. So red teaming of models is, is increasingly something that that security practitioners are used to. So I think what this paper is kind of pointing out is a condition where you need to look at the evaluation of the, of the AI model kind of like a security person would look like, look at defense in depth strategies. Meaning you don't necessarily protect using one technology,

00:12:59:12 - 00:13:19:19
Joel Moses
you protect using a variety of technologies in the same way, so that if it passes muster with one, it may not pass muster with another—a diversity, so to speak, of security protections. And so I think that this is, this is pointing out the need to, when you're doing red teaming, to have diversity in the evaluation set.

00:13:19:22 - 00:13:38:15
Lori MacVittie
Yeah. It makes sense if if I write something right, you see it, you become so familiar with it that you often miss little things like typos or little grammar, right, errors that when an editor looks at it, which is why we have them, we'll find them for you. Because it's a fresh set of eyes. It's a different perspective.

00:13:38:15 - 00:13:54:07
Lori MacVittie
They haven't looked at it 100 times. So having some external system, right, look at the outputs and and judge it makes absolute sense to me, especially in something that's so varied as generative AI.

00:13:54:09 - 00:14:15:17
Ken Arora
Yeah. No. Totally I, I, absolutely. So more concretely, right, this is, I think, add says if, and again there's some economic considerations to this as well, but if you are developing some models, I it's important to try to test them, break them, evaluate them in the context of of models from different families. I think, right, that seems to be the

00:14:15:22 - 00:14:26:15
Joel Moses
Yeah.

Ken Arora
obvious answer. It's like, well don't, if you're going to ask about the weather, don't only ask people in Hawaii or only in San Diego or only in in Minnesota.

00:14:26:18 - 00:14:29:15
Joel Moses
Yeah. Everything looks like snow in Minnesota doesn't it? Lori.

00:14:29:17 - 00:14:43:29
Lori MacVittie
I, well I don't know. I'm not in Minnesota. I'm in

Joel Moses
Okay.

Lori MacVittie
Wisconsin, which is Minnesota's like neighbor. So, yeah, it all looks like snow eventually and ice, sleet.

00:14:44:02 - 00:15:03:29
Joel Moses
Good point, good point. So I mean the obvious question here is, what do we do in response to something like a, like a possibility of preference leakage? There's diversity. But can you think of any other approaches Ken, that might be useful for, for judgment systems, which are increasingly popular and being used for security purposes in red teaming models.

00:15:04:01 - 00:15:25:15
Ken Arora
Okay, you hit me out, this you hadn't asked me this out of the blue. So off the top of my head here, I'd say, you know, another thing is, is the the this touches on another thing that that I think that same research team talks about, which is hallucinations and what causes them. And and there's one there's a school of thought with some evidence that says hallucinations are caused.

00:15:25:17 - 00:15:50:22
Ken Arora
You know, when you take the SAT, right, the SAT or any standardized test, kind of take into account that you might randomly guess. And if you have this multiple choice and you have five choices, you're randomly going to get one in five, right, if you guess. Therefore, I'm going to assume that and take that into account. A lot of the training doesn't do that for LLMs, which is, you know, maybe a little surprising that they, the, the there's a definite penalty like get zero points for saying, I don't know.

00:15:50:24 - 00:16:12:10
Ken Arora
And you get but you get, you know, if you blindly guess and you're right 10% of the time, that's a pretty, it means you'll hallucinate 90% of time, but you still get 10% of the points you could have gotten. So there's incentive to guess and that might cause hallucinations. Anyway, going back relating this back to preference leakage, I think that we probably need to do more to have when you train models,

00:16:12:10 - 00:16:30:21
Ken Arora
even within a single family, to encourage them to say, "I don't know, this is outside of my universe of knowledge." And I think that might, you know, then when you call a friend a friend has the freedom to say, "yeah, I don't know either. Maybe you want to call a different model because I can't, I can't evaluate:

00:16:30:21 - 00:16:45:20
Ken Arora
well that's a good answer or not." Because maybe I'm spec, now here I'm definitely speculating, the areas where maybe the those preference leakage will give you bad answers are cases where the confidence in the answer is pretty low.

00:16:45:22 - 00:17:07:12
Joel Moses
Got it. So there are definitely implications. And this is definitely an area where we're going to need to watch the technology as it matures. And as judgment systems are used and red teaming systems become more popular, it's going to be more important to know about how those processes actually work, how evaluation processes of models actually do work.

00:17:07:15 - 00:17:27:14
Joel Moses
What are they comprised of? What are they built from? What models do they use if they use models? Now, Ken another question, and this is, this this again, this is ad hoc and you may have an answer to this, you may not have an answer to this, but, you know, there's a there's also a growing popularity of systems that check output in real time.

00:17:27:19 - 00:18:05:09
Joel Moses
And we're not what we're talking about today, the preference leakage is in evaluation tools, meaning taking a model and battery testing it. Not not handling, looking at prompts in real time or anything like that, but literally testing the input and the output of a model system and seeing if there's any inherent bias there. But at for security solutions, for example, that are judging the output and looking for data loss within live output, it's becoming a common theme to use small language models, which are very reactive, very quick, but maybe very specialized, in order to do a quick check of an output of a of a user prompt.

00:18:05:12 - 00:18:12:09
Joel Moses
Does this research have any, have any bearing on the use of SLMs for protecting data transactions?

00:18:12:11 - 00:18:52:09
Ken Arora
I will my my off the top of my head my my thought on this is SLMs are would take this and, you know, if I believe that some fraction some significant fraction of the root cause of this is having a, a limited universe, and large language models have a fairly large universe but it's just not perfect, small language models are going to have a much more limited universe. And so I think they might be more more, I would speculate, might be more biased or more susceptible to this preference leakage sort of vulnerability. The way I would I would tend to mitigate that is maybe having a independent model that all it does, it says something is

00:18:52:09 - 00:19:13:07
Ken Arora
trying to evaluate how much in the domain of the small language model is it? Because when you do a small language model you typically know the domain. So maybe you give a large language model say, if I were to ask something trained for this purpose, "do you, you know, would it be in the domain of expertise of it?" And use that as a way of judging how how much you should trust that answer.

00:19:13:09 - 00:19:14:01
Joel Moses
Yeah.

00:19:14:03 - 00:19:19:15
Joel Moses
Well, once

Lori MacVittie
I

Joel Moses
once again, there's a ton to learn in this particular space. Any other questions Lori?

00:19:19:17 - 00:19:44:28
Lori MacVittie
No, I was just going to comment. I mean, we naturally go to security because I think security is top of mind right now because this is a new system. There are so many ways to exploit them. But just the definition of, you know, you mentioned goodness or correctness has significant implications on other areas within, you know, the enterprise, like I want to choose, you know, one of three models that are out there.

00:19:44:28 - 00:20:12:09
Lori MacVittie
Well, what if one is degraded? What if it's not correct anymore? Do I really want to send to it? Right. So the the way that we scale and distribute requests, right, app delivery if you will, is definitely impacted by the models, right, degrading. So being able to judge correctness is actually very important to just about everything that happens in building out these systems.

00:20:12:09 - 00:20:29:14
Lori MacVittie
So this is a good place to start and we need to be aware of it. And how can we get better at that so that we don't, right, end up with a, yeah, it's okay, and it turns out it's not. Right, we don't we don't want to make those kind of kind of selections. So it's an important topic.

00:20:29:21 - 00:20:34:11
Ken Arora
Yeah. Well, in a generative model, correctness is security.

00:20:34:14 - 00:20:35:04
Lori MacVittie
Ah, see

00:20:35:06 - 00:20:37:21
Joel Moses
I like that. That's that's a that's a good way to sum it up.

00:20:37:22 - 00:20:41:19
Lori MacVittie
I would say it's reliability. Should we argue about that. Is it reliability or is it

00:20:41:19 - 00:20:45:16
Joel Moses
Sounds like another podcast

Ken Arora
Yeah.

Joel Moses
to me Lori.

00:20:45:19 - 00:21:06:29
Joel Moses
You know let's let's talk about some of the things we've learned today. So we've learned about something called preference leakage. And and again the paper will be linked within the body of, of the description. And it's an interesting paper and it's an interesting concept to think about. I think, instead of classifying this as a vulnerability, I would classify this mainly as a characteristic of models themselves.

00:21:06:29 - 00:21:26:05
Joel Moses
And it's something that we probably need to do a little more thinking about as we look to drive biases out of the systems that improve models and out of the systems that protect models. And so, so when I, when I am looking for a takeaway from today, I'm thinking about, I'm thinking about figuring out bias in these systems.

00:21:26:07 - 00:21:30:09
Joel Moses
It makes me look outside now of of the model itself.

00:21:30:11 - 00:21:44:22
Ken Arora
Yeah. I guess if I have one takeaway it would be sort of that just phrased in the more philosophical human realm, which is, yeah, bias, be careful what you measure, because that's exactly what you'll get.

00:21:44:24 - 00:22:06:26
Joel Moses
Got it.

Lori MacVittie
Wow, that was that was beautiful and short and to the point. And I think that's a great way to end it for this episode. So that's a wrap for Pop Goes the Stack. Subscribe now so you don't miss the next innovation that'll break staging before it breaks production.

More episodes

Chapters

Creators and Guests

What is Pop Goes the Stack?