Chain of Thought | AI Agents, Infrastructure & Engineering

As AI agents and multimodal models become more prevalent, understanding how to evaluate GenAI is no longer optional – it's essential. Generative AI introduces new complexities in assessment compared to traditional software, and this week on Chain of Thought we’re joined by Chip Huyen (Storyteller, Tép Studio), Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) for a discussion on AI evaluation best practices. Before we hear from our guests, Vikram Chatterji (CEO, Galileo) and Conor Bronsdon (Developer Awareness, Galileo) give their takes on the complexities of AI evals and how to overcome them through the use of objective criteria in evaluating open-ended tasks, the role of hallucinations in AI models, and the importance of human-in-the-loop systems.Afterwards, Chip and Vivienne sit down with Atin Sanyal (Co-Founder & CTO, Galileo) to explore common evaluation approaches, best practices for building frameworks, and implementation lessons. They also discuss the nuances of evaluating AI coding assistants and agentic systems.Chapters:00:00 Challenges in Evaluating Generative AI05:45 Evaluating AI Agents13:08 Are Hallucinations Bad?17:12 Human in the Loop Systems20:49 Panel discussion begins22:57 Challenges in Evaluating Intelligent Systems24:37 User Feedback and Iterative Improvement26:47 Post-Deployment Evaluations and Common Mistakes28:52 Hallucinations in AI: Definitions and Challenges34:17 Evaluating AI Coding Assistants38:15 Agentic Systems: Use Cases and Evaluations43:00 Trends in AI Models and Hardware45:42 Future of AI in Enterprises47:16 Conclusion and Final ThoughtsFollow:Vikram Chatterji: https://www.linkedin.com/in/vikram-chatterji/Atin Sanyal: ⁠⁠https://www.linkedin.com/in/atinsanyal/Conor Bronsdon: https://www.linkedin.com/in/conorbronsdon/Chip Huyen: ⁠https://www.linkedin.com/in/chiphuyen/⁠Vivienne Zhang: ⁠⁠https://www.linkedin.com/in/viviennejiaozhang/Show notes:Watch all of Productionize 2.0: ⁠https://www.galileo.ai/genai-productionize-2-0⁠

Show Notes

As AI agents and multimodal models become more prevalent, understanding how to evaluate GenAI is no longer optional – it's essential. 

Generative AI introduces new complexities in assessment compared to traditional software, and this week on Chain of Thought we’re joined by Chip Huyen (Storyteller, Tép Studio), Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) for a discussion on AI evaluation best practices. 

Before we hear from our guests, Vikram Chatterji (CEO, Galileo) and Conor Bronsdon (Developer Awareness, Galileo) give their takes on the complexities of AI evals and how to overcome them through the use of objective criteria in evaluating open-ended tasks, the role of hallucinations in AI models, and the importance of human-in-the-loop systems.

Afterwards, Chip and Vivienne sit down with Atin Sanyal (Co-Founder & CTO, Galileo) to explore common evaluation approaches, best practices for building frameworks, and implementation lessons. They also discuss the nuances of evaluating AI coding assistants and agentic systems.

Chapters:00:00 Challenges in Evaluating Generative AI

05:45 Evaluating AI Agents

13:08 Are Hallucinations Bad?

17:12 Human in the Loop Systems

20:49 Panel discussion begins

22:57 Challenges in Evaluating Intelligent Systems

24:37 User Feedback and Iterative Improvement

26:47 Post-Deployment Evaluations and Common Mistakes

28:52 Hallucinations in AI: Definitions and Challenges

34:17 Evaluating AI Coding Assistants

38:15 Agentic Systems: Use Cases and Evaluations

43:00 Trends in AI Models and Hardware

45:42 Future of AI in Enterprises

47:16 Conclusion and Final Thoughts

Follow:Vikram Chatterji: https://www.linkedin.com/in/vikram-chatterji/

Atin Sanyal: ⁠⁠https://www.linkedin.com/in/atinsanyal/

Conor Bronsdon: https://www.linkedin.com/in/conorbronsdon/Chip Huyen: ⁠https://www.linkedin.com/in/chiphuyen/⁠Vivienne Zhang: ⁠⁠https://www.linkedin.com/in/viviennejiaozhang/


Show notes:Watch all of Productionize 2.0: ⁠https://www.galileo.ai/genai-productionize-2-0⁠

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes bi-weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Head of Technical Ecosystem at Modular, and previously led growth at AI startups Galileo and LinearB.

Challenges in Evaluating Generative AI
Conor: [00:00:00] Welcome back to Chain of Thought. I'm Connor Bronsdon, Head of Developer Awareness at Galileo. And welcome back, Vikram Chatterjee, Co Founder and CEO at Galileo. How are you doing, Vikram?
Vikram: Thanks, Connor. Doing good. Good to be here. How you doing?
Conor: I'm good. I'm excited for this conversation, and particularly I'm excited not only to talk to you, but for the second half of this conversation where our listeners are going to get to hear from some fantastic panelists who joined us for productionized 2. 0 to discuss evaluating Gen AI. Vivienne Zhang, senior product manager, generative AI software at NVIDIA.
And another ex NVIDIA person, Chip Nguyen, folks may know her as the author of Designing Machine Learning Systems. She's now at TEP Studio, and was formerly VP at Fultron Data. you've probably seen her on X or Twitter or LinkedIn at some point. we're gonna definitely play that conversation later in [00:01:00] the episode.
But before we do, we're gonna explore the various methods and techniques used to evaluate Gen AI models. It's become a hot topic as we think about not only evaluation intelligence broadly, but also evaluation driven development and the need to really build from first principles with evaluating and observing genetic systems in mind, particularly given that at times these models can be a bit of black boxes and the ethical considerations, the governance considerations that come in around this beyond simply the practical.
So, Vikram, let's start by talking about this challenge of evaluating generative AI. What makes this evaluation process for gen AI difficult compared to traditional software systems?
Vikram: Yeah, Connor, that's I think the, question of the hour, it seems like with most AI companies and AI teams that we work with. there's a couple of things that come to mind, right? Like number one, it's, if you look at evaluations in general, it's not a new term for AI, right? But [00:02:00] it was mostly done by data scientists.
So if you talk to a data science team about evaluations, it's very natural to them. There's no other way of. of. doing things. You would have to have, uh, you have to have a training set. You have to have a test set. You have to compare the two over time. You have to, uh, you know, fine tune with the right kind of data.
It all comes down to looking at the inputs, looking at the outputs. It's science, right? It's experimentation. But if you compare that to traditional software engineering and look at the kind of, the persona of the people who are building out apps increasingly, it's not just the data scientists anymore. There are like 500, 000 data scientists in the world, but there are 30 million developers out there that now can just. Invoke these LLMs with just an API call. So, that's where it's a little bit new, it's tough because it's introducing the concept of some of the concepts of data science to software engineers so that they can experiment And when you're experimenting rigorously for, for software engineers, what that means is now they have to think about a couple of different things. One, what are the different pieces of the puzzle that I need to bring in? [00:03:00] including, like, the right kind of prompt, the right kind of model, the right kind of, I don't know, the vector store that you want to use if you want to use one. If you want to bring in an entire orchestration system, like lang chain, llama index, you want to make that determination quickly. A lot of people don't do that. They just build their own chains. so you have to make all of these determinations for the different parts of your stack with agents and multimodal models.
This expands a little bit more. So that's,that's number one is thinking through. What's the assembly that I have, that I'm playing with? that's not necessarily new for software engineers. They're all, you know, if you're a software engineer and you're building out an app, you're already thinking about, you know, what's the right database, what's the right structure I want to bring in, what are the right function calls. All that stuff is not brand new. What is new is, from there on, When you want to just run an application through this or build, do a singular run through this compound system that you brought in, all of a sudden you have to think about, what's the right context I can provide this with, if it's a RAGBASE system. Now it's not Easy and it's not simple enough [00:04:00] to just point it to a knowledge base and hope that it works You kind of also have to massage that knowledge base and convert that into a format that is actually consumable by this entire system So that's net new work that you have to kind of figure out you have to evaluate that context and now you have to evaluate the prompt itself You have to figure out what the responses look like Now, within all of this, now you have to figure out, what's a good test set that you can actually use?
What are the right metrics that you can actually use? All of this is new. The idea of coming up with a high quality test set, the idea of using specific metrics that don't exist, like it's not cost and latency anymore, which may be like in the world of Datadog, you could just get that out of the box, but now these metrics don't exist. so what are the right kind of metrics? evaluation metrics that I can use. But then again, like once I do figure out that something's wrong or right, how do I get into the mindset of rigorously testing these entire pieces out using an ABN comparison over time? And then what do I [00:05:00] do in production? So there's a lot of net new items that folks have to come to terms with, which is what's leading to, a lot of noise around, Hey, I do evaluations in this way.
And then I do evaluations this other way. I'm using just an LLM as a judge and somebody else says, I'm just throwing humans at the problem and doing wipe checks and it's fine. And so we're all of these things in the market versus. just coming together and understanding like what are the, true principles that you could use in order to do this the correct way, which is why we don't hear this. The question that we hear a lot from users is not so much about, Hey, evaluations are not necessary. It's more about evaluations are just hard because the process is just very difficult to come to terms with.
Evaluating AI Agents
Conor: and you brought up the example of AI agents, and I think that's a great one to illustrate the challenge of effective evaluations within AI, because these systems are really unique. They use LLMs to plan out their [00:06:00] actions. They can take actions via real world tools and APIs, and often they have multi turn or multi agent workflows.
So we're seeing builders want to evaluate kind of three parts of that agentic workflow. The step level of was the right tool chosen and used correctly at each individual point of the workflow. The turn level, were the steps actually performed in the correct order to reach the correct conclusion. And then of course session level of the workflow.
is that final result accurate, but that's something that's really only started to come into clarity in the last couple of months here. we're seeing this in industry surveys, but even while we're hearing that agents are kind of the hot topic, only 13 percent or so of AI builders right now are already leveraging agents.
even if we expect to see that, percentage massively increase, how many of those folks are actually effectively evaluating their agents? it doesn't seem like that many so far.
Vikram: Yeah, that's true. what we've been hearing from a lot of these companies and our partners is it's really fun to watch because [00:07:00] we've seen a lot of these people go from NLP based techniques for chatbots, et cetera, to using large language models with prompts. To run using rag based systems and feeling like, aha, we finally got this, it's kind of working. But then if only I could take this action with the tool, that would be
amazing. And split it just being this random chatbot. If I can just answer the user's question and also do this thing. And now with the agents, they're basically saying like, finally, we can do this. Here's an API call I can make and can be one of four API calls.
Let's just have the, LLM to this API call, right? And that's basically what an agent is. we're seeing that as they've progressed down this road with every single added piece of complexity, that all these new questions about like, I'm super excited about this, but wait, how do I know if, when this gets wrong?
Like, how do I know that's actually gone wrong? And to your point, it does get into the nuances, right? I keep mentioning how for, All of this, it's very important to have the brass tacks in place. The brass tacks,
meaning you got to have the right kind of logging systems. You have to, like, even when [00:08:00] you're shipping a product, you have to be able to log everything to know what's going wrong. It's very similar, right? In an agentic system, you've got to log every single part of the puzzle, which API call was made, which ones are not made. Why not? When the API call was made, what happened right after that? And being able to see that in a way which just makes sense for you. That's brass tacks to me.
That's the tracers and loggers. There are. 100, 000 different kinds of tools out there that can do that. Open source, free, what have you. Very easy to make. You shouldn't make it yourself. Use one of the 100, 000 tools out there for evaluation for that. Tracers and loggers, lots of them. the part which becomes interesting now is for your specific use case, for your specific agent, how do you figure out what the right kind of metrics are? How do you measure this in an effective way? And that's what takes a lot of iteration. And that's the part which you're seeing a lot of people trip up with. and how do I know if the agent actually got to completion or not? you can have an out of the box metric, which a lot of maybe folks are going to say that they have, but, it takes iteration because that just depends a lot on the specific use case that you're gonna [00:09:00] ship. So that's the part where, from a software perspective, from an evaluation perspective, it's very important to kind of nudge the user in the right direction so that they can actually get that determination of what right metrics are and be able to see that in a very explainable way. as and when these systems get more and more complicated, it's going to become even more important to log the right stuff, to be able to create the right kind of metrics, to test out those kinds of metrics, and then be able to visualize these, responses and, uh, the traces in an effective way.
Especially as we enter RAG plus Agents plus Multimodal, it's going to get, very interesting.
Conor: Completely agreed. And I think that's why we're starting to hear thought leaders throughout the space. Think about evaluations as kind of a P0 for building AI applications. and you're going to hear Vivienne and Chip talk about this more later in the episode. part of the challenge here is that because we're now leveraging these non deterministic software systems, It's hard and quite [00:10:00] subjective to understand what good or bad responses can be at times.
So, how can we work to establish more objective criteria for evaluating generative AI, and especially when dealing with these open ended tasks? Like, how can we determine what's good?
Vikram: for determining what's, what's good, it always comes down to just the right kind of, guardrails in place. And it depends so much upon, your use cases. This is again, where we keep trying to educate, different teams with this notion that What's good for you might be different for what's good for somebody else.
For even a very similar kind of use case, as an example, for chatbots, for turn by turn, chat applications, you might, and if it's, let's say it's based on rag, there are a lot of these metrics which are now out of the box, right? Like, does this, does the response adhere to the context that was provided or not?
Does the response adhere to the from the previous responses and inputs from the user in the chat or not. [00:11:00] But, you know, a chatbot for, let's say, buying shoes from Nike is very different from a chatbot that, let's say, a patient is using with a healthcare provider. Right? The risks are different. Like, for Nike, maybe the risk is much, lower. and so they don't need to have governance guardrails and compliance guardrails in place. But whereas when you talk to the healthcare provider, there's a lot on the line here. The developer over there has to not just think about the specific use case and the nuances of the use case, but also has to think about. things like I have to be HIPAA compliant. And what does that mean? And how does that translate into different kind of guardrails? And maybe I'm working out of the EU. So now I have the EU AI Act guardrails to also bake into place. How do I make that happen? our general thesis on this has been list out all of the different kinds of gotchas that you can think about from a system perspective that could go wrong, meaning context adherence, prompt quality, et cetera. come up with that list, come up with a list from a use case perspective that could go wrong, come [00:12:00] up with a list from a regulatory and compliance perspective. And then now for all of these, try to figure out, can you convert these each and every one of them into actual guardrails that you can codify, right?
Whether they're LLM powered or code based and bake that into your system for every single developer. A lot of them might not be able to just codify. And so thosehave to be taken care of through humans in the loop. that's the determination that you have to make to figure out how you're going to be attacking the system to answer the simple question of, is this good or not? it's very nuanced.
Conor: Yeah, and one of the key challenges that we've seen with LLMs and AI more broadly is this idea of hallucinations when an AI model will produce incorrect or misleading information. And I know it's something Galileo has talked quite a bit about. Anyone who has been familiar with us has probably seen a couple of versions of our hallucination index that have come out.
where we've identified which models have tendencies to hallucinate and tracked metrics for them. but as you'll hear later, this [00:13:00] episode in our panel with Chip and Vivienne, they suggested that hallucinations aren't always a bad thing. Can you explain that concept further for our audience?
Are Hallucinations Bad?
Vikram: they're not necessarily a bad thing in the sense that Again, it depends on the context in which we are saying this, right? So it's, it's not a bug, it's obviously a feature. hallucination is basically something that the model is coming up with, and it might not be something that makes a lot of sense, but, The model is kind of like an operating system or a brain, quite literally, where it's learning from a bunch of stuff.
And if the stuff that you've taught it, it doesn't make any sense. And then you're asking a question that's outside of its comprehension. It's going to say something that doesn't make quite, quite a lot of sense. Now, when it does that, it's possible that it's actually saying something which is net new. It's a lot like asking, uh, asking a baby a question. And sometimes, you know, kids say the darndest things where they'll say something and you'll be like, Wait, that doesn't make sense, but oh, wait a second. It does. That's beautiful. That's amazing. Why couldn't we think of that? It's because we have a lot of guardrails in place [00:14:00] already over the course of time with age.
Right? So that's kind of how I see, like poetry, for instance, and you need,
you need a creative, freedom there. And oftentimes, like models will come up with stuff, which will be really interesting and really poetic. And you can, you might not have been able to come up with that yourself. There, Hallucinations are a good thing. And they're definitely a part of how these. models think. And so then the question becomes, given complete free reign, the models can say anything, which is beautiful and amazing and really interesting and fascinating. But when you get into applications and software development, it's a lot about, curtailing risk and minimizing risk and bringing regulations into play, bringing your specific use case elements into play. And that's why. hallucinations started to become your, the bane of your existence. so it becomes a little bit of a, of a chicken and egg at that point, where you want to use this powerful system because it's going to come up with something which is outside of what you programmed it to do. But at the same time, you wanted to follow the rules.
[00:15:00] I am seeing more and more, you know, these, creative use cases are fewer and far between in the enterprise, which is why there's considered, hallucinations or something that you consider just reigning in as much as you can.
Conor: as developers determine what constitutes a hallucination for their application, How do you think they should be thinking about this?
Vikram: hallucinations is definitely one aspect of, looking at whether the response was good or not. And now, you know, if you dive deeper into that, it could mean a couple of different things. One is, If you're using, an open, like a closed book system, like a non rag based system where the response could be anything from anywhere, then it becomes very much about the accuracy of the response. And that accuracy could be very much from a factual perspective, like does this actually make sense? It could also be from, a semantic perspective and a sense making perspective, whether this actually makes sense or not. Is this even a real English sentence or is this even a sentence? those are things that you can create [00:16:00] guardrails around. but then there's, That's more around the correctness of the response, and the semantic, toxicity of the response, and things of that nature, which I feel like are, at least the latter is an easier problem to solve. The overall correctness of a response is a little bit harder to do because there is no, global fact checker that you can use unless you use another powerful LLM in the loop that is, trained with data up until, the last couple of days to the present day. but the other kind of hallucination if let's say you're using a rag based system, It's hallucinating just because it might be giving out a response which is outside of the syllabus that you gave it, right? And in which case, if the, if it's only supposed to give you responses about world history and you ask it a question about, when was Rome built and starts giving you a response, which is, I don't know about, Caesar's palace in Las Vegas, then it's outside of the syllabus that he gave it.
It's trying to give the right response and maybe the response is right, but it's outside of the syllabus. And so that's a hallucination, but it's much more use case specific. So those are some, the closed book and open [00:17:00] book, are two ways in which we've seen people think about hallucinations at a high level. That again, that's just one form of, model response guardrail type that we think about that our developers should think about.
Human in the Loop Systems
Conor: And this, I think, speaks to the need to consider human in the loop systems as well, particularly as we think about these agentic workflows. I know you have some thoughts on human loop systems. What's your perspective on them?
Vikram: I think they're necessary. I think they're never gonna, for this one, I generally say never say never, but for this one, I feel like they're never going to go away. just like for software design as well, right? Like we, when, Software applications were being built out initially. the idea was we'd be, we can just get rid of humans completely in the loop.
And now this is a, it's just if else statements, et cetera, and they can just launch this application. We don't need humans in the loop for anything, but what we quickly realized was you need quality testing. You need humans to look at the exception cases. You need subject matter experts to understand why things went wrong and fix that. [00:18:00] That is much more pronounced in the world of AI application development, where there is the opportunity for edge cases are a lot more and you need somebody to have that semantic understanding of exactly what the use case is and what the right answer should have been to be able to help you give that response for any neural network based application.
This is just going to be the case. Even if you fine tune this, LLM with the world's best data, as soon as you put this into production, I can guarantee you. All of your biases and all of your best laid plans are going to completely go off the rails and you're going to see users come in with all sorts of other kinds of questions and now you'll need humans in the loop to help figure out how to improve your test set, to figure out where do your metrics start failing you and how to start to improve that again. Humans in the loop are absolutely necessary. I do feel like they should not be the first line of defense. that's too expensive and that's what we need to move away from, but you do need to make that maybe the second or the third line of defense, but start using them more intelligently versus just throwing the [00:19:00] kitchen sink at them.
That's kind of what I see happening a lot across many other, many AI app developers today, but that's not how it should, they should be used, but they're absolutely a necessary part of the entire value chain.
Conor: Fantastic. Vikram, do you have any closing thoughts you want to share with the audience?
Vikram: evaluation and observability is these are just really important parts of the application development lifecycle. We are maturing as an ecosystem in terms of how best to evaluate and how best to monitor these applications. What kind of metrics work?
What kind of metrics don't work? But remembering that this is a science and not just classic software engineering is very important. and so I'm pretty bullish on the idea that software developers are going to align across the board on a very specific way of evaluating the systems. and, uh, we'll all be better for it very quickly.
Conor: Fantastic. Vikram, thank you so much. listeners stay tuned after a quick break. You'll hear from our panel as they dive more into evaluations in AI.
Unlock [00:20:00] evaluation intelligence for your AI team with Galileo, the industry leading platform for evaluating Gen AI applications.
Trusted by AI leaders like HP, Databricks, Comcast, Twilio, Headspace, CollegeVine, and many more, and powered by our proprietary Luna Evaluation Suite, Galileo enables evaluation driven development and observability for your AI systems. Learn more at Galileo. ai
Next up, we've got our practical lessons from GenAI evals panel from Productionize 2.0. You'll hear from Chip Huyen, storyteller from TEP studio and acclaimed author, Vivienne Zhang, senior product manager, generative AI software at NVIDIA, and Atin Sanyal, co founder and CTO at Galileo as our moderator.
Panel discussion begins
Atin: The topic of today is Gen AI evals and any practical lessons that we can sort of take away from how, the eval, landscape has evolved [00:21:00] from 2023 going into 2025. so we'll cover a bunch of things, including, challenges, that we've seen and the various approach solutions to evals, how evals can vary from, uh,case to case.
As well as the role of human in the loop, the hot topic of hallucinations, of course, and some of the other challenges, some of the newer challenges which have emerged post production once you deploy your app, because that's something that a new muscle that has built up more recently. But I'll kick things off with a more high level question, chip.
This is a question for you. like, why do you think, evaluation is such a big challenge in generative AI. And if you could make a comparison of evaluations in gen AI to more traditional non generative classic machine learning.
Chip Huyen: How much time do we have? so I do believe that evaluation is. The biggest, one of the biggest, if not the biggest challenge of gen AI productions.
So there are multiple reasons why it's hard. I think one reason is [00:22:00] that the more intelligent AI becomes, the more challenging it is to evaluate its outcome. So anyone, like almost everyone can tell whether the solution to a first grade math problem is wrong, but it's really hard to tell whether the solution to a PhD math level question is wrong.
the second is that Jenny, I also introduce open ended, which so before a lot of applications were closed and that first of all, like, classification. So, even though I spam on a spam, if you notice the app, what is spam? You can compare with, like, they say, what it is expected in a spam, then it's wrong.
But we open ended output. It's like, so hard. So, first of all, if you are tomorrow to summarize a book as a summary, might sound very reasonable. So you actually might have to read the entire book to see whether the summary is good or not. I think it's a good two key reasons.
I mean, I can go on and on, but I would love to hear more from as a panelist.
Atin: No, absolutely. Um,Vivienne, I'm curious to get what your take is on this.
Challenges in Evaluating Intelligent Systems
Vivienne Zhang: Yeah, I've actually felt Chip just [00:23:00] said what I wanted to say, but I'm just going to give more examples in the same vein and so hopefully we drive it home.
as the systems become more intelligent, right, the people, or more specifically subject matter experts, are qualified to do this assessment. Just the number of people just becomes fewer. to give you an example, NVIDIA has a co pilot to enable our designers chip designers to design chips faster and better.
And I have to take a look at some of the sample questions and answer. And someone like me who has no hardware design background has no idea what the question is, not, not even speaking about the answer quality. So if we think about how complex the tasks we're now throwing at our general applications, the difficulty in an assessment has also increased.
And then the [00:24:00] second point is that, just like, as Chip said, it's non deterministic. And so maybe to just give you an example, like I think a lot of our users can head to OMSYS or something called RewardBench, which are open leaderboards, where you can find samples of A B answers from models. And you can see actually a lot of humans don't agree on which answer is better.
There's no right answer, and there could be 10 different answers. for one, correct answer and you need to design a system to assess that. And that's very difficult.
User Feedback and Iterative Improvement
Atin: Just continuing on that thought, so I'm curious to know from your standpoint, you know, seeing users use Nemo, evaluation was traditionally sort of looked upon as this point in time kind of activity, you know, just viewing evals as a statistic or some number that you slap onto a data frame.
I mean, in light of that, that has clearly proven to be insufficient. so having [00:25:00] seen users use Nemo and from your experience, how have you seen folks sort of bridge that gap and, what have you seen users typically do to address this gap?
Vivienne Zhang: Yeah, I think the point in time activity you're talking about is, you know, typically the app developer will curate a data set and then they measure the application's performance against that benchmark and get it.
By the way, that actually sets you probably ahead of the big crowd already, because I feel a lot of application developers also just do a vibe check and then send it to production. I think is very, very important is after you release the application, When it's in production, it's almost a different ball game.
The queries we're going to get are much, much more complex and involve a lot of different edge cases. I think what I've seen working is that you simply have to keep iterating. You have to capture the user feedback, the usage data, [00:26:00] and systematically troubleshoot what have tripped up your LLM and then improve the system.
Again, you're going back to. The Chipnimo example at NVIDIA, we have seen a lot of when at the beginning, when the co pilot was failing, is that the answer simply doesn't exist in the database. So we need to go back and improve the data instead of the model system. And similarly, we're also seeing some of the chunks got cut off at all places, and then we need to change the chunking strategy.
So I think do a lot of releases, capture user feedback, and then systematically improve the system iteratively is super important.
Atin: Yeah, that touches upon the role of human feedback here as well.
Post-Deployment Evaluations and Common Mistakes
Atin: but, you know, shifting gears to talking about post deployment, evaluations, Chip, from your experience, like what are the top things that you AI system [00:27:00] once it's deployed?
Chip Huyen: there are a lot of things that could go wrong with evaluations. And I think one of the number one mistakes I see people make is that not defining clearly the guideline for evaluation criteria.
for example, like,a lot of what use as a judge to evaluate the output of the model is a really depend on evaluate judges. However, I talked to a team, and I was, okay, so what metrics do you use? I always mentioned coherence, relevancy, and people use these metrics as if they're exact metrics.
But they're actually very highly dependent on the models you use, on the prompt you use for the judge. And this is a big mistake I see, it's like, the people who maintain those metrics, the people who maintain and use the metrics. Are different people, so say, like, if say, the metric team suddenly change the problem, maybe don't make a mistake, and actually, I've gone through a lot of, like, evaluation tools, open source, open tools, I can see the problems that you explore.
But there's not just a lot of them have typos, like, almost every single one have [00:28:00] a little typo. And so, like, people can fix type over time to impose a prompt. So, like, the score, I think you need to use a score, but you can't trust them because those scores themselves change over time to the 90 percent different from yesterday 90%.
So, it's very hard. And the other is just like, what is a good or bad evaluation, score means? So, uh, LinkedIn has a really good phase study. So they review a bot to have candidates evaluate whether a job is a good fit for them, and they found out that, a lot of correct responses are not good response.
For example, if the candidate asked the bot like, Hey, am I a good fit for this role, the bot would respond, No, you're a terrible fit. So it is a correct response, but it's not a good. So you have to spend a lot of time curating what is a good and bad guidelines. and I feel like sometimes we just get lazy because we have to explain what it is and we'll just skip the step and hoping AI judges will figure it out and it doesn't quite go that way.
Atin: Right. No, totally.
Hallucinations in AI: Definitions and Challenges
Atin: and I guess related to that, you know, hallucinations has also sort of become an AI judge related metric. I know a lot of,folks use [00:29:00] LLM as judges to detect hallucinations. But Chip, related to your previous point, about, the kind of mistakes you could make in these sort of metrics.
In context of RAG hallucinations, for example, if I were to build an AI as a judge metric, number one, is that the right technique? Are there alternatives for me to evaluate hallucinations in RAG? And what are some of those limitations of using, say, AI as a judge to detect RAG hallucinations?
Chip Huyen: So I usually like, when I give talk in person, I usually ask like, who thinks hallucination is a big issue and almost everyone raises a hand. And then they ask the next question, like, who here can explain what hallucination is? Like almost nobody raises their hand. So, so I think like, but what is hallucinations?
So, so I think it's like, I usually think of a hallucination. Usually it's a hallucination is not a bad thing. I'm very into creative use cases. So I think hallucinations is going to be really, really great things for creative use cases. It's a bad thing for like use cases that depend on factual consistency.
So hallucinations are bad when it's factually [00:30:00] inconsistent with what is considered likethe facts. So like, it's very important to understand what the facts are. And we still don't know, there's still many things we don't understand by hallucinations. But one thing we do know is that models are more likely to hallucinate when it does not have access to the necessary and correct information to answer questions.
So I think that's why RAC is so important, like it's so powerful, because RAC was created for like knowledge intensive use cases, because it gives the model access to contexts that are supposedly relevant and important for the model to answer questions. So what we're trying to evaluate is whether, the context we retrieve is, relevant and correct to the questions.
And second, based on this context, can the answers be, relevant and correct Let's a content to reach your context to the end of the question. So I think like this is actually a new task in natural language processing. I'm not sure people coming from here familiar with the term, you know, I like natural language inference or like contextual entailment.
So with those ideas, it's like given a hypothesis and a [00:31:00] statement, can you derive the statement from the hypothesis? For example, like, if you have the saying, source is good for you. So if you say, I know it's good for you, you can derive that from the statement because it's like one of the fruit, but if you're saying it's like bananas expensive, that is neutral, right?
You can't derive that from the statement. So, like, yes, I do think it's like, AI. I don't think that's strictly how they should model, but I think like, AI in general can be very useful for that. You can train a classification model, like, given this hypothesis, given this statement, given this context, given this answer, can the answer be derived, like, consistent with the context.
Atin: In fact, I was just going to say that, even at Galileo, a lot of the innovations that we've done with our lunar suite of models and, our hallucination detection model essentially takes the principle of textual entailment. And we've done a bunch of innovations on top of that to improve the accuracy of the hallucination detection.
But NLI is certainly very core to, how we have defined rag hallucinations as well. [00:32:00] but, Vivienne, in on the topic of hallucinations, I'm curious, you've worked with users of Nemo and I'm sure you've kind of had a good broad, swath of, understanding of different use cases. And I'm curious to know, how have you seen users of Nemo or, or, you know, just AI practitioners define, mean when they say hallucinations to Chip's point?
especially in context of enterprise Gen AI and, yeah, I'm just curious to know what users mean when they say hallucinations.
Vivienne Zhang: Yeah, I thought Chip's explanation is super interesting and probably a lot more comprehensive and thoughtful than what I will provide. I think I have a very simplified view representing users, which is the application gives an answer that looks correct, but it's not what the user is looking for, and I will consider that a hallucination.
So, Jensen actually gives a very good example from this, this years GTC earlier, so I'm going to [00:33:00] stick with a Chibnemo example. So basically he asked the Chibnemo co pilot, what is a CTL? And a answer that is generally accepted is a combinational time logic, but that's not the right answer for this context for a chip designer.
The right answer is Compute Trace Library. It's specific to NVIDIA, and it's what the user needs. And can you imagine like a new, designer going into meetings and everyone is talking about CTL and somehow it just does not make sense. And so I think for a naive user, this answer is just as insidious, as unhelpful, as an answer that's completely wrong.
it can still be extremely confusing. And if you extrapolate to legal, healthcare, finance, many high stakes use cases, best, it's a poor [00:34:00] user experience and at worst, it can be harmful or even dangerous.
Atin: Absolutely. there's a couple of topics around use cases that I'd love to talk about next.
One is code generation and the other is, um, agentic systems. So let's start with code generation.
Evaluating AI Coding Assistants
Atin: I have a question for you both. so we have had models like OpenAI Codex and Code Llama, which are sort of specifically designed for code gen. And they're pretty fantastic models. I love both your take on this, given that these models already generate code and on top of that, there's nuanced code gen use cases like code by co pilots, et cetera.
What does evaluating an AI coding assistant really boil down to? If the top three things, if you both could talk about chip, maybe we can start with you.
Chip Huyen: coding is one of the use cases popular because it's one of the use cases that's easier to evaluate. So like when you, ask, yeah. J Code, you can evaluate it by functional correctness if web [00:35:00] first, whether the code runs in second, how efficient it is.
For example, if you ask Jerry a SQL query, you can see like how fast, how slow or memory it is. so I do think that's like, it's one reason that make coding extremely popular. You just evaluate it based on whether it generate the codes that does the things you want it to do and how efficient it is.
Atin: Absolutely. Vivienne, your take?
Vivienne Zhang: Yeah. What I find very difficult in evaluating AI coding assistant is it's not often in the logic or syntax, even functional correctness or writing the functions to test it. I think it's what, where I see LLM often failing is that it does not even know that there are certain libraries that exist for it to architect a tool to do very complex things.
Tasks and, you know, there are many standard APIs now for LMs to do all kinds of stuff like checking the weather or use WolframAlpha to find [00:36:00] out, like some calculation tasks. For enterprises, I think a lot of the secret sauce, a lot of the value prop is often in their proprietary code basis. And you need to be able to give that knowledge to an LM system and application.
And evaluating that takes a lot of effort for, the reasons we first started with, right? That the tasks is increasingly complex, and you need a subject matter expert, in many cases, a senior engineer to do code review, and that is very expensive and time consuming, and hopefully that can get easier going forward.
Atin: I can add my own anecdotes from our customers and various practitioners I've worked with, especially in terms of code. We've seen that iteration on the code is a not only something you have to constantly do, but it's a big pain to evaluate, what coding assistants get right is just general code, like give me a for loop that does X, Y, Z on an array.
where it [00:37:00] gets tricky is when there's contextual code. And you're trying to generate new pieces of code based on that context. There's often utter garbage that, that will be produced, because there's a very thin line between good code and utter garbage code which is different from natural language where, you know, you can express one sentence in 50 different ways that would mean the same thing.
One piece of mistake in a code and it's garbage. So the line is very thin. That's where evaluation becomes more of an iterative challenge. Say if you missed out an import statement, Hey, can you add it back? And you know, you kind of go step by step and that's where, there's no, at least in my experience, not seen one particular metric just work out of the box.
It's more, there's a lot of human in the loop and a lot of iterative sort of step by step getting to the right code for more complex code use cases.
Vivienne Zhang: No, I think that makes a lot of sense. I think, uh, the, the word of contextual code as something, as I was mentioning, like [00:38:00] libraries, existing knowledge of code basis, and that's very different.
Atin: the other thing I wanted to ask about is, evals in the new agentic land and agentic workflows. So Chip, question for you is.
Agentic Systems: Use Cases and Evaluations
Atin: Essentially agents are designed to perform tasks and automate tasks, make decisions. Like what are some examples of agentic applications that you've seen or heard people building and what are those critical things to keep in mind from an evaluation standpoint when it comes to agents?
Chip Huyen: Really excited about agentic tech use cases, we're very bullish on it. So the way we use agents is, we're actually using a lot of agents. So identify agents as actual components. Once you then have actual tools. So for example, like web search is a tool, calculator is a tool. It's a tool that a lot of us are already using, which is like a retriever.
Like you have like, uh, so rack is actually, one type of like agentic, but easy for me of agentic is like more [00:39:00] complex, like using code generator, like more, more complex workflow. and the second spec of that as I'm to use is planning. So, like, given a task and constraint and a set of tool, the agentic is supposed to come up with a way, like all plan.
a strategy to solve this task. And so I would say this, because of my view on agents based on true components, I would probably evaluate. The agents based on that, for example, I would probably, like,if possible, of course, I think we would want to evaluate based on whether it can accomplish a task.
Whether it's achievable within the constraints, for example, if you ask is hey, plan me a 2 week trips. There's a budget of like, 5, 000 is this. Able to manage it for 2 weeks and we can do budget. The other is whether it can call the right tool.
So a lot of time, we'll try to call like an invalid tool. Or it will try to use the correct tool, but not with the right parameter. Because a lot of time, you know, the model has inferred, like, if it can call no functions for the weather, perhaps maybe has to extract [00:40:00] the locations, the time.
We had to extract that and whether it's valid, a lot of numbers say, like, uh, based on the, how efficient the plan, like the sequence of action is, if the sequence of action is very long, it might go on and on and on and never be able to, like, stop in the squeeze of time. whenever I try to evaluate some use case, I would probably try to break down, like, how is this application going to fail.
It tries to put in a lot of evaluation metrics around the points of failures.
Atin: Yeah, and I absolutely, I fully agree. And thanks for that detailed breakdown. you kind of covered, I think all aspects of an agentic system. Vivienne, on the topic of agents, I know there's like what Chip described is a fairly complex workflow.
There's a lot of nuances and a lot of subjectivity to. Say, how do you even, you know, tool selection, whether it's right or wrong, parameters, whether they're right or wrong, do you, or have you foreseen number one, say the Nemo guardrails, like [00:41:00] frameworks being used in agentic systems and in particular, have you seen any kind of, human in the loop story here with the agents?
Vivienne Zhang: Yeah, I'm also very excited about agents and NVIDIA actually have released many blue agent blueprints because we believe that there are so many use cases that can benefit from. And specifically to your question of human in the loop, I think I'm going to give two examples. One we're very familiar with, and then one it's more visionary, but hopefully happening in the near future.
So for the very familiar one, I think we've all interacted with a customer service chatbot. And you will keep handling your queries until it decides it can no longer deal with the complexity of the query. And then you will send the query to a human and that will be the human in the loop. And I think, you know, as simple as the sound to do it properly and correctly and [00:42:00] give a good user experience it's not easy.
And so I think Iterating and doing this eval on all the failure points as Chip was identifying is super important. And then to talk about what hopefully will be in the works in the future. I want to go back to the original definition of an agent. It is a person or a party who acts on behalf of another person or group.
And I think the implied definition of relationship is that the goal is defined and identified by the human and the means are delegated to the agent. And I think that's a ideal human in the loop vision for me. So I will define all the goals and the parameters, like I want a trip in this way, and I want to find a new diet, optimize my schedule, and I want an agent to do all of that for me.
And I'm actually seeing a lot of different, small agents or early agents been built in this traction and in this direction, and I would love to talk about this again next [00:43:00] year.
Trends in AI Models and Hardware
Atin: one question for you, Chip. There's some trends that we're seeing, on various fronts. the cost of intelligence is coming down.
what, like a three digit billion parameter model could do today, a single digit billion parameter model can do, the cost of hosting these models is also sort of linearly decreasing. It's becoming easier to sort of deploy them. On prem, literally you can download like a Lama 3.
1 and host it on AWS. all these trends have emerged in 2024. given these downward trends, making it easier and easier to use a lot of open source models and host them on more commoditized GPUs, what are your thoughts on the trends going into 2025? And what are you most excited about?
Chip Huyen: So, I'm definitely very excited about, like, open source, like, smaller models, I think there's a chain of, like, models. when we see this trend of, models, [00:44:00] the same size of getting bigger, stronger and stronger. So, if you read first about the lab map papers, right? You can see that on my tree, like, even 70.
Models is way, way better isn't like just like the 1st, but 10 times bigger, so I'm very excited. So you can see that we have very strong powerful small models at the same time Nvidia and also like a lot of competitors and incredible work in making very powerful chips. So, I hope Nvidia.
Invest more into consumer chips. So I feel like I hope the same. Yeah. So I do think that's like, for people to be able to host model locally, we do need more powerful consumer hardware. but is that is a China, if NVS not doing it, I'm sure they're going to be like a lot of setups already doing it.
So I do see just like a lot of on dev and like on prem to be very, very useful. Very interesting. I think this trend would be like pretty exciting. Like agentic tech use cases. I think it's like, Vivienne already mentioned some great use cases. Uh, I also see, like, for example, like sales, automated sales, automating, marketing, market research, coding, [00:45:00] personal assistant.
A lot of them are very, very, very fun. another trend, I think could be in a wider range of dictations and just, I think it may be agentic patterns, So, if you don't mention a few use cases of a human is a look. One thing I'm very excited about is it's like, agents, but like, instead of asking agentic to make the full strategy plan to sort them fast enough, like human domain experts to come up on a scaffold of like, how to solve the task.
And then give them agents a tool. So like the agents just need to like follow the scaffold instead of like coming and put like the plan from A to Z by themselves. I mean, just basically super exciting right now.
Atin: Oh, totally. I couldn't agree more.
maybe the same similar question for you, Vivienne, as we end this.
Future of AI in Enterprises
Atin: so you're kind of in the eye of the storm, right? You're in the Gen AI teams at NVIDIA. Yeah. That's kind of the epicenter of a lot of this innovation. what do you see coming in 2025?
Vivienne Zhang: first of all, I cannot promise the consumer grade GPU just requested, but I will [00:46:00] relay this request for feature. but yeah, no, I think we totally see that we, we now have a category of product of a models now called MiniTron, like, you know, the little brothers or sisters of NemoTron. And yeah, the, the amount of work they can do is also just really astonishing.
In terms of what I'm really excited about, I think one thing I will love to see more enterprises do is this hype of new. Versions of its application, just like how a lot of B2C chatbots have been doing. So we know that, you know, I think consumer grade chatbot like Perplexity, Gemini, GPT 4, like, we're all very excited whenever they have a new drop of the model.
I think the same wraps of collecting user feedback, usage data, iterate, improve, evaluate. Release should also be applied for a lot of enterprise use [00:47:00] cases. And I love to see Galileo partners say, Oh, we have a new application coming in and let's all be wild for its new performance compared to its former version.
And I think we can keep climbing the hill.
Atin: Absolutely. I will certainly be doing that. amazing.
Conclusion and Final Thoughts
Atin: I mean, this has been a great session. I learned a lot from both you Chip, as well as you Vivienne. Thank you both for joining me Thank you. Thanks so much.
Conor: That's it for this week. Thank you so much for listening and make sure that you've subscribed to Chain of Thought wherever you get your podcasts. And if you enjoyed this episode, if you're finding the show valuable, please share it with a friend. We're trying to get the word out about the show and it means so much to us when we see our listeners sharing the podcast on social media or when we hear that someone came in and listened to us because they found it from a friend.
Thanks so much for listening and we'll see you next week. [00:48:00]