Chain of Thought | AI Agents, Infrastructure & Engineering

As we enter the era of the AI engineer, the biggest challenge isn't technical - it's a shift in mindset. Hamel Husain, a leading AI consultant and luminary in the eval space, joins the podcast to explore the skills and processes needed to build reliable AI. Hamel explains why many teams relying on vanity dashboards and a "buffet of metrics" experience a false sense of security, which is no substitute for customized evals tailored to domain-specific risks. The solution? A disciplined process of error analysis, grounded in manually looking at data to identify real-world failures This discussion is an essential guide to building the continuous learning loops and "experimentation mindset" required to take AI products from prototype to production with confidence. Listen to learn the playbook for building AI reliability, and derive qualitative insights from log data to build customized quantitative guardrails. Follow the hostsFollow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow Today's Guest(s)Connect with Hamel on LinkedInFollow Hamel on X/TwitterCheck out his blog: hamel.devCheck out Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Show Notes

As we enter the era of the AI engineer, the biggest challenge isn't technical - it's a shift in mindset. Hamel Husain, a leading AI consultant and luminary in the eval space, joins the podcast to explore the skills and processes needed to build reliable AI. 

Hamel explains why many teams relying on vanity dashboards and a "buffet of metrics" experience a false sense of security, which is no substitute for customized evals tailored to domain-specific risks. The solution? A disciplined process of error analysis, grounded in manually looking at data to identify real-world failures 

This discussion is an essential guide to building the continuous learning loops and "experimentation mindset" required to take AI products from prototype to production with confidence. Listen to learn the playbook for building AI reliability, and derive qualitative insights from log data to build customized quantitative guardrails.


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

Connect with Hamel on LinkedIn

Follow Hamel on X/Twitter

Check out his blog: hamel.dev


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB.

Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of Modular. This account is not affiliated with, authorized by, or endorsed by Modular in any way.

[0:00] Hamel Husain:
These generic metrics are most of the time not helpful at all. They're way too generic. They don't necessarily correlate with actual failures in your AI product, and they don't actually mean anything. And it's it can be extremely destructive because, like, you know, people get seduced into this. Like, okay. I could just plug in this dashboard to my system and get this dashboard of metrics and can tell me, like, how I'm doing. And then you kind of have this illusion of, oh, like, I checked the box. I'm doing evals, and I am monitoring my system. In reality, like, you're not monitoring anything. Just wasted a whole bunch of time.

[0:35] Conor Bronsdon:
Welcome back to Chain of Thought, everyone. I am your host, Conor Bronson, and joining me is my cohost, Atin Drea Sanyal, co founder and CTO of Galileo. Atin, as always, great to have you behind the mic with me. Always great to be here. Yeah. I'm excited for this conversation because it's gonna be on a few topics you and I are particularly passionate about, and which I hate to tell our audience, but they've probably heard us opine about a few times because we have a special guest joining us who's been at the forefront of applied AI, helping over 30 companies navigate the complexities of building and productionizing their products. Hamel Hussein is an independent AI consultant, a luminary in the eval space,

[1:13] Conor Bronsdon:
and has worked with innovative companies such as Airbnb and GitHub, which included early LLM research, used by OpenAI for code understanding. Hamel, great to see you. Welcome to the show. Thank you for having me. Yeah. It's absolutely a pleasure because we've chatted a couple times about product philosophy, how to approach AI products today, and you're obviously well known for

[1:37] Conor Bronsdon:
your blogs, your courses, and a particular favorite of mine is your field guide to rapidly improving AI products. You get right to the core of what actually makes AI products successful in the real world, And that feels like the perfect place for us to start our conversation. You open that guide with this concept of a tools trap that many companies are falling into. Can you start by giving our audience a bit of an explanation of this idea and why so many smart AI teams are falling into this trap. Yeah.

[2:09] Hamel Husain:
So a lot of times when people think about evals or measuring, you know, the reliability or performance of their LMs in terms of is it doing the right thing for the user, the first thing that a lot of people's minds go to is, like, okay. What tools can I use? Can I just, like, abstract this entire thing away to some tools? Can I just buy tool like, you know, can I make it not my problem?

[2:35] Hamel Husain:
You know? Is there some abstraction or something that I can use to, like, have it so that I don't have to worry about the accuracy of my AI product or the performance of it or if it's doing the right thing. Like, maybe something can just figure it out for me. And, you know, until we get to like AGI or something like that, I don't think that's possible. But, you know, it is where people's mind goes goes towards.

[3:05] Hamel Husain:
And so I think the number one question I get is, hey. What are the tools? And that's the wrong question. The question should be like, hey. What's the right process? Like, how do you how do you evaluate AI? You know, getting into tools later, but what is the right process to go through? Because no matter what tool you use, you have to go through a certain process

[3:31] Hamel Husain:
to evaluate AI correctly.

[3:33] Conor Bronsdon:
Yeah, and, Otten, I know you have a ton of thoughts about that process, and both of you, I've seen, discuss this idea of generic metrics not being enough for many companies. You know, fancy dashboards being a panacea and, you know, a Band Aid solution, not something that actually solves everyone's problems. And this idea in your guide, as you put it, Hamill, creating of a false sense of measurement and progress, or as you described it, that, Atin,

[4:03] Conor Bronsdon:
you know, AI has a measurement problem. Hamill, could you give us an example of how you think vanity metrics have led teams astray? Yeah. So it's really

[4:12] Hamel Husain:
tempting if you're building AI tools, you know, Autin probably can provide more color around this, but I've seen this with other vendors, not not trying to pick on Galileo or really anyone, to be honest, is when you go into a pitch meeting around, hey, like, we can help you with your evals, you know, people wanna see a solution. And they want, you know, it's easy to kind of present a dashboard. It's convincing to some extent, if you don't know any better, to present a dashboard with all a bunch of generic metrics, hallucination score, toxicity score,

[4:51] Hamel Husain:
conciseness score, you name it. It just so happens that these generic metrics are most of the time not helpful at all. You know, they're they're way too generic. They don't necessarily correlate with actual failures in your AI product, and they don't actually mean anything. And it's it can be extremely destructive because, like, you know, people get kinda seduced into this. Like, okay. I could just again, going back to the tools

[5:22] Hamel Husain:
discussion, I can plug in this dashboard to my system. I can get this dashboard of metrics. It can tell me, like, how I'm doing. And then you kind of have this illusion of, oh, like, I checked the box of doing evals, and I am monitoring my system. But in reality, like, you're not monitoring anything. You just wasted a whole bunch of time. You don't really know what your failures are and, like, what the most important things you should be focusing on.

[5:46] Hamel Husain:
And so it just creates a lot of churn, and I think people are getting a lot better with recognizing that now. I think, especially six months ago, it was the cause of almost all of my consulting business, with people that were confused and hit kind of a a roadblock or a wall in terms of, okay, we plugged in this tool. We got this generic dashboard thing, and we don't really know what to do now. No. I I absolutely second what Hamel is saying. In fact, I would go to the extent of saying that this

[6:22] Speaker:
generic metrics problem has existed in machine learning even before GenAI. Even erstwhile ML workflows, you usually have a held out test set, measure F1 scores, and just say the model is good or bad based on that score. Those approaches as well are akin to generic approaches where they treat all kinds of errors the same way. In some situations they're necessary but they're in no way sufficient.

[6:51] Speaker:
And these were also some of the realizations that personally had as well before even starting the company when we built Michelangelo at Uber, there was no one stop metric that would be the panacea for your problems, and the same patterns are emerging again. I'm curious to ask you, Hamel, what kind of patterns you're seeing, but basically, just to take an example,

[7:18] Speaker:
with agents, customized architectures are kind of the way to go. You can build agentic architectures in a million different ways, and customized architectures need customized personalized evals, which also need to evolve as your application grows and evolves and meets the new new kinds of data. So one good question to ask, think, for practitioner, for a developer is rather than, oh, what metrics do I need from a buffet of metrics,

[7:49] Speaker:
rather, what are the pains or potential risks in the workflows of my app? Let's list them down, and then author evals which are customized to those panes, and then constantly monitor those panes because those panes will also evolve, panes as in potential risks and pitfalls in your application, and then accordingly update the set of evals that you're using based on those

[8:15] Speaker:
evolving pitfalls. But I'm curious to know if you've seen similar patterns, Mel. Yeah, so one of the things that's really important

[8:23] Hamel Husain:
to do with evals is to ground it in your failures. So how do you know what your failures are? And like, the thing that we harp on a lot and what we teach in evals and what I write in my blogs constantly is look at your data. But what does look at your data mean? Look at your data so what's behind look at your data is this process called error analysis. And error analysis is

[8:48] Hamel Husain:
has been around for a really long time, even before machine learning. So it's like been around in social sciences. I recently learned it. I thought, you know, the first time I was exposed to it was machine learning, of course. But it is a kind of this process where you go through and you look at data and you take notes about what is going wrong, and you then

[9:10] Hamel Husain:
use those notes and you kind of categorize them. You say, okay, like, what kinds of errors am I seeing? And you can do it starts very simple, like counting those categories and seeing like, okay, what types of errors are happening the most? And then you make a decision like what to prioritize from there. And it's a very powerful technique that most people don't do

[9:34] Hamel Husain:
because no one has taught them to do it, I think. And it's very simple. It's like the most simple kind of thing. We like, you know, We're talking about opening a trace viewer and then writing notes and going through a bunch of traces. And there's some, okay, the same questions always come up, how many traces should I look at? So on and so forth. And there's some useful heuristics. There's this concept from social sciences

[10:01] Hamel Husain:
called theoretical saturation, which just means like, hey, keep looking at traces until you're not learning anything new. So what we teach is, like, try to look at at least a 100 traces just as a heuristic to get people started because they have a lot of anxiety. If you just say theoretical saturation, they get they don't even begin. They just get scared of the whole process. But 100 is, like, concrete number people can know, like, have a goal.

[10:26] Hamel Husain:
And then, like, you know, after you begin, you don't really care about the 100. You're like, oh, I'm learning so much. I think that's the counterintuitive part of, like, going through individual data points and reading what is happening in a focused session provides immense value, and people don't know that until they do it. They're very surprised you know, at the amount of value that it provides. And so that that can inform all of your evals activity.

[10:58] Hamel Husain:
You know, it'll, like, it'll motivate everything, like what you should focus on, what you should write an eval for, etcetera. And it's not really like this error analysis can, like, kind of bucket it into this activity of evals, but it's not even evals. It's, like, just development. So I'll just

[11:19] Speaker:
stop there. No. That's super fascinating. I I actually kind of as you were talking, I'm drawing parallels to certain sort of opinions that we make on the Galileo platform itself, because we are an evals and observability platform, is this new notion of quantitative insights or metrics and qualitative, and the qualitative bit to me sounded very similar to the theoretical saturation workflow that you're describing, which is the error analysis process where it's less about numbers between zero and one measuring low and high, and it's more about

[11:57] Speaker:
more abstract. It's at a much more abstract level where are you achieving what you were set out to do? And along the way, what pitfalls or errors are you seeing? Something we do in Galileo is kind of drive the developer or the user to using what we call LogStream Insights, And LogStream insights are more qualitative insights on hoards of your data, like segments of your

[12:25] Speaker:
long running sessions, whether it's like a chatbot session or any kind of long running agent, we would analyze data in bulk and give you qualitative insights and then try to correlate them to potentially having you build some quantitative measures based on those qualitative insights. And hopefully the more qualitative insights you find, you reach that theoretical saturation

[12:52] Speaker:
that you're talking about. So I can draw a lot of parallels and it's very fascinating to hear kind of the theoretical sort of side of error analysis and the practice of it being much beyond AI and machine learning. I'm curious if the two of you think that part of the reason

[13:12] Conor Bronsdon:
this approach to error analysis hasn't really truly been popularized in current AI development circles is because we've seen this change in persona where most of the people who were doing machine learning work, like, there were engineers involved, but it's a lot of data scientists who have kind of more classically been trained on some of these error analysis techniques. Whereas software debugging

[13:37] Conor Bronsdon:
is a different approach often. And we're now seeing kind of the the marrying of these two approaches with engineers who are now becoming AI engineers and and working very differently and having to transform both how they think about the software they create from deterministic to nondeterministic, and also having to think about their approaches in different ways. Is that what's driving this kind of gap, you think, or is it something different? I think so. Yes. I mean, I I would say the first epoch or phase of AI engineering

[14:08] Hamel Husain:
was very much focused on, okay, like, we need to build stuff. We need to get go to zero to one really fast, and let's see what's possible in a rough sense. And, you know, now that you know, and so it was very much the narrative and, you know, also the truth. Like, you know, one of the most important skill to get started was software engineering, you know, in that. Like, you need to, you know, glue together a lot of things, use APIs,

[14:36] Hamel Husain:
you know, kind of full stack engineering, really important. And when it comes to, okay, like, how do you know that this stochastic system is reliable? That's a whole different skill set that takes time to learn. And there's a very large intersection between machine learning, data science, and the skills you need to do evals often. And the reason you know, I tried to actually

[15:11] Hamel Husain:
see how how much I could get get away with in terms of, like, teaching engineers evals without data science background or, you know, the requisite, let's say, background. And you do hit a limitation really fast. And, like, for example, you know, Srain and I are teaching a class on evals. We've taught over 700 students so far of all different kinds of backgrounds.

[15:38] Hamel Husain:
And, you know, like for example, when we get into building LLM as a judge, what we teach people is like, okay, one of the things that's important with LLM as a judge is that you can trust the LLM as a judge. And to trust LLM as a judge, you have to compare it to some human labels. There's things like the And questions always come up such as hey like why is it okay to sample

[16:08] Hamel Husain:
data how can you know and like we we show people, like, okay. If you wanna know how much noise there is in your judge, you can do stuff like boot strap bootstrap sampling. People don't understand that. They're like, why is it okay to, like, just continue, like, sample a whole bunch of times from a dataset to get the distribution? And so we we found that, like, we almost have to go back to classic statistics and see people that you. Which

[16:31] Hamel Husain:
is not super tractable, to be honest. Like, you know, not in the format of, okay. Let me teach you evals real quick. You can teach fundamentals, and you don't necessarily need all that stuff to get started, but you can you need, like, a fair amount of data literacy. That's one side of the equation. It's, like, statistics, but, also, it's all the analytical tools.

[16:55] Hamel Husain:
Right? So like how do you how do you like dig into data? You know, like let's say we're talking about traces earlier and like clustering those traces or navigating them or analyzing them. Like you want to be able to like really pick at data really fast and just do open ended exploratory analysis on it. And a lot of those data skills come into play again when it comes to, like, digging into a problem.

[17:21] Hamel Husain:
And so, like, you very quickly arrive at the very similar skill set of a machine learning engineer or a data scientist. You don't necessarily you don't need to be training models, but I would argue that you shouldn't be spending most of your time training models anyways. Like, you were looking at, you were doing a lot of error analysis and debugging and whatnot, so

[17:46] Hamel Husain:
that's my spicy hot take perhaps in this podcast.

[17:50] Speaker:
Don't think it's a hot take at all. I think it's a very, very legit take on just the distinction between software engineers and data scientists, and answering that key question. Like in the new world of sort of meshed roles and the AI engineer and what is, you know, mostly, like technical people are kind of undergoing this minor identity crisis. And the answer kind of lies in what you said, which is if you were to cherry pick one skill that's needed

[18:23] Speaker:
for the software engineer to become the AI engineer or, you know, to be efficient in the modern era is really just the skill of understanding data and knowing the difference between good and bad data or how to take bad data and step by step move it to good data and just data literacy is how you put it. Think that is the main skill because there's the other skills which are, you know, knowing the semantics of a decision tree,

[18:50] Speaker:
which is totally commoditized and you don't need to know. You don't even need to know how to train models or fine tune them. But to be able to understand this basic process of comparing an output with a pre generated ground truth, which is either human labeled or synthetic, but just knowing the goods and bads of the practices, that is what data literacy is, and if this skill is adopted

[19:17] Speaker:
by a tier A software engineer, I think they've set themselves up for the future.

[19:22] Hamel Husain:
Definitely, and there's lot of other related skills as well, like designing metrics, and the list goes on, how to tell stories with data, how to have a sense of when your metrics are leading you astray, all the way down to like having good product sense and having that be aligned with with metrics, you know, potentially doing AB tests, the whole suite of things is is important.

[19:49] Hamel Husain:
My friends and I joke that we might have a new, job title coming called AI scientist, but I try not to be the one who is coining you Wait wait a second. Job title. We're talking about AIPMs,

[20:02] Conor Bronsdon:
AI engineers, AI scientists now. Oh, man.

[20:06] Hamel Husain:
You know, there's always every time there's a technological shift of some kind, there is kind of this sort of gravitation towards the idea of a unicorn. So we saw it we saw it actually, like, many times. Like, you know, the most recent time we've seen it is, like, actually in data science itself, where, you know, initially at the outset of the data scientist,

[20:29] Hamel Husain:
we had the person that did everything, software engineering, statistics, DevOps, so on and so forth. I think, like, people realize there's a little bit too much service area, honestly, and then kinda split it into different kinda sub disciplines. Then we may be seeing that with AI engineer, if I were to predict.

[20:48] Conor Bronsdon:
Speaking of AI engineers, I know one of the recommendations that you've made, Hamel, has been that when teams are making AI investments, particularly when AI engineers are helping make their decisions here, it's really important just to have a customized way of viewing their data, not necessarily a complex dashboard, so that they can approach this debugging as error analysis the right way, so they can make decisions in the right way.

[21:14] Conor Bronsdon:
Because as I think as Austin and I have certainly experienced working with folks, it's very easy to overwhelm teams with too much data instead of enabling focus. Why do you think giving everyone an easy way to see what their AI system is doing

[21:30] Hamel Husain:
is more impactful than some of the sophisticated analytics that I think often we're trying to reach for? Yeah. So the guidance there is like, okay. There's a lot of tools out there that provide a good way to get started, like Galileo. Like, you know, you have a a way that you can, like, plug in your AI application and see your traces in a stream, and kind of

[21:53] Hamel Husain:
go through that. A lot a lot of times in, you know, in your applications, there are a lot of domain specific things going on. Like, might have widgets that you're rendering. Your application might be writing emails. You might have external data sources that you need to reference to evaluate particular trace. You might want to have you might want to view the trace in the exact way the user is is seeing it,

[22:27] Hamel Husain:
For example, you might have things in your trace that by default are usually not helpful, but that take up a lot of space in terms of tokens. All kinds of, like, little nuances. So what you want to do is really dial in the data viewing experience so that you can do this error analysis and, like, review lots of data really fast in a way that is very customized to you, that is very contextualized

[22:55] Hamel Husain:
to how you want to see data, all the data you need to see in one place, rendered in exactly the right way. And so the reason that's my that's our advice is just because of AI. Because, like, AI, you can vibe code. So, you know, AI is really good at producing simple applications that can render data and, like, you know, have simple yeah. Simple web applications like render data where you have, like, input fields and stuff like that.

[23:26] Hamel Husain:
That's something that is probably, you know, below the bar where AI can clear clear those tasks very well. And so because of that reality, we recommend that people in a lot of cases like create their own data annotation apps because there's just way too much value to be had relative to the cost of doing so. Isn't that the case like 100% of the time, but it's the case like a lot of times.

[23:55] Conor Bronsdon:
Atsin, I know a big part of our recent product philosophy at Galleo has been to give people more simplified views, whether it's, you know, the graph view or timeline view, which we've kind of designed with the idea of like, okay, let's give them other options to debug agents in particular as we look at these kind of more complex systems, as well as other views that, you know, may or may not be live by the time that this podcast launches.

[24:19] Conor Bronsdon:
And I know this is something you're thinking about a lot this is something you're thinking a lot about too, because as I alluded to, we've kind of had conversations together with AI engineers, I think, just like Hamill has, who are going, hey, I need help focusing here. I'm not really sure what to look at necessarily. I'm not sure where to spend my time in that error analysis.

[24:38] Conor Bronsdon:
What's your philosophy on how to approach this, I guess, observability and focus layer that Hamill is talking about? Yeah. I think beyond the

[24:49] Speaker:
graph views, which is a feature that we offer, features like graph views kind of tend to point to the broader philosophy of giving the right abstractions to the user to be able to kind of do the segmented root causing of these ever growing sophistication in systems which have evolved from simple REG to agentic REG to multi agent. You want to give the users the right abstractions so that they can shine the torch in the right areas,

[25:24] Speaker:
and that's where views like the graph view, session views, interaction views, these come in to be able to give the tools to the user to just be able to root cause effectively. And what that means what that entails is you run your application end to end, and each request may sort of touch certain parts of your application and light up the nodes there. And each request will run through a different

[25:54] Speaker:
sort of path in in in your application, which you can visualize as a dagger or workflow. The first step is to be able to spot the anomaly where kind of the ground level customization on the metrics, as well as the qualitative insights come in, but then these right abstractions and the right views to be able to make sense of, yeah, what's going on. And then there's the data that's associated with, because all this is really is just data flowing through a bunch of nodes and edges.

[26:28] Speaker:
So once you spot the anomaly, you want to look at the data and what, you know, went wrong with that. So simplifying the views around the data is kind of the next step from there.

[26:39] Hamel Husain:
And just to be clear, like, what I described does not add odds with these things and tools. They're just, like, supplementary. Like, I I also always want a trace viewer like the ones in Galileo because this can be, like, a lot faster to search through that and just look at that without you know, sometimes I'm looking sometimes I'm looking for something that maybe by accident wasn't in the annotation tool or something else. So it it is really useful. And also, like, a lot of these platforms, like Galileo, have APIs where you can connect your annotation tool to,

[27:13] Hamel Husain:
and, you know, write data back and forth to it.

[27:16] Conor Bronsdon:
So, you know, that's just something to think about. Yeah. And I think we all agree that's a a great best practice, is to leverage the APIs of whatever evaluation tool you're leveraging. Obviously, we we hope that's GALLAYO. But whichever eval tool you're using, like, using that API to bring that data into other places where you can look at it look at it in different ways,

[27:37] Conor Bronsdon:
and kind of consume that information and highlight it to business users, think is a fantastic thing to do. And, Hamel, I know you've talked about this idea of empowering domain experts who may not be in an eval product every day to add their insights and help improve these nondeterministic systems. How do you think about, you know, writing and iterating on prompts

[28:02] Conor Bronsdon:
with domain experts versus with engineers?

[28:06] Hamel Husain:
Yeah. So one of the biggest failure modes I see, and is also one of the biggest drivers of my consulting business, is people outsourcing evaluations to developers, which is fine if you're building a developer tool where the developer is a domain expert, but usually they're not. And the symptom there are the root cause of people outsourcing eval developers because they're thinking of AI like software engineering. They're like, oh, AI

[28:39] Hamel Husain:
development is a software engineering task. The, you know, the moment you'd say anything about AI development process, they're like, this is outsourced to developers. That turns out that always goes really badly because, yeah, like, you're you're only guessing, you know, and the developers don't have enough context, so you want to involve the domain expert. It's like, you know, if you're

[29:05] Hamel Husain:
working building something for lawyers, you want to involve the lawyer. You wanna involve the the legal expert at some point. And so, you know, when it comes time to doing things like iterating on prompts, you shouldn't have the prompt so removed from the domain expert. The whole point of LLMs is, like, humans can talk to computers. And so if you obfuscate everything so much that the domain expert can't talk to the computer, then you're

[29:31] Hamel Husain:
kind of burning the whole, you know, the value proposition of AI to begin with. Like, because you wanna direct, you know, line of communication between your domain expert and your in, like, what's going into the AI in terms of prompting. And so what I described in that blog post is a lot of like, a good pattern that I've seen work really well is if you have a user facing application,

[29:59] Hamel Husain:
you know, have, like, an admin view where you expose the prompt and allow the person to change the prompt. Even if you don't want the user to change the prompt, you have the like, for your internal purposes, you have an admin view that allows the domain expert to change the prompt and and fiddle with it. It gives them, like, a more direct connection to what exactly is happening rather than, like, having conversate abstract conversations about AI, and it should do this and it should do that. It's really important that they get in there, and they are, like, experimenting.

[30:30] Conor Bronsdon:
Yeah. And I think it very much aligns to what Galileo has done with our continuous learning through human feedback feature, because we feel the same way. You need to leverage this domain expert feedback. You can't simply have it just be the engineers who may be, depending on your business, you know, divorced from the bare metal of what the product's doing. Like, hopefully they are are very aligned to that, but sometimes they have business users who are translating,

[30:55] Conor Bronsdon:
you know, key pieces of that for them, or domain experts who bring a lot of context. And I know it's part of why, especially when we're looking at custom metrics, but all of our metrics, we leverage, you know, feedback from SMEs, you know, whatever type they may may be, can go in and say, okay, like, let me get feedback on these 10 traces and say, hey, this this metric feels a little off, actually. This is pretty accurate, or, you know, here's a little contextual feedback,

[31:18] Conor Bronsdon:
and then use a judge to translate that and apply it and, you know, retune the metrics. It's something we're we're finding a lot of success with. But I I think there's a lot more opportunity to go deeper here, to your point. Like, it feels like too often, even in highly customized evaluation systems for enterprises, we are just scratching the surface of the human context that we can bring in.

[31:41] Conor Bronsdon:
I mean, it's it's a very common problem for many organizations that there is too much tribal knowledge that's not living in documentations, that's not necessarily making its way into systems. And to your point, it's so necessary that we bring that human knowledge into our AI systems because they perform best when they have the data they need. And it can be as simple as

[32:04] Conor Bronsdon:
friction between technical teams and understanding that domain experts have of some of the jargon of your AI systems. Like, you gave this great example in one of your pieces about translating REG to just making sure the model has the right context and really saying, hey, like, let's just put this in a term that anyone can understand, even if they're not deep in AI.

[32:25] Conor Bronsdon:
What's your advice for AI teams who are looking to bridge that gap and really bring their domain experts into the fold, so that they can be part of improving their AI systems and their AI data. Yeah. Let me, like, clarify

[32:40] Hamel Husain:
the last point with some, like, concrete failure modes, like, to look out for. Like, one is okay. There's a there's there's an aspect of, like, a prompt store or, like, a centralized place that you could put prompts, which is fine. But a lot of times what happens is folks don't build, like, properly enough around around that. They don't build, an experimentation

[33:04] Hamel Husain:
environment. And so, like, you have to change the prompt there and then, like, commit it and then wait and then, like, go somewhere else and, like, try something. And that's, like, way too much friction. So that is kind of you know, that prevents the domain expert from experimenting. A lot of tools have prompt playgrounds, which are great. It's a good place to get started. However, most pump playgrounds, they don't have access to your tools and your infrastructure and your application code. So they can't call

[33:33] Hamel Husain:
they can't perform rag, they can't call tools, they can't do all the things that your application is doing. And so, you know, you can't necessarily rely on that either. That's why you need this, like, integrated I forgot what I called it in the in the blog post thing. Called it, like, integrated prompting environment or something. Try to make up a name for Basically it's, you need to be able to play

[33:55] Hamel Husain:
with the prompt in your user facing application directly. Because that's the only pattern, at least that I've seen, that's worked reliably in terms of bringing the domain experts in. Yeah, I'll just add a couple of points here. First, of course the need for

[34:11] Speaker:
easy to, sort of easy to use human feedback is critical, And some of our, yeah, like to your point, Connor, some of our human feedback features, which go much beyond, you know, just offering like binary signals, thumbs up, thumbs down, and the ability to create your own sort of feedback, kinds of feedback becomes important. But to Hamil's other point about just

[34:40] Speaker:
managing the prompts and offering the subject matter experts the ability to tweak the prompts to interact with the app. I think engineering wise, the matter gets a little bit tricky, especially for more sophisticated applications like multi agents where things are not necessarily driven by one prompt, might have a series of prompts which are triggered one after the other, you don't have control over

[35:06] Speaker:
many of them, but more often than not, it is driven by a kind of a seed query, which is kind of the natural language interface to any GenAI app. So the engineering challenge kind of becomes how do you abstract the entire application and make it available in front of the user through a natural language interface, the user being the subject matter expert, not the developer,

[35:32] Speaker:
but being able to actually run the developer's app seamlessly, so that to the SME, it's all about, here's my input. I have pure knowledge about my input and the expected output, but all the machinery in the middle, you should be able to abstract out from me. So the trickiness kind of comes in the fact that, I guess, the challenges around how do you use our APIs and the SDKs and, of course, all the, you know, containerization

[36:01] Speaker:
technology to be able to kind of simulate version of the app, which may be a distributed app. It might be running on, you know, two different availability zones for that matter. It is just software. So I think that's where the the challenge comes, and we are kind of at a point where it's it's doable to simulate sort of singular monolithic applications and make this workflow available to the SME, but it gets challenging when the app

[36:28] Speaker:
itself becomes distributed, and that's where kind of the a lot of engineering

[36:34] Hamel Husain:
innovation is going. Yeah. It's really nontrivial. Like, have to be you have to think, like, you often can't, like, expose everything to the SME. You have to say, is there a high value thing I can expose? And you know, it just if anything else, it just helps give them intuition. So they don't think rag is a very abstract concept or prompt is even an abstract concept.

[36:58] Hamel Husain:
You know, you'll be surprised, like, how many people think prompt is an abstract concept because they say something in a meeting and the expectation is a developer's gonna write the prompt. That's the worst thing that can possibly happen. So whatever way possible you needed to get away from that. And so what I'd love to close the conversation with,

[37:16] Conor Bronsdon:
and Hamel, thank you again so much for joining us. It's been a distinct pleasure having you. It is just some advice. Like, what what would be your summation, your advice to a team that is looking to build their eval system, that is looking to

[37:31] Hamel Husain:
improve their AI product? What would you tell them? Yeah. So the biggest two kind of things that I can think of is like one, error analysis, also known as look at your data. It just it can yeah. It solves so many problems. I was like, maybe 90% of the whole evals process is is like looking at your data. Like, you find so much even before writing evals. Like, just find you'll just find so many bugs, so many things, opportunities for improvement, so on and so forth.

[38:05] Hamel Husain:
And then the next thing that I can think of that makes a huge difference is having the experimentation mindset. And this one you have to cultivate a little bit. There's some talks that I can point you to about how you might, you know, reframe your thinking. I mean, this is something that's innate to machine learning folks and data science folks is, like, you know, you don't have, like, this waterfall

[38:31] Hamel Husain:
chart of, like, how to build a machine learning system. Like, you have to you you have an idea of, different experiments you wanna try. You don't even know it's gonna work. But what you do have is a hypothesis of, like, hey. Like, this might work. This might not work. Let's try this. Let's look at this afterwards. And so you have to reorient a lot of things in order to do that. You have to kind of, you know, have a different language that you talk

[39:01] Hamel Husain:
about within your teams and sort of make sure that you're not don't have those rigid approaches when it comes to this. It's hard yeah. That's probably another podcast, but

[39:16] Conor Bronsdon:
those are my thoughts. I mean, we can definitely have you back for for another conversation, because I think there is so much more we can go into here. Ottin, how about you? Any closing thoughts from your side of the house? Yeah. I would say that, you know, erstwhile,

[39:28] Speaker:
before LLMs, AI was considered garbage in garbage out, and now with LLMs, AI has become software, so software three point zero is AI, and now software is garbage in garbage out. So to Hamil's point, do look at your data because of garbage in garbage out. And secondly, I would say that there's three specific things that I've learned as kind of the layers of AI reliability.

[39:56] Speaker:
The bottom most layer is the kind of the brass tacks, set up basic monitoring traceability. That's just stuff that we've solved before AI happened. Traditional observability is a partially solved problem, and there are certain things that are done well there, adopt those practices. The second layer of the three layers is set up your prompts and your metrics and consider them as your evaluation assets, they're your first class citizens,

[40:26] Speaker:
they will evolve over time, have disciplined versioning, lineage around them, set up a good system there. And the third is the insights layer, which is the whole qualitative insights turned into customized quantitative insights. So if you practice these three things and kind of consider them the three pillars of your AI reliability, you've built a good three sixty evaluations and observability layer

[40:51] Hamel Husain:
in your software. And I'll add one more thing, is take my course. So it's a shameless plug, take the Evals course. It'd be a good way to learn about how to get set up with the Evals.

[41:03] Conor Bronsdon:
And I'll second that and say also check out Hamill on X and on LinkedIn, he shares a lot of fantastic content. We will certainly link both those, in the show notes, yeah, Hamel's blog as well is is a great place to to go learn. Hamel, thank you so much for joining us on the show. It's been a pleasure. Yeah. Thank you. And to our listeners, if you want more fantastic content,

[41:28] Conor Bronsdon:
from Hamil and many other thought leaders, make sure you subscribe to the podcast because we share information from industry experts, perspectives from AI luminaries, and hot takes, plus much more both in the podcasting app of your choice and on YouTube. So whether you wanna watch the conversation, listen in, or check out any of our other content here from Galileo, you can find us all over the Internet. We appreciate your support, and Hamel Ottin, thank you again for joining me today. Thank you. Thank you.