Chain of Thought | AI Agents, Infrastructure & Engineering | Mindset Over Metrics: How to Approach AI Engineering

As we enter the era of the AI engineer, the biggest challenge isn't technical - it's a shift in mindset. Hamel Husain, a leading AI consultant and luminary in the eval space, joins the podcast to explore the skills and processes needed to build reliable AI. Hamel explains why many teams relying on vanity dashboards and a "buffet of metrics" experience a false sense of security, which is no substitute for customized evals tailored to domain-specific risks. The solution? A disciplined process of error analysis, grounded in manually looking at data to identify real-world failures This discussion is an essential guide to building the continuous learning loops and "experimentation mindset" required to take AI products from prototype to production with confidence. Listen to learn the playbook for building AI reliability, and derive qualitative insights from log data to build customized quantitative guardrails. Follow the hostsFollow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow Today's Guest(s)Connect with Hamel on LinkedInFollow Hamel on X/TwitterCheck out his blog: hamel.devCheck out Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Show Notes

Hamel explains why many teams relying on vanity dashboards and a "buffet of metrics" experience a false sense of security, which is no substitute for customized evals tailored to domain-specific risks. The solution? A disciplined process of error analysis, grounded in manually looking at data to identify real-world failures

This discussion is an essential guide to building the continuous learning loops and "experimentation mindset" required to take AI products from prototype to production with confidence. Listen to learn the playbook for building AI reliability, and derive qualitative insights from log data to build customized quantitative guardrails.

Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow Today's Guest(s)

Connect with Hamel on LinkedIn

Follow Hamel on X/Twitter

Check out his blog: hamel.dev

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes bi-weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Head of Technical Ecosystem at Modular, and previously led growth at AI startups Galileo and LinearB.

00:00:00:00 - 00:00:18:03
Unknown
These generic metrics are most of the time not helpful at all. They're way too generic. They don't necessarily correlate with actual failures in your AI product, and they don't actually mean anything. And it's can be extremely destructive because like, you know, people get seduced into this, like, okay, I could just plug in this dashboard to my system.

00:00:18:03 - 00:00:35:07
Unknown
You can get this dashboard of metrics and can tell me, like how I'm doing. And then you kind of have this illusion of, oh, like I check the box, I'm doing evals and I am monitoring my system in reality, like, you're not monitoring anything, just wasted a whole bunch of time.

00:00:35:10 - 00:00:56:21
Unknown
Welcome back to Train of Thought, everyone. I'm your host, Connor Bronson. And joining me is my co-host, as Andrea Sanyal, co-founder and CTO of Galileo Auton. As always, great to have you behind the mic with me. Always great to be here. Yeah, I'm excited for this conversation because it's going to be on a few topics you and I are particularly passionate about and which I hate to tell our audience with.

00:00:56:21 - 00:01:19:11
Unknown
They've probably heard us, opine about a few times because we have a special guest joining us who's been at the forefront of applied AI, helping over 30 companies navigate the complexities of building and productionize their products. Hamil Hussain is an independent AI consultant, a luminary in the eval space, and has worked with innovative companies such as Airbnb and GitHub, which included early Lam research.

00:01:19:13 - 00:01:45:06
Unknown
Use my open AI for code understanding. Hamel. Great to see you. Welcome to the show. Thank you for having me. Yeah, it's absolutely a pleasure because we've tried it a couple times about product philosophy, how to approach AI products today. And you're obviously well known for your blogs, your courses. And a particular favorite of mine is your field guide to rapidly improving AI products.

00:01:45:08 - 00:02:07:08
Unknown
You get right to the core of what actually makes AI products successful in the real world, and that feels like the perfect place for us to start our conversation. You open that guide with this concept of a tools trap that many companies are falling into. Can you start by giving our audience a bit of an explanation of this idea, and why so many smart AI teams are falling into this trap?

00:02:07:12 - 00:02:30:19
Unknown
Yeah, so a lot of times when people think about evals or measuring, you know, the reliability or performance of their LMS in terms of is it doing the right thing for the user? The first thing that a lot of people's minds go to is like, okay, what tools can I use? Can I just like, abstract this entire thing away to some tools?

00:02:30:21 - 00:02:55:12
Unknown
Can just my tool like, can I make it not my problem, you know? Is there some abstraction or something that I can use to, like, have it so that I don't have to worry about the accuracy of my AI product or the performance of it, or if it's doing the right thing. Like maybe something can just figure it out for me.

00:02:55:15 - 00:03:13:17
Unknown
And, you know, until we get to like AGI or something like that. I don't think that's possible. But, you know, this is where people's mind goes, goes towards. And so I think the number one question I get is, hey, what are the tools? And that's the wrong question. The question should be like, hey, what's the right process?

00:03:13:17 - 00:03:54:17
Unknown
And how do you how do you evaluate AI? You know, getting into tools later, but what is the right process to go through? Because you, no matter what tool you use, you have to go through a certain process, to evaluate AI correctly. Yeah. And often I know you have a ton of thoughts about that process. And both of you I've seen discuss this idea of generic metrics not being enough for many companies, you know, fancy dashboards being a panacea and, you know, a Band-Aid solution, not something that actually solves everyone's problems.

00:03:54:19 - 00:04:18:21
Unknown
And this idea in your guide, as you put it, Hamel, of creating a false sense of measurement in progress or as you've described it, that often, you know, I has a measurement problem. Hamel, could you give us an example of how you think of an entity? Metrics have lead teams astray. Yeah. So it's really tempting if you're building AI tools and, you know, often probably, can provide more color around this.

00:04:18:21 - 00:04:44:03
Unknown
But I've seen this with other vendors. Not, not try and pick on Galileo. Or really anyone to be honest, is when you go into a pitch meeting around, hey, like, we can help you with your evals. You know, people want to see a solution and the they want, you know, it's easy to kind of present a dashboard is convincing to some extent.

00:04:44:03 - 00:05:09:27
Unknown
If you don't know any better to present a dashboard with a bunch of generic metrics hallucination score, toxicity score, conciseness, score, you name it. It just so happens that these generic metrics are most of the time not helpful at all. You know, they're they're way too generic. They don't necessarily correlate with actual failures in your AI product.

00:05:09:29 - 00:05:28:19
Unknown
And they don't actually mean anything. And it's it can be extremely destructive because like, you know, people get kind of seduced into this, like, okay, I could just again, going back to the tools discussion, I can plug in this dashboard to my system. I can get this dashboard of metrics and can tell me, like, how I'm doing.

00:05:28:22 - 00:05:49:23
Unknown
And then you kind of have this illusion of, oh, like I check the box, I'm doing evals and I am monitoring my system. But in reality, like, you're not monitoring anything, you just wasted a whole bunch of time. You don't really know what your failures are and like, what the most important things you should be focusing on. And so, it just creates a lot of churn.

00:05:49:26 - 00:06:15:07
Unknown
And, you know, I think people are getting a lot better with recognizing that now. But I think, you know, especially six months ago, it was the cause of almost all of my consulting business with people that were confused and hit kind of a roadblock or a wall in terms of, okay, we plugged in this tool, we got this generic dashboard thing, and we don't really know what to do now.

00:06:15:13 - 00:06:39:13
Unknown
Now I absolutely second what I'm saying. In fact, I would go to the extent of saying that this generic metrics problem has, existed in machine learning even before I, you know, even erstwhile ML workflows. You usually have a held out test said measure F1 scores. And, you know, just say the model is good or bad based on that score.

00:06:39:15 - 00:07:05:23
Unknown
Those approaches as well are akin to generic approaches where they treat all kinds of errors the same way to in some situations they're necessary, but they're in Norway sufficient. And these were also some of the realizations that, I personally had as well before even starting the company when we built Michaelangelo at Uber. There was no one stop metric that would be the pinnacle for your problems.

00:07:05:26 - 00:07:38:05
Unknown
And, the same patterns are, emerging again. I'm curious to, ask you AML, what kind of patterns you're seeing. But basically, just to take an example with agents, customized architectures are kind of the way to go. You can build genetic architectures in a million different ways, and customized architectures need customized, personalized evals, which also need to evolve as your application grows and evolves and meets the new new kinds of data.

00:07:38:08 - 00:08:08:21
Unknown
So one good question to ask, I think, for for a practitioner, for a developer is rather than, oh, what metrics do I need from a buffet of metrics? Rather, what are the pains or potential risks in the workflows of my app? Let's list them down and then author evals, which are customized to those spans and then constantly monitor those pains because those pains will also evolve pains as in potential risks and pitfalls in your application.

00:08:08:24 - 00:08:37:18
Unknown
And then accordingly update your the set of URLs that you're using based on those, those evolving pitfalls. But I'm curious to know if you've seen similar patterns. ML. Yeah. So one of the things that is really important to do with evals is to ground it in your failures. So how do you know what your failures are? And like the thing that we harp on a lot and what we teach and evals and what I write in my blogs constantly is look at your data.

00:08:37:18 - 00:08:58:05
Unknown
But what does look at your data mean? Look at your data. So what's behind look at your data. Is this process called error analysis. And error analysis is has been around for a really long time, even before machine learning sort of like been around in social sciences. I recently learned it. I thought, you know, the first time I was exposed to it was machine learning.

00:08:58:05 - 00:09:24:07
Unknown
Of course, but it is a kind of this process where you go through and you look at data and you take notes about what is going wrong, and you then use those notes and you kind of categorize them and you say, okay, like, what kinds of errors am I seeing? And you can do? It starts very simple, like counting those categories and seeing like, okay, what types of errors are happening the most.

00:09:24:10 - 00:09:54:28
Unknown
And then you make a decision like what to prioritize from there. And it's a very powerful technique that most people don't do. Because no one has taught them to do it. I think it is a very simple it's like the most simple kind of thing, like, you know, we're talking about like opening a trace viewer and then like writing notes and going through a bunch of traces and like, you know, there's some like, okay, the same questions always come up, like, how many traces should I look at?

00:09:55:00 - 00:10:15:04
Unknown
So on and so forth. And there's some useful heuristics. There's this concept from social sciences, called theoretical saturation, which is means like, hey, keep looking at traces until you're not learning anything new. So what we teach is like, try to look at at least 100 traces just as a heuristic to get people started, because they have a lot of anxiety.

00:10:15:04 - 00:10:43:02
Unknown
If you just say theoretical saturation, they get they don't even begin. They just get scared of the whole process. But 100 is like concrete number of people can know like have a goal and then like, you know, after you begin, you don't really care about the 100. You're like, oh, I'm learning so much. I think that's the counterintuitive part, like going through individual data points and reading what is happening like in a focus session provides immense value.

00:10:43:04 - 00:11:04:24
Unknown
And people don't know that, until they do it and they're a very surprised that I'm not among, you know, at the amount of value that it provides. And so that that can inform all of your evals activity, you know, it'll like it'll motivate everything. Like what you should focus on, what you should write an email for, etc..

00:11:04:26 - 00:11:39:27
Unknown
And it's not really like this error analysis, you know, like kind of bucketed in this activity of evals was not even, you know, it's like just development. So I'll just stop there. Now, that's super fascinating, I tell you, kind of as you were talking, I'm drawing parallels to certain, sort of opinions that we make on the Galileo platform itself, because VRM evolves and observability platform is this new notion of quantitative insights or metrics and qualitative.

00:11:40:00 - 00:12:07:07
Unknown
And the qualitative bit to me sounded very similar to the the theoretical saturation workflow that you're describing, which is the error analysis process where it's less about numbers between 0 and 1, measuring low and high, and it's more about more abstract. It's at a much more abstract level. Where are you achieving what you are set out to do or and along the way, what pitfalls or errors are you seeing?

00:12:07:10 - 00:12:43:19
Unknown
Something we do in Galileo is, kind of drive the developer or the user to using, what we call log stream insights. And long stream insights are more qualitative insights on hoards of your data, like segments of your, you know, long running sessions, whether it's like a chat board session or, any kind of long wrangling agent, we would analyze data in bulk and give you qualitative insights and then try to correlate them to potentially having you build some quantitative measures based on those qualitative insights.

00:12:43:21 - 00:13:30:07
Unknown
And hopefully, the more qualitative insights you you find, you reach that theoretical saturation, that you're talking about. So I can draw a lot of parallels. And it's very fascinating to hear kind of the, the theoretical sort of side of, error analysis and the practice of it being much beyond AI and machine learning. I'm curious if the two of you think that part of the reason this approach to error analysis hasn't really, truly been popularized in current AI development circles is because we've seen this change in persona, where most of the people who were doing machine learning work like, yes, there were engineers involved, but it's a lot of data scientists who have

00:13:30:07 - 00:13:56:11
Unknown
kind of more classically been trained on some of these er, analysis techniques, whereas software debugging is a different approach often. And we're now seeing kind of the the marrying of these two approaches with engineers who are now becoming AI engineers and working very differently and having to transform both how they think about the software they create from deterministic to non-deterministic, and also having to think about their, their approaches in different ways.

00:13:56:13 - 00:14:19:21
Unknown
Is that what's driving this kind of gap, you think, or is it something different? I think so, yes. I mean, I would say the first epoch or phase of AI engineering was very much focused on, okay, like we need to build stuff, we need to get go to 0 to 1 really fast and let's see what's possible in a rough sense.

00:14:19:23 - 00:14:39:08
Unknown
And you know, now that, you know, and so it was very much the narrative and, you know, also the truth, like, you know, one of the most important skill to get started with software engineering, you know, in that, like, you need to, you know, glue together a lot of things, use APIs, you know, kind of full stack engineering, really important.

00:14:39:10 - 00:15:09:16
Unknown
And when it comes to, okay, like, how do you know that this stochastic system is reliable? That's a whole different skill set. That takes time to learn. And, you know, there's a very large intersection between machine learning, data science. And the skills you need to do evals often. And the reason, you know, I tried to actually.

00:15:09:18 - 00:15:36:12
Unknown
See how, how much I could get away with in terms of, like, teaching engineers evals without data science background or, you know, the requisite, let's say, background. And you do hit a limitation really fast. And like, for example, you know, training or teaching a class on evals, we've taught over 700 students so far, all different kinds of backgrounds.

00:15:36:15 - 00:16:07:20
Unknown
And, you know, like, for example, when we get into building LLM as a judge, what we teach people is like, okay, one of the things is important with the LLM as a judge is that you can trust the elements, judge, and to trust them as a judge, you have to compare it to some human labels and there's things like the questions always come up, such as, hey, like, why is it okay to sample, data?

00:16:07:22 - 00:16:23:25
Unknown
How can you know in like, we, we show people like, okay, if you want to know how much noise there is and you judge, you can do stuff like bootstrap, bootstrap sampling. People don't understand that. They're like, why is it okay to like, discontinuous, like sample a whole bunch of times from a data set to get the distribution?

00:16:23:27 - 00:16:48:25
Unknown
And so we we found that like we almost have to go back to classic statistics and to people that you, which is not super tractable, to be honest. Like, you know, not in the format of, okay, let me teach you emails real quick. You can teach fundamentals and you don't necessarily need all that stuff to get started, but you can you need like a fair amount of data literacy.

00:16:48:28 - 00:17:12:11
Unknown
That's one side of the equation is like statistics, but also it's all the analytical tools. Right. So like how do you how do you like dig into data. You know, like let's say we're talking about traces earlier and like clustering those traces or navigating them or analyzing them like you want to be able to like really pick a data really fast and just do open ended exploratory analysis on it.

00:17:12:11 - 00:17:37:25
Unknown
And a lot of those data skills come into play again. When it comes to like digging into a problem. And so like very quickly arrive at the very similar skill set of a machine learning engineer, or a data scientist, you don't necessarily you don't need to be training models. But I would argue that you shouldn't be animals retraining training models anyways.

00:17:37:28 - 00:18:13:10
Unknown
Like, you know, you were looking at, you were doing a lot of error analysis and debugging and whatnot. So, that's my that's my spicy hot take. Perhaps in this podcast, I, I don't think it's a hot take at all. I think it's a very, very legit take on, on just the distinction between software engineers and, data scientist and answering that key question, like in the new world of sort of meshed roles and the AI engineer and what is, you know, mostly like technical people are kind of undergoing this minor identity crisis.

00:18:13:12 - 00:18:38:27
Unknown
And the answer kind of lies in what you said, which is if you were to cherry pick one skill that's needed for the software engineer, to become the AI engineer or, you know, to be efficient in the modern era is really just the skill of understanding data and knowing the difference between good and bad data, or how to take bad data and step by step, move it to good data.

00:18:39:05 - 00:19:12:17
Unknown
And just data literacy is how you put it. I think that is the main skill, because there's the other skills which are, you know, knowing the semantics of a decision tree, which is totally commoditized. And you don't need to know, you don't even need to know how to train models or fine tune them. But to be able to understand this basic process of, comparing an output with, pre generated ground truth, which, which is either human labeled or synthetic, but just knowing the goods and bads of the practices, that is what data literacy is.

00:19:12:19 - 00:19:48:07
Unknown
And if this skill is adopted by a teacher, a software engineer, I think they've set themselves up for the future. Definitely. And there's a lot of, like related skills as well, like designing metrics and the list goes on. Like, you know, how to, you know, tell stories with data, how to have a sense of like when your metrics are leading you astray all the way down to like having good product sense and having that be aligned with, with metrics, you know, potentially doing AB tests the whole suite of things is is important.

00:19:48:10 - 00:20:15:21
Unknown
My friends and I joked that we might have an a new, job title coming called AI scientist, but I try not to be the one who is coining you. Wait a second job on AI, PMS, AI engineers, AI scientists. Now! Oh, man. You know, there's always every time there's a technological shift of some kind, there is kind of this sort of gravitation towards the idea of a unicorn.

00:20:15:23 - 00:20:42:10
Unknown
So we saw it. We saw it actually, like many times, like, you know, the most recent time we've seen it is like actually in data science itself, where, you know, initially at the outset of the data Scientist, we had the person that did everything software engineering, statistics, DevOps, so on and so forth. I think, like, people realize this is a little bit too much service area, honestly, and then kind of split it up into different kind of sub disciplines.

00:20:42:13 - 00:21:12:15
Unknown
Then we may be seeing that with AI engineer. If I were to predict. Speaking of AI engineers, I know one of the recommendations that you've made, Hamel, has been that when teams are making AI investments, particularly when engineers are helping make their decisions here, it's really important just to have, a customized way of viewing their data, not necessarily a complex dashboard, so that they can approach this debugging as error analysis in the right way so they can make decisions in the right way.

00:21:12:17 - 00:21:34:09
Unknown
Because as I think as is often, I have certainly experience working with folks, it's very easy to overwhelm teams with too much data instead of enabling focus. Why do you think giving everyone an easy way to see what their AI system is doing is more impactful than some of the sophisticated analytics that I think often we're trying to reach for?

00:21:34:12 - 00:22:17:23
Unknown
Yeah, so the guidance there is like, okay, there's a lot of tools out there that provide a good way to get started, like Galileo, like, you know, you have a, way that you can like plug in your AI application and see your traces in a stream. And kind of go through the a lot of times in, you know, in your applications, there's a lot of domain specific things going on, like you might have widgets that you're rendering your application might be writing emails, you might have external data sources that you need to reference to evaluate a particular trace.

00:22:17:25 - 00:22:59:15
Unknown
You might want to have you might want to view the trace in the exact way the user is, is seeing it. For example, you might have things in your trace that by default are usually not helpful, but that take up a lot of space in terms of tokens, all kinds of like little nuances. So what you want to do is really dial in the data viewing experience so that you can do this IR analysis and like reviewed lots of data really fast in a way that is very customized to you, that is very contextualized to how you want to see data, all the data you need to see in one place, rendered in

00:22:59:15 - 00:23:24:13
Unknown
exactly the right way. And so the reason that's my that's our advice is because of AI, because like I you can vibe code. So, you know, AI is really good at producing simple applications that can render data in like, you know, have simple. Yeah. Simple web applications like render data where you have like input fields and stuff like that.

00:23:24:16 - 00:23:50:21
Unknown
That's something that is probably, you know, below the bar where I can clear and clear those tasks very well and so because of that reality, we recommend that people in a lot of cases, like create their own data annotation apps, because there's just way too much value to be had relative to the cost of doing so. Isn't that the case like 100% of the time?

00:23:50:21 - 00:24:17:07
Unknown
But it's the case. Like a lot of times I and I know a big part of our recent product philosophy at Galileo has been to give people more simplified views, whether it's, you know, the graph view or timeline view, which we've kind of designed with the idea of like, okay, let's give them other options to debug agents in particular, as we look at these kind of more complex systems, as well as other views that, you know, may or may not be live by the time that this podcast launches.

00:24:17:09 - 00:24:41:15
Unknown
And I know that's something that you're thinking about. This is something you're thinking a lot about, too, because as I alluded to, we've kind of had conversations together with AI engineers, I think, just like Hamel has, who are going, hey, I need help focusing here. I'm not really sure what to look at necessarily. I'm not sure where to spend my time that our analysis, what's your philosophy on how to approach this?

00:24:41:17 - 00:25:22:14
Unknown
I guess observability and focus layer that Hamel is talking about? Yeah. I think beyond the graph views, which is a feature that we offer, features like graph views kind of tend to, point to the broader philosophy of giving the right abstractions to the user to be able to kind of do the segmented, you know, route causing, you know, these ever growing sophistication in, in, in systems which have evolved from simple rag to tech rag to multi-agent, you want to give the users the right abstractions so that they can shine the torch in the right areas.

00:25:22:17 - 00:25:49:23
Unknown
And that's where views like the graph view, session views, interaction, views. These come in to be able to give the tools to the user, to just be able to give root cause effectively. And what that means, what that entails is, you run your application end to end, and each request may sort of touch certain parts of your application and light up the nodes there.

00:25:49:26 - 00:26:16:10
Unknown
And each request will run through a different sort of that in in in your application, which you can visualize as a dagger or workflow. The first step is to be able to spot the anomaly there, kind of the ground level customization on the metrics as well as the, qualitative insights come in. But then these right abstractions and the right views to be able to make sense of.

00:26:16:12 - 00:26:37:15
Unknown
Yeah. What's going on? And then there's the data that's associated. But because all, all this is really is just data flowing through a bunch of nodes and edges. So once you spot the anomaly, you want to look at the data and what you know went wrong with that. So simplifying the views around the data is kind of the next step from there.

00:26:37:18 - 00:27:02:22
Unknown
And just to be clear, like what I described, this is not at odds with these things in tools. They're just like supplementary. Like I also always want a trace viewer like the ones and Galileo. Because this can be like a lot faster to search through that endless look at that without, you know, sometimes I'm looking sometimes and looking for something that maybe by accident wasn't in the annotation tool or something else.

00:27:02:24 - 00:27:24:01
Unknown
So it is really useful. And also like a lot of these platforms like Galileo, have APIs where you can connect your annotation tool to and, you know, write data back and forth to it. So, you know, that's just something to think about. Yeah. And I think we all agree that's a great best practice is to leverage the APIs of whatever value iteration tool you're leveraging.

00:27:24:01 - 00:27:55:13
Unknown
Obviously we we hope that scale, Leo. But whichever eval tool you're using, like using that API to bring that data into other places where you can look at it and look at it in different ways, and kind of consume that information and highlight it to business users, I think is a fantastic thing to do. And Hamel, I know you've talked about this idea of empowering domain experts who may not be in an our product every day to add their insights and help improve these non-deterministic systems.

00:27:55:16 - 00:28:36:02
Unknown
How do you think about, you know, writing and iterating on prompts with domain experts versus with engineers? Yeah. So one of the biggest failure modes I see, and is also one of the biggest, drivers of my consulting business, is people outsourcing evaluations to developers, which is fine if you're building a developer tool where the developer is the domain expert, but usually they're not, and the, the kind of the symptom there are the root cause of people outsourcing emails to developers because they're thinking of I like software engineering.

00:28:36:05 - 00:29:01:06
Unknown
They're like, oh, it's, AI development is a software engineering tax. And the that the, you know, the moment you'd say anything about AI development process or like outsource to developers, that turns out that always goes really badly because. Yeah, like, you're only guessing, you know, the developers that don't have enough context. So you want to involve the domain expert.

00:29:01:08 - 00:29:22:03
Unknown
So, like, you know, if you're working, building something for lawyers, anyone involved? Lawyer. One of all the the legal expert, at some point. And so, you know, when it comes time to doing things like iterating on prompts, you shouldn't have the prompt so removed from the domain expert. The whole point of LMS is like humans can talk to computers.

00:29:22:06 - 00:29:46:01
Unknown
And so if you obfuscate everything so much that the domain expert can't talk to the computer, then you're kind of burning the whole, you know, the value proposition of AI to begin with. Like is you want to direct, well, you know, a line of communication between your domain expert in your, in like what's going into the AI in terms of prompting.

00:29:46:08 - 00:30:08:09
Unknown
And so what I described in that blog post is a lot of like a good pattern that I've seen work really well is if you have a user facing an application, you know, have like an admin view where you expose the prompt and allow the person to change the prompt, even if you don't want the user to change the prompt, you have the like for your internal purposes.

00:30:08:09 - 00:30:27:21
Unknown
You have an admin view that allows the domain expert to change the prompt in and fiddle with it. It gives them like a more direct connection to what exactly is happening, rather than like having conversation, abstract conversations about AI. And it should do this and it should do that. It's really important that they get in there and they are like experimenting.

00:30:27:24 - 00:30:46:29
Unknown
Yeah. And I think it very much aligns to what Galileo has done with, continuous learning through human feedback feature, because we feel the same way. You need to leverage this domain expert feedback. You can't simply have it. Just be the engineers who may be depending on your business, you know, divorced from the bare metal of what the product's doing.

00:30:46:29 - 00:31:12:23
Unknown
Like hopefully they are very aligned to that, but sometimes they have business users who are translating, you know, key pieces of that for them, or domain experts who bring a lot of context. And I know it's part of why where especially when we're looking at custom metrics, but all of our metrics we leverage, you know, feedback from SMEs, you know, whatever type they make may be, you can go in and say, okay, like, let me get feedback on these ten traces and say, hey, this this metric feels a little off.

00:31:12:23 - 00:31:39:09
Unknown
Actually, this is pretty accurate. Or, you know, here's a little contextual feedback and then use a judge to, translate that and apply it and, you know, retune the metrics or something where we're finding a lot of success with. But I think there's a lot more opportunity to go deeper here. To your point. Like, it feels like too often, even in highly customized evaluation systems for enterprises, we are just scratching the surface of the human context that we can bring in.

00:31:39:12 - 00:31:59:13
Unknown
I mean, it's it's a very common problem for many organizations that there is too much tribal knowledge that's not living in documentations, that's not necessarily making its way into systems. And to your point, it's so necessary that we bring that human knowledge into our AI systems because they perform best when they have the data they need.

00:31:59:15 - 00:32:22:29
Unknown
And it can be as simple as, you know, friction between technical teams and understanding that domain experts have of some of the jargon of your AI systems, like you gave this great example and one of your pieces about translating rag, to just making sure the model has the right context and really saying, hey, like, let's just put this in a term that anyone can understand, even if they're not deep in AI.

00:32:23:02 - 00:32:51:21
Unknown
What's your advice for AI teams who are looking to bridge that gap and really bring their domain experts into the fold, so that they can be part of improving their AI systems and their AI data. Yeah. Let me clarify last point. With some, like concrete failure mode. So like, to look out for like one is okay, there's a, there's, there's an aspect of like a prompt store or like a centralized place that you can put prompts, which is fine.

00:32:51:23 - 00:33:17:06
Unknown
But a lot of times what happens is folks don't build like properly enough, run around that they don't build like an experimentation environment. And so like you have to change the prompt there and then like committed and then wait and then like go somewhere else and like try something and that's like way too much friction. So that is kind of, you know, that prevents the domain expert from experimenting.

00:33:17:14 - 00:33:35:24
Unknown
A lot of tools have prompt playgrounds, which are great. It's a good place to get started. However, most prompt playgrounds, they don't have access to your tools and your infrastructure and your application code. So they can call they can perform Rag and they can call tools, and they can do all the things that your application is doing.

00:33:35:27 - 00:34:02:23
Unknown
And so, you know, you can't necessarily rely on that either. That's why you need this like integrated. I forgot what I called it in the in the blog post saying call it like integrated prompting environment or something. Try to make up a name for it. Basically it can. You need to be able to play with the prompt in your user facing application directly, because that's the only pattern, at least that I've seen that's worked reliably in terms of bringing the domain experts in.

00:34:02:29 - 00:34:31:26
Unknown
Yeah, I'll just add, a couple of points here. First, of course, the the need for easy to sort of easy to use human feedback, is critical. And, some of our. Yeah. Like to your point corner, some of our human feedback features which go much beyond, you know, just offering like binary signals, thumbs up, thumbs down, and the ability to create your own sort of feedback kinds of feedback becomes important.

00:34:31:29 - 00:35:00:26
Unknown
But to Hamill's other point about, just managing the prompts and offering the subject matter expert the ability to tweak the prompts to interact with the app. I think engineering wise, the matter gets a little bit tricky, especially for more sophisticated applications like multi agents, where, things are not necessarily driven by one prompt. You might have a series of prompts which are triggered one after the other.

00:35:00:28 - 00:35:33:14
Unknown
You don't have control over, many of them, but more often than not, it's, it is driven by a kind of a seed query, which is kind of the natural language interface to any gen AI app. So the engineering challenge kind of becomes, how do you abstract the entire application and make it available, in front of the user through a natural language interface, the user being the subject matter expert, not the developer, but being able to actually run the developer's app, seamlessly.

00:35:33:14 - 00:36:09:07
Unknown
So that to the SME, it's all about here's my input. I have pure knowledge about my input and the expected output, but all the machinery in the middle, you should be able to abstract out for me. So the tricky ness kind of comes, in the fact that I guess the challenges around how do you use our APIs and the SDK is and of course, all the, you know, containerization technology to be able to kind of simulate, a version of the app, which may be a distributed app, it might be running on, you know, two different availability zones, for that matter.

00:36:09:07 - 00:36:32:10
Unknown
It just software. So I think that's where the the challenge comes. And we're kind of at a point where it's it's doable to simulate, you know, you know, sort of singular monolithic applications and make this workflow available to the SMEs. But it gets challenging when the app itself becomes distributed. And, that's where kind of the, a lot of, engineering innovation is going.

00:36:32:17 - 00:36:55:17
Unknown
Yeah, it's really non-trivial. Like you have to be you have to think like you often can't like, expose everything to the same. You have to say, is there a high value thing I can expose and you know, it just if anything else, it just helps give them intuition so they don't think rag is a very abstract concept or prompt is even an abstract concept.

00:36:55:19 - 00:37:15:08
Unknown
You know, you'll be surprised, like how many people think prompt is an abstract concept because they say something in a meeting and the expectation is a developer is going to write the prompt. That's the worst thing that can possibly happen. So whatever way possible, they need to get away from that. And so what I'd love to close the conversation with and Hamil, thank you again so much for joining us.

00:37:15:08 - 00:37:42:10
Unknown
It's been a distinct pleasure having you. It is just some advice. Like what? What would be your summation, your advice to a team that is looking to build their eval system that is looking to improve their AI products, what would you tell them? Yeah. So the biggest two kind of things that I can think of is like one error analysis, also known as look at your data.

00:37:42:13 - 00:38:03:24
Unknown
It's just it can yeah. It solves so many problems. I was just like maybe 90% of the whole evals process is, is like looking at your data like you find so much even before writing eval, I was like, you just find you'll just find so many bugs, so many things optimize for improvement, so on and so forth.

00:38:03:26 - 00:38:36:05
Unknown
And then the next thing that I can think of that makes a huge difference is having an experimentation mindset. In this one, you have to cultivate a little bit. There's some talks that I can point you to about how you might, you know, reframe your thinking. I mean, this is something that's innate to machine learning folks. And data science folks is like, you know, you don't have like, this waterfall chart of like how to build a machine learning system like you have to you have an idea of like different experiments you want to try.

00:38:36:06 - 00:39:01:26
Unknown
You don't even know if it's going to work. But what you do to have is a hypothesis of like, hey, like, this might work, this might not work. Let's try this. Let's look at this afterwards. And so you have to reorient a lot of things in order to do that. You have to kind of, you know, have a different language that you talk about within your teams.

00:39:01:28 - 00:39:20:23
Unknown
And sort of make sure that you're not don't have those rigid approaches, when it comes to this, it's hard. Yeah, that's probably another podcast. But, those are my thoughts. I mean, we can definitely have you back for for another conversation. So I think there is so much more we can go into here. Austin, how about you?

00:39:20:23 - 00:39:45:02
Unknown
Any closing thoughts from your side of the house? Yeah, I would say that, you know, erstwhile before, Lem's, Yeah, I was considered garbage in, garbage out. And now with LMS, AI has become software. So software 3.0 is I, and now software is garbage in, garbage out. So do Hamlin's point. Do look at your data. Because of garbage in, garbage out.

00:39:45:04 - 00:40:10:25
Unknown
And, secondly, I would say that there's three specific things that I've learned as kind of the IT layers of AI reliability. The the bottommost layer is the kind of the brass tacks set up, basic monitoring, traceability. That's just stuff that we've solved, you know, before I happen. And traditional observability is a partially solved problem. And there's certain things that are done well there.

00:40:10:25 - 00:40:36:09
Unknown
Adopt those practices. The second layer of the three layers is, set up your prompts and your metrics and consider them as your evaluation assets. They're your first class citizens. They will evolve over time, have disciplined versioning lineage around them, set up a good system there. And the third is the insights layer, which is the whole qualitative insights turned into customized quantitative insights.

00:40:36:11 - 00:41:00:20
Unknown
So if you practice these three things and kind of consider them the three pillars of your AI reliability, you you'll build a good 360 evaluations and observability layer in your software. And I'll add one more thing is take take my course. So there's shameless plug take there. Take the emails. Course. It'd be a good way to learn about how to get set up with evals.

00:41:00:23 - 00:41:18:12
Unknown
And I'll, I'll second that and say. Also check out Hamel on X and on LinkedIn, where he shares a lot of fantastic content. We will certainly link both those, in the show notes. And yeah, Hamill's blog as well is a great place to to go learn. Hamill, thank you so much for joining us on the show.

00:41:18:12 - 00:41:41:29
Unknown
It's been a pleasure. Yeah. Thank you. And to our listeners, if you want more fantastic content, from Hamill and many other thought leaders, make sure you subscribe to the podcast because we share information from industry experts, perspectives from AI luminaries, and hot takes, plus much more, both in the podcasting app of your choice and on YouTube.

00:41:41:29 - 00:42:01:28
Unknown
So whether you want to watch the conversation, listen in, or check out any of our other content. You're from Galileo. You can find us all over the internet. We appreciate your support. And Hamill often. Thank you again for joining me today. Thank you. Thank you.

00:42:02:01 - 00:42:07:03
Unknown
I.

Chain of Thought | AI Agents, Infrastructure & Engineering

More episodes

Chapters

Show Notes

What is Chain of Thought | AI Agents, Infrastructure & Engineering?