High Agency: The Podcast for AI Builders

I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.

In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.

Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights

For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0

What is High Agency: The Podcast for AI Builders?

High Agency is the podcast for AI builders. If you’re trying to understand how to successfully build AI products with Large Language Models and Generative AI then this podcast is made for you. Each week we interview leaders at companies building on the frontier who have already succeeded with AI in production. We share their stories, lessons and playbooks so you can build more quickly and with confidence.

AI is moving incredibly fast and no-one is truly an expert yet, High Agency is for people who are learning by doing and will share knowledge through the community.

Where to find us: https://hubs.ly/Q02z2HR40

[00:00:00]
[00:00:00] Bryan Bischof: When I talk to AI engineers who come from data science, they're like, I look at the data so much.
I'm looking at the inputs and looking at the outputs. I'm trying to figure out like what's going on all the time. And I think that is the sort of alpha that people with a lot of experience in data science and machine learning come at this problem with is: you will answer most of the hard product questions by looking at the data.
And the data is going to be generated as you go. It is going to be generated by the users, and it's gonna be generated by the agent. And you will generate so much alpha for your own product by staring at that data.
Intro and music
This is high agency, the podcast for AI builders. I'm Reza hood beep

[00:00:49] Raza Habib: I'm delighted to be joined today by Brian Bischoff, who has an extraordinary CV when it comes to AI, starting from a. hardcore maths PhD coming as a physicist and then had [00:01:00] data science roles at some of the most interesting companies out there, Stitch Fix, Blue Bottle.
And now the reason he's on the show today is he leads AI at Hex where he's been bringing, LLMs and AI to data science notebooks and to the workflows of data analysts and data scientists. So Brian, it's a pleasure to have you on the show.
[00:01:16] Bryan Bischof: Thanks a lot. I'm really excited to be here and yeah, excited to chat.
[00:01:20] Raza Habib: Thanks very much. So I'm going to dive straight in because my first question to you is related to Hex AI product and most companies have failed to get AI agents into production You're one of the only ones that I know who have succeeded here So to start with what have you guys figured out that other people got wrong?
[00:01:38] Bryan Bischof: So I think one thing I would say is like, we also failed. We just have succeeded and failed at getting agents into production. and maybe that was the first thing that was Kind of noteworthy was our very first attempt at agents and prod. It didn't go super well, but we were getting a lot of like unexpected behavior and a lot of sort of like, not [00:02:00] quite the death loops that you hear about, with some people's agent applications, but we were getting too high of entropy as I sometimes talk about.
let me set the stage actually. So you ask a question. And that's a question about data science. And we want to dispatch a collection of agents to help with different stages of that work.one thing that is like quite common at this point, even there's a recent Claude blog post, discussing this is having one agent make a plan and then having that plan consists of individual steps and then have the agents go and work on those individual steps, somewhat in parallel, somewhat in sequence, but regardless.
That's the agent like planning paradigm at this point. And we too started with that paradigm.the challenge is that we allowed for quite high, diversity in the plan and Right off the bat, it seemed like it was like going really well. We were getting exciting results.
And can I just ask, when you say high diversity in the [00:03:00] plan, like, are you using things like function calling to like call out to the other sub steps? And does diversity here just mean like many functions or what does diversity mean here? yeah. And the initial
sort of like approach is we would sort of let the agent decide on like how it wanted to structure the plan and how it wanted to orchestrate the agents. And we allowed for pretty much like anything goes.we were using function calling, but we were using it in a pretty like Broad sense, exactly to your question.
One of the ways that we like constrain that and we started getting things a little bit more under control and what I believe like led us to success in getting agents abroad is this precise part here was we were more prescriptive about the types of plans that could be generated. That doesn't mean that we tighten down how much planning could be made up by the agent, but what we did tighten down is all of the like specific types of steps that could be executed in the [00:04:00] plan.
Let me give you a precise example. The classic example now of tool use for these LLMs What's the weather in San Francisco today? And so the tool use example is like, Oh, it's going to make a call to like, get the weather, and that is going to be the tool use. And then the sort of like, I would say level two of that is what's the weather near me today. And the reason that this is level two is because you need a tool to identify where the user's coming from. And you need a tool to use the information to then go make the query about what the weather is. And then you gather those up and give that to the agent response. So, if you ask the agent, What's the weather like for me today?
What you want the agent to be able to reason about is that he needs to use both of those tools. And you need to specify that one is a get user location tool. And the other one is a get weather tool. Everything sounds great. Right. But like, [00:05:00] how many of these use cases are you going to build specific tools for?
Are you going to build individual tools for like get user location, get weather, get precipitation, get humidity? How many tools are you building and how much does that tightly constrain what your interface to the user like affords? And so you can start to see there almost becomes a one to one mapping.
And it's not quite one to one, but combinatorics aside, there ends up being this mapping between all the functionality that you want to build in, in the planning stage and all the tools that you have to introduce. And so long answer to a short question, how do you think about this in terms of like getting agents into prod?
You start by saying like, what are the core ways that the agent should be able to respond to interface with our like product? Building those into still general [00:06:00] tools is like kind of like where we transitioned. So initial version, quite broad, lots of like, it's just going to return a plan. That plan is going to orchestrate down to very specific types of tools that it knows how to use and are super representative of our
[00:06:17] Raza Habib: capabilities. so it sounds to me like what you're saying is there was two things that you had to figure out. One is like what constrained set of quote unquote tools, like what API interfaces are we going to expose to the model that are the right ones to give it the maximum power. Within Hex or within the whatever application someone's building without constraining it too much.
So there's kind of like this generality trade off between what APIs you actually expose it to. If they're too general, it can go off the rails. But if they're too narrow, you're going to have like lots of them. But I'm kind of curious. One thing that wasn't clear to me is like, how are you actually doing this constraining of the model? Is this prompt engineering or is this, how are you actually doing that?
[00:06:56] Bryan Bischof: Yeah, it's prompt engineering via function calling. We have thought really deeply [00:07:00] about the API for those tools that the agent's going to use so that like when the initial plan is generated, that can yield really powerful prompts to the agents themselves, that dispatch stage. I didn't even think of this. Like when I was first like dreaming up this agent pipeline, I didn't really think very much about the handoff. I was just thinking like, Oh, we'll do the planning stage. And then we'll like do the tool you stage and everything's gonna be great. And then Handoff ended up being the very first thing that
totally got us.

[00:07:32] Raza Habib: And this is, this is handoff to like a sub LLM or sub agent. So user query comes in, they say, hey, I want to analyze this data and find out which product was most popular last month. And then the agent's going to go and build a plan, figure out how to dispatch this to like sub agents, and each of those sub agents then has some constrained set of tasks it can do. And eventually you aggregate these all back up into a final result somehow.
[00:07:54] Bryan Bischof: yes and no. Actually, our agent results, we tend to have associated to [00:08:00] explicit hex functionality. So a lot of these agents are responsible for one part of the magic analysis, as we call it. The magic analysis is a chain of cells. Each of those individual cells is an agent. They're operating both independently and together.
And this raises another technical challenge that we had to work through, and this is more on the engineering side, was how do you infer the state that other agents are going to need access to, to be able to do their work? Some of the prompts in downstream agent requests require information from the original prompt or from an agent response upstream.
So we actually construct a DAG under the hood of what we think the plan infers, what, in that DAG, which steps require upstream response values. And then we also do things like reactivity. And this is the other thing that was like pretty interesting, was If things don't go well, if you want [00:09:00] to make a change and you go upstream in your like agent DAG and you make a change to this agent's like response, that should percolate through.
And so we also have reactive agent flows. So you can make a change and it percolates all the way through. It re prompts and it rebuilds.
[00:09:16] Raza Habib: Interesting. and and did you jump straight to agents or did you guys Try like, simpler things? Try getting to work with RAG, single, you know, goes first. Like, what pushed you to take on all this extra complexity? Because it sounds like there's a lot of work involved in trying to get the agents to work.
What's the reward for all of that? you know, what motivates you to do it in the first place?
[00:09:36] Bryan Bischof: I think there's two questions there, what do we do first? And then what, what was the the paradise that we saw in the distance that we were like willing to walk through the desert for? I think on the former, absolutely. And I actually think that that was part of why we had any chance of success in the agent pipeline is because.
All of the agent capabilities that we have built out, they have started as individual capabilities that the [00:10:00] user can do. So before we build a sub agent who's able to generate the relevant SQL on the way to answering your question, we first build A capability that generates appropriate SQL. We see how users use it.
We learned from that experience. We learned the connection between the display UX and the sort of like generation, we understand a little bit about like what needs to be in scope to construct great context. We've learned all those lessons. So when it comes time to say that this dispatched by the, supervising agent. Now we get to bootstrap. And that was really, really important in getting us to work relatively quickly on the, like, what's the benefit, like what's the paradise in the distance that is more about like, what's the workflow. So as a data scientist, you often think in cells, you really do. You think like, okay, [00:11:00] what do I want to do next?
And that's a step. It's a logical step, but your ultimate, like overarching, every little individual thing that you do in an analysis. is a collection of cells. I can't think of many times when the beginning and end of my entire analysis is just one cell. And so the real question is, if a single question maps to a collection of cells to get to the answer, then that's exactly how magic should feel too.
You should ask a question, and it should be able to generate a sequence of cells to answer that question. One of the reasons we were sort of like the first ones to go down this path of like, you ask a question like bang, bang, bang, executable cells, even before things like, you know, Code Interpreter is because that is the workflow that like data scientists use. One of the reasons that Code Interpreter doesn't feel super valuable for a data scientist. Is because it tries to jam everything into one cell by having multiple cells that work together. You get a lot more, I would say, logical [00:12:00] workflow. And that's why it made
sense to do it as agents.
[00:12:03] Raza Habib: There's something really interesting about what you're saying there that jumps out to me, which is when I was chatting to Kai from Ironclad about, Whether incumbents or startups are likely to be able to win here more. And one of the arguments he made to me. is that, in some sense, if you have all of the existing workflow tooling, you have those APIs that the humans are using, you now have the building blocks that you can give to an AI agent.
You're in a much stronger position to start building that agent, whereas if you have to start from scratch, you have to invent the right abstractions, the right APIs, As well as get the agents to work simultaneously. And I guess what I'm hearing from you is, well, we knew what the workflow was because we had human data scientists doing it in Hex.
We knew what the right building blocks were because we had humans doing it in Hex, and we took each of those steps and those workflows and then exposed them to an AI agent who was kind of able to take that on. Is that a fair summary?
Absolutely. And I strongly agree with that like thesis. Ultimately, when you're trying to build a new product, [00:13:00] one of the questions you have to ask yourself is like, who is it for? Okay. Once you've identified, who is it for? The next question is basically like, how does it like serve any purpose for them? The great thing about building on top of a product that is already great for a particular type of user is you know what they do and you know what their pains are. And so you can start addressing those directly. I'll take an example.let's say you wanted to build like a text to SQL product. Okay, cool.
Yeah,you'd be, the number of startups, you know, people talk about tar pit ideas and that to one, that one feels to me because the number of startups we've had as customers at certain points that have tried that and bounced off for one reason or another. And I'd love to go into the reasons, has been enormously high.
It seems to be one of these ideas. It's very tempting. A lot of people seem to work on it, but very hard to succeed in a general way on.
[00:13:51] Bryan Bischof: yeah. Well let's just like think about from first principles like what might be like You know, what might go wrong? so you think, okay, Texas SQL, these agents have seen [00:14:00] tons of SQL. It'll be really good at writing a SQL. Okay, neat. Let's assume just for argument's sake, the GPT 5 is perfect at writing SQL.
It never makes any mistakes. It never hallucinates and it never writes like sort of like malformed SQL. It also knows every flavor of SQL out there. Let's just assume that. Okay. So you asked this like infinite SQL bot. how many customers do I have this month? Okay, well, where are your customers? You're like, oh, sorry, sorry.
My customers are in Dim Customers. And it's like, okay, cool. like, select count star from Dim Customers. You're like, okay, like, well, my trial customers are also in there. And it's like, okay, like where trial is false. No, that's not the name of the field. The field is like, is trial. Oh, sorry. I'll just do that.
And you're like, Oh, cool. Wait a minute. I don't care about like my customers in, in like this. I really care about the company users and those customers. Okay. where, and then [00:15:00] like, now you're in this like conversational situation. Where you need to like explain over and over again. And so, okay. So the first thing you say is you say, we need to do RAG. We need to do RAG over the schema. We'll put the whole schema in the context window. Or we'll like RAG and we'll search the right tables. Great. Okay. So now you graduated from like negative one Texas SQL startup to like level zero Texas SQL startup.
Okay, nice. I'm glad that you're here. Let's keep going. What's next? Now you know, like, what the Dim Customers Table is, and you know that there's a column called, like, Is Trial in your Dim Customers Table.cool. did you remember that, like, before 2020, all of your customers,weren't in this table? Cause you did a data like a data warehouse migration. that. that's fine. That's like, annoying. We shouldn't have done that. But like, you know, the data team has a million tickets. There's no way they're going to get to this anytime soon. we'll just like add that like as part of the prompt. Okay.
but who knows that, [00:16:00] who knows that information? The data team knows that information. Some of the like GTM team probably knows that information, but does like everyone in the company know that information? you're level zero Texas SQL startup and you're like all right, all right, I need to like accommodate for this.
What we'll do is we'll build a data dictionary. what does the data dictionary have? It's got all this like relevant additional information that it stores the whole business. Is documented there. Okay, cool. We'll feed all that into the model too. Okay, so lots of put in the model now We've got like the table schemas and we got all the meta like the meta metadata It's not even just the metadata from the tables It's like the meta metadata about like what are all the data caveats? I used to make a joke about like every table has a caveat.
we're getting close to level one. We're not there yet though and the reason is because like I don't know very many businesses that like has all this
shit documented
[00:16:51] Raza Habib: This is exactly the problem that a lot of these startups came up against, maybe if you're a product person listening to this and you're like, what can I take away from this story? I think it's like, The AI isn't magic. It can't [00:17:00] do things that humans can't do in some sense.
And if you had a colleague joining your company who was going to be writing SQL queries against these tables, they would get all of this tacit knowledge from their colleagues. They'd ask questions, you'd explain it to them. Not everything is ever perfectly documented. And so you can't expect an AI system to be able to do that either unless it's somehow going to be able to also get all of this tacit information. And that's why I've only ever seen people get this to work for specific databases. Like you can build it in a, in a database specific way, but making it work generically such that anyone can come and plug in seems to be extremely difficult. But, but it does seem to be something that you guys have got working, which is why it was on my list of questions to ask you, actually.
[00:17:41] Bryan Bischof: And I, I agree with you. Like, I think to get to, level one here, you have to you have to overcome these problems. And so what are some things that you can do? Well, you can make sure that you like make documentation easy. we introduced the data manager, the data manager makes documentation really [00:18:00] easy.
You can make sure to use, We'll say existing known context about the way people query the data. You can start using latent representations of the relationships between the data. You can start using more nuanced, search and retrieval methods that like people have known for a while, if they work in rag and recommendation systems to get the most appropriate stuff in real time.
That's how you breach, in my opinion, level one of text to SQL. And then obviously you arrive at level one and the very first thing you have to face is latency. Latency is
ready and waiting for you at level one.
[00:18:41] Raza Habib: Okay, so this, this, this actually segues really nicely into a question that I was going to ask you anyway. So let's jump straight there, which is I tried out Hex, and one of the things that I was most impressed by. is you're using these agents under the hood. They're dispatching all over the place.
You're calling multiple models. And yet, it's really fast. Like, I put the response in, [00:19:00] and I get the answer back quite quickly. And so, you know, the product is called Magic. Like, it felt like magic. Like, how are you guys doing that? Are you fine tuning? Are you using smaller models? Like, I can but speculate.
But I'm very curious as to how it happened so quickly.
[00:19:14] Bryan Bischof: Yeah. I mean, I think part of it is we are very aggressive about the way that we keep things out of context. I remember like there was a period, like it was the summer of 2023.
[00:19:28] Raza Habib: out of the
models context, as in you keep the, okay. So basically trying to make sure that you're sending the fewest number of tokens possible to the model at every call.
[00:19:36] Bryan Bischof: Yeah. in summer, 2023, I felt like every time I went to an AI meetup, uh, The people were talking about more and more shit that they were jamming in the context and more and more excitement about long context windows. And I was like, that's interesting because most of the conversations we have is how we can get shit out of context.
Like what we can take out, what we can remove. And like, it's very much this like Michelangelo, like carving a sculpture thing of like, I'm trying to [00:20:00] get rid of all the stone that shouldn't be there. That's very much how I feel about like prompt engineering. It is not like, what all can I put in there to make it good?
It's what can I remove? That doesn't like break the ability for it to get the
right answer. and so
like and what's the process for that? Like what, firstly, like where are you guys doing your prompt engineering? Like, is it in Hex? Is it in a different tool? And do you like start with a longer prompt and whittle down? Or, you know, how are you, how are you getting to this like minimal context that's going to give us the lowest latency possible?
yeah. I mean, oftentimes we do start like, like big and go small. And so like, start by saying like, okay, what is all the context that could possibly be available? And by the way, to answer your question, uh, we're using GPT 4 turbo and we are prototyping an HEX HEX is a data science notebook. Where else could I ask to like possibly be able to like quickly iterate on like an API and like, Get the responses and make changes and like see the output and sometimes batch that over a collection of inputs.
you asked a little bit about like [00:21:00] how that process feels like you do start by saying like, okay, if I'm asking this question of magic at this place in the application, these are the things I've done before in my project. What is possible for it to know? And you explore all the possibilities like, oh, it could possibly know what variables are in scope. Oh, it could possibly know what packages I've imported. Oh, it could possibly know all the code that I've previously written in this project, or that I've ever written ever, or like every single SQL query that's been run against this data warehouse.
This is the universe. What part of the universe should I look for these aliens? And that's kind of the like, you know, you're trying to whittle down, whittle down, whittle down.we do start big and go small, and we do a lot of iteration on the like, Yeah. building out
these capabilities, I would say,
[00:21:46] Raza Habib: but so, but the real secret to the speed then, it's not like you guys have fine tuned custom things or using, like, mix trial all over the shop. It really is that you've done prompt engineering really cleverly with GPT 4 to really send the the least amount [00:22:00] of stuff you need to send at every stage.
Is that a fair summary? Okay.
[00:22:03] Bryan Bischof: yes, it, it, it, we have some future things coming that are a little bit more outside just that one paradigm, in terms of like exploring models. And we do use like 3. 5 when we've identified that it's
capable enough.
[00:22:14] Raza Habib: the reason I sort of, sort of summarize it that way is because I'm, I was almost a little incredulous. It's so fast that I kind of assumed you guys must have to be fine tuning or using a smaller model. But I guess I'm just used to interacting with models, maybe where people are sending a lot of stuff into the context.
[00:22:28] Bryan Bischof: I think that's part of it. Yeah.
[00:22:30] Raza Habib: Maybe, maybe kind of last question on agents. Cause I also do I do want to dive into evals with you as well. Cause I know it's an area that is close to your heart, but a common argument that's given for why agents don't work well is, well, you know, we give the agent a task that's like not super reliable on each stage, it's maybe like 95 percent reliable, and then you're chaining multiple things together that are each 95 percent reliable and the number of steps, you know, your reliability basically goes down exponentially and like.
Whatever the length of this chain is, [00:23:00] do you buy that argument? and if you do, then why does it not hit you guys? And if you don't sort of like, what are other people getting wrong there?
[00:23:09] Bryan Bischof: if your biggest fear is expanding uncertainty. Yeah. As you go deeper into the chain, the very first thing you should do is keep the chain short. If I ask a data scientist, like, an analytics question, what are the core, like, components that the data scientist needs to do?
They need to get some data. They need to transform that data. And they probably need to make a chart. So the first thing that we did to reign in the agents is we asked, how many questionscould just be answered with this, like, one set of agent responses? And it turns out like, damn, it's like pretty impressive.
Like what percentage you can get with just that.and so I don't want to characterize this as like, it's simpler than it sounds. But what I do want to characterize is [00:24:00] I am not super bullish on the Devin style, like click go, come back in two days and it's all done. I'm more bullish on the. User interacting with the agent pipeline.
The agents are only there to do more things simultaneously and reactively based on one another, but they're all still fundamentally coming from the interactivity paradigm. The user is right there to observe and to reflect. People love to talk about like agent reflection. They love to talk about like, Oh, you have an agent who's supervising. They're just reflecting on the other agents and they're constantly doing this feedback loop. Well, if you keep those loops tighter and you make it really, really easy for the user to get in there and make the adjustments necessary to get things back on the rails. And you have less of these, like, death loops.
And you have a lot more, frankly, like, interaction from the [00:25:00] user. That's great. We've put a lot of effort into making our UX feel very interactive and feel like editing and iterating is super organic. We, we went through a phase where we were really upset because you couldn't just keep your hands on the keyboard and just keep going. That was, like, frustrating to us from a UX and seamlessness
perspective.
[00:25:24] Raza Habib: a generalizable UX lesson there. I actually think, you know, a lot of people have made this comment by now, a year and a half on from chat GPT, but I still think it's like underappreciated, which is like one of the things that. I think going from like a completion paradigm to a chat paradigm did, is it changed the expectation of the user as to whether or not everything should just work first time.
In a chat paradigm, you're kind of expecting to take turns and like correct things and be able to iterate on stuff. And in some sense, like steer the conversation back on rails. Whereas in a sort of one shot completion paradigm, if the model gets it wrong, you go, Oh, well that sucks. And you go away. [00:26:00] And I guess what I'm hearing is like in a similar vein, like if you keep the human closer to the actions of what the agent is doing, such that it's interacting with the outputs more often, you're giving way more opportunity for the humans to steer the agents back on course.
And so, okay, maybe there is this like expanding uncertainty or greater probability of failure. Further down the path, but you give humans more opportunities to fix that. And so you can get, you know, and if you fix the one mistake that would have happened at step 10, now step 13 is fine. And now you've got a very powerful agent.
[00:26:30] Bryan Bischof: and combine that with the reactivity thing that I mentioned before of, Oh, you, you see an issue in 10 while 11 and 12 are off doing their work. And you say like, hold on, I need to make a change here. You make the change at 10 and then we already know 11 is fine. 12 is fine. 13 is
the one that needs to be updated.
That's where that
[00:26:50] Raza Habib: Because you have the DAG structure
the plan affects the outcome. so so maybe we can, maybe we can summarize kind of some of the stuff that the journey you've been on at Hex to get agents to [00:27:00] work. So it sounded like there were a few problems that That you had to solve to get it to work.
One was choosing carefully the correct set of tools and APIs that you give to the agent to correctly balance this trade off between how powerful or general each tool is, versus like the number of tools you have. and the way that you solved that, and I think the way I've seen other people solve this as well, is to answer the question, like, well, what tools are the humans using within our application?
Like, and mapping that almost. Kind of one to one so that the workflows in the existing application provide the scaffolding for, for what you have. It sounds like building this DAG of the plan and using the reactivity to correct things has been super important. Making the contexts really short so that you keep latency low and then also, keeping the human sort of UX, like the human very close to the steps of the agent so things can be corrected.
Anything else? That you guys think you sort of figured out any, any alpha in there that I've not mentioned.
[00:27:57] Bryan Bischof: Yeah, I think the other [00:28:00] exciting sort of like realizations that we've had. Is that you'll see things in the literature that you get excited about, like chain of thought. And you think to yourself, like, okay, chain of thought, like everybody's having such success with chain of thought. And so you say like, okay, I'm going to put chain of thought in my application. And then immediately you say like, actually,that's not really like compatible with my like situation. And what's been really interesting is how you analogize these results And you analogize them into the setting that you're working. And so, sometimes, I find myself in the question people say like, Is Hex using Chain of Thought?
I'm like, what do you mean by Chain of Thought? Because if you mean like, The precise version that's in the paper? No. Are we using tree of thought? No. Are we using chain of density? No. But are we using our own variant of chain of thought? Yes, we are. Is it something that like makes sense for [00:29:00] many other people?
Not really. Does it make sense in our paradigm? Yes. It's actually like crucial. If I take chain of thought out, our agent pipeline goes significantly down in terms of like performance for the planning stage. So like. It's really interesting how I don't think most of the results that I read in the literature are directly translatable.
There's always this translation layer. And so the other piece of alpha that I feel is just like how many times we've had to take existing knowledge and existing like resources and really rephrase them and really reframe them. And that's why like we have like a research part of our organization that's focused on this kind of stuff.
And translation to make it applied. Um, this is also true of our RAG
pipeline.
[00:29:46] Raza Habib: Yeah. And actually one other thing you said that stuck out in my mind was also just how much prompt engineering you had to do, whether that being on the API layers or elsewhere was obviously key, key part to, to that success. So you said something just now though, that I think sort of [00:30:00] takes me into the topic I want to spend a little bit of time on, which is, you know, if you take out chain of thought, the performance drops. And so the natural question is, how do you know?
[00:30:07] Bryan Bischof: Yep. there's a couple of thoughts here. Like one top of mind is I think every data scientist or like ML practitioner feels like evals are a little bit in the like, label for an old idea, and I think there can be a little bit of like frustration from ML people around evals.
Because we've been writing evals and building data sets to like offline evaluate our model forever.
[00:30:30] Raza Habib: Yeah, although I guess one thing I, I feel is different now compared to before, and tell me if you disagree, is that at least before we were generally doing tasks where it was easier to say there was a right answer, or the generality was sufficiently confined that like, I could calculate an F1 or a precision or, okay, maybe in some cases I had to do like a rouge or a blue score or something, but rarely did I really feel like the only way to score this was with a subjective judgment, and post LLMs I feel like that much more often.
[00:30:59] Bryan Bischof: [00:31:00] well, I worked in fashion recommendations, in fashion recommendations, I can promise you the one thing you don't know is, is this a good recommendation for a person? There is no easy way to sit down and say. Like, yeah, this is clearly a good recommendation. I mean, I personally, I'll have people send me a piece of clothing and say like, isn't this cool?
And I'm like, no, actually I don't like that. And I'm like, oh, I'm so surprised you don't like that. And they're like, I thought I really understood your style. And you're like, nobody understands my style. Um, and the reality is like, It's hard. It's really hard. how do you build how do you build evals for fashion recommenders?
How do you build evals for coffee recommenders? What about YouTube recommenders? And so like, I think my rejoinder to people that say like, no, it's fundamentally harder is, yeah, it was always hard. You oversimplified or you work downstream of the hardware. And so that's my like general feeling on the [00:32:00] topic.I think my more specific feeling on the topic is this isn't necessarily a bad thing. you always have to bite off what you can chew.and so you always have to say like, okay, you know what? I can't capture Brian's style perfectly, but what I can do is I can understand If I'm like in the right neighborhood, did he click on this item?
Maybe he didn't buy it, but did he click it? I got some attention.did he buy something that's like visually similar? Did he buy something that has this color pattern? what about this silhouette? This is an oversized shirt. If he owns this shirt, maybe he's like open to oversized. So we've always been making hedges where we say we will understand if things are good or bad.
Based on analogy and that is a okay. I claim that like evals these days are the same. And they should be like approached very similarly. I'm really concerned when I [00:33:00] hear people Trying to think too holistically about the like output. Instead of trying to break off little things and assess the outcome. Aspects of it. Let me give you an example.you are designing like, you know, your AI agent who's going to talk to your children. Your children are going to ask science questions, and it's going to respond with like kid friendly versions.
Lots of Eli 5 type explanations. This is your cool new Y Combinator company. And everybody's really excited. And one of the first things that someone asks you is like, Whoa, aren't you worried about like toxicity? And you're like, Oh yeah, I don't want to like tell my kids like why the sky is blue. And then like also like say some mean things about like, you know, people that wear blue clothes, like, I don't want to be like mean to people that wear blue clothes.
okay, I, I should put some like toxicity evals there. So that's, that's like more on the guardrail side, The next question that comes up is like, how do you know these answers are correct? They're Eli5, so you [00:34:00] can't just like, measure them against Wikipedia. A Roos score against the Wikipedia article is like, comically naive.
It's like, let's not do that. Let's try harder. What can we do? Well, the first thing we can do is we can try to detect some like, stats. We could have a extractive model that pulls out any statistics that are included in the explanation. And we can see if those are justified by the article, i. e. we can design a binary eval that compares any statistics in their response to statistics that are in the reference article. What else can we do? Well, we can think about like, what else is great about an Eli 5 article? Well, the vocabulary is simple. It has to be for kids. So we could build a model that assesses vocabulary level. That could just be a simple old school NLP model. And we can just keep. The sense that that's like in the small window that we expect for children of whatever age you're designing for. And these are very convoluted and contrived examples, but [00:35:00] this is the logic that I think that goes into building evals.
[00:35:03] Raza Habib: Right, which is you want to build like a family of evaluators that each is individually testing one thing that you care about. And in aggregate, they give you an overall picture of how your system is working. You're not relying on any one evaluator to tell me, is the system working well? and that really resonates with me partly because Sometimes a lot of these things are going to be in tension with each other, right?
Like, you know, we have people who have evaluators on human loop where they're measuring like overall performance and also cost and latency, like things that are in that mix, well, you can't have all three go up at the same time. Usually you're kind of trading these things off. And so, yeah, having having a family of them simultaneously, you know, similarly with like, I think, you know, we talk about like helpful, harmless,kind of trade offs, right?
With the RLHF and things like. this. Everyone hates that the models like. haverefusals or that they're really polite or they give these like condescendinglectures. Well, there is a trade off between like the harmful helpfulness or honest sort of boundary. You can't get all three simultaneously all the [00:36:00] time.
[00:36:00] Bryan Bischof: 100 percent agree. And I think these are Pareto problems, and they will always be Pareto problems.
[00:36:05] Raza Habib: And is that how you go about building the evals at Hex then? You have a library of these kind of binary evaluators that are each measuring different categories. And then you're kind of tracking these over time as well.
[00:36:16] Bryan Bischof: I would say it's even a little bit more broad than that. because a lot of our evaluators are not ones that you can apply broadly across, like, hundreds of evals. It's actually, like, every eval comes with a set of assertions. And those assertions, some of them are reusable. if you ask the agent, like, you know, make me a data frame that does X, a very good assertion is, did it generate a data frame, period?
Like even before you check, if it's like the right data frame, is it a data frame? and so, there are some reusable ones, but there's a lot of nuance in like, is this correct? One of the things that I got introduced, very early when thinking about evals was [00:37:00] execution evaluation, which is when you're writing SQL or code, executing the result gives you a lot of information about how correct or incorrect it is.
And so we're doing a lot of sort of like,simulation of what the environment is, and then what the environment looks like before and after and comparing those, or what the environment looks like in the target code versus what the environment looks like in the, in the sort of
like agent predicate.
[00:37:27] Raza Habib: and and what fraction of your time on building these features is spent on the eval part of it?
[00:37:33] Bryan Bischof: It goes up and down. I spend a lot of my personal time on evals, like as like team lead. and I think the direction that we are, I would say like coming to is that like in traditional software engineering, you want a clear feature spec. You want to know like what to build. The new version of the feature spec for AI features is these are the evals that I wanted to pass.
And it sounds strange. It sounds like some [00:38:00] weird version of like test driven development. But it's so much more powerful than that, because it's, this is what the user asks. And this is what like the user gets in response in an ideal case. And then you distill that ideal case down to the important aspects of it, similar to how
you do
[00:38:18] Raza Habib: right. And I guess that's the right way to think about
designing evals, is like working backwards from like, okay, what is the minimum threshold I need on these outputs to like, usefully do something for a user? Let's figure out what all of those lowest bars are, and then I have a suite of evaluators out of that almost.
[00:38:35] Bryan Bischof: exactly.
[00:38:36] Raza Habib:
hearing you talk about, you know, building these agents that are able to do data analysis and Devon and other tools like this what is all of this going to mean for, What's the future of developers or people like you and me who are machine learning, researchers or engineers, like what's your vision for, for how that might play out?
[00:38:52] Bryan Bischof: Yeah. I teach data science at Rutgers andI sometimes laugh because if my students come out of my class knowing [00:39:00] three things. I think that they should like pass the class and I'm like, totally fine.and the three things are like, how to frame a problem, how to frame the data to solve that problem, and how to frame the objective function to solve that problem. If they can do those three things, I don't care if they remember the like, like SciPy, syntax. I don't really care if they like could draw a causal graph for a particular like intervention. And I don't care that much if they remember, like, why do you use a two tailed t test?I care less about those things.
What I do care about is those framing questions and that they can do it quickly and that they can do it confidently. The reason I bring this up is because, What I find is that myself and a lot of my like friends and peers and colleagues like the part of data science that we love is getting to think deeply about real problems and getting at least closer to answers. Nothing that we're doing with AI [00:40:00] makes that worse. It only makes it better. Distracting me from thinking about, like, how this data should be distributed, and why the distribution looks different than I expect. To write some annoying Panda syntax, like not useful. Like, do I remember the KL divergence formula?
I think I do, but do I remember like well enough to like implement it in code the first try? Probably not. I'm probably going to have to like check a couple of things to nail it. Also, one of my data frames isn't in the right shape. I need to change the shape of that data frame before I treat it as a distribution.
Why is this interesting? It's not. It's never been interesting. What's been interesting is this distribution is fundamentally different. Are there any factors why? So my claim is that all the work that I'm doing is to make the work more interesting and [00:41:00] not any less important,
[00:41:02] Raza Habib: Yeah, and, and maybe another way of, of arguing this is that like we're moving in some sense like up the abstraction stack, and we've kind of been doing this forever in programming, right? Like, it used to be assembly, and then you had, you know, compiled languages, and now people are in interpreted languages.
And I guess to, to maybe like circle back to the point about, you know, when you were making the pitch for Hex as like the ultimate prompt engineering tool. And, and I think I agree with you for the product you're building, because I think you're fundamentally, the domain expertise needed for what you're doing is data science and engineering domain expertise.
But what excites me about moving up this abstraction stack, and I think like What motivates me to HumanLoop is how can we give non technical people who are domain experts in their space the right tools to be able to do this as well. If you are the lawyer who's doing prompt engineering, like what does that look like, right?
Because you need some way, and it's never been possible for a lawyer to like build an AI product before, but all of a sudden it actually is. And I think, yeah, in some [00:42:00] sense, if you take away this kind of detailed stuff of like, what's the data frame and what's the right syntax for SciPy and you allow people to operate at the level of like, well, what are my goals and how am I framing the problem?
And what does success look like? You expand massively, like who can be involved in this process?
[00:42:15] Bryan Bischof: Yeah, I do think creating more open doors for people to get involved in this process is going to lead to a lot of success. I think more and more product managers are going to be, sort of harangued into writing evals than we would necessarily think a
[00:42:30] Raza Habib: priority.I mean, that's, that's certainly been my experience. Like it's increasingly PMs who are leading, leading a lot of this work. Maybe one final question that we can, we can end on, which is, I think you've seen in your career, a lot of different paradigms, right? so the latest sort of round of LLMs is just like one new one, rather than the same shock that it might've been to people who maybe don't have the background.
And so I guess my question is like, for people who are like new to the space, and maybe LLMs are like [00:43:00] most of what they know about machine learning or AI, what, Tools from before or what like concepts might be underappreciated that people should be aware of that they might be missing.
[00:43:10] Bryan Bischof: when I talk to AI engineers who come from software, I hear them talk a lot about, systems and reliability. building products, like you have to care about that stuff. When I talk to AI engineers who come from data science, they're like, I look at the data so much.
I'm always looking at the data, constantly looking at the data. I'm looking at the inputs and looking at the outputs. I'm trying to figure out like what's going on all the time. And I think that is the sort of Alpha that people with a lot of experience in data science and machine learning come at this problem with is you will answer most of the hard product questions by looking at the data.
And the data is going to be generated as you go. It is going to be generated by the users, and it's [00:44:00] gonna be generated by the agent. And you will generate so much alpha for your own product. By staring at that data. I have a friend, Eugene, he spends like, hours every week, just looking at the data of the application that they've built my team.
We have now a weekly recurring like evals party. eval slash like looking at the data party, this was inspired from talking to, one of the AI managers at Notion. Like this is the way, like you have to immerse yourself and you have to just get absolutely filthy in that data. That's the alpha in my opinion from
people like me.
[00:44:38] Raza Habib: Fantastic. Yeah, that, that, that really resonates. I mean, it was literally the first thing that we built at HumanLoop was like logging and observability, right? Because until you can, a lot of people dump the logs of like their inputs and outputs somewhere in the view of like, at some point we'll make them useful.
But they don't sort of serve it up in a way that actually allows them or like, to frequently incorporate that into their day to day work.

[00:45:00] Cool. Well, Brian, it's been an absolute pleasure. There are a ton of gold and nuggets for people in there who are trying to think about building AI products to take away. I'm sure we could have filled a second hour, but but I really enjoyed it.
So thanks very much.
[00:45:12] Bryan Bischof: thanks a lot. Thanks for having me.
It's super fun.
That's it for today's conversation on high agency, I'm Reza Habib, and I hope you enjoyed our conversation. If you did enjoy the episode, please take a moment to rate and review us on your favorite podcast platform. Like Spotify, apple podcasts, or wherever you listen and subscribe.
For extras, show notes and more episodes of high agency. Check out humanloop.com/podcast

Evaluating LLMs the Right Way: Lessons from Hex's Journey

Evaluating LLMs the Right Way: Lessons from Hex's JourneyEvaluating LLMs the Right Way: Lessons from Hex's Journey

More episodes

Evaluating LLMs the Right Way: Lessons from Hex's Journey

Evaluating LLMs the Right Way: Lessons from Hex's Journey

Chapters

What is High Agency: The Podcast for AI Builders?