Chain of Thought | AI Agents, Infrastructure & Engineering | Every AI Agent Has an Evaluation Gap

Snorkel CEO Alex Ratner maps the evaluation gap blocking AI agents from real enterprise work, walks through the company's $3M Open Benchmarks Grant, and explains why pure 'environment' vendors don't actually understand how AI works. Chain of Thought is hosted by Conor Bronsdon.

Show Notes

Alex Ratner co-founded Snorkel AI out of Chris Ré's Stanford lab and helped establish data-centric AI as a field. Today, Snorkel is a $1.3B company shipping thousands of data sets and environments a week to frontier labs and vertical AI teams like Harvey.

In this conversation, he argues our ability to build AI agents has outpaced our ability to measure them. That gap is what's keeping most enterprise agents stuck in demo purgatory.

If you can't measure it, you can't improve it. And you can't deploy it.

In this conversation:

The three axes of the evaluation gap: input complexity, autonomy horizon, and output complexity
Big Law Bench: how Snorkel and Harvey benchmarked legal agents on deep-research tasks that take lawyers 10-15 hours
What Snorkel's $3M Open Benchmarks Grant is funding, and why "benchmaxxing" critiques don't kill the case for public benchmarks
Why 40-50% of Snorkel's data work is still review and labeling, even with the best models in the loop
The "expert-agentic" era, where domain expertise (law, finance, coding, even woodworking) is the new bottleneck
Why self-supervision is a dead end outside narrow cases like distillation
The false dichotomy between data and environments, and why pure-environment vendors miss how AI actually works

Chapters

(00:00) Intro: Alex Ratner and Snorkel AI
(02:50) What the evaluation gap actually is
(06:05) Moravec's paradox and the jagged frontier
(08:46) Where AI agents fall down in enterprise work
(10:40) Big Law Bench: benchmarking Harvey's legal agents
(12:00) The three axes: input, autonomy horizon, output
(18:31) Snorkel's $3M Open Benchmarks Grant
(22:33) From "janitorial" to epicenter: 15 years of data-centric AI
(29:26) The expert-agentic data era
(34:54) The false dichotomy between data and environments
(40:05) DoorDash Tasks and expert data at scale

Connect with Alex Ratner:

X/Twitter: https://x.com/ajratner
Snorkel AI: https://snorkel.ai

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

More episodes: https://chainofthought.show

Thanks to Galileo — download their free 165-page guide to mastering multi-agent systems at galileo.ai/mastering-multi-agent-systems

Creators and Guests

Host

Conor Bronsdon

Creator and Host of the Chain of Thought Podcast | Technical Ecosystem Lead at Modular

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB.

Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of my employer. This account is not affiliated with, authorized by, or endorsed by my employer in any way.

FINAL TRANSCRIPT
================
Speakers: Conor Bronsdon, Alex Ratner
Duration: 42:39
Total Words: 7104
Generated: 2026-04-22

---

[0:00] Alex Ratner:
If you want your model to do something different, you don't go and perform brain surgery on the neurons. You don't reinvent the architecture of the algorithm. Let me go and invent a new post-transformer architecture. No, you get data.

[0:19] Conor Bronsdon:
Welcome back to Chain of Thought, everyone. I am your host, Connor Bronstein. My guest today is Alex Ratner, co-founder and CEO of Snorkel AI. You have maybe heard of him. They have done quite a few cool things over at Snorkel. They actually started as a research project focused on programmatic data labeling. Seven years in, though, the company has evolved considerably, achieving a valuation of, I think, $1.3 billion in 2025. And Alex's team is now doing some of the most interesting work in both AI evals, benchmark design, model weights, data set labeling, including a major Open Benchmarks grant program with partners like Hugging Face, Together AI, and PyTorch. And Alex also did his PhD at Stanford where he built the Snorkel open source project and helped establish data-centric AI as a field. He, I believe, is an affiliate professor of computer science at the University of Washington. Go Dawgs! That's my alma mater. Alex, welcome to Chain of Thought.

[1:13] Alex Ratner: [OVERLAP]
Well, thanks so much for having me. And I'm an affiliate assistant professor. So I, you

[1:17] Conor Bronsdon: [OVERLAP]
Ah!

[1:18] Alex Ratner: [OVERLAP]
know, in startup world, titles are all made up. In academic world, the titles have a little bit more weight behind them, right? So I,

[1:25] Conor Bronsdon: [OVERLAP]
Very, yeah.

[1:26] Alex Ratner: [OVERLAP]
but I was fortunate enough to get some time there. UW's amazing. And then, you know, snorkel sucked me into the vortex a couple of years back. But

[1:35] Conor Bronsdon: [OVERLAP]
It makes

[1:35] Alex Ratner: [OVERLAP]
excited

[1:35] Conor Bronsdon: [OVERLAP]
total sense.

[1:35] Alex Ratner: [OVERLAP]
to chat today. Thanks for having me on.

[1:36] Conor Bronsdon:
Yeah, super excited for our conversation. Before I dive too much farther in, though, I do want to do a quick thank you to our presenting sponsors, Galileo. They were recently acquired by Cisco. Congratulations to them on that. Very excited to see how their eval intelligence framework continues to evolve. Check them out at Galileo.ai. and congratulations again to Cisco. You can actually check out a lot of Cisco's strategy and their thought process, I presume, behind this acquisition from my conversation with G2 Patel on the podcast last year. G2 is CPO and president of Cisco, and I think if you're interested in the long-term thought process of Cisco and its leadership around how AI is taking shape, a lot on inference, a lot on compute crises, that's a great one, plus some great leadership tips. But We definitely want to talk compute and inference today, but I think it's important to stop, talk maybe about what's driving a lot of that inference desire. And that is in large part agentic AI. And there is an evaluation gap that has formed in AI agents. Alex, you recently published a framework for what you're calling the evaluation gap in agentic AI. Walk me through what you're seeing and maybe break down how you're seeing companies fail to accurately evaluate agents in doing real work.

[2:50] Alex Ratner:
Yeah, well, also congrats to Galileo and Cisco. Pretty exciting news and great teams. Yeah, so I think the, I mean, I'll talk about the phrase and then I'll talk, maybe I'll start actually just talking about some of where we think this evaluation gap is

[3:08] Conor Bronsdon: [OVERLAP]
Yeah.

[3:09] Alex Ratner: [OVERLAP]
kind of occurring, just because that's a lot of what I think everyone is curious about and we're super curious about today, including with the Open Benchmark Grant that you mentioned. I'll start with the phrase, this is just a comment basically, or a hypothesis that I think many share, that our ability to do stuff with LLMs, with AI models and AI agents, has actually significantly, more significantly than people realize sometimes, outpaced our ability to measure their capabilities. And to measure is to define. So I would really say, you know, to both define and measure those capabilities in precise ways. Pardon me. So I think, you know, what does this mean? I mean, if you're kind of messing around with an agent and you're trying kind of, you know, coding projects or, you know, OpenClaw style things, like, I guess, let's say OpenClaw with proper isolation from your bank accounts, you know, This is okay that there's a jagged frontier of capabilities. I think when you get to the enterprise setting, when you get to high impact use cases that have real value in the upward direction and real risk or cost of errors in the downward direction, this evaluation gap is is potentially stops you from actually doing anything with AI responsibly. And of course, this evaluation gap is also about how you advance AI, whether you're talking about the AI labs that are kind of pushing on the general frontier capabilities, or you're talking about a vertical specific or enterprise specific team trying to build in their swim lane. If you can't measure, you can't improve, right? And you can't deploy safely. So, you know, I think that's something to emphasize. And I think part of why this evaluation gap occurs, you know, there's another phrase that has been used for quite a while now of kind of this idea of a jagged frontier, but this idea that the frontier of AI capabilities, kind of where it works and where it doesn't work yet, is somewhat kind of jagged and unpredictable. And I think that's actually something that's been true with AI for a long time, but it's just even more true today because the surface area of AI capabilities and the complexity of those capabilities has grown so insanely. There's actually an old you know, this old thing called Moravec's paradox, which is this idea that we often conflate kind of what's hard for humans and what's hard for AI. Back in the 80s, one of my favorite stories, and I promise I'll arc this back to the framework, but, you know, is this, back in the 80s, they'd had like one of the first kind of academic summer programs to think about AI. and they had assigned all of what's called computer vision to a single PhD intern for their summer project.

[6:03] Conor Bronsdon: [OVERLAP]
How

[6:03] Alex Ratner: [OVERLAP]
Now,

[6:03] Conor Bronsdon: [OVERLAP]
things

[6:04] Alex Ratner: [OVERLAP]
I'm

[6:04] Conor Bronsdon: [OVERLAP]
change,

[6:04] Alex Ratner: [OVERLAP]
sure, yeah,

[6:04] Conor Bronsdon: [OVERLAP]
man.

[6:05] Alex Ratner:
I mean, that's a multi-billion dollar academic endeavor, research endeavor, industry of computer vision. How do we deal with image and video data automatically? But at the time they figured, hey, that's, even babies can see objects. That must be straightforward. We're gonna go work on long form math and chess and Go and things like that. So this idea that what's easy, what's hard for humans is gonna be hard for AI and vice versa, often leads you astray. We see it today with things like, look at how LLMs have crushed math and coding competition problems that most of us would struggle with. And yet, you can ask a coding agent to do a much more mundane-seeming task, but one that has messier inputs and messier outputs and a longer set of steps required, and it might fall down. And that has a lot to do with how models are trained with reinforcement learning. And there are reasons that you can explain and predict this, but it's counterintuitive, right? So if I can summarize, the fact that AI capabilities have been advancing so rapidly beyond our capability to measure them, the fact that measurement itself is getting so difficult because of how many capabilities there are and how complex they are, and the fact that the frontier capabilities are so jagged and counterintuitive, all means that there's this real problem and real need to get better at doing evaluation. So that's what we mean by the evaluation gap. Now, I'll just pause there for a second, because I went off on a terrific ramble about that. But I could go on and talk about some of the most interesting areas, we think, in terms of addressing that gap and how to address it. But I'll pause for half a second.

[7:57] Conor Bronsdon:
Yeah, first of all, fascinating. Love it. The example of the computer vision student, and that's the person who's holding this whole program together, compared to the multi-billion dollar industry, it might be in the 10s or 20s now, I have no idea.

[8:10] Alex Ratner:
Subtle yeah.

[8:10] Conor Bronsdon:
Today, it could be 100 billion. It is crazy to see that growth. And yet, we did solve chess first to your point like chess has been kind of solved for a while it took us a lot longer to get really good image gen video gen let alone image recognition uh so i think this is a a great example here and yeah bring it home for me where where are you seeing those gaps today like what what are we getting wrong that we think, you know, humans are great at or terrible at and then assuming a model is going to do? And how does that then apply to the, as you put it, the gap that we have here?

[8:46] Alex Ratner:
Yeah, and I mean, this is what fascinates us every day. So I guess at Snorkel,

[8:54] Alex Ratner:
the thing that we do every day is we're building, we're developing data sets and environments for doing evaluation, doing tuning, doing reinforcement learning. And so we work with pretty much all the major frontier labs, we work with vertical AI teams at places like Harvey, and we work with enterprises who are trying to get more into getting more specific or use case specific agents to work. We're constantly trying to actually do this. We're producing thousands of data sets and environments every week or every day sometimes for most of the well-known coding agents out there, for example. And most of the data environments we produce are only valuable if they expose gaps where the models our challenge. In fact, sometimes that's the only thing we get paid for. So we're not a staffing firm that's just kind of throwing people over the wall. We're actually working with the research teams and using humans plus technology to probe this boundary of exactly where there are gaps. So I can't share anything specific to customer work because it's obviously very it's data we keep very private. But broad strokes, trying to find this boundary of where does a coding agent, where does a clinical co-pilot, where does a legal deep research tool fall down, and where does it struggle? Finding that frontier is what we do every day. So

[10:28] Conor Bronsdon: [OVERLAP]
So

[10:28] Alex Ratner: [OVERLAP]
taking a step

[10:28] Conor Bronsdon: [OVERLAP]
on

[10:28] Alex Ratner: [OVERLAP]
back

[10:28] Conor Bronsdon: [OVERLAP]
that

[10:29] Alex Ratner: [OVERLAP]
from...

[10:29] Conor Bronsdon: [OVERLAP]
frontier, I have a quick question. So you mentioned Harvey. My understanding is Snorkel worked with Harvey to build a new benchmark. I think it's called Big Law Bench,

[10:39] Alex Ratner: [OVERLAP]
Yes,

[10:39] Conor Bronsdon: [OVERLAP]
particularly for

[10:39] Alex Ratner: [OVERLAP]
yeah, yeah.

[10:40] Conor Bronsdon:
evaluating their work.

[10:41] Alex Ratner:
Yeah. And we're partnering with these teams. This is all credit to the Harvey research team and our research and data team collaborates with them. But as they published about in their post, this was really looking at more complex kind of agentic settings And I'll get back to this with kind of a framework of like, where are we seeing the frontier, just super broad strokes, but it was a, you know, it's a legal deep research style set of tasks where, you know, you're trying to answer a complex question with a complex, you know, sometimes multi-page output and tool use is required. So looking up information, looking up citations, et cetera. So, you know, this is, you know, kind of the deep research format of task was the style of task that we were examining in the benchmark, but really looking at it in a really complex domain. You know, these are tasks that might take a human, you know, five, 10, 15 hours to complete, right? With multiple tools, multiple citations, complex reasoning, complex rubrics, or kind of criteria that need to be met to consider the output of success. So that's an example of a kind of benchmark today that reveals a place where even the best LLMs struggle. So taking a step back, what are the bigger kind of axes here? We put out a post along next year with our open benchmark grants where we're not just doing this internally, we're also trying to to support and collaborate with and fund some of the great academic and open source teams working on this kind of evaluation problem. We kind of compiled it into three axes that we think describe some of the big forward directions. So number one is environment or input complexity, number two is autonomy horizon, and number three is output complexity. So in a nutshell, and let's take coding, for example, let's take a programming task or software engineering, kind of more to the point of where the axes lead. Environment complexity is about how much input do you have to take into account, both at the start of a task and as you're doing a task, right? So a simple environment would be, or simple input in environment would be no environment, just here's a competition coding problem, right? Might be one or two sentences. Again, it could be a really, really tough problem, but it's really local context. That's all you need to solve the problem, right? We think one kind of obvious but very critical direction and messy and broad and challenging is making the input in the environment much more complex. So imagine the opposite of that spectrum might be something that a software engineer does at a company where you have to know all these different code bases, internal tools, internal guidelines. You've got JIRA tickets, you've got input on Slack from your PM and business colleagues about design specifications, you have a massive code base to look through, that's an example of a complex input, a complex environment that looks more like the real world, right? So going from a really self-contained written task to a complex environment input that you need to use as context, that we think is one of the most important kind of directions to go. And to be clear, this is the evaluation gap again. People are using agents in these complex settings, but we are not yet really measuring in a scientific way the fullest extent of this kind of axis. So the second axis of autonomy horizon, which is, you know, we could have picked a less fancy name, but really just like number of steps to do a task, right? And then other attributes we kind of lumped into this axis, like a non-stationary, like do the goals change? Our chief scientist, someone in his lab at University of Wisconsin, Fred, put out a benchmark that has an awesome name, a slop code bench.

[14:43] Conor Bronsdon: [OVERLAP]
Oh,

[14:44] Alex Ratner: [OVERLAP]
This is how,

[14:44] Conor Bronsdon: [OVERLAP]
I've seen that.

[14:45] Alex Ratner:
I mean, this is a great name. Academic projects see the name Snorkel should have names like that. But I also do think it kind of almost undersells the

[15:01] Alex Ratner:
scope of what that benchmark is tackling, which is the fact that most people when they're coding with a coding agent as a co-pilot or kind of moving autonomously as a full agent, they're gonna be iteratively evolving objectives, right? So this second axis is about how, in the simplest standards, here's a coding challenge question, I'll put a single correct answer that we're gonna unit test, or here's a multiple choice medical question or legal question, I'll put A, B, C, or D. Multi-step is one where you have to, you know, have multiple turns with the user to maybe take into account shifting specifications and objectives, multiple steps of reasoning, multiple steps of tool use, et cetera, right? So obviously this kind of like horizon of how many steps are needed is one that many people are talking about. These long horizon tasks are becoming more common. but really measuring it to the length of a real-world task is still at the frontier. And you can see there are very few benchmarks, publicly at least, that test this. Most of them are very much unsaturated. And I think there's a lot left to do to really scientifically and comprehensively probe that one. And the third one is output complexity, which is, you know, when you're outputting a really complex output, it makes the evaluation of that output even more complex. So let's go back to that simple example of a coding challenge question, or let's take any of it like a math question. Okay, here's a super tough math question. Everything you need to know is one paragraph. So super, no environment, super simple context. And then the output is A, B, C, or D, or it's a single number. Well, that's really easy to check. You just check it against the answer key. But what about a deep legal research output where it's a 30-page PDF with 50 citations? How do you verify if that's good or not? you know, that's tough for humans to do, right? So that is another direction, one of the third major directions that we're seeing a real evaluation gap, but also, you know, a lot of work to close that gap. This is where you hear people talk about rubrics that kind of, you know, list out all these different kind of ways to check if the answer was correct. for human or automated validation. This could be encoding settings, not just unit tests, but also style checks or, you know, milestone objectives, et cetera. So anyway, I went on long, but I guess to summarize, you know, this evaluation gap we think has kind of these three major axes, and this is both kind of, you know, complexity of environment input, autonomy horizon, and complexity of output. These are things where you know, we're improving AI agent capabilities, but we need to catch up on, you know, being able to measure precisely for these types of tasks. And we're behind.

[17:58] Conor Bronsdon:
Yeah, we need to be able to tell what's working because not only will it inform our ability to actually ship things to production instead of having them be held back, as I'm sure many of us have experienced with agent demos, but also it's going to inform heavily how we build our next layer. How do we iterate? How do we actually explore this space? What do we need to do better? And a big element of Snorkel's approach to this is this Open Benchmarks grant program we've mentioned a couple times. Can you tell us a bit more about what you're looking for from that program and what it is?

[18:31] Alex Ratner:
Yeah, so we're super excited about this. This is

[18:36] Alex Ratner: [OVERLAP]
basically a grants program for teams that are working on these benchmarks in open public settings that address... We're very interested in anything that pushes the frontier on one of these three major axes, but we're also open to all kinds of other ideas that we haven't thought of yet in terms of just building new benchmarks in some open source way. So we put up $3 million to start in terms of initial funding. We're hoping to significantly increase the grant as we proceed this year. And we're super excited. We'll have more announcements soon, but this is the kind of thing that's led us to have great collaborations with the Terminal Bench team on T-Bench 2.0, for example. I guess there are a bunch of others that we'll announce imminently, so I'll hold my tongue there, but just some really, really cool projects. And we think that it's really important to push this you know, closes the valuation gap. And one component of that is with these, you know, open public benchmarks that we think really serve as transparent guideposts for the field. Now, I will note, public benchmarks are not the only tool. And you see almost some like, you know, there's almost a backlash against some public benchmarks. And if you've seen the term benchmarking, maybe I'm spending way

[20:07] Conor Bronsdon: [OVERLAP]
Oh,

[20:07] Alex Ratner: [OVERLAP]
too much

[20:07] Conor Bronsdon: [OVERLAP]
yes.

[20:08] Alex Ratner: [OVERLAP]
time on the nerdiest

[20:09] Conor Bronsdon: [OVERLAP]
No,

[20:09] Alex Ratner: [OVERLAP]
corner of

[20:09] Conor Bronsdon:
yeah.

[20:10] Alex Ratner:
X. And this idea that people kind of overfitting to benchmarks and

[20:19] Alex Ratner:
it's like, okay, studying for the test versus actually learning the subject. And look, the thing I would say about that is that, I mean, this should be kind of obvious, but like the fact that people overdo and overfit to something does not diminish its value as a tool. So we are full-throated believers that these benchmarks are critical. especially out in the open for pushing the field forward, they just obviously can't be the only tool. You need to measure in a number of ways, use case specific, private benchmarks, public benchmarks, but they are a critical piece of publicly measuring and inspecting where does AI work first not, and where is the next frontier to build along.

[21:48] Conor Bronsdon:
Talk to me about how this benchmark approach, which I agree, I think it's important to have public benchmarks. I love that there is a consortium that is pushing for more open benchmarks. But how does this track back to like the original vision of Snorkel and this idea of data centric AI? You know, you have this through line I can see of, OK, we want measurement to be clear. We want to enable progress. But when you started, data-centric AI was barely a term. And now you've kind of expanded it. And yet at NeurIPS last year, there was an entire section of the conference floor dedicated to it. What's changed about how the field thinks about data during your time in it? And where do you see the conventional wisdom still being wrong?

[22:33] Alex Ratner:
Yeah. So I guess I'll go back to the beginning. It's been a long journey at Snorkel, and it's only just getting started. I think

[22:44] Alex Ratner:
what we're doing today is

[22:51] Alex Ratner: [OVERLAP]
quite a bit different than what we were doing when we started. The through line is data. I mean, we've We're quite excited about the kind of work I was describing of all the data environment development and benchmark development. And commercially, or in terms of top line, by end of quarter, we'll grow at about 10x over the last eight or nine months, so we're excited. But it is quite a journey from kind of where we started, which was, to your point, very much around data labeling and this initial idea of data-centric AI. But actually, a lot of that is completely carried through. So let me go back to that. When we started, I mean, we started, this is gonna now make me sound like the least cool of dinosaurs and I'll never get invited to San Francisco AI

[23:38] Conor Bronsdon: [OVERLAP]
You

[23:38] Alex Ratner: [OVERLAP]
parties

[23:38] Conor Bronsdon: [OVERLAP]
did mention

[23:38] Alex Ratner: [OVERLAP]
ever again.

[23:39] Conor Bronsdon: [OVERLAP]
the 80s

[23:39] Alex Ratner: [OVERLAP]
Yeah,

[23:39] Conor Bronsdon: [OVERLAP]
earlier, so...

[23:41] Alex Ratner: [OVERLAP]
I already mentioned the 80s. You know, I had a birthday a month or so back on my daughter, who's six, guessed that I might be 103. I'm really, you know, which is very cute. So we've been working in this area for almost 15 years. So when we were starting back in our co-founder Chris Reis lab at Stanford, and this is when I was starting my PhD, the idea of working on something involving data in an AI lab was not at all on the table. I remember one of my buddies, still a great friend, kind of taking me aside and saying, are you sure you want to work on data in an AI lab? You're never going to publish as an academic. Because that's how data was viewed back then. It was, I mean, you know, at the time, you know, we were talking about machine learning, machine learning still technically being the subset of AI that we're all working in. Definitionally, all the people just started saying AI because it's,

[24:43] Conor Bronsdon: [OVERLAP]
Just

[24:43] Alex Ratner: [OVERLAP]
you

[24:43] Conor Bronsdon: [OVERLAP]
don't

[24:43] Alex Ratner: [OVERLAP]
know,

[24:43] Conor Bronsdon: [OVERLAP]
call it that anymore.

[24:45] Alex Ratner: [OVERLAP]
yeah.

[24:45] Conor Bronsdon: [OVERLAP]
Yeah.

[24:46] Alex Ratner:
But, you know, basically that means, you know, automation AI that is learned from data. So obviously data was always important. But the preparation, the creation, the curation, the labeling of that data when we started was viewed as upstream, it was viewed as janitorial, it was viewed as not our problem. And I think there was some work that was starting to emphasize how important it was, like the work of Fei-Fei Li on ImageNet and other data sets. people were beginning to realize, hey, the data actually, like, leaps forward in the data actually move the field more than leaps forward in the algorithm or the model architecture, et cetera. But still, it was not the subject that people thought of as part of AI and part of how you do AI. So we started, you know, and we were always very kind of, you know, quote-unquote, customer-centric. And this, back then, it was with our academic, you know, medical, you know, nonprofit collaborators at the lab. we started just realizing that every time we come with a new fancy algorithm, a new fancy model architecture, look, I have this nicely theoretically bounded joint Markov inference, something, something. And they'd be like, yeah, we don't really care, but could you help us label the data? Because that's what's blocking us. So that was kind of the first inkling of that. And the more we got in, you know, down that rabbit hole of saying, hey, why don't we try to work on helping with the data problem? And at the same time, also, you know, deep learning was just starting to become a thing. We're seeing these increasingly powerful, but black box model architectures, increasingly powerful, increasingly standardized or black box. We started to come up with this hypothesis that, hey, maybe data is actually gonna shift from being this kind of upstream, not our problem thing, to being the complete epicenter of AI development. And that's kind of where the world is today, more than we ever could have imagined. Like 95% of teams, I don't know what the right stat is, but something like that, whether it's at a frontier lab or at an enterprise, in an enterprise use case setting, If you want your model to do something different, you don't go and perform brain surgery on the neurons. You don't reinvent the architecture of the algorithm. You don't say, hey, I want my coding agent to be better at COBOL and hospital systems. Let me go and invent a new post-transformer architecture. No, you get data. You get data to help you measure where it's falling short, and then you get data or environments to help tune it to be better. So this idea of data being the primary tool for developing AI, back then, 15 years ago, it was a very out there concept, but we started talking about it as data-centric AI development. This idea that everything's important, but if deep learning and LLMs and everything that we were seeing germinate starts to take off, our hypothesis was that the data was going to be not just a limiting reagent or bottleneck, but actually the center of how you develop AI. Measure, monitor, improve. And that kind of is where the world's gone. So that's kind of like the through line. And we started off on data labeling. Data labeling is still a huge part of what we do. And that may not sound intuitive right away, but when you're building a

[28:05] Alex Ratner:
coding agent data set, let's say, you have to do a lot of data generation, you have to do a lot of environment generation, you have to do a lot of curation. But almost like 40, 50% of the time that goes into creating a data point or an environment, at least at the quality bar that we hit, which we know is uniquely high, but is the one that's valued by our customers, goes into review. and quality control, which is basically labeling. You take all the outputs and you're just trying to check them if they meet the unique spec for the benchmark or data set you're building. And so actually, not only has that through line of data being the center of how you build AI, kind of carried throughout the 15 years of the academic projects and the last six years of the company, Actually, a lot of the labeling work, along with many, many other pieces that go into it, has remained a key part of our data factory, so to speak.

[29:05] Conor Bronsdon:
What does the next build out for that data factory look like? Because I think a lot of folks are familiar with this idea of data labeling, and many of my listeners have probably been quite involved in it. But there is a new era that feels like it's dawning around how we're going to be providing data to the next stage of AI development.

[29:26] Alex Ratner: [OVERLAP]
Yeah, I mean, we've been using this phrase for kind of this next phase, it's kind of like the expert-agentic phase, because if you look at both the capabilities and then the corresponding data that you need to measure, or evaluate, and then improve, tune, reinforce, and learn, et cetera, models in this upcoming setting, a lot of it is about expertise. It's not just 15 seconds per click, query the crowd anymore. It's law, finance, gardening, woodworking, coding. I mean, the surface area is immense, and it's only gonna grow as we try to use AI in literally every part of life to improve and accelerate. But it's some kind of expertise, and it's also increasingly agentic, which is, again, that more complex input, complex output interaction, not just a prompt and a response. So that's the kind of data, and I say data, but I mean data and environments. People talk about RL environments, and I'm kind of blurring those together, because they kind of go together in a payload. I think it's a false dichotomy, which I could separately get into, but I'll say data and environment development. That's kind of where it's aimed. And I'd say at a high level, and this is also a through line throughout all of our last 15 years of research and development, The key idea, and this may sound stupidly simple, but it's really just this idea that you can't just throw people at it if you wanna hit the scale and quality and complexity that moves the field forward, but you also can't just use push-button automation. You have to have sophisticated ways of having kind of human or expert in the loop, but AI-accelerated processes for all of the core operations of generating data, generating environments, curating those, reviewing those, et cetera. And this is really the core of our research over the last 15 plus years. And so I know that's a simple idea, but just to double click on this, the way that the data industry worked, and if you look at the legacy players out there that we're replacing in commercial settings, they all started around these tasks that were like, let's build up this massive network of hundreds of thousands of millions of people we can throw a picture of a traffic light and they'll label green or red, or pedestrian or car, or this LLM response sounds good or sounds bad. That era is done, that data is not valuable anymore. The frontier for the next five, 10, 15, whatever years, as far as I'll dare to predict in this crazy world, is all but expert agentic data where you need You need, like in the Harvey example, a lawyer to sit down for 10, 15 hours if they're doing it by hand and probe this very counterintuitive, jagged frontier where even the experts may not guess correctly on their own where the model's gonna have trouble, right? So throwing humans at it is just not enough. Even the smartest humans in the world, it just doesn't hit that bar. If you look at the other extreme though, You know, I've worked in the field academically and commercially of synthetic data for, yeah, almost 15 years at this point. And, you know, it never ceases to amaze me how often you still have these kind of like snake eating its own tail free energy machine pitches of like, oh, we'll just ask the LLM to generate its own data. Self-supervision, we can go into the weeds in this one optionally, but there are valid cases where you just use Nellon to generate data. One of those is called distillation, where you're using a bigger model to train a smaller model. There's an idea, a very old topic called self-supervision that can help a little bit. But by and large, the basic intuition is like, you can't go, you know, hey, student, here's a subject matter that you have no knowledge in. Like, maybe I'm gonna get this wrong, because I don't know your background, Connor, but like, hey, Connor, you don't know anything about like ancient Sumerian theology, but can you just write yourself

[33:50] Conor Bronsdon: [OVERLAP]
You

[33:50] Alex Ratner: [OVERLAP]
a curriculum

[33:50] Conor Bronsdon: [OVERLAP]
would be surprised actually.

[33:51] Alex Ratner:
Okay, so I picked wrong, but, you know,

[33:56] Alex Ratner:
okay, Alex, you don't know about ancient Sumerian theology. I have no idea why that pops into my head, and I miraculously got it wrong here. I'm going to ask you more about that at some other time. You know, go write a curriculum, go write some test questions, go teach it to yourself. That's not going to yield good results. So you can't just say, hey, LM, go generate your own data and teach yourself. You're not going to fill the blind spots. That data does not have value. The data only has value if there's some human expertise, some new expertise getting put in, but you can't just throw humans at it. So the whole game, and this has been our kind of research area from another angle for 15 plus years, is how do you put human experts in the center, but accelerate every single part of this data environment kind of factory so they can be more effective, higher quality, et cetera. So that's what we work on in a nutshell.

[34:54] Conor Bronsdon:
Well, okay. Okay. You, before I, I sidebar us here, um, you, you had a couple of really interesting statements, um, that I want to double click on. So, so one, you mentioned that there is a perceived. Or that you perceive there to be a false dichotomy between environment and data in these conversations. Can you tell me more about your viewpoint there?

[35:14] Alex Ratner:
Yeah, I mean, I think,

[35:17] Alex Ratner:
so here's why I think it's a bit of a false dichotomy is that, first of all, I think, you know, almost everything that we're producing, not everything, but quite a large portion of the kind of frontier of data and environments is gonna be both. So, you know, we do a lot of volume of stuff that we're not specifically producing environment, but I just think in general, you know, the more agentic stuff is, meaning the more that you're trying to build, you're trying to help with evaluation and tuning of agents that interact with an environment, the more that an environment is just kind of part of it. And when you have an environment, you also have what we traditionally would call data. So some of this is just, imagine let's just use a metaphor that is not really even a metaphor, but just imagine like a standardized test, like the SAT or something. That's where we were like a year or two ago, like multiple choice questions, right? And so what did you need for a full data payload to evaluate or teach a model? You needed a bunch of questions, you needed the answer key. That was kind of it, right? Questions, answers.

[36:28] Alex Ratner:
What the tests look like today is more like, you know, and this is again, this is not even a metaphor, because this is literally what we're producing, but I'll still talk about like the student setting. Like, it's basically these tasks, like go use the library computer to look up whatever references you need, and then write this book report. That's kind of like the metaphor for the Harvey task that we were talking about, right? And so what do you need? What is the payload for one of those task questions now? For either evaluation or for tuning, inclusive of RL? Well, you need an environment, which in this case is the desk with the library computer.

[37:12] Alex Ratner:
that's the environment. But you also need the task itself of what you're trying to write the book report on and what themes you're supposed to be expounding on. That's often called the task, but that

[37:26] Conor Bronsdon: [OVERLAP]
Ancient Sumerian

[37:27] Alex Ratner: [OVERLAP]
was what

[37:27] Conor Bronsdon: [OVERLAP]
theology, obviously.

[37:28] Alex Ratner: [OVERLAP]
Yeah,

[37:28] Conor Bronsdon: [OVERLAP]
Yeah.

[37:29] Alex Ratner:
yeah, yeah. Ancient Sumerian theologians are like, you know, go answer Conor's question that Alex didn't even parse well enough to regurgitate here about ancient and write a report on this with at least five citations. That's what we traditionally call data because that's a task. So you have the environment, which is the library computer and the desk and the pencil. You have the question. You now also need a rubric, which is another thing you see often in the space, which is just the grading guidelines. Maybe you have a reference answer of here's a great book report, but then you need a rubric because you can't just say, does A equals B, you need to actually check, you have to have a detailed grading rubric for how to grade it. And then you need to have graders, which are often called verifiers. So, you know, that's the payload. It's the task of what you're supposed to do. It's the environment within which you do it, including tools and even simulated user personas, if part of the task is going back and forth with a simulated or with a human. So you need the task, you need the environment, you need the potentially the reference answer, you need the rubric or the grading kind of guidelines, and you need the grader or the verifier. That's the full package. So whether you call that an environment, or you call that data plus an environment, or you call it data, like it's kind of semantics. And so this dichotomy of like, which I think is kind of marketing of like, data is dead, now it's all about environments, to me, exposes that. I mean, look, like, I'll try to make this a little spicy, because we're getting towards the end, and we may have already lost half the viewership with the ancient Sumerian detour, like, A lot of the companies out there that are creating an environment, like a CRM clone or something, they don't really understand how AI works. They hack something up, they sell it to 10 labs, they annualize the number, they go raise a series A. and they say, oh, we're not like data, we're an environment company, because they just don't actually understand how AI works. The reality is the full package, whether you're producing it as a vendor or not, but the full valuable package is the two combined. And candidly, when they're not combined, it's not that useful, because you design the task based on the environment, and you design the environment based on the task if you're doing it right. So we're, you know, maybe it's a little pedantic, but we're big believers that the two kind of are part of one holistic thing, that if you do it right, has to be developed together.

[40:05] Conor Bronsdon: [OVERLAP]
Yeah, I think a good example to mention here is DoorDash. They recently announced in March that they are creating a massive data gathering operation. They have a separate app called Tasks now, where they are paying their dashers to collect real-world audio, video, photo data

[40:25] Alex Ratner: [OVERLAP]
Yep.

[40:25] Conor Bronsdon: [OVERLAP]
to train AI and robotic systems, particularly about how to navigate the real world. And to Alex's point earlier about experts, we may not think of dashers as experts but if you think about dashers as as in navigating environment doing these tasks they are experts for those tasks they are experts in navigating an environment that robots have not truly learned how to navigate yet and so it's going to be really interesting to see how all these different data approaches come together. And I think every company that has scale right now is looking to see how can I leverage the data or how can I create better data to fuel this next layer of innovation and agentic systems. Alex, it's been such a fun conversation. I know we didn't quite get as much time as we hoped for, so hopefully we'll have you back sometime later this year or some other time. But it's been fantastic chatting with you, and I'll make sure we get some time for Ancient Sumerian next time as well, though

[41:17] Alex Ratner: [OVERLAP]
Yeah,

[41:17] Conor Bronsdon: [OVERLAP]
I don't speak the language at all, to be clear.

[41:20] Alex Ratner: [OVERLAP]
well, Connor, I really appreciate being on and I really appreciate the great conversation. And I have my homework before that next conversation. So I

[41:30] Conor Bronsdon: [OVERLAP]
You

[41:31] Alex Ratner: [OVERLAP]
know

[41:31] Conor Bronsdon: [OVERLAP]
might

[41:31] Alex Ratner: [OVERLAP]
I

[41:31] Conor Bronsdon: [OVERLAP]
be more of an expert

[41:31] Alex Ratner: [OVERLAP]
got to get

[41:32] Conor Bronsdon: [OVERLAP]
by

[41:32] Alex Ratner: [OVERLAP]
to

[41:32] Conor Bronsdon: [OVERLAP]
then.

[41:32] Alex Ratner: [OVERLAP]
work

[41:32] Conor Bronsdon: [OVERLAP]
I

[41:32] Alex Ratner: [OVERLAP]
on.

[41:32] Conor Bronsdon:
have limited knowledge, admittedly, so.

[41:35] Alex Ratner:
Well, this is

[41:36] Conor Bronsdon: [OVERLAP]
It

[41:36] Alex Ratner: [OVERLAP]
awesome. Thank you so much.

[41:37] Conor Bronsdon: [OVERLAP]
was my pleasure. And where should folks go to check you out and follow your work or follow Snorkel's work?

[41:42] Alex Ratner: [OVERLAP]
Yeah, I mean, snorkel.ai is the website. You can follow me on X. It will be exactly as nerdy and AI data focused as this, but that also links to some of the work that our research team does out in the open, some of the research work that we collaborate with and fund. So yeah, that'll probably be the best place to look. I'm excited to engage with any

[42:05] Conor Bronsdon: [OVERLAP]
Yeah, hopefully some listeners go create their own open benchmarks and apply for a grant or at least reach out to you on X. And Alex, just thanks again for joining us. Listeners, if you enjoyed this conversation, be sure to like and subscribe wherever you are. And make sure while you're at it that you've subscribed to our newsletter at newsletter.chainofthought.show. Alex, thanks again for coming on. It's been fantastic chatting with you.

[42:28] Alex Ratner: [OVERLAP]
Connor, thanks so much for having me.