Thinking Machines: AI & Philosophy

Talfan Evans is a research engineer at DeepMind, where he focuses on data curation and foundational research for pre-training LLMs and multimodal models like Gemini. I ask Talfan:

Will one model rule them all?
What does "high quality data" actually mean in the context of LLM training?
Is language model pre-training becoming commoditized?
Are companies like Google and OpenAI keeping their AI secrets to themselves?
Does the startup or open source community stand a chance next to the giants?

Also check out Talfan's latest paper at DeepMind, Bad Students Make Good Teachers.

Creators and Guests

Host

Daniel Reid Cahn

Founder @ Slingshot - AI for all, not just Goliath

What is Thinking Machines: AI & Philosophy?

“Thinking Machines,” hosted by Daniel Reid Cahn, bridges the worlds of artificial intelligence and philosophy - aimed at technical audiences. Episodes explore how AI challenges our understanding of topics like consciousness, free will, and morality, featuring interviews with leading thinkers, AI leaders, founders, machine learning engineers, and philosophers. Daniel guides listeners through the complex landscape of artificial intelligence, questioning its impact on human knowledge, ethics, and the future.

We talk through the big questions that are bubbling through the AI community, covering topics like "Can AI be Creative?" and "Is the Turing Test outdated?", introduce new concepts to our vocabulary like "human washing," and only occasionally agree with each other.

Daniel is a machine learning engineer who misses his time as a philosopher at King's College London. Daniel is the cofounder and CEO of Slingshot AI, building the foundation model for psychology.

Daniel Reid Cahn: 00:00

Welcome back to Thinking Minds, where we talk about AI and philosophy. Today, we're talking to Talvin Evans. Tal did his PhD at UCL, computer vision research at my alma mater, Imperial, and these days he works at DeepMind on data curation for and pre training of large language models, which I believe means Gemini. So in other words, he's working on the core most important bits of Gemini, which is one of the biggest and one of the few large language models out there. Tal has strong intuitions for what it takes to train large language models at scale, so I want to ask him a lot of tough questions.

Daniel Reid Cahn: 00:34

So with that, let's begin. Tal, thanks so much for being here.

Talfan Evans: 00:37

Hey, Daniel. Thanks for having me.

Daniel Reid Cahn: 00:39

Cool. So I think, like, topic I wanted to focus on for today was a lot of what we've talked about, which is sort of like our LLMs, winners take all. Maybe a good place to start is, do you think that pretraining of large language models is becoming commoditized?

Talfan Evans: 00:52

I think that it's natural as the science advances that this population down from the big companies. And I think it's natural that we're starting to see more and more innovation, not just in pretraining, but across architecture modification, etcetera. I think we're starting to see more of that come from the open source community just because it's such a big field now and so many people are interested in it.

Daniel Reid Cahn: 01:17

I mean, it's weird. Like, a few years ago, the general consensus was like pretraining large language models is insanely hard. It will only ever be done by very small number of players who have access to both the compute and the data, the techniques, and then missed all happened. And suddenly this, you know, really smart people, obviously, with great experience, but small group of people trained a language model that was, like, competitive with big ones. That was weird.

Talfan Evans: 01:42

Yeah. Although I wouldn't discount the experience that they have. You know, they're some of the best researchers at their former organizations. So it was a great team to do that. I would say that it's also natural that the best version of any technology that we develop is always going to be imperfect.

Talfan Evans: 01:58

And if you think about how technology gets developed, it's a sort of noisy iteration process. And we often make small gains in small sort of, like, positive steps forward that aren't necessarily the globally optimal way to go about these things. So we often find that we have to hack together bits of badge or, like, locally optimal modifications. And I think as the technology progresses, these things get ironed out and things tend to become simpler over time. So I think we're right now, you know, in this era of deep learning where there still are a lot of engineering tricks that one might need to know to get these things, not just training, but performing really well.

Talfan Evans: 02:40

And I would hope that as we understand deep learning better that, yeah, the principles get simpler.

Daniel Reid Cahn: 02:46

Okay. So the thought is kind of like we go through this s curve where, like, as we move really quickly, you need a lot more complexity. You need a bigger team. You need people who specialize in a lot of low level different things. And so you just can't really compete with a small team without that technology.

Daniel Reid Cahn: 02:59

Over time, we realized most of that didn't really matter. There actually are simpler primitives that we could focus on. And maybe that's what's happening with language models. You hear a lot of, like, simplification. Like, scale is all you need.

Daniel Reid Cahn: 03:08

You just need a lot of tokens. They need to be really good. People talk a lot less about these small architecture things. People don't talk so much about flash attention right now.

Talfan Evans: 03:15

Yeah. Great example. Back propagation and scaling as to general principles. Those have really simplified the task of training a useful model. You could put together a team of researchers, engineers who were trying to build the most sophisticated AI model known at the time in the eighties, and they would maybe not be using back propagation at sufficient scale.

Talfan Evans: 03:43

And they would be doing a lot more work to get a lot worse product. And I think, you know, finding out about these general principles, I mean, scaling seems intuitive now, but it wasn't even until a couple of years ago that we sort of understood empirically what scaling was meant. Yeah. Those are 2 simplifying principles that are allowing people to do a lot more than they could before we understood those principles.

Daniel Reid Cahn: 04:02

So then so we have these principles that are becoming more generalizable. Really smart people are able to do things. I think the other thing that's interesting that's happened is, of course, like OpenAI locked down. That's not new news, but you know, they locked down their information. They stopped being open.

Daniel Reid Cahn: 04:16

They stopped sharing. They stopped open sourcing. And DeepMind, from what I've understood, has done something similar where, like, not just externally, but also internally. I know personally, you don't have to comment on it, but there are a few things where I've, like, asked, you know, how is it that Google has this giant 10,000,000 context window model? And, you know, the folks that I tend to talk to at DeepMind say someone else knows, but not me, and I'm not allowed to know because we've locked down that information internally.

Daniel Reid Cahn: 04:38

Do you think that is gonna change the game when it comes to this kind of commoditization process?

Talfan Evans: 04:43

I think it's definitely true that up to at least recently, there's been a lot of leakage from the big tech companies. And this is even true of OpenAI to some extent at the beginning where they were publishing a lot of the early research, like on DOTA, for example. I think they stopped leaking as much much earlier than Google, and I applaud Google for the principles here because, I mean, it's good for humanity to be sharing some of these secrets. I think that the environment has changed since OpenAI released ChachiPT, and now there's a lot more pressure to generate a competitive product. And so I think it's natural that this leakage does get sealed, I guess.

Talfan Evans: 05:25

And I think it will change the dynamics a little bit, but I can't see that you can completely remove this.

Daniel Reid Cahn: 05:31

I think, like, there are kinda like 2 mentalities, I think, that affect whether or not you wanna share information, like, just financially. There's kind of, like, the rising tide raises all boats kind of mentality, where basically, like, if you're a company like Facebook and you expect that you're mainly gonna profit from things like ads and social networking connections. And, you know, if you're Meta, like, the big competitive advantage you have is your network effect. Like, everyone is on some sort of Meta product or multiple. And so perhaps, you know, that might explain why Meta is very happy to be open source because they're not really trying to compete with anyone.

Daniel Reid Cahn: 06:00

Like, they're not trying to say, as long as we have the best language models, we'll win. They don't make money on their language models. They make money on their social products. So that would suggest if that claim is true that companies that wanna lock down feel like they are being competitive. Right?

Daniel Reid Cahn: 06:12

They are expecting commoditization, and they need to maintain a competitive advantage by keeping secrets. So it's essentially some sort of, like, a win for Gemini is a loss for, you know, Chargebee T and a win for Chargebee T is a loss for Gemini. If that's true, I think the other, like, related question is, are we gonna move into this commoditization world where we have a lot of language models that are basically interchangeable? What matters is product level? Or, you know, on the flip side, the exact opposite.

Daniel Reid Cahn: 06:36

This is a winner takes all game, and these companies are competing and work really hard to maintain their secrets so that they can win and take all. Any thoughts?

Talfan Evans: 06:44

Yeah. Super interesting points. I mean, Meta is a really interesting case right now. And maybe what you're saying, I think I really agree with this, is that Google and OpenAI are really in direct competition at the minute. I mean, Google search business is obviously the thing that's most valuable to the company.

Talfan Evans: 07:02

And I think when Chargept was released, it seemed like a new version of search. Right? I mean, like, generative search is a new way of extracting information that you want given a query. And so the worry maybe is that that does become true and undermines the search business. So from this respect, Google and OpenAI don't really wanna share secrets, and both will suffer if the others models get better.

Talfan Evans: 07:29

Now Meta are interesting because maybe they're not competing on that same business. Meta's business is really around, I guess, yeah, selling ads on the social platform, and maybe they're not directly competing in search. Maybe they don't mind that much if they leak the llama models and it improves, Gemini and ChargebeeT. Maybe there's outsized benefit for them from getting community modifications that improve the LAMA models.

Daniel Reid Cahn: 07:58

There's kind of this funny asymmetry where it seems like Google is the most scared of any of these players because Google's core business is the only one that's really at risk.

Talfan Evans: 08:05

I think it's tough to be an incumbent. I think

Daniel Reid Cahn: 08:08

it's tough

Talfan Evans: 08:08

to be incumbent. Right? I mean, the incumbent dilemma is really real. You have much more brand risk than a start up. And I think to their credit, what OpenAI have done well is continued to behave like a startup and move like a startup as they've scaled.

Talfan Evans: 08:24

So still making, I think, risky moves even though they are now at a much loftier position that they were maybe 3 years ago.

Daniel Reid Cahn: 08:33

Well, maybe they are. I mean, they've also been super locked down, buttoned up. We have no idea what they've done for a long time. They've done basically nothing. GPC 4 came out, then GPC 4 v, which, you know, had been a secret for 6 months, you know, being used as far as we know.

Daniel Reid Cahn: 08:46

So we have no idea what secrets they have right now, but it has been a while since OpenAI has released anything. So you could say it's like a startup. In some ways, it's also just like a top secret military research lab. Right?

Talfan Evans: 08:57

I mean, like, maybe I don't think that the startup and the top secret military research labs operate differently or should. I think the start up scale may differ if you're a small start up and you're trying to prove that your product works and that you have a market, then, yeah, you want to release thing publicly. But hoping I've sort of already done this, and so it seems sensible to me that they would be locking everything down and only releasing the choicest bits. So, I mean, like, the SORO release was was pretty well timed. I don't think that it's completely outshone the Gemini release at the time.

Talfan Evans: 09:27

I mean, like, the the long context, feature of Gemini Gemini was, I think, at least as impressive. Although, the Soarer videos are incredible also.

Daniel Reid Cahn: 09:35

I think there's also some maybe this is bullshit, but I think there's something interesting about like OpenAI. There's some way that you can like view the strategy in terms of like keeping secrets, in terms of releasing or not releasing. There are so many product features that OpenAI could release and hasn't. Right? They did release their assistance API, which was consumer ish, you know, nice for consumer stuff.

Daniel Reid Cahn: 09:52

But, you know, one thing is, like, in the ecosystem of AI, a lot of people are, like, training smaller models, fine tuning smaller models, you know, and running inference. OpenAI used to have, like, 5 model sizes. Now they have 2. Like, if they wanted to, one really obvious direction they could go in is just release 5 smaller models of different sizes, offer fine tuning of all of them and say, hey. You want cheaper inference, cheaper training?

Daniel Reid Cahn: 10:12

Go for it. Use one of our smaller models. And they would destroy, like, a huge segment of the startup market, but they haven't touched that. And I think it makes sense in the lens of, like, OpenAI's goal is to achieve AGI. And if their goal is to achieve AGI, they're not gonna bother with all that kind of bullshit.

Daniel Reid Cahn: 10:26

On the other hand, like, no one else has that goal. Like, do you think that there's something legitimate to, like, maybe OpenAI actually is unique and different here? Because the reason, you know, if you can actually view their secret keeping, they're not releasing consumer features just from the perspective, like, all they actually care about is AGI.

Talfan Evans: 10:43

I mean, I do think they care about developing AGI. And if you listen to Sam Altman, that seems to be true. I think they also care about products and being the most successful and most used foundation models out there. It's an interesting point about the sort of, like, quality versus cost trade off. And they seem to be right now going for that particular point on the Pareto front, which is, you know, absolute best quality in the market.

Talfan Evans: 11:07

The question is to whether they could, if they wanted to, just compete across that Pareto front, I think it's probably yes, but I wouldn't discount the possibility that just being able to be the best in a market at any one particular period of point requires specialization that isn't actually easy to transfer. So I think the assumption as in or why don't they compete across model scales is that, oh, well, we have scaling laws. All it takes is to just spend more time and compute to train bigger models or smaller models and then serve the right one to the user. And I think that, you know, there are probably, like, infrastructural questions there that are really important. So it probably requires a different kind of business to serve a a, like, bad model to many, many more clients than it does, like, a very big model to fewer clients.

Talfan Evans: 12:01

But the minute, OpenAI are obviously serving to a lot of clients. But the other problem here is just like the routing problem, which is I think something that's super interesting to a lot of people including myself right now. How do you serve the right model to the right clients or the right request? That's not trivial. And this is when sort of mixture of experts models are starting to go, and people are also looking at things like adaptive computation methods.

Talfan Evans: 12:24

There was a great paper from DeepMind recently called mixture of depths. The idea there is, like, your model learns to drop tokens that aren't really essential for inference or training.

Daniel Reid Cahn: 12:34

So that's like if you're gonna predict an of or a the, for example, like, it should be really, really easy. You shouldn't put in a ton of. Yeah. It's an interesting point. You're saying, like, OpenAI, especially if they care about AGI, then they will focus on the most expensive side of the spectrum.

Daniel Reid Cahn: 12:47

And it would be a huge waste of time, like, and pretend potentially get completely different research direction if they were thinking, how do we make this more scalable, cheaper? Whereas, like, Google might wanna actually do that because Google might actually wanna, you know, do inference to that scale or other companies. So it might just be like, no. OpenAI is not the best place company to do large scale cheap inference just because it might require totally different research and engineering work.

Talfan Evans: 13:10

Yeah. I think that's completely possible. My feeling about OpenAI, you know, knowing nothing about the financial side of the business is that, yes, they have to have a product, but their bet is that by staying ahead and staying at the bleeding edge, that you're going to be able to dominate across, like, the scale by just having a better product overall. And I think, you know, it's staying ahead is pretty important. And that also aligns with this goal of creating AGI.

Talfan Evans: 13:39

Although I would say, you know, this is something that DeepMind are also interested in and have been for a long time. I mean, this is sort of like Demos' and Co's vision setting up DeepMind.

Daniel Reid Cahn: 13:48

So I think I mean, one one really interesting thing about you're talking about, like, OpenAI, their business model. You also jump into the DeepMind AGI thing. But on OpenAI's business model, there is something to be said about, like, they are the biggest model. We can't tell if they're doing optimizations to try to be cheaper inference. We know that there's no public stuff where they're trying to do smaller model efficient inference, which they could do, but it could be a different trajectory.

Daniel Reid Cahn: 14:09

What we can say is their inference is pretty cheap. Like, I think, if I remember correctly, when Mistral released their, like, media model, which they were like, it's almost as good as GPT 4, but not quite, They charged higher inference prices than GPT 4 does for GPT 4. So it's like a, their inference costs are still very low. While I think you are right that they're probably not working as much on, like, smaller model inference, they are making inference really cheap at scale. And that is, like, a pretty freaking big deal, isn't it?

Talfan Evans: 14:35

I actually don't know. Are they making are they profitable? Because I would just say it's entirely possible that they're still making a loss on on this stuff.

Daniel Reid Cahn: 14:42

Like, you think they're just taking a loss on every inference request?

Talfan Evans: 14:45

I mean, it's possible. Although, mhmm, you know, you would assume that they'd have to do some rate limiting. I think it seems like maybe they are doing some sort of routing at the minute.

Daniel Reid Cahn: 14:54

Routing in the sense of, like, early exit or, like, mixture of depths or something like that.

Talfan Evans: 14:58

Like this. I wouldn't be surprised if some of this is going on under the hood.

Daniel Reid Cahn: 15:01

They almost certainly do things like speculative decoding. I mean,

Talfan Evans: 15:04

I think

Daniel Reid Cahn: 15:04

there's no one knows. We're all speculating. But I mean, for them to do inference this cheaply, they must have some pretty awesome things that allow them to do more efficient inference.

Talfan Evans: 15:11

Yeah. And, obviously, you know, talking about inference costs, it's hard to consider outside of the total costs of, like, what it took to get there. And so I'm actually doubly surprised in OpenAI's case because, of course, they've obviously spent a lot of money getting to the point where they have the best models.

Daniel Reid Cahn: 15:31

Okay. So talk me through this. Last time we talked, you were talking to me about, like, how drug prices work in terms of, like, drug discovery. Can you explain that?

Talfan Evans: 15:38

Yeah. Exactly. I think the analogy is just that you price something to account for the cost that have gone into developing this. And, like, if you pay x for a model, then you need to recover the cost of doing that. And the the time over which you want to recover that cost obviously has to be a factor.

Talfan Evans: 15:53

And this is why drugs are extremely expensive because pharmaceutical companies take a long time developing these, do all of the r and d, and they actually have a very short time window over which to profit on those drugs. So they end up having to crank the prices up. I'm not saying that's the only reason they do, but it does make sense from this perspective. If you pay the price to develop the technology, you do need to profit from it. And maybe actually OpenAI have been reasonably lean there.

Talfan Evans: 16:21

I don't know.

Daniel Reid Cahn: 16:22

Yeah. I mean, I there's something interesting. It's funny. Like, I think the quote with drugs is like, you're like, why does it cost, you know, $10 for this pill when it only costs, like, a cent to produce? And you're like, oh, well, this one costs only 1 cent to produce, but the first one costs $1,000,000,000 to produce.

Daniel Reid Cahn: 16:34

So there's some element. I think what is interesting is, you know, it doesn't cost a cent for inference. Inference is like well, maybe there's a cent, but it's relatively expensive compared to a lot of other, you know, unit price things. But there is the expectation that prices will be dominated by amortization of training costs. So basically, training costs are massive.

Daniel Reid Cahn: 16:52

You can amortize training costs, which means which is to say, like, if I have twice as much compute and I want to recover my training costs, you know, in the way I charge, theoretically, I can charge half as much now to for for inference request to recover my costs. So there's some actual, like, winner takes all dynamic here. Right? Where, like, if you're the number one player and your model is marginally better than the number 2 player, you assume the inference costs are roughly equal, and you assume that you both have to charge to recoup the training costs. So now player 1 can charge half as much as player 2 if they have twice as many requests.

Daniel Reid Cahn: 17:23

So by being the best, you have this phenomenal business model where you don't have to charge very much in order to recoup your training costs because you just get so much more inference. And if you're expecting so much more inference, you can charge less. Charging less leads to more inference. So now you're charging the least for your model size. You know, you are getting the most requests.

Daniel Reid Cahn: 17:40

You're also able to load balance. Let's not forget that, which is, like, the more requests if you have, like, very spotty requests and you want to be ready for, like, I don't know, whatever, a 1000000 requests in a second. Like, you need a lot of GPUs on that are totally idle. If you have more requests on average, you can load balance better. So now there are, like, some real opportunities around inference at scale.

Daniel Reid Cahn: 17:57

Does this make you question at all the, you know, winner takes all thing?

Talfan Evans: 18:01

Well, I think yeah. And we've discussed this briefly recently. Personally, I don't see why there couldn't be a company that fills that gap and say provides sort of liquidity, I guess, across many small model providers. You know, maintains uptime, pays for the GPU or TPU hosting itself, and then does something equivalent to what OpenAI are able to do because of the demand. I don't see a fundamental reason why that couldn't be the case.

Talfan Evans: 18:29

I mean, there are also, like, GPU cloud service providers that are thinking about this kind of thing. Ori, Edge, for example, some of the bigger ones. Also, I would despise if they start going deeper into the the sort of, like, AI infrastructure as opposed to just staying at the level of serving infrastructure.

Daniel Reid Cahn: 18:48

Yeah. I think Fireworks is starting to offer this. Predabase was the one who released the lower x library for hosting a lot of fine tunes. So those 2 are 2 of these startups. I mean, there are definitely a lot of model hosting startups, but, like, one big thing is the economy of scale element, which is, like, these companies, like, there's just no way that they're able to load balance in the same way because, like, each GPU, like, relative to in the CPU world where, like, a single machine can just do a lot, you know, a single GPU can't do that much.

Daniel Reid Cahn: 19:12

Like, a single llama 70 model needs at least 2 GPUs, which means just to host a single model, you need to have 2 GPUs. That means, like, just in terms of raw number of GPUs, if each one of those costs $20,000, you know, like, it's just a huge capital expenditure and you probably need to be something like cloud size to be able to do really efficient inference. You're probably right in terms of, like, you know, if your costs are at least half as much, then it makes sense. But I definitely have heard repeatedly when I talk to the founders of companies that do this and I'm like, are you scared of OpenAI? And they're like, yes.

Daniel Reid Cahn: 19:43

Because it is really hard to compete on prices when they are so freaking cheap, which could be again because they have so much capital that they're willing to take a loss. There are other ones that are weird, like Grok right now. It doesn't charge for their API at all. Deep Infra also has a free API. That's ridiculous.

Daniel Reid Cahn: 19:57

There are 2 venture backed startups that are offering free APIs for like Llama 70,000,000,000 I think. Hopefully, they'll shut them down soon enough, you know, or the economy is great. But with OpenAI, like, it is the obvious. Like, you have these companies that are like, we offer a model that's a 100 times smaller than GPC 4 and half the price. And you're like, what?

Daniel Reid Cahn: 20:13

You know?

Talfan Evans: 20:14

Yeah. I think why is it nontrivial to serve a bunch of small models heterogeneous small models that are different from one another? Why is it nontrivial to serve those with the same efficiency that you can serve a single model like OpenAI are doing? Okay. So They're not.

Daniel Reid Cahn: 20:32

Wait. Just to be clear, OpenAI OpenAI doesn't just host one model. Like, they so OpenAI published a a month ago, I think this was like April 4th, they wrote, we believe that in the future, the vast majority of organizations will develop customized models that are personalized to their industry, business, or use case. With a variety of techniques available to build a custom model, organizations of all sizes can develop personalized models to realize more meaningful, specific impact from their AI implementations. I think what's really cool about OpenAI isn't this one model to rule them all world that I think everyone else is kind of focused on.

Daniel Reid Cahn: 21:02

OpenAI is, like, the only organization that has this fine tuning API at scale that's awesome and then hosts these fine tuned GPT 3.5s quite cheaply, really fast, no cold start time. I think what's most impressive about OpenAI here is not that they hosted G54 as one model to roll them all at cheap prices. It's that they host these gpt 3.5 fine tunes so freaking well at such scale for so cheap. You know?

Talfan Evans: 21:26

Yeah. I don't see how this is possible, at least if you compare an absolute efficiency with serving a single model that is computing a bunch of different queries. I mean, this is just this comes out of, okay, you just can't fit multiple models of the same size on memory. Right? So one model is efficient because you can process a batch of queries with the same model footprint in terms of memory.

Talfan Evans: 21:49

But as soon as you try start hosting a bunch of these smaller models, you obviously need different chips or Well many, many models beyond the same This

Daniel Reid Cahn: 21:58

is the LoRa miracle. Right?

Talfan Evans: 21:59

So I think you still need the model specific LoRa's. Right? So you can't do all that on the on the same chip. You need different chips.

Daniel Reid Cahn: 22:07

You can. As far as I understand, I think, if you check out LoRa's, that's Predabasis implementation, although it sounds like there are a lot. You can actually fit many LoRa's on a single GPU and actually do a single batch across multiple models. I think that is now a thing. And that must it's the only explanation about OpenAI is so efficient.

Talfan Evans: 22:23

Yeah. I guess, yeah, the parameter footprint of the LoRa's is much smaller than the base model, and so you can feasibly fit a bunch of those on. Yeah.

Daniel Reid Cahn: 22:30

I mean, it's still a huge engineering challenge, though. It still requires a huge amount of compute for load balancing. Like, now if you imagine OpenAI is hosting, whatever it is, a 100,000 LoRa's, and they need to think about, like, okay, this GPU can fit a 100, but which 100 do we host? How do we switch them out? How do we predict which usage is gonna go where so that we're not wasting compute?

Daniel Reid Cahn: 22:48

Like, that is a massive engineering challenge, and it's pretty obvious, I think, why it's not easy for any startup to do. Not to say startups can't do it, but it's really freaking hard and crazy impressive that OpenAI have achieved this at scale.

Talfan Evans: 22:59

Well, this is pretty yes. It is pretty impressive, and I'm surprised that they are doing as good a job as they are if they are, and they're not just, like, covering up the massive losses with more funding. Also a possibility. This really sounds like a problem that's, like, perfect for a cloud provider. I mean, this is a pure engineering problem almost.

Talfan Evans: 23:20

So, again, I would bet that somebody is going to do this really well. Like, maybe you say, database are are doing something like this. I would bet that somebody is gonna do this really well as a service as a cloud provider. And that is the, hopefully, heterogeneous model world.

Daniel Reid Cahn: 23:34

Yeah. I mean, I don't know. I just it's funny because, like, Google doesn't offer fine tuning of Gemini. Anthropic doesn't offer fine tuning of Claude. Like, just to say, like, in terms of, like, cloud providers, OpenAI is not a cloud provider, but they sort of are when they're, like, almost all owned by Microsoft.

Daniel Reid Cahn: 23:48

I am extremely excited, extremely excited about this heterogeneous model world, and I would absolutely love to see cloud providers make this easy. I would love to see Google Cloud just have an offering, which is just like upload your LoRa and we'll host it. I think, you know, maybe smaller clouds can do this first. But, yeah, I think that's a huge deal. I wanna ask you though, I mean, just because you're in this research space, do you believe in the heterogeneous world?

Daniel Reid Cahn: 24:08

Like, do you believe in one model to rule them all? Or do you believe in this foundation model fine tuning world?

Talfan Evans: 24:13

Yeah. I think I believe in the heterogeneous model world. And this just feels like it has to be true. If you look in nature, there's specialism, and, you know, there's no free lunch. So what you pay for generality surely is going to mean inefficiency.

Talfan Evans: 24:27

And maybe there are okay. So maybe there are sort of economies of scale that allow you to, I don't know, make up for the inefficiency. But I have to believe, I think, in a sort of capitalistic world that these inefficiencies are going to be, like, competed on to the point where it's gonna have to be a bunch of people who are serving specific models to specific niches at least to some extent. Right? There might be some core that is shared and then some, like, kind of, the later layers of your transformer are actually going to be specialized models because this is how we think about knowledge.

Talfan Evans: 25:03

Right? A lot of knowledge is more general and a lot of knowledge is specific. And there are sort of hierarchies that mean you probably can share a lot of that lower level knowledge and then build specific domain main specific parts on top of it. It. I think that you can't escape this.

Talfan Evans: 25:19

I think in AI or in in nature.

Daniel Reid Cahn: 25:22

I think, like, one of the big jokes, like, jokes that goes around the startup world is, like, you know, you talk to an AI startup and they're like, but what if G57 can just do everything that you're doing without any training? And then, of course, there's the, you know, rag. RAG, RAG, RAG. RAG can do it. Do you try prompt engineering?

Daniel Reid Cahn: 25:36

Did you try RAG? How do you think of this?

Talfan Evans: 25:37

Whether I think that RAG will supplant a model that's able to just do anything. I think that yeah. I mean, in context, learning is pretty powerful, and we're talking about LoRa's here. Right? But you could, in theory, do the same thing with just contextual learning, so in context learning.

Talfan Evans: 25:55

And if RAG starts to work really, really well, then, you know, in context learning gets much more powerful. It's pretty powerful right now. Obviously, the view of how performance these models are from a sort of theorist perspective and, customer's perspective, are quite different. You know, you look on Twitter, a lot of people complaining about how bad these, like, long contact windows are or how finicky RAG is to get to work. Again, I think we'll get to the point where these things work really well.

Talfan Evans: 26:22

And, maybe that does go in favor of the one model, rules them all. But I still think that you're gonna have to do some sort of routing because you don't want to be paying the cost of that full model every time.

Daniel Reid Cahn: 26:36

Can I just challenge you? Because I think the idea that it's just a cost thing is a little bit of a weird one. You're saying that in context learning can be really powerful. I mean, like, do you have evidence for that?

Talfan Evans: 26:47

Well, can be powerful in the sense that it is useful and can approximate gradient based training. If the gold standard is gradient based training, then that's my definition of powerful.

Daniel Reid Cahn: 26:58

Like, do you think that it can approximate gradient based training, you're saying?

Talfan Evans: 27:01

I mean, it's clear that few shot learning from in context learning works pretty well in a lot of cases.

Daniel Reid Cahn: 27:07

I'm gonna challenge you there. I don't know. I've seen a lot of research and just experience firsthand where a few shot learning sucks. Like, especially where often I actually end up taking GPT 4, give it 5 examples, give it 10, give it 20, give it, you know, some sort of algorithm where you use, like, you know, let's see what we're doing classification. We do some embedding similarity to decide which examples to include random examples.

Daniel Reid Cahn: 27:28

Very often you get worse results than not including any examples at all. So I don't think it's obvious that naive few shot learning works, let alone is a good idea. I see it less and less every day. I've seen very little implementation out there in the world for people to do few shot learning as opposed to prompt engineering.

Talfan Evans: 27:43

So, I mean, I wonder how your experiments turned out if you fine tuned on the same examples. I only mean that in context learning is good with respect to that sort of, like, gold standard baseline.

Daniel Reid Cahn: 27:53

I mean, I think the I hear you. I think, like, the most interesting paper here lately was the, I don't know if you saw the many shot in context learning paper from Google Nope. From DeepMind. I think the paper was basically like, what if you took an insanely large context window model and fill it, basically? So instead of using huge shot learning, you include just an insane context window.

Daniel Reid Cahn: 28:13

And you're like, what if we could fit the entire dataset in the context? And there, they found that it does work. Like, it was not huge shot learning. It was many shot learning. I can see why that might work.

Daniel Reid Cahn: 28:21

But I do tend to think, like, partially few shot, many shot learning assumes that your task is one that basically is suitable to this. So one where like examples relatively short, for example, classification. If you are creating something like an conversational agent or if you're creating like a legal research tool, or if you're adept trying to navigate the Internet, it doesn't even seem like reasonable to imagine what it would look like to build an adept that's not training. You know what I mean? Or to build a Harvey that's not training.

Talfan Evans: 28:45

Yeah. I would say that this is mostly around task dissimilarity and whether the thing that you're trying to few shot learn to is roughly within the domain of the training set that you you you did pretraining on.

Daniel Reid Cahn: 28:58

And I

Talfan Evans: 28:58

think in a lot of cases, you know, few shot learning requests probably aren't, or at least it might be true that the task that you're expecting the model to few shot learn to actually is sufficiently far outside that you need more examples. So this example that you gave by giving a document versus, like, a, you know, like, 10 examples, for example, that might just be the difference. I mean, those laws apply in gradient based training as well. You know, these models are fundamentally reproducing or, like, trying to reproduce the probability distribution of the training data that they were given. And so oftentimes, it takes a lot of evidence to change their view about how that data is being generated.

Daniel Reid Cahn: 29:38

Yeah. But I think, like, some of the most common tasks where people are using Bragg right now and such are things like navigating the Internet, like or, you know, iterating on, like, I'm gonna give you an image of my screen, you know, rendering some code, then I want you to edit the code and then generate you know, render the screen again or repeat that process. That's probably not in the training set. Right? Like, that does seem exactly to me like the kind of thing that's insanely hard to achieve with few shot or many shot learning or adept just trying to navigate the Internet.

Daniel Reid Cahn: 30:03

I mentioned Harvey trying to do, like, legal analysis. Like, there are a lot of these things where it's just pretty different than the general purpose assisted tasks. So there's something to wonder about. Like, another way to frame this is, like, what if the pre training data actually did include all of these examples, but the downstream, like the fine tuning, the alignment steps done by a general purpose model makes it particularly bad at a task you wanted to do. Like, you know, look at a screen, write code, and iterate.

Daniel Reid Cahn: 30:29

What are your thoughts?

Talfan Evans: 30:30

So you're saying what what about the case where, like, the the task that you're interested in is in the pre training data, but it's not in the fine tuning data?

Daniel Reid Cahn: 30:38

Yeah. Or what if the basically, it's not a general purpose assistant task?

Talfan Evans: 30:41

Yeah. You're always hoping that the the task that you're trying to do is going to be easily generalized to. And, you know, yeah, where we are, we are putting a lot of weight on the assumption that these models generalize well. I think this is something that maybe wasn't always really true even in the the era of, like, largish language models in particular. I mean, GPT 2 was obviously groundbreaking, but still probably was struggling with generalization true generalization.

Talfan Evans: 31:12

And I think that a lot of people have been critical of large models, but I I do think that there is more evidence as we scale them and get better at training them that there is more generalization. I think it's hard to find somebody who hasn't been sort of astounded by the responses of, an LLM at some point. Also, those same people would have been extremely frustrated, at some point. But, you know, we're still relatively early in this journey, I think.

Daniel Reid Cahn: 31:39

Can you just quickly define what you mean by generalization?

Talfan Evans: 31:42

I mean, generalization is exactly what you were prompting me with, which is, is the data that you're trying to fine tune on or is this query somehow within the the domain of the training data that the model has has has learned on? And whether it generalizes well is really a question of how it interpolates within the data or extrapolates. I think there's, like, some muddiness about the difference between those two terms. But, really, if you overfit to the data, clearly, then when you try to move away from the areas of, like, high density in that probability distribution, then you're going to the model's gonna assign low probability if it's not generalized. So, yeah, it's really about how sensible is your model at, guessing what to do, when you take directions within or outside of the domain of the training data?

Daniel Reid Cahn: 32:36

Yeah. That makes sense. And that's where, like, you know, generalization could be, like, doing math where the exact math problem was not in the training set, which is potentially extrapolation. And then similarly, if you give it yeah. If you wanted to learn instrumental skills, if you wanted to be able to perform a task, basically, where that task was never exactly going to appear in the training data, I think all of that's generalization.

Daniel Reid Cahn: 32:54

What What I think is interesting going back to what you said before was like that humans specialize. We see a lot of specialization in nature. I think what that may imply is just like a different thing, not necessarily generalization, but just about if we bring together multiple tasks, do models get better at all of the tasks by just kind of like if I take, I don't know, addition problems and multiplication problems, and instead of training a multiplication model and an addition model, I train 1 on math. Does it become better than each of the individual models? If I double the number of weights, will it then do better?

Daniel Reid Cahn: 33:23

Because at least it learned 1 and it learned the other and it learned whatever overlaps. So there's sort of like there are 2 ways this can go wrong. Right? One is just interference effect, which is like when you train 2 tasks, it gets worse at both because the 2 tell you to do different things. And then the other possibility is just the number of weights, which is just kind of like it could have learned both.

Daniel Reid Cahn: 33:42

But if you just tried to teach it one thing, you'd get way more efficiency for a given number of weights, a given amount of compute. Right? Where it might just be that if you really care about this task, you will do better for a fixed amount of compute and data than you will if you try to train one model on everything. Did that make sense?

Talfan Evans: 33:57

Yeah. Right. I think we don't know exactly sort of, like, how compositional generalization is really playing out. This is like also a muddy question, but it seems at least partially true that we see this sort of transfer. Models that are trained on code seem to be better in instruction following, for example.

Daniel Reid Cahn: 34:17

Yeah. Which is so cool. So I think both of these are possibilities though in terms, like, it's possible that we benefit from specialization because we just have a finite amount of compute. And if you fine tune a model on a specific task, it'll do a better job because it's using all the weights to the models on this one task. And we just can't give you a bigger model at this point in time.

Daniel Reid Cahn: 34:34

And so if you view, you know, this commoditization, this s curve, it is possible that we just slow down progress, and therefore specialization is a really phenomenal idea to make economic progress. But that becomes temporary, and then GPT 7 becomes AGI or whatever. It's also possible that interference is real, though. Right? I I don't know if anyone really talks about this, but it is possible that, for example, GPT 4 is worse at a task because it's trying to be a general purpose assistant.

Daniel Reid Cahn: 34:57

And therefore, if you just try to dissect and separate out tasks, you end up with higher performance models, not just for, like, data efficiency, but just because it is hard for a model to simultaneously try to be helpful assistant and navigate the Internet and think about, like, science, genome, whatever it might be. Okay. Wrapping up, I guess, do you have any comments on Med Gemini by any chance?

Talfan Evans: 35:16

I don't know much about Med Gemini. I would say just to close this, like, one last thing on the question of commoditization. I think a strong factor here is that the more that we train these models, the more it becomes extremely apparent that the quality of the data is extremely important. And I think the commoditization question has to consider what is the cost of acquiring really, really high quality clean data for a particular niche, and what is the specialism that is required to do that. This is a really good example in sort of like foundation models for biotech.

Talfan Evans: 35:53

If you're a biotech company starting right now, you can build a moat by getting the best data for a specific application of particular drug target. And I think this is something that's going to go in favor of the heterogeneous model world.

Daniel Reid Cahn: 36:08

So the opportunity it's funny. You're the data curation guy, so it makes sense to say basically, the opportunity for startups that you see is you know, like, more generally, if you were, like, talking to startups, what would you say given your experience?

Talfan Evans: 36:19

Yeah. I think there's a moat. I think we're learning more and more that if you have a small curated set of data that is, like, highly task specific, doesn't include noise in the sense of bad labels or, you know, blurry images and things like this. I think that there's opportunity to build mode in that specific task domain that maybe it's going to be impractical for, say, OpenAI to do for every domain.

Daniel Reid Cahn: 36:46

Yeah. I like that. I guess now, anyone listening who knows what we do at Slingshot knows why you and I are talking.

Talfan Evans: 36:53

Preaching to I'm preaching to the choir. Yeah. But no,

Daniel Reid Cahn: 36:55

but I I guess that's how we met. So anyone who wanted to learn more about data curation, pre training language models, do you have any, like, resources, books, or, papers that you'd recommend?

Talfan Evans: 37:04

Yeah. Well, I would, you know, first of all, say that reads paper that we just published, it's called bad students make great teachers, and there are a lot of great references in there. It's the if you don't take anything from the science, there there are a lot of great references in there.

Daniel Reid Cahn: 37:18

Awesome. Well, thanks so much for sharing that. Great paper. Thanks so much for joining us today. This was awesome.

Daniel Reid Cahn: 37:23

I feel like you've got me thinking about some really awesome ideas. This was great.

Talfan Evans: 37:26

Yeah. Thanks, Daniel. It was great.

Daniel Reid Cahn: 37:29

That was great. I don't know about you, but that left me with a lot of thoughts about opportunities in this space and what it takes to train language models, what business opportunities still exist for startups that aren't OpenAI. Anyway, thanks again for joining me, and we'll see you next week.

More episodes

Chapters

Creators and Guests

What is Thinking Machines: AI & Philosophy?