TalkRL: The Reinforcement Learning Podcast

Thanks to Professor Silver for permission to record this discussion after his RLC 2024 keynote lecture.   

Recorded at UMass Amherst during RCL 2024.

Due to the live recording environment, audio quality varies.  We publish this audio in its raw form to preserve the authenticity and immediacy of the discussion.   

References  

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Speaker 1:

The function approximator is doing some form of unrolled planning internally, as it is doing. So essentially, it is meta learning, a planning algorithm that Yeah. It does during inference. How do you think about those? Because I can imagine that meta learning a planning algorithm, in the same way that your meta learning RL Yeah.

Speaker 1:

Might actually be a much more powerful technique.

Speaker 2:

Great idea, and I think it should be better. Yes. I I I do not believe that MCTS by any means is, like, the the end point of the of planning methods. And I think I I find it almost, unfortunate that we're still using it after all this time, and then we haven't yet been able to build systems that learn how to to search more effectively. Mhmm.

Speaker 2:

We've had, like, some little probes in that direction. Things like MCTS networks were an attempt to learn how to do better, but, I think absolutely, yeah, the future of the future of planning will be systems that learn to to plan for themselves.

Speaker 1:

And, I mean, to some extent, if you have an RNN now, all systems are doing this, to some extent. Right? Like, any, like, recurrent system is is doing some kind it's like learning some feedback on its action, like, past history. Do you think this is just a a matter of balancing compute during training versus inference where we need to be able to just, like, during inference time give it more planning compute than we do during training? Or do you think we just need to have deeper networks so that we can have a a deeper unrolled planner?

Speaker 1:

Essentially, I mean, this this is kinda what I'm trying to build. I'm curious if you have any intuitions.

Speaker 2:

Yeah. Yeah. I mean, we faced a lot of those trade offs ourselves, and I think, you know, it comes down to really understanding the scaling laws. Right? Because you you can have one scaling law for for doing more planning and another scaling law for Okay.

Speaker 2:

You know? But then, you can have one scaling law for what happens if you increase the size of your network and another scaling law for what happens if you use the same compute but to do more planning steps. And the question is, what's the relative, you know, gradient of those scaling loss, and and we use that to to guide the decision of which which is most of it.

Speaker 1:

So Oh, and you can imagine that it takes longer to learn the planning, the the meta learn the planning algorithm, so you might not even discover a good planning algorithm for a while.

Speaker 2:

Yeah. Can I

Speaker 3:

ask you about the second lesson you had? The one where you're talking about deletions and stuff?

Speaker 2:

Yeah.

Speaker 3:

So the takeaway, right, was just, like, bigger model, do more RL, and it'll eventually close lots of those holes. Yeah. Is that because and it's not it's this is not to say that go or retire or whatever. It's, like, a simple environment, but it is much simpler than something like the real world. And I very much am an ascriber to the big world hypothesis.

Speaker 4:

Can we do you

Speaker 3:

think we can expect that even in much, much more complex environments where we know we're not gonna be able to learn nearly everything?

Speaker 2:

Yeah. I mean, I think I think we're not going to be able to close every hole Mhmm. Because the world is too big. So, you know, inevitably, there are going to be some inaccuracies in in any model that the system builds and how the world works. Yeah.

Speaker 2:

However, I I think it's reasonable to expect that with more RL, we should do better and better at closing the holes. Yeah. So the system should keep improving. I don't think it will ever reach the point where it has a perfect understanding of every aspect of the world because that's that would violate the big world hypothesis.

Speaker 3:

And I I guess maybe a better way to ask the question is do you think more RL and bigger models will lead to more graceful degradation when we encounter those holes?

Speaker 2:

Yes.

Speaker 3:

Right? Okay.

Speaker 2:

I do. Thank you. Okay.

Speaker 3:

I have a question about,

Speaker 4:

the math's the end of the talk. Did the system did you did you have any, like did the system had any way of explicitly learning abstraction? So when when the system discovers a lemma, does it memorize the lemma, or is it just all implicit?

Speaker 2:

Yeah. So it's a good question. So we did so we explored a bunch of methods, that tried to do that. I would say most of those have been kind of left for future work. Mhmm.

Speaker 2:

So at the moment, each problem is being solved largely independently from the so so there's yeah. So most of what's happening So so this yeah. So most of what's happening is is, being learned in the weights rather than, I guess, what you're asking is, you know, memorizing particular bits of of specific knowledge and being able to pull those in. And that's, yeah, largely future work, I would say.

Speaker 4:

Yeah. So the system is solving, like, the problem on the at the level of, like, formal mathematics, like, step by step using some kind of learned.

Speaker 2:

Yeah. And I think some of what you're saying happens in the weights. Like, I think when it's seen a large number of examples, each of which requires the same kind of lemmor to be used, that gets internalized into the weights. Mhmm. But, we so far didn't have an explicit mechanism for that.

Speaker 2:

Yeah.

Speaker 5:

Thank you. So so you mentioned that alpha proof, like, takes, like, 3 days versus, like, humans taking about, like, 9 hours. I was wondering when what the majority of time it spends spends those, like, 3 days on. What are some kind of, like, the main main challenges there, and how do we solve these challenges? I

Speaker 2:

think the 3 days could be dramatically improved. We already have ideas that we think will make it much faster. So I don't think we should over index on the fact that it it took a long time. I think, you know, a lot of this work was put together, you know, to try and see if we could do something on this particular IMO, and I think I would hope that we can we can do things significantly faster in in future competitions or future methods that we want to work on. In terms of particular challenges, I think the hardest challenge is probably the kind of problem I chose.

Speaker 2:

There's these very open ended, problems that you find in combinatorics. And I would say our system is, you know, remarkably consistent in, you know, algebra and number theory. But when it comes to these very open ended questions that it they they touch on a much broader range of possible intuitions than we currently manage to encode into the system. So I think there's a big challenge there to see, you know, can we can we improve its its its strength across across that dimension.

Speaker 5:

So it's kind of a solution to showing more examples of such problems or, you know, a better or some kind of better search for finding our own?

Speaker 2:

What's I think I think the the problem comes at every step. So, actually, for example, in the formalization, formalizing these open ended problems is also harder. So even for humans to formalize that particular problem I showed of turbo the snail, is is very hard, actually. It took it took humans quite a long time. The other ones, it was fairly straightforward.

Speaker 2:

That one, it took a long time for humans to formalize. And similarly, for our system to formalize those kind of open ended problems, it it is harder. So I think we probably ended up with less good examples in our data. And I think in addition, it's they're also harder for, a machine to solve and get a handle on. It's this maybe yeah.

Speaker 2:

The so so I think at every level, they're a little bit trickier for for the kind of approach we we took. So, yeah, so it'll be interesting to see what happens in the future.

Speaker 6:

I have follow-up questions. So when we are trying to compare human with all for proof, so you were saying human takes 9 hours to achieve this rate, and we achieve this in 3 days. So if you want to have a first side by side comparison, how do you think we can quantify the intelligence of human in terms of GPU powers?

Speaker 4:

Yeah.

Speaker 6:

Let's say in papers, we always talk about to use these computing resources and compare this.

Speaker 2:

Yeah. So how

Speaker 6:

do you think we can make it a quite fair comparison?

Speaker 2:

I think it's very hard. I think it's very hard to come up with some, you know, like because it's just 2 completely different computational models, and so it's really hard to compare, you know, the amount of computation that the human brain uses with some totally different systems, some totally different, you know, hardware, software. Yeah. Every every layer of the stack is somehow different. So I think, you know, if you really what we've always done historically is chosen to control the only thing which we can control, which is the, you know, the wall clock time.

Speaker 2:

So if you really wanted to say, you know, has this been done fairly, I think you would say, well, we should do it in the same amount of time as as humans.

Speaker 6:

Give humans 3 days and see what happens.

Speaker 2:

Or the other way around that we should do this in in, you know, in 2 times 4 and a half hours in a way that that that humans do. So I think, you know, that that would be, I think, the clearest indication that at least what that tells you is that you've got 2 totally different systems. But, yeah, given given a fixed time constraint, have we reached the point where machines are able to, you know, in the same amount of time, achieve the same results that a human could? I think that would be a fair comparison. Sorry.

Speaker 2:

Let's let's give other people a chance to

Speaker 6:

to I wanted to ask

Speaker 7:

you a question about the meta RL work. Yeah. So I I think in thinking about that about no free lunch theorems and results that say, oh, you can't get any better than this. I think that that's not generally true, but it seems to me that if you're going to make, progress on meta RL, that requires having the system learn something about the kinds of problems we're actually interested in solving. And it seems like that's the kind of product that you're training that's also similar to LLMs.

Speaker 7:

And I wonder if you think there's any connection between what you've learned when you're doing meta RL and trying to get a a good grasp of how the world works for that task, what you learned when you're training an LLM on all all the texts that have been written, and, whether you think one of those approaches is a better way to get to general reasoning, or if need to be combined.

Speaker 2:

Yeah. I think they're solving quite different problems, LLMs and and this meta learning approach, I think. But what I would say is that what's beautiful about the the meta learning approach is that it it it tailors to the, you know, to the data. So if you were to train it, like, maybe maybe this algorithm we trained isn't good at particular type of RL environment. But then if you were to train in that kind of an RL environment and add it into the mix, it would learn how to get better.

Speaker 2:

It would come up with new algorithms which are tailored to those things. Whereas at the moment, we're in this world where, you know, we kind of have to human do that process where we all know as RL practitioners that RL algorithms aren't yet at the point where there's like this, you know, one size fits all, approach that we have for, you know, supervised learning has kind of reached that point, but we don't have that in in RL. And so as humans, we're continually tailoring the toolkit we have to the problem that we're trying to solve. And now here here we have something which, you know, addresses that in 2 ways. You know, first, it's it's like, well, if we give it a broad enough range of problems, maybe it can find one size fits all or at least something which takes as context a particular problem it's trying to solve.

Speaker 2:

And the second thing is, you know, if we really are as as humans prepared to spend more time on this particular problem to to do better, well, let's continue to problem to to do better. Well, let's continue to meta train on that particular problem, and we can take that human labor away and come up with something that's really specifically good at this kind of problem. So I think I think tailoring to the data is really important because the data tells you, you you know, what you're important because the data

Speaker 4:

tells you, you you you know,

Speaker 2:

what what you really want. And so, yeah, we we can learn a lot that way. Do you think you're gonna be able to find some set of problems

Speaker 7:

that will produce better generalization, or do you think that it's a matter of coming up with something that's reasonably general and then just sort of fine tuning the meta RL over and over again whenever you get to a new class of problems you wanna solve.

Speaker 2:

I mean, what we've seen and the the result we've seen so far is that as you broaden the set of, training environments that we meta chain on, we get strictly improved, held up generalization to other other environments. Does it also improve the previous environments that you've

Speaker 7:

worked on? Earlier stuff in the train set? So it

Speaker 2:

at least stays

Speaker 7:

the same and sometimes improves it

Speaker 2:

depending on the environment. So, so this it seems to be, you know, strictly

Speaker 7:

So there's there's some generalization between sets of problems?

Speaker 2:

Yes. There's generalization. And so the the hope would be that now we have a scalable formula that allows us to add in more and more problems into the training set and come up with an RL algorithm that's that's kind of more and more general across those. So I don't wanna claim that, you know, what we did would, for example, work for RLHS or something, totally different that's really out out of the distribution you used. And maybe if you start to add in these other cases or or do a separate pool of those or however you want to use it, it gives us a toolkit to, you know, to to learn to do that.

Speaker 2:

Yeah.

Speaker 5:

David, can you sort of reflect on the reward reward is enough hypothesis? It sounded like you're still very much a 100% believe in it, or or has your view changed a little bit in the past couple of years?

Speaker 2:

Yeah. Thank you. So yeah. So for those who don't know it, reward is enough hypothesis is really you know, it was it was a hypothesis we wrote to kind of challenge the community to think about the approach of what would be required to get to superhuman intelligence, and and the hypothesis is, you know, like a a challenging one to say that maybe, a challenging one to say that maybe all we need to do is to, is to put a powerful RL agent into a complex environment and ask it to maximize the reward. And maybe that would be sufficient for intelligence in all its complex c and b, would be sufficient for intelligence in all its complexity and beauty to emerge.

Speaker 2:

So I would say I think the hypothesis is still a really powerful guiding light. I use it as a guiding light. Now I don't know how far it will hold and whether it, you know, will hold fully or perfectly. That's kind of not the point of the hypothesis. The point is to say, you know, it provides a sort of sense of optimism, and and motivation that in as much as this hypothesis holds true, it gives us one really clear strategy to follow that will take us further and further towards, towards AGI, towards superhuman intelligence.

Speaker 2:

And, and I think that's really exciting and inspirational because, you know, if we don't have that, then then we're kind of lost, I think. So so I think, you know, having a strategy that that that may work or may take us a really long way is is is really important. So so I still has my opinion changed on it? I would say, you know, one thing which has maybe changed is I think, you know, we really when we wrote the reward is enough, we kind of left room for the idea that that there could be powerful priors. But I think, you know, the advent of, like, foundation models and LLMs has shown, you know, that those powerful priors can really, you know, take you a long way.

Speaker 2:

So I I kind of want to be more explicit and welcoming to the idea that that, you know, we could you can start with whatever that, you know, we could you can start with whatever you want as long as you then do a massive amount of RL on top of it. Yeah. That can take you really far. But, you know, why not why not,

Speaker 6:

say, embrace

Speaker 2:

the idea that there are these yeah, sort of launching points for the agent. Yeah.

Speaker 8:

Great. Another question about

Speaker 2:

let's sorry. 1 and then 2. Yeah.

Speaker 8:

And, yes, I was wondering how, like, this meta arrow and discover arrow you mentioned about can be applied to, let's say, a multi agent RL problem?

Speaker 2:

Yeah. So we haven't yet tried it on multi agent RL problems, but I don't think there's anything in the formulation that is specific to single agent RL. I think well, let me say that differently. I think at least if you take the simplest and clearest approach to, to multi agent, RL is to say, well, you just have individual agents each following their own RL algorithm, and then you could take the meta objective to be something like the overall success of of of all agents. Maybe you could do something like that, and I would expect it to learn a good approach that's maybe gonna be effective.

Speaker 2:

It's a great idea. We haven't we just haven't tried it yet, but I think that would be super interesting to to know how that how that goes. Yeah.