TalkRL: The Reinforcement Learning Podcast

David Silver is a principal research scientist at DeepMind and a professor at University College London. 

This interview was recorded at UMass Amherst during RLC 2024.   

References  

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Robin:

I'm here with professor David Silver, of DeepMind, at RLC. Professor Silver, in today's talk, you showed, a result of discovering better RL algorithms, also using RL. And at these RL conferences, there's a major focus on new methods, new algorithms. How far away is the day when human RL researchers, improving algorithms themselves is not really practical and the focus becomes purely on meta RL, which presumably just the big labs can participate in? Like, basically, will small scale RL research still matter at that point?

David:

Yeah. It's a great question. So what we've done so far is focus on a single piece of the puzzle, which is the update role that you use in RL, which is really one big piece of the whole algorithm, but it's not the only piece. So by the update rule, I mean, something like whether you're doing, you know, Q learning or auxiliary tasks or, policy gradient, essentially what the loss is, the objective that you would try to optimize via gradient descent or so forth. And I think in that space, you know, we found that we can make a lot of headway in, in applying meta learning and, you know, learning from a set of environments what what update rule is is most effective, which is an exciting step.

David:

But I don't think we're anywhere near the point where we're, you know, putting ourselves out of, out of work partly because there's these other pieces of the of the puzzle that that work with it. You know, the nature of the function approximator, the nature of, the the optimizer, how all the, pieces are put together and so forth. But maybe more importantly that the the way that all of these, algorithms work is, you know, they still have some objective. So we are, actually learning what algorithm is most effective with respect to some human design proxy objective. So we use, like, a vanilla policy gradient algorithm.

David:

So there's still, like, human design in the loop, and we find that if you improve that proxy objective, it still makes the whole meta gradient algorithm work better. So we're by no means putting, you know, we we there's still a lot of room for for the humans to make things better. And in fact, you know, even the space in which we designed our meta networks to to optimize, now a lot of that, came from inspiration of making sure that this, the meta network design was able to, support all of the different discoveries that humans have made over the years of all the different types of algorithms that we've developed, and we wanted to make sure that the meta network was sufficient to support those. And I think, you know, if humans were to make a big breakthrough, maybe we would change the meta network design to kind of follow. So I think I think at the moment, we're just at the beginning of this kind of, maybe phase change where where meta learning is starting to be very successful, but it's by no means, you you yeah.

David:

It's still a long a long road ahead until until until that's the only game in town.

Robin:

So I understand AlphaFold does not use RL at this point. Is there a path to scaling alphaFold's protein folding predictions with RL, the way that you spoke about the other challenges?

David:

Yeah. That's a great question. So, actually, I was originally involved in one of the people at the beginning of the AlphaFold project, where really, you know, what happened was, you know, we'd been very excited. Demis, in particular, was very excited and and and driven by this question of, could we apply the methods that were so effective in AlphaGo, to try and solve the protein folding problem? And so, originally, we were viewing this as an RL problem with, you know, folding as an as an action and trying to, you know, optimize a reward that corresponds to how well, folded the whole, the whole, structure is.

David:

And I have to say that one of my biggest contributions to the project was encouraging the team to stop viewing this as an RL problem. That, you know, if I'm really honest about it, not all problems are best suited to RL. You know, you really have to find the problems which are, natively better understood in a different way. And in this case, I think it became clear to me over time that this was, better understood, and modeled as a as a supervised learning problem, based on the data that we had there. I think, you know, that and and that led to a lot of rapid progress in in that led to the success we had.

David:

Having said that, I think there are hugely important problems in the space of, related problems, things like protein design, which are much more suited to RL than to, supervised learning. Because now you much more clearly have an an action space where we're trying to design drugs. The the design space is non differentiable. You know, all of RL is probably gonna be necessary to do or at least, you know, search, is gonna be necessary. And and and so I'd expect that there to be much more crossover and fruitful, application of RL.

Robin:

So when I looked at your Google Scholar to prepare for this interview, I noticed that, in the in in recent times, there's a lot of patent applications and not a lot of papers. Have you changed your focus, away from publishing, for now? Or is that a temporary thing? Or is that a trend for you?

David:

I think publication's really important. The honest answer is that, I I was quite ill for a while. So, so, after COVID, I got, long COVID, and I just wasn't super well for, a year or 2. And so I had to kind of scale back a lot of my activities. And sadly, one of the things which I, down prioritized was, you know, publishing conference papers.

David:

And I decided, you know, for the amount of time and energy that I had to really just try and use that time and energy to focus on the bigger results. And so I think, you know, I'm feeling much better now. Things are, you know, my health is back to a 100%. But, I think it really you know, it was just a shift that happened because of that. And, yeah, I I do think that publication is enormously important for the field, and and I've always felt that, you know, the the work that we do deserves to be understood and and and and and shared wherever possible, you know, within the constraints of being in a commercial organization.

David:

But, on the whole, you know, that that is and always will be my modus operandus.

Robin:

I'm glad to hear you're better. Going back to AlphaZero, there's you you in the paper, there was a curve showing how Elo, scaled with training, and it reached a plateau, and it kinda stayed at that plateau for a very long time. Do you have a sense of what is the limiting factor that determined the height of that plateau?

David:

Yeah. That's a great question. What we saw was that in some games, Alpha 0 continued learning, indefinitely without any plateau. And in some games, it appeared to at least, reach something more like a plateau. And it seems like the the defining characteristic is where the draws are possible in those games.

David:

So in the game of Go, it's not possible to have a draw, because of the way the game is designed. There's this rule called komi, which means that, you know, one player or other always has to win by at least half a point. In chess, when you start to get to very high level play, most games are draws. And I think what we're seeing in those alpha zero plots is the fact that as you get to this very high level play, you're just getting less feedback out of the system because most of the games are drawing. You get, you know, Alpha 0 in self play.

David:

Most of the time, it's gonna, draw against itself, and so everything just slows down a little bit. And I think there are reasonable mechanisms to get around that, but just those weren't really in those plots. But I think, you know, that's it's really a nature of the, of the problem rather than something fundamental in the algorithm, I think.

Robin:

A lot of grad students listen to this podcast. Do you have any words for the young grad student who's aspiring to get into RL research?

David:

Pick a problem that you really believe in and go for it. And don't don't, you know, don't be afraid to go for challenging problems. I think it's much better, in my opinion, to, try and do something glorious in research, and maybe fail, than it is to do something you know is incremental, and be guaranteed to succeed. And so if I have one tip, is that something I've always tried to do in my career is to try and find the sweet spot, where you look for problems which are just about within range, not completely beyond possibility, but which are, you know, really not not straightforward at all. So I like to in my mind, I'm quite an optimistic person, but, nevertheless, I try to choose problems where I I believe that the chance of success is is, at most 50%.

David:

But, you know, roughly that 50 50 sweet spot I found to be helpful. And I think, you know, going for that is really helpful because, you know, if you look at the rate of progress in in AI, it's fast. And if you're working on something which is, you think is 5050 today, you know, it might well be be, solved by the end of your your your, masters or PhD or whatever you're getting into. So I think that would be my advice is to be bold, be ambitious, and just, you know, go for it because, you know, you you have a chance to do something special with your precious moments on this planet and and really work on something that makes a difference.

Robin:

And finally, if young David Silver in doing his PhD could see all of this today, what what would he say or how would he feel?

David:

Well, I think it's a great question. I think it's a great question. I think I was very focused on on computer go, in the early days, but at the same time, you know, I always believed that the long term mission was to try and, you know, reach AGI to try and, you know, get to some kind of level of superhuman intelligence. And so I think I'd be just amazed by how much progress there's been and really excited to know that that was kind of coming down, down the line. And I think it's just you know, if you can always imagine that the future is gonna be much more exciting than where we are today, it's really motivating to know that, you know, whatever you do now is gonna feed into this world where things are just gonna be moving so quickly and and and building on things so fast.

David:

And I think that, you know, would probably be my, advice back to myself is, you know, just just enjoy enjoy the moment because, you know, everything moves so quickly that, whatever you do, you know, it feels like, everything feels a long way away at at the time. But but before you know it, it's here and and and rapid progress has happened. And, you know, building on on all the work that people do is, you know, it's extremely satisfying to have, you know, if you're in any way able to contribute a small cog to that bigger machine of of all of the research progress that the field is making. No.

David:

It's it's it's an amazing it's an amazing feeling. For those of you who've chosen to focus on reinforcement learning, you know, it's, it's it's an exciting and ambitious choice, and I just wish you all the best to make it successful.