TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
I'm here with David Abel from Google DeepMind. David, can you tell us about the paper that you have at the
Speaker 2:conference today? Yeah. Sure thing. So it's a paper called 3 dogmas of reinforcement learning. It's joined with my amazing collaborators, Marco and Anaharatugnon.
Speaker 2:It's a position paper where we argue for some ways we should be thinking differently about the RL paradigm and we hope can kind of open some new research territory, for the next couple years.
Speaker 1:Can you share any highlights, that you found at the conference that you really like?
Speaker 2:Yeah. Definitely. I mean, I think Andy's keynote yesterday was just beautiful. It was amazing to see the story and these anecdotes, yesterday was just beautiful. It's amazing to see the story and these anecdotes
Speaker 1:from his
Speaker 2:experience with him and his students in his lab over the decades. And this moment last night when he he finished his keynote and everyone gave him this standing ovation, it's quite an emotional and kind of special moment to be here in Amherst, watching Andy. Just this this kind of deep appreciation and value from my community for Andy and all he's done. So that was a really special moment. Yeah.
Speaker 1:I'm here with Kevin Wang
Speaker 3:from Brown University. Yeah. So, at a very high level, it was about, tree search algorithms. Like, if you think about Monte Carlo tree search or AlphaZero, but it's about a version, based on another algorithm called thinker. And what we wanted to do was get the tree search algorithm to, learn by itself how much compute to use.
Speaker 3:So, you know, you don't want your tree search algorithm to take, like, a year before it makes a move in chess, so there's some cost. So what we do is we encode that cost in the rewards, and we let the agent, learn by itself to balance the intrinsic, benefits of doing more search with this, penalty for doing more search that we give it. And it worked really well. So it's able to trade off, using compute and, getting higher performance, much better than, like, the other algorithms.
Speaker 1:Seems like an awesome idea. And it's a thing that people complain about all the time. Why does everything have to be constant compute? So, that sounds very promising. And can you tell us about, anything else that RLC that you, really enjoyed?
Speaker 3:Yeah. I mean, it's been a great conference. It's like a very small size. So it's nice to just be able to talk to so many people. It was also nice to get to hear David Silver's, keynote because I feel like we've gotten pretty sparse details on alpha proof, and I feel like we got some nuggets in the talk today that I don't think have been revealed before.
Speaker 1:I hope he gets to hear about, the work that you're doing.
Speaker 3:Yeah. I hope so too.
Speaker 4:Hi. I'm Ashwin Kumar from Washington University in Saint Louis. Yep. I had a paper at the Idle Safety Workshop.
Speaker 1:Can you tell us about it?
Speaker 4:Sure. It was about learning fairness in multi agent resource allocation. So it's called decaf. It's a cute name we came up with. The idea is that, you know, in resource allocation sometimes when you're trying to maximize utilities, you end up with unfair allocations and it's not straightforward how to optimize in such systems because there's usually some arbitrator who's taking care of the constraints.
Speaker 4:So what we came up with was an algorithm to learn long term fairness just like we learn utility estimates in RL, and then show that it works and allows you to trade off between utility and fairness in resource allocation environments in a fairly efficient way. So, you know, it's exciting. I got some great feedback, and I'm looking forward to copying that and resubmitting.
Speaker 1:Any other comments on, what you particularly enjoyed at, RLC?
Speaker 4:The one thing I absolutely enjoyed is random conversations around the conference because it's everyone who's working on the same field. And, well, even within the field, there are niches, but each of those niches is interesting to me. So that's what I like the most is hearing diverse opinions and, you know, of course, the keynotes are amazing. So
Speaker 5:Hi. I'm Prabhat Nagarajan. I'm a PhD student at the University of Alberta. I work with Martha White and Marlon there. Prabhat, what, what are you
Speaker 1:working on these days, and what do you find interesting these days?
Speaker 5:Yeah. These days, I've, been working on overestimation and studying studying overestimation, back in a lot of classical deep RL algorithms that we've been working on. So I think a lot of value based deep RL methods, that we're using today still sort of originate from, like, the 2014 to 2018 era. So we're sort of going back and looking at alternative ways to sort of implement some of these classical algorithms, particularly WQN and double QNE.
Speaker 1:Cool. And, can you tell us about, some things at RLC that you particularly liked?
Speaker 5:Yeah. I think for me, well, I I guess twofold. 1 is, sort of meeting a lot of, old peers, old research friends, and also making a lot of new ones. And I think the highlight for me was probably Andy Barto's talk. I think I walked away with, a lot of historical knowledge that I didn't have before and a ton of, like, references to go back and hunt down just to sort of be more literate about my oral history, I guess.