TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
TalkRL.
Speaker 2:TalkRL Podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan. K. We're doing a hot takes episode on what sucks about RL.
Speaker 2:We're at the RL meetup after after your reps on Wednesday, in here in Vancouver at the Pearl Club. And what sucks about RL?
Speaker 3:My name is Matt Olsen from Intel Labs, and I think RL sucks across the
Speaker 2:board. Tell me why. Why does RL suck?
Speaker 3:It is too hard to train. It is impossible. There are way better ways to get your models to do what they want.
Speaker 2:Okay. What's the better way?
Speaker 3:Supervised fine tuning, more clear labels, humans in the loop, much better algorithms. Get your humans in there. Stop automating everything with sparse rewards.
Speaker 1:Yeah.
Speaker 3:They're terrible.
Speaker 2:Get humans in the in those loop, guys. Okay.
Speaker 4:This is Neil Ratzlaff. Tabula rasa r l sucks. You can't learn everything from nothing. Never try it. Nothing works like that.
Speaker 4:Impose some structure. Impose some rules. Do some learning first about the world, then do r l.
Speaker 5:Love it. Love it.
Speaker 2:You have to take Yeah. Why does RL suck? What about what sucks about RL?
Speaker 6:I got excited about RL thinking because I got into it from computational game theory. But then I realized I had known the elegance of it, and I was just kinda disappointed.
Speaker 7:Awesome. Okay. What sucks about RL? So I think that, the biggest problem is, the value function learning. And, particularly, if you learn a value function in simulation and you deploy it in a real robot, then, you have a big mismatch.
Speaker 7:And, basically, you lose all the time, that, you invested in learning the value function when you do the deployment
Speaker 2:of the real system. So the sim to real gap?
Speaker 7:Sim to real, but specifically on value function. And it's also very, very, difficult to, stabilize the learning, particularly if you have specific policies, structure that are not neural networks or something like that.
Speaker 2:Love it. Do you wanna share your name?
Speaker 7:Yes. I'm David Attayo and I'm a research group leader at, the U. Darmstadt, in the lab of Jan Pieters. Awesome. Thanks.
Speaker 4:We are trying to solve a problem that's too big and refuse to concretize it to make it actually tractable.
Speaker 2:Love it. Do you wanna throw in your name or a fake name?
Speaker 4:Yeah. No. No. I'm Klas, Klas Vacher from the University of Toronto. Everything.
Speaker 4:The only thing
Speaker 8:it should be used for is to train LLMs.
Speaker 2:Love it. Okay. And do you wanna throw in your name or a fake name?
Speaker 8:I'm gonna use my adviser's name for this. No. My name is John Doe.
Speaker 2:Do do you wanna throw one in?
Speaker 9:What sucks about RL is that we're still doing RL with no prior for the most part. That needs to stop.
Speaker 2:Love it. So there's no tabula rasa?
Speaker 9:No tabula rasa. That's gone.
Speaker 2:And and your name was? Seth. Sample efficiency. You wanna say more about that? There isn't enough of it.
Speaker 2:We need more.
Speaker 7:Hey, guys. That was my that was my take.
Speaker 10:People don't do RL well, so
Speaker 2:I
Speaker 11:don't know whether RL sucks or people.
Speaker 2:So what what what should they be doing?
Speaker 10:I think we need, like, streamlined experimental pipelines, which are not there. Oh, okay. I mean, they are there, but, like, still, it it it's hard. Like, only few people can do it well.
Speaker 2:So you mean they're they're not setting up experiments in a scientific manner? Is that what you're saying?
Speaker 10:There are not enough good pipelines to have, like, validated RL pipe like, yeah, RL outcome. So, I mean, it it sucks because you can't rely on it because there is the experiment pipelines are not good enough. Or or it's not good enough to the extent that a large number of people can do
Speaker 12:it. Hyperparameters.
Speaker 2:Is that the only thing? No. Well, hyperparameters in general suck. Like, we just shouldn't have them. Just cancel all the hyperparameters.
Speaker 12:Just finding the right ones. That sometimes it's just search through hyperparameter space to find policies that work.
Speaker 2:I don't, man. Everything.
Speaker 13:No. I'm just kidding. Not everything. Hyperparameters.
Speaker 2:Oh, wait. Okay. That's a good one. What sucks about hyperparameters? Since I don't
Speaker 13:know, it's like getting everything to work is black magic. Like, on it literally literally seeds. I have literally, like, the hyper not even hyperparameters. It's so brittle sometimes. Specific I I do multi agent RL.
Speaker 13:Literally, I had the exact same hyperparameters and I changed the seed. And it works one time and the other time it collapses.
Speaker 2:It's That's your first mistake, doing multi agent.
Speaker 13:Oh, 100 Yeah. No. Yep. You're right.
Speaker 14:You're right. Yeah.
Speaker 2:Yeah. Yeah. Awesome. Josh McClellan. Thanks, Josh.
Speaker 12:Grayson Brothers at Johns Hopkins, Applied Physics Lab. Generalization. Sure. Like, you can train it to do a certain task, and you can give it a bunch of different examples of that task. But as soon as you sort of move it into an area where it sees something out of distribution, or maybe a scenario that you never happen to have it explore during training, you have no guarantees in terms of like what it's gonna do, which depending on your application, if you have like an autonomous system like driving, you see something you've never seen in the training data, and if you have no guarantee of what it does, that could be dangerous.
Speaker 15:I'm David Beckham. And I think RL sucks because it overfits 2 sports correlations. Okay. Love it.
Speaker 2:Is that the only thing that sucks about RL? Yeah. That's it. Okay. Thanks, David.
Speaker 5:I'm Harley Wiltshire from Miele. And as a distributional RL enthusiast, I'm frustrated that people only care about risk neutral RL.
Speaker 2:Okay. Love it. Well, who who are you and what sucks about RL? Okay. I'm Glenn Burseth.
Speaker 2:I'm also like a professor at Mila, and about almost everything sucks about RL. Love it. Okay. Do you wanna be any more specific than that since you're an RL professor? It really sucks that RL still doesn't really work in the real world like I want it to.
Speaker 2:Yeah.
Speaker 11:So I'm an R and D scientist at, Ubisoft, video game company. And so what sucks at RL, at 09RL, is that it doesn't work in non trillion fucking environment. Okay.
Speaker 16:I learned to walk, you know, it takes me, I don't know, a 1000 trials or something. Not 10 fucking 1,000,000,000 trials.
Speaker 17:I don't know.
Speaker 16:Have you seen those fucking, like, walking gates? You know? They're crazy. I walk, you know. I'm a reasonable walker.
Speaker 16:These guys are like, walking, yeah. I know. I'm Leticia and I'm work I work on the OMS, so it's very nice.
Speaker 2:Yeah. Thanks, guys. Anyone else?
Speaker 8:My name is Mateusz Ostaszewski, and nothing. I love aero.
Speaker 2:Okay. You got me. Okay. What sucks about RL?
Speaker 17:Lack of formal guarantees in a lot of circumstances, in, like, safety critical contexts.
Speaker 2:What sucks about RL? There's not enough
Speaker 5:of it.
Speaker 2:Okay. We should do more RL. Yes. Okay. RL, all the things.
Speaker 2:Everything. Everything. Everything. Every possible thing. Yeah.
Speaker 2:RL. Okay. What what about you?
Speaker 9:I mean, it would be cooler if there's more, like, pre training, like, paradigms for RL, like, in in, like, these language models. Like and there's there's probably some of that with, like, the 0 shot RL stuff. But if they're, like, more generalizable, you can leverage, like, I don't know, existing stuff to, like, learn policies faster.
Speaker 2:I don't know. Love it. Okay. Okay. I don't know.
Speaker 2:Yeah. I mean, the like Yeah. No. No.
Speaker 9:Tabula Rasa is a big problem. I I feel like it can be easily like, not easily, but, like, there's way mechanisms to make it be addressed with these, like, multimodal models. I think you can leverage the knowledge that they have to inform your policies, to be smarter than, like, learning randomly from scratch. Yeah.
Speaker 2:You you could say, like, maybe animals and humans don't learn completely from scratch because we Yeah. We've evolved and we have some, like, in in build instincts that are like a prior for us.
Speaker 9:Yeah. Yeah. We we can generalize principles. Like, we know what certain like, how certain items behave, how to, like, manipulate certain things from past experiences that should be able to generalize, which is not being done in the way that we train, like, base RL policies right now, I guess.
Speaker 2:K. So, what sucks about RL?
Speaker 17:Nothing. RL is just perfect. Everything works all the time. It's flawless. It's totally flawless.
Speaker 17:Yeah. It generalizes.
Speaker 9:You don't it's continually learning and doing things.
Speaker 2:It's just perfect. It works on every seed. Oh, wow. It works on every seed value.
Speaker 17:Yes. Super easy to reproduce. Very stable. Yes. You you you have something that works on one domain and there you go.
Speaker 2:I love it. It's magical. It's like a rainbows and unicorns.
Speaker 17:It's so good that you don't even need a test set. I love it.
Speaker 18:I'm Cathy Wu. I'm an associate professor at MIT, working on reinforcement learning for mobility. And, mobility. And, reinforcement learning sucks because it doesn't really work.
Speaker 2:I saw you at RLC, I think, on the panel. Okay. Why doesn't it why doesn't it work? And what are we gonna do?
Speaker 18:So it doesn't work because it's super sensitive to all sorts of things. There's, like, has been a line of papers since, like, 2018 on this. So, like, sensitive to, like, code level implementations, hyperparameters, and and whatnot. And, like, we also found that it's sensitive to be, like, you know, tweak the tweak the basic benchmarks, like, pendulum, like, change the mass of the pendulum, the length of the pendulum. Super sensitive.
Speaker 18:So, we don't have a full solution, but we're starting to find some ideas around, like, like, transfer learning and, like, basically building algorithms on top of unreliable algorithms to make them more reliable. Yeah.
Speaker 16:What else have
Speaker 18:you heard?
Speaker 2:I mean, honestly, it's a lot of similar themes. Brutalness, sample inefficiency. Everyone starts on Tabula Rasa. The experimental methods are weak. To me, I think that, like, it doesn't generalize.
Speaker 2:I think some of the complaints oh, I I guess I'm recording this part. I get I think some of the complaints are actually complaints about deep learning function approximators, and I I would like to personally separate those from the complaints about RL itself, which I think if we had, like, a I actually blame the deep learning people for not giving us better function approximators. I don't want the RLV double G to be blamed for the deep learning, deficiencies. That's my hot take.
Speaker 19:I am, Ben Ellis from the University of Oxford. And I think the fact you have to, like, hand create environments from scratch almost all the time, it really sucks. I believe you. Yeah.
Speaker 2:Do you wanna say you it sounds like there's a story behind that. Do you wanna say it anymore?
Speaker 19:So I think just, in general, it's, like, a big limiting factor that you have to have an environment in the 1st place, because a lot of that will time will have to be a simulator, and it won't, like, necessarily be that general. So I think, like, you know, like, Tim Rokdeshaw has a talk where he makes this point. And I think that's, like, a a clear, really big problem that holds RL back.
Speaker 2:Love it. Yeah. My name's Scott Jean. I'm from the University of Cambridge, UK.
Speaker 20:I think RL's too elegant.
Speaker 2:I've never heard this one. Okay. Tell us more.
Speaker 20:I think it draws us all in with this elegance, and then we get hit with all the other issues
Speaker 2:that you probably heard from everyone else. It it it makes you believe in magic, and then you get in there and you're like, it's not. A 100%. Yeah. We're not pragmatic enough.
Speaker 2:Love it. Hey. Thanks, guys. Yeah. Awesome.
Speaker 2:Yeah. Yeah. It's enforced. Okay. Who are you and and what sucks about reinforcement learning?
Speaker 1:I am Vishamir. And I think the thing that sucks about reinforcement learning is the fact that you need to rely on very informative rewards. And that is typically not attainable in the real world.
Speaker 2:It's like it's like it's as if, like, where do these rewards supposed to come from? Like, just to fall out of the sky?
Speaker 1:Precisely. I mean, if you want to know the answer already by specifying the reward, you've solved the problem already. So you don't really need to go in a roundabout manner and specify a reward that is going to solve the problem for you. So maybe that's what sucks about, Ari.
Speaker 14:Love it. Thanks. I'm David, and it's slow environments.
Speaker 2:My name is George, and, lack of ex exploration sucks about reinforcement learning. Love it.
Speaker 11:I'm Brian, and what sucks is that, reinforcement learning is mostly used, in LLMs right now.
Speaker 2:Yeah. I love it. Okay. Get rid of those LLMs.
Speaker 17:Yeah.
Speaker 1:Get rid of those.
Speaker 2:Let's get back to Atari, people. Yeah. The only thing that actually matters. And, robotics. And robotics.
Speaker 2:Okay. Thank you. Thank you.
Speaker 1:You cannot reset your life.
Speaker 2:Oh, good one. I mean, unless you believe in reincarnation, I guess. Wow.
Speaker 1:In marriage, you will do, actually.
Speaker 11:You'll get
Speaker 14:to transfer the parameters over after yourself. Yeah. My take is, it's too hyperparameter sensitive and that's why not more people are using it.
Speaker 2:Brutal. Yeah.
Speaker 5:My name is Sam And, I think a lot of people wanna use reinforcement learning in the real world, but but good simulators are hard to find.
Speaker 2:Oh, good one. Good one. You do you have a take?
Speaker 15:I guess so. K. So I'm Will, and what sucks about RL is that if you really care about a problem solving with RL, if you care that much, you could just handcraft a solution on your own in the amount of time it would take to set everything up and get RL to learn it. Brutal. Annoying.
Speaker 15:Okay. Brutal.
Speaker 14:Well, we're generalists. Well, we're generalists. No.
Speaker 15:I care about the generalists. But the reality is right now, when you wanna solve a real problem
Speaker 19:You just
Speaker 15:it's really hard to find that that special case.
Speaker 2:Yeah. You
Speaker 1:can see AlphaGo would have been solved without Alice.
Speaker 15:No. No. But think about how much time went into developing AlphaGo. Improving the methods to make it work. That's multiple years.
Speaker 10:But still, what do you
Speaker 1:think you would have gone for now?
Speaker 5:Might be an exception because it's such a specialized high search benefit problem, but many other situations fit fit what you said.
Speaker 15:I mean, AlphaGo is an is an kind of an annoying aspect of where RL is right now.
Speaker 2:Now that they've solved AlphaGo, Go, they just take the trained AlphaGo Ball c, put it on any robot, and it's it's magic. Isn't that how it works?
Speaker 15:The general method actually works surprisingly broadly, but even that required more research to generalize it past, like, known rules of of Go.
Speaker 14:Algorithm you want, and that'll be it. Maybe that's it.
Speaker 15:I'm a person a believer in RL generality, and I I I'm here for it. Yeah. Yeah. A 100%.
Speaker 5:Yes.
Speaker 15:It's just like if I'm trying to sell RL Yeah. To someone who cares about a real problem, there's a specific niche in which that sale is is doable Yeah. Right now. Yeah. Hopefully, it'll expand.
Speaker 14:Well, I think it will. I think it will. I think, like, the big gain in RL is actually in industrial processes. And those people just, like, kinda they're scared. Right?
Speaker 14:They're, like, oh, this is scary. Like, we're we there's a lot of money running, like, invested in this. But I think that's where the biggest gains are made. And if you can make just, like, a small margin of gain in those really important things like oil processing. Right?
Speaker 14:It'll be huge. Right?
Speaker 2:So
Speaker 15:I I heard some good things out of Edmonton from this, actually.
Speaker 14:Yes. Yes. Yes. Specifically, maybe some water treatment Yeah.
Speaker 2:Google or something like
Speaker 14:that. Yeah. Yeah.
Speaker 21:My name is Harshad Fikji. I think optimizing for optimality all the time in RL is a really bad thing. We should be good at a lot of things, and it's fine to be completely suboptimal at all of them. I don't know what's the right target. But but, yeah.
Speaker 21:I just should be capable of a lot of things. Yeah.
Speaker 2:Love it. Thanks. It's just too slow. It's just way too slow. Like, too slow to train?
Speaker 1:Low sample efficiency. You need a lot of samples to get any reasonable policy early.
Speaker 2:Brutal. Can't argue with that.
Speaker 14:I mean My name is Pablo, And what sucks about RL is the h f suffix. Oh, all the
Speaker 2:h f. No more h f.
Speaker 14:I'm sort of sort of joking.
Speaker 2:You have. Okay. So I do talk our own podcast, so I make an episode on what sucks about RL.