TalkRL: The Reinforcement Learning Podcast

Jacob Beck and Risto Vuorio on their recent Survey of Meta-Reinforcement Learning.  Jacob and Risto are Ph.D. students at Whiteson Research Lab at University of Oxford.   

Featured Reference   

A Survey of Meta-Reinforcement Learning
Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, Shimon Whiteson   

Additional References  

Creators & Guests

Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

[00:00.000 - 00:12.240] Talk RL podcast is all reinforcement learning all the time featuring brilliant guests, both
[00:12.240 - 00:18.280] research and applied. Join the conversation on Twitter at talk RL podcast. I'm your host
[00:18.280 - 00:24.000] Robin Chauhan.
[00:24.000 - 00:29.240] Today we're joined by Jacob Beck and Risto Vuorio, PhD students at the Whiteson Research
[00:29.240 - 00:33.680] Lab, which is at University of Oxford. Thank you so much for joining us today, Jacob and
[00:33.680 - 00:34.680] Risto.
[00:34.680 - 00:36.520] Hey, thanks very much. Yeah, great to be here.
[00:36.520 - 00:41.440] We're here to talk about your new paper, a survey of meta-reinforcement learning. So
[00:41.440 - 00:47.320] we have featured meta-RL on the show in the past, including the head of your lab, Professor
[00:47.320 - 00:53.040] Shimon Whiteson, who covered very bad and more. That was episode 15. Sam Ritter in episode
[00:53.040 - 00:59.520] 24, Alexandra Faust in episode 25, Robert Lange in episode 31. Hope I'm not missing
[00:59.520 - 01:04.160] any. So we have touched on this before, but never in this comprehensive way. And this
[01:04.160 - 01:10.160] paper is just really a tour de force. Really excited to get into this. So to start us off,
[01:10.160 - 01:14.560] can you tell us how do you define meta-RL? How do you like to define it?
[01:14.560 - 01:21.280] Yeah, so meta-RL is learning to reinforcement learn at a most basic level. So reinforcement
[01:21.280 - 01:26.200] learning is really slow and simple and inefficient, as we all know. And meta-reinforcement learning
[01:26.200 - 01:30.800] uses this slow reinforcement learning algorithm to learn a fast reinforcement learning algorithm
[01:30.800 - 01:33.480] for a particular domain of problems.
[01:33.480 - 01:39.520] And why is meta-RL so important to you guys that you're willing to put all this work into
[01:39.520 - 01:42.520] produce this giant paper? Why is it important to you and your lab?
[01:42.520 - 01:49.680] Yeah, so as Jake hinted at there, there's like sample efficiency is a big issue in reinforcement
[01:49.680 - 01:57.280] learning. And meta-RL is then like a pretty direct way to try to tackle that question.
[01:57.280 - 02:04.680] So you can train an RL algorithm that then will be more sample efficient in the sort
[02:04.680 - 02:10.000] of test tasks you're interested in, if that makes sense. So I think that's the big motivation.
[02:10.000 - 02:17.740] And then also meta-RL is like as a problem, it comes up a lot in subtle ways when you're
[02:17.740 - 02:24.600] doing like complicated settings otherwise, but maybe we'll get to that as we talk more.
[02:24.600 - 02:30.120] And how does meta-RL relate to, say, auto-RL? Are those two things related?
[02:30.120 - 02:36.080] So we were just talking about this. Auto-RL is any way you can automate a piece of the
[02:36.080 - 02:41.720] RL pipeline. And it could be, you know, learning, it could be heuristics, it could be other
[02:41.720 - 02:49.400] methods. Meta-RL specifically is when you learn to replace a particular component in
[02:49.400 - 02:55.180] the RL algorithm. So it's learning an RL algorithm as opposed to selecting a particular heuristic
[02:55.180 - 03:01.400] to do that. So in that sense, you can view meta-RL as a subset of auto-RL. But meta-RL
[03:01.400 - 03:06.120] is also a problem setting. So as we mentioned in the paper, like a distribution of MDPs
[03:06.120 - 03:10.720] kind of defines the meta-RL problem. And I think auto-RL isn't really a problem setting
[03:10.720 - 03:11.880] in that same sense.
[03:11.880 - 03:17.240] The meta-RL problem setting is really central. And that's kind of where most of this work
[03:17.240 - 03:23.760] comes from as well. So yeah, and I feel like auto-RL just doesn't, like, it can handle
[03:23.760 - 03:29.920] any task. You can, you can be, like, it doesn't have to be a particular setting where you
[03:29.920 - 03:30.920] would use that.
[03:30.920 - 03:36.360] Now, to help ground this a little bit, you pointed to in your paper, two classic meta-RL
[03:36.360 - 03:41.880] algorithms from back from the early days of deep RL, which is really when I started reading
[03:41.880 - 03:49.760] RL papers. And these two illustrate some really important points that maybe can help us understand
[03:49.760 - 03:53.960] these concepts going forward and for the audience. So you mentioned MAML, that was from Berkeley
[03:53.960 - 04:01.040] back in 2017, and RL squared from back in 2016. And there was, that was a funny one,
[04:01.040 - 04:04.360] because there was two papers that came out almost the same time from OpenAI and DMIND
[04:04.360 - 04:09.640] that with a very similar ideas. But can you can you briefly describe these two, just to
[04:09.640 - 04:14.560] get us get us grounded in this? How, how do these, what these algorithms do? And how do
[04:14.560 - 04:16.840] they work? And how are they very different as well?
[04:16.840 - 04:23.040] Yeah, so let me start with MAML, and maybe Jay can then explain RL squared. So MAML is
[04:23.040 - 04:29.080] a, I feel like very sort of iconic meta-RL algorithm from, as you said, early days of
[04:29.080 - 04:34.440] meta-RL in the sort of deep RL wave. There's, there's of course, like, earlier works that
[04:34.440 - 04:41.080] do meta-RL from the 90s. And so, but I guess like the there's been a big jump in popularity
[04:41.080 - 04:46.840] more recently. So what MAML starts from, like, I think the intuition there is really the
[04:46.840 - 04:54.720] key to it. Like, it's that in deep learning, we have, pre-training is a big, big thing
[04:54.720 - 05:01.160] people do. Like you, you train your convolutional neural net on image nets. And then you have
[05:01.160 - 05:06.280] some like maybe like an application where you like you want to classify produce in the
[05:06.280 - 05:09.880] supermarket or something like that, which for which you have way less data. So what
[05:09.880 - 05:14.320] you can do is like use the pre-trained model and then fine tune it on that task you're
[05:14.320 - 05:20.320] interested in. And RL still doesn't have a lot of that. And especially in 2016 didn't
[05:20.320 - 05:28.280] have any of that. So like what MAML does is like it takes this, this question very explicitly,
[05:28.280 - 05:35.400] like whether we can use meta-learning to produce a better initialization, like a pre-trained
[05:35.400 - 05:42.000] network for that, that's then quick to fine tune for other tasks. So you essentially you
[05:42.000 - 05:49.640] take a big distribution or tasks, and then you train a network using any algorithm for
[05:49.640 - 05:55.040] your choice on those tasks. And then you back propagate through that learning algorithm
[05:55.040 - 05:59.760] to the initialization such that like the, the learning algorithm in the middle can make
[05:59.760 - 06:03.080] as fast progress as possible, if that makes sense.
[06:03.080 - 06:08.640] So that sounds a bit like a foundation model for your, but for your specific setting. Is
[06:08.640 - 06:09.640] that similar?
[06:09.640 - 06:14.480] In addition, yeah, I think so. I mean, I think the, the mechanics work out quite differently,
[06:14.480 - 06:17.920] but like, but the, the, the motivation is definitely there.
[06:17.920 - 06:22.760] Yeah, I think recently did a good job of summarizing MAML. I guess most simply put, it's just meta-learning
[06:22.760 - 06:28.200] the initialization and kind of on the spectrum in the paper we talk about of generalization
[06:28.200 - 06:33.200] versus specialization, MAML is at one end of the spectrum. So it's just learning an
[06:33.200 - 06:39.880] initialization and the inner loop or the algorithm that's actually learned to do the, the fast
[06:39.880 - 06:43.440] learning reinforcement learning algorithm that's learned is all hard coded other than
[06:43.440 - 06:48.360] the initialization. And so from that, you can get certain properties like generalization
[06:48.360 - 06:51.920] to new tasks that you haven't seen during training. And at the other end of the spectrum,
[06:51.920 - 06:57.160] we have RL squared and L2RL, which are both papers that came out around the same time
[06:57.160 - 07:03.040] doing roughly the same thing. So RL squared, I think was Duann et al and L2RL was Wang
[07:03.040 - 07:09.320] et al. And the idea more or less in these papers is just the inner loop. So the reinforcement
[07:09.320 - 07:13.480] learning algorithm that you're learning is entirely a black box. It's just a general
[07:13.480 - 07:18.200] function approximator. So, you know, it tends to be GRU or LSTM. That's kind of at the extreme
[07:18.200 - 07:20.760] other end of the spectrum from MAML.
[07:20.760 - 07:25.440] And so where would you apply these, these two different approaches? Or can you talk
[07:25.440 - 07:29.040] about the pros and cons of the spectrum?
[07:29.040 - 07:33.840] MAML really has found more popularity, I feel like in this sort of few shot classification
[07:33.840 - 07:40.160] setting more recently, like turns out actually doing MAML for reinforcement learning is really
[07:40.160 - 07:46.160] challenging. So I don't know if it's like a big baseline or anything anymore. But like
[07:46.160 - 07:52.560] the nice thing about the basic algorithm in MAML is that since the inner loop, the algorithm
[07:52.560 - 07:58.160] you're learning is just usually a policy gradient method, policy gradient reinforcement learning
[07:58.160 - 08:04.880] algorithm, there's like some guarantees that it'll do reasonably well on any task you throw
[08:04.880 - 08:11.640] at it. Like, so even if your initialization was kind of off, like even if the task distribution
[08:11.640 - 08:17.480] you trained it for, isn't the one that where you eventually then deploy the initialization,
[08:17.480 - 08:23.800] there's still hope that the policy gradient algorithm will recover from that. So like
[08:23.800 - 08:31.120] it has this more sort of a little bit better generalization performance than RL squared
[08:31.120 - 08:32.200] would have.
[08:32.200 - 08:36.560] And I would add to that the horizon also matters, right? So in MAML, you actually are computing
[08:36.560 - 08:40.200] a gradient, you're computing a policy gradient, so you need to collect some data for that.
[08:40.200 - 08:44.960] If your performance matters, you know, from the first time step of training, then RL squared
[08:44.960 - 08:48.400] is kind of more the algorithm you would use.
[08:48.400 - 08:54.800] Did you have a clear idea of how you would do this categorization and how things would
[08:54.800 - 08:59.600] be organized before you did the paper? Or is that something that really came about through
[08:59.600 - 09:03.720] a lot of careful thought and analyzing what you learned through through your readings?
[09:03.720 - 09:08.600] Yeah, I mean, we had a few false starts. It was kind of a mess at the beginning, we proposed
[09:08.600 - 09:13.240] it, we started by proposing a taxonomy of survey of meta learning papers. And then we
[09:13.240 - 09:16.800] quickly realized that the literature just didn't really reflect the taxonomy we had
[09:16.800 - 09:20.360] just sat down and thought of.
[09:20.360 - 09:26.080] So we had to kind of reorganize based on what the literature reflected, so the main clusters
[09:26.080 - 09:30.560] of literature, and then we had to be pretty careful about divvying up work within each
[09:30.560 - 09:31.560] of those.
[09:31.560 - 09:37.240] In retrospect, I think the structure is kind of what you would find from the literature,
[09:37.240 - 09:39.400] but we definitely didn't start from this one.
[09:39.400 - 09:44.640] Cool. So let's get into the different settings that your paper mentions, where our meta RL
[09:44.640 - 09:50.000] can be applied. Do you want to cover the main settings? Can you give us a brief description
[09:50.000 - 09:53.960] of or an example of what that setting would look like?
[09:53.960 - 10:02.520] Yeah, so let me just get started here. So we have two axes along which we distinguish
[10:02.520 - 10:10.640] these meta RL problems. So there's zero or a few shots versus many shots. So that has
[10:10.640 - 10:15.640] to do with the horizon of the task in the inner loop. So if you have something where
[10:15.640 - 10:21.800] you, as Jack mentioned earlier, if you have something where you want your agent to make
[10:21.800 - 10:27.360] as much progress as possible from the first time step it's deployed in the environment,
[10:27.360 - 10:32.040] then you're kind of in this zero or a few shot regime. And usually those are tasks where
[10:32.040 - 10:39.940] you then are also expected to do really well after like, you know, a small number of steps.
[10:39.940 - 10:45.480] So this could be like, originally, these were the kinds of things where you have maybe a
[10:45.480 - 10:50.480] mujoco environment where you have like a cheetah robot running around and you need to decide
[10:50.480 - 10:55.960] which way to run with the cheetah, like that will be sort of a canonical early task from
[10:55.960 - 10:59.800] there. And they're more complicated now, but like, that's sort of roughly the order of
[10:59.800 - 11:08.480] mag like size, we're thinking here. And then many shot is more about learning the whole
[11:08.480 - 11:13.360] like learning a sort of long running our algorithm in the inner loop. So something you can think
[11:13.360 - 11:18.840] of like, you would want to meta learn an algorithm that you can then use to update your policies
[11:18.840 - 11:24.080] like 10,000 times. So it could be it could be 10,000 episodes, it could be like, you
[11:24.080 - 11:32.120] know, hours or days long training run, then using the learned meta learning algorithm,
[11:32.120 - 11:36.200] or a learned reinforcement on the algorithm, the inner loop of the meta learning algorithm.
[11:36.200 - 11:40.000] So in that case, you're not worried about performance at the beginning?
[11:40.000 - 11:43.760] Yeah, basically, right, like you would you would assume that you essentially start from
[11:43.760 - 11:47.280] like a really random policy, and then you just try to, of course, you still try to make
[11:47.280 - 11:53.920] as fast progress as possible. But like, if it's if the inner loop is modeled after, let's
[11:53.920 - 11:59.400] say policy gradient algorithm, then you're going to need need some amount of samples just
[11:59.400 - 12:06.240] to get like a reasonable gradient estimates for the update. So it won't get started like
[12:06.240 - 12:08.880] in any kind of zero shots manner for sure.
[12:08.880 - 12:13.360] Okay, so you're not you're not evaluating it to test time doesn't start right away.
[12:13.360 - 12:14.360] Is that you're saying?
[12:14.360 - 12:22.600] Yeah, usually you would evaluate the the policy after like, you know, hundreds up to thousands
[12:22.600 - 12:25.400] or tens of thousands of updates.
[12:25.400 - 12:28.760] The goal of that setting can be stated as learning a traditional RL algorithm.
[12:28.760 - 12:32.480] And the other axes of that setting is the like, whether we're dealing with a single
[12:32.480 - 12:37.640] task or a multitask setting. And like, this is kind of this is like a trippy thing, I
[12:37.640 - 12:43.520] guess, like, this isn't something that is super often discussed in the especially in
[12:43.520 - 12:49.680] some parts of the metal literature, but the but the single task case is still very interesting.
[12:49.680 - 12:56.960] And it actually does. The methods are very similar between the many shot multitask setting.
[12:56.960 - 13:02.240] So where you would have like a big task distribution of distribution of tasks, and then you're
[13:02.240 - 13:07.640] trying to learn that like traditional RL algorithm, the inner inner loop, turns out you can actually
[13:07.640 - 13:14.840] just grab that algorithm, the the meta learning algorithm, and run it on a single RL task
[13:14.840 - 13:19.280] essentially, and still get reasonably good performance. So you can get like, you know,
[13:19.280 - 13:26.840] you can train agents on Atari, where you actually meta learned the objective function for for
[13:26.840 - 13:33.080] the the policy gradient that's then updating your policy just on that single task. But
[13:33.080 - 13:37.280] oh, yeah, and I guess one important thing here is that there really isn't like a few
[13:37.280 - 13:44.800] shot single task setting, because there needs to be like some source of transfer, I guess,
[13:44.800 - 13:50.320] like you need to have in if you have multiple tasks, then you kind of like what you do is
[13:50.320 - 13:55.120] like you train on train on the distribution of tasks, and then you maybe have a held outside
[13:55.120 - 14:00.320] of test tasks for which you try to like where you test whether your learning algorithm works
[14:00.320 - 14:07.680] really well. If you're in the long horizon setting, in the mini shot setting, then you
[14:07.680 - 14:15.120] can like, test like the single task, you can compare it to just the kind of vanilla RL
[14:15.120 - 14:20.680] algorithm, you would run over that, but they're like a zero shot single task setting, there
[14:20.680 - 14:25.440] isn't like anything that it doesn't like, you can't really test it on anything like
[14:25.440 - 14:30.760] and there's no, there's not enough room for meta learning. No, it's a pretty difficult
[14:30.760 - 14:35.600] concept to explain. So I think you did a good job. But what you said basically is that right
[14:35.600 - 14:42.160] in the multi task setting, you're transferring from one set of MDPs to a new MDP a test time,
[14:42.160 - 14:46.160] in the single task setting, what you're doing is transferring from one part of a single
[14:46.160 - 14:50.840] lifetime in one MDP to another part of that same lifetime in the same MDP. So you have
[14:50.840 - 14:53.680] to have some notion of time for that to occur over.
[14:53.680 - 14:58.680] Awesome, I actually was going to ask you about that, just like a missing square in the quadrant,
[14:58.680 - 15:04.560] right? And so that totally makes sense. So then you talked about different approaches
[15:04.560 - 15:11.120] for the the different settings. Do you want to touch on on on the some of the most important
[15:11.120 - 15:17.080] those? So I guess, as we mentioned, mammal was kind of the prototypical algorithm in
[15:17.080 - 15:22.960] the PPG setting. But you can also imagine, you can add additional parameters to tune
[15:22.960 - 15:28.040] other than just the initialization. So you can do the learning rate, you can learn some
[15:28.040 - 15:33.120] kind of a curvature matrix that modifies your gradient, you can learn the whole distribution
[15:33.120 - 15:36.320] for your initialization instead of just a single point estimate for initialization.
[15:36.320 - 15:40.560] And it's kind of a whole family of things that build on mammal. And the inner loop,
[15:40.560 - 15:44.200] the only thing that's consistent between them is the inner loop involves a policy gradient.
[15:44.200 - 15:50.960] So we call those PPG methods or PPG algorithms for parameterized policy gradient. And that's
[15:50.960 - 15:54.560] kind of the first category of methods we talked about in the few shot setting.
[15:54.560 - 15:58.240] And did you guys coin that coin that phrase PPG?
[15:58.240 - 15:59.240] We're trying to.
[15:59.240 - 16:00.240] Cool, I like it.
[16:00.240 - 16:05.000] Thank you. Thank you. So yeah, that's that's the PPG setting. And then the other there's
[16:05.000 - 16:10.680] two other main types of methods in the few shot setting. There's black box. And the main
[16:10.680 - 16:16.000] example of prototypical algorithm in that setting would be RL squared. But you can also
[16:16.000 - 16:20.040] replace that black box with many different architectures, transformers, other forms of
[16:20.040 - 16:26.640] attention and other memory mechanisms. And so there's a whole category of black box algorithms.
[16:26.640 - 16:32.800] And then I guess the only one we haven't really touched on yet is task inference methods.
[16:32.800 - 16:38.800] So the idea here is a little more nuanced. But meta-learning, as we mentioned, considers
[16:38.800 - 16:43.920] a distribution of MDPs, also known as tasks. What's different about the meta-learning
[16:43.920 - 16:49.080] setting from the multitask setting is you don't know what task you're in. So you actually
[16:49.080 - 16:53.680] have to explore together to gain data to figure out, hey, I'm in, you know, I'm supposed to
[16:53.680 - 16:56.920] run 10 miles per hour instead of five miles per hour. I'm supposed to navigate to this
[16:56.920 - 17:03.280] part of the maze as opposed to this part of the maze. And you can frame, you know, the
[17:03.280 - 17:07.280] setting as task inference. So I think humplik et al, one of the early papers that pointed
[17:07.280 - 17:10.480] this out, if you can figure out what task you're in, you've reduced your setting to
[17:10.480 - 17:13.800] the multitask setting, and you've made the problem much easier. And so that kind of gave
[17:13.800 - 17:16.000] rise to a whole bunch of methods around task inference.
[17:16.000 - 17:21.880] So is that the scenario where you may have a simulator that has a bunch of knobs on it,
[17:21.880 - 17:24.800] and then your agent just has to figure out what are those what are the settings on the
[17:24.800 - 17:25.800] knobs for the environment?
[17:25.800 - 17:28.960] Yeah, I mean, more or less, right, you can consider your environment parameterized by
[17:28.960 - 17:33.480] some context vector, you have to figure out what those parameters are. And then you can
[17:33.480 - 17:36.720] assume maybe that once you know those parameters, you have a reasonable policy already ready
[17:36.720 - 17:41.440] to go for that set of parameters. I think Shimon, maybe even on this podcast at one
[17:41.440 - 17:46.680] point, pointed out that like, if you can figure out what MDP you're in, you don't even need
[17:46.680 - 17:51.000] to do learning anymore, right? You can just do planning. If you know that MDP, you don't
[17:51.000 - 17:52.600] need to do any more learning at that point.
[17:52.600 - 17:55.160] Right, that's a really important observation.
[17:55.160 - 18:03.040] Right. So and then we also have the many shot setting where this would, there's like two,
[18:03.040 - 18:09.320] I guess, like the major categories of things to think here are the single task and multitask
[18:09.320 - 18:20.100] many shot problems. The methods for both single task and multitask end up being quite similar.
[18:20.100 - 18:24.440] So the kinds of things that people learn in the inner loop. So okay, let me try to like
[18:24.440 - 18:31.820] be clearer about the many shot setting one more time here. So basically the structure
[18:31.820 - 18:39.440] is that you take your policy graded algorithm, A to C or whatever, and then you put some
[18:39.440 - 18:44.360] parameters to the loss function there. So maybe you have an intrinsic reward function
[18:44.360 - 18:52.840] or an auxiliary task or something of that flavor, and then you tune, like you change
[18:52.840 - 18:58.920] the parameters of that loss function with the meta learner. So there's this sort of
[18:58.920 - 19:04.320] outer loop meta learner that computes the gradient through the updated policy into the
[19:04.320 - 19:09.800] loss function parameters and changes those so that you get better performance from the
[19:09.800 - 19:20.240] agent. And so this idea applies to both like the single task and multitask setting. And
[19:20.240 - 19:26.480] I think one of the like important topics there would then be like, you know, what is the
[19:26.480 - 19:32.840] what is the algorithm you're building on top of? Like, what is the inner loop, like base
[19:32.840 - 19:37.360] algorithm? And what is the way you're optimizing that and those kinds of things? And then the
[19:37.360 - 19:44.600] sort of things that you learn there are these often like intrinsic rewards is pretty big
[19:44.600 - 19:50.640] auxiliary tasks, you could have like you could have a more general parameterization of the
[19:50.640 - 19:57.480] RL objective function in the inner loop. So there's kind of algorithms that just parameterize
[19:57.480 - 20:04.620] that very generally. And then one other thought people have considered is like learning hierarchies.
[20:04.620 - 20:10.240] So you do like hierarchical RL, maybe option discovery, for example, could be done in this
[20:10.240 - 20:15.920] like long, many shot meta RL setting. When I think of like, one item on this list like
[20:15.920 - 20:23.120] intrinsic rewards, I remember when Patak, you know, came up with this curiosity, intrinsic
[20:23.120 - 20:28.320] reward and did that study. And, and I think his agent, his agent had like billions of
[20:28.320 - 20:34.440] steps for the curiosity to really do its thing. And that was not in meta RL, that was just
[20:34.440 - 20:40.840] in straight RL. So when I think about doing this in a loop, it seems like it could be
[20:40.840 - 20:45.200] maybe massively expensive. Like, how do you think about the cost of these, of these algorithms
[20:45.200 - 20:52.120] and when it actually makes sense economically, or, or just when it makes sense to, to use
[20:52.120 - 20:55.680] these methods and how and how expensive are they? Do you have a sense of that?
[20:55.680 - 21:01.360] Yeah, that's a great question. So for the few shot learning setting, it's not really
[21:01.360 - 21:13.360] hugely different from just training a like agent that can generally solve that tasks
[21:13.360 - 21:17.800] of that flavor, like I would say that the meta RL part there isn't like, there's of
[21:17.800 - 21:26.120] course, like an upfront cost to training the meta learner, but then like the test time,
[21:26.120 - 21:30.880] it should be very efficient. I think the like big costs come out in the many shots setting
[21:30.880 - 21:39.800] where you like you're trying to train a full RL algorithm in the inner loop, and then just
[21:39.800 - 21:45.800] the getting like, being able to optimize that that that can take a whole lot of samples
[21:45.800 - 21:53.480] for sure. There's the trick there is that these algorithms can generalize quite a bit.
[21:53.480 - 22:00.240] So there's a paper by I think it's junior and others from deep point, where they train
[22:00.240 - 22:05.320] and train an inner loop algorithm that is trained on like, essentially grid worlds and
[22:05.320 - 22:11.200] bandits and those kinds of things. And so they're training the inner loop objective
[22:11.200 - 22:15.640] on on very simple environments. And it still takes a whole lot of samples, like it takes,
[22:15.640 - 22:21.580] I think, like billions of frames, but in very simple environments, at least. And then they
[22:21.580 - 22:27.840] transfer that and show that it actually can generalize to Atari and produce like, roughly,
[22:27.840 - 22:32.320] original DQN level level performance there, which is pretty impressive to me. But I mean,
[22:32.320 - 22:40.360] like, yeah, it's a it's a it's the most expensive Atari agents of that, that performance level,
[22:40.360 - 22:44.880] for sure. One thing I don't know if the question was intended to be this specific, but you
[22:44.880 - 22:49.320] mentioned it took a while takes a while for the curiosity based rewards to do their thing.
[22:49.320 - 22:53.760] recent knows a lot more about this setting than I do. But I my understanding is that
[22:53.760 - 22:57.480] generally for the intrinsic rewards, you don't actually try and meta learn the propagation
[22:57.480 - 23:02.600] through the critic. So you could like, you know, the meta learned reward would be useful
[23:02.600 - 23:07.040] in the end step return or like the TD lambda estimate. But I don't think you're actually
[23:07.040 - 23:10.960] meta learning how to propagate that information through the critic. Is that right, Risto?
[23:10.960 - 23:14.880] Would that change the cost too much? I feel like they probably don't. But I would actually
[23:14.880 - 23:19.560] like it's just a couple of extra, a little bit more memory cost in the backward pass,
[23:19.560 - 23:21.840] but it doesn't seem like critical. I'm not sure.
[23:21.840 - 23:25.360] Sure, sure, sure. But you don't need to do like many steps of value iteration to try
[23:25.360 - 23:28.280] and figure out the effects of that through that process. Oh, yeah, of course. Yeah, no,
[23:28.280 - 23:34.400] no, no, no. Yeah, it's a huge approximation or in all kinds of way, ways to compute the
[23:34.400 - 23:40.320] update for your intrinsic rewards. And I and one critical thing that the algorithms in
[23:40.320 - 23:47.000] that setting often do is that, in some sense, you if you're into many shots, multitask setting,
[23:47.000 - 23:55.160] you want the intrinsic reward or whatever your training that would produce the best
[23:55.160 - 24:00.440] end performance of the agent when you train train a new agent from scratch using that
[24:00.440 - 24:06.120] learned objective function, you want like the, you know, however long is your training
[24:06.120 - 24:12.080] horizon the best agent to like the want the agent that you want the loss function that
[24:12.080 - 24:17.520] produces the best agent at convergence. But of course, like back propagating through the
[24:17.520 - 24:25.060] whole long optimization horizon in the inner loop would be extremely costly. So then people
[24:25.060 - 24:29.080] often do like they truncate the optimism, this is like the truncated back propagation
[24:29.080 - 24:34.120] to through time, essentially. So you're just consider as tiny window of updates within
[24:34.120 - 24:40.560] that inner loop, and then back propagate within there to keep the memory costs reasonable.
[24:40.560 - 24:46.080] Okay, then you you mentioned that in your paper exploration is clearly a central issue,
[24:46.080 - 24:52.760] especially for few shots, but RL, what can you talk about the importance of exploration
[24:52.760 - 24:57.600] in meta RL and the main methods you use for exploration in meta RL. So exploration is
[24:57.600 - 25:03.960] kind of a central concept that makes meta RL distinct from just meta learning in general.
[25:03.960 - 25:07.680] So in meta learning, you might be given a new data set, you have to rapidly adapt to
[25:07.680 - 25:13.360] that new data set. In meta RL, you actually have to collect the new data yourself. And
[25:13.360 - 25:17.120] it might not be clear, you know how to do that or what data you need. So you have to
[25:17.120 - 25:21.240] explore to figure out what task you're in and what data you need to identify the task
[25:21.240 - 25:26.760] itself. And that's kind of one of the central challenges in the few shot meta RL setting.
[25:26.760 - 25:30.800] And here you're talking about exploration at the task level, not at the meta level,
[25:30.800 - 25:34.560] right meta exploration? Is this something you mentioned in a separate part of the paper?
[25:34.560 - 25:39.080] Yeah, meta exploration was a bit distinct. So that's exploration in the space of exploration
[25:39.080 - 25:43.760] strategies. Yeah, I don't know if we want to unpack that statement, that statement more,
[25:43.760 - 25:49.000] but I guess first, so some of the methods that are used for exploration, right, there's
[25:49.000 - 25:55.480] end to end learning is common. But it's difficult for really challenging exploration problems.
[25:55.480 - 25:59.560] So you can just do RL squared, where your inner loop or you know, the reinforced, I
[25:59.560 - 26:03.940] guess, I'm not sure we actually have defined inner loop and outer loop so far in this discussion.
[26:03.940 - 26:07.480] But when we say inner loop, we mean the reinforcement learning algorithm that you are learning,
[26:07.480 - 26:10.600] we say outer loop, we mean the reinforcement learning algorithm, the slow one that you're
[26:10.600 - 26:16.500] using to learn the inner loop, which can just be like PPO or something along those lines.
[26:16.500 - 26:20.680] And so you can just use the inner loop as a black box. And that can solve some exploration
[26:20.680 - 26:25.040] problems, but generally more challenging exploration problems won't be solved by RL squared and
[26:25.040 - 26:28.960] things that are just a complete black box. So people have tried building in more structure
[26:28.960 - 26:36.280] for exploration, ranging from posterior sampling to more complicated methods, often using task
[26:36.280 - 26:42.740] inference actually. So we mentioned task inference being this idea that you want to take actions
[26:42.740 - 26:45.700] or you know, identify what task you're in. And often you might need to take actions to
[26:45.700 - 26:50.280] figure out what task you're in. And if you need to take actions to figure out what task
[26:50.280 - 26:54.560] you're in, one way to do that is by saying, okay, we're going to give the agent a reward
[26:54.560 - 27:00.160] for being able to infer the task. There's some drawbacks for doing that directly, right?
[27:00.160 - 27:03.960] So you might be trying to infer the MDP, which is the transition function and the reward
[27:03.960 - 27:08.600] function. And there might be completely irrelevant information in that exploration process that
[27:08.600 - 27:12.440] you don't need. So if you're trying to figure out what goal to navigate to, let's say in
[27:12.440 - 27:15.800] a kitchen, you're trying to figure out like where the robot should be. And maybe there's
[27:15.800 - 27:18.920] some paintings on the wall. And the paintings on the wall are completely irrelevant to where
[27:18.920 - 27:24.160] you're supposed to be navigating right now to make someone's food. And in that case,
[27:24.160 - 27:27.320] there are also algorithms to tackle that. So one that we like to highlight his dream
[27:27.320 - 27:32.960] from one of our co-authors on the paper, Evan. And there you actually learn what information
[27:32.960 - 27:37.360] is relevant first by doing some pre-training in the multitask setting, you figure out what
[27:37.360 - 27:41.840] information would an optimal policy need, an informed policy, and then you separately
[27:41.840 - 27:48.360] learn an exploration policy to try and uncover the data that allows you to execute the informed
[27:48.360 - 27:49.360] policy.
[27:49.360 - 27:52.400] There's a lot of concepts in this paper, I got to say, compared to the average paper.
[27:52.400 - 27:56.960] I guess that's the nature of survey papers. So I'm really glad you're here to help us
[27:56.960 - 27:57.960] make sense of it.
[27:57.960 - 28:04.720] Yeah. So we talked about a few different exploration methods. One that came from our lab is the
[28:04.720 - 28:08.960] very bad paper, which I think you already had Shimon on to talk about as well. It's
[28:08.960 - 28:13.440] a really cool method that allows you to have to quantify uncertainty in this task inference
[28:13.440 - 28:19.360] that we just mentioned. So what the very bad does is trains of VAE separately to reconstruct
[28:19.360 - 28:23.960] transitions and it trains a latent variable, mean and variance in order to do that. And
[28:23.960 - 28:29.240] then you condition a policy on the inferred mean and variance. So you're explicitly conditioning
[28:29.240 - 28:33.360] on your uncertainty in the distribution of tasks. And you can actually frame that entire
[28:33.360 - 28:37.720] problem as what's called a BAMDP or a Bayes Adaptive MDP.
[28:37.720 - 28:43.560] Yeah. Very bad treatment of uncertainty is so cool. That makes it really special to me.
[28:43.560 - 28:48.160] And I guess that's the magic of variational inference. Is that right?
[28:48.160 - 28:51.640] Yeah. I mean, it's variational inference plus conditioning on that uncertainty for the meta
[28:51.640 - 28:55.440] learning allows you to learn actually optimal exploration pretty easily.
[28:55.440 - 28:58.160] Cool. Should we move on to supervision?
[28:58.160 - 29:10.080] Yeah, sounds good. Mostly we focus on the case where we have reinforcement learning
[29:10.080 - 29:15.240] in the inner loop and reinforcement learning in the outer loop. And I mean, most of the
[29:15.240 - 29:20.520] meta RL research is also in that setting. But similarly, as has happened with the kind
[29:20.520 - 29:28.560] of term RL or especially deep RL, it sort of subsumed a lot of other topics that also
[29:28.560 - 29:33.240] are doing some kind of machine learning for control. So like imitation learning, for example,
[29:33.240 - 29:37.680] like often people just say RL and sometimes they mean something that's more like imitation
[29:37.680 - 29:42.120] learning. So we also have a similar thing happening in meta RL, where there's a lot
[29:42.120 - 29:51.800] of like meta imitation learning, where you might be doing just kind of like the, I guess
[29:51.800 - 29:57.440] the most direct approach for that would be doing something like MAML for imitation learning.
[29:57.440 - 30:01.680] So it's like imitation learning is sort of just supervised learning of control. And then
[30:01.680 - 30:07.760] you could just take the supervised learning version of MAML to learn a fast imitation
[30:07.760 - 30:12.240] learning initialization. But then you could have all these other variants as well, where
[30:12.240 - 30:21.600] you have like, let's say you're trying to learn a imitation learning algorithm, which
[30:21.600 - 30:28.400] when it's shown a demonstration of a new task, can do that as quickly as possible. You could
[30:28.400 - 30:35.360] you could have you could meta train that so that actually the the meta learning algorithm
[30:35.360 - 30:40.240] is still optimizing the reward of the task somehow, like if you if you have access to
[30:40.240 - 30:43.440] the rewards, the outer loop could still be reinforcement learning. So now I have this
[30:43.440 - 30:47.880] setting where you have imitation learning in the inner loop, and reinforcement learning
[30:47.880 - 30:54.080] in the outer loop. And then then you would test it as a imitation learning algorithm.
[30:54.080 - 31:00.760] So you you show it a new demonstration, and you're expecting it to adapt to that as quickly
[31:00.760 - 31:08.320] as possible. And then of course, like all the other the permutations of that same setting
[31:08.320 - 31:15.600] apply, and people have done research on those. Then there's also like unsupervised learning
[31:15.600 - 31:25.480] is a big topic. So and also, people in meta RL have looked into doing sort of unsupervised
[31:25.480 - 31:27.360] learning algorithms in the inner loop.
[31:27.360 - 31:32.200] Right, so I mean, unsupervised in the inner loop could be useful if you just don't have
[31:32.200 - 31:37.840] access to rewards at test time, right. And some algorithms that do this are like, heavy
[31:37.840 - 31:41.280] and learning algorithms. There's a lot of heavy learning algorithms that just don't
[31:41.280 - 31:46.200] condition on reward, they're local, and they're unsupervised in their inner loop. But the
[31:46.200 - 31:49.520] outer loop, as we mentioned, still uses reward. So you're still meta learning this with rewards
[31:49.520 - 31:53.840] end to end. I think there are there are a bunch of other papers, you know, aside from
[31:53.840 - 31:57.640] heavy learning as well. But the idea there is that you might not have access to rewards
[31:57.640 - 32:02.400] when you actually go to test. There's also unsupervised in the outer loop. So if you're
[32:02.400 - 32:05.340] given one environment, it's kind of like a sandbox you can play with, but you don't
[32:05.340 - 32:09.200] really have any known rewards, you can do some clever things to get a distribution of
[32:09.200 - 32:12.720] reward functions, they might prepare you for a reward function, you're going to encounter
[32:12.720 - 32:17.760] a test time. So there during meta training, you create your own distribution of tasks
[32:17.760 - 32:26.160] or own distribution of reward functions. And then there's also like, so I guess that's
[32:26.160 - 32:30.160] unsupervised outer loop unsupervised inner loop, you can also have a supervised outer
[32:30.160 - 32:34.560] loop where your inner loop is reinforcement learning. And there, the idea is just like
[32:34.560 - 32:41.440] reinforcement learning in the outer loop is very slow. And it's a very weak supervision.
[32:41.440 - 32:45.960] And the cost of meta training is huge. Right. So we're learning very simple, efficient algorithms
[32:45.960 - 32:50.080] for test time through meta learning, but that blows up the cost of meta training. And if
[32:50.080 - 32:53.920] we can use stronger supervision during meta training, then that can get us huge wins in
[32:53.920 - 32:58.440] terms of sample efficiency. Okay, that part I think I followed. It's kind of like how
[32:58.440 - 33:06.420] many ways can you put together a Lego kit? There's a lot of ways, right? So can we talk
[33:06.420 - 33:12.560] about some of the application areas where meta RL has been been important or looks promising
[33:12.560 - 33:18.360] in the future? Yeah, for sure. So I mean, there's, I think it's a pretty recent paper
[33:18.360 - 33:28.880] by Evan again, where they do meta RL for this really cool code feedback thing. So you have
[33:28.880 - 33:33.880] like a online, so this is a very specific thing, but just because it's on the top of
[33:33.880 - 33:42.840] my memory, like you have a online coding platform where you try go on and learn programming.
[33:42.840 - 33:47.520] And if there's like an interactive program, you're trying to code there, it's really
[33:47.520 - 33:51.720] hard for the automated toolkit to give you good feedback on that. So what they do is
[33:51.720 - 33:57.840] actually train a meta reinforcement learning algorithm, our agent that like, provides good
[33:57.840 - 34:06.280] feedback there, because like, the students programs, like, make a task distribution,
[34:06.280 - 34:11.680] which you then need to explore efficiently, to find like, what kinds of bugs the students
[34:11.680 - 34:17.840] have figured out to implement there. So that's actually they got like pretty promising results
[34:17.840 - 34:26.440] on the on the benchmark there. And and that seems like it could tentatively be deployed
[34:26.440 - 34:31.800] in the real world as well. And maybe they can talk about the other applications we cover
[34:31.800 - 34:32.800] in the paper.
[34:32.800 - 34:38.080] Yeah, we cover a bunch of other ones, but I guess to highlight here, like robot locomotion
[34:38.080 - 34:43.160] is a big one. So there, it's pretty common to try and train in simulation over distribution
[34:43.160 - 34:47.480] of tasks, and then try and do sim to real transfer to a real robot in the real world,
[34:47.480 - 34:53.400] as opposed to trying to do meta learning on a robot from scratch. And there's some pretty
[34:53.400 - 34:58.600] cool algorithms that have been applied in order to do that import by Camiani at all,
[34:58.600 - 35:01.720] in particular, being one of them, where you actually do this kind of multitask training
[35:01.720 - 35:06.000] I talked about before and task inference that I mentioned before, but you do it simultaneously
[35:06.000 - 35:10.280] well doing meta learning. So you'd have some known information about the environment that
[35:10.280 - 35:17.120] robots trying to walk in in your simulator. And maybe we assume that at test time, this
[35:17.120 - 35:22.600] information wouldn't be known, like the exact location of all the rocks and steps. And some
[35:22.600 - 35:26.720] sensory information isn't available to the actual robot in the real world. So what you
[35:26.720 - 35:31.480] can try and do is have the known representation, some encoding of that, then you have your
[35:31.480 - 35:34.560] inferred representation, you have some encoding of that, and you can try and make these two
[35:34.560 - 35:39.000] things look very similar. And that's been used in a number of robotics papers at the
[35:39.000 - 35:44.000] moment for some pretty cool effects. So I guess in addition to the robot locomotion
[35:44.000 - 35:49.680] problem, one application area we go into in the paper in some detail is the meta learning
[35:49.680 - 35:56.760] for multi agent RL problem. And there kind of just to summarize, concisely, you can view
[35:56.760 - 36:01.040] other agents as defining the task. So if you have a distribution of other agents, that
[36:01.040 - 36:04.520] pretty clearly creates for you a distribution of tasks, and you can directly apply meta
[36:04.520 - 36:09.280] learning. And that enables you to deal both with the adapting to novel agents at test
[36:09.280 - 36:14.960] time. And that allows you to deal with maybe the non stationarity introduced by the adaptation
[36:14.960 - 36:18.200] of other agents. So all the learning other agents are doing can be taken into account
[36:18.200 - 36:25.520] by your meta learning. Your paper also discusses using meta RL with offline data. Can you say
[36:25.520 - 36:32.760] a couple things about that? Yeah, so as I as I mentioned earlier, like, meta reinforcement
[36:32.760 - 36:36.520] learning, it tries to create a sample efficient adaptation algorithm in a few shots setting
[36:36.520 - 36:43.080] anyway. And that shifts a huge amount of the data and burden to meta training. So you can
[36:43.080 - 36:49.000] imagine having an offline outer loop. Right. So the meta training, if you're having such
[36:49.000 - 36:52.560] a large meta training burden, you can't really do that directly in the real world. So one
[36:52.560 - 36:56.920] thing you might want to do is have some safe data collection policy to gather a lot of
[36:56.920 - 37:02.160] data for you. And then you can immediately use offline meta RL in the outer loop to try
[37:02.160 - 37:06.440] and train your meta learning algorithm having not actually taken any dangerous actions in
[37:06.440 - 37:13.120] the real world yourself. So it's kind of the offline outer loop idea in meta RL. We also
[37:13.120 - 37:18.040] go into offline interloop and different combinations of offline online interloop offline online
[37:18.040 - 37:24.280] outer loop. But the idea with the offline interloop is we're already trying to do, you
[37:24.280 - 37:28.120] know, you know, few shot learning. So at the limit of this, it's like you're given some
[37:28.120 - 37:31.680] data up front, and you actually never have to do any sort of exploration in your environment,
[37:31.680 - 37:36.160] you can adapt immediately to some data someone hands you a test time without doing any sort
[37:36.160 - 37:41.680] of exploration or any sort of data gathering. So of course, we have RL is is generally framed
[37:41.680 - 37:48.040] in terms of MDP is the Markov decision process. And in the case of meta RL, can we talk about
[37:48.040 - 37:56.720] the MDP for the outer loop or the POMDP? What does that MDP look like in terms of the traditional
[37:56.720 - 38:03.520] components of state action and reward? As we mentioned before, meta RL defines a problem
[38:03.520 - 38:07.960] setting. And in this problem setting, there's a distribution of MDPs, which could also be
[38:07.960 - 38:13.480] considered a distribution of tasks. So your outer loop is computed, like, for example,
[38:13.480 - 38:19.360] your return is computed in expectation over this distribution. Instead, you can actually
[38:19.360 - 38:24.220] view this distribution as a single object. In that case, it's a partially observable
[38:24.220 - 38:29.400] Markov decision process, also known as a POMDP. And what's different from a POMDP and MDP
[38:29.400 - 38:35.400] in a POMDP is a latent state. So it's something agent can observe. And in this case, the latent
[38:35.400 - 38:39.800] state is exactly the MDP you are inhabiting the agent is in at the moment. So your latent
[38:39.800 - 38:44.120] state would include the task identity. And so if you actually were to try and write out
[38:44.120 - 38:49.240] this POMDP, then the transition function would condition on this latent variable, your work
[38:49.240 - 38:53.800] function would condition on this latent variable. And then there's just kind of the action space
[38:53.800 - 39:00.960] left to define. The action space is usually assumed to be the same across all these different
[39:00.960 - 39:05.540] MDPs. And so that's usually just the same for the POMDP. But there's also work trying
[39:05.540 - 39:12.280] to loosen that restriction. So someone from our lab, Zhang, recent paper, trying to generalize
[39:12.280 - 39:17.600] across different robot morphology with different action spaces. And there he's using hyper
[39:17.600 - 39:22.080] network, which is also other work we've done in our lab, hyper networks and meta RO. So
[39:22.080 - 39:26.000] kind of what is held constant and what changes between these, usually the action space is
[39:26.000 - 39:29.640] held constant, the safe space is held constant. And then the reward function and the transition
[39:29.640 - 39:33.640] function depend on this latent variable. But you can also try and relax the action space
[39:33.640 - 39:38.880] assumption as well. How practical is this stuff? Like, where is meta RL today? I mean,
[39:38.880 - 39:45.040] you mentioned some application areas, but for let's say a practitioner, an RL practitioner,
[39:45.040 - 39:49.160] is meta RL something do you really need to understand to do this stuff? Well, or is it
[39:49.160 - 39:54.240] kind of exotic still and more forward looking research type thing?
[39:54.240 - 40:01.880] It's definitely more on the forward looking edge of deep RL research, I would say, like
[40:01.880 - 40:10.040] it's the whole idea that you need to learn these adaptive agents and cut the computational
[40:10.040 - 40:15.800] cost at test time using using that, like it is very appealing. And it is actually like
[40:15.800 - 40:23.480] a very it, it has, it's rooted in like a very practical consideration. Like what if your,
[40:23.480 - 40:28.360] your robot is deployed in a slightly different environment, you would still want it to like
[40:28.360 - 40:37.320] be able to handle that well. But in practice, I think, mostly this is still a little bit
[40:37.320 - 40:43.760] speculative, like, and then there's also the aspect that the meta RL algorithms, to some
[40:43.760 - 40:49.520] extent, if you're doing if you're dealing with some like new environments where you
[40:49.520 - 40:56.720] need to adapt to get a good policy. Oftentimes, what you end up doing is just taking a policy
[40:56.720 - 41:03.080] that has memory. So like, let's say an RNN, and then it just like so it can like, it can
[41:03.080 - 41:09.960] do if it doesn't observe the full state of the environment, it can like retain its observations
[41:09.960 - 41:15.360] in the memory, and then like figure out the details of the environment as it does, as
[41:15.360 - 41:20.040] it goes. And that's essentially R squared, like that's like, that's the essence of what
[41:20.040 - 41:26.900] R squared does, whether, whether you want to like call that meta RL in each instance,
[41:26.900 - 41:30.880] like maybe not. And do you really need to know everything about meta RL to actually
[41:30.880 - 41:36.640] do that? Again, maybe not. But that's kind of like, in this kind of sense, the ideas
[41:36.640 - 41:44.840] are still fairly pragmatic. And actually, like you can you can often find that the algorithm
[41:44.840 - 41:50.560] ends up behaving in a way that's essentially like adaptive learned adaptive behavior, which
[41:50.560 - 41:55.080] is what the meta RL agents would do. Yeah, I guess to add on, what Risto said, I think
[41:55.080 - 41:59.780] the practicality also depends on which of these kind of clusters you're in that we discussed.
[41:59.780 - 42:05.000] So in the few shot setting, I think it's whether or not you call it meta RL, like you're trying
[42:05.000 - 42:09.720] to do sim to real transfer over distribution of tasks, which generally is meta RL, an extremely
[42:09.720 - 42:15.440] practical tool, right? It's very cleanly and directly addressing the sample inefficiency
[42:15.440 - 42:22.440] of reinforcement learning and shifting the entire burden to simulation and meta training.
[42:22.440 - 42:28.440] In the long horizon setting, I'm not so sure there's a practical use at the moment to the
[42:28.440 - 42:33.000] multitask long horizon setting. But the single task long horizon setting seems to have some
[42:33.000 - 42:37.560] practical uses, like, you know, hyper parameter tuning, it's a particular way to do auto RL,
[42:37.560 - 42:41.480] right? Where instead of just using a manually designed algorithm, you're doing it end to
[42:41.480 - 42:45.420] end on the outer loop objective function. And so from that perspective, if you're trying
[42:45.420 - 42:49.280] to tune some hyper parameters of an RL algorithm, it's pretty practical if you're trying to
[42:49.280 - 42:56.880] run any RL algorithm. It also is kind of as research said, this emergent thing, that a
[42:56.880 - 43:00.560] lot of systems, a lot of generally capable systems will just have in it, whether or not
[43:00.560 - 43:06.800] you're trying to do meta RL. A lot of systems like you know, large language models have
[43:06.800 - 43:12.160] these emergent in context learning that occurs, even if that wasn't directly trained for.
[43:12.160 - 43:16.320] So it's practice in some ways very practical and other ways it's very not practical, but
[43:16.320 - 43:20.000] will arise regardless of what we try and do. Do you guys have I know you've already mentioned
[43:20.000 - 43:24.880] a couple, but are there any other specific examples of meta RL algorithms that you're
[43:24.880 - 43:32.040] specifically excited about? Or are your favorites? We talked about dream and very bad.
[43:32.040 - 43:40.280] Yeah, those are those are definitely really, like, thought provoking. And also dream, dream
[43:40.280 - 43:46.240] is like they use dream in that the code grading thing. So like, turns out it's practical as
[43:46.240 - 43:52.240] well. One algorithm that for me personally has been especially thought provoking and
[43:52.240 - 44:00.480] kind of impacted my own interest slot is the learned policy gradient, the junior paper
[44:00.480 - 44:06.720] that I hinted at earlier, where they learn the objective function completely in the inner
[44:06.720 - 44:11.760] loop and show that like this is one of the few papers in meta RL that shows this like
[44:11.760 - 44:18.680] sort of quite impressive form of transfer, where you train the inner loop on like tasks
[44:18.680 - 44:23.760] that don't look anything like the tasks that you see at the test time. So in their particular
[44:23.760 - 44:30.360] case, it's like great worlds to Atari. And I find that that's like, sort of thought provoking,
[44:30.360 - 44:35.880] even if the algorithm ends up not being super practical, but like the the idea that yeah,
[44:35.880 - 44:42.960] really like meta learned system can transfer this way. And I think that's an exciting capability
[44:42.960 - 44:51.120] that would be would be fun to see appear even more in in meta RL, and elsewhere as well,
[44:51.120 - 44:52.120] of course.
[44:52.120 - 44:56.760] Yeah, I think that's a great example of a paper for the many shot setting. And kind
[44:56.760 - 45:01.560] of in the few shot setting, I, as I mentioned, I'm pretty fond of this import idea that I
[45:01.560 - 45:07.000] think has been pretty useful in robotics as well, where you try and simultaneously learn
[45:07.000 - 45:10.360] what the task representation and how to infer the task at the same time.
[45:10.360 - 45:17.040] So in regular deep RL, we've seen an explosion algorithms. And but recently, we've seen the
[45:17.040 - 45:23.840] dreamer and recently dreamer version three from danger halfner at all, that beats a lot
[45:23.840 - 45:30.480] of algorithms without tuning. And that suggests there's some maybe some pruning or conversions
[45:30.480 - 45:36.680] of the state of the art family tree of our algorithms is maybe in order. I mean, maybe,
[45:36.680 - 45:42.560] maybe there's some things we don't have to worry about as much or we don't have to. Maybe
[45:42.560 - 45:47.160] we can mentally trim the tree of algorithms we have to keep track of, because dreamer
[45:47.160 - 45:54.200] is kind of covering a lot of space. Do you see, do you see any similar thing being possible
[45:54.200 - 46:01.600] in meta RL in terms of algorithms being discovered that covers a lot of space? Or is is meta RL
[46:01.600 - 46:05.880] somehow different? I mean, sounds like one of the main things I've gotten from this discussion
[46:05.880 - 46:10.440] is there's just so damn many combinations of things. So many different variations and
[46:10.440 - 46:17.560] settings that that it's it's is it is it is it is it is it different in that way? And
[46:17.560 - 46:21.880] that we should not expect to find some unifying algorithm? Or do you think that may be possible?
[46:21.880 - 46:26.680] I guess dreamer v3 already will solve a huge chunk of meta RL problems. That's already
[46:26.680 - 46:33.760] a good start. But but I think that really there are different problem settings with
[46:33.760 - 46:37.920] pretty unique demands in the meta learning setting, right? So if you have a very narrow
[46:37.920 - 46:42.280] distribution of tasks, and you don't have to do much extrapolation to test time, it's
[46:42.280 - 46:48.240] kind of hard to beat a pretty simple task inference method. And on the flip side, if
[46:48.240 - 46:52.440] you need a huge amount of generalization, I'm not sure like dreamer is going to do any
[46:52.440 - 46:57.080] better than actually building in some policy gradient into the inner loop to guarantee
[46:57.080 - 47:02.520] that you have some generalization. So I think, in that sense, it is kind of hard to say there's
[47:02.520 - 47:06.240] going to be one algorithm for all of meta learning, because of the different demands
[47:06.240 - 47:08.600] of each of the different problem settings discussed.
[47:08.600 - 47:17.560] Yeah, but I just want to stack on that a little bit in that the sort of black box style, like
[47:17.560 - 47:25.920] R squared inspired that sort of very general pattern of algorithm seems quite powerful
[47:25.920 - 47:31.160] and and has recently been demonstrated to do well in like, quite complicated task distribution
[47:31.160 - 47:38.000] as well. So it's like, there's definitely some convergence there. But maybe like a big
[47:38.000 - 47:43.640] reason why we see so many different kinds of algorithms in meta RL is that it's also
[47:43.640 - 47:49.520] just like, it's also just about learning about the problem and the different features there's
[47:49.520 - 47:56.400] like, you want to like, you're trying to understand more and like uncover like the critical bits
[47:56.400 - 48:01.320] of like, what are we what is the what is the challenge where and like, how can how can
[48:01.320 - 48:08.360] we so I feel like the the multi multi many, many algorithms that we see, like some of
[48:08.360 - 48:12.720] them are just like kind of trying to answer a smaller question as well than than just
[48:12.720 - 48:18.120] like, you know, whether this is a really a state of the art contender for meta RL. So
[48:18.120 - 48:22.760] like, naturally, some of them will kind of fall by the wayside as we go on.
[48:22.760 - 48:29.800] That makes sense. That's all part of the research process, right? So in deep RL, we've seen
[48:29.800 - 48:36.760] that pretty minor changes into an MDP. We have to consider as a different task, and
[48:36.760 - 48:42.680] trained agents might no longer perform well with with the with a slightly different MDP.
[48:42.680 - 48:47.120] For example, a robot may be having slightly longer or slightly shorter legs or playing
[48:47.120 - 48:55.880] with a blue ball instead of a red ball. And my senses for humans, we we can generalize
[48:55.880 - 49:00.480] quite well naturally. And so we might not really call that a different task. Like basically
[49:00.480 - 49:08.120] we might not chop up tasks so finely in such a such a fine way. And, and I, I, I always
[49:08.120 - 49:13.720] think of that as, that's just a property of our current generation of function approximators.
[49:13.720 - 49:18.640] Deep neural networks are very finicky and they generalize a little bit, but they don't
[49:18.640 - 49:23.440] really extrapolate. They mostly interpolate as the way I understand it. So do you think
[49:23.440 - 49:28.840] that the the facts that our current function approximators have limited generalization
[49:28.840 - 49:35.320] forces us to look more towards meta RL? And if we were to somehow improve, come up with
[49:35.320 - 49:39.320] improved function approximators that could maybe generalize a bit better than we wouldn't
[49:39.320 - 49:44.040] need as much meta RL. Do you think there's any truth to that or, or, or no?
[49:44.040 - 49:48.960] So I think this seems like a distinction kind of between the, whether we were talking about
[49:48.960 - 49:55.440] the meta RL problem setting or the algorithms for meta RL. Like if you think of the task
[49:55.440 - 50:03.920] distribution it's, it's, it's just, you know, a complicated world where your agent has to
[50:03.920 - 50:10.000] like, it can't know zero shots, the expected behavior. So it has to like go and explore
[50:10.000 - 50:16.200] somehow the environment and then do the best it can with the information it has gathered.
[50:16.200 - 50:20.120] And I feel like, you know, that idea is, is not going to go away. Like that's sort of
[50:20.120 - 50:27.280] how a lot of the real world works as well. So like, in some sense, thinking about that,
[50:27.280 - 50:32.720] that problem setting, it seems very relevant going, going forward, whether or not we're
[50:32.720 - 50:36.520] going to use like these specific methods we came up with. Like that's, that's more of
[50:36.520 - 50:41.320] an open question. And, and like, I guess there's some hints that in many cases we can get away
[50:41.320 - 50:44.080] with like fairly simple ideas there.
[50:44.080 - 50:47.900] But I don't think it's going to be like, we've come up with some new architecture and magically
[50:47.900 - 50:52.120] we don't need to train to generalize anymore. Like, I think that you're still going to have
[50:52.120 - 50:57.840] to train if you want your, whether, you know, the universal function approximator to generalize,
[50:57.840 - 51:00.760] I think you're going to have to train over a distribution of tasks intentionally to try
[51:00.760 - 51:06.320] and get that generalization. Whether the task distribution is explicit or implicit, like
[51:06.320 - 51:12.320] in more like large language models, I think it doesn't necessarily matter. But I think
[51:12.320 - 51:18.080] that trying to, you know, expecting some machine learning model to generalize without being
[51:18.080 - 51:22.040] explicitly trained to generalize is kind of asking more than is feasible.
[51:22.040 - 51:27.440] All right, we're going to jump to some submitted questions now. These are exactly three questions
[51:27.440 - 51:33.120] from Zohar Rimon, a researcher at Technion. Thank you so much for Zohar for the questions.
[51:33.120 - 51:37.200] And the first one is, what do you think are the barriers we'll need to tackle to make
[51:37.200 - 51:42.340] meta RL work on realistic, high dimensional task distributions?
[51:42.340 - 51:49.160] Great question, Zohar. So yeah, I think really kind of the, it's sort of in the question,
[51:49.160 - 51:56.280] the answer as well, that I believe that like the barrier keeping us from generalizing to
[51:56.280 - 52:03.120] these more complex task distributions is really that we don't quite have like a good training
[52:03.120 - 52:07.880] task distribution, where we would train the meta RL agent that would then generalize to
[52:07.880 - 52:13.400] the other tasks. So there's been efforts in this direction, right? So there was a meta
[52:13.400 - 52:19.840] world, which proposed like a fairly complicated robotics benchmark with a number of tasks
[52:19.840 - 52:25.000] and a lot of parametric variation within each of them, but still like, not quite there,
[52:25.000 - 52:31.400] like I guess my intuitive answer is that like, it doesn't have enough of the categories of
[52:31.400 - 52:38.520] tasks. Then there's also alchemy, which also didn't catch on, I don't quite remember why,
[52:38.520 - 52:43.760] but that was, that was also like trying to post this like complicated task distribution
[52:43.760 - 52:50.800] and see if we can study meta RL there. And now DeepMind has their xland, I think it's
[52:50.800 - 52:57.680] called, which seems really cool and has like a lot of variety between those tasks. But
[52:57.680 - 53:01.800] I guess the drawback there is that it's closed. So nobody else gets to play around with it
[53:01.800 - 53:08.040] and like evaluate whether, whether that is like solving, whether you know, you get reasonable
[53:08.040 - 53:13.320] generalization from those tasks. So I would say that, you know, we need better training
[53:13.320 - 53:15.280] task distributions for this.
[53:15.280 - 53:20.880] Okay. And then he asks, some meta RL methods directly approximate the belief, like a very
[53:20.880 - 53:26.240] bad Pearl and some don't like RL squared. Are there clear benefits for each approach?
[53:26.240 - 53:29.920] I think you guys touched on some of this. Is there anything you want to add to that?
[53:29.920 - 53:34.240] Yeah, I guess if you can directly quantify the uncertainty, it's pretty easy in a lot
[53:34.240 - 53:39.160] of cases to learn an optimal exploration policy or at least easier. However, if you're doing
[53:39.160 - 53:42.560] task inference based methods, you might just represent, if you're trying to infer the MDP,
[53:42.560 - 53:46.440] there might be irrelevant information in the MDP that you don't need to learn for the optimal
[53:46.440 - 53:51.720] control policy. So you might just waste your time learning things you don't need to learn.
[53:51.720 - 53:56.160] And then Zohar asks, would love to hear your thoughts about DeepMind's Ada. That's the
[53:56.160 - 54:00.360] adaptive agent. Do you think it will mark a turning point for meta RL? And again, you
[54:00.360 - 54:05.760] Risto, you just mentioned Xland. I think Ada is, is based on Xland. Is there anything more
[54:05.760 - 54:06.760] to add there?
[54:06.760 - 54:11.200] Yeah, yeah. I mean, it's, it's really exciting work, actually. Like, I think it's a really
[54:11.200 - 54:16.280] strong demonstration of the kinds of things that you can get a big black box meta learner
[54:16.280 - 54:21.980] to do. So you take a big, big pool of tasks, and then you train this, like, big memory
[54:21.980 - 54:30.040] network policy on it, and it can really generalize in quite impressive ways. But I mean, like,
[54:30.040 - 54:34.080] turning points, I don't know, like, you know, I think there's always been a contingent of
[54:34.080 - 54:41.640] meta RL researchers who would have said that, you know, a big recurrent neural network and
[54:41.640 - 54:47.160] a complicated task distribution is kind of all we need. So RL squared, for example, kind
[54:47.160 - 54:53.120] of already starts from that idea. And now it feels a little bit like Ada is like, figuring
[54:53.120 - 55:00.680] out, like, what do you actually need to make that idea really work and scale it up. So
[55:00.680 - 55:04.960] I think it remains to be seen whether that's a turning point, it to me, it feels like it's
[55:04.960 - 55:14.640] on the continuum, to a large extent, but it is it is, I guess, it is at least a, like,
[55:14.640 - 55:19.600] very bright spot in the in the spectrum right now.
[55:19.600 - 55:24.720] Yeah, I'm not sure it like, proposed anything novel that we hadn't seen before, where I
[55:24.720 - 55:28.640] think, you know, it was a huge distribution of tasks, it talked a lot about using attention
[55:28.640 - 55:32.800] or transformers, or at least some sort of attention in your memory. I think they did
[55:32.800 - 55:37.720] some sort of curriculum design. And I think they did. It's like student teacher distillation
[55:37.720 - 55:41.580] thing. There were a lot of kind of a hodgepodge of ideas, and I'm not sure it really added
[55:41.580 - 55:45.920] too much novel on the method side. But it was definitely an establishment of, hey, we
[55:45.920 - 55:49.880] can do this really cool, you know, demonstrate that we can do this really cool thing and
[55:49.880 - 55:56.320] get these really cool generalization capabilities out of a generally capable recurrent agent
[55:56.320 - 56:00.520] over a complex task distribution, as Risto said, so maybe more synthesis of existing
[56:00.520 - 56:05.240] ideas than some very new concepts. Yeah, that sounds about right. So when I was preparing
[56:05.240 - 56:10.680] for this episode, I was looking back at RL squared, and, and, you know, Ilya Sutskever
[56:10.680 - 56:18.640] was giving a talk about this. And at the time, it was it was open AI universe, which, which
[56:18.640 - 56:24.880] was like the Atari learning environment, but with way more games. And that was something
[56:24.880 - 56:29.000] that kind of just fell by the wayside back in the day, I guess it didn't either we weren't
[56:29.000 - 56:33.880] ready for it, or it didn't the meta learning didn't really happen. Do you guys have any
[56:33.880 - 56:38.440] comments about opening a universe or what what happened back then? I guess was it was
[56:38.440 - 56:42.480] RL just not powerful enough for for such a task distribution? Yeah, that's a that's a
[56:42.480 - 56:49.080] great question. I remember universe. I don't actually know what was the specific issue
[56:49.080 - 56:53.160] they ran into with that. But I think it's just generally, I think what we're finding
[56:53.160 - 56:59.120] here is that really like designing designing a task distribution in which you can train
[56:59.120 - 57:03.520] these like, more capable agents is really is a really complicated problem. Like there's
[57:03.520 - 57:09.080] been multiple really high profile efforts in this direction. And somehow still, we're
[57:09.080 - 57:15.640] like, not really there, I feel like or I mean, maybe excellent 2.0 is that but we don't we
[57:15.640 - 57:21.040] don't get to play with it. So I don't know. But like, yeah, I think it just a testament
[57:21.040 - 57:27.600] to the complexity of that particular issue, the problem there that it's just, it's hard
[57:27.600 - 57:33.880] to come up with really good task distributions for meta RL. So this was a very long, very
[57:33.880 - 57:42.080] detailed paper, 17 pages of references, actually more than 17. It was absolutely mind bending,
[57:42.080 - 57:47.680] honestly reading this and trying to keep track of all these these ideas. I'm sure we've
[57:47.680 - 57:52.640] just scratched the surface of it today. But can you can you tell us a bit about the experience
[57:52.640 - 57:57.240] of writing this paper? I think you mentioned a little bit of in the beginning about how
[57:57.240 - 58:01.200] some of your ideas changed as you went through it. But can you talk about the experience
[58:01.200 - 58:04.520] of writing? What's it like writing a survey paper? I can't imagine how much reading you
[58:04.520 - 58:10.160] had to do. Yeah, I think we alluded to before, but we kind of had a couple of false starts.
[58:10.160 - 58:13.120] We didn't really know what we were doing, right? This was a lot of trial and error on
[58:13.120 - 58:20.240] our part since the very beginning. And we kind of sat down and like, methodologically
[58:20.240 - 58:26.320] proposed different ways in which meta RL algorithms could differ. Really, you know, okay, how
[58:26.320 - 58:29.160] can the inner loop be different? How can the outer loop be different? How can the policy
[58:29.160 - 58:32.760] we're adapting be different? And it turned out that just wasn't at all how the literature
[58:32.760 - 58:37.840] was organized and didn't reflect anything out there in the world. So we had to completely
[58:37.840 - 58:44.840] redesign our framework, which was a big effort. And then after redesigning the framework,
[58:44.840 - 58:48.440] actually keeping track and organizing people on a project this large was something I'd
[58:48.440 - 58:53.680] never done before. And, you know, I think we had to come up with processes just for
[58:53.680 - 58:58.800] that. And that was pretty difficult. So Risto has like multiple spreadsheets where we keep
[58:58.800 - 59:04.920] track of who's assigned to what conference and what paper has been read by who. And I
[59:04.920 - 59:10.240] think that was a pretty useful tool in and of itself. Yeah, definitely. Like, it turned
[59:10.240 - 59:17.240] into a project management exercise, to a large extent, as much as as much as it was about
[59:17.240 - 59:24.040] writing, it was just like, you know, managing the complexity. So in the future, do you think
[59:24.040 - 59:30.400] we will all be using meta RL algorithms only, like, or algorithms designed by meta RL, maybe
[59:30.400 - 59:36.080] I should say, or it's like right now, they're generally all hand designed, as you mentioned
[59:36.080 - 59:41.400] in the paper hand engineered. Do you think this is this is just an early phase, pre industrial
[59:41.400 - 59:46.560] revolution type thing? Well, I wouldn't be surprised, I guess, if every algorithm you
[59:46.560 - 59:51.480] said some automatically tuned component, whether that is directly using meta RL or some other
[59:51.480 - 59:56.840] form of RL. And I would also be surprised if it turned out that the long horizon multitask
[59:56.840 - 01:00:00.300] setting, one of giving us something that could beat state of the art methods were hand designing
[01:00:00.300 - 01:00:06.040] you know, they're smart as engineers ourselves. But that said, like, I think emergent meta
[01:00:06.040 - 01:00:10.000] learning, whether, you know, meta learning, whether explicitly designed as part of the
[01:00:10.000 - 01:00:13.360] problem in the few shot setting, or as an emerging capability, like in the LLMs we're
[01:00:13.360 - 01:00:18.760] seeing now, I think that's going to be in a lot of products from now into the far future.
[01:00:18.760 - 01:00:24.160] Any comment on that one, Russo? Yeah, yeah, I feel I kind of feel the same. Like, it's,
[01:00:24.160 - 01:00:30.280] there's definitely a lot of people who believe that the you can do better, like, you know,
[01:00:30.280 - 01:00:34.560] learned optimizers, and those kinds of things are very relevant here. And I think a lot
[01:00:34.560 - 01:00:41.160] of people are looking into how to actually make that make make those things work.
[01:00:41.160 - 01:00:47.520] That said, like, I don't think we have anything like that deployed. So maybe we're missing
[01:00:47.520 - 01:00:53.760] like some bigger piece of that puzzle, like, how do we actually like get, you know, through
[01:00:53.760 - 01:00:57.900] the uncanny valley and to optimize learned optimizers and learned RL algorithms that
[01:00:57.900 - 01:01:02.760] are actually better than human design ones. So there's there's there's work to do there.
[01:01:02.760 - 01:01:09.240] But I don't really doubt like, I think it's it's an exciting problem to work on.
[01:01:09.240 - 01:01:13.840] This might be a bit of a tangent. But even if you know, like, LPG didn't create state
[01:01:13.840 - 01:01:19.560] of the art, you know, it was a state of the art circle, like 2016, or 2013, or something,
[01:01:19.560 - 01:01:24.080] using algorithms in the outer loop of the art now, even if the inner loop wasn't better
[01:01:24.080 - 01:01:28.960] than anything we have lying around, it might be the case that for particular types of problems,
[01:01:28.960 - 01:01:33.520] like if we're trying to meta learn offline interloop, which is a pretty difficult thing
[01:01:33.520 - 01:01:40.640] to manually hand engineer, or we're trying to meta learn an interloop that can deal with
[01:01:40.640 - 01:01:45.640] non stationarity. So for instance, for continual learning, it might be the case that those
[01:01:45.640 - 01:01:48.960] the meta learning and the outer loop can produce better learning algorithms there than humans
[01:01:48.960 - 01:01:52.800] can hand engineer. I think that's kind of yet to be seen.
[01:01:52.800 - 01:01:56.000] Is there anything else I should have asked you YouTube today?
[01:01:56.000 - 01:02:03.160] Is chat GPT three conscious? I'm kidding. Yeah, I'm kidding. No, I think I think we
[01:02:03.160 - 01:02:10.640] covered this pretty extensively lately. Okay, well, well, since we're going there, let's
[01:02:10.640 - 01:02:14.600] just take a moment and say because I've started to ask people for this. What do you guys think
[01:02:14.600 - 01:02:19.800] of AGI? And this is meta are all going to be a key step to getting to AGI?
[01:02:19.800 - 01:02:30.920] Oh, this is risky. I mean, like, I mean, like, carefully now, Risto, but yeah, like, I think,
[01:02:30.920 - 01:02:36.960] if you want to do, if you want to train agents that can tackle like problems in the real
[01:02:36.960 - 01:02:42.440] world, they're going to require some level of adaptive behavior, most likely, like it's,
[01:02:42.440 - 01:02:46.980] it seems a little, well, I mean, maybe you can get around that by doing a really careful
[01:02:46.980 - 01:02:51.320] design of the agent itself and that kind of thing. But like, probably it's better if you
[01:02:51.320 - 01:02:56.880] can adapt to the environment. So in that sense, like this idea of learning to adapt, learning
[01:02:56.880 - 01:03:04.640] agents that can like take cues from the environment and, and act upon that is, is really central
[01:03:04.640 - 01:03:14.520] to like, just deploying stuff in the real world. So like, again, the meta RL emergent
[01:03:14.520 - 01:03:20.560] meta learning seems seems important. And and on the other hand, we kind of see like these
[01:03:20.560 - 01:03:27.440] kinds of meta learning behaviors come out of things like JET GPT, like it can do in
[01:03:27.440 - 01:03:31.240] context learning and stuff like that, even though it hasn't been actually explicitly
[01:03:31.240 - 01:03:40.680] trained on on like a particularly an explicit meta learning objective. So I would say that
[01:03:40.680 - 01:03:46.760] we're definitely going to going to see at least emergent meta reinforcement learning
[01:03:46.760 - 01:03:51.200] in the generally capable agents we're going to be looking at in the future.
[01:03:51.200 - 01:03:56.040] Yeah, I agree with what Risa said, and I should also tread carefully here. And to be clear,
[01:03:56.040 - 01:04:03.880] I was joking about the, you know, consciousness of chat GPT three, lest I be misquoted. But
[01:04:03.880 - 01:04:08.440] I do think that fast learning is kind of one of the major hallmarks of intelligence. And
[01:04:08.440 - 01:04:12.880] so regardless of whether we design that manually, or it's an emerging property of our systems,
[01:04:12.880 - 01:04:18.000] fast adaptation, fast learning will be a product in the generally capable systems going forward.
[01:04:18.000 - 01:04:23.560] So there has been a meme that's been popping up every once in a while more recently, is
[01:04:23.560 - 01:04:29.680] that do we even need RL, like Ian LeCun had, you know, slide not long ago saying, basically,
[01:04:29.680 - 01:04:37.720] let's try to either minimize RL or, or use more like shooting methods and learn better
[01:04:37.720 - 01:04:42.960] models and just not use, not need RL. And we had Arvind Srinivas from OpenAI saying
[01:04:42.960 - 01:04:48.800] that he's kind of what maybe what you were saying, Jake, is that emergent RL was just
[01:04:48.800 - 01:04:54.480] happening inside the transformer in, say, decision transformer, which isn't really even
[01:04:54.480 - 01:04:59.280] doing RL, it's just supervised learning. So do you guys think that RL is always going
[01:04:59.280 - 01:05:04.960] to be an explicit thing that we do? Or is it just going to be vacuumed up by these,
[01:05:04.960 - 01:05:10.840] you know, another emergent, emergent property of these, these bigger models, and we don't
[01:05:10.840 - 01:05:12.120] have to worry about RL anymore?
[01:05:12.120 - 01:05:16.680] Well, I guess to start, like, not to split hairs on definitions of terms, but I guess
[01:05:16.680 - 01:05:22.040] kind of split hairs on definitions of terms. Like some people would say, when Jan says
[01:05:22.040 - 01:05:25.640] that RL isn't necessary, we should just do like, world models, a lot of people would
[01:05:25.640 - 01:05:31.480] call that model based RL. I think, if I'm not misremembering, I think Michael Littman,
[01:05:31.480 - 01:05:35.760] our advisor, would, you know, definitely fall into that camp. So I think we could you could
[01:05:35.760 - 01:05:43.240] make a lot of people less upset by not saying that that's not RL. Plus, maybe comment number
[01:05:43.240 - 01:05:48.400] one. And comment number two is we really shouldn't do RL at all. If we can avoid it, it's a pretty
[01:05:48.400 - 01:05:53.480] weak form of supervision. And so if we can, you know, we had a small section on supervision,
[01:05:53.480 - 01:05:58.680] if we can at all avoid RL on the outer loop, that's better. And we can still clearly wind
[01:05:58.680 - 01:06:01.040] up with reinforcement learning algorithms in the inner loop.
[01:06:01.040 - 01:06:08.440] Yeah, I'm on the same same lines here as Jake, for sure that like, if you can get get away
[01:06:08.440 - 01:06:13.840] without using RL, go do it, like, it's probably going to be better. But like, it's it's hard
[01:06:13.840 - 01:06:21.240] to hard to imagine, like, at least to me, it's not clear how you would solve. Well,
[01:06:21.240 - 01:06:26.880] I don't know what, what's a concise description of a problem where surely you need RL. But
[01:06:26.880 - 01:06:31.440] there's problems where I have a hard time imagining that you can get around that without
[01:06:31.440 - 01:06:34.080] something like RL being actually deployed there.
[01:06:34.080 - 01:06:44.400] Well, chat, he uses it as it's called. I'm invested in it, right? Yeah, exactly. So,
[01:06:44.400 - 01:06:49.840] so I gotta thank you both. This has been fantastic. Jacob Beck and Risto Vuorio. Thanks so much
[01:06:49.840 - 01:06:53.360] for sharing your time and your insight with the talker audience today.
[01:06:53.360 - 01:07:00.720] Awesome. Thanks so much, Robin. Yeah, thank you.