Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at Talk RL podcast. I'm your host, Robin Chauhan. I'm super excited to have our guest today, Danijar Hafner, a PhD candidate at the University of Toronto with Jimmy Ba. He's a visiting student at UC Berkeley with Peter Abbeel and an intern at DeepMind. And of course, Danijar was our guest before back in episode 11. Welcome back, Danijar. Thanks for having me, Robin. Yeah, our last interview together was honestly one of the favorite interviews of mine that I've ever done. It was an audience favorite and I learned a ton, so I'm super excited about today. You have done a lot of incredible work since the last time we spoke. And let's jump right into it. We're going to start with Dreamer version three. That's mastering diverse domains through world models. That is yourself as first author at all. And so this is version three of your Dreamer series. We talked about version one back in episode 11. But can you briefly describe, remind us, what is the idea with the Dreamer series overall? So the idea is typically reinforcement learning would just try out a lot of different action sequences in the environment. And over time through trial and error, it would figure out which of them are better and more likely to lead to the goal than others. But that just requires a lot of interaction with the environment and it's not really feasible or makes it really hard for tasks where it's hard to get data. So for example, real robots. They run pretty slowly. You can't just speed them up like a simulator. And so the more sample efficient, the more data efficient you can be, the better. And the Dreamer line of work addresses that problem with a pretty intuitive approach where instead of just running a ton of trial and error in the real environment, you instead use the data that you get from interacting with the environment to learn a model of the world. And then you can use that model to run a bunch of trial and error in imagination without having to actually interact with the real environment. And so the idea of model-based reinforcement learning is old. And it's just been pretty challenging to get it to work in practice, especially for complex environments that have high dimensional inputs like images and complex tasks. And so Dreamer learns the world model from images and actually from any input you give it pretty much. So yeah, that's part of what we got to in Dreamer v3. The goal really being an algorithm that is very data efficient and you can just throw it in any problem out of the box. The code is open source and hopefully it'll make reinforcement learning a lot easier to use for people out there. And you definitely succeeded in that. I see from the paper, exceeds Impala performance while using 130 times fewer environment steps like that is a really big difference. Can you walk us through just on a high level, the progression from version one, two and then now to three. And were these different versions about like refining your original vision or were your goals of evolving as you want? The line of work actually started with planet, the deep planning network. And then there was Dreamer one, two and three. And the vision hasn't changed along the way, just the algorithmic details to make it work better and better. And I think it's true with so much in AI that the general idea has been out there for a long time. If you think about large language models, for example, right, sequence to sequence has been a big research topic for a long time. And then it just incrementally gets better over time. Sometimes there's a bigger jump like transformer architectures and so on. And now with models like GPT-4, we've gotten to really good, really good capabilities. But I think it's easy for people to forget how incremental and gradual the progress is and over how many years it spans out. And it's been similar with model based reinforcement learning, or just data efficient reinforcement learning in general. So planet was the first version where we train these world models on high dimensional inputs on videos. And to really make that feasible, you need two things. So you need an environment model. Well, first of all, at a high level, what do I mean by world model? I mean, something that gives the agent a rich perception of the world, some representations that summarize its inputs more compactly, and then allow you to predict forward. So by predicting forward, you can then do planning with it, right? It could be a tree search like an AlphaGo, or it could be a model predictive control that's common in robotics, you could use it to do decision time planning as you interact with the environment, you can also do it, use it to do offline planning, where you just imagine a lot of scenarios, a lot of sequences with your world model, and train a policy on that. And then when you interact with the environment, you just you just sample from the policy. So what do you need for this to work? Why has this been challenging in the past? Well, you need a model, first of all, that's accurate enough to actually get successful planning with it, right? If it doesn't approximate the environment well, then you're screwed. If you're just learning your behaviors from from predictions of the model. And then the second point is that it has to be computationally efficient. And that's another theme that reaches throughout AI, we really care about maximizing the compute efficiency of our algorithms. So you know, if you can scale the algorithm up, if it's more compute efficient, it means you can have like if an algorithm is twice as compute efficient as another one, you can just run twice the size of the model, or train on twice as much data, and you'll get way better performance by comparison. In planet, we figured out how to learn this world model. And we do that by encoding all your sensory inputs at each time step into a compact representation. There's a stochastic bottleneck there as well. And then learning to predict the sequence of these compact representations with a GRU conditioned on the sequence of actions. And then you train that by reconstructing the images at each time step, and also by predicting the rewards. And the rewards are important because later then you can do planning just in imagine like just in this learned representational space without having to reconstruct images. Now the model hasn't actually actually changed very much, at least at a high level, since since planet. And I tried a lot of different designs. And I keep coming back to this because it works really well empirically. And it's pretty simple. This is the RSSM component. This is the RSSM. Yeah. And so yeah, recurrent state space model is what we called it. And what has changed from planet to the first dreamer, what has changed totally is how do we learn or how do we derive behaviors from this model? So in planet, we did online planning at each time step as you interact with the environment. And that's just very computationally expensive because you have to do a bunch of rollouts with your model to find good action. And then you can't really reuse that computation much for the next time step, because then you'd be like too narrowed in on your previous plan that didn't look far enough into the future. So you just end up planning from scratch doing thousands of model rollouts at every time step as you interact with the environment. And that's just it's fine if you're training on one robot, because it does run faster than real time. But it's not really fine if you want to compete with other algorithms on simulated environments where collecting data is really cheap. And so in dreamer, we switched from online planning to doing offline rollouts, just starting from states that come from the replay buffer and using these rollouts to train an actor critic policy that is fast to sample from. And that also has a value function and therefore can take rewards into account that are within that are beyond the planning horizon or beyond the rollout horizon. Yeah, in dreamer v two. Yeah, and I yeah, in dreamer v two, we so dreamer v one was focusing on continuous control from pixels. And it was very data efficient. But but it wasn't that general of an algorithm. And if you want to convince people to start using that using this algorithm, okay, let me redo that. So dreamer v one, for dreamer v one, we focused on continuous control from pixels. And it was very data efficient and got high final performance, but also pretty narrow in the sense that we couldn't deal with discrete actions very well. And we weren't competitive on standard benchmarks like Atari. And so that was the focus of the dreamer v two paper, really matching or exceeding the performance on a very popular benchmark that people have tried to improve results on for a long time. And so we did that by switching to discrete discrete representations in the algorithm that just match the discrete nature better of these games where you have discrete actions. And we also improved the objective function in some ways. And then dreamer v three is the next natural step where people were starting to get pretty interested in using dreamer for all kinds of problems where especially where data efficiency matters. And especially where you have high dimensional inputs. And so I ended up helping a lot of people tuning the algorithm and getting it to work. And that ended up taking a lot of time for me. And I developed all the intuitions of, you know, in this type of environment, you need to set your maybe your entropy regularizer for the policy a bit higher because the rewards are pretty sparse. And so you need to explore more. And then here, the the visual complexity of the environment is very low. And so maybe in Atari, you need to pay attention to individual pixels to get really good performance. And so you need to change the world model objective to get really good at reconstruction and not abstract too much away in the representations. Whereas in complex 3d environments, you actually want to abstract quite a bit away so that you can generalize more quickly, or you can generalize better, and forward prediction becomes easier as well. You don't care about all the like texture details in the image. My goal for dream every three was to automate all of that away, have an algorithm where the the objective functions are robust enough that you can just run it out of the box and it'll give you good performance. So as an engineer, I love your attention to efficiency. And the efficiency of this design is actually really amazing. And it's the compute efficiency was also the sample efficiency, like it requires few samples in the environment. Do you think that we're that you're approaching some kind of limit with with these efficiencies? Or can you envision a future dreamer that may be two or 20 times even more efficient in these senses? Where do you think we are in the long term in terms of efficiency? Yeah, that's a great question. In dream every three, because the algorithm is so robust, we observed very predictable scaling behavior of the algorithm. And so, obviously, if you do more gradient steps, or more, if you replay more data from your replay buffer, then you will be more data efficient. And at some point, you will start overfitting, and your algorithm will degrade in performance. And that point is much further down, down the road than it is for a lot of model free algorithms. So in a sense, the world model lets you just trade off more compute to become more data efficient. And maybe more surprisingly, increasing the size of the model has a very similar effect, where you will become more data efficient, and reach higher final performance as well, which matches what we're seeing for large language models, but hasn't really been demonstrated very much in RL. And it's pretty exciting to be at a point now where we have very predictable scaling capabilities. And I think even today, if we want to be 10 times more data efficient, we can actually do that by increasing the model size and increasing the gradient steps and just waiting longer for the whole thing to train. So we featured Jacob Beck and Rista Vuario recently, and their survey on meta RL. And I believe they mentioned Dreamer does a good job as a meta RL agent as well. Does that surprise you at all? Or if you used it that way, and do you see Dreamer being applied more in a meta sense? That's interesting. Do you know what kind of tasks they did with it? I would have to get back to you on that one. Yeah. So the model integrates information over time into Markovian states. And there's actually a reason we're not using transformers in the world model, even though transformers are everywhere now. And the reason is that Markovian states make it much easier to do control, to do RL on top of the representations. It's easier to fit a sequence with a transformer that doesn't have this recurrent bottleneck to squeeze everything through. But by forcing the model to learn Markovian representations, we're actually offloading a lot of what's challenging about RL to the unsupervised model learning objective. And so we don't need rewards to learn which parts are relevant about the state. And so I'm not that surprised that just because it's a sequence model, or it integrates information over time, if you feed in rewards as well, it can do some sort of meta learning. But yeah, this is almost just an immersion property of using sequence models in RL, which we should have been doing for a long time anyways. And yeah, if there are any specific capabilities where this is model based approach works much better, other than just being more data efficient than model free, that'd be pretty interesting to know. So I noticed you had results in dreamer three for Minecraft and making diamonds in Minecraft, which is, I understand a very hard exploration problem. Can you can you tell us about that? It's actually been an open challenge that's been posed by the research community for a couple of years, there were these competitions at neurobes to find algorithms that can perform in this complex environment, you know, Minecraft is 3d, every episode is procedurally generated. So you never see the same thing twice. And the rewards are very sparse. So to get to the diamond, you have to complete a bunch of tasks along the way, it takes 30,000 time steps to get there roughly. And I, I think, like, personally, I didn't think it was possible to do without either human data to guide the agent along the way, like open AI did, or at least have a very strong intrinsic exploration objective, which, you know, there are some ideas for that out there, but there isn't any, like, really good thing that will work out of the box and something like Minecraft yet, I think. So it was part is, in a sense, it was a pretty long shot. But we had this algorithm dream every three that works quite well out of the box, right? And you can't really tune hyper parameters that well on real robots. And it's similar in Minecraft, because it's a pretty complex task, and it will take quite a while to train, it ended up taking 17 days for our training runs to finish. And so you don't want to tune hyper parameters and fiddle with the algorithm and that at those timescales. So we just ran it and waited. And we're like, okay, let's give it two weeks, see what comes out of it. And then after two weeks, we already had a couple of diamonds, so we let it run a little bit longer. But yeah, it was a test of the algorithm, in a sense, where, yeah, it ended up working out great. And we didn't really expect that it was possible just just with the robust objective functions and the entropy regularized policy objective that we use in Dreamo v3. So do you plan Dreamer version four and so on? What kind of issues do you think you might tackle with future issues if there's, with future versions, if there's future versions? Is there Dreamer v4? Um, unclear. So I think the biggest issue we're facing in RL now is to do temporal abstraction really well. Another issue is that we want to leverage pre training data to squeeze out more sample efficiency and learn tasks much faster, right? At the end of the day, maybe some people might wonder why does sample efficiency matter so much? Can't we just get a lot of data and solve everything that way? And I think it's important to point out, there are two reasons that sample efficiency matter a lot. The first one is we don't have web scale data sets for how to make decisions for a lot of decision making problems, right? Like for example, in robotics, even for like language based assistance and so on, if we want to fine tune them to become goal oriented, we don't really have huge data sets for that. OpenAI is collecting one, but I doubt they'll share it. And, and so you have to be data efficient, because there isn't, there's just isn't that much data to supervise from. And the second part is that we also want our algorithms to adapt quickly and learning from small amounts of data is the same as adapting quickly. And so that's, to me, that's the core of intelligence, how can you adapt very quickly? You know, you want to generalize as far as you can, but then at some point, you'll be out of the distribution of what you can generalize to, given the data you have, will reach that point even with with large language models. And and then the tricky question is how do you how do you adapt away from there? Or how do you adapt to new stuff, right? We want these algorithms to discover new things for us. And, and so, in terms of open challenges in RL, the biggest to me seem, seem to be learning abstract events and planning over them to do a very long horizon reasoning and using pre training data that is available, but it's not that easy to use, like unlabeled videos without actions that you can get from YouTube. And so those are the two things I'm focusing on going forward. And I don't know if there'll be another just like, general dreamer algorithm, because it will have these new things built in. And yeah, it might deserve a new name at that point. So let's move on to daydreamer. That's where you apply dreamer to physical robots, if I understand. And I admit, when I first encountered dreamer, I and plan it, I assume something like, oh, you know, maybe that model is good for learning in simulated worlds. But you know, with that kind of model really makes sense in the real world. And I guess you you answered that question here. So so could you tell us about daydreamer? Daydreamer uses the dream of v3 algorithm, actually a slightly earlier version of that algorithm. So it was published before the dream of v3 paper came out. And the question is, can we actually run these, these algorithms on real robots? Do all the sample efficiency improvements we see in simulated environments transferred to something, something real in the physical world. And the whole project actually happened in in just a few weeks. Because really, we just ran the algorithm on the robot, and it worked out of the box. And that was the ultimate test to see that, you know, you can't tune hyper parameters on the real robot very well, because every training run takes multiple hours and things break along the way, you need people to fix stuff all the time. So the focus of dream of v3 of just running out of the box really paid off when we were running on real robots. And I would say the tasks that we did there are still fairly simple tasks. So it'd be very exciting to see how much we can scale this up. But yeah, we trained on visual pick and place from sparse rewards with an arm that picks up balls and places them into a different bin. We trained the quadruped the doc robot to just from scratch with with manually specified reward function that has three components on it, train it to roll over, and then figure out how to stand up and walk without any simulators in in just one hour. And there were no resets. I mean, sometimes the robot would like, get too close to the wall and like, start just trying stuff. So we would pull it back into the middle of the room. But at least we were making sure to not change the joint configuration of the robot. So if you had more space, then you wouldn't have that issue. Yeah, and it worked amazingly well. We got to Yeah, it walked in one hour, and then it's continuously learning in the real world, right? So you can just start messing with it and see how it adapts. And initially, if you just perturb it a little bit, it just falls and struggles to get back up. But then in 1015 minutes, it actually learned to withstand when we tried to push it or roll over very quickly and get back up on its feed. That's amazing. So it's this is really dreamer out of the box. With no changes? Like did you come away from this thinking, maybe there's some things I could I could tune to make it more robot friendly, or really not even? The only thing we had to do was to paralyze the gradient updates for the neural net with the data collection on the robot. So run those in two different processes. So that whenever you're doing a gradient step, that doesn't pause the policy. So the policy can run continuously. And then you sync the parameters over every couple of seconds. So nothing on the algorithmic side. And yeah, a little bit on the software infrastructure side, and I think things are starting to become more general on on that front as well. And yeah, for Dreamer v3, I think we didn't talk about it yet. But yeah, the main result in terms of capabilities was to solve the Minecraft diamond challenge from sparse rewards. And that also need a little bit of infrastructure setup and so on. I played Minecraft a little bit, I can say I've never made a diamond. We did talk to Jeff Clune in the last episode, who did vpt, which was video pre training method with open AI's approach to learning off of human videos on YouTube humans playing Minecraft. And I understand there was something like 24,000 actions required to create a diamond. So this is this is quite a quite exploration channel challenge. Okay, let's move to director. That's deep hierarchical planning from pixels. And that's yourself at all. So what is what is happening with director? So I already mentioned briefly earlier, that I think one of the big challenges is to deal with temporal abstraction in RL now. I mean, it's crazy that our algorithms are still doing basically all their reasoning. I mean, our RL algorithms are doing all their reasoning at the timescale of primitive actions. And if you think about it, you know, humans can set very long term goals. And it doesn't even sound that long term to us if we say I just want to go to the grocery store and buy some stuff. But if you think about the millions of muscle commands that have to be executed along the way to get there, then it just seems completely hopeless to learn that if you're assigning credit and planning in this low level action space only. So somehow, we as humans have this ability to identify meaningful events from our raw sensory inputs, high dimensional inputs. And then we can plan over those things like, okay, I have to, you know, look up where the grocery store is, and I have to go through the door and then open the door, blah, blah, blah, and then drive over there. And all these high level events, you know, our RL algorithms aren't really able to identify those things and plan over them through long horizons. So director is the first step towards that by using work models, but also training a goal conditioned policy at the low level that learns to go from anywhere to anywhere in state space. And then you can use a high level policy on top that just directs the low level policy around, and that's where the name comes from. And so the high level policy chooses goals that are either exploratory, or helpful for solving the task that achieve high task reward. Whereas the low level policy only chooses, is only trained to reach the goal. So it doesn't even know about the task. And that's the ultimate test that this thing is actually working. So I first encountered this idea in the, there was a feudal networks paper a few years ago from DeepMind. And I guess they also referenced an earlier feudal reinforcement learning paper from the past. And I wondered, is there an intuitive reason why splitting the agent into this manager and worker in this way is helpful for learning? It lets you on a high level plan much further into the future, right? Because it's only planning over the sequence of goals that change less frequently than the low level primitive actions. And I think there's a lot more to be done in director, we're changing the goal every 16 steps. So it's only 16 times further that it can look into the future at the high level. And so there are multiple benefits to that. One is, if you plan further into the future, you might just find a better strategy to get to the goal, especially if the reward is sparse, it might just be out of your visible horizon. Otherwise, even with a value function, it still has an effective horizon based on the discounting. And then, moreover, credit assignment becomes easier. So maybe you got to the goal, and it's the sequence of thousands of actions. And now which of these actions should I make more likely? And which of them should I make less likely? Well, it's a lot easier to make that decision at the high level and say, okay, here are the five goals that I chose. And those five goals seem to have been good. Another thing is that you're offloading a lot of the learning how to get from A to B, the goal condition policy to an unsupervised objective function, you actually don't need any rewards to learn a good low level policy that can reach arbitrary goals in your environment. And so you can get away with learning from fewer rewards if you only need them to train the high level policy. So I saw your clockwork via E, which has more than two levels of temporal abstraction. And do you see future directors having more levels of management to expand to higher levels of temporal abstraction? That's a very good question. So I think there are two perspectives on this. One is, yes, you just want a deep hierarchy, and you want that hierarchy to be explicit. And then you can do top down planning, right? You plan very far into the future at the high level, but only take your 10 steps. And then at the level below, now you plan how to achieve the goals from above. And now you're not planning as far into the future because the level is less temporary abstract. And so you still predict your 10 times steps forward. And so effectively, you end up having like a triangle where the highest level looks the furthest into the future. And then the lowest level is quite reactive. Now, in that sense, you definitely want more than just two levels. But there's another perspective on it, which maybe these hierarchies can actually be a bit more implicit than that. Maybe we don't want to fix the timescales to be, you know, powers of two or something. And maybe we don't want like a fixed number of 10 levels. But what if we can train dynamics models such that some of the features just change less often than others. And so implicitly, this becomes a hierarchy now, where the slow features or the slow dimensions of your representation, because they change less often, effectively, when you predict what they will change to, that'll give you a much further prediction into the future. And then there will be some things that just model or represent the high frequency information in the video that change all the time. And those will effectively be the lower levels of your hierarchy. And I think it's a big open question how to actually learn that and use that for abstract planning. But it seems like a pretty compelling idea. So I understand that director and director goals are states from the environment, or observations. Is that right? Goals are representations or states of the world model. So they are, yeah, they are recurrent states of the GRU. And because we have a decoder in the world model, we can actually look at them in image space and interpret what the thing is doing, interpret what sub goals it's choosing to decompose some really long horizon task. So for humans, when I think of a goal, we think abstractly and partially kind of like when I imagined getting an engineering degree, I didn't really think about the details of the scene when I obtained my certificate. How might we bridge that difference from like a partial abstract goal to these very concrete goals? Yeah, that you're bringing up a really important property of of what we want, like a property we want in goal spaces for goal-conditioned RL. And there isn't a great solution for that out there yet. I think it's one of the biggest challenges for goal-conditioned RL and template abstract RL. How do you actually want to specify your goals? And language can be, you know, language can be a decent approximation to that, and maybe it'll surface for a couple of years. But really, I think what you would want is you have a representation that's quite disentangled, and then your goal is to change some small aspect of that representation, maybe change a couple of these dimensions. So the top-down goal could be something like, here is the feature vector, and here is a mask, and the mask is very sparse, and so most of the features in the representation I don't care about. But yeah, I'm sure there are better ways to do it, and we don't really have working algorithms for it yet. So yeah, excited to see what people come up with. Okay, let's move to your next paper, action and perception as divergence minimization by, again, by yourself at all. I remember seeing this when it came out, I believe it was in 2020, and I had a sense that this was, you're saying, you're telling us something really important, but I admit I really didn't understand a lot of the details, so maybe you can help us more today. It seems like a grand unified theory for designing these types of agents, and the scope of all the different types of agents that you've explained in this framework is really quite diverse and amazing. The big question this is addressing is, what objective function should your agent optimize? And the objective function shouldn't just be a reward, because if you want to solve tasks just based on some task-specific reward, then now you have to design the reward function. So you still have to basically know what's a good way to solve the task, because your agent won't work just from sparse rewards. It will need very detailed shaping rewards to actually solve a task, and so we're just offloading the problem, or changing the problem from designing the strategy directly to designing the detailed reward function, and then using RL to fill in the gaps and track that reward function. So we can take a step back and think about, you know, fundamentally what should the reward function be for a general agent? And, you know, there are some ideas out there that say, well, maybe just if you have the right reward function, then everything will be fine, but that's actually not true. So you can have objective functions that only care about the inputs the agent is receiving, like a reward function, right? Reward is basically an input from the environment. It's at least often thought about that way in traditional RL. And you can also have objective functions that depend not just on what the agent is seeing or receiving, but also on the agent's internal variables and its actions. And so now you have a broader class of objective functions, and you can ask the question, well, what is the space of all these possible objective functions? And it turns out we can actually categorize what the unsupervised objective functions are that an agent can optimize. So in addition to rewards or task specific objectives that are inherently narrow in the sense that they only work for a specific domain, and then if you're in a different world, then they wouldn't really make sense there potentially. You always find an environment where this reward function is the opposite of what you'd want to do. So is there something more general? And the answer is yes, there are unsupervised objective functions that make sense in any environment. And it's similar to how when you do object classification, you can learn your representations through some unsupervised objective like CPC or masked autoencoding, something like that. And then you can learn your classifier on top of that very quickly from a small amount of supervision. But in the embodied case, for embodied agents, we actually have three different classes of unsupervised objective functions. And all unsupervised objective functions can be put into one of these three categories. And so the three categories are to learn representations from your past data. And the most complete way of doing that is to learn a work model, just model your complete trajectories in the past doesn't have to be through reconstruction, could be through other ways that end up given some inductive biases, your model architecture, and so on, and maximizing the information that's shared between the past inputs you've received, let's say in your replay buffer, and the representations that are in your agent, right, infer representations that are informative of past inputs. Now, that's all you can do if the data set is fixed and given. But for an embodied agent, you can actually influence your future data distribution. And so not just can you infer representations that are informative of past inputs, you can also steer towards inputs in the future that you expect to be informative of your representations. And so that explains the class of unsupervised exploration objectives, right, get diverse data. And it's another way, or maybe to connect it to the information maximization. Basically, the question is how here's a potential future trajectory, how much information does that share with my representations? In other words, how much will it tell me about how I should change my representations? And so by collecting a diverse data set, you can an agent can set itself up for just, you know, you have a diverse data distribution, you can therefore do better on tasks later on that you're given because you already can learn a lot about the world. Similarly to how from the data you have, if you learn general representations, they will also make it faster to adapt to a new task later on and help you if the reward is sparse and so on. And then finally, not just can you maximize mutual information between your representations and future inputs, but also between your actions and future inputs or could be either primitive actions or more abstract actions like goals or skills and so on, latent variables. And now that gives you a form of what's known as empowerment, where you're trying to choose actions that have a certain measurable outcome in the environment that have high mutual information with what will happen in the future with your future sensory inputs. And so that category basically means your agent without any task specific rewards can learn how to influence the environment. And so that's the third category of things we can do as an embodied agent to set ourselves up for solving new tasks quickly as they come later on. Yeah, so the three categories are representation learning in the most complete sense learning world models, exploration, and learning how to influence the environment. And they play together, one benefits the other, right, but they are distinct objective functions. And they actually don't work that well by themselves. For example, if you don't have a diverse data set, then if you just have random actions, you can't learn a very good world model, right? It'll just not see interesting enough data to be valid in a wide distribution of states. And if you're learning how to influence the environment, but you're doing that based on narrow data, you're not exploring, actually has the opposite effect of exploration. And I think people are starting to realize that, especially for skill discovery methods, like variational intrinsic control and diversity is all you need, but it's also true for goal conditioned policies. And it was an issue in director that we had to address as well, where initially your replay buffer is not very diverse, and you're training your low level policy to go to different points in that distribution of states you've already seen, right? So now your low level policy knows how to go back to the places that it's been to, and it doesn't know how to go to new places. And so you really have to make an effort to get the thing to not lock in the data distribution, but to actually explore new things outside of the distribution. Because otherwise, you'll just collect more of the data you've already seen, your replay buffer fills up with that, and you get locked in more and more into your narrow data distribution. So for example, in director, there was an issue where an earlier version didn't have the expiration bonus. And it just really liked green walls, because in DM lab, initially it saw a bunch of green walls. And so the policy basically just learned to go to green walls. And then the whole replay buffer would fill up with that, and everything it would ever practice on is going to green walls. And so you then, to get these algorithms to work in practice, you need exploration. And I think people are starting to realize that, just from working on these algorithms. But this framework actually explains it to you from first principles, that you need to combine representation learning and exploration, and learning how to influence the environment to get a general agent. So do your existing, I guess you mentioned director, but do your other, does Dreamer fit into this? How does Dreamer fit into this framework? And does this framework suggest to you new types of agents that we haven't seen, or what capabilities can you, will you have now that you have this framework? Can you mix and match and mix up agent designs that no one's thought of to this framework? Yes, yes, you can. So the way Dreamer fits in is that it learns the world model, but it doesn't have any unsupervised exploration or goal reaching capabilities. Now, we have a paper called plan to explore that I think we talked about during the previous episode. Yes. Yeah, that implements the unsupervised exploration idea. And so that's a powerful combination. Now you can run your algorithm on its own, it will learn a model and explore diverse data using that model, those two objectives will reinforce each other. And eventually you'll end up with a world model that's valid in a lot of different states. And you can very quickly solve new tasks with, which at least with fairly simple environments, we showed in the paper and think that's an exciting direction to scale up. And director builds in the unsupervised way of influencing the future by having a goal condition policy, which I think is the easiest way to implement that part of the framework. And it also builds in exploration at the high level because the manager chooses goals that are exploratory. So director is the first algorithm that actually combines all these three aspects in at least one form, it learns a world model, it does unsupervised exploration, and it learns a goal condition policy. And empirically, it works very well on sparse reward tasks. So I think there's a lot of promise there, we're starting to try it out on robots now. But the design space is much larger than that, right? There are a lot of details in how to implement these different components. And the framework doesn't answer those questions, those are just empirical questions, and we'll need RL researchers to not fear about missing out on large language models and actually solve these really important long term problems that will bring us towards general AI. And yeah, so there's a lot to be done in that space, the framework tells you those are the important things to focus on from first principles, all the things that the agent can do are learning world model, exploring, and learning how to influence the environment, plus following domain specific task specific preferences, which could come from human feedback and demonstrations and so on. Danijar Hafner, this has been fantastic, as usual. Thanks so much for sharing your time and your insight with the talk our audience today, Danijar Hafner. Thanks for having me, Robin. I had a great time and looking forward to the next time.