TalkRL: The Reinforcement Learning Podcast

Danijar Hafner takes us on an odyssey through deep learning & neuroscience, PlaNet, Dreamer, world models, latent dynamics, curious agents, and more!

Show Notes

Danijar Hafner is a PhD student at the University of Toronto, and a student researcher at Google Research, Brain Team and the Vector Institute.  He holds a Masters of Research from University College London. 

Featured References 
  • A deep learning framework for neuroscience
    Blake A. Richards, Timothy P. Lillicrap , Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, Colleen J. Gillon , Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham , Grace W. Lindsay, Kenneth D. Miller , Richard Naud , Christopher C. Pack, Panayiota Poirazi , Pieter Roelfsema , João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C. Schapiro , Walter Senn, Greg Wayne, Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien, Konrad P. Kording 
  • Learning Latent Dynamics for Planning from Pixels
    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson 
  • Dream to Control: Learning Behaviors by Latent Imagination
    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi 
  • Planning to Explore via Self-Supervised World Models
    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak 

Additional References
 
Errata 
  • [Robin] Around 1:37 I say "some ... world models get confused by random noise". I meant "some curiosity formulations", not "world models" 

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

This is TalkRL Podcast. All reinforcement learning, all the time. Interviews with brilliant
folks from across the world of RL. I'm your host, Robin Chauhan.
Danijar Hafner is a PhD student at the University of Toronto, a student researcher
at Google Brain and the Vector Institute, and holds a master's of research from University
College London. Danijar Hafner, thanks so much for speaking with me.
Hi, Robin. Thanks a lot.
So how do you describe your area of focus?
Yeah, I work in, broadly in artificial intelligence, and that's really where the motivation for
me comes from. Not so much building better applications, but more really understanding
the concepts behind intelligent thinking. And I think machine learning actually gives
us pretty powerful tools that we can use to study at least some of these questions that
we couldn't study directly on a person or in the brain, because it's just so hard to
make measurements. So that motivation led me to machine learning and then more specifically
to reinforcement learning. So a lot of my work is in reinforcement learning, in generative
modeling, learning world models, and exploration.
Can you share with us what your PhD advisors focus on?
Sure. So my main advisor is Jimmy Ba, and I'm also advised by Jeffrey Hinton, and they
both focus on a lot of questions around deep neural networks. So about architectures of
deep neural networks, about optimization, and things you can do with it. So in some
sense, it's great to have an advisor like that, or true advisors like that, because
they have quite broad interests and broad knowledge as well. So I can basically do whatever
I want, and I get good feedback and good advice on those topics.
So the first paper we're going to discuss today, you're a contributing author to a
deep learning framework for neuroscience by Richards et al. So I really don't know anything
about neuroscience. My apologies in advance as I try to follow along with this. But what
is the main idea here?
The reason I think this is a very good paper to start off with is that it really gives
the general framework for us to think about understanding the brain and what it can do
in connection to machine learning. The general idea is that neuroscience has existed for
a really long time, and there's lots of data around, and there are also some theories.
But it's almost at the point where there are lots of small data sets and measurements
that have been made. But we're really, for one, we're limited by the types of experiments
we can run on real subjects, just because it's so hard to look into the brain, basically,
make measurements. There's this goal, and then there's so much going on. It's really
hard to kind of target specific, let's say, neurons that you would want to measure. And
so that's one thing. And the other thing is that there are some kind of general themes
missing. And of course, there are some ideas of general theories that put together all
these experimental results. But it seems like we need some more guiding principles to really
make sense of all of that data and get some frameworks that we can think within. So the
idea of this paper is that we kind of have a similar situation in deep learning, where
we have all these crazy architectures and different loss functions that you can optimize
and different ways to optimize these loss functions. And so this has served us really
well in the deep learning community. There's a loss function, there's a way to optimize
this loss function, and then there's an architectural model to optimize this function. And so in
this paper, we propose this framework as a way to make sense of data and neuroscience.
So how can we draw connections between the two disciplines here?
So this paper talks about these three components, objective functions, which I gather are equivalent
to loss functions, the learning rules, and architectures. Can you can you say just a
brief feel a little bit about these three things and maybe contrast how they work in
neuroscience and how we define them in machine learning?
So I'm very much on the machine learning side. And I'm really interested in neuroscience,
but I can speak much better for the machine learning side of things here. And so for example,
let's say you just train some deep neural network on some image classification task.
And so there's some data which often you don't have control over. And then there is an architecture
that would be how many layers you use in your neural network, whether you use any skip connections,
what activation function you want to use, and so on. And then there's a loss function
which in the case of supervised learning is quite simple. It's just maximize the probability
of your supervised outputs that you want the network to predict. But that could be much
more complicated in other scenarios. For example, in unsupervised learning, it's really a field
that's about trying to find out what is a good loss function if you don't know exactly
what you want the model to output precisely. So that's the second component. We have the
architectures, we have the loss functions. And then once you have these two, you've defined
an optimization problem. So find the parameters in this model or in this architecture to optimize
this objective, given some data, and then you have to optimize it. So how do you actually
find the right parameters? And in machine learning, we call that an optimizer, like
stochastic gradient descent or Adam and so on. But in neuroscience, that would be a learning
rule where you write down the dynamics of how do you actually, how do the weights change
from one step to the next, or maybe even continuously over time to make progress on finding better
parameters that maximize the loss function. So you said unsupervised learning is a lot
about figuring out what the loss should be. And that's obviously still an open question.
But would you do you feel like in general, in machine learning, we kind of have these
three things figured out to some degree? That's a really good question. I think we have we
have really good ways to optimize our networks. So I think the learning rule part is figured
out to, to at least the level where you can do a lot of things with it. And it's often
not the bottleneck anymore. Of course, there are a lot of people working on developing
better optimizers. And actually, Jimmy works a lot on that as well. And, and it's, it's
like an interesting field, because when you come up with a better optimizer, then you've
made the lives of 1000s of people easier, because now they can all just switch over
to that optimizer, and they will get better results with their machine learning projects.
And that's, that's really the power that comes from a logical framework like this. So the
idea is, if we find good building blocks that we can separate a project, a problem into,
then people can work on them, to some extent, independently of the other building blocks.
So if I want to solve some, if I want to find a better architecture for a specific task,
I don't have to also do research on finding a better optimizer at the same time, or on
finding a better objective function at the same time. So to answer your question, I think
we're in a decent position in terms of the learning rules. I think we're also in a decent
position in terms of the architectures, even though it's probably not as clear yet, just
because it's such a giant design space of how can you build a neural network. One thing
we figured out is that we have a, we have a kind of tool bank of different neural modules
that you can just stack together. And that's a really, really powerful way of thinking
about building an architecture, right? You can have dense layers, fully connected layers
and convolutional layers and attention layers and recurrent layers and so on. You put them
all together and they kind of work in any order more or less. So I think we'll still,
we can still design much better architectures, especially, especially for precise tasks.
So one big benefit of deep learning is that it kind of applies to everything. Whatever
your prediction problem is, you can use deep learning and you can probably do a pretty
good job at making predictions, but especially when there is very little data, then we have
to be more careful about what architecture we use. And so you basically have to build
priors about the data into the architecture. I think we can still do much better job there.
For one, we can, for very specific problems, we can find better priors. And then an example
here is that convolutions work well for images. But then there's still a lot of knowledge
that we intuitively have about natural images that are not captured by convolutional network.
So, for example, there are objects in the world. And so objects tend to be consistent
in time. They move slowly. It's like some piece of information in my sensory input that
is correlated in space and time. And it can move in time or it can move in space. And
we don't really put these priors into our networks yet. And that's what Jeff has been
working on for a really long time with the capsules networks. So there is a spectrum
of how precise you want to tailor something to a task, get really good results on one
task, but then lose some generality. And I think object priors are general enough that
they will be useful for a lot of things. And there are probably some other priors that
we haven't really incorporated well into our architectures yet, like smoothness, for example.
And there is lots of interesting work on Lipschitz neural networks and so on. So I think there's
a very active development on the architecture side. And then to come to the last component
of objectives, I think that's where we have to do the most work and where we're really
early in the process. And so that's what I think is probably the biggest bottleneck of
machine learning and also of understanding intelligent systems better. Finding the right
objective functions, as I said, that's basically, to me, that's basically what unsupervised
learning means as a field at the moment, because some people say, well, it's not really a rigorous,
clearly defined task that you're trying to solve. But to me, that's really the beauty
of it. We don't know yet what is the right mathematical objective that you want to optimize
and if you search for it. And if you find better objective functions, you can learn
better representations, you can describe systems better, and it becomes especially interesting,
not just if you're trying to learn representations, but in the reinforcement learning setting,
where you're not just doing perception, but you're also interacting with the world. And
I think it's not at all clear yet what our agents should optimize for if there are no
rewards around.
That's super interesting. And I've always thought of the deep learning architectures
as very well factored. As you say, we have all these libraries of layers that we can
just drop in. But you help me appreciate the extent to which the other components are also
well factored, which is, I think, a great insight. So for the brain, do we have any idea if we
should expect to find a single type of objective function and a single learning rule? Or could
we imagine there could be many different types of objective functions and learning rules
in different parts of the brain? Is that still a completely open question?
That's a really good question. The theoretical answer is that it doesn't really matter. So
yes, for any system, any system that you can observe, there exists theoretically exactly
one objective function that describes all the behavior of that system. Actually, not
quite true. It describes all the behavior of that system that can be described through
an objective function. So in the paper, we talk a bit about this. And it's basically
the idea of the fundamental theorem of vector calculus or the Helmholtz decomposition. And
so the idea is the following. Let's say you're describing a system. It could be a neural
network where the weights change over time in some gigantic space of all the possible
combinations or configurations of the weight vector. Or it could be a very simple system
like a thermostat that just has a sensor and then controls the heating unit. Or it could
be even more complex than a neural network or deep neural network like a system like
the brain. And so all these systems you can view from like a dynamical systems perspective.
There's some state space and every point in that space describes possible configuration
of the system. And at the current time, it's at one point and then it kind of moves around
over time as the brain is doing its computing or as the thermostat is reading new temperature
values and storing some new internal state about them, like a moving average maybe. And
as the weights of our deep neural networks change with every optimization step, they
also move around in the state space. And so when you describe a system like from this
angle, you can view it as a vector field in the state space. If your state description
is complete, then from every state, there is a direction in which you go to get to the
next state. And if you include the, if you couple that with an external system, then
you really have like a closed state and like a closed system where everything is captured
by your description and basically everything becomes more or less predictable. And for
every point in the configuration space, there is a direction and that gives you the next
point in configuration space. And so when you describe systems like this, you can actually,
you get a vector field. Every point in state space is a direction. That's the vector field.
And you can decompose it in two simpler vector fields. And that works in any case, except
for some maybe degeneracies that are just of theoretical interest. And you can decompose
it into one part that is optimizing something and one part that's not optimizing anything.
So think of the configuration space again, and now plot the heat map over it, which is
the objective function. So some points in weight space give you better value, mean that
your neural network is better at predicting the labels, let's say. And some points mean
that the neural network is worse at predicting the labels. And we can write down our own
cost function there, and then we can implement our own learning rules so that we end up with
the system that seeks out the better regions in the configuration space. But we can use
the same mental picture to describe an existing system that we don't change anymore, that
we don't have control over the dynamics. And so there is still this potential function
or energy function or cost function. Those are all the same things, but just different
fields call them differently. And so when you look at the system, you can wait for a
while, you can observe it, and it's moving around in this configuration space. And it
will be more often in some places and less often in other places. And from that you can
derive a cost function. So what kind of cost function is this system optimizing for? Well,
you just look at what it's doing, and over time you will get an idea of what parts of
the state space it likes and what parts it tries to avoid. And then that's your cost
function. It's just the stationary distribution, the visitation frequency, basically. And so
once you have the visitation frequency of a system, you can describe all of its optimizing
behavior. So you can say, now that I have the cost function, maybe a very simple example
is a person, maybe you have a daily routine and you can be in different rooms of your
house and you can be at work, maybe not at the moment, but at least there are different
rooms at home that you can switch between. And there is some probability of going from
that room to the other room and so on. And if you observe somebody for like, or maybe
you write down every day what room you've been in for how long, and then you get this
kind of cost function that describes you. It's like, oh, the living room is the best,
for example. You spend the most time there. And so once you have this cost function, you
can describe the dynamics. If you give me the cost function, I can basically reverse
engineer you to some extent, too, based on what state space you chose. It's probably
not like the state space always uses some abstraction because you can't go to the kind
of particular level. But let's say it's different rooms and then I can build something that
also seeks out the same, seeks out the rooms with the same preference distribution. Okay.
So that's the optimizing part. And then there is a part to every system that is independent
of the stationary. Well, it's orthogonal to the gradient on the stationary distribution.
So if you give me the distribution over rooms, I can build a new agent that follows the gradient
on this preference distribution, always tries to go towards what is better under the cost
function. But then there is some maybe external perturbation that keep it away from there.
So it has to kind of keep going towards the optimum. But then there's also potentially
a direction that doesn't change the probability. And so that's the direction that's orthogonal
to the gradient on the cost function. So if you think of the cost function as a surface,
like as a hill surface over your configuration space, then you can either go up or you can
walk around the contour lines of this cost function. And so that's the difference between
the divergence part of the vector field that goes up on the cost function and it tries
to concentrate on the optimal points. I guess if it's a cost function, it goes down. If
it's an objective function to maximize, it goes up. And then there's the curl part that
just walks around contour lines. And so it's never optimizing for anything. It always cycles
back after a long time. And so this is all to explain why when you're talking about
something as an optimization problem or you're describing maybe a natural agent trying to
describe intelligence as an optimization, then you will lose this part that doesn't
optimize anything. You will not be able to describe that part. And that's probably fine.
Like maybe we have evolved to be quite efficient. And so maybe we don't do a lot of unnecessary
things that don't actually optimize any objective function. But who knows? Maybe that's on some
level of abstraction that you choose to describe the system. Maybe that's really important
to get something that shows the behaviors that we think of as maybe connected to intelligence.
So is this paper saying that we should look for the components similar to those that we
use in deep learning in the brain? And then maybe vice versa, figure out how to adjust
deep learning to match more closely match what we see in brains to help us understand
to use deep learning to understand brains is that it? Is that close to the message?
Yeah, yeah. So it goes in that direction. I don't think machine learning and neuroscience
have to converge to one thing. We can use different models in machine learning, then
the models that might be useful for explaining the brain because there are biological constraints
on the brain. And it's interesting to understand them and understand what kind of ways nature
found around those. But just conceptually speaking, the best models for the type of
computer hardware that we have are probably different. So if your goal is to build an
algorithm that's very good at predicting labels on some data set, then probably like the very
long term solution will be different from the biological solution. Now, that said, at
the moment, we're still quite far away from getting anything close to the intelligence
of a brain, of course. And so I think neuroscience has a lot of potential for helping us with
building better models in machine learning. But it doesn't have to be the goal doesn't
have to be to end up in the same place for both disciplines. Although that I think that
would be interesting. But that's not necessary. And what the paper is saying is we should
use the same framework to break down the problem. And that will help us share insights in both
directions. And as I said earlier, it's really difficult to make measurements in the brain.
And there are a couple of papers from the last few years, where people have studied
deep learning models in a similar way in terms of analyzing the activations, then the neuroscientist
would study the brain and found that there are actually really surprisingly strong connections
between how a deep neural network processes some input to solve a prediction task and
how the activations in the brain look like that try to solve the same prediction task.
And so there is definitely exchange in both directions. And I think both disciplines can
learn from the other and use tools from there. Because on the other hand, we also have no
idea really how deep neural networks work and why they work. And so maybe some ideas
from neuroscience would help there. And I think the reason you can find these similarities
between models in machine learning and measurements in the brain is that even though the models
are very different in some way, both systems are still trying to solve the same task. And
a lot about solving a task is a lot of the computation needed to solve the task is actually
more about your input data than the architecture you're using to process it. So that's why
I think, I mean, nobody really knows, but my intuition is that probably there are some
constraints on computation in general, on what information do you need to extract from
your input so that later on you can you can solve a task.
Do you have any comments on how the insights of this paper might relate to reinforcement
learning more specifically than learning in general? This wasn't an RL paper, right?
It was. It was not an RL paper. For me, the biggest takeaway of this kind of perspective
on understanding intelligence is that for the biggest takeaway for reinforcement learning
is that we have to think a lot about what objective functions we should use in reinforcement
learning. Because, I mean, it's always been bothering me that we have a reward signal
that comes from the environment. And that's actually not how reinforcement learning used
to be defined in some of the earlier work, where you would usually, you know, there are
some early papers on the question of where rewards come from. And the way to think about
it really is that there's an environment that gives you sensory inputs, you give it actions,
and it doesn't care about what you're doing, right? Like, why would the environment care?
And then there's an agent, and that agent, you can choose to break that agent down into
two components. And one component gives you the reward as a function of, you know, the
past sequence of inputs and the past sequence of actions. And then there is another component
that tries to maximize this reward. And so that's the kind of classical reinforcement
learning component, where maybe you learn a value function or, you know, there are many
things that you could be doing. And so I think we haven't really spent a lot of time yet
or enough time to understand the first component where actually the reward is being generated.
And if you want to build something that is more intelligent, or closer to maybe an intelligent
being, then the current agents we use in reinforcement learning, then we have to make progress on
that part because there's not really a reward function in the world. There are some candidates
that we can think of maybe, you know, optimizing for survival is good. But then that doesn't
really give you a good idea of the system. I want to understand. So I think this optimizing
for survival in some world with like a giant simulation, like an artificial life approach
to building intelligence might work to build something. Like, I mean, we're quite far away
from that. But in principle, it could work. But I and it might be easier to study the
resulting system than to study in like biological system. But it doesn't really answer the question
of how it's doing that. And maybe you don't care about that, you just want to build something
that replicates some aspects of behavior that we see in people. But to me, I actually want
to know what are the components that we're optimizing for, like within one lifetime.
And to get that additional insight, we have to try out different objective functions,
different implementations of this module one in the agent that provides the objective function
to the optimization component. And we have to try them out. And we have to do it in an
environment that probably has to be very complex. And then we can look at the behavior, and
we can see if that's similar in some way to the behaviors we're trying to replicate. And
we're very general, like, people are very general in the sense that there are many different
environments in which we can do something. And so the objective function should also
be general in the sense that it doesn't depend on something like underlying environment state,
like if you want me to move the glass from one side of the table to the other, then maybe
if you have a physics simulator and you know the object ID of the glass and so on, you
can compute a square distance between the position and the goal position. But that's
not the sensory input that the agent gets. And so that's not available if you want a
general implementation of the first component of the agent. So it has to be something that's
only a function of the sensory inputs and past actions and still accounts for interesting
behavior across many different environments.
So are you pointing to intrinsic motivation as a key here?
Yes, yes. That's how the field is often called. And often intrinsic motivation, I think there
are many different ways of how to really evaluate intrinsic motivation. And it's very difficult.
And I think it's a good challenge to make progress up. And there are parts of intrinsic
motivation where you're basically trying to be better at solving a particular task. And
so maybe you sum up the intrinsic reward with the extrinsic reward, and you get something
that makes faster learning progress on the task than without the intrinsic motivation.
Another evaluation setting that I really like that I think will come to a bit later in the
podcast is that you explore without any task in mind. And then maybe you can use the data
set that results from that to later on train a new agent on it to solve specific tasks.
You can see how useful this exploration was.
So now let's turn to a set of four papers that are tightly related with each other,
starting with planet that's learning latent dynamics for planning for pixels. Can you
tell us what was the main idea of this planet paper?
The main idea was to learn a dynamics model of the environment that's accurate enough
that you can do reinforcement learning with it. And people have been trying to get model-based
RL to work in various instantiations for a long time. And there has been lots of progress
as well. But it was really almost like a bottleneck where it kind of worked on simple tasks, but
then it didn't really work on harder tasks. And so in practice, people were still using
model-free methods most of the time, even though model-based methods are appealing in
different ways, because for one, it's kind of like a really intuitive thing that you
have a model of the world that lets you predict into the future. I mean, we know that people
can do that. So probably our agents should as well. But then having a world model also
lets you do a lot of things that you couldn't do with a model-free agent. So it's almost
this backlog of research ideas in my head and other people's heads that were blocked
by not having accurate enough world models to implement them. And so that was really
the goal, because I wanted to work on intrinsic motivation. Yeah, you can do better exploration
if you have a world model. And I think we'll talk about this when we get to the disagreement
paper about the retrospective versus expected exploration.
And so I knew to do that. I really needed world models to work on some tasks, some kind
of tasks that I would be happy with, with high dimensional inputs and so on. And that's
why I started working on learning dynamics models from pixels.
So that's so interesting. So you really are planning multiple papers ahead for where you
want to get to, being strategic about it.
Yes. And maybe not so much a chain of papers, but I always had this goal of building autonomous
agents with intrinsic motivation. And then whenever I start a new project, I reflect
on that and think about what is the limitation? Like, can we do this now? Or is there anything
that's still necessary to solve before we can build that? And it was almost a bit frustrating
in the beginning when I started my masters in London, that I wanted to do this like active
exploration. But there was just no, no accurate dynamics model that I could use for it. And
then people told me, you know, yeah, we all know that this would be cool to have, but
we've been trying for a really long time and it just doesn't work and we don't really know
why. And I thought, okay, well, you know, I'll, I'll try and it's 30 year. And Tim was
really helpful, Timothy Lillicrap, when he advised the project and my manager at Google
at the time James Davidson was very helpful as well. And we just went through it quite
systematically and we kind of tried for a really long time and eventually it worked.
And I think there isn't even like a single thing that I could point to that, that was
like the point where it made like where it suddenly started to work. I mean, those were
mostly bugs in the implementation where, oh, and like, you know, we normalize the input
twice and then at the end it's like during evaluation, you do different normalization
and then Jerry training, of course, your model doesn't make good predictions. So it was mainly
to we had a pretty clear idea of what we wanted to do. I wanted to build this latent dynamics
model because I think, I think a lot of RL work with low dimensional inputs. It's a bit
too tight. I actually don't even read those papers anymore in most cases. And you can
do quite well with random search and so on. So, so to me there need to be some, some high
dimensional input where representation learning is part of the challenge. And, and then if
you predict forward, it doesn't really make sense to do that in pixel space from one image
to the next because that gets very expensive and arrows can accumulate very quickly. And
it's definitely not what we do. Like when I plan on, when I plan my day, I don't plan
how my, like the activations on my retina change like hours from now. It's all in an
abstract space and it's both abstract and they're both abstracts from space into concepts
and so on. And it also abstracts in time. And we, so far we focused on the first aspect
and we're trying to, we're also doing some work on the temporal abstraction, but I think
that's still quite unsolved. Yeah. So at the end we had this kind of clear picture of what
we wanted to do and we didn't actually deviate much from it throughout the project. We just
visualized a lot of metrics and, and try to really understand what was going on. And then
we found a lot of bugs that we fixed over time. And then at the end it just worked and
we were quite surprised. So that must've been really satisfying. You worked on this for
a year. And the first thing that jumped at me from this paper was the efficiency gain.
It said there was a line that said data efficiency gain of planet over a D four PG, a factor
of 250 times, which is just huge. So was that, was that surprising to you? I guess you'd
been working on it for a year. So by that time you're used to it, but did you expect
anything of that level when you went into this? To be honest, I didn't really care about
data efficiency at all because I just needed a world model to do exploration with. I didn't
really care about it being so much more data efficient, but it turned out to be more data
efficient. And, and of course that's useful for a lot of applications. Like if you want
to use world models for robotics, for example, where environment steps are much more expensive
than the simulation. Then it really matters. So of course we put it in the paper. But I
wasn't actually like, it didn't actually matter to me and it still doesn't to add to this.
I think the reason that is more data efficient, there are multiple reasons. We don't exactly
know how much each of them contributes. But one reason is that a lot of model free methods
just don't use any specific representation learning. They learn representations just
through the reinforcement learning loss for maybe value learning or policy gradients.
And so the only signal that goes in is like the action that you chose and the reward that
you got. And that's, that just, if you think about like, let's say I throw you in, in an
unknown environment or do well in that environment in some way, maybe try and get food or you're
trying to solve specific tasks you want to solve. And if you just imagine that everything
you would learn about this world would just come from the reward and the actions you chose.
That's just insane. That means like, I'm not trying to find any correlations in my input,
for example. I'm not trying to explain what I'm seeing. And that is what more mathematically
speaking, there's a lot of information in the images or in the sensory inputs that you
get about the environment. And so you should use that in some explicit way using representation
learning, I think. And it's, this can be quite separate, actually, from the RL algorithm.
So there's a lot of work showing a lot of applications, application papers showing that
you can, you have your RL agent, and then in addition to the policy gradient boss, you
just have a reconstruction loss from maybe the features of your, some high out representation
within the network, you just try to reconstruct your input. And that really helps a lot, even
though it's a very simple thing to do, especially when you have high dimensional input. And
so it's, I think it's perfectly fine to do research on representation learning for control
and like core RL separately. But if you want something that's data efficient, you should
definitely make use of your inputs in some way. And to add to that, the same is true
for wealth models as well, if you have a specific task, because in principle, you only need
to make accurate predictions of future rewards. And that's, that's enough to get maximum performance
on the task. So in principle, you don't even need to reconstruct your inputs in the wealth
model. It's just that then you're back to only learning from a very limited learning
signal. And I think there is still some benefit in learning a wealth model, even without any
explicit representation learning. Because in addition to the representation learning,
you still incorporate some other useful priors into the wealth model such that, for example,
that there is a compact, compact activation vector that explains the state of the world
at one point in time. That's that's a useful prior, right? It means that we have this high
dimensional input. And for the agent, that's this gigantic pixel grid. And it means that
there's a much smaller representation that has to, has to describe everything that the
agent needs to know about the input. And so, and then if you have a, if you have a dynamics
model, then there needs to be a function that takes this description of one point in time
to the description of the next point in time. And then that has to be enough to predict
the good action at that point in time, or predict the value of the reward. And so this
idea of a hidden Markov model structure is also useful. It's a useful prior. I don't
know exactly how much the representation learning contributes to the data efficiency compared
to the just learning and latent space, compact representation of the environment state, but
of the sequence of past inputs to the agent. But for example, that's what mu zero does.
It's not learning a global world model, where the agent learns everything about its inputs.
It's just learning what is necessary to solve with specific tasks, because all the learning
signal comes from the reward and the value and, and the policy gradients. So, but you're
still incorporating this, at least this one prior of having a compact representation.
So in, in, in the plan of paper, I think you, you separate the stochastic and deterministic
components of the state. And can you help us understand why you want to separate those
and then how that separation works?
Yes. So we, when we came up with the model, we basically just tried random things and
we had no idea what we were doing. And this particular combination seemed to work well.
And so afterwards, we tried a lot of other designs and they did not work. And I think
by now I have a bit of a better understanding. Of course, we had some hypotheses of why maybe
stochastic part helps and heuristic part helps. But then later on, doing other projects building
on top of this model, we got some more insights of why this is, this might be a particularly
useful way of, of designing this, the latent transition function.
And so one, one point is that if you're, if you want a latent dynamics model where given
the sequence of states, you can predict all the images individually. So there's no skip
connection from one image to the next, let's say. Then, then your sequence of latent states
has to be stochastic in an environment where the agent can't make deterministic predictions.
So that could be either because maybe there's actually noise injected in the simulator in
how the simulator works, or it could be because the agent doesn't know everything about the
world. So it's a partially observable environment and that makes it stochastic from the perspective
of the agent. And so to predict multiple possible features, you need stochasticity in your,
in your latent state sequence. But if you make it fully stochastic, then you get a typical
state-space model where the hidden state at one step is just the, let's say a Gaussian
where the mean is predicted through some neural network from the last state and the last action.
And the variance is also predicted by the neural network. Then there's a lot of noise
in train, during training. And, and that noise, technically speaking, it adds information
to your state at every type step, but it's not information about the environments. It's
not useful information that it kind of hides the information that the model has extracted
already. So if you think about maybe the agent has seen some images and then it has inferred
the position of objects and put that into the latent state. And now you predict forward
for five time steps, but at every time step you're adding noise to the state, then it
becomes really hard for the model for the agent to preserve information over multiple
time steps. It's just erased after a couple of steps.
And here you're talking about the conditional VAE formulation, is that right?
What is the conditional VAE formulation?
Sorry, I meant, when you're talking about a stochastic model like you are right now,
are you speaking about like a VAE?
Yes. So it's, it's a latent variable model, the way a VAE is a latent variable model.
And we train it the same way a VAE is being trained. So it's the same elbow objective
function or free energy objective function.
But you don't call it a VAE.
And it has a lot of similarities. So you could, you could see it as a, as a very kind of specific
case of a VAE where instead of having one kind of fixed size representation as your
latent variable, you instead have a sequence, a Markov chain of latent variables. And then
your data is also a sequence of images rather than a single image. So you can think of it
as a sequential VAE.
So you were describing how the stochastic component cannot capture all the information.
And so that's why you need the deterministic component as well.
So theoretically speaking, it could, the stochastic version, the fully stochastic model is general.
So it could learn to set the variance to close to zero for some of the state components.
And that way it would preserve information over many timestamps without it getting erased
by noise.
It's just hard to learn. And you don't really get good gradients for learning that because
the optimization process is so noisy. And so you would basically end up with a model
that doesn't learn long-term dependencies in the data well.
And so having a deterministic component is in principle just like setting the variances
to zero for some of the stochastic components in the state. So that you put in the prior
data that there are some things that should be preserved over a long time.
So is the idea that in certain areas of the environment, things could be fully or more
so deterministic or more so stochastic? Like do these two components kind of become more
influential or less in certain areas as appropriate?
That's an interesting question. So I think that's basically the same question. But I
like to think about... I like to not think about the implementation of the environment.
So this comes up for exploration as well. But in this case, whether the environment
is more stochastic or less stochastic in some states, it doesn't matter. What matters is
whether it's more or less predictable for the agent. Because the agent doesn't really
know more about the environment than the sequence of its inputs. And it can't make more sense
of them than what its model architecture lets the agent make sense of the data.
So more stochastic, practically what it actually means is that the agent can't model it well.
The agent doesn't know exactly what's going to happen with things that many possible things
could happen. And that could be because we inject pseudo-random noise into the simulation.
Or it could be just because there are so many visual details. The model is too small to
really make an accurate prediction for some of the more complex parts of the world.
And now to answer your question, the way I think about this latent model now with the
stochastic and the deterministic part is that there is another big benefit of having a stochastic
part. And it's not so much about stochasticity in the data. But it's more about allowing
you to control how much information goes into the deterministic state.
So you can think of this as a deterministic model at every time step. You have a stochastic
variable that lets you add information about the current image into the model. And there's
a KL regularizer that encourages the model to not incorporate that much new information
into the hidden state. But you're still training it to reconstruct all the images. So what
this reconstruction error does together with the KL regularizer is when you want to reconstruct
the image from some particular state, then the model is allowed to look at the image
through the stochastic bottleneck. But it's encouraged not to because of the KL regularizer.
So instead, it would look at all the input information that it has already extracted
from past time steps, because there's no KL regularizer for those. There is, but it already
paid for it. So the model is better off using the deterministic path to look back on time
to get the information from there, as long as it's something that can be predicted from
the past. And I think that encourages the model to learn long term dependencies.
Okay, so maybe I'm misunderstanding a little bit here. But is this model not Markovian?
Does it not look back at only the one step previous state? Or you're saying it's looking
back in time implicitly through the deterministic latent? Is that what you're saying?
Yes, yes, exactly. So it's actually, it's good that you're bringing up this point because
it, there are different ways to, to think about the stochastic deterministic parts in
the model. You can either think of it as a Markovian model, where just some elements
in the state are not stochastic, right? And then your state is basically the concatenation
of deterministic and stochastic state at every timestamp. Or you can think of it as a non
Markovian model of only the stochastic state. So if you don't, if you kind of ignore the
deterministic part from your model description from like when you write down a probabilistic
graphical model, and you only write down the stochastic sequence of states, then this deterministic
RNN actually lets the stochastic state at some timestamp t depend on all the past stochastic
states through this deterministic kind of shortcut. But that, yeah, so those are both
valid views. You can say it's a non Markovian stochastic model, or you could say it's Markovian
hybrid stochastic deterministic model. But the second perspective is useful for the implementation
because it means that when you observe a new image, you don't have to go back in time.
You only need the last stochastic and deterministic state and the new image to compute the next
stochastic and deterministic state.
So I was looking at a little bit at the code for the RSSM component. And there was a comment
saying that if an observation is present, the posterior latent is computed from both
the hidden state and the observation. So is that does that mean that when it's imagined
is that is that because when it's imagining the future, the observation is not available?
Is that what that line means?
Yes, yes, exactly. So you can think of this as the prior and the approximate posterior
in a VAE. Well, the prior and the encoder in a VAE, they both give you distribution
over the latent variable. They are both the belief over the code. But one is a more accurate
belief because it got it got some context information, in this case, the whole image.
So one is the prior one is the posterior or approximate posterior. And this principle
is more general than that. You could have additional context information. So you could
have you can have the whole context like the, you know, just give it the whole image as
you do in a VAE to try to get the most accurate belief. But you could give it like some information
as well, you could either give it part of the image, like a patch, maybe, or you could
give it some additional kind of context information about the image, like a label, like a class
label for the image. And, you know, what's the belief over the code, if I only know it's
it's a dog, and then that's going to be a narrower distribution, then the prior belief
that doesn't know any context, but it's still going to be wider distribution, then the belief
I get when I condition on the whole image. And so, in a temporal model, something similar
happens where the prior belief over the code at some time step t, there are multiple beliefs
you could have over that, right? If you don't know anything, then that could just be standard
Gaussian, let's say. But in RL, or in the sequence model in general, there is a lot
of context, you know, and that context is basically all the past inputs, but just not
the current one, and of course, not the future ones yet. And so that's, that's the prior
that you need to use, at least when you just write down the standard set of elbow objective,
the prior over the code at time step t, the distribution, the belief that doesn't depend
on the current image, should still have access to all the past images. And another way to
view this is as a Kalman filter, because basically the model is just a nonlinear, learned Kalman
filter. So, so in a Kalman filter, you also have this temporal prior, which is called
the prediction step that tries to predict the hidden variables without knowing the current
image. And then there's an update step that takes this prior belief, this temporal prior
belief, and updates it to a more precise distribution by looking at the new input, by looking at
the new image. And so we do the same in a sequential VIE.
So is the model aware that when it's imagining future time steps, that it's less certain
about those? In some sense?
Yes, yes. So those are two neural network components, you actually have to learn two
transition functions. One, where you give it the past state, the past in state, and
the past action, and you train it to predict the distribution over the next set. And then
another one, where you give it the past state and the past action and the current image,
and then try to predict another distribution. And that will be more precise, a narrower
distribution. And it actually is when you look at the entropy, because it has information
to more context, or access to more context information. And the way those two are trained
is that during training, you always use the one that can see the data. But the KL regularizer
is from that second distribution to the first. So to the prior transition function. And so
that trains the prior transition function to basically try and predict what the posterior,
the better belief is going to be. But without seeing the new image, so won't be able to
do a perfect job, unless the sequence of inputs is fully deterministic. And so that is the
only thing that trains this KL regularizer is actually the only loss term that trains
the prior transition function. And the prior transition function is what you use for forward
imagination when you're just planning into the future, but you don't know the actual
inputs for those time steps. And at the same time, the KL regularizer regularizes the posterior
belief, saying that even though you got to look at the image, don't be overconfident.
Try to still be close to what you would have predicted without seeing this data point.
Try to still be close to the temporal prior.
Can you talk about what range of environments this type of approach is best suited for?
Or the limits in what environments would this could be applied to? Well, does it have something
to do with how much stochasticity they have? Or I mean, it seems like the environments
as a user really pixel a large dimensional pixel state space. But is that the main area
where this method is useful? Or does it go beyond that?
Yes. So I think the approach is generally useful for a lot of reinforcement learning
setups. There are some applications of reinforcement learning where you not really have an agent
in that sense, but you're just trying to solve some discrete optimization problem or some
black box optimization problem where you don't get gradients. So in those cases, I don't
know when you're trying to, I don't know, maybe try to predict the proof for a mathematical
problem. I don't know if I haven't really thought about those problems. But when you
have an agent in an environment, and especially if the environment is partially observed,
so you have to integrate information over time. So for example, an image won't tell
you velocities of objects, it just tells you positions. And then if the field of view is
limited because you're only looking in one direction and you don't see the object in
the other direction, you also have to integrate information over time. And so then this is
a very useful, very useful general approach because you're making use of the factorization
of a partial observable environment. So in some sense, the latent states that you're
learning can be thought of as a replacement of the hidden states of the environment that
the agent doesn't have access to. Now, this is important. The latent states learned by
the agent are not an approximation of the environment state, right? There's no reason
whatsoever to believe that they will become similar in value to whatever the environment
state is. But they are an alternative representation that if the model is trained well, also explains
the same sequence of observations given the same sequence of actions. So it's like an
alternative implementation of the environment if you want. And so that's really powerful
because now you've got a Markov system. So once you have this representation, then you
can even make predictions into the future given actions, you don't need a recurrent
policy anymore. The state is already sufficient. And I think your question also hinted a bit
in the direction of, could we do this for low dimensional inputs, like more typical
for these MuJoCo tasks? And the answer is yes, we have tried that at some point, and
it does work. And it is a bit faster than learning from pixels, but actually not that
much. Yeah, and it works well. And I think Brendan Ahmo set a paper on differentiable
model predictive control where he does that, and also found that it worked quite well.
But I haven't done any, any, yeah, we had one project where we tried it on low dimensional
states and it worked, but it didn't go anywhere. So yeah, I'm more interested in the pixel
space. And right now I'm trying to scale up these models to more complex environments.
Some of that we had in the in the follow up paper for dreamer.
All right, let's turn to another recent paper of yours. Dream to control learning behaviors
by latent imagination. Can you know we got to hear you describe this paper in our December
and Europe's episode? Can you remind us of the main idea with this paper?
Sure. So one, one limitation that planet has is so it does learn a quite powerful wealth
model. But it doesn't make use of it in the most efficient way to derive behaviors. Planet
uses online online search at every time step when it interacts with the environment. And
that can be really expensive because you do many, you predict forward many trajectories
and then you select the one action that you like the best and you execute it and you throw
away all this, all this effort and you would like do another search at the next time step.
And so collecting data becomes quite expensive.
So it's doing model predictive control.
Exactly. Yeah. And the second limitation is that in in the original planet agent, we don't
learn a value function and there is no temporal abstraction. And that means the agent is only
going to consider rewards within the planning horizon. And you can't increase the planning
horizon infinitely because for one, eventually your model is going to make more, it's going
to make less accurate predictions. But also if you're searching for a longer plan, it's
going to take you longer to find a good plan because the search space got so much bigger.
There's so much more longer plans than there are shorter plans. So, so it's not really
computationally tractable to consider very far delayed rewards that are like a hundred
to hundred time steps into the future. And that's a one way that I thought initially
you could get around that is through temporal abstraction. And I still think that's really
the long-term way to go. But there, we have value functions in reinforcement learning
and they work quite well. So for now we can solve it that way. And so Dreamer is really
a follow-up on planet where we use the same dynamics model, the same world model, but
we're using it in a more clever way to learn how to predict good actions. And there is
a substantial increase in computational performance. So we went down from maybe one day for a million
time steps to like four to five hours. And there's a substantial improvement in the horizon
of how many future rewards the agent considers. And so that leads to much higher empirical
performance as well. And the way we do that is we throw away the model predictive control
part. And instead we have a neural network to predict actions in actor network that takes
the latent state of the world model and predicts a distribution over the action that hopefully
is best for this state. And then we have a second neural network in the latent space
which predicts the value, the expected sum of future rewards with some discount factor
that the current actor network is thought to achieve from this particular state that
is input to the value network. So with the value function and the actor, you can do an
efficient actor-critic algorithm in latent space. And you can train that from model predictions
independently of the data collection. So you don't have to do any online planning anymore.
Once you have a good actor to collect data, you just run the world model at every step
to get the latent state or to update the latent state from the last step to the next one to
incorporate the new input. And then you just send that to the actor and predict an action
and execute that. And so all the model predictions or planning if you still want to call it planning
happens offline independently of the current episode. So in principle you could also distribute
this and run it asynchronously very efficiently. And the way you learn these two components
now, like one thing you could do is you have a world model, you know, it basically defines
a new RL problem. It's an imagination MDP where instead of environment states, you have
these model states and so on. And it predicts rewards as well. So you could throw any model
free RL algorithm at it now and you can solve it without actually causing additional environment
interaction. So we get a very data efficient algorithm. But you can actually, if you're
doing that, you're not really making full use of the world model because we have a neural
network world model so we can actually compute gradients through it. But all the model free
RL algorithms, they are designed for real environments where you can't differentiate
through it. So they don't make use of these gradients. And that's why we can do better
by developing an actor-critic algorithm that's specific for world models. And the algorithm
is actually quite simple. You encode some like past data from the replay buffer to get
some initials model states. And then you imagine forward a sequence with some imagination
horizon, let's say 20 steps, using their actions, not from the replay buffer, but from the actor
network. So you're just like the actors just trying out something in the model world. And
then you predict all the corresponding rewards for those states. You predict all the corresponding
values as well based on your current value estimate. And you want to maximize that with
respect to the actions that are with respect to the actor network, the parameters of the
actor network. So you can actually compute very elegantly, compute the gradient of the
sum of future rewards and future values that you can weigh in some way if you want. And
you can compute the derivative of that with respect to the actor parameters just by back
propagating through the multi-step predictions of your model because it's all the neural
network components. And there are some stochastic notes in there because the model state has
a stochastic component. And the actions are also sampled from the actor distribution.
So there are two ways you can deal with it. If it's continuous and can be reparametrized
like a Gaussian, for example, then you just use a reparametrization trick to compute the
gradients through all these steps. And if it's discrete, then you can use a straight
through estimation, which is not really the exact gradient, but it still works very well.
And once you do that, you know exactly if you change the actor parameters a little bit,
how is that going to, at what rate is that going to increase the future reward or decrease
the future rewards? You know how to change the actor network. And then the only thing
that's left is optimizing the value network. And that's just done through simple temporal
difference learning. So the value at one step just should correspond to maybe the reward
plus the value at the next step, or you could actually do a multi-step return. So the value
should correspond to the next 10 rewards plus the 11th value. What we actually do in the
papers, we do lambda return, which means we take all of these n-step returns for different
values of n. So one reward plus the value two rewards plus the following value and so
on, and we weigh them. But yeah, that's just so we don't have to choose a hyper parameter
for n. And it doesn't really matter that much.
So on a high level, is this sounds similar to Sutton's Dyna architecture, but then Dyna
didn't have this notion of gradients or it was independent of what kind of function approximator
I think was used, right?
Yes. Sutton's Dyna, I think basically includes almost all of model based RL. It's a very
general high level perspective, where you have some data from the real environment and
you use that to learn some model of the environment of the data that you got from the environment.
And then you use that model to somehow select an action and then you can execute that in
the real world. And I think the Dyna paper even talks about online planning as well,
but maybe that's a follow up paper. But yeah, in principle, these are all within the category
of Dyna style algorithms.
So you're building on the work you did in planet and use to use the same RSSM deterministic
plus stochastic type model here. Was the model the same?
Yes, the world model is exactly the same. And for continuous control, we found the world
model still works across like all the 20 continuous control tasks. There are a few more, but we
chose the ones for which the best model free algorithm got non-zero performance because
some of the tasks don't really make sense from pixels. You can't see the things that
are necessary for solving the task. So yeah, the world model just worked for all these
and the improvement comes from the value function and also comes from the actor network, which
can actually learn a better policy than an online planning algorithm can potentially
do because it doesn't assume that the actions are independent in time, for example.
And the actor network also has a lot more optimization steps in total, because for the
online MPC in planet, you can do maybe 10 optimization steps, but then you have to have
an action at the end of the day, because otherwise, if you do too many optimization steps, then
it becomes way too slow to really interact with the environment. Whereas the actor network
in dreamer is shared, there's just one actor network throughout the whole training process
of the agent. So over time, it will get trained much more. And later on, or in addition to
the continuous tasks, we did some discrete tasks in the Atari and DeepMind lab. And we
also found that the same world model just works. But we did increase the size of the
stochastic and deterministic states. So we just gave the model more capacity. And so
I was actually really surprised by that. But what it said is that the planet agent was
bottlenecked not by the model, but by the planning part.
Was that surprising to you to when you determined the final performance of the dreamer agent?
Or was that what you expected?
No, I was actually quite surprised. So I knew that to do some of the more interesting tasks
that I wanted to solve, that I wanted to do exploration and eventually, we needed to consider
rewards further into the future than 20 steps. So we couldn't use planet out of the box.
And I always thought that, oh, there are probably much bigger problems. And we probably have
to find a better world model. And like, you know, is it even worth focusing on the horizon
problem? Or are there much bigger bottlenecks at the moment?
But it was a kind of almost easy problem to tackle, because there are already solutions
for that with temporal difference learning. And we just kind of applied that to the setting
we were in, where you have a differentiable world model to make efficient use of that.
And I was really surprised how well it worked. And I was also really surprised how that actor
that doesn't do any look ahead, while interacting with the environment, can do better and even
be as data efficient as an online model predictive control.
Do you think that dreamer would do pretty well, even if it didn't differentiate through
the model? Like if you were just or maybe that's something in between planet and dreamer,
like the idea of just distilling planets planning into a policy network, kind of like maybe
like what alpha zero would do. That's different than what you did here, though, right? Because
you differentiate it through the model. Yeah.
Would that be a reasonable thing to do? You think that would work well here or?
Yeah, there are there's a there's a design space of different of different algorithms
that operate within the world model to derive long term behaviors, to learn value function
and an actor. And the way AlphaGo does it is basically just just try to regress past
returns with the value function. And so we can't really you can't really do that with
a with a big replay buffer, because the returns you got in the past, they depended on the
actions that you that your actor chose in the past. But now you act your actors already
better. So those returns won't reflect the value of the actor in the stage right now.
But if you make the replay buffer small enough, it's approximately, you're approximately
on policy. And, and then if you just train it on a lot of data, then then that can work
well. It's just that in the low data regime that we're in, making your replay buffers
small is a bad idea. And, and just pretty clearly always hurts performance. So, so we
couldn't really go with this like approximate on policy approach to learn the value function,
we needed to, we needed to do TD learning. And we needed to do it on imagined rollouts,
but we can't use the past replay buffer data, because it's too different. So now to do online
to do imagined rollouts, you need a policy to select actions. And as you said, you could
in principle use a search to select actions that like a like like a cm search, let's say,
and then distill that, but like learn a value from it, and then and then learn an actor
network from that. But or you will not learn an actor network anymore. If you have a value
function, you can just use that during planning and that will be fine. But the problem is
you can't really afford to do the cm search at every time step in imagination for like
you know, so many imagination trajectories. So that's why we actually ended up abandoning
the the explicit search and switch to using an actor network. Yeah, and I think your question
was also whether it could work. Similarly, well, if we ignore the gradients. And I'm
not 100% sure. So what I do know is that once you have the world model, all the environments
environment, all the training inside the world model, just cost you one clock time, it doesn't
cost you environment interaction. So you could use a less efficient optimization algorithm
in imagination. And you would get the same data efficiency in the real world. And I don't
see a reason why why the normal model free algorithm inside the world model couldn't
get to the same final performance as well. But I think it would be computationally more
expensive, because you would need more updates. But I haven't tried it. So let's turn to another
very recent paper yours planning to explore via latent disagreement. Can you tell us what
the the main idea here is with this paper? Yes, so I I was really excited about the paper
because I finally got to the point where I wanted to be about two and a half years ago,
when I started to work on planet, which is to do forward looking exploration. And so
we solve the world model problem to sufficient degree, and then we solve the horizon problem
to a sufficient degree. So that was planet and dreamer. And then we could finally do
exploration with it. And that's the that's the key point of this paper. And there are
a couple of ideas. And one is, when you do exploration, you, you need some measure of
novelty that you can optimize for as the intrinsic reward. So we use an ensemble disagreement
for that, which Deepak Pataik was collaborator on the project has done a lot of work with
and there are a couple of papers also from other people who show that ensemble disagreement
works works really well as a novelty signal. And I would even include random network distillation
into the category of ensemble disagreement. And so so that's the kind of the source of
novelty that gives you the intrinsic reward. But then there's another aspect to to the
project, which is when you do exploration to learn about the environment, and you have
novelty as some objective function, then that's a non stationary objective function. Because
every time you interact with the world, you see new data. And, and then that changes your
knowledge. And, and so that changes what you think is novel about, like what future inputs
will be novel. And so there's a conceptual problem with model free exploration, because
model free optimization works by training a policy from samples of the real environment.
And so you have some novelty objective that you want to maximize with your exploration
policy. And to do that, you need to draw samples from the environment to improve the policy
for that novelty objective. But while you're training the policy, the novelty objective
has already changed because you needed all these samples to train your policy and those
samples tell you more about the environment. So, so in some sense, it doesn't really make
it doesn't really make that much sense conceptually.
Sorry, is that why a lot of the curiosity formulations just take an incredibly long
time like a huge billions of samples?
Yes, I think that's an important part of it. And I think that you can be much more data
efficient by doing forward looking exploration. And to do forward exploration, forward looking
exploration, you really need a world model. At least I don't see another way of doing
it. Because you need to train the policy to maximize the novelty reward without changing
the knowledge of the agent. So without causing any additional environment interaction. And
that way you can actually find the best policy for your current reward and then execute that
for maybe one step or maybe for multiple steps. And then gather some more data and then update
the model, update your novelty reward, and then optimize the policy again. So you really
like doing a lot of compute to decide what is the best action I can choose next. Rather
than the model free approach where the policy will always lag behind, because it hasn't
converged on the novelty reward, but you're already changing the novelty reward.
Okay, cool. So could you maybe just make crystal clear for us again, this distinction between
retrospective novelty and expected surprise? And so in what is the more common case here,
I guess the retrospective novelty is more is the more common case looking at the at
the literature at the past literature?
Yes, yes, I would say that's Yeah, that's fair to say. So these are the two terms that
I like to use to describe these two ways of using exploration, although both have been
done for a long time. But yeah, so the retrospective, retrospective surprise is what a model free
agent maximizes, if it has an intrinsic reward. What it basically is doing is, you know, in
the beginning, you don't know anything. So you do random actions, and then you find something
that's novel. And then see, you simulate an episode, and you predict all the intrinsic
rewards for that episode. And in the beginning, it will all be novel, because you don't know
anything yet. And so then you train your policy to basically tells you policy, oh, this was
a really good trajectory, because it was very normal. So you're reinforcing the same behavior.
And if you were really good at optimizing your policy, then it would an environment
isn't too random, then it would go and realize the same trajectory again. But that's exactly
not what you want, because you just went there. So it's not novel anymore. It was novel by
the time you tried it for the first time. And so you do it again. And this time you
get a low reward. And so then you encourage the policy to not go there again anymore.
So then what does the policy do? It has no idea. It just knows, don't go there. And then
it's doing another random expiration somewhere else, going that second time to find out it's
not novel anymore. Like, in practice, there is more generalization in the network going
on and so on. So it's not exactly this. But I think it's a useful mental picture to understand
what's really wrong with retrospective expiration. And in contrast to that, there is expected
expiration or planning to explore forward looking expiration where you use a predictive
model of the future to optimize your policy in imagination so that the policy gets really
good at choosing whatever at the time you're training it is novel to the agent. But since
you're training it from imagined rollouts, the training doesn't tell the agent anything
new about the environment. And so the intrinsic reward doesn't change. You can really optimize
this for much longer in principle, even until your policy converges fully. And then in the
most extreme case, you would just execute one action of that policy. And then I retrain
your world model and so on, retrain your policy in imagination. And then you really get what
is most promising to explore next. And you can look into the future and think, oh, if
I go here, I don't really know what's going to happen here. But for the things that I
think might be happening, some of them are really interesting because they are really
different from everything I've seen so far. And others are not so different from what
I've seen so far. And then you can go in a really directed way and to the parts that
your model expects are the most interesting parts of the world. So you maximize the information,
the expected information that you imagine you could gain about the environment.
There was a cool paper called Model-Based Active Exploration where they do something
quite similar but on much simpler environments and without any high dimensional inputs. But
they learn an ensemble of, they basically learn 10 environment models and then the disagreement
between their predictions is the reward. And then they train basically a soft actor critic
or some other model-free algorithm to maximize this imagined reward on the imagined predictions.
So it's also implementing this forward looking exploration. Now the challenge we had in addition
to that is that we have high dimensional image inputs. So we can't really afford to do the
policy optimization in image space. We have to do it in the latent space. And so we need
some way of defining the novelty reward there. And what we did for that is from every latent
state during training, we predict an ensemble to try and regress the observation embedding
for the next time step, whatever the conf net produces in terms of features before it goes
into the model at the next step. As you get the ensemble of one step predictors, that's
more efficient than actually like replicating, like training multiple RSSM architectures.
It's just like some feed forward layers. And that turned out to work really well. And once
you have this trained for training, of course you need the target for the next observation
embedding. But for imagination training, you only need the variance, the disagreement of
these ensemble predictors. So you don't need the future observations. You can do it all
in the latent space of the world model to predict the trajectory of states. And then
for every state you feed it to all the ensemble predictors, and you just compute the disagreement
between them.
How does this formulation respond to the noisy TV problem where model world models, some
world models get confused by random noise sources of random noise?
Yeah. And I'd like to connect this to the earlier point where it's not so much about
whether the environment is stochastic or random or not. So aleatoric uncertainty or reducible
uncertainty is not just the property of the environment, whether the screen is unpredictable
or not. It's also a property of your agent and the modeling capacities of your agent.
So even if something in principle is perfectly predictable, if your model is too weak, then
it will never learn it. And you don't want to get stuck trying to learn about that forever
where you could actually move on to other parts of the world where there's lots of things
that you can learn. So the question of the noisy TV really becomes the question of, how
do I know when I should give up on learning something and move on to the next thing? And
conceptually, I think the answer is you don't really ever know. But the best you can do
is learn things in order of increasing difficulty. Learn the easiest things first, the things
that are easiest for you. And so eventually, you will have learned everything that you
can learn, and then you will be stuck on the next hardest thing, but there is not really
a way to avoid that. So to do that, to have an idea of what you can't learn, you need
a nice model. So if you have a deterministic model, then you have two problems. For one,
it kind of has to explain everything perfectly. And the second is you can't really consider
multiple hypotheses over the models. There's one model, some one point in the weight space
of all possible models. And you don't really know how much certainty you have in that model.
So you don't know how much the certainty reduced after you see some new data. So if you have
a distribution of our models, like Bayesian neural network or an ensemble, then you can,
that gives you a bit of an idea of how much you know, for example, what's the disagreement
in your ensemble. But then you also want a way to allow noise in your predictions. For
example, if you try to, let's say, just predict the next observation, to keep it simple, and
from maybe the last input and the action, and you do that with an ensemble of Gaussian
models, then you're allowing some error in the prediction. You're saying, you know, each
model tries to really predict the next input, but with a Gaussian distribution. So it doesn't
have to be perfect. It's trying to get the mean to be the right mean, but then also if
the observation is somewhere else, it's okay, because we're predicting this Gaussian. So
we sign some possibility to all the next inputs we could get. And so then the variance in
your output of the Gaussian is basically the amount of noise that you assume there is in
the data. And so the more noise there is in the data, maybe you should avoid those parts
of the environment. And that's what the experimental information game also tells you mathematically.
And intuitively, this works out pretty nicely because you have this ensemble of models,
they all predict the Gaussian over something in the future, let's say the next image. And
even though the next image is a bit random, and maybe inherently stochastic, the means
of your ensemble over time, when you get enough data, they will still converge to the mean
of whatever is the distribution of the next input. And so the ensemble disagreement will
go to zero, even though there is randomness in your inputs. And so you will not be interested
in them anymore. So it's able to model the stochasticity in a way that makes it not curious
about it anymore. Actually, it's not clear to me how that works. So let's say the agent
comes across two displays, or let's say two displays. And one is showing just random go
boards, 30 by 30 go, or a smaller one, let's say tic-tac-toe board. And the other one is
the same board, but it's being played by experts. And we know they're different, right? We know
these two cases are totally different. And we know, we might think that if we could,
at least with a simpler game, if we watch it long enough, we could figure it out. But
we don't know that at first.
Right. So you have a model that tries to predict the next move in the game, like just tries
to predict the next input to the agent, what it's going to see next. And then you need
multiple models so that you can get an idea of multiple hypotheses of the rules of the
environment. You try to learn the rules of the environment by having a model that from
one go position or image of a go position predicts the next image of a go position.
And so to get uncertainty about your to-do exploration, you need some way of representing
your uncertainty either explicitly or any other algorithm will do it in some implicit
form. So one way to do that is to train multiple environment models. And so then you get an
idea of, well, if they are all the same, then I'm quite certain about what the next outcome
is going to be. If they are all different, I probably have not that good of an idea.
So if you train these in both scenarios for the random go board and for the expert go
board, then in the random go board, the dynamics models, in the beginning, they are initialized
differently. So they will predict different things. So your agent will go there for a
while. And then over time, all of the models will just learn to predict the mean and maybe
the variance of the next image. And so the mean image or the average over the next moves
is going to be uniform probably. So if it's in pixel space, if you're actually looking
at the go board, it would be basically the stones that are already there, they will stay
there and all the other empty fields, they will have an equal chance of getting the next
stone. So they will all be a little bit darker, a little bit lighter based on what player
is next. And so if there is nothing to predict, if there were something to predict about the
next move, then there would be some fields that are clearly still empty and some fields
that have some chance of the stone ending up there. And if you have multiple predictors,
then they can all predict this average image. But in case of the random policy, or in case
of the random board, after a while, they will all predict the exact next kind of uniform
distribution over possibilities. And so if they all predict the uniform distribution
over next possibilities, you know that first of all, your models all agree, they all predict
the uniform distribution. So probably the next move is actually uniform. And then you
know that there's, there's nothing more to learn, because your your ensemble members
have agreed or agree even though they are not certain in what the next outcome is, whereas
in the X move, and it will get it will take much longer for them to agree on what the
next move is going to be. And they will only agree by the time that they've actually like
perfectly reverse engineered the expert players to the degree that the model allows them to.
Can you tell us a bit about the process of writing these papers? Like, for example, did
the experiments in general work out to experiments work out? Often how is it how you expected
them to? Are there often dead ends that are reflected in the final papers?
The experiments rarely work out the way you want them to work out. So you need to run
a lot of experiments. And I also want to be very confident in my own algorithm. When I
write about it, because it, for one, it takes some time and effort to write a paper. And
that's time where you can't do research. And so I only want to do that if I have a result
that's that I'm happy enough with that I'm willing to spend all this time for writing
the paper and then writing rebuttals for the conference. And then you have to do poster
and maybe a talk or so and so on. And if you're not really if you don't really believe in
the method, then all of these types are painful. So I don't want to do that. And I didn't think
that way before grad school, because before grad school, you kind of you just need to
get a paper so you get into a PhD program. But once you're in a PhD program, you have
several years and you can think much more long term, and much more actually follow follow
your interests. So I want to be sure that I have something that that I also believe
in. And so that just takes a long time. And you have to like run a lot of experiments.
Whatever problem you're studying in the paper, either world modeling or exploration and so
on. There's usually a big design space of ideas you can explore. And I want to kind
of, as much as possible, strategically break down this space, and test out all these ideas
get an understanding of which of them are better words or what better in some situation,
but worse in another why. And it's not always easy. Because for example, we didn't do that
much of that for planet, for example, just because we tried a lot of things and they
all just didn't work at all. But I think it would actually be interesting to go back and
try out like try to really understand why, for example, the stochastic and deterministic
state separation seemed to be so important. So there's a lot of tuning necessary and it
takes a long time. And I think it's worth putting in that time and it's better to have
one paper a year that you're really happy with than four papers that nobody, that don't
really help anybody. Does that answer your question?
Yeah, that was great.
Cool.
So do you have any comments on what you think the future of world models looks like?
Yeah. So I think we still need to scale up a bit because reconstructing accurate images
doesn't seem to be the solution, the long-term solution for representation learning, neither
in model-free RL nor in model-based RL. So I think there are better ways of learning
representations, learning latent states, than by reconstructing images. Because if you think
about it, there might be a lot of things in the image that the agent doesn't really care
about. And there may also be a lot of things in the image that are just kind of really
difficult to predict. And my experience so far is that if you can get reconstruction
to work on an environment, then it does really well because you're basically solving a harder
task than you have to. You're trying to predict all your sensory inputs. If you can do that,
then you know everything about the world there is. But if you can't, because the world is
too complex to predict everything accurately in input space, then the agent tends to not
learn a good representation. And so it's not like a kind of graceful failure. And I think
contrastive representation learning is really interesting. There are a couple of very successful
empirically successful methods for static images that I think we can apply to video
sequences for RL. And so we're trying some of that. Another aspect that I think a lot
of RL is still kind of bottlenecked by is temporal abstraction. And I said earlier,
value functions give you some of that because they let you consider rewards into the long
term future. But in a really complex environment, I think it will become intractable to learn
a good value function for everything. And you probably need to do some kind of online
planning just because there are too many possible scenarios that you could imagine to really
be able to learn about all of them. And so what you want to do is do the planning online,
so you only have to do it for the situations that you actually encounter. And to then still
consider long horizons, you need to have temporal abstraction in your world model. So that's
another thing we're trying. And then besides that, I mean, there is a big open space for
objective functions that are enabled through learning accurate world models. And some of
them will benefit from having uncertainty estimates that are more accurate than maybe
assembles about parts of the world model so you can explore better. Empowerment is another
interesting objective function that we're studying that becomes much easier to compute
once you have a world model. So in summary, it's scaling up, learning better representations
and finding better objective functions because eventually exploration will become really important
as well. So back at the NeurIPS 2019 RL workshop poster sessions, I was at David Silver's poster
for mu zero. And I asked him about how mu zero handled stochasticity. And he told me
that it didn't it used a deterministic model. But but that he but he said it could be extended
to to handle stochastic case. And I think I think mu zero builds on the predictron paper,
which all which does some kind of temporal abstraction. So maybe there's progress being
made in that in the temporal abstraction side.
Yeah, I'm, I'm actually not sure if the original predictron has temporal abstraction in it.
But yeah, so I think for the stochasticity aspect, it may be more necessary when you're
trying to explain more complex data. So if you're trying to explain your inputs, stochasticity
becomes more important than if you're just trying to explain future rewards. That's my
guess. Yeah, also, you have to learn a lot more, of course, if you if you're trying to
model the world rather than the task. And but the result is that you get a model that
can be useful for a lot of different tasks. And that can be useful for exploration where
you don't have a task at all. But there, I mean, there are some, there are some recent
papers on doing temporal abstraction and some old ones as well, both in model free and model
based RL. It's just that, and I think there are lots of great ideas. And a lot of these
ideas can probably, but my guess is that we don't have to invent like a crazy fancy method
for like almost everything in machine learning, where we just have to take like a reasonable
kind of something that seems intuitively correct, and then push it, push it until it either
works, or we find a reason for why it doesn't work. And that hasn't really happened for
temporal abstraction yet in RL. Can you say anything about the research directions that
that you are pursuing going forward? Yeah, I mean, that overlaps a lot with your with
what I said for in response to your question about next steps for world models. But yeah,
for me, I'm, I'm trying to systematically go through different objective functions for
intrinsic motivation now. And besides that, we also want to work on harder tasks. So I
need to scale up world models further, so that we can do, let's say, train, train like
an agent with only with an intrinsic motivation to play Minecraft from pixels, that would
be great. Awesome. And besides your house, and it survives and maybe fights the monsters
and at night and you know, because there is such a complexity and kind of there, there
are, there are so many things you can do because it's not a lot of games are actually easier
to explore than you might think. For example, in Mario, you can only walk forward. So it's
not that difficult to explain to explore, it's basically either you're, you're making
progress, you go forward or you don't. But in an open world game, there are so many things
you can do and then you have to then you get an additional challenge because once you've
explored something, you kind of have to go back and, and see if there's something else
that I could have also tried from here. And, and so that's why I like thinking about training
doing intrinsic motivation Minecraft, because, you know, you have to build tools and then
use these tools to get better materials and they can do better tools. And then you can,
you know, build more like, like, bring yourself into like a better, into a better state for
surviving. And so if an agent can actually do all these things, then it must be very
general, very general objective function that, that can explain all this.
Besides your own work, is there other angles in RL that you find very interesting lately
that you might not have mentioned?
Yeah, there's one that I've been thinking about a bit, but not really, not done anything
in which is external memory for to give agents long term memory, because I think temporal
abstraction is just one part of the puzzle. You do want to plan into the future on a temporally
abstract level. But, and that gives you a long context into the, from the past as well.
But I think you can't keep everything in memory in your working memory at a time. And so it's
very natural to think that that could be this external memory module that you can write
things into and then you can later query it to get back the facts that you need at the
moment. So there, yeah, there are a couple of interesting, interesting papers on training
these modules for RL. And another direction that's not directly reinforcement learning
is, is the like brain inspired architectures. So I think it would be cool to, to develop
an unsupervised learning algorithm that works in an online setting on high dimensional inputs.
So it can't really do back prop through time has to find some other way, because it keeps
getting new new input. So I think we have to be cool to kind of go away from the static
image setting into the online streaming setting for representation learning, and potentially
explore ideas people like just kind of very basic ideas that people know about computation
in the brain, which is like sparse distributed representations, hierarchies and so on.
And as our halfener, it's been a real treat. And thanks for taking this time and your patience
to, to teach us so much, actually, I've learned so much in this episode, I'm going to listen
to it many times. And it's been great hearing about your fascinating research, I can't wait
to hear or read about what you come up with next. Thanks for sharing your time and your
insight with us to enter jar.
Thanks, Robin. That was what was a great chat, and looking forward to hearing that. So
that's our episode for today, folks. Be sure to check talkRL.com for more great episodes.