TalkRL: The Reinforcement Learning Podcast

Jakob Foerster on Multi-Agent learning, Cooperation vs Competition, Emergent Communication, Zero-shot coordination, Opponent Shaping, agents for Hanabi and Prisoner's Dilemma, and more.  

Jakob Foerster is an Associate Professor at University of Oxford.  

Featured References  

Learning with Opponent-Learning Awareness 
Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, Igor Mordatch  

Model-Free Opponent Shaping 
Chris Lu, Timon Willi, Christian Schroeder de Witt, Jakob Foerster  

Off-Belief Learning 
Hengyuan Hu, Adam Lerer, Brandon Cui, David Wu, Luis Pineda, Noam Brown, Jakob Foerster  

Learning to Communicate with Deep Multi-Agent Reinforcement Learning 
Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, Shimon Whiteson  

Adversarial Cheap Talk 
Chris Lu, Timon Willi, Alistair Letcher, Jakob Foerster  

Cheap Talk Discovery and Utilization in Multi-Agent Reinforcement Learning 
Yat Long Lo, Christian Schroeder de Witt, Samuel Sokota, Jakob Nicolaus Foerster, Shimon Whiteson  

Additional References  

Creators & Guests

Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests,
both research and applied.
Join the conversation on Twitter at Talk RL podcast.
I'm your host, Robin Chauhan.

Today we are lucky to have with us Jakob Foerster, a major name in multi-agent research.
Jakob is an associate professor at the University of Oxford. Thank you for joining us today,
Jakob Foerster.
Well, thanks so much for having me. I'm excited to be here.

How do you like to describe your research focus?
My research focus at a high level is about finding the blind spots in the research landscape
and then trying to fill them in.
And I know this sounds awfully general, but I can give you examples of what this used
to be in the past.
And maybe I can also talk about what I think this will mean in the future.
So in the past, when I started my PhD, this was all about multi-agent learning.
When I started my PhD, at the time, deep reinforcement was just becoming the popular thing to do.
But folks hadn't realized that putting learning agents together was an important problem to
So that was a big gap in the research landscape.
And in my PhD, I started making progress on this.
And what might this mean going forward?
It essentially means understanding what the boundaries of current methods are, where the
limitations are either already arising or will be arising in the future, and then trying
to address those limitations.
For example, being able to utilize large-scale simulations to address real-world problems,
as opposed to relying just on supervised learning methods or self-supervised learning methods.
That's at a very, very high level, what I'm excited about.
Fundamentally, it's about painting in the big gaps in the research landscapes.

Most of your work I've encountered is on multi-agent RL.
And on your lab's website, it mentions open-endedness as well.
Can you tell us about what type of work you do in open-endedness?
Yes, this is a fantastic question.
There's a really sort of up-and-coming research area called unsupervised environment design,
PD for short, and I believe you had somebody on your podcast as well, who's leading in
that area.
And I've been fascinated by this question of how we can discover environment distributions
that lead to specific organization results or allow agents to generalize to sort of the
corner cases of the distribution.
And as it turns out, this is a multi-agent problem, because it's often formulated as
having a teacher and a student, and immediately you're in the space of general sum learning
or zero sum learning in this case, whereby we have to consider the interactions of learning
So that's one thing that makes it very fascinating for me.
The other aspect is it's clearly a crucial step when wanting to bring methods to the
real world.
So being able to bridge the sim to real gap effectively is one of the key questions in
bringing multi-agent learning systems to the real world.
And then lastly, we have papers now where we're actually using insights from multi-agent
learning and bringing them to unsupervised environment design.
So for example, there's a method called sampler that we published last year at New York Side
of Belief.
And here it's essentially a transfer of a method I developed of belief learning for
multi-agent learning, where then the same problem also appears in this very different
problem setting of unsupervised environment design.
So for me, overall, this has been a fascinating process.
And I would say now probably almost a majority of the papers coming out of FLIR in the future
expect to be having some type of element of open-endedness or unsupervised environment

And I'm sure you get this question a lot or have to answer this a lot.
But for the listeners, can you remind us of what are some of the main challenges with
multi-agent RL and multi-agent in general?
Like why do the methods designed for single agents not work very well for multi-agent
And what makes it hard?
So in a nutshell, supervised learning is easy.
And this sounds a little flippant, it's supposed to be flippant.
But if you have a data set and you have supervised or self-supervised loss, ultimately, this is
a stationary problem.
And what that means is, as long as the learning algorithm is going to converge to an approximate
global optimum, you will get a model that works well.
And we've really gotten to a point now where large-scale supervised learning, even if it's
a GPT-17 or whatever we have, can converge stably to good solutions.
And effectively, we don't care about the exact weights or the exact solution found, because
we can simply look for generalization performance, make sure we don't overfit.
And that's really the only concern that we have.
We can look at the training loss, test loss, validation loss, and make sure that we're
in good hands.
In contrast, when we have multiple learning agents, then suddenly all of these guarantees
break down.
And that's because the other agents in the environment continuously change the problem
that we're addressing.
So, for example, if you're looking at the world from the perspective of one agent, then
suddenly the actions that other agents take will change the environment that's being faced
by that agent.
And that's called non-stationarity.
To make matters worse, when these agents are learning, we get extremely hard-quared assignment
problems, whereby suddenly the actions that an agent takes in an episode can change the
data that enters the replay buffer or the training code of another agent, and therefore
will change the future policies.
And suddenly doing extremely sensible things, like each agent maximizing their own rewards,
can lead to rather drastically unexpected phenomena.
And one example is playing the iterative prisoner's dilemma, whereby there are a lot of different
possible Nash equilibria that could be reached during training.
But naive learning methods have a strong bias towards solutions that lead to radically bad
outcomes for all agents in the environment, such as defecting unconditionally in all situations.
So to put this into one sentence, the problem of multi-agent learning is non-stationarity
and equilibrium selection.

So I think right now with GPT and chat GPT, I think a lot of people are associating that
approach and LLMs and all that with AI and with, in general, and maybe as the path to
AGI potentially.
And then there's people who have said that the reason human brains are so powerful has
to do with our social learning and the fact that we had to deal with social situations
and basically that the multi-agent setting was central to the evolution of really powerful
intelligence in the human brain.
So I wonder if you have any comment on that in terms of what the role of multi-agent learning
might be in the path towards really powerful AI and potentially AGI.

Yeah, this is a fantastic question.
So when I started my PhD, I had that exact same intuition that indeed the interaction
of intelligent agents is what has driven intelligence and that the epitome of that interaction of
intelligent agents is language.
So that was my intuition for studying emergent communication at the time.
It was essentially sort of my take on bringing agents from playing Atari games to being able
to discuss things and ultimately get to abstraction and intelligence.
And this intuition looking back was good in some sense in that indeed language is now
looking back obviously crucial for abstract reasoning and for sort of social learning.
It now turns out looking back that simply training supervised models, supervised models,
large scale language models on large amounts of human data is a faster way of bringing
agents to current or approximate levels of human abilities in terms of simple reasoning
And probably that makes sense looking back.
Now that doesn't mean that these methods will also allow us to radically surpass human abilities
because again these abilities emerged in the human case through multi-agent interactions
through a mix of biological and cultural evolution and ultimately led to that corpus of cultural
knowledge that we've currently sort of codified in the existing text and other media.
So I'd like to distinguish a little bit between being able to get to broadly speaking something
that matches human level at a lot of tasks give or take versus something that can radically
surpass human abilities.
And I can imagine that to radically surpass human abilities we will need systems that
can train on their own data, train in simulation and and and and and and also might need systems
that can interact with each other and train in multi-agent settings to sort of like drive
meta evolution and something like that.
Does that make sense?
So I think there's a bit nuanced about getting up to human levels which I think current systems
can do and it's an open question how much further we can push it without going to multi-agent
learning and also if you want to push it that way.

Okay and this definitely brings up your paper from years ago.
I considered a classic in deep RL.
There was learning to communicate with deep multi-agent reinforcement learning.
I believe that was 2016 and there in that paper you mentioned end-to-end learning of
protocols in complex environments and and I think that was a groundbreaking paper in
terms of figuring out how to get deep RL agents to communicate and to invent languages is
that correct?
Yeah so this is as I as I said I mean this was driven by my desire to get agents to be
less singleton and actually have them interact with each other talk to each other and ultimately
get to intelligent systems.
It was and it was fantastic in many ways this paper that it really started the community
and started showing people what is possible with modern techniques and I do believe that
looking back the currently more successful approach is rather than relying on emerging
communication to actually seed these systems with large-scale language models and they'll
be very excited to see how we can combine these two now.
So standing on the shores of giants starting with large language models can we now ask
those same questions again about having agents that can develop protocols starting from what
is already seeded with all of human knowledge from from a large-scale language model but
then apply that same rationale that multi-agent interactions can lead to novel skills and
the emergence of novel capabilities.
But again not starting from scratch anymore like we did in 2016 but instead starting with
already sort of quite a lot of let's call it reasoning abilities just within these models.
So I think there's an exciting line of work here which is picking up those initial ideas
again but combining them with what we have in 2023.
That does sound exciting.
I still wonder though is there not still a role for the learning from scratch like if
you had maybe small IoT devices or something that have to talk they might not want an LLM
but they might have some very specific thing they need to collaborate on.
Do you think that there's still a role for that that very simple level of learning from
Oh absolutely I think there's no so this is about the path to AGI basically or to really
intelligent systems that's one axis.
From a practical point of view I always joke that I don't actually want my Roomba device
to be able to build a dirty bomb.
I wanted to be able to clean the floor reliably maybe certifiably and I wanted to be able
to do almost nothing else.
And so there's many instances where even if it was possible to prompt some large language
model to do the task I'd much rather not be deploying the large language model for safety
Mm-hmm and performance like yeah absolutely correct.
Safety and cost performance guarantees right so I think reinforcement learning from a practical
point of view is still going to be required especially think about robotics and so on
but I think in the training process of these policies we will be using large language models
to generate data to help the exploration task to perhaps generate environments to come up
with high-level plans and so on.
And that's broadly speaking how I think about it and obviously in that same context learning
communication protocols from scratch when it's required as part of the task specification
is still going to be a necessity in many situations where communication is costly, communication
is noisy and it's not obvious what these devices should be communicating.
But that's a very different sort of motivation what was back in the day guiding my decision-making
to work on communication protocols in the first place.
Obviously you know if you read the paper it will be all about the down-to-earth things
like actually solving practical problems.
That makes sense.
So we had Carol Houseman and Fay Shah co-authors of the Seikan paper on earlier and I was asking
them do we really want our kitchen appliances having read all of Reddit and things like
that although they had a good answer for how they kept things safe.
They didn't let the models just generate whatever they were just evaluating probabilities.
But on the other hand they couldn't run the LLM locally they had to the robot had to talk
to the LLM in the cloud so there's all sorts of trade-offs there.
So let's move on to cooperation versus competition and so you'd kind of talk about these or these
two axes sort of and but is that the whole story and how are they different when it comes
down to to learning problems or cooperation competition very very different and what makes
them different.
Yeah it's interesting.
So the grand challenge I mentioned of multi-age learning or one of the grand challenges being
what policy do you play because these problems aren't specified it all depends on what others
are doing they disappear in some settings and in particular if you have a fully cooperative
setting where you're able to control the policies so the way that other agents act for every
single player in the team then that effectively reduces to a single agent problem not in terms
of the hardness of solving the problem in terms of this problem specification right
I can say the say the weights theta should just be the arc max of the joint policy in
terms of maximizing the return of this team and that looks a lot like any single agent
or problem it's mathematical specification even though the environment is now parameterised
differently and positive parameterised differently but it's quite simple so this is fully cooperative
all agents under the same controller and there's only one other problem setting that I can
think of that is also easily specified and that is two player zero sum because in two
player zero sum I can specify the requirement of having a Nash equilibrium and if I find
any Nash equilibrium then I'm guaranteed not to be beaten by any training partner test
time so but I need a test time partner and that's why sort of competition and fully
cooperative self-play are special cases of voltage learning where it's really easy to
specify the problem setting and make sure that whatever we train in simulation with
our train time partners actually works well at test time in other words the equilibrium
selection of finding the exact policy is trivial beyond the computational cost of finding this
actual equilibrium in contrast there are two problem settings where things get extremely
complicated and that is on the one hand side general sum learning whereby for example in
the iterative presence dilemma I have all these different equilibria and my training
algorithms have biases towards one or the other and the interaction of learning systems
can lead to disastrous outcomes that are not desirable for any for any party so that's
in general sum learning tragedy of the commons and the other aspect is cooperation when we're
not able to specify all of the policies for all agents at test time in the environment
and that's something I've worked on a lot under the banner of zero shot coordination
which is essentially how do we train in simulation such that we can expect our policies to generalize
to novel partners at test time and clearly here the sort of inability to agree on a specific
policy makes it much harder to now specify what the problem setting is and also how we
should train for it so that's what I'm trying to say that really the competitive part at
two players or some is quite unique in multiage learning because all it takes is to find the
Nash equilibrium now this can be had in complex settings like poker but we in principle know
what this solution should formally be so I saw a lecture of yours online where you talked
about the different games that were being tackled in deep RL and most of the games that
you know we associate with deep RL agents like go and chess and dota and Starcraft you
showed how they're all zero-sum games and and at the time there was not nearly as much
work in the in the cooperative quadrant in the partially observable and cooperative quadrant
so and then and so you have some really interesting work in that quadrant can you can you tell
us about about that yeah so this is basically so I put it on the tagline of you know being
able to use computers not to beat humans at these competitive games by instead being able
to support and help humans right using large-scale compute and in particular using compute in
simulation now the challenge with this is that you have you can have a solution that
does perfect in that team of AI agents in simulation but the moment you replace any
of these agents with a human attest time everything breaks because these policies are incompatible
with settings whereby the equilibrium can't be jointly chosen for everyone in the team
this is really abstract so let's try and make it a little bit more specific so maybe our
listeners can visualize this imagine that we're playing a game because I like toy examples
as you must have noticed and there's 10 levers and nine of these levers pay a dollar if we
both pick it and one of these levers pays 0.9 dollars if you and I pick it and the reward
that's being paid by these levers is written on each of these levers obviously pick different
levers we don't get any points does that promise that importantly makes sense yeah that makes
sense and the question now is what would a standard machine learning algorithm reinforcement
learning say model reinforcement learning learn in this in this setting well it would learn a joint
policy so policy for you and I that maximizes the reward in expectation of the team and that policy
can effectively pick any of the 1.0 levers and get one point in expectation now in contrast if you
and I were to play this game and it's common knowledge that we cannot agree on a policy so
we can't there's no numbering to the levers we can't agree pick lever one or five or whatever
then it's fairly obvious that we should pick the lever that we can independently coordinate on
which is the unique 0.9 lever and that highlights the difference between Nash equilibria which are
well suited for self play where we coordinate where we can control the entire team a test time
and a completely different set of Nash equilibria that is well suited when we cannot do this
and understanding this is important because when machines meet humans often the problem setting
will be known will be understood the task is clear but the ability to specify a policy isn't there
because they'll be quite costly and what is worse the space of possible Nash equilibria that these
algorithms can consider often is exponentially large and only very few of them are actually
suitable for coordination so to illustrate this imagine that we're playing this lever game now
but repeatedly where you can observe which lever I played and what lever you played in this setting
the space of all possible optimal policies is actually joint policies that pick an arbitrary
but every time step from the 1.0 set and you can imagine if you're on time steps then there is
something like 9 to the 100 possible optimal trajectories but it would be quite hard to explain
to a human that this is the policy you're playing instead what a human would likely do is something
like either I copy your move or you could copy my move well that requires tire braking so we're
gonna randomly decide who copies whom and who sticks and again figuring this out is important
because if I want to help and support humans that can be formulated as a partially observable
fully cooperative multi-agent problem partially observed because no no robot can look into the
human said you don't know the exact reward function of the human fully cooperative because
you're trying to help the humans as only one reward even though it's unobserved by the robot
in generality and multi-agent because of the robot in the human but it's also coordination
problem because we can't know the exact weights of the brain that is actually controlling the
human so therefore we have to be able to work with another agent in the absence of the ability
of agreeing on a policy so therefore I've been sort of trying to push to feel more to think about
fully cooperative partnerships over coordination problems and in particular I've used the card
game Hanabi here for the last few years really to develop novel methods in that space now I heard
about your Hanabi work I guess last year and so I got the game with my family and we did play
Hanabi and it's I don't know for anyone who has not tried it it's a very strange sensation playing
that game because it's like the opposite of most games where you cannot see your own cards and
you're trying to work as a team with the rest of your players instead of competing with them which
is which is quite refreshing can you talk a little bit about how you approach Hanabi and and what
agent design you used for that for that problem yes so it's been a journey really when we started
out a while ago doing my deep-mind internship doing the very first self-play experiments so
this was basically about if you can control the entire team what happened and the good news is
you can get really good performance in this team the bad news is what I found pretty soon is that
these agents that you train in simulation on the game from scratch are quite brittle in terms of
independent training runs so if you run the same training algorithm twice you can get our teams
that are completely incompatible and also they are very different from sort of natural human gameplay
as humans would play the game and at this point I had I had a choice I could have said well let me
try and solve Hanabi by using large-scale human data to regularize the learning process but I
didn't do that what I said instead is can we use this as a platform for understanding what type of
algorithms are suitable for coordination what I mean by coordination is again this idea of
independently trained algorithms being able to cooperate or coordinate at test time and that's
what I did and this was a really challenging journey because turns out that the human ability
to coordinate is quite amazing we can again I think so like when you played the Hanabi game
most likely what happened is someone explained the rules and you then started playing in within a
few games you had a sensible strategy and you didn't need to know exactly you didn't know you
didn't have to agree precisely in what this game was and how you're gonna act in every situation
and that's what I wanted I wanted to have developed our algorithms that get to this more
sensible way of playing of acting in these tech POMDPs is part observable fully cooperative systems
without requiring vast amounts of human data and the last instance of this is called off-belief
learning which was sort of looking back one of the papers that I've really enjoyed working on
that to me addressed a lot of open questions but it's also paper that's notoriously hard to make
sense of so it's I leave it up to you if you want to risk boring your readers or listeners with
off-belief learning which case I'm happy to talk about it at any length well I mean this show is
explicitly aimed at people who don't usually get bored hearing about deep RL so in that sense
you're welcome to go into more depth of that if you'd like but if you could maybe maybe we could
start because we actually have quite a few topics and and limited time so but maybe if you could
just give us a one one level deeper on OBL what is the general strategy you took with OBL okay
so off-belief learning at the most basic level tries to prevent agents from developing their
own communication protocols so it's in some sense because of your what you learned from before is
that's what they do right and then they all these problems come with that correct so that it's
exactly the opposite of going to communicate I started my PhD by saying how can ages learn
communication protocols made some progress fantastic and then years later these communication
protocols are real issue because they're quite arbitrary and if you're now going to encounter
novel test partner in the real world and you haven't agreed on a communication protocol then
suddenly everything is going to break right so imagine in an Abbey if I used to say if I told you
the third card is red and this didn't mean play your third card but meant that you should throw
away your fifth card then that could be quite confusing yeah so the whole question of OBL is
yeah please go ahead oh I just wanted to add so you know you know what I don't I shouldn't have not
have interrupted you you're about to get to the core question I'm sorry my timing isn't always
perfect because I can't this is no this is this was actually a great instance of coordination
problem all right so people say oh language can resolve coordination and like well have you tried
having a conversation with a robot it's actually really hard to get the coordination of who should
speak when and when to interrupt right right so coordination problems exist everywhere even in the
usage of language and actually this is not there I'm interested in but let me get back to this so
the core problem was how can we train a policy that can play Hanabi learns to play on a bit
from scratch but it's not able to develop any communication protocols at all so what that means
is when you're playing Hanabi this policy should only interpret red to mean that this card is red
and if I say this card is 3 it should only be that this is a 3 it shouldn't be able to assign
any high order meanings to this such as if I say this is a one it's playable and so on and that's
quite foundational because if you think about it often we don't want agents to communicate
arbitrary information but it'd be quite bizarre if you had a fleet of software in cars that you're
training in simulation and then one day we realized that these cars are gossiping about us
through their indicators or small nuances of movements it's like a TMI problem right it's
too much information yes absolutely right it's sort of and it's brittle right because
suddenly if you have another partners don't understand it might fall apart and also it might
be an AI safety issue if AI systems are exchanging obscure information amongst themselves is the fact
that deep learning systems have generally everything entangled is that the same problem here or is that
only part of the issue that is part of the issue I mean the problem is that coexistence of agents
in an environment you're training leads to correlations which can be exploited and once
you do that you get in a communication protocol right the moment any agent does something in a
situation where there's partial information another agent can start making inferences and
some information is passing through the environment and enough belief learning we fundamentally and
provably address this and the main insight is the agents actually never train think about
the reinforcement learning we calculate target values in Q learning for example where we ask
given an action in a current action in a given action observation history what is the effect of
that action well it's the target it's R plus gamma Q of a star given tau prime new trajectory but
this effect depends on the true state of the environment and the true state of the environment
is correlated with the past actions of other agents and suddenly at this correlation between
past actions and future outcomes which leads to conventions enough belief learning we never learn
from what happened in the environment but only from what would have happened had the past had
the other agents in the past been playing according to a random policy because a random policy doesn't
introduce correlations if I play all actions uniform in all possible states of the world
there's nothing you can learn from my actions themselves because obviously if you knew you're
playing Hanabi with me and I say this card is red but you know that I'm play I said that randomly
without looking at your hand then suddenly all you know is this card is red because that's
revealed by the environment but you have no idea about why I said this because I would have said
this randomly anyway and that's the main idea of belief of belief learning it really mathematically
this method takes away the risk or the ability of having emergent protocols in multi-agent systems
but don't we need some type of conventions in Hanabi for example if you hint this card is a one I
should probably assume that you're telling me this because the card is playable even if it's
not obvious from the current state of the board and you can get this out by iterating off belief
learning in hierarchies and then you get extremely human compatible gameplay out of OBL and at this
point if you're listening to the podcast I highly recommend to check out our demo of off belief
learning if you're interested and this is at and then slash OBL minus demo and you can
actually play with these OBL bots and these are really interesting to play with at least for me
that was quite fascinating to see how Hanabi looks without conventions and then how conventions
emerge gradually at the higher levels of the hierarchy. So we will have a link to that in
our show notes and so but I'm really obsessed with this entanglement issue with deep learning so
are it does that mean what you're saying made me think that OBL is the policies that OBL comes up
with do not entangle everything in the same way that traditional deep learning would in the sense
that it can actually point to one thing at a time? Yeah so OBL policies disentangle something very
specific which is the correlation well the correlation between the past actions of other
players and the state of the world and what this prevents is secret protocols, emerging protocols
between the different agents in the same environment and I think an interesting question now is if
we're using language models in this world for different agents different interacting systems
how do we prevent them doing clandestine message passing between each other and I think here
something like off-belief learning could be used down the line to make sure that the messages that
are being passed between language models are being used in the literal sense as opposed to you know
scheming a plot to take over the world right so it might you might you can imagine if we have a lot
of decentralized AI systems we might want to make sure that they're not scheming in the background
through their messages but that actually when they're saying you know we should increase GDP
by five percent they actually mean that and they don't mean I've realized there is a hack that we
can use to bring down humanity. That would be good you brought up prisoner's dilemma in your
talk at NeurIPS and you're referring to I believe you're referring to your Lola algorithm which is
actually a few years back but I think I just first encountered it at your talk at NeurIPS
2022 DRL workshop and I definitely recommend listeners check that out as well as other
great lectures that Jacob has on the line but so I've always found prisoner's dilemma very
depressing and I've read that it was first analyzed by Rand Corporation in the 50s in
the context of strategy for for nuclear war and it depicts this tragedy of the commons where
it seems like the sensible thing to do is always to betray your your counterpart and
you both suffer and but but you had some some really interesting results on this on this
old game can you can you tell us about uh about your results with with uh with prisoner's dilemma
and what you learned there yeah so this is I mean I think the prison dilemma yeah it is a little
frustrating because obviously if we only ever playing one single prisoner's dilemma
a sensible agent should defect and that is what makes it a tragedy of the commons
but the good news is that uh more humanity has mostly managed to turn single shot games into
iterated games that means we're playing prisoner's dilemma over and over again often with the same
partners and often with transparency of what was done in the previous rounds and what the
outcomes were and that completely changes what type of outcomes are possible amongst rational
agents among self-interested agents so in particular there was this tournament by Axelrod in the 1980s
I think it was where he invited people to submit algorithms to play the prisoner's dilemma
in this competition and scientists spent many many hours and many tens of thousands lines of code
coming up with these complicated strategies in the end the strategy is that one was a few lines
of code and it was tit for tat it was I will cooperate if and the first move and then I will
cooperate again if you cooperate it with me in the last move otherwise I'll defect and this strategy
was extremely successful in the tournament and obviously if you put tit for tat against tit for
tat you actually get mutual cooperation because nobody wants to be punished so the single shot
game is frustrating the iterated game isn't frustrating unless you do standard naive
reinforcement learning independent learning of initializing some set of agents and training
them together because what you'll find is these agents invariably will learn to defect unconditionally
even in the iterated game and that's obviously bad news because if you imagine deploying these
agents in the real world where we have iterated games we don't want agents to uncondition defect
we would like to have agents that can actually account for the fact that other players are there
and are learning and realize that by reward and punishment they can shape them into cooperation
and that was the key insight behind Lola where we don't maximize don't take a gradient step towards
increasing my current return assuming the other agent's policy is fixed but we differentiate
through the learning step of the opponent in the environment anticipating that our actions will go
into their training data and i will never forget when we first implemented this method this would
do my internship at openai first implementation first run and we get this policy out that
cooperates but it doesn't cooperate blindly it placed it for tat this moment is always going
to be something that i remember in my research career because it was a hard problem we had
honest we had come up with a theory of what is driving the the failure of current methods
and we managed to fix it now lola has obvious issues it's asymmetric it assumes that the other
agents are naive learners it's myopic it only shapes one time step and requires these higher
order derivatives and ever since in particular here in my group at oxford at flayer we've done
follow-up work to address these issues and a paper that i really like out of that line of work
is model free opponent shaping which i highly recommend any of the listeners we're interested
to look at okay so use this phrase opponent shaping first of all i want to say that moment
that you shared about running lola and seeing the results that's the kind of stuff we're here for
on the show that's what i think that's what a lot of a lot of people um love about about about
machine learning and about uh deep rl these magical results so so thank you for sharing that
what do you mean when you say opponent shaping can you say more about that concept yeah so this
is maybe sort of at the very very core of what i'm currently excited about in multi-agent learning
and that is the coexistence of learning systems within a given environment and if you think about
it this fundamentally and radically changes what the relevant state of the environment is
because even in something as simple as a prisoner's dilemma whereas of um on paper the state is the
last action of the two agents in reality as soon as agents are learning the state becomes
augmented with the policy of the other player because if after an episode the opponent is
going to learn from that data generated in the in the interaction with me before we interact again
then suddenly my actions will influence their learning process
and if we're doing this naively then we're forgetting the fact that we can actually shape
the learning process and this is ubiquitous this is whenever we have learning systems that are
interacting i believe we should be considering the fact that there is a chance to shape the
learning process and that if we're not doing it extremely undesired long-term outcomes such as
mutual affection become quite likely and this has been a focus of of this this line of work
how do we do machine learning when our decisions are influencing other learning systems in the long
run and it's all the unbelievable opponent shaping and i'm happy to talk more about any of the
recent papers or methods in that space so you're talking about policies that are choosing actions
policies that are choosing actions based on how they will affect the other player's policy in
future is that is that what you're getting to how they will affect the other player's learning step
learning step okay learning step this is it this is the crucial part because how they affect the
policies the current policies that's really done by rl right reinforcement learning gives me time
horizons i can just play a thousand times samples until the end of the episode i see how my action
change you have change impact your future actions i like your policy but the big thing is if you
and i are learning agents then these trajectories generated will go into your learning algorithm
right so imagine you have wayment you have tesla these cars are on the road
they generate data the train data goes to the training center and suddenly they will
tomorrow tesla is going to slide drive slightly differently because it has learned from the
interaction scenario with way more okay and what this means is if you you know if you think this
through suddenly we have to consider the fact that if uh self-driving cars don't honk or too
passive they will encourage other cars are the participants of the road to take the right of way
right so for example as a cyclist when i was living in san francisco i knew that way more
cars are extremely passive therefore i can be more aggressive with them on the other hand if
when was accounting for the fact that i'm a learning agent they would naturally honk they
would have to be slightly more aggressive to prevent being bullied into this type of very
passive situation where they end up blocking the roads of this of the city i need to be taken off
the roads right so this is the crucial difference between what happens in terms of an agent
influencing the future action choices within the episode or the consequences my actions are going
to cause by going into your training data and this happens the moment that we have interacting
learning systems so this is the future the future is language models are everywhere we're generating
tons of data in that interaction and that data will be used to train more ai systems so when
people saying what is more good for well it turns out that if we look under the hood every deployed
machine learning system becomes a multi-agent learning problem because these language models
exist in the same environment with humans and other language models and systems will be trained
on the data that they generate so you had uh one paper i think you mentioned model free opponent
shaping and in that paper you explicitly talk about um or that was first author chris lew
with yourself as a co-author but in that paper um you talk about basically as as a meta game
is this meta learning and do you consider this a meta rl problem and how do you how do you frame
things in terms of meta rl here yeah so just for the to give some more background for the listeners
um what we're doing is we're defining a meta game whereby the state is augmented
with the policy of the other player and each time step consists of entire episode
and my action is to choose a policy for the next episode why is the meta state not the policy
because based on my and your last policy the learning algorithm in the other agent is going
to induce a state transition a new policy that comes due to the learning process so this is
meta reinforcement learning in a specific setting and more interestingly we get to meta self-play
which is when we combine two mforce agents that learn to shape another shaper so shaping is again
useful opponent shaping here right you have a learning you have a learning agent something
like a ppo agent that's maximizing its own return doing essentially independent learning in the
prison's dilemma and there were meta learning another ppo agent that can learn to optimally
influence knowing dynamics of this naive learner to maximize the returns of the shaper and then
as the next step we can now train two meta agents that learn to optimally influence each other's
learning process in the in this iterative game okay thanks for doing that we had Jacob Beck and
rest of Wario recently presenting their their meta RL survey paper and so and so I wanted to see
you wanted to hear how how you you relate that to to meta RL so meta self-play that is a phrase that
I have not heard before is that a new thing with and are you are you quoting a phrase here in your
line of work or is that an established idea I haven't come across it before I mean I don't
think we've made a huge claim to novelty but I do think I do think it is new and it's nice that it
addresses some of the issues with with mforce um maybe just more different one one more difference
between so standard meta RL that people talk about and what we do in mforce is that we are
truly model free so and we don't we really have a meta state right so often what's missing
in meta RL is the meta state and in our case we have the policy of the other agent which is that
that meta state that is commonly missing in meta RL where then we get into the question of how to
estimate gradients through finer time through uh short unrolls with minimal bias or how we
differentiate through long unrolls of trajectories here we've wrapped we've got rid of all these
issues by being able to actually learn in the meta state quickly back to meta self-play this is
this is I believe new I haven't come across it before and it allows us to have shapers that are
consistent what I mean by that is I don't have to model the other side as naive learner like in
Lola I can instead have a shaping policy that is cognizant of the fact that the other agent
is also shaping at the same time and it's sort of almost like that infinite theory of mind where
I model you modeling me and so on at infinity that is really hard to wrap your head around but but
works fine in practice and what it means is that MFOS doesn't just extort naive learners yes it
would do so so if you have a population of naive learners you introduce MFOS it will then exploit
these naive learners and so push them into cooperating with it even though it's defecting
but that would actually incentivize others to use MFOS and if you have a population of MFOS agents
this model for your opponent shapers then in meta self-play they would actually end up cooperating
because they would stabilize each other's cooperation they would mutually shape each
other in cooperation which is nice finding now there's an empirical finding so far we do not
have theoretical results which I think is a really nice frontier of our work just to see if I
understand this correctly are you seeing that MFOS agent could play another MFOS agent and it would
account for the fact that that other agent is trying to shape it correct is that right so yes
so the way that meta self-play works is in the end we get these agents that are fully shaping aware
they're approximately optimally shaping another shaper right so this is this recursion twist in
a sense but because we have a training process where we can anneal the probability of playing
with the naive learning of us with this MFOS agent that actually gives us a specific equilibrium
and that ends up being a shaping aware equilibrium that stabilize cooperation between these MFOS agents
so it would be like in the case of a negotiation we'd have to think oh why are they taking action
x are they taking it because they think that that's going to make me think like down that
whole rabbit hole and how does like I mean you talked about how that is recursive the theory of
mine could be any number of levels and I guess people generally just cut it off at some point
for for practicality but are you saying there's something deeper going on here where you don't
need any more levels and you kind of cover the levels correct can you say a little bit more about
that because that seems kind of amazing yeah so we're we're effectively trying to find an approximate
equilibrium of the of the of this meta self-play right now again we don't have theoretical guarantees
but empirically we have good reasons to believe that we're close to a Nash equilibrium of meta
self-play and obviously once you are you are Nash equilibrium in that meta game then this is
equivalent to being infinitely aware of awareness of the other agents policy right so this is so
like a nice result if you can solve for the equilibrium of the meta game then you're fully
shaping aware and you're no longer affected by these issues of k-level theory of mind or
higher levels of Lola for example because we can just solve for the fixed point okay let's move on
to some other recent papers you have on communication there was one adversarial cheap talk
and the cheap talk has been a theme that's come up in your work before can you tell us about that
yeah so basically this is asking i think so what i what i liked is a question is there a really
counterintuitive setting a minimal setting where we can still influence the learning dynamics of
another learning agent and what we came up with is so constrained that it was mind-boggling to me
that this is indeed still possible so the rules of the game are the learning agent say the victim
agent observes the true state of the environment s and all the adversary can control is bits that
are appended to that true state so you can think for example about a setting where you have
some noise features in the data set that nobody cares about but also nobody checks
for example orders far away from the mid price in the order book that are cheap to do
or data on reddit or whatever else somebody might be able to manipulate but you don't want to care
if it's new training data because we throw everything in there and what we found is that
we can now meta-learn an adversary strategy that cannot just disrupt the learning process
of the victim agent it can also vastly accelerate that learning process and it can actually
introduce a backdoor during the training of the meta of the victim agent whereby it can at test
time remote control that agent to carry out a completely different policy so you know in the
extreme case you could imagine if i do this in a stock market environment i might be able to use
orders far away from the order book to influence the training process of other participants
to then make them maximize my financial return rather than their financial return obviously we
would never do such a thing but it highlights potential abilities of agents to backdoor training
processes and things quite interesting to ask how would this can this be defended against in real
world settings cool that that actually sounds really interesting and and and kind of important
so i'm a little surprised that paper wasn't wasn't wasn't picked up but but i encourage
people to check that out and as well as your previous paper on cheap talk discovery and
utilization in multi-agent reinforcement learning i guess by cheap talk you mean
those extra bits that you you think they would be ignored but they're not is that is that what that
means correct so we're just saying there's bits that don't influence the dynamics cheap talk
generally means um bits that can be modified by an agent and observed by another but do not
influence rewards or environment dynamics because we know that i can shape the learning process of
another agent if i interact in the environment for example prisoners dilemma or other jobs and
learning problems but what if i cannot actually change environment dynamics or payouts because
paying people things is expensive right i don't want to pay people messing with the environment
would be expensive because i have to change the world but what if i can't do any of this
all i can do is i can set random bits that nobody cares about we should care about
can i still influence a learning process now this comes back to deep learning often having
this unwanted entangling between things i think right it seems earlier to that exactly this is
exactly a feature of the fact that we use function approximators in fact we also prove that if you
use a tabular policy in the limit of enough samples you cannot be uh i personally attacked
like this zooming out a little bit uh you know the topics we've been discussing multi-agent
systems communication cooperation competition and all these of course have been major features of
uh life and society social life from from the beginning in humans and other animals and i guess
even in the other in the plant kingdom too um but does your work uh or every kingdom i guess all of
life um but does your work tell us something about these these aspects in life in more general terms
like outside of of machine learning like do you think that the ideas behind some of these
algorithms could uh could give us insight into tragedy of the common scenarios that we
that we face that we face in real life i think it's a really good question i mean i think i can
speak for myself that um certainly by having worked on this range of problems for some time
now that has certainly changed how i think about interactions with humans i think about um
conflict i think about alignment um between different humans so for example it's i think
it's quite easy to underappreciate what a precious gift common knowledge is and how hard it is to
obtain and um this was a foreign concept to me until i started working on multi-agent learning
and i've ever since actually used it in group situations where teams haven't been working well
together to be very explicit about establishing some proximity of common knowledge making sure
that people are on the same page because it's really hard to coordinate if it's unknown what
other people know right what that means for example practically for flair is that the group
puts a heavy emphasis on having meetings where people are in the same room where groups of people
understand what we're doing and why we're doing it and i think these insights sort of are really
important from a coordination point of view so like this you know i started out with my human
intuition about coordination i went to work with machines i tried to develop machine learning
algorithms that can allow these machines to coordinate but then immediately those insights
and the tools that come up come up within there have helped me um both work with others but also
understand conflict the same holds for opponent shaping so i think really emphasizing that uh
humans are learning and that there is a natural tendency for trial and error and understanding
but also that you know the feedback we provide will help others develop right and being very clear
about what the what the goals are in creating alignment around this again these these are all
ideas that have come back sort of from from my work on on multi-agent learning and that have
really helped me deal with with coordination problems or incentive alignment problems in uh
in my research work but also my personal life and i think those ideas will probably also have
use cases in other areas of of science or understanding sort of where does you know i
think at some point i'd like to get to a point where we can really understand how these
algorithms could be discovered by logic by by an evolutionary process right because we had to
discover these things mathematically or through intuition but it would be great to at some point
see that these types of reasoning abilities and culture and rules and so on can really emerge out
of an evolutionary process like it has happened for for humans i think they will be much closer
to truly understanding um the genesis of these abilities in terms of theory of mind multi-agent
reasoning so is there other work in uh in in machine learning and reinforcement learning that
that you find interesting outside of outside of what's happening at uh at your lab what what kind
of things do you find uh fascinating these days oh i mean i obviously find gbt4 to be absolutely
mind-boggling um you know as i said i think this was ilia was right and i was wrong we had these
conversations years ago but he said have you tried using a bigger lstm and i said well you know i'm
carrying exact gradients and this is like an infinite batch size ultimately he was right that
this was a faster way of getting to real intelligence sort of approximately intelligent
systems than relying on emergent capabilities so that's amazing um and in that context i think
out from human feedback everyone's talking about it i think i'd love to understand that space better
and thinking about how we can learn better algorithms for what are the limitations
and so on putting language models together to have a conversation i think that's cool
it's unclear what we're solving right now but i think trying to make that into a really scientific
um approach that i think works well in terms of trying to merge it with agent-based modeling and
so on is cool and lastly i talked about it before unsupervised environment design
open-endedness having algorithms that can utilize ever large-scale compute to really discover new
things in the broader sense um i think that's a fascinating area because ultimately the core the
core hypothesis i have been operating under is that data is finite but compute will be near infinite
and the big question is how can we replace the need for data with simulation and methods that
improve themselves in simulation and this is a big sort of again going back to one of the blind
spots right if we're thinking this this to the to the end the current the current hype train
right what is the final stop of this hype train and what what happens next so i heard some of
your comments on working in academia versus industry and i think uh that was at your 2022
deep rl workshop talk and our past guest taylor keelian asked about that on twitter he said ask
him how he really feels about industry labs fronting as academic departments wink so um any
comment on that jacob well if you want to know how i really feel um find me the conference and let's
have a chat no but i will give you sort of the the the version i think that will be interesting
to the listeners hopefully which is that it's kind of easy to forget that um everything we're
seeing right now the revolution of deep learning large models and so on is seeded by academic
research and more importantly that this is a branch of academic research research deep
learning that was vastly unpopular for decades okay this is really easy to forget it's easy to
see the large models coming out now from companies and to think that innovation is happening at
these large company labs when the reality is that almost by definition the groundbreaking
long-term innovation comes out of academia has always come and i want to say will always come
and the reason is simple the reason is the cost of exploration versus exploitation what
industry labs are really really good at is throwing large amounts of money at relatively
safe projects that are going to yield return one way or another and i include in return being
nature paper science papers what they cannot do in the long term is open exploratory work
and that is because the investments don't make sense with the incentives of the institutions
so the time scale on this is different obviously we'll see as people squeeze
more and more juice out of current methods there is going to be more flashy interesting results
but i'm sure that we'll see methods huge breakthroughs that give us orders of
magnitude of improvement in efficiency in understanding and and and that will come
out of academia but these operate on a very different time scale but it's important to
if we're now in this field as PhD students as professors as academics to take a step back to
zoom out and to see the big picture which is that breakthrough innovations on the large time scale
scale not the squeezing of current methods have come out of academia and will be cutting
coming out of academia we're not looking for the one or two percent improvement or the ten percent
we're looking for orders of magnitude zero to one changes by having fundamentally novel approaches
that's a beautiful thing so is there anything else that we should have covered today that we
didn't uh that i didn't mention yeah just a piece of advice for anyone who's thinking about
joining the field at the moment um yeah get out of twitter i think this is one of the huge
advantages i had as a PhD student i didn't have twitter until i had to promote my learn to
communicate paper and it's really important to not be blindsided by looking at the same
problems and results and approaches as everybody else but instead to understand hard problems
deeply and try and solve them and that's a really fascinating and rewarding exercise
and it gets us out of this competition of trying to do the next obvious paper fast
right that's not what science about science is about at least in my understanding which is quite
i guess romantic almost um it's about getting to the bottom of hard problems and then addressing
them and even addressing them rediscovering solutions is fine because we'll get us at
something from a different angle which will give us get us somewhere else and it's amazing how many
open problems there are everywhere once we stop just looking at the same things everyone's looking
on at twitter so that's my one note of caution stop using twitter even if that means i'm going
to use i'm going to lose followers okay but but do come to the talker all podcast twitter
because that's where we're going to post this interview guys

Jakob this has been a real treat thank you so much for joining us today and for sharing your insight with our TalkRL audience thank you Jakob Foerster
Well thanks so much for having me it's been it's been great talking to you you