TalkRL: The Reinforcement Learning Podcast

AI Generating Algos, Learning to play Minecraft with Video PreTraining (VPT), Go-Explore for hard exploration, POET and Open Endedness, AI-GAs and ChatGPT, AGI predictions, and lots more!

Professor Jeff Clune is Associate Professor of Computer Science at University of British Columbia, a Canada CIFAR AI Chair and Faculty Member at Vector Institute, and Senior Research Advisor at DeepMind.

Featured References

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [ Blog Post ]
Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune

Robots that can adapt like animals
Antoine Cully, Jeff Clune, Danesh Tarapore, Jean-Baptiste Mouret

Illuminating search spaces by mapping elites
Jean-Baptiste Mouret, Jeff Clune

Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions
Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, Kenneth O. Stanley

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions
Rui Wang, Joel Lehman, Jeff Clune, Kenneth O. Stanley

First return, then explore
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

I'm very excited to have Professor Jeff Clune with us today. Jeff is an Associate Professor
of Computer Science at the University of British Columbia. He's a Canada CIFAR AI chair and
faculty member at Vector Institute. He's also a Senior Research Advisor at DeepMind. Welcome
to the show. Thanks so much for being here, Jeff Clune.
It's a pleasure to be here. Thank you for inviting me.
So how do you like to describe what it is that you do?
That's a great question. Nobody's ever asked me that before. I think the shortest answer
is that I'm a scientist first. I'm very curious and I want to understand the world. I want
to do excellent experiments and understand the world more. And then I also am in a communicator
and I love to share what I have learned or what my colleagues and I have learned with
the world. And that happens through scientific publications, through talks, through podcasts,
but also in the classroom.
So over the years, every so often I come across one of your works and it always strikes me
that you have a really unique approach. It seems like your goals are always very ambitious,
super ambitious, which is really exciting. And then also you seem like you're not afraid
to discretize and hash things, which I think some people shy away from. Another thing I've
noticed that your style is often very compute heavy, like you even explicitly talk about
algorithms that never end. Do you think that characterization is fair to say?
I totally agree with the first and the third. I love to swing for the fences and I'm most
interested by ambitious ideas that might not even be ready for current technology, but
you might have to wait a few years or even decades to fully be realized. And that involves
using as much computers I can get my hands on. I'm not quite sure even by the second
one though, so I won't commit to that just yet.
Okay, no worries. I was just noticing that some algorithms like Go Explorer and Map Elites
use a discretization like discrete buckets or hashing as part of the algorithm, which
seems like it's not all that common and some people shy away from, but you get cool results
by doing that.
I would actually say I don't like those aesthetically. I would try to avoid them. Usually that is
a stand-in for what will work now to prove the other principles we're trying to show,
but ultimately we want to move away from that to something that's more learned end-to-end
and therefore likely to be like continuous vector spaces, latent embeddings, things like
that.
Let's jump into one of your works here, VPT. That was video pre-training, learning to play
Minecraft with video pre-training. I found this work really interesting and fun. It combines
Minecraft and RL in a unique way. Can you give us a really brief rundown on what's going
on in this work?
Yeah, absolutely. But before we talk about VPT, I want to quickly acknowledge that this
was the work of my team at OpenAI. It was a wonderful set of collaborators and everyone
should look at the blog post and or the paper for the full list of amazing collaborators.
I think one of the major lessons from the last, call it eight years of machine learning
is that AI sees further by standing on the shoulders of giant human datasets to borrow
from Newton. We've seen this with GPT and DALI and all the other incarnations of this
idea. The idea is that if you can go and pre-train on vast amounts of internet scale data, then
you have a huge leg up when you want to ultimately go solve a particular task like answer questions
or help somebody with customer service or generate an image. Often, this is formula
where you pre-train on the internet that tells you how to understand the world, learn features
about the world, et cetera, and then later you fine-tune to a particular task.
What we were seeing is that, hey, if we want to take reinforcement learning to really hard
problems like learning to play Minecraft or learning to use a computer or learning to
clean up a house or find landmines, these are exceptionally what we call hard exploration
problems, which is to say that typical RL is rather unintelligent. It does a whole bunch
of random trials and if it happens to do something that is rewarded, it knows how to lock that
in and do more of that thing. Of course, the problem is in really hard problems, random
exploration taking literally random actions will never solve the problem. For example,
if I want to have a robot that cleans my house or learns to play Minecraft and it just randomly
matches the buttons or emits motor torques, it's never going to get a good job from me
in the house cleaning example or in Minecraft is never going to do something like make a
diamond pickaxe. This has been a problem that focused much of my career on. You mentioned
Go Explorer, which is another algorithm that is in this space. We thought to ourselves,
all right, we could try to follow the historical path, which I have worked on and others have
worked on, which is to create intrinsic motivation or better exploration algorithms, but we also
could just take a page from the modern playbook and say, could we just go and learn from the
internet first how to accomplish lots of things in the world, whether it be in Minecraft and
robotics or whatever. We chose Minecraft as our domain. Once you generically know what
to do in the game, then we might ask you a particular challenge, like get a diamond pickaxe
and you would be really easy. The RO would actually have a chance of learning that because
it mostly knows how to play Minecraft. You just have to figure out what it is that we
specifically want in this particular situation. That's the gist behind VPT. What we said is,
okay, we're going to have an AI model, a large neural network. I think we used half a billion
parameters. It's going to watch YouTube, videos of people playing Minecraft on YouTube. We
went out and we got about eight years of clean video, which is to say video that doesn't
have logos and somebody is like video of a person playing it layered over the top. It
watches Minecraft videos for about eight years of sequential time. We obviously paralyzed
that and it just knows how to play Minecraft in general. Then at the end, we say, okay,
you're a relatively good Minecraft playing agent. Now we want you to go do a specific
task, which is to get a diamond pickaxe. Now, just to set the context for the listener,
that's a really, really hard task. It takes humans over 20 minutes and about 24,000 actions.
We're talking about actions like moving your mouse, clicking a button, et cetera. This
is orders of magnitude harder than many RL things that we've done in the past. This system
can basically do it. It learns to be able to go get that diamond pickaxe. If you didn't
do the pre-training, you have no hope. Almost no matter how long you ran your algorithm
within reasonably, even if you had a planet sized computer, you probably wouldn't learn
how to do this thing. That is the high level result. The one thing that I haven't mentioned,
which is essential to this work, is that for things like GPT and DALI or music generative
AI that you get on the internet, almost all of the domains that it has been applied in
the labels are for free. They exist. GPT says, I'm going to read 100 words of an article
and predict the 101st word and then the 102nd word. Those words are right there in the original
article. Same with predicting musical notes or predicting pixels in an image. The difference
in trying to learn to act is that you are watching a video of somebody play Minecraft
and you see the camera feed of what is happening in the world as they act, but you don't see
the actual actions they took. If I watch somebody editing an image in Photoshop, I don't see
the keyboard shortcuts they're hitting. I don't actually know what they're doing with
their mouse, so they might be able to infer it from their cursor. That was an extra challenge
we had to address in the Minecraft case. We actually solved it with a pretty simple technique
that works very well and it works like this. We pay some people to play a couple hundred
or a couple thousand hours of Minecraft, which doesn't cost that much money. For those contractors,
we record the actions that they're taking while they're playing the game. Then we train
a model to look at a video of Minecraft and be able to look at the past and the future
– say the past two seconds and the future two seconds within a video. Its job is to
tell us the action that must have been taken in the middle. This is quite a simple task.
If you are watching Super Mario Brothers or Minecraft and all of a sudden you see the
character jump, you know somebody probably just hit the jump button. That's a pretty
simple task. You don't need a lot of data to learn that task. We just collect a very
little amount of what we call labeled data, which has the actions with the video. This
model has to then say for any little snippet of video, if I see the past and the future,
it's pretty easy for me to guess the action that must have been taken at time t in the
middle. Once I have that model, then I can go to YouTube where I don't have actions
and I just run this labeler and it just tells me all of the actions that were taken at every
time step in the video. Now I have a massive label data set of Minecraft videos and the
actions that were being taken at every time step. Then I can just do standard behavioral
cloning, which is also called imitation learning, which is also exactly what GPT does, which
basically says go from the past, everything we've seen up until now, to what is the next
action I should take and then the next action and then the next action and then you can
roll this out autoregressively in a simulator and boom. That is the general formula in VPT
that allows first a zero-shot model that knows how to play the game. It can go out and do
things like make crafting tables and get wooden tools, but it also has kind of the core knowledge
that once you start doing RL on it, you can actually ask it to go learn to do a diamond
pickaxe and it can do quite well.
Yeah, I love that result and also how you got there. I mean, it seems like a special
case of Jan Lekun's cake where the bulk of the work was this unsupervised or self-supervised
training, but you had this extra step here. You were able to use all this unlabeled data
and I guess turn it into labeled data. Was that two-stage design? Was that obvious to
you from the start or did you kind of arrive at that after some experimentation and thought?
Yeah, great question. Actually, that idea was the start of the project. Once we were
having a conversation and basically we said, hey, this would probably work. It was like,
that seems like a really powerful idea. Let's go do it. We committed to it and the overall
project took about a year and a half to get it all to actually work. It wasn't like we
set off with the goal to do something vaguely in the space and eventually through trial
and error we got to here. It was more kind of like that is an idea that will scale and
allows us to take the modern unsupervised paradigm and the GPT paradigm, but now take
it to domains where we want to learn how to act, how to control a computer, how to control
a robot, how to control an agent in a virtual world.
I want to give a lot of credit to Bowen Baker because it was in a conversation with him
where we landed on this idea and decided that this would be a really profitable direction
to go off. He's the first author on this paper.
Another thing I thought was really cool here is that you have this non-causal model, kind
of a non-standard type of model instead of a forward RL model that's predicting the next
state or an autoregressive model like in GPT. You had this setup where you were trying
to predict the action in the middle of a stack of frames. That kind of reminded me a bit
of BERT, like filling in a word in the middle. I was just wondering if that idea of having
a non-causal model of filling in the action in the middle was something that you had right
out of the gate or how did you come to that? I remember in the paper reading that you found
that performed better than simply predicting the next action.
This is where it kind of gets a little bit subtle. Because we have this two-stage process,
first we need to get the labels. Then once we have labeled data, we're going to do standard
behavioral cloning. The non-causal model, the model that gets to see the past and the
future and predicts the middle, that is just the action labeler. This is a thing that's
going to be trained to look at a YouTube video and say, hey, I see the last second and the
next second, the action that was taken right in the middle was jump or move the mouse to
the left.
Okay, but is that action labeler network still, is that what you're using as the basis of
your foundation model or no?
No. That just gives us labels for YouTube videos. YouTube videos, we don't know the
actions taken. We call this the inverse dynamics model, the IDM, but you could also call it
the action labeler. It gets to see the past and the future and tell us the action that
must have happened in the middle. Then we go to a YouTube video for which we don't have
actions. We run the labeler and now it gives us what it thinks is the label at every time
step in that video.
Now we have eight years of labeled YouTube videos. We have the video and the corresponding
action or the action we think happened. Now we throw away the data labeler, which is
the non-causal bit. We train now just a normal neural net who just goes from all the video
up until time t and it tells us what action to take at time t. We take that action. It
does not get to see the future. Then we send that action to the simulator. It sends us
back to the next frame of observations. We give it back to the model. It tells us the
next action.
The final model that learned to play Minecraft is a causal model. It is a normal model. To
some extent, we don't know the future. We're actually rolling this out in a real simulator
and so we can't give it the future. It only gets to, like any RL agent, it gets to act
from the past and has to take an action without knowing the future. That non-causal thing
was only to get the labels, to infer the labels and then train the model with behavioral cloning.
I see. Thanks for clearing that up. I guess I was confused on that point. You use the
phrase AIGA, AI Generating Algorithms. Can you talk about what you mean by that phrase?
Do you consider an AIGA to be an RL algorithm?
An AI Generating Algorithm is basically like a research paradigm. It also ultimately will
be an artifact once we have the first one of them. The general idea is that we're going
to try to have AI that produces better AI. It's a system that will try to go all in on
the philosophy that we should be learning as much of the solution as possible. To set
some context, I think if you look out into machine learning, what most people work on
is what I call the manual path to AI. They're trying to build AI piece by piece. Like, oh,
I think we need a ResKit connection or I think we need layer norm or I think we need the
atom optimizer. I think we need this kind of attention, that kind of attention. I think
we need to train in these kind of environments. Every paper in ML is like, here is either
a building block that we didn't know we needed before that I think we need or an improved
version of an existing building block. That's what most ML research does. That begs the
question of, one, can we find all these building blocks and the red versions of them? Two,
when are we going to put all these building blocks together into an actual big AGI or
powerful AI? That manual path, I don't think it scales very well. I think if you look at
the history of machine learning, there's a very clear trend, which is that hand design
pipelines are replaced by learned pipelines, entirely learned pipelines, once we have sufficient
compute and data. We've seen that all over the place. We used to hand code vision features
and language features. Now we learn them. We used to hand code architectures. Increasingly,
those are being learned and searched for. We used to hand code RL algorithms. Now we
have meta learning, which often sometimes can learn a better algorithm, et cetera. The
AIGA paradigm is kind of like staring that trend directly in the face and saying, hey,
if it's true, then we should be learning how to produce powerful AI, all of the problem.
I think if we want to push on this, in the paper, I actually had three original pillars.
I'll talk to you about a fourth that I added later. The first pillar is that we should
be searching for architectures automatically. The second one is we should be learning the
learning algorithm or meta learning. That is very much an RL problem. Then the third
pillar is automatically generating the right environments and the right training environments,
the right data, however you want to think about that. If you put all those three pieces
together, what you could end up with is something that looks a lot like what happened on Earth,
where you have a very expensive outer loop algorithm. In the case of Earth, it was Darwinian
evolution, but in an AIGA, it almost surely would be something more efficient and look
different. You basically have this outer loop that's searching over the space of ancient
architectures, learning algorithms, and generating its problems and data. Ultimately, that thing
itself, even though it's very compute inefficient, will produce a sample efficient AI that is
probably more sample efficient than humans. Darwinian evolution produced you. You're the
most impressive brain we know of on the planet. Increasingly, an AIGA will get better and
better at producing something that ultimately will surpass probably human intelligence.
That's the general idea. I mentioned I added a fourth pillar recently. It's basically
a catalyst that speeds everything up. That is standing on the shoulders of giant human
data set and using pre-training on human data to basically get this whole process moving
faster. There are some benefits to doing that versus not. We won't get into those details
right now, but that is an extra pillar that I think is important if you want to make this
all happen extremely fast.
Do you consider some of your other works to be concrete steps towards AIGA, or is it more
a distant abstract goal?
Yeah, I think a lot of the work that I have done in the past and that I'm currently doing
is very much in this paradigm. I think VPT, for example, is an example of how you can
use the fourth pillar to dramatically accelerate things. Ultimately, if you wanted to generate
a lot of really complicated training domains like Minecraft and using computers and driving
a car, et cetera, et cetera, it might be very inefficient to have to bootstrap up and learn
how to do that from scratch.
If you can learn from humans and how they play video games and how they drive in driving
simulators and how they use their computers, then you might be able to catch right up to
where humans are and then step ahead off into the future. That could dramatically speed
up AIGAs. That's one example.
Another paper that I'm very proud of that we worked on that is directly in the idea
of AIGAs, especially the pillar three of AI generating algorithms, is our paper on PoET.
This was one of the first papers that really put on the map of the community the idea that
we should be automatically generating training environments for RL agents rather than hand
designing them. As you mentioned earlier, we want to ultimately do this in an open-ended
way where the system will just continuously and automatically generate as many novel,
different, interesting training challenges for the agent as possible. The agent has an
infinite curriculum and continues to level up and learn new skills. There's yet another
example. There are many more. I don't want to give you a 15-minute answer, but much of
the work that I've done, I think, slots in nicely into this AIGA paradigm.
Let's talk about quality diversity. Can you mention what you mean by quality diversity?
Can you compare and contrast that maybe with something we might have heard of like population
based training and genetic algorithms?
Before we talk about quality diversity, I want to quickly acknowledge the leadership
of Ken Stanley, Joel Lehman, and John Baptiste Moray in this area, as well as in the area
of open-ended algorithms, which we will talk about later. In both cases, they've been pioneers,
I have been fortunate enough to work with them to develop these areas.
Quality diversity algorithms are another area of work that I'm extremely excited about.
For a while there, my colleagues and I were working in a very, very small niche, but now
it's exploded into an area that a lot of people are pushing on and are very interested in,
which is fantastic to see. The general idea is quite simple, but it is very different
from traditional machine learning and optimization. Typically, in machine learning or even RL,
we are trying to solve one problem and we want the best solution for that problem.
For example, we have a robot and we want it to move really fast, and so we just want the
fastest robot.
Quality diversity algorithms, in contrast, they say, you know what I really want? I want
a lot of diverse solutions. I want as many different ways to walk as possible, but for
each one of those, I want them to be really good. That's the quality part. I have quality
and diversity. For example, look, and it's just inspired by natural evolution. Evolution
can be an example. If you look out into the world, you have ants and three-toed sloths
and jaguars and hawks and humans, and they're so magnificently different from each other,
but each of them is very good at doing what it does. It's very good at making a living,
doing whatever it does. If you go and you look at, for example, trying to say, I want
the fastest organism, if that's your performance criterion, well, if you only optimize for
speed, then maybe you get a cheetah, but you would not get an ant and you wouldn't get
a duck-billed platypus, but we like ants and duck-billed platypus. They're really interesting.
They're creative solutions, and they could potentially be useful. What we don't want
to do is compete ants and cheetahs on speed because that's silly. You wouldn't even have
an ant, but you do want ants to be fast. You prefer a fast ant to a slow ant. The general
idea behind these algorithms is within each niche or bucket or type of thing, we want
the best there is, but we want as many of these buckets as possible. I want to give
an example of this because we had this paper in Nature in 2015, which I really think demonstrates
the value of these quality diversity algorithms. That paper was led by Antoine Coley and Jean
Baptiste Moret with Dinesh Tarapur. We have a six-legged robot. We want it to be able
to walk. We also ultimately know that the robot's probably, if it's out in the world
doing something like finding survivors, it's going to become damaged. We're going to want
it to adapt to damage as fast as possible. What a traditional optimization algorithm
would do would be like, give me the fastest robotic gate. What a quality diversity algorithm
might say is, hey robot, go learn how to walk in as many different ways as possible. Each
one of them, we want it to be pretty good. Imagine you ultimately become damaged. Well,
think about what you would do if you were in a forest and you became injured. What you
wouldn't do is launch an RL algorithm that takes a million different trials of slightly
different version of the current best thing and try to figure out the best way to walk
despite damage. No. Instead, what you would do is you'd say, all right, I stand up, I
try to walk. Ow, that really hurt. I can't walk the normal way. I'm going to try a totally
different type of gate, which is maybe I'm going to walk on the ball of my foot on the
injured side. You say, ow, that doesn't work. I'll try one more thing. I'll try to walk
on the outside of my foot. You say, oh, that still hurts too much. You say, all right,
whatever. I'm just going to hop out of this forest. Notice that ahead of time you had
practice how to hop. When we're children, one of the things we love to do is act like
a QD algorithm. We go out and we try to figure out the fastest way to hop on one foot, on
two feet, to walk on our tippy toes, to walk backwards, to walk on all fours, whatever
it is. That's like playing. We're intrinsically motivated effectively like a QD algorithm
to do this. Then once we become injured, we can harness all of that practice and knowledge
to adapt to injury in a really fast way. We did exactly that in the nature paper. We
had a QD algorithm. It learns to walk in a variety of different ways ahead of time. Once
it becomes damaged, we paired it with an algorithm called Bayesian Optimization that quickly
just basically said, I'll try one type of gate. If that doesn't work, I'll rule out
all of those types of gate. I now know those don't work. I'll try an entirely different
type of gate. I'll just bounce around and try a handful of different gates until I find
one that works pretty good. Then I'll use that to limp back to the station where I can
get repaired or whatever. What we showed is that if you do on this robot, once it's damaged,
if you run typical RL like PPO or policy gradient or whatever, it's very simple and efficient.
The search space is really large and it takes a ton of trials. If you run Bayesian Optimization,
it also doesn't work very well. If you use quality diversity algorithm and then you do
this Bayesian Optimization thing, then at about six to 12 experiments and at about one
to two minutes, the robot can figure out a gate that works despite the damage and it
can walk very, very, very quickly and soldier out with its mission. This was work with Antoine
Coley and Dinesh Terapur and led by Jean-Baptiste Moret, a long time colada league of mine and
one of the founders of quality diversity algorithms and this algorithm is MapElites, which is
probably the most popular quality diversity algorithm which he and I coauthored around
the same time. I just think it was a really nice example of the idea that once you have
a big set, a big archive of things that are diverse and high quality, there's so many
cool things you can do with them and this was one example.
I remember that 2015 article in Nature. It was on the cover actually and it was a really
nice cover story and I understand that MapElites explores this space of attributes, like say
using the example that you gave in your Coral 2021 talk. It was the 2D space of height and
weight I think was the space and I was just wondering if it obviously works well in 2D.
I wonder if you were to scale that up, if we had four or 10 dimensions and we had the
curse of dimensionality making things less tractable, do you foresee ways to extend this
idea to more dimensions?
I do, yeah, great question. First of all, just in terms of the nature cover, there's
a fun fact which is that nature usually wants a nice big picture of a fish or a spider or
whatever it is you studied and they don't like you to put data on the cover of nature
but Jean-Baptiste Moret had a fun idea which is that we could use the MapElites, literally
the grid of the performance in each cell as a giant matrix that's really colorful and
we made that the floor underneath the robot so we snuck our data under the cover of nature
which we thought was kind of fun.
Going back to your question, this actually gets full circle back to one of the first
things we talked about which is discretization. I am a huge fan of the idea that ultimately
you don't want to pick these dimensions by hand, you'd rather learn them. You could
take really high dimensional data and with a neural net you could learn a low dimensional
space of interestingness and how things can be interestingly different and then you could
do MapElites in a space like that.
In fact, Antoine Coley, the first author of the Nature paper, he has a paper that shows
exactly that, a method called Aurora. I think there are other people who have worked on
similar ideas but basically just to give you a broad sketch of how it might work, imagine
you have really high dimensional data, for example like a video of your robot walking
or images or something like that, you could then compress with an autoencoder down to
a small low dimensional projection of your data. You can then use that in that latent
embedding space of the autoencoder like the bottleneck of the autoencoder. You could then
run MapElites by discretizing that space and trying to find, you know, fill as many of
the buckets of that space as possible. Then that gives you new data that you can then
run the autoencoder again, get a new latent embedding space, try to fill it up again and
keep repeating this process. In this way, you don't have to hand design the dimensions
of variation, you could learn them.
Another example I'll just throw out there, and I don't think anybody's done this yet,
but you know like modern pre-trained models are another way that you can kind of get a
space of dimensionality of variation that would be interesting. Take CLIP, for example.
It knows that robots that walk on two legs are different than robots that walk on four
legs and if suddenly a zebra stands up on its hind legs and it walks around, then it
should give it a different caption. Models like CLIP and GPT and GPT-4, which is multimodal,
almost surely already have many different dimensions of different ways that things can
vary in their latent embedding space. You could literally just steal that space and
run NAPOLITS in that space and say, go get me all the weird robotic gates that will light
up all these different dimensions of variation. That probably gets you a huge amount of diversity
and in each one of those, make it high quality. Now we've got our QD algorithm and we didn't
have to pick the dimensionality of the space manually. We also allowed it to be very high
dimensional.
Let's move to Go Explore. If I understand, this is a very high fidelity exploration algorithm.
Could you explain at a high level what the main ideas are here with Go Explore?
Before I explain Go Explore, I want to acknowledge the wonderful colleagues that worked with
me on this paper, which were Joost Heisinger, Adrian Echofey, Joel Layman, and Ken Stanley.
Some of the Achilles heels, maybe the Achilles heel of reinforcement learning is exploration.
If you'd never happened through random actions to do the thing that is getting rewarded,
then you have no signal for how to get better. It's like playing warmer and colder, but you
never get the answer warmer. You're just cold all the time.
If you go back to the original DQN paper, which is I think the face that launched 10,000
papers in the sense of kicking off the deep reinforcement learning revolution and putting
deep mind on the map, they did pretty well on a number of Atari games. After that, better
and better algorithms did better and better, but there was one game in which they literally
got zero and that is Montezuma's Revenge because it's very, very difficult to ever get any
reward in that game just through random actions.
For a long time, people held this game, Montezuma's Revenge, up as an example of a hard exploration
problem for which our current algorithms are failing. It became a grand challenge in the
field to see if we could solve this game. While Progress was being made on all the other
Atari games, there was a small set of games in the Atari benchmarks, which were these
hard exploration games like Pitfall and Montezuma's Revenge in which we were making very, very
little progress just to set the stage.
The natural thing that people do in reinforcement learning when you have a hard exploration
problem is say, hey, the extra reward function, I'm never triggering it. I'm never getting
any signal from it because I'm never doing whatever it is that I need to do to get that
reward signal. I should be intrinsically motivated. I should have my own reward for exploring,
for going to new states, doing new things, learning how the world works. There's been
a lot of different methods over the decades in terms of how you might do that and they're
all really interesting and in some problems they've been shown to work better than nothing.
But if you look at this Montezuma's Revenge game, none of those methods were really moving
the needle. In fact, a couple of weeks before Go Explore came out, there was this paper
by OpenAI before I joined OpenAI that created a huge splash because they got all the way
to 11,000 points on Montezuma's Revenge, which was a huge accomplishment. It was a big step
up over what everything that happened before and it basically gave bonuses for getting
to new states of the game, for seeing new stuff. I spent a lot of time thinking about
why is intrinsic motivation not working better? It works better than nothing, but it doesn't
solve the game. It leaves a lot on the table. I started thinking there's basically like
two pathologies that I think exist at the heart of reinforcement learning even when
it has intrinsic motivation. One of them is what I call detachment. An intrinsically
motivated algorithm might reward you for getting to new places. Imagine you're standing in
a hallway and so at the beginning, you could go left or right. Everything's new and you
get a reward. If I go left for a little while, I consume intrinsic motivation, but then we're
always enough that agent happens to die or whatever. We restart it back in the middle
of the hallway. It might happen to go right now because there's intrinsic motivation over
there too, but now when you get reset the next time, there's no intrinsic motivation
nearby where you're starting because everything's been consumed. Basically, I've detached from
the promising frontier of exploration. I haven't remembered where I've gone in the past, so
I should go to new places. That was one thing that I think these algorithms weren't doing.
The other thing I think is maybe even more important and I call this derailment. Imagine,
for example, if you were doing rock climbing and you climbed three quarters of the way
up a wall, which is maybe really, really difficult, and you're gobbling up intrinsic motivation
the whole time, you're really excited as an agent, hey, I'm getting to someplace new.
Well, what do we do with that agent? We say, hey, that was really good. Wherever you just
got to, you should go back there. We start them again and we had just told them, good
thing, but now as they're trying to re-climb up that wall, we're like, but we also want
you to go to new places, not exactly back there. We're going to sprinkle in random actions
the whole time you're acting, but if that wall was really hard to climb in the first
place, now I'm trying to do it again. We're basically knocking it off the wall on purpose
by rejecting all this random actions or noise into its policy. We're never really letting
it get back to that place. We said we should do things differently. We should adopt this
mentality, which ultimately became the name of the paper in nature, which is we should
first return, then we should explore. Hey, if you got three quarters of the way up a
rock wall or you got deep into some dungeon or whatever, don't go back there and try to
explore along the way. No, no, no. Just go back there first, and then once you get there,
go to your heart's content. Those two pieces, when we put them together, ended up, when
we solved both of those pathologies, all of a sudden we had an algorithm called Go Explore,
and on games like Montezuma's Revenge, it got ridiculously high doors, like 18 million,
I think, was our best policy, and we probably could have gotten higher. It beat the human
world record, which was 1.2 million at the time. It blew everything else that had come
before it in terms of RL out of the water, and it did that on many games in Atari. Ultimately,
you could solve the entire Atari benchmark suite. If I could indulge the listener to
try to explain exactly how Go Explorer works very quickly, it's quite simple. We basically
have a first phase where we say, hey, start taking actions in the environment. It could
even be literally take random actions, but every time you get to a new place, a new state,
a new interesting situation in the game, then we'll just save that in an archive, and we'll
also save how you got to that interesting place, and then we have this archive of places
where you've been that are pretty cool. All we'll do now is we'll pull one of those locations
or situations out of the archive, and we'll say, first, go back to that situation, and
we could do that by just saying, replay the moves that got you there, or we trained you
to get back there without stumbling or whatever. Once you're back in that situation, then you
can explore from there, and either you could do that with just random actions or a policy
or whatever.
Note how simple it is. If I got three coders the way up the wall, I'm just going to go
right back there. Even random actions from there are likely to get to even new places
like a little bit higher up the wall, and then I'll save those places, go back, a little
bit more exploration from there. I could probably go a little further up the wall. Basically,
what you get is this expanding archive of stepping stones or situations that I've been
to that are interesting, from which I can explore from and just get another new situation
and another new situation and another new situation.
Once you have that, basically, eventually, you're going to start to discover all these
highly rewarded situations in the game, including maybe how to beat the game. Then you can go
back and say, hey, maybe I've been doing some tricks to help you get back there really easily
without dealing with the fact that sometimes in the game, there might be a lot of noise
and stochasticity. Now that I know what we're trying to do and accomplish in this game,
I'll just train a policy that does that and does it really well, does it in the presence
of noise, and boom. Now I can have a policy that starts from the beginning, it only has
access to everything that just plays the game as normal, and it can get extremely high scores.
I first encountered your name and your work, actually, at the NeurIPS 2018 deep RL workshop
in Montreal. I remember you presenting Go Explorer on the main stage there and the score
that you got on Montezuma's Revenge. Then I was actually at the back of the room. I remember
David Silver commenting on some of the assumptions with respect to RL and saving the simulator
updates. I recently learned that you've extended the algorithm to have a policy version that
doesn't require a simulator at all. Is that right? Can you talk about the simulator version
versus the policy version?
That's right. I'm glad you were there and I remember David's question. This is some
of the details that I touched a little bit on at a high level, but I can explain in a
little bit more detail. The simplest version of the algorithm, the first version, the one
we presented in the room that day, if we're in a video game and I get to an interesting
situation, then I can literally just save the simulator state. It's the equivalent of
freezing the world when I'm three quarters of the way up the rock wall. Then instead
of next time I play the game and saying, hey, try to climb back to that spot three quarters
of the way up the rock wall and then explore from there. We just literally just say, oh,
actually I just have the frozen state of the world where you're three quarters of the way
up the rock wall. I'll just resurrect that. You wake up and you're right there and your
job is to explore from there. That's taking advantage of the fact that a lot of the work
that we're doing in Atari and other simulators allows you to save the state of the world
and resurrect it. We don't even have to really first return. We can just pull something out
of the archive, explore from it, and if we get somewhere new, add it to a growing set
of simulator states. Now, some people thought, hey, that's not the original challenge that
we're trying to solve here. In the real world, you might not be able to do that. We said,
okay, fine. We'll show you that these ideas are really general and really powerful. All
we do is instead of saving the state of the world like three quarters of the way up the
rock wall, we will just train a policy. We will give it the goal of, hey, go to this
place three quarters of the way up the rock wall. It's a goal condition policy. Based
on that goal, it'll be trained just to go back there and get really robust at going
back there. Then we have a higher level controller that says, hey, go to the three quarters of
the way up the rock wall and the goal condition policy goes there. We're not resurrecting
simulator states or making the game deterministic or anything. It deals with all the noise in
the world. Once it's there, then we can give it some other goal, which is explore around
you or maybe generate a goal that's nearby, et cetera.
Once we do that, we now have a version of the algorithm where we're not making the world
easier in phase one. The original version of it, we took advantage of the fact that
simulators can be made deterministic and or you could resurrect simulator states to just
figure out what the game wants us to do, like how to get a good reward. Then we had a phase
two that made a policy that was a neural net that was robust to noise. In this version
of the algorithm, you never make those simplifying assumptions. You just basically start out
training a policy from the get-go, but it's still got the principles of go explore in
there, which is we want you to increasingly first return to a place that we considered
interesting to explore from and then explore from there. If you get to an interesting place,
we'll save that in this archive.
I do want to point out one more thing, which I think is really cool, especially in this
conversation because we've already touched on quality diversity algorithms, including
the map elites algorithms, which underlie the robotic adaptation thing that I described,
and that is that the principles of quality diversity algorithms are alive and well and
inspired go explore. In the robotics case, we wanted as many different robotic gates
that were each high quality as possible, and then we harnessed that archive, that big library
of different gates when we had damage. In go explore, we're doing the same thing, but
what we're collecting is different trajectories or different policies within one world or
environment or game. We end up with is the highest quality way to get to a whole bunch
of different situations or states in the game. You have the most efficient way to get three
quarters of the way up the rock wall, the most efficient way to get to the vending machine,
the most efficient way to get to the parking lot. By doing that, you expand out and explore
the space of possibilities, and then eventually you can find something that lights up the
reward function, and then if you want, you can distill a policy that only does that thing
even better.
I just think it's really cool because it's like these principles that we get excited
about in one context end up paying off to solve other really hard challenges in machine
learning again and again and again, which I think is a pretty good sign that these principles
are exciting.
Great. Thanks for explaining that. I'm so glad to hear about the policy version, and
that was the biggest criticism that I heard about this algorithm. Some people say more
traditional RL people saying we're following the RL rules, but you were able to overcome
that with the policy version. Does the policy version perform well on some of these tasks
too, like Montezuma's Revenge and the difficult problems? Is the policy version doing its
thing?
Yeah, it also does extremely well, and it does way better than all previous algorithms.
It's more compute intensive because you're not taking advantage of the fact that I have
to train a policy to go three quarters of the way up the rock wall. That's a metaphor.
That's not actually part of the Montezuma game, but yeah, you're not taking advantage
of some of these efficiencies, but yeah, it does extremely well. It gets really, really
high performance on these games, and the cost is just you have to train this policy. There
is one really cool thing I'll mention about the policy version that was not true in the
original version, is because you've trained a policy that can go three quarters of the
way up the rock wall, I'm going to keep using that example. Once it gets three quarters
of the way up the rock wall, it knows how to rock climb really well, and so now it's
going to be better at exploring from that.
We do show in the paper that one benefit you get is more efficient exploration. Even though
the overall algorithm is still more expensive, the exploration part is more efficient because
you're no longer taking random actions to do the explore step of the algorithm. I actually
think that really bodes well for the future. As you train big models on more and more kinds
of domains, they're going to have all sorts of common sense and skill sets and understanding,
especially if you did like VPT ahead of time, for example. Basically, we're going to become
really efficient at exploring, and so you put that together with a powerful first return
then explore, and now my exploration is efficient, and I'm off to the races to solve really,
really difficult hard exploration challenges.
Just going back to that simulator version for a second, does the simulator version...
It might not fit the kosher definition of the RL problem setting, but can it still help
produce real, helpful real world policies in different domains, and what kind of domains
do you see this for helping us in? What kind of domains are most suitable for Go Explore?
Both with VPT and Go Explore, we did some things that helped us get to our final policy,
but the final policy itself just does the canonical thing. You put it in the game, and
it plays from pixels with all the noise in the game. It takes actions just like any other
policy in RL. Some purists were saying, hey, you might have gotten there via a different
path and maybe an easier path, but the final thing still does the task that we wanted solved.
The same is true with Go Explore. It produces a final policy that can play the game from
pixels. If you didn't know how we got it, then you would say, okay, maybe I got this
through traditional RL. It just seems to perform way better. That means that you end up with
any problem that you were trying to solve with RL. If you tried to solve it with Go
Explore, if it performs really well in the end, you'd be really happy, and it could go
off and do what you need. To answer your other question, which is like, what are the domains
that this really helps on? I would say any hard exploration RL problem, which is to say
almost all of the unsolved RL problems. Maybe that's a little too strong, but many of them.
For example, if you wanted to learn how to drive in a driving simulator, if you wanted
to do robotics, then I think these algorithms could work really well. Let me give you an
example. Imagine that in robotics, you train. Imagine that you want your robot to clean
up your room and get a fire extinguisher and put away the dishes and make you an omelet
and take the trash out and all this stuff. Those are all really, really, really hard
exploration problems if you only know how to reward it once the trash is taken out or
once the room is clean. You could use Go Explore to solve each of these problems. Then you
could train a generalist policy on all these demonstrations of how to do each task to produce
a generalist robot that knows how to do a huge variety of different things. I actually
think this would be really cool. You have a thousand or a million different tasks that
are all hard exploration. Go Explore could efficiently solve them all, give you demonstrations
of how to solve those tasks, and now you could do a huge amount of what you might call pre-training,
which is having a robot hear a task description and then know how to do it. You've got all
of these solution demonstrations and kind of like GPT that knows how to complete a thousand
different types of articles. Now you have a robot that knows how to complete a billion
different kinds of tasks, and it becomes very general at zero-shotting new tasks when asked
to do so. That's one futuristic project that I'd love to see with Go Explore.
Cool. Sounds like hard exploration problems that have really good simulators would be
suitable for Go Explore. Is that fair to say?
I think that's right. I mean, it would be really cool to try to use the principles of
Go Explore in a problem that doesn't have simulators. That would be really futuristic
and ambitious, and that would be a great project for somebody to work on. I haven't seen that
yet, but where it really shines is like many RL algorithms, when they could take advantage
of lots of computing in a simulator and then learn a lot of stuff and then try to be efficient
about crossing the reality gap.
That seems to be a bigger trend these days where that's becoming very feasible to cross
that gap, which is kind of amazing.
It is true. As we have more and more compute and more and more pre-training and more and
more data augmentation and more and more demonstrations, the reality gap is kind of taking care of itself
in a sense, and that's why I think these algorithms like I just sketched out might just work.
Just Go Explore in a really decent simulator or a good simulator with a lot of compute,
and boom, you might zero-shot transfer to the real world.
You used the phrase open-ended algorithms. Can you touch on how you define that?
This is one of the topics I'm most excited about in machine learning and have been really
since I started in science. To me, one of the most profound mysteries in the world is where
we came from and also where all of the amazing creations of evolution came from. I look out
into the world and I see jaguars and hawks, the human mind, platypuses, whales, ants.
It's this explosion of interestingly different engineering marvels. I like to think about
them as engineering marvels. The system is continuously innovating, right? Like it produced
COVID, it produced new species all the time, and it's constantly surprising us.
That is an example of an algorithm that we would call an open-ended algorithm. You keep
running it and it just keeps surprising you. It keeps innovating, it keeps creating forever.
We have no evidence that it's going to stop. There are other examples of that. Another
one is human culture. Human culture, for example, science or art or literature, it keeps inventing
new challenges and then solving them and the solutions to those challenges become new challenges.
For example, in science, you invent one technology that basically allows you to answer totally
new types of questions or make new innovations, and then there's new sciences to study those
things and the interaction between those things and other things. The system just keeps on
innovating and learning and accumulating knowledge and skills and an expanding archive of wonders.
A quest that I've had is, what are the key ingredients that would allow us to create
a process like that inside of a computer? One, because that would be fascinating, would
allow us to maybe produce AGI and create alien cultures to study and all sorts of cool stuff,
but also because it teaches us about the general properties that are required for a process
to be open-ended, which we still don't really understand. We don't really know why nature
works. I want to contrast those open-ended algorithms with what is traditionally sought
after in machine learning and certainly happens in machine learning. Typically, you want a
certain solution to an optimization problem, whether or not it'd be a fast robot or the
right way to schedule your final exams. You run it for as long as you can, and it gets
better and better and better with diminishing returns. Eventually, it's done the best it
can do. If you ran that thing for another billion years, nothing interesting would happen.
What we want is an entirely different paradigm. We want to see, could you create an algorithm
in the words of Jean-Baptiste Moray, which would be interesting to come back and check
on after a billion years, because evolution is 3.5 billion years and running and still
surprising us. Why can't we do the same thing with a computer algorithm? When I started
my PhD, there was no algorithm I was aware of that was worth running really for more
than about a day, maybe a couple days. As I've gotten later in my career, that went
up to weeks, and now I think we're at about months. If you think about how long might
it have a GPT or something run for, we're training things on the order of months. Now,
we are throwing more and more compute at them as well, but roughly speaking, we've got nothing
that we would want to run and come back and check on in a billion years. That's the quest
and the challenge. Could we create a system that would forever, as long as we run it,
need to innovate, delight, and surprise us in the same way as human evolution and human
culture?
Okay, so that part makes sense to me. I guess it reminds me of efforts in things like artificial
life, which have been around for a while. It seems like in some sense, some of the systems,
they don't really have a hard goal. The goal is actually just to create complexity and
discovering or exploring that process of creating complexity. When it comes to evolution, it's
very inefficient. It's not really clear entirely if you could say it's goal-oriented. I wouldn't
say it's goal-oriented. What do you see as the goal of these open-ended learning systems,
and how do you know when you're achieving that goal?
Yeah, it's a great question. To quote Justice Potter Stewart, I don't know how to define
it, but I'll know it when I see it. In fact, you're right about this field has traditionally
been pursued in artificial life, although now it's becoming a mainstream topic in machine
learning and reinforcement learning. I've spent a lot of my career in the artificial
life community when that was the best place to do this work. Now, it's great to be able
to talk about it with a wider set of people. In general, I do agree with you, evolution
doesn't really have a goal. It didn't set out to produce ants or cheetahs or humans.
It just happened to do so. The same thing is going to be true of our algorithms. We're
going to have to create systems that are not seeking a goal. In fact, that's not an accident.
It turns out, and Ken Stanley and Joel Layman, my dear friends and colleagues have been pushing
on these ideas for a while, that if you have a goal, that probably prevents you from doing
things that are really interesting because you get stuck on a local optima or in a dead
end. What you really need to do is abandon the idea of having a particular goal and just
do what we've been talking about with Mappa Leads and Go Explore and quality diversity
algorithms. Just go out and collect a huge diversity of high quality stuff and stepping
stones and that will ultimately unlock tremendous progress. I want to give you an example. Let's
assume, for example, that you wanted to produce a computer and you went back a couple thousand
years to the time of the abacus and you are the king of a kingdom and you say, I will
only fund scientists who make me machines that produce more compute. Well, you might
get an abacus that has longer rods or more beads or maybe it's a three dimensional abacus
or something, but you would never invent the modern computer because to invent the computer
you had to have been working on vacuum tubes and electricity, which were things that were
not invented because they had immediate benefits for computation. That was not in the minds
of the people that invented those things. Here's a line from Ken and Joel's book, to
achieve your greatest ambitions, you must abandon them. You can't have the objective
of producing human level AI when you create your AIGA, your open ended AIGA that can generate
environments forever because if you myopically start giving IQ tests to bacteria, you will
never get human beings. You just have to go explore and collect stepping stones and eventually
you get the stepping stone, which is say the vacuum tube and the electricity and someone's
like, aha, I can put this together and make a computer or in the algorithm, like, hey,
I've created this thing that could do these weird tasks and boom, that turns out to be
the key ingredients to produce something that looks like AGI and has general intelligence.
I don't think we're going to be setting objectives. I think we want to not have objectives, but
as you said, how do I know when I run a system and it's worth running for a lot longer and
how do I know when I shut it down? This is very difficult to measure, but often we can
just look in there and see and if really interesting things are happening, then we know that it's
worth continuing. If I created a simulation and the agents in the system were inventing
society, they were having group meetings, they were electing leaders, or they were inventing
technological innovations and then building on those, or they started inventing educational
systems and they were teaching other AIs. If these kind of things are happening, then
I get pretty excited that I'm on to something. If I just run the system and it's generating
more of the same or it's just generating white noise forever, I'm not interested. We don't
really know how to measure it, but humans are pretty good at evaluating things even
if we can't quantify how to measure them. I think we should bring our human judgment
to bear in this field of research, even though scientists don't love hearing that.
Yeah. Some of the things you're saying and reminding me of Henry Ford saying customers
just wanted faster horses and there's that British TV show Connections where James Burke
was always relating how something would not have been invented without some other very
obscure thing happening. I want to move a little bit on to the topic of the day. Chat
GBT and those types of models are on fire. It seems like they represent a very different
path towards AGI than the AIGA paradigm, or at least it seems that way to me. Do you see
these two types of paths competing or do you see them as complementary in some way? If
I think of the AIGA concept imply that we want to have an outer loop around GPT-4, but
these models are presumably always going to be towards the edge of the feasible run size
for its day, so that kind of precludes having an outer loop around them. How do you see
the interaction between AIGA and this current paradigm with the large language models and
the massive models and the chat GBTs of the world?
Yeah. Fantastic question. One that I think about a lot. This is kind of why I added a
fourth pillar to the AIGA paradigm because I think pre-training on human data is effectively
kind of like an orthogonal choice or an optional speed up. You could try to do the whole AIGA
thing and never train on human data, but it probably will take a long time. Interestingly,
it might produce intelligence that looks a lot less like humanity because it wasn't trained
using human data and that might be beneficial once we start to study the space of all possible
intelligences. Let's assume for now that you want to at least produce one really intelligent
thing first before you worry about all possible intelligences. Well, training on human data
is just a huge speed up. Doing something like GPT, training on that human data, I think
it's a fantastic accelerant that kind of immediately catches AI roughly up to human, not human
level intelligence, but gets us much closer than we would without human data. Let's say
it that way. I still think you're going to need a lot of the elements in the AIGA paradigm.
For example, if you want to produce intelligence that can solve problems that humans have never
solved before, like a cure to cancer, for example, then you can't just train on human
data. At least it's unlikely that a model that's trained just on human data would suddenly
know the solution to cancer because humans don't know that solution. It's probably going
to have to become a scientist and start conducting experiments and know which experiments to
conduct and how to learn from them. That starts to be a lot of like pillar three. Pillar two,
which is kind of meta-learning the learning algorithms. Similarly, GPT itself is an amazing
meta-learner. It's one of the great surprises of the GPT paradigm, but yet I still think
that probably as you have to learn to explore and solve new problems and become a scientist
and conduct these experiments, it's going to benefit from kind of meta-learning things
that are even more advanced that come from just training on human data. Then finally,
on the architecture front, you certainly wouldn't be able to do architecture search where each
atom in the search is literally a full GPT run. That's as you said, it's way too expensive.
But I would not at all be surprised and think ultimately it probably will be very important
and beneficial to be doing architecture search in a way that you do experiments just like
machine learning researchers do at small scale, figure out principles, better architectures,
and then you start to test them at larger and larger scales. Eventually, you do a scaling
lot of thing and you say, hey, I think we should try this big. Then you launch a very
few number of experiments at scale. I still really believe in a system that's automatic
that is using AI to do architecture search to come up with better learning algorithms
via meta-learning, it's automatically generating new training data, and all of this in conjunction
with this pillar four, which is the speed up, which is seeding the whole process with
human data.
To conclude, I think that the approaches are very complimentary. GPT alone will not
be enough, but GPT inside of an AIGA might allow AIGAs to actually accomplish their goal
maybe even decades before we would have otherwise if we didn't take advantage of strategies
like GPT.
I heard you once predicted a 30% chance of AGI by 2030. Is that currently your estimate
or can you talk about your timeline for AGI?
Well, I only made that prediction in December. It's currently March. I haven't had too much
of a change at heart in the few short months. I still think that's possible. If anything,
I might be a little bit more aggressive now, but I think right now I'll stick with it.
Obviously, what we're seeing is tremendously powerful systems. I think probably the most
interesting thing now is having a discussion about when AGI would come five, four, seven
years ago, you didn't really have to define AGI because the whole enterprise was so hard
that we basically meant that thing that we're not talking about.
Now I think the systems are so good that your actual timeline really will come down to your
definition because if you define it like I have is something like humans doing 50% or
sorry, AI doing 50% of the work that got paid for in 2023, that definition, it probably
will be something that will hit, I say 2030. I think that's increasingly becoming clear,
but if you have a different definition, like it has to be better than humans and everything
that humans do, well, then suddenly your timeline gets pushed up.
I'll just say we're not within the range at which the definitions matter because we're
going to start crossing through each one of these definitions with rapid speed. I'll even
put out something more. I don't know why nobody has done a Turing test on these current systems.
I actually like the Turing test. If it was done well by a really smart group of people,
I think that it would be hugely informative. I think there's a pretty good chance that
GPT-4, well, let's see. I don't know if it would pass the Turing test, but I think it's
got a pretty good chance and I think GPT-5 probably would.
Already we're seeing one of the definitions we had since the founding of our field being
surpassed. Again, we're back to within the range at which where your definition is depends
on what year you should predict, but no matter what, it's all happening and it's happening
at a ridiculously fast pace. In the span of human history, it's happening in a blank.
Society needs to be really deeply thinking about whether we should do this, how fast
we should do this, how do we do this safely, and what are the consequences of doing this.
One detail from one of your talks, you mentioned that there's evidence that evolution in nature
is better than random. Where's that fact from?
The random part of evolution is the mutation. I mean, you could even argue about that, but
let's just take that as a given. Yeah, you randomly mutate a current thing and that's
not a very smart thing to do, but the non-random part of evolution is then you keep the stuff
that worked really well. If you started off 3.5 billion years ago and you just randomly
sampled genomes, then you'd never get anywhere. It's actually very similar and I've never
made this connection before right now, but it's kind of similar to Go Explorer. Go Explorer
says, hey, you got three quarters of the way up the rock wall. That was really hard. Let's
go there and then we'll do some random exploration from there. That's one version of Go Explorer
and that allows you to get maybe one hold farther up the wall and that was good. Now
you're more than three quarters of the way up the wall, but if you just randomly search
for policies, you'd never end up three quarters of the way up the wall. Basically, like the
Cheetah genome is the equivalent of a stepping stone that has been collected. It's like the
equivalent of a policy that can get three quarters of the way up the wall. If I start
there and then I do a little random exploration, I might get something that's better than a
Cheetah. That is very, very different than just saying in the space of all possible genomes,
I will randomly generate a genome and I will see if it is good because if you do that,
nothing will even self-replicate because making a machine that can self-replicate is impossibly
hard.
You're not saying the random mutations are somehow biased towards beneficial mutations.
That's not what you're saying.
There is a whole line of work on evolvability that actually does argue that it is non-random
and it can do better than random. The first point I want to make is that even if you assume
that it is random, the actual mutation operator, the algorithm itself is far, far, far, far,
far from random. Now, if you're specifically asking, can you do better than random with
the mutation operator? Yes, you can actually. The way that that works is that you come up
with a basically a representation that is intelligent such that when you do random things
to it, good things happen. For example, if you go into the space of GANs and you randomly
sample vector codes, you don't get garbage, you get different faces because the representation
maps noise to the low dimensional manifold of all possible human faces and only that
face and only faces if it's been trained to generate faces. You could have random mutations
that do still do really good changes. Similarly, many people argue and I think there's some
pretty good evidence that evolution is the same. It allows you to randomly change the
genome and then maybe like both of your legs get longer or shorter, not like one. You don't
have like asymmetry. It's kind of like captured the regularity that legs should be the same
length and legs and arms should be proportional to each other. Random changes to this representation
produce creatures that have kind of all of their legs and arms shorter or bigger for
the most part. That is not random. That was the spirit in which I'm talking about that.
This goes by the name of evolvability and or canalization in the evolutionary biology
literature. Is there anything else I should have asked you today or that you want to share
with our audience? I guess I'll share this thought with the world. We're willing tremendous
power as machine learning scientists, as deep reinforcement learning practitioners out there
right now in companies, in labs, in clinics, in startups. People are thinking of ways to
use this technology and how to develop more powerful versions of this technology. I just
think we want to be really conscious of the downstream effects of what we build. I think
we want to make sure that we build things as safely as possible and to help humanity
as much as possible. I think that everybody in our field should do the equivalent of taking
like a Hippocratic oath to try to do good things with this and not do harm. I think
we should just be humble about unknown unknowns and potential downstream effects. We should
try to take very seriously your individual responsibility as somebody who's helping to
build very possible technology and that we want to try to do it really safely and in
a way that's as beneficial as possible. Jeff, thanks for taking the time today to share
your thoughts and your insight. We talk to our listeners. Thank you, Professor Jeff Clune.
Thank you very much again for the invitation and thanks to everybody who listened through
the podcast.

TalkRL: The Reinforcement Learning Podcast

More episodes

Chapters

Creators and Guests

What is TalkRL: The Reinforcement Learning Podcast?