I'm very excited to have Professor Jeff Clune with us today. Jeff is an Associate Professor of Computer Science at the University of British Columbia. He's a Canada CIFAR AI chair and faculty member at Vector Institute. He's also a Senior Research Advisor at DeepMind. Welcome to the show. Thanks so much for being here, Jeff Clune. It's a pleasure to be here. Thank you for inviting me. So how do you like to describe what it is that you do? That's a great question. Nobody's ever asked me that before. I think the shortest answer is that I'm a scientist first. I'm very curious and I want to understand the world. I want to do excellent experiments and understand the world more. And then I also am in a communicator and I love to share what I have learned or what my colleagues and I have learned with the world. And that happens through scientific publications, through talks, through podcasts, but also in the classroom. So over the years, every so often I come across one of your works and it always strikes me that you have a really unique approach. It seems like your goals are always very ambitious, super ambitious, which is really exciting. And then also you seem like you're not afraid to discretize and hash things, which I think some people shy away from. Another thing I've noticed that your style is often very compute heavy, like you even explicitly talk about algorithms that never end. Do you think that characterization is fair to say? I totally agree with the first and the third. I love to swing for the fences and I'm most interested by ambitious ideas that might not even be ready for current technology, but you might have to wait a few years or even decades to fully be realized. And that involves using as much computers I can get my hands on. I'm not quite sure even by the second one though, so I won't commit to that just yet. Okay, no worries. I was just noticing that some algorithms like Go Explorer and Map Elites use a discretization like discrete buckets or hashing as part of the algorithm, which seems like it's not all that common and some people shy away from, but you get cool results by doing that. I would actually say I don't like those aesthetically. I would try to avoid them. Usually that is a stand-in for what will work now to prove the other principles we're trying to show, but ultimately we want to move away from that to something that's more learned end-to-end and therefore likely to be like continuous vector spaces, latent embeddings, things like that. Let's jump into one of your works here, VPT. That was video pre-training, learning to play Minecraft with video pre-training. I found this work really interesting and fun. It combines Minecraft and RL in a unique way. Can you give us a really brief rundown on what's going on in this work? Yeah, absolutely. But before we talk about VPT, I want to quickly acknowledge that this was the work of my team at OpenAI. It was a wonderful set of collaborators and everyone should look at the blog post and or the paper for the full list of amazing collaborators. I think one of the major lessons from the last, call it eight years of machine learning is that AI sees further by standing on the shoulders of giant human datasets to borrow from Newton. We've seen this with GPT and DALI and all the other incarnations of this idea. The idea is that if you can go and pre-train on vast amounts of internet scale data, then you have a huge leg up when you want to ultimately go solve a particular task like answer questions or help somebody with customer service or generate an image. Often, this is formula where you pre-train on the internet that tells you how to understand the world, learn features about the world, et cetera, and then later you fine-tune to a particular task. What we were seeing is that, hey, if we want to take reinforcement learning to really hard problems like learning to play Minecraft or learning to use a computer or learning to clean up a house or find landmines, these are exceptionally what we call hard exploration problems, which is to say that typical RL is rather unintelligent. It does a whole bunch of random trials and if it happens to do something that is rewarded, it knows how to lock that in and do more of that thing. Of course, the problem is in really hard problems, random exploration taking literally random actions will never solve the problem. For example, if I want to have a robot that cleans my house or learns to play Minecraft and it just randomly matches the buttons or emits motor torques, it's never going to get a good job from me in the house cleaning example or in Minecraft is never going to do something like make a diamond pickaxe. This has been a problem that focused much of my career on. You mentioned Go Explorer, which is another algorithm that is in this space. We thought to ourselves, all right, we could try to follow the historical path, which I have worked on and others have worked on, which is to create intrinsic motivation or better exploration algorithms, but we also could just take a page from the modern playbook and say, could we just go and learn from the internet first how to accomplish lots of things in the world, whether it be in Minecraft and robotics or whatever. We chose Minecraft as our domain. Once you generically know what to do in the game, then we might ask you a particular challenge, like get a diamond pickaxe and you would be really easy. The RO would actually have a chance of learning that because it mostly knows how to play Minecraft. You just have to figure out what it is that we specifically want in this particular situation. That's the gist behind VPT. What we said is, okay, we're going to have an AI model, a large neural network. I think we used half a billion parameters. It's going to watch YouTube, videos of people playing Minecraft on YouTube. We went out and we got about eight years of clean video, which is to say video that doesn't have logos and somebody is like video of a person playing it layered over the top. It watches Minecraft videos for about eight years of sequential time. We obviously paralyzed that and it just knows how to play Minecraft in general. Then at the end, we say, okay, you're a relatively good Minecraft playing agent. Now we want you to go do a specific task, which is to get a diamond pickaxe. Now, just to set the context for the listener, that's a really, really hard task. It takes humans over 20 minutes and about 24,000 actions. We're talking about actions like moving your mouse, clicking a button, et cetera. This is orders of magnitude harder than many RL things that we've done in the past. This system can basically do it. It learns to be able to go get that diamond pickaxe. If you didn't do the pre-training, you have no hope. Almost no matter how long you ran your algorithm within reasonably, even if you had a planet sized computer, you probably wouldn't learn how to do this thing. That is the high level result. The one thing that I haven't mentioned, which is essential to this work, is that for things like GPT and DALI or music generative AI that you get on the internet, almost all of the domains that it has been applied in the labels are for free. They exist. GPT says, I'm going to read 100 words of an article and predict the 101st word and then the 102nd word. Those words are right there in the original article. Same with predicting musical notes or predicting pixels in an image. The difference in trying to learn to act is that you are watching a video of somebody play Minecraft and you see the camera feed of what is happening in the world as they act, but you don't see the actual actions they took. If I watch somebody editing an image in Photoshop, I don't see the keyboard shortcuts they're hitting. I don't actually know what they're doing with their mouse, so they might be able to infer it from their cursor. That was an extra challenge we had to address in the Minecraft case. We actually solved it with a pretty simple technique that works very well and it works like this. We pay some people to play a couple hundred or a couple thousand hours of Minecraft, which doesn't cost that much money. For those contractors, we record the actions that they're taking while they're playing the game. Then we train a model to look at a video of Minecraft and be able to look at the past and the future – say the past two seconds and the future two seconds within a video. Its job is to tell us the action that must have been taken in the middle. This is quite a simple task. If you are watching Super Mario Brothers or Minecraft and all of a sudden you see the character jump, you know somebody probably just hit the jump button. That's a pretty simple task. You don't need a lot of data to learn that task. We just collect a very little amount of what we call labeled data, which has the actions with the video. This model has to then say for any little snippet of video, if I see the past and the future, it's pretty easy for me to guess the action that must have been taken at time t in the middle. Once I have that model, then I can go to YouTube where I don't have actions and I just run this labeler and it just tells me all of the actions that were taken at every time step in the video. Now I have a massive label data set of Minecraft videos and the actions that were being taken at every time step. Then I can just do standard behavioral cloning, which is also called imitation learning, which is also exactly what GPT does, which basically says go from the past, everything we've seen up until now, to what is the next action I should take and then the next action and then the next action and then you can roll this out autoregressively in a simulator and boom. That is the general formula in VPT that allows first a zero-shot model that knows how to play the game. It can go out and do things like make crafting tables and get wooden tools, but it also has kind of the core knowledge that once you start doing RL on it, you can actually ask it to go learn to do a diamond pickaxe and it can do quite well. Yeah, I love that result and also how you got there. I mean, it seems like a special case of Jan Lekun's cake where the bulk of the work was this unsupervised or self-supervised training, but you had this extra step here. You were able to use all this unlabeled data and I guess turn it into labeled data. Was that two-stage design? Was that obvious to you from the start or did you kind of arrive at that after some experimentation and thought? Yeah, great question. Actually, that idea was the start of the project. Once we were having a conversation and basically we said, hey, this would probably work. It was like, that seems like a really powerful idea. Let's go do it. We committed to it and the overall project took about a year and a half to get it all to actually work. It wasn't like we set off with the goal to do something vaguely in the space and eventually through trial and error we got to here. It was more kind of like that is an idea that will scale and allows us to take the modern unsupervised paradigm and the GPT paradigm, but now take it to domains where we want to learn how to act, how to control a computer, how to control a robot, how to control an agent in a virtual world. I want to give a lot of credit to Bowen Baker because it was in a conversation with him where we landed on this idea and decided that this would be a really profitable direction to go off. He's the first author on this paper. Another thing I thought was really cool here is that you have this non-causal model, kind of a non-standard type of model instead of a forward RL model that's predicting the next state or an autoregressive model like in GPT. You had this setup where you were trying to predict the action in the middle of a stack of frames. That kind of reminded me a bit of BERT, like filling in a word in the middle. I was just wondering if that idea of having a non-causal model of filling in the action in the middle was something that you had right out of the gate or how did you come to that? I remember in the paper reading that you found that performed better than simply predicting the next action. This is where it kind of gets a little bit subtle. Because we have this two-stage process, first we need to get the labels. Then once we have labeled data, we're going to do standard behavioral cloning. The non-causal model, the model that gets to see the past and the future and predicts the middle, that is just the action labeler. This is a thing that's going to be trained to look at a YouTube video and say, hey, I see the last second and the next second, the action that was taken right in the middle was jump or move the mouse to the left. Okay, but is that action labeler network still, is that what you're using as the basis of your foundation model or no? No. That just gives us labels for YouTube videos. YouTube videos, we don't know the actions taken. We call this the inverse dynamics model, the IDM, but you could also call it the action labeler. It gets to see the past and the future and tell us the action that must have happened in the middle. Then we go to a YouTube video for which we don't have actions. We run the labeler and now it gives us what it thinks is the label at every time step in that video. Now we have eight years of labeled YouTube videos. We have the video and the corresponding action or the action we think happened. Now we throw away the data labeler, which is the non-causal bit. We train now just a normal neural net who just goes from all the video up until time t and it tells us what action to take at time t. We take that action. It does not get to see the future. Then we send that action to the simulator. It sends us back to the next frame of observations. We give it back to the model. It tells us the next action. The final model that learned to play Minecraft is a causal model. It is a normal model. To some extent, we don't know the future. We're actually rolling this out in a real simulator and so we can't give it the future. It only gets to, like any RL agent, it gets to act from the past and has to take an action without knowing the future. That non-causal thing was only to get the labels, to infer the labels and then train the model with behavioral cloning. I see. Thanks for clearing that up. I guess I was confused on that point. You use the phrase AIGA, AI Generating Algorithms. Can you talk about what you mean by that phrase? Do you consider an AIGA to be an RL algorithm? An AI Generating Algorithm is basically like a research paradigm. It also ultimately will be an artifact once we have the first one of them. The general idea is that we're going to try to have AI that produces better AI. It's a system that will try to go all in on the philosophy that we should be learning as much of the solution as possible. To set some context, I think if you look out into machine learning, what most people work on is what I call the manual path to AI. They're trying to build AI piece by piece. Like, oh, I think we need a ResKit connection or I think we need layer norm or I think we need the atom optimizer. I think we need this kind of attention, that kind of attention. I think we need to train in these kind of environments. Every paper in ML is like, here is either a building block that we didn't know we needed before that I think we need or an improved version of an existing building block. That's what most ML research does. That begs the question of, one, can we find all these building blocks and the red versions of them? Two, when are we going to put all these building blocks together into an actual big AGI or powerful AI? That manual path, I don't think it scales very well. I think if you look at the history of machine learning, there's a very clear trend, which is that hand design pipelines are replaced by learned pipelines, entirely learned pipelines, once we have sufficient compute and data. We've seen that all over the place. We used to hand code vision features and language features. Now we learn them. We used to hand code architectures. Increasingly, those are being learned and searched for. We used to hand code RL algorithms. Now we have meta learning, which often sometimes can learn a better algorithm, et cetera. The AIGA paradigm is kind of like staring that trend directly in the face and saying, hey, if it's true, then we should be learning how to produce powerful AI, all of the problem. I think if we want to push on this, in the paper, I actually had three original pillars. I'll talk to you about a fourth that I added later. The first pillar is that we should be searching for architectures automatically. The second one is we should be learning the learning algorithm or meta learning. That is very much an RL problem. Then the third pillar is automatically generating the right environments and the right training environments, the right data, however you want to think about that. If you put all those three pieces together, what you could end up with is something that looks a lot like what happened on Earth, where you have a very expensive outer loop algorithm. In the case of Earth, it was Darwinian evolution, but in an AIGA, it almost surely would be something more efficient and look different. You basically have this outer loop that's searching over the space of ancient architectures, learning algorithms, and generating its problems and data. Ultimately, that thing itself, even though it's very compute inefficient, will produce a sample efficient AI that is probably more sample efficient than humans. Darwinian evolution produced you. You're the most impressive brain we know of on the planet. Increasingly, an AIGA will get better and better at producing something that ultimately will surpass probably human intelligence. That's the general idea. I mentioned I added a fourth pillar recently. It's basically a catalyst that speeds everything up. That is standing on the shoulders of giant human data set and using pre-training on human data to basically get this whole process moving faster. There are some benefits to doing that versus not. We won't get into those details right now, but that is an extra pillar that I think is important if you want to make this all happen extremely fast. Do you consider some of your other works to be concrete steps towards AIGA, or is it more a distant abstract goal? Yeah, I think a lot of the work that I have done in the past and that I'm currently doing is very much in this paradigm. I think VPT, for example, is an example of how you can use the fourth pillar to dramatically accelerate things. Ultimately, if you wanted to generate a lot of really complicated training domains like Minecraft and using computers and driving a car, et cetera, et cetera, it might be very inefficient to have to bootstrap up and learn how to do that from scratch. If you can learn from humans and how they play video games and how they drive in driving simulators and how they use their computers, then you might be able to catch right up to where humans are and then step ahead off into the future. That could dramatically speed up AIGAs. That's one example. Another paper that I'm very proud of that we worked on that is directly in the idea of AIGAs, especially the pillar three of AI generating algorithms, is our paper on PoET. This was one of the first papers that really put on the map of the community the idea that we should be automatically generating training environments for RL agents rather than hand designing them. As you mentioned earlier, we want to ultimately do this in an open-ended way where the system will just continuously and automatically generate as many novel, different, interesting training challenges for the agent as possible. The agent has an infinite curriculum and continues to level up and learn new skills. There's yet another example. There are many more. I don't want to give you a 15-minute answer, but much of the work that I've done, I think, slots in nicely into this AIGA paradigm. Let's talk about quality diversity. Can you mention what you mean by quality diversity? Can you compare and contrast that maybe with something we might have heard of like population based training and genetic algorithms? Before we talk about quality diversity, I want to quickly acknowledge the leadership of Ken Stanley, Joel Lehman, and John Baptiste Moray in this area, as well as in the area of open-ended algorithms, which we will talk about later. In both cases, they've been pioneers, I have been fortunate enough to work with them to develop these areas. Quality diversity algorithms are another area of work that I'm extremely excited about. For a while there, my colleagues and I were working in a very, very small niche, but now it's exploded into an area that a lot of people are pushing on and are very interested in, which is fantastic to see. The general idea is quite simple, but it is very different from traditional machine learning and optimization. Typically, in machine learning or even RL, we are trying to solve one problem and we want the best solution for that problem. For example, we have a robot and we want it to move really fast, and so we just want the fastest robot. Quality diversity algorithms, in contrast, they say, you know what I really want? I want a lot of diverse solutions. I want as many different ways to walk as possible, but for each one of those, I want them to be really good. That's the quality part. I have quality and diversity. For example, look, and it's just inspired by natural evolution. Evolution can be an example. If you look out into the world, you have ants and three-toed sloths and jaguars and hawks and humans, and they're so magnificently different from each other, but each of them is very good at doing what it does. It's very good at making a living, doing whatever it does. If you go and you look at, for example, trying to say, I want the fastest organism, if that's your performance criterion, well, if you only optimize for speed, then maybe you get a cheetah, but you would not get an ant and you wouldn't get a duck-billed platypus, but we like ants and duck-billed platypus. They're really interesting. They're creative solutions, and they could potentially be useful. What we don't want to do is compete ants and cheetahs on speed because that's silly. You wouldn't even have an ant, but you do want ants to be fast. You prefer a fast ant to a slow ant. The general idea behind these algorithms is within each niche or bucket or type of thing, we want the best there is, but we want as many of these buckets as possible. I want to give an example of this because we had this paper in Nature in 2015, which I really think demonstrates the value of these quality diversity algorithms. That paper was led by Antoine Coley and Jean Baptiste Moret with Dinesh Tarapur. We have a six-legged robot. We want it to be able to walk. We also ultimately know that the robot's probably, if it's out in the world doing something like finding survivors, it's going to become damaged. We're going to want it to adapt to damage as fast as possible. What a traditional optimization algorithm would do would be like, give me the fastest robotic gate. What a quality diversity algorithm might say is, hey robot, go learn how to walk in as many different ways as possible. Each one of them, we want it to be pretty good. Imagine you ultimately become damaged. Well, think about what you would do if you were in a forest and you became injured. What you wouldn't do is launch an RL algorithm that takes a million different trials of slightly different version of the current best thing and try to figure out the best way to walk despite damage. No. Instead, what you would do is you'd say, all right, I stand up, I try to walk. Ow, that really hurt. I can't walk the normal way. I'm going to try a totally different type of gate, which is maybe I'm going to walk on the ball of my foot on the injured side. You say, ow, that doesn't work. I'll try one more thing. I'll try to walk on the outside of my foot. You say, oh, that still hurts too much. You say, all right, whatever. I'm just going to hop out of this forest. Notice that ahead of time you had practice how to hop. When we're children, one of the things we love to do is act like a QD algorithm. We go out and we try to figure out the fastest way to hop on one foot, on two feet, to walk on our tippy toes, to walk backwards, to walk on all fours, whatever it is. That's like playing. We're intrinsically motivated effectively like a QD algorithm to do this. Then once we become injured, we can harness all of that practice and knowledge to adapt to injury in a really fast way. We did exactly that in the nature paper. We had a QD algorithm. It learns to walk in a variety of different ways ahead of time. Once it becomes damaged, we paired it with an algorithm called Bayesian Optimization that quickly just basically said, I'll try one type of gate. If that doesn't work, I'll rule out all of those types of gate. I now know those don't work. I'll try an entirely different type of gate. I'll just bounce around and try a handful of different gates until I find one that works pretty good. Then I'll use that to limp back to the station where I can get repaired or whatever. What we showed is that if you do on this robot, once it's damaged, if you run typical RL like PPO or policy gradient or whatever, it's very simple and efficient. The search space is really large and it takes a ton of trials. If you run Bayesian Optimization, it also doesn't work very well. If you use quality diversity algorithm and then you do this Bayesian Optimization thing, then at about six to 12 experiments and at about one to two minutes, the robot can figure out a gate that works despite the damage and it can walk very, very, very quickly and soldier out with its mission. This was work with Antoine Coley and Dinesh Terapur and led by Jean-Baptiste Moret, a long time colada league of mine and one of the founders of quality diversity algorithms and this algorithm is MapElites, which is probably the most popular quality diversity algorithm which he and I coauthored around the same time. I just think it was a really nice example of the idea that once you have a big set, a big archive of things that are diverse and high quality, there's so many cool things you can do with them and this was one example. I remember that 2015 article in Nature. It was on the cover actually and it was a really nice cover story and I understand that MapElites explores this space of attributes, like say using the example that you gave in your Coral 2021 talk. It was the 2D space of height and weight I think was the space and I was just wondering if it obviously works well in 2D. I wonder if you were to scale that up, if we had four or 10 dimensions and we had the curse of dimensionality making things less tractable, do you foresee ways to extend this idea to more dimensions? I do, yeah, great question. First of all, just in terms of the nature cover, there's a fun fact which is that nature usually wants a nice big picture of a fish or a spider or whatever it is you studied and they don't like you to put data on the cover of nature but Jean-Baptiste Moret had a fun idea which is that we could use the MapElites, literally the grid of the performance in each cell as a giant matrix that's really colorful and we made that the floor underneath the robot so we snuck our data under the cover of nature which we thought was kind of fun. Going back to your question, this actually gets full circle back to one of the first things we talked about which is discretization. I am a huge fan of the idea that ultimately you don't want to pick these dimensions by hand, you'd rather learn them. You could take really high dimensional data and with a neural net you could learn a low dimensional space of interestingness and how things can be interestingly different and then you could do MapElites in a space like that. In fact, Antoine Coley, the first author of the Nature paper, he has a paper that shows exactly that, a method called Aurora. I think there are other people who have worked on similar ideas but basically just to give you a broad sketch of how it might work, imagine you have really high dimensional data, for example like a video of your robot walking or images or something like that, you could then compress with an autoencoder down to a small low dimensional projection of your data. You can then use that in that latent embedding space of the autoencoder like the bottleneck of the autoencoder. You could then run MapElites by discretizing that space and trying to find, you know, fill as many of the buckets of that space as possible. Then that gives you new data that you can then run the autoencoder again, get a new latent embedding space, try to fill it up again and keep repeating this process. In this way, you don't have to hand design the dimensions of variation, you could learn them. Another example I'll just throw out there, and I don't think anybody's done this yet, but you know like modern pre-trained models are another way that you can kind of get a space of dimensionality of variation that would be interesting. Take CLIP, for example. It knows that robots that walk on two legs are different than robots that walk on four legs and if suddenly a zebra stands up on its hind legs and it walks around, then it should give it a different caption. Models like CLIP and GPT and GPT-4, which is multimodal, almost surely already have many different dimensions of different ways that things can vary in their latent embedding space. You could literally just steal that space and run NAPOLITS in that space and say, go get me all the weird robotic gates that will light up all these different dimensions of variation. That probably gets you a huge amount of diversity and in each one of those, make it high quality. Now we've got our QD algorithm and we didn't have to pick the dimensionality of the space manually. We also allowed it to be very high dimensional. Let's move to Go Explore. If I understand, this is a very high fidelity exploration algorithm. Could you explain at a high level what the main ideas are here with Go Explore? Before I explain Go Explore, I want to acknowledge the wonderful colleagues that worked with me on this paper, which were Joost Heisinger, Adrian Echofey, Joel Layman, and Ken Stanley. Some of the Achilles heels, maybe the Achilles heel of reinforcement learning is exploration. If you'd never happened through random actions to do the thing that is getting rewarded, then you have no signal for how to get better. It's like playing warmer and colder, but you never get the answer warmer. You're just cold all the time. If you go back to the original DQN paper, which is I think the face that launched 10,000 papers in the sense of kicking off the deep reinforcement learning revolution and putting deep mind on the map, they did pretty well on a number of Atari games. After that, better and better algorithms did better and better, but there was one game in which they literally got zero and that is Montezuma's Revenge because it's very, very difficult to ever get any reward in that game just through random actions. For a long time, people held this game, Montezuma's Revenge, up as an example of a hard exploration problem for which our current algorithms are failing. It became a grand challenge in the field to see if we could solve this game. While Progress was being made on all the other Atari games, there was a small set of games in the Atari benchmarks, which were these hard exploration games like Pitfall and Montezuma's Revenge in which we were making very, very little progress just to set the stage. The natural thing that people do in reinforcement learning when you have a hard exploration problem is say, hey, the extra reward function, I'm never triggering it. I'm never getting any signal from it because I'm never doing whatever it is that I need to do to get that reward signal. I should be intrinsically motivated. I should have my own reward for exploring, for going to new states, doing new things, learning how the world works. There's been a lot of different methods over the decades in terms of how you might do that and they're all really interesting and in some problems they've been shown to work better than nothing. But if you look at this Montezuma's Revenge game, none of those methods were really moving the needle. In fact, a couple of weeks before Go Explore came out, there was this paper by OpenAI before I joined OpenAI that created a huge splash because they got all the way to 11,000 points on Montezuma's Revenge, which was a huge accomplishment. It was a big step up over what everything that happened before and it basically gave bonuses for getting to new states of the game, for seeing new stuff. I spent a lot of time thinking about why is intrinsic motivation not working better? It works better than nothing, but it doesn't solve the game. It leaves a lot on the table. I started thinking there's basically like two pathologies that I think exist at the heart of reinforcement learning even when it has intrinsic motivation. One of them is what I call detachment. An intrinsically motivated algorithm might reward you for getting to new places. Imagine you're standing in a hallway and so at the beginning, you could go left or right. Everything's new and you get a reward. If I go left for a little while, I consume intrinsic motivation, but then we're always enough that agent happens to die or whatever. We restart it back in the middle of the hallway. It might happen to go right now because there's intrinsic motivation over there too, but now when you get reset the next time, there's no intrinsic motivation nearby where you're starting because everything's been consumed. Basically, I've detached from the promising frontier of exploration. I haven't remembered where I've gone in the past, so I should go to new places. That was one thing that I think these algorithms weren't doing. The other thing I think is maybe even more important and I call this derailment. Imagine, for example, if you were doing rock climbing and you climbed three quarters of the way up a wall, which is maybe really, really difficult, and you're gobbling up intrinsic motivation the whole time, you're really excited as an agent, hey, I'm getting to someplace new. Well, what do we do with that agent? We say, hey, that was really good. Wherever you just got to, you should go back there. We start them again and we had just told them, good thing, but now as they're trying to re-climb up that wall, we're like, but we also want you to go to new places, not exactly back there. We're going to sprinkle in random actions the whole time you're acting, but if that wall was really hard to climb in the first place, now I'm trying to do it again. We're basically knocking it off the wall on purpose by rejecting all this random actions or noise into its policy. We're never really letting it get back to that place. We said we should do things differently. We should adopt this mentality, which ultimately became the name of the paper in nature, which is we should first return, then we should explore. Hey, if you got three quarters of the way up a rock wall or you got deep into some dungeon or whatever, don't go back there and try to explore along the way. No, no, no. Just go back there first, and then once you get there, go to your heart's content. Those two pieces, when we put them together, ended up, when we solved both of those pathologies, all of a sudden we had an algorithm called Go Explore, and on games like Montezuma's Revenge, it got ridiculously high doors, like 18 million, I think, was our best policy, and we probably could have gotten higher. It beat the human world record, which was 1.2 million at the time. It blew everything else that had come before it in terms of RL out of the water, and it did that on many games in Atari. Ultimately, you could solve the entire Atari benchmark suite. If I could indulge the listener to try to explain exactly how Go Explorer works very quickly, it's quite simple. We basically have a first phase where we say, hey, start taking actions in the environment. It could even be literally take random actions, but every time you get to a new place, a new state, a new interesting situation in the game, then we'll just save that in an archive, and we'll also save how you got to that interesting place, and then we have this archive of places where you've been that are pretty cool. All we'll do now is we'll pull one of those locations or situations out of the archive, and we'll say, first, go back to that situation, and we could do that by just saying, replay the moves that got you there, or we trained you to get back there without stumbling or whatever. Once you're back in that situation, then you can explore from there, and either you could do that with just random actions or a policy or whatever. Note how simple it is. If I got three coders the way up the wall, I'm just going to go right back there. Even random actions from there are likely to get to even new places like a little bit higher up the wall, and then I'll save those places, go back, a little bit more exploration from there. I could probably go a little further up the wall. Basically, what you get is this expanding archive of stepping stones or situations that I've been to that are interesting, from which I can explore from and just get another new situation and another new situation and another new situation. Once you have that, basically, eventually, you're going to start to discover all these highly rewarded situations in the game, including maybe how to beat the game. Then you can go back and say, hey, maybe I've been doing some tricks to help you get back there really easily without dealing with the fact that sometimes in the game, there might be a lot of noise and stochasticity. Now that I know what we're trying to do and accomplish in this game, I'll just train a policy that does that and does it really well, does it in the presence of noise, and boom. Now I can have a policy that starts from the beginning, it only has access to everything that just plays the game as normal, and it can get extremely high scores. I first encountered your name and your work, actually, at the NeurIPS 2018 deep RL workshop in Montreal. I remember you presenting Go Explorer on the main stage there and the score that you got on Montezuma's Revenge. Then I was actually at the back of the room. I remember David Silver commenting on some of the assumptions with respect to RL and saving the simulator updates. I recently learned that you've extended the algorithm to have a policy version that doesn't require a simulator at all. Is that right? Can you talk about the simulator version versus the policy version? That's right. I'm glad you were there and I remember David's question. This is some of the details that I touched a little bit on at a high level, but I can explain in a little bit more detail. The simplest version of the algorithm, the first version, the one we presented in the room that day, if we're in a video game and I get to an interesting situation, then I can literally just save the simulator state. It's the equivalent of freezing the world when I'm three quarters of the way up the rock wall. Then instead of next time I play the game and saying, hey, try to climb back to that spot three quarters of the way up the rock wall and then explore from there. We just literally just say, oh, actually I just have the frozen state of the world where you're three quarters of the way up the rock wall. I'll just resurrect that. You wake up and you're right there and your job is to explore from there. That's taking advantage of the fact that a lot of the work that we're doing in Atari and other simulators allows you to save the state of the world and resurrect it. We don't even have to really first return. We can just pull something out of the archive, explore from it, and if we get somewhere new, add it to a growing set of simulator states. Now, some people thought, hey, that's not the original challenge that we're trying to solve here. In the real world, you might not be able to do that. We said, okay, fine. We'll show you that these ideas are really general and really powerful. All we do is instead of saving the state of the world like three quarters of the way up the rock wall, we will just train a policy. We will give it the goal of, hey, go to this place three quarters of the way up the rock wall. It's a goal condition policy. Based on that goal, it'll be trained just to go back there and get really robust at going back there. Then we have a higher level controller that says, hey, go to the three quarters of the way up the rock wall and the goal condition policy goes there. We're not resurrecting simulator states or making the game deterministic or anything. It deals with all the noise in the world. Once it's there, then we can give it some other goal, which is explore around you or maybe generate a goal that's nearby, et cetera. Once we do that, we now have a version of the algorithm where we're not making the world easier in phase one. The original version of it, we took advantage of the fact that simulators can be made deterministic and or you could resurrect simulator states to just figure out what the game wants us to do, like how to get a good reward. Then we had a phase two that made a policy that was a neural net that was robust to noise. In this version of the algorithm, you never make those simplifying assumptions. You just basically start out training a policy from the get-go, but it's still got the principles of go explore in there, which is we want you to increasingly first return to a place that we considered interesting to explore from and then explore from there. If you get to an interesting place, we'll save that in this archive. I do want to point out one more thing, which I think is really cool, especially in this conversation because we've already touched on quality diversity algorithms, including the map elites algorithms, which underlie the robotic adaptation thing that I described, and that is that the principles of quality diversity algorithms are alive and well and inspired go explore. In the robotics case, we wanted as many different robotic gates that were each high quality as possible, and then we harnessed that archive, that big library of different gates when we had damage. In go explore, we're doing the same thing, but what we're collecting is different trajectories or different policies within one world or environment or game. We end up with is the highest quality way to get to a whole bunch of different situations or states in the game. You have the most efficient way to get three quarters of the way up the rock wall, the most efficient way to get to the vending machine, the most efficient way to get to the parking lot. By doing that, you expand out and explore the space of possibilities, and then eventually you can find something that lights up the reward function, and then if you want, you can distill a policy that only does that thing even better. I just think it's really cool because it's like these principles that we get excited about in one context end up paying off to solve other really hard challenges in machine learning again and again and again, which I think is a pretty good sign that these principles are exciting. Great. Thanks for explaining that. I'm so glad to hear about the policy version, and that was the biggest criticism that I heard about this algorithm. Some people say more traditional RL people saying we're following the RL rules, but you were able to overcome that with the policy version. Does the policy version perform well on some of these tasks too, like Montezuma's Revenge and the difficult problems? Is the policy version doing its thing? Yeah, it also does extremely well, and it does way better than all previous algorithms. It's more compute intensive because you're not taking advantage of the fact that I have to train a policy to go three quarters of the way up the rock wall. That's a metaphor. That's not actually part of the Montezuma game, but yeah, you're not taking advantage of some of these efficiencies, but yeah, it does extremely well. It gets really, really high performance on these games, and the cost is just you have to train this policy. There is one really cool thing I'll mention about the policy version that was not true in the original version, is because you've trained a policy that can go three quarters of the way up the rock wall, I'm going to keep using that example. Once it gets three quarters of the way up the rock wall, it knows how to rock climb really well, and so now it's going to be better at exploring from that. We do show in the paper that one benefit you get is more efficient exploration. Even though the overall algorithm is still more expensive, the exploration part is more efficient because you're no longer taking random actions to do the explore step of the algorithm. I actually think that really bodes well for the future. As you train big models on more and more kinds of domains, they're going to have all sorts of common sense and skill sets and understanding, especially if you did like VPT ahead of time, for example. Basically, we're going to become really efficient at exploring, and so you put that together with a powerful first return then explore, and now my exploration is efficient, and I'm off to the races to solve really, really difficult hard exploration challenges. Just going back to that simulator version for a second, does the simulator version... It might not fit the kosher definition of the RL problem setting, but can it still help produce real, helpful real world policies in different domains, and what kind of domains do you see this for helping us in? What kind of domains are most suitable for Go Explore? Both with VPT and Go Explore, we did some things that helped us get to our final policy, but the final policy itself just does the canonical thing. You put it in the game, and it plays from pixels with all the noise in the game. It takes actions just like any other policy in RL. Some purists were saying, hey, you might have gotten there via a different path and maybe an easier path, but the final thing still does the task that we wanted solved. The same is true with Go Explore. It produces a final policy that can play the game from pixels. If you didn't know how we got it, then you would say, okay, maybe I got this through traditional RL. It just seems to perform way better. That means that you end up with any problem that you were trying to solve with RL. If you tried to solve it with Go Explore, if it performs really well in the end, you'd be really happy, and it could go off and do what you need. To answer your other question, which is like, what are the domains that this really helps on? I would say any hard exploration RL problem, which is to say almost all of the unsolved RL problems. Maybe that's a little too strong, but many of them. For example, if you wanted to learn how to drive in a driving simulator, if you wanted to do robotics, then I think these algorithms could work really well. Let me give you an example. Imagine that in robotics, you train. Imagine that you want your robot to clean up your room and get a fire extinguisher and put away the dishes and make you an omelet and take the trash out and all this stuff. Those are all really, really, really hard exploration problems if you only know how to reward it once the trash is taken out or once the room is clean. You could use Go Explore to solve each of these problems. Then you could train a generalist policy on all these demonstrations of how to do each task to produce a generalist robot that knows how to do a huge variety of different things. I actually think this would be really cool. You have a thousand or a million different tasks that are all hard exploration. Go Explore could efficiently solve them all, give you demonstrations of how to solve those tasks, and now you could do a huge amount of what you might call pre-training, which is having a robot hear a task description and then know how to do it. You've got all of these solution demonstrations and kind of like GPT that knows how to complete a thousand different types of articles. Now you have a robot that knows how to complete a billion different kinds of tasks, and it becomes very general at zero-shotting new tasks when asked to do so. That's one futuristic project that I'd love to see with Go Explore. Cool. Sounds like hard exploration problems that have really good simulators would be suitable for Go Explore. Is that fair to say? I think that's right. I mean, it would be really cool to try to use the principles of Go Explore in a problem that doesn't have simulators. That would be really futuristic and ambitious, and that would be a great project for somebody to work on. I haven't seen that yet, but where it really shines is like many RL algorithms, when they could take advantage of lots of computing in a simulator and then learn a lot of stuff and then try to be efficient about crossing the reality gap. That seems to be a bigger trend these days where that's becoming very feasible to cross that gap, which is kind of amazing. It is true. As we have more and more compute and more and more pre-training and more and more data augmentation and more and more demonstrations, the reality gap is kind of taking care of itself in a sense, and that's why I think these algorithms like I just sketched out might just work. Just Go Explore in a really decent simulator or a good simulator with a lot of compute, and boom, you might zero-shot transfer to the real world. You used the phrase open-ended algorithms. Can you touch on how you define that? This is one of the topics I'm most excited about in machine learning and have been really since I started in science. To me, one of the most profound mysteries in the world is where we came from and also where all of the amazing creations of evolution came from. I look out into the world and I see jaguars and hawks, the human mind, platypuses, whales, ants. It's this explosion of interestingly different engineering marvels. I like to think about them as engineering marvels. The system is continuously innovating, right? Like it produced COVID, it produced new species all the time, and it's constantly surprising us. That is an example of an algorithm that we would call an open-ended algorithm. You keep running it and it just keeps surprising you. It keeps innovating, it keeps creating forever. We have no evidence that it's going to stop. There are other examples of that. Another one is human culture. Human culture, for example, science or art or literature, it keeps inventing new challenges and then solving them and the solutions to those challenges become new challenges. For example, in science, you invent one technology that basically allows you to answer totally new types of questions or make new innovations, and then there's new sciences to study those things and the interaction between those things and other things. The system just keeps on innovating and learning and accumulating knowledge and skills and an expanding archive of wonders. A quest that I've had is, what are the key ingredients that would allow us to create a process like that inside of a computer? One, because that would be fascinating, would allow us to maybe produce AGI and create alien cultures to study and all sorts of cool stuff, but also because it teaches us about the general properties that are required for a process to be open-ended, which we still don't really understand. We don't really know why nature works. I want to contrast those open-ended algorithms with what is traditionally sought after in machine learning and certainly happens in machine learning. Typically, you want a certain solution to an optimization problem, whether or not it'd be a fast robot or the right way to schedule your final exams. You run it for as long as you can, and it gets better and better and better with diminishing returns. Eventually, it's done the best it can do. If you ran that thing for another billion years, nothing interesting would happen. What we want is an entirely different paradigm. We want to see, could you create an algorithm in the words of Jean-Baptiste Moray, which would be interesting to come back and check on after a billion years, because evolution is 3.5 billion years and running and still surprising us. Why can't we do the same thing with a computer algorithm? When I started my PhD, there was no algorithm I was aware of that was worth running really for more than about a day, maybe a couple days. As I've gotten later in my career, that went up to weeks, and now I think we're at about months. If you think about how long might it have a GPT or something run for, we're training things on the order of months. Now, we are throwing more and more compute at them as well, but roughly speaking, we've got nothing that we would want to run and come back and check on in a billion years. That's the quest and the challenge. Could we create a system that would forever, as long as we run it, need to innovate, delight, and surprise us in the same way as human evolution and human culture? Okay, so that part makes sense to me. I guess it reminds me of efforts in things like artificial life, which have been around for a while. It seems like in some sense, some of the systems, they don't really have a hard goal. The goal is actually just to create complexity and discovering or exploring that process of creating complexity. When it comes to evolution, it's very inefficient. It's not really clear entirely if you could say it's goal-oriented. I wouldn't say it's goal-oriented. What do you see as the goal of these open-ended learning systems, and how do you know when you're achieving that goal? Yeah, it's a great question. To quote Justice Potter Stewart, I don't know how to define it, but I'll know it when I see it. In fact, you're right about this field has traditionally been pursued in artificial life, although now it's becoming a mainstream topic in machine learning and reinforcement learning. I've spent a lot of my career in the artificial life community when that was the best place to do this work. Now, it's great to be able to talk about it with a wider set of people. In general, I do agree with you, evolution doesn't really have a goal. It didn't set out to produce ants or cheetahs or humans. It just happened to do so. The same thing is going to be true of our algorithms. We're going to have to create systems that are not seeking a goal. In fact, that's not an accident. It turns out, and Ken Stanley and Joel Layman, my dear friends and colleagues have been pushing on these ideas for a while, that if you have a goal, that probably prevents you from doing things that are really interesting because you get stuck on a local optima or in a dead end. What you really need to do is abandon the idea of having a particular goal and just do what we've been talking about with Mappa Leads and Go Explore and quality diversity algorithms. Just go out and collect a huge diversity of high quality stuff and stepping stones and that will ultimately unlock tremendous progress. I want to give you an example. Let's assume, for example, that you wanted to produce a computer and you went back a couple thousand years to the time of the abacus and you are the king of a kingdom and you say, I will only fund scientists who make me machines that produce more compute. Well, you might get an abacus that has longer rods or more beads or maybe it's a three dimensional abacus or something, but you would never invent the modern computer because to invent the computer you had to have been working on vacuum tubes and electricity, which were things that were not invented because they had immediate benefits for computation. That was not in the minds of the people that invented those things. Here's a line from Ken and Joel's book, to achieve your greatest ambitions, you must abandon them. You can't have the objective of producing human level AI when you create your AIGA, your open ended AIGA that can generate environments forever because if you myopically start giving IQ tests to bacteria, you will never get human beings. You just have to go explore and collect stepping stones and eventually you get the stepping stone, which is say the vacuum tube and the electricity and someone's like, aha, I can put this together and make a computer or in the algorithm, like, hey, I've created this thing that could do these weird tasks and boom, that turns out to be the key ingredients to produce something that looks like AGI and has general intelligence. I don't think we're going to be setting objectives. I think we want to not have objectives, but as you said, how do I know when I run a system and it's worth running for a lot longer and how do I know when I shut it down? This is very difficult to measure, but often we can just look in there and see and if really interesting things are happening, then we know that it's worth continuing. If I created a simulation and the agents in the system were inventing society, they were having group meetings, they were electing leaders, or they were inventing technological innovations and then building on those, or they started inventing educational systems and they were teaching other AIs. If these kind of things are happening, then I get pretty excited that I'm on to something. If I just run the system and it's generating more of the same or it's just generating white noise forever, I'm not interested. We don't really know how to measure it, but humans are pretty good at evaluating things even if we can't quantify how to measure them. I think we should bring our human judgment to bear in this field of research, even though scientists don't love hearing that. Yeah. Some of the things you're saying and reminding me of Henry Ford saying customers just wanted faster horses and there's that British TV show Connections where James Burke was always relating how something would not have been invented without some other very obscure thing happening. I want to move a little bit on to the topic of the day. Chat GBT and those types of models are on fire. It seems like they represent a very different path towards AGI than the AIGA paradigm, or at least it seems that way to me. Do you see these two types of paths competing or do you see them as complementary in some way? If I think of the AIGA concept imply that we want to have an outer loop around GPT-4, but these models are presumably always going to be towards the edge of the feasible run size for its day, so that kind of precludes having an outer loop around them. How do you see the interaction between AIGA and this current paradigm with the large language models and the massive models and the chat GBTs of the world? Yeah. Fantastic question. One that I think about a lot. This is kind of why I added a fourth pillar to the AIGA paradigm because I think pre-training on human data is effectively kind of like an orthogonal choice or an optional speed up. You could try to do the whole AIGA thing and never train on human data, but it probably will take a long time. Interestingly, it might produce intelligence that looks a lot less like humanity because it wasn't trained using human data and that might be beneficial once we start to study the space of all possible intelligences. Let's assume for now that you want to at least produce one really intelligent thing first before you worry about all possible intelligences. Well, training on human data is just a huge speed up. Doing something like GPT, training on that human data, I think it's a fantastic accelerant that kind of immediately catches AI roughly up to human, not human level intelligence, but gets us much closer than we would without human data. Let's say it that way. I still think you're going to need a lot of the elements in the AIGA paradigm. For example, if you want to produce intelligence that can solve problems that humans have never solved before, like a cure to cancer, for example, then you can't just train on human data. At least it's unlikely that a model that's trained just on human data would suddenly know the solution to cancer because humans don't know that solution. It's probably going to have to become a scientist and start conducting experiments and know which experiments to conduct and how to learn from them. That starts to be a lot of like pillar three. Pillar two, which is kind of meta-learning the learning algorithms. Similarly, GPT itself is an amazing meta-learner. It's one of the great surprises of the GPT paradigm, but yet I still think that probably as you have to learn to explore and solve new problems and become a scientist and conduct these experiments, it's going to benefit from kind of meta-learning things that are even more advanced that come from just training on human data. Then finally, on the architecture front, you certainly wouldn't be able to do architecture search where each atom in the search is literally a full GPT run. That's as you said, it's way too expensive. But I would not at all be surprised and think ultimately it probably will be very important and beneficial to be doing architecture search in a way that you do experiments just like machine learning researchers do at small scale, figure out principles, better architectures, and then you start to test them at larger and larger scales. Eventually, you do a scaling lot of thing and you say, hey, I think we should try this big. Then you launch a very few number of experiments at scale. I still really believe in a system that's automatic that is using AI to do architecture search to come up with better learning algorithms via meta-learning, it's automatically generating new training data, and all of this in conjunction with this pillar four, which is the speed up, which is seeding the whole process with human data. To conclude, I think that the approaches are very complimentary. GPT alone will not be enough, but GPT inside of an AIGA might allow AIGAs to actually accomplish their goal maybe even decades before we would have otherwise if we didn't take advantage of strategies like GPT. I heard you once predicted a 30% chance of AGI by 2030. Is that currently your estimate or can you talk about your timeline for AGI? Well, I only made that prediction in December. It's currently March. I haven't had too much of a change at heart in the few short months. I still think that's possible. If anything, I might be a little bit more aggressive now, but I think right now I'll stick with it. Obviously, what we're seeing is tremendously powerful systems. I think probably the most interesting thing now is having a discussion about when AGI would come five, four, seven years ago, you didn't really have to define AGI because the whole enterprise was so hard that we basically meant that thing that we're not talking about. Now I think the systems are so good that your actual timeline really will come down to your definition because if you define it like I have is something like humans doing 50% or sorry, AI doing 50% of the work that got paid for in 2023, that definition, it probably will be something that will hit, I say 2030. I think that's increasingly becoming clear, but if you have a different definition, like it has to be better than humans and everything that humans do, well, then suddenly your timeline gets pushed up. I'll just say we're not within the range at which the definitions matter because we're going to start crossing through each one of these definitions with rapid speed. I'll even put out something more. I don't know why nobody has done a Turing test on these current systems. I actually like the Turing test. If it was done well by a really smart group of people, I think that it would be hugely informative. I think there's a pretty good chance that GPT-4, well, let's see. I don't know if it would pass the Turing test, but I think it's got a pretty good chance and I think GPT-5 probably would. Already we're seeing one of the definitions we had since the founding of our field being surpassed. Again, we're back to within the range at which where your definition is depends on what year you should predict, but no matter what, it's all happening and it's happening at a ridiculously fast pace. In the span of human history, it's happening in a blank. Society needs to be really deeply thinking about whether we should do this, how fast we should do this, how do we do this safely, and what are the consequences of doing this. One detail from one of your talks, you mentioned that there's evidence that evolution in nature is better than random. Where's that fact from? The random part of evolution is the mutation. I mean, you could even argue about that, but let's just take that as a given. Yeah, you randomly mutate a current thing and that's not a very smart thing to do, but the non-random part of evolution is then you keep the stuff that worked really well. If you started off 3.5 billion years ago and you just randomly sampled genomes, then you'd never get anywhere. It's actually very similar and I've never made this connection before right now, but it's kind of similar to Go Explorer. Go Explorer says, hey, you got three quarters of the way up the rock wall. That was really hard. Let's go there and then we'll do some random exploration from there. That's one version of Go Explorer and that allows you to get maybe one hold farther up the wall and that was good. Now you're more than three quarters of the way up the wall, but if you just randomly search for policies, you'd never end up three quarters of the way up the wall. Basically, like the Cheetah genome is the equivalent of a stepping stone that has been collected. It's like the equivalent of a policy that can get three quarters of the way up the wall. If I start there and then I do a little random exploration, I might get something that's better than a Cheetah. That is very, very different than just saying in the space of all possible genomes, I will randomly generate a genome and I will see if it is good because if you do that, nothing will even self-replicate because making a machine that can self-replicate is impossibly hard. You're not saying the random mutations are somehow biased towards beneficial mutations. That's not what you're saying. There is a whole line of work on evolvability that actually does argue that it is non-random and it can do better than random. The first point I want to make is that even if you assume that it is random, the actual mutation operator, the algorithm itself is far, far, far, far, far from random. Now, if you're specifically asking, can you do better than random with the mutation operator? Yes, you can actually. The way that that works is that you come up with a basically a representation that is intelligent such that when you do random things to it, good things happen. For example, if you go into the space of GANs and you randomly sample vector codes, you don't get garbage, you get different faces because the representation maps noise to the low dimensional manifold of all possible human faces and only that face and only faces if it's been trained to generate faces. You could have random mutations that do still do really good changes. Similarly, many people argue and I think there's some pretty good evidence that evolution is the same. It allows you to randomly change the genome and then maybe like both of your legs get longer or shorter, not like one. You don't have like asymmetry. It's kind of like captured the regularity that legs should be the same length and legs and arms should be proportional to each other. Random changes to this representation produce creatures that have kind of all of their legs and arms shorter or bigger for the most part. That is not random. That was the spirit in which I'm talking about that. This goes by the name of evolvability and or canalization in the evolutionary biology literature. Is there anything else I should have asked you today or that you want to share with our audience? I guess I'll share this thought with the world. We're willing tremendous power as machine learning scientists, as deep reinforcement learning practitioners out there right now in companies, in labs, in clinics, in startups. People are thinking of ways to use this technology and how to develop more powerful versions of this technology. I just think we want to be really conscious of the downstream effects of what we build. I think we want to make sure that we build things as safely as possible and to help humanity as much as possible. I think that everybody in our field should do the equivalent of taking like a Hippocratic oath to try to do good things with this and not do harm. I think we should just be humble about unknown unknowns and potential downstream effects. We should try to take very seriously your individual responsibility as somebody who's helping to build very possible technology and that we want to try to do it really safely and in a way that's as beneficial as possible. Jeff, thanks for taking the time today to share your thoughts and your insight. We talk to our listeners. Thank you, Professor Jeff Clune. Thank you very much again for the invitation and thanks to everybody who listened through the podcast.