[00:00.000 - 00:11.120] Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests, [00:11.120 - 00:17.320] both research and applied. Join the conversation on Twitter at Talk RL podcast. I'm your host [00:17.320 - 00:23.440] Robin Chauhan. [00:23.440 - 00:27.680] Dr. Natasha Jaques is a senior research scientist at Google Brain, and she was our first guest [00:27.680 - 00:32.440] on the show three and a half years ago on Talk RL episode one. Natasha, I'm super honored [00:32.440 - 00:36.440] and also totally stoked to welcome you back for round two. Thanks for being here today. [00:36.440 - 00:39.960] Well, thank you so much for having me. I'm stoked to be back. [00:39.960 - 00:44.200] So when we did that first interview back in 2019, I remember you're just wrapping up your [00:44.200 - 00:49.960] PhD at MIT. And I can tell you've been super busy and lots, lots of things have been happening [00:49.960 - 00:55.520] in RL and AI in general since then. So can you start us off with like, what do you feel [00:55.520 - 01:00.240] have been like the big exciting advances and trends in your field since you completed your [01:00.240 - 01:01.240] PhD? [01:01.240 - 01:05.440] Yeah, well, I think it's kind of obvious, right? I mean, everyone's obsessed with the [01:05.440 - 01:11.360] progress in large language models that have been happening, you know, chat GPT, how the [01:11.360 - 01:17.280] API is getting deployed. I think that's kind of the, I mean, image and language models, [01:17.280 - 01:19.440] diffusion models, there's so much going on. [01:19.440 - 01:23.840] Yeah, like you said, all this buzz around chat GPT, and reinforcement learning from [01:23.840 - 01:28.000] human feedback and the dialogue models in general. And of course, you were really early [01:28.000 - 01:33.280] in that space. And a lot of the key open AI papers actually cite your work in this space. [01:33.280 - 01:39.000] And there's a few of them. Can, can you talk a bit about how your work in that area relates [01:39.000 - 01:44.560] to to what open AI is doing today and what these these models are doing today? [01:44.560 - 01:53.040] Sure, yeah. So I guess, like, let me take you back to 2016, when I was thinking about [01:53.040 - 01:57.920] how do you take a pre trained language model, but in that case, I was looking at actually [01:57.920 - 02:04.640] LSTM, so like, early stuff, and actually fine tune it with reinforcement learning. And in [02:04.640 - 02:10.000] that time, I was actually looking not at language, per se, but at like, music generation and [02:10.000 - 02:15.600] even generating molecules that might look like drugs. But I think the I think the molecules [02:15.600 - 02:20.280] examples is a really good way to see this. So basically, the idea was like, we have a [02:20.280 - 02:25.400] data set of known molecules, so we could train a supervised model on it and have it generate [02:25.400 - 02:30.240] new molecules. But those molecules don't really have like the properties that we want, right? [02:30.240 - 02:34.960] We might want molecules that are more easily able to be synthesized as a drug. So we have [02:34.960 - 02:42.000] scores that are like the synthetic accessibility of the molecule. But neither so neither thing [02:42.000 - 02:45.960] is perfect. If you just train on the data, you don't get optimized molecules. If you [02:45.960 - 02:50.480] just optimize for synthetic accessibility, then you would get molecules that are just [02:50.480 - 02:57.280] like long chains of carbon, right? So they're useless as a drug, for example. So what you [02:57.280 - 03:01.040] can see, like in this problem, you can you can use like reinforcement learning to optimize [03:01.040 - 03:05.400] for drug likeness or synthetic accessibility, but it's not perfect. The data is not perfect. [03:05.400 - 03:09.600] So how do you combine both? So what we ended up proposing was this approach where you pre [03:09.600 - 03:14.220] trained on data, and then you train with RL to optimize some reward, but you minimize [03:14.220 - 03:18.600] the KL divergence from your pre trained policy that you train on data. So we call that like [03:18.600 - 03:23.080] your pre trained prior. And this approach lets you flexibly combine both supervised [03:23.080 - 03:28.140] learning, get the benefit of the data, and RL, where you kind of optimize within the [03:28.140 - 03:35.080] space that's within the space of things that are probable in the data distribution for [03:35.080 - 03:39.960] sequences that have high reward. And so you can see how this is obviously related to what's [03:39.960 - 03:45.040] going on with our LHF right now, which is that they pre train a large language model [03:45.040 - 03:49.320] on data set. And then they say, let's optimize for human feedback, but we're still going [03:49.320 - 03:53.580] to minimize that KL divergence from that pre trained prior model. So there's still an end [03:53.580 - 03:58.080] up using that technique. And it turns out to be, to be pretty important to the framework [03:58.080 - 04:04.760] for to the RLHF framework. But I was also working on our LHF, the idea of like learning [04:04.760 - 04:10.800] from human feedback. In around 2019, we took that same KL control approach. And we actually [04:10.800 - 04:15.880] had dialogue models try to optimize for signals that they got from talking to humans in a [04:15.880 - 04:24.600] conversation. But what we were doing is, instead of having the humans like rate, which dialogue [04:24.600 - 04:30.640] entries were good or bad, or do the preference ranking that open AI is doing with RLHF, we [04:30.640 - 04:34.720] wanted to learn from implicit signals in the conversation with the humans. So they don't [04:34.720 - 04:38.480] have to go out of their way to provide any extra feedback. What can we get from just [04:38.480 - 04:44.140] the text that they're typing? So we did things like analyze the sentiment of the text. So [04:44.140 - 04:49.320] if the person sounded generally happy, then we would use that as a positive reward signal [04:49.320 - 04:54.240] to train the model. Whereas if they sounded frustrated or confused, that's probably a [04:54.240 - 04:58.240] sign that the model is saying something nonsensical, we can use that as a negative reward. And [04:58.240 - 05:01.760] so we worked on actually optimizing those kind of signals with the same technique. [05:01.760 - 05:06.800] I mean, it sounds so much like what chat GPT is doing. Maybe the function approximator [05:06.800 - 05:10.880] was a bit different. Maybe the way you got the feedback was a bit different, but under [05:10.880 - 05:12.680] the hood, it was really RLHF. [05:12.680 - 05:17.460] Well, there's key differences. So open AI is taking a different approach than we did [05:17.460 - 05:23.200] in our 2019 paper on human feedback, where they train this reward model. So we don't [05:23.200 - 05:25.800] do that. So what they're doing is they're saying, we're going to get a bunch of humans [05:25.800 - 05:32.280] to rate, which of two outputs is better. And we're going to train a model to approximate [05:32.280 - 05:37.580] those human ratings. And that idea is coming from way earlier, like open AI's early work [05:37.580 - 05:42.200] on deep RL from human preferences, if you remember that paper. And in contrast, the [05:42.200 - 05:49.960] stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific [05:49.960 - 05:58.640] output, and then train on that as one example of a reward. But I didn't have this generalizable [05:58.640 - 06:02.720] reward model that could be applied across more examples. So I think there's a good argument [06:02.720 - 06:07.720] to be made that the training of reward model approach actually seems to scale pretty well, [06:07.720 - 06:09.920] because you can sample it so many times. [06:09.920 - 06:15.600] Can we talk about also the challenges and limits of this approach? So in the last episode, [06:15.600 - 06:22.800] 38, we featured OpenAI founder and inventor of PPO, John Schulman, who did a lot of the [06:22.800 - 06:28.680] RLHF work at OpenAI. And he talked about instruct GPT, the sibling model to chat GPT, because [06:28.680 - 06:33.120] chat GPT wasn't released yet. And there is no chat GPT paper yet. But the paper explained [06:33.120 - 06:37.320] that that it required a lot of human feedback. And the instructions for the human ratings [06:37.320 - 06:41.320] was really detailed and super long. And so there was a lot of there was a significant [06:41.320 - 06:46.440] cost in getting all of that human feedback. So I just I guess I wonder what you think [06:46.440 - 06:53.280] about that? Is there is that cost going to limit how useful RLHF can be? Or is that not [06:53.280 - 06:55.480] a big deal? Because it's totally worth it? [06:55.480 - 06:59.400] Yeah, I mean, that's a great question. And going back and reading the history of papers [06:59.400 - 07:05.160] they've been doing on RLHF, even before instruct GPT, like in the summarization stuff, it seems [07:05.160 - 07:11.680] like one of the key enablers of getting RLHF to work effectively, is actually investing [07:11.680 - 07:16.400] a lot into getting quality human data. So between these, they have these two summarization [07:16.400 - 07:20.000] papers where one, I guess wasn't working that well, then they have a follow up where they [07:20.000 - 07:23.840] said, one of the key differences, we just did a better job recruiting raters that were [07:23.840 - 07:28.400] going to agree with the researchers, we were taking a high touch approach of like, being [07:28.400 - 07:32.520] able to be in a shared slack group with the raters to answer their questions and make [07:32.520 - 07:37.560] sure they stay aligned. And like that investment in the quality of the data that they collected [07:37.560 - 07:42.440] from humans was key in getting this work. So it is obviously expensive. But what I was [07:42.440 - 07:48.000] struck by in those papers and also in instruct GPT is that, as you'll notice in instruct [07:48.000 - 07:55.680] GPT, the what was it like the 1.3 billion parameter model trained with RLHF is outperforming [07:55.680 - 08:03.200] the 175 billion parameter model trained with supervised learning. So it's like 100x the [08:03.200 - 08:08.640] size of a model is outperformed by just doing some of this RLHF. And obviously training [08:08.640 - 08:13.520] 100x size model with supervised learning is extremely expensive in terms of compute. So [08:13.520 - 08:17.160] I don't know what like, I don't think open AI released the actual numbers and dollar [08:17.160 - 08:21.320] value that they spent on collecting human data versus like training giant models. But [08:21.320 - 08:26.560] you could make a good case that RLHF actually is cost effective because of it could reduce [08:26.560 - 08:28.280] the cost of training larger models. [08:28.280 - 08:32.880] Okay, that part makes sense to me. But then when I think about the, you know, this data [08:32.880 - 08:40.000] set that's been collected, it's I mean, they're using the data for on policy training. From [08:40.000 - 08:44.920] what I understand, they're using PPO, which is on policy methods. So and on policy methods, [08:44.920 - 08:51.000] generally, or the way I see them is you can't reuse the data, because they depend on the [08:51.000 - 08:56.280] data from this model sample or from a very close by model. So if you start training on [08:56.280 - 09:00.960] this data, and the model drifts away, then is that data set going to be still useful? [09:00.960 - 09:04.720] Or is it could it could ever be used for another model? Like, are these like, like disposable [09:04.720 - 09:08.240] data sets that are just only used for that model in one point in time? [09:08.240 - 09:12.360] I wouldn't say it's disposable, like I would still use that data, because the data they [09:12.360 - 09:17.080] actually use is like comparisons of summaries, and then they use it to train the reward model. [09:17.080 - 09:20.920] And so your reward model can be kind of like trained offline in that way and used for your [09:20.920 - 09:27.480] policy. But this the actual comparisons they do for my understanding is they compare like, [09:27.480 - 09:32.200] not only their current RL model, but they're comparing the supervised baseline, they're [09:32.200 - 09:36.360] comparing the instructions from the data set. So you kind of get this like general property [09:36.360 - 09:41.040] of like, is this summary better than another summary? Right. And I think that's kind of [09:41.040 - 09:45.920] a reusable, reusable truth about the data, you just look at as their general summaries, [09:45.920 - 09:49.840] and this is what makes a high quality summary, then why couldn't that apply across different [09:49.840 - 09:53.800] models? And that those data sets are totally reusable. And maybe we can cost effectively [09:53.800 - 09:56.240] build up these libraries of data sets that way. [09:56.240 - 10:00.600] Yeah, like to put more fine a point on it, the data that they use to train their reward [10:00.600 - 10:06.040] model comes from a bunch of models that isn't just their RL model. So they are using quote [10:06.040 - 10:10.680] unquote, off policy data to train their reward model. And it's working. [10:10.680 - 10:14.560] The human feedback is like only valid for a limited amount of training. Like John was [10:14.560 - 10:19.280] saying, if you train with that same reward model for too far, your performance ends up [10:19.280 - 10:23.720] falling off at some point. So I guess the implication is that you would have to keep [10:23.720 - 10:27.400] collecting additional human feedback after every stage, like after you've trained to [10:27.400 - 10:31.520] a certain degree to improve it further might require a whole new data set. We don't really [10:31.520 - 10:34.720] get into that with that with the chat with john. But I wonder if you had any comment [10:34.720 - 10:35.720] about that part. [10:35.720 - 10:39.400] I can't say as much for what's going on with open AI is work. But I can't say I observed [10:39.400 - 10:44.640] this phenomenon in my own work trying to optimize for reward, but still do something probable [10:44.640 - 10:49.320] under the data. And you can definitely sort of over exploit the reward function. So like [10:49.320 - 10:54.680] when I was training dialogue models, we had this reward function that would reward the [10:54.680 - 11:00.200] dialogue model for having a conversation with a human such that the human seemed positive [11:00.200 - 11:03.880] seem to be responding positively, but that the dialogue model itself was outputting sort [11:03.880 - 11:09.240] of like, high sentiment, text and stuff like that. And we had a very limited amount of [11:09.240 - 11:12.880] data. So I think we might have like quickly overfit to the data and the rewards that were [11:12.880 - 11:18.640] in it. And what you see is the policy kind of like, collapse a little bit on. So its [11:18.640 - 11:22.680] objective is to stay with stay within something that's probable under the data distribution, [11:22.680 - 11:27.560] but maximize the reward. RL is ultimately even though we're using maximum entropy RL, [11:27.560 - 11:33.840] it's trying to find the optimal policy. So it doesn't really care like it, it ended up [11:33.840 - 11:37.440] having sort of a really restricted set of behaviors where it could get kind of repetitive [11:37.440 - 11:42.720] and sort of exploit the reward function. So our agent with those rewards kind of got overly [11:42.720 - 11:47.520] positive, polite and cheerful. So I always joke that it was like the most Canadian dialogue [11:47.520 - 11:57.160] agent you could train. We can say that because we're two Canadians. Exactly, exactly. But [11:57.160 - 12:02.320] yeah, it was kind of collapsing. Like the diversity came at a cost of like diversity [12:02.320 - 12:06.560] in the text that was output. So I wonder if there's something similar going on with their [12:06.560 - 12:11.800] results about like training too long on the reward model actually leads to diminishing [12:11.800 - 12:19.080] and then eventually like negative returns. And it seems that the reward model isn't perfect. [12:19.080 - 12:22.280] If you look at the accuracy of the reward model on the validation data, it's like in [12:22.280 - 12:27.560] the seventies or something. So it's not perfectly describing what is quality. So you really [12:27.560 - 12:32.080] overfit to that reward model. It's not clear that it's going to be comprehensive enough [12:32.080 - 12:36.680] to describe good, good outputs. I gather that like some of your past work in this, in this [12:36.680 - 12:41.000] area was like doing RL at the token level, like considering each token as a separate [12:41.000 - 12:45.320] action, maybe sequence tutor and side learning from your way off policy paper. Was that how [12:45.320 - 12:50.960] it worked? Was it individual token actions? Yes. But I would mention that so is instruct [12:50.960 - 12:56.080] GPT if you dig into it. So what they end up doing is what you can do, it's a little easier [12:56.080 - 12:59.640] in policy gradients because you can get the probability of the whole sequence by just [12:59.640 - 13:04.160] summing the probabilities over the individual tokens. But at the end of the day, your loss [13:04.160 - 13:08.640] is still being propagated into your model at the token level by increasing or decreasing [13:08.640 - 13:13.240] token level probabilities. Oh, so you're saying when they because because the paper says that [13:13.240 - 13:18.160] it framed it as a bandit. And to me, that meant the entire sample, all the tokens together [13:18.160 - 13:22.840] were taken as one action. But you're saying because of the way it's constructed, then [13:22.840 - 13:28.240] it still breaks down the token level probabilities. Yeah, you can write the math as like, reward [13:28.240 - 13:33.800] of the entire sequence for word of the entire output times probability of the entire output. [13:33.800 - 13:37.640] But under the hood, the way you get probability of the entire output is a sum of the token [13:37.640 - 13:42.600] level probabilities. So the way that that's going to actually change the model is to affect [13:42.600 - 13:47.120] token level probabilities. This is why I like having this podcast because that that question [13:47.120 - 13:51.400] is for a while like, who am I who's gonna explain this to me? So thank you for clearing [13:51.400 - 13:56.320] that up. For me, Natasha, that's really cool. No problem. So does that mean there's no benefit [13:56.320 - 14:00.840] to looking at a token level? Or like, is it always going to be this way? Because like, [14:00.840 - 14:05.280] I think john was saying that it's like more tractable to do it this way as a whole sample. [14:05.280 - 14:08.920] So what they're actually doing that might be a little bit different than token level [14:08.920 - 14:15.040] RL normally is like, their discount factor is one. So they apply the same reward to all [14:15.040 - 14:20.800] of the tokens in the sequence. And there's no discount where like, you're getting like [14:20.800 - 14:23.320] later in the sequence, you're discounting the reward you're going to get at the end [14:23.320 - 14:26.640] of the sequence or whatever, or earlier in the sequence, you're just getting so that [14:26.640 - 14:30.000] is a difference. That makes sense. It seems to be working well for them. Yeah, because [14:30.000 - 14:33.560] it matters just as much what you say at the end, like if you say not in capital letters, [14:33.560 - 14:35.480] then that's kind of important. [14:35.480 - 14:41.720] Yeah, exactly. And I think in my work, if I recall correctly, we had experimented. So [14:41.720 - 14:46.080] we experienced we had rewards that were at the sequence level as well, even at the level [14:46.080 - 14:51.240] of the whole dialogue. So we had stuff about like, how long does the conversation go on, [14:51.240 - 14:56.160] which is of course, across many dialogue turns. And then we had sentence level rewards that [14:56.160 - 15:00.640] were spread equally over the tokens in the sentence. But for something like conversation [15:00.640 - 15:05.240] length, we did have a discount factor, you aren't sure the conversation is going to go [15:05.240 - 15:09.160] on as long as it is at the beginning. So you discount that reward. But once you're already [15:09.160 - 15:14.120] having a long conversation, then the reward is higher. And it was very difficult to optimize [15:14.120 - 15:17.120] those discounted rewards across the whole conversation. [15:17.120 - 15:19.720] So you combined rewards at different levels? [15:19.720 - 15:20.720] Yeah, yeah. [15:20.720 - 15:25.760] Which kind of reminds me of this recursive reward modeling. There was a paper from like [15:25.760 - 15:31.600] at all out of DeepMind, who was in 2018. It seems like the idea here is taking this whole [15:31.600 - 15:37.400] RLHF further and stacking them for more complex domains, where we have models that help the [15:37.400 - 15:43.000] humans provide the human feedback and stacking them up. Do you have any thoughts about recursive [15:43.000 - 15:47.200] reward models? Do you think that's a promising way forward? Or like, are we gonna need that [15:47.200 - 15:48.200] soon? [15:48.200 - 15:52.040] I mean, so my understanding of their example of like a recursive reward model is the user [15:52.040 - 15:56.520] wants to write a fantasy novel, but evaluate like writing a whole novel, and then having [15:56.520 - 16:00.160] that evaluated would be very expensive, and you get very little data. So you could have [16:00.160 - 16:07.760] a bunch of RLHF trained assistants that do things like check the grammar or summarize [16:07.760 - 16:11.880] the character development up to this point or something like that. And that can assist [16:11.880 - 16:17.440] the user in doing the task. So I think like, sure, that idea makes sense. If you want to, [16:17.440 - 16:21.480] if I were to make a company that's helping people write novels, I would do it at that [16:21.480 - 16:27.680] level rather than at the level of the whole novel, right? So so that's definitely cool. [16:27.680 - 16:32.400] But in terms of like, pushing forward the boundaries of RLHF, I think what I would bet [16:32.400 - 16:36.180] on, and maybe I'm just biased, because this is literally my own work, but I would still [16:36.180 - 16:42.560] bet on this idea of trying to get other forms of feedback than just like humans comparing [16:42.560 - 16:47.040] to answers and rate like ranking them. So I'm not saying my work is the perfect answer, [16:47.040 - 16:52.200] but we were trying to get this type of implicit signal that you're getting during the interaction [16:52.200 - 16:57.080] all the time. And so, you know, when you're speaking about, oh, RLHF is so expensive to [16:57.080 - 17:03.480] collect the human data. Well, what if you could be getting data for free in any way [17:03.480 - 17:08.120] that's pervasively in your interactions? And so it doesn't cost anything additional to [17:08.120 - 17:12.720] find it. So like, okay, imagine you're using open AI playground or something to play with [17:12.720 - 17:19.560] chat GPT. How many times did you like rephrase the same prompt until you got some behavior [17:19.560 - 17:24.200] and then stopped? Yeah, they must be like, could that be it? But not yet. Do you think [17:24.200 - 17:28.880] so? I don't know. You would hope so. Because otherwise, how are they going to scale this? [17:28.880 - 17:32.400] Like they, they also have thumbs up and thumbs down. But they don't, they kind of have the [17:32.400 - 17:36.520] limited feedback though, right? And it's not always about whether the sentiment is good. [17:36.520 - 17:42.400] Like you could be wanting to write something scary. Exactly. Yes. Sentiment isn't perfect. [17:42.400 - 17:45.880] You could also look at like, okay, I prompt GPT, I get some output. Like if they had a [17:45.880 - 17:50.360] way to like edit that output in the editor, which I don't actually know if they do in [17:50.360 - 17:54.720] playground, I have to, I have to look at that again. But any edits I made to the text would [17:54.720 - 17:58.800] be a signal that I didn't like it, like I need to fix this. So that could be a signal [17:58.800 - 18:03.000] you could be training on with RLHF. I feel like that's just going to be more scalable. [18:03.000 - 18:06.840] And ultimately, it's not the ground truth of the human rating of quality. But what we [18:06.840 - 18:11.000] show in our work, it's like even though sentiment is very and the other stuff, we didn't just [18:11.000 - 18:14.720] use sentiment, we use a bunch of stuff. But even though those are imperfect, and only [18:14.720 - 18:19.720] proxy measures, optimizing for those things still did better than optimizing for the thumbs [18:19.720 - 18:24.160] up thumbs down that we built into the interface, because just no one wants to bother providing [18:24.160 - 18:28.000] that. You have to go out of your way out of the normal interaction that you're trying [18:28.000 - 18:32.880] to use to like sort of altruistically provide this extra feedback and people just don't. [18:32.880 - 18:39.060] So yeah, I think more scalable signals is the right direction. That makes so much sense. [18:39.060 - 18:43.380] Are you up for talking about AGI? [18:43.380 - 18:45.000] Depends what the question is. [18:45.000 - 18:48.680] So first of all, do you think it's like it's something we should be talking about and thinking [18:48.680 - 18:52.960] about these days? Or is it like a distant fantasy? That's just not really worth talking [18:52.960 - 18:53.960] about. [18:53.960 - 18:58.000] Oh, man, I always get a little bit frustrated with like AGI conversations, because nobody [18:58.000 - 19:02.380] really knows what they're talking about when they say AGI. Like it's not clear what the [19:02.380 - 19:07.800] definition is. And if you try to pin people down, it can get a little bit circular. So [19:07.800 - 19:12.560] like, you know, I've had people tell me, oh, AGI is coming in five years, right? And I [19:12.560 - 19:17.760] say, okay, well, so how do you reconcile that with the fact that CEOs of self driving car [19:17.760 - 19:22.880] companies think that fully autonomous self driving is it coming for 20 years? Right? [19:22.880 - 19:27.040] So if AGI is in five, and then my definition of AGI might be it can do everything a human [19:27.040 - 19:33.560] can do, but better. That doesn't make sense, right? If it can't drive a car, it's not AGI. [19:33.560 - 19:37.440] But then people will say, oh, but it doesn't have to be embodied. And it can still be AGI. [19:37.440 - 19:42.000] And okay, but then what is it doing? Like, it's, it's just such a muddy, muddy concept, [19:42.000 - 19:43.000] right? [19:43.000 - 19:46.840] I've also been in these arguments or discussions. And then in the end, we just realized we have [19:46.840 - 19:51.400] different definitions. And then there's no point in arguing about two words that mean [19:51.400 - 19:52.400] different things. [19:52.400 - 19:59.240] All of that aside, I do think I have been really impressed and even a little bit concerned [19:59.240 - 20:04.160] about the pace of progress. Like it stuff is happening so fast that if you want to just [20:04.160 - 20:12.360] define AGI as highly disruptive, fast advancements in AI technology, I think we're already there. [20:12.360 - 20:18.440] Right? Like, look at chat GBT, right? Universities are having to revise their entire curriculum [20:18.440 - 20:23.320] around writing take home essays, because you can just get chat GBT to write it you an essay [20:23.320 - 20:28.280] better than an undergrad can. So it's already super disruptive. Like where we are now is [20:28.280 - 20:29.680] already super disruptive. [20:29.680 - 20:35.800] Yeah, it might not be like AGI do all the jobs AGI. But if it's, it's general, it's, [20:35.800 - 20:40.160] to me, chat GBT is the first thing I've seen that really is so general. Like nothing has [20:40.160 - 20:45.280] been that general before, that imagining where that generality could take us in a few years [20:45.280 - 20:49.880] does make me think your point about the self driving vehicles is well taken. Like I think [20:49.880 - 20:54.360] everyone recognizes it's been a bit of a shit show with people predicting that it's going [20:54.360 - 20:58.040] to come in two years and three years and it just keeps getting pushed back and the timelines [20:58.040 - 20:59.040] just get longer. [20:59.040 - 21:03.160] I think embodiment is really hard. I think fitting the long tail of stuff in the real [21:03.160 - 21:06.960] world is really hard. So you might have seen this example. I think like Andre Carpathi [21:06.960 - 21:14.400] talked about it for Tesla, where they had an accident because there was a, the car couldn't [21:14.400 - 21:19.480] perceive this thing that happened, which was a semi truck carrying a semi truck carrying [21:19.480 - 21:25.640] a semi truck. So like a truck on a truck on a truck. And they were just like that. I hadn't [21:25.640 - 21:29.400] even seen that before. It wasn't in the support of the training data. And of course we know [21:29.400 - 21:34.280] these models, like if they get off the support of the training data, don't do that well. [21:34.280 - 21:39.280] So how will you ever curate a dataset that's going to cover every single thing in the real [21:39.280 - 21:44.200] world? I would argue that you can't, especially because the real world is non-stationary. [21:44.200 - 21:48.760] It's always changing. So new things are always being introduced. So sort of definitionally, [21:48.760 - 21:54.960] you can't cover everything that might happen in the real world. And so, you know, that's [21:54.960 - 21:58.400] why I'm excited about some of these approaches. It sounds like you talked about this on a [21:58.400 - 22:02.760] previous episode, but like, um, I've been working on this like adversarial environment [22:02.760 - 22:07.320] design stuff or unsupervised environment design stuff for RL agents, where you actually try [22:07.320 - 22:13.400] to search for things that can make your model fail and like generate those problems, um, [22:13.400 - 22:18.360] and train on them. And I think that could be an approach that is more tenable than just [22:18.360 - 22:23.560] supervised learning on a limited dataset. Totally. Yeah. We spoke with your colleague, [22:23.560 - 22:28.960] Michael Dennis, who was a co-author of yours on the paired paper. Is that right? Yes. Yeah, [22:28.960 - 22:33.600] exactly. Yeah. And I met him as at the poster session at, I think it was ICML. I love that [22:33.600 - 22:36.560] right away. And then I wasn't surprised at all to find your name on it. I didn't know [22:36.560 - 22:41.160] that at first. That makes total sense. That's exactly the type of thing Natasha would come [22:41.160 - 22:45.680] up with. The idea of embodiment, basically robotics is super hard or anything that has [22:45.680 - 22:51.280] to touch real world sensors. And it seems what chat GPT has shown us is if we can stay [22:51.280 - 22:57.280] in the abstract world of text, we actually have like magic powers even today in 2022, [22:57.280 - 23:03.240] 2023. Um, we could do a lot with the techniques we already have in the, we were staying in [23:03.240 - 23:10.040] the world of texts and abstract thought and now, and, and code and, um, abstract symbols [23:10.040 - 23:14.920] basically. So maybe it goes to the back to that point of, of the real world and robotics [23:14.920 - 23:18.880] just being turning out to be the really hard stuff, the animal intelligence being super [23:18.880 - 23:23.080] hard and the abstract thought that we used to think we made us so special is turning [23:23.080 - 23:27.640] out to be maybe way easier. We've already solved go that we thought was impossible not [23:27.640 - 23:33.480] long ago. And, and, uh, Chad GPT is doing, showing us a level of generality we could [23:33.480 - 23:39.800] not expect from robotics, you know, maybe for ages. Yeah. And I mean, you probably remember [23:39.800 - 23:43.120] the name of this principle better than I do, but it's sort of the principle that, uh, things [23:43.120 - 23:47.080] for, that are really hard for us to solve, like chess and go are actually easy to get [23:47.080 - 23:51.200] AI to solve. Maybe because we have more awareness of the process, but like the most low level [23:51.200 - 23:54.760] stuff about, you know, manipulation, like how do you pick something up with your hand [23:54.760 - 23:59.760] is a very challenging problem editor's note. I forgot. So I looked it up afterwards. This [23:59.760 - 24:03.840] is more of X paradox. I want to share like my favorite anecdote when thinking about why [24:03.840 - 24:09.440] embodiment is so hard. I've been working on this, this problem of, um, language conditioned [24:09.440 - 24:13.360] RL agents. So they take a natural language instruction, they try to follow it and do [24:13.360 - 24:18.720] something in the world. Right. And, uh, so I was in, in that space, I was reading this [24:18.720 - 24:23.080] paper from deep mind, which is, uh, imitating interactive intelligence and they have this [24:23.080 - 24:27.440] sort of simulated world where a robot can walk around and it's kind of like a video [24:27.440 - 24:32.480] game, like a low res video game kind of environment. So not super high res visuals, but it can [24:32.480 - 24:36.880] do things like, um, it'll get an instruction, like pick up the orange duck and put it on [24:36.880 - 24:41.720] the bed or pick up the cup and put it on the table or something like that. Right. And they [24:41.720 - 24:46.400] invested like two years. There's a team of 30 people. I heard they spent millions of [24:46.400 - 24:52.960] dollars on this project, right? They collect this massive human dataset of, um, people [24:52.960 - 24:58.160] giving instructions and then trying to follow those instructions in the environment. And [24:58.160 - 25:02.280] the dataset they collect is so massive that I think half of the instructions in the dataset [25:02.280 - 25:06.600] are exact duplicates of each other. So they'd have two copies of it, pick up the orange [25:06.600 - 25:11.440] duck and put it on the table or whatever. Um, and they train on this to the best of [25:11.440 - 25:16.000] their ability. And guess what, their success rate in actually following these instructions, [25:16.000 - 25:19.920] like guess what percentage of the time they can successfully follow the instructions in [25:19.920 - 25:24.280] this environment. I'm just trying to take a cue from you. I don't, I vaguely remember [25:24.280 - 25:29.640] this paper, but I'm going to guess it was terrible. Like 5%, not 5%, but it's 50%. 50%. [25:29.640 - 25:34.960] Okay. What do you feel about that number? Is it is shockingly low or low for that much [25:34.960 - 25:40.480] investment and for a pretty simple problem. Like it just, it's surprising that they can't [25:40.480 - 25:45.960] do better. And I think that just illustrates like how hard this, you know, we've seen that [25:45.960 - 25:49.840] you can tie a text and images together pretty effectively. Like we're seeing all of these [25:49.840 - 25:53.000] texts to image generation models that are compositional. They're beautiful. They're [25:53.000 - 25:58.440] working really well. Um, so I don't think that's the problem, but just like adding this [25:58.440 - 26:04.200] idea of navigating a physical body in the environment to carry out the task while perceiving [26:04.200 - 26:09.520] vision and linking it to the text just becomes so hard and it's very hard to get anything [26:09.520 - 26:10.520] working. [26:10.520 - 26:15.840] Yeah. 50%. I don't know. It's higher than I thought. But if we look at like, uh, we, [26:15.840 - 26:20.240] so we talked to Carol Houseman here a few episodes back and working on this, the say [26:20.240 - 26:27.360] can, which is the kitchen robot that you can give verbal, which becomes textual instructions [26:27.360 - 26:31.800] and it is using RL and it is actually doing things in a real kitchen with, you know, in [26:31.800 - 26:36.600] the real world and some sponging things up. And, and, um, I mean, a few things struck [26:36.600 - 26:40.800] me about that. Like they were doing something that sounds kind of similar to what you're [26:40.800 - 26:46.720] describing and, but I was amazed by how much they had to divide up the problem and how [26:46.720 - 26:51.240] much work it was to build all the parts because they had to make separate value functions [26:51.240 - 26:56.400] for all their skills. And then, but I think connecting it to the text seemed to be kind [26:56.400 - 26:57.400] of the easier part. [26:57.400 - 27:03.000] Well, so they actually, they actually don't connect text to embodiment. I would argue. [27:03.000 - 27:08.520] So first let me say Carol's an amazing person. He's great. Say can is so great of a paper [27:08.520 - 27:13.000] that Google is amazingly excited. And I think, so I'm actually doing some work. That's like [27:13.000 - 27:18.040] a followup to say can, and it's literally the most crowded research area I've ever been [27:18.040 - 27:22.700] in. Like there's so many Google interns working on followups to say can like everyone's excited. [27:22.700 - 27:28.240] So it's great work. So not trash the work at all, but they actually do separate the [27:28.240 - 27:33.760] problem of understanding the language and doing the embodied tasks almost completely [27:33.760 - 27:38.080] because the understanding of the language is entirely offloaded to a pre-trained large [27:38.080 - 27:44.240] language model. And then the executing of tasks is train. You train a bunch of low level [27:44.240 - 27:49.480] robotic policies that are able to like pick something up or do this. And you just select [27:49.480 - 27:55.120] which low level robotics policy to execute based on what looks probable under the language [27:55.120 - 28:00.560] model and what has the highest value estimate for those different policies. But there's [28:00.560 - 28:08.080] no network that's really doing high level language understanding and embodied manipulation [28:08.080 - 28:13.320] at the same time. Yeah. I thought it was innovative how they separated that so they didn't really [28:13.320 - 28:18.560] have to worry about that. They kind of like offloaded that whole problem to the LLM without [28:18.560 - 28:21.940] having the LLM know anything about robotics. It's definitely innovative and it works super [28:21.940 - 28:27.320] well and I think that's why the paper is exciting. But it's kind of, to me, like I was really [28:27.320 - 28:31.920] excited about this idea of an embodied agent that could really understand language and [28:31.920 - 28:36.440] do embodied stuff at the same time because if you think, okay, talking about what is [28:36.440 - 28:42.440] AGI, if we just use a definition of something that's like the maximally general representation [28:42.440 - 28:48.920] of knowledge, then you should have something that can not only understand text, but understand [28:48.920 - 28:52.520] how the text is mapped to images in the world because that's already going to expand your [28:52.520 - 28:58.400] representation, but understand how that maps to physics and how to navigate the world. [28:58.400 - 29:01.880] And so it'd be so cool if we could have an agent that actually like in the same network [29:01.880 - 29:06.920] is encoding all of those things. This is also just really reminding me of why I really like [29:06.920 - 29:10.640] talking with you, Tasha, because you're so passionate about this stuff. And also you [29:10.640 - 29:17.660] don't pull any punches. You will call a spade a spade no matter what. And you see the big [29:17.660 - 29:24.480] picture and you're so critical and sharp. And that's honestly the spirit that I was [29:24.480 - 29:26.760] looking for with this whole show. [29:26.760 - 29:33.440] I hope I'm not sounding too critical. I mean, I love this work, so. [29:33.440 - 29:38.500] I think my feedback on Seikan on a very high level is that they're depending on the language [29:38.500 - 29:44.200] model to already know what makes sense in that kitchen. But if they were in an untraditional [29:44.200 - 29:47.600] kitchen or they invented a new type of kitchen or they were in some kind of space where the [29:47.600 - 29:52.980] language model didn't really get it, then none of that would work. They're depending [29:52.980 - 29:57.760] on common sense of the language model to know what order to do things in the kitchen. And [29:57.760 - 29:59.800] they're assuming that common sense is common. [29:59.800 - 30:03.400] Yeah. And it's hard because they're kind of missing this like pragmatics thing too. [30:03.400 - 30:07.680] So humans could give you ambiguous instructions about what to do in the kitchen that could [30:07.680 - 30:14.240] only be resolved by looking around the kitchen. Like if they just said, get me that plate. [30:14.240 - 30:19.240] And there's multiple plates. How do you resolve that? Well, now you might want to use pragmatics [30:19.240 - 30:24.640] about like the plate that's closer to the human or something about like visually assessing [30:24.640 - 30:27.920] the environment and Seikan's not going to be able to do that, right? [30:27.920 - 30:33.140] Well they had the inner monologue edition, which added this idea of having other voices. [30:33.140 - 30:37.200] And so that might be able to, if they had another voice that was like describing what [30:37.200 - 30:42.320] the person's doing or looking at, inject it into the conversation. And inner monologue [30:42.320 - 30:46.720] to me seemed very promising. That was the second part of our conversation with Carol [30:46.720 - 30:51.180] and Fay. And that was fascinating to me and a little smooth because this robot has an [30:51.180 - 30:57.800] inner monologue going. But that let them leverage the language model and have more, a lot more [30:57.800 - 30:59.400] input into it. [30:59.400 - 31:00.400] That's cool. [31:00.400 - 31:01.520] And it seemed like an extensible approach. [31:01.520 - 31:05.320] That's cool. That can be quite promising. I don't know. I still just want to see a model [31:05.320 - 31:10.160] that does vision, text, and embodiment. I'm excited for that when that comes. [31:10.160 - 31:14.760] I see that you're planning to return to academia as an assistant professor at U Washington, [31:14.760 - 31:15.760] is that right? [31:15.760 - 31:16.760] That's right. [31:16.760 - 31:22.100] Cool. So that's an interesting choice to me after working at leading labs in the industry. [31:22.100 - 31:26.760] And I bet some people might be looking to move the opposite direction, especially a [31:26.760 - 31:32.040] lot of people have talked about the challenges of doing cutting edge AI on academic budgets [31:32.040 - 31:36.960] when more and more of this AI depends on scale. That becomes very expensive. So can you tell [31:36.960 - 31:41.880] us more about the decision? What drew you back to academia? What's your thought process [31:41.880 - 31:42.880] here? [31:42.880 - 31:47.080] Yeah. I mean, so you might think like, if I want to contribute to AI, I need a massive [31:47.080 - 31:52.440] compute budget and I need to be training these large models and how can academics afford [31:52.440 - 31:56.880] that? But what I actually see happening as a result of this is that what's going on in [31:56.880 - 32:02.440] industry is that more and more people and researchers in industry are being encouraged [32:02.440 - 32:08.680] to sort of amalgamate into these large, large teams of 30 or 50 authors where they're all [32:08.680 - 32:14.320] just working on what looks more like a large scale engineering effort to scale up a research [32:14.320 - 32:19.640] idea that's kind of already been proven out. Right? So you'll see like, there's big teams [32:19.640 - 32:25.060] at Google that are now trying to work on RLHF and the RLHF they're doing is very similar [32:25.060 - 32:28.240] to what OpenAI is doing. They're just trying to actually scale it up and write their own [32:28.240 - 32:33.440] version of infrastructure and stuff like that. And I hear the same thing is going on. It [32:33.440 - 32:39.080] already was that case at OpenAI where they're a little less focused on publishing, a little [32:39.080 - 32:44.640] more focused on scaling up in big teams. Apparently pressure at DeepMind is doing something similar [32:44.640 - 32:49.960] where if you're pursuing your own little creative research direction, that's going to be less [32:49.960 - 32:55.240] tenable than actually jumping onto a big team and kind of contributing in that way. So if [32:55.240 - 33:01.720] you're interested in doing creative research, novel research that sort of hasn't been proven [33:01.720 - 33:06.160] out already and coming up with new ideas and testing them out, I think there's less room [33:06.160 - 33:11.440] for that in industry right now. And I actually care a lot about research freedom and the [33:11.440 - 33:15.760] ability to kind of like think of a clever idea and test it out myself and see if it's [33:15.760 - 33:20.360] going to work. And I think there's a real role for that. Obviously scaling this stuff [33:20.360 - 33:25.960] up in industry works really well, but what actually works is they do end up using ideas [33:25.960 - 33:31.360] that were innovated in academia and incorporating that into what they're scaling up. So we were [33:31.360 - 33:36.000] talking at the beginning of this podcast about just that idea of doing KL control from your [33:36.000 - 33:40.340] prior is something that I did on a very, very small scale in academia that ends up being [33:40.340 - 33:45.800] useful in the system eventually, right? In the system that gets scaled up. So I see the [33:45.800 - 33:50.720] role of academics to do that same kind of proof of concept work, like discover these [33:50.720 - 33:55.920] new novel research ideas that work and then industry can have the role of scaling them [33:55.920 - 33:59.600] up, right? And so it just depends on what you want to be doing. Like, do you want to [33:59.600 - 34:03.680] be on a giant team working on infrastructure or do you want to be doing the kind of more [34:03.680 - 34:08.520] researchy like testing out ideas thing? And for me, I'm much more excited about the ladder. [34:08.520 - 34:13.400] That makes total sense. And like, I guess you're getting the credit from the citations [34:13.400 - 34:18.120] from these big papers that really work, but maybe not so much the public credit because [34:18.120 - 34:23.280] like everyone's just points to check and they think that is AI, like open AI invented AI, [34:23.280 - 34:26.480] but they're building on like the shoulders of all these giants from the past, including [34:26.480 - 34:30.000] yourself and all the academics know this, but for the public, it's like, Oh look, they [34:30.000 - 34:31.000] solved AI. [34:31.000 - 34:37.400] That's interesting. Yeah. I mean, I think my, my objective is more about like, well, [34:37.400 - 34:41.680] I just enjoy the process of like testing out ideas and seeing if they work, but my objective [34:41.680 - 34:47.040] is much more like, did you end up contributing something that was useful rather than did [34:47.040 - 34:49.480] you get the glory? [34:49.480 - 34:54.600] That's very legitimate to legit. Okay. So, um, what do you plan to work on at UW? Have [34:54.600 - 34:58.600] you, do you have a clear idea of that or is that something that you'll decide? [34:58.600 - 35:02.080] I do have a clear idea because you kind of, they don't give you the job unless you can [35:02.080 - 35:07.960] kind of sell it and sell what you're going to do. So, um, yeah, I mean the pitch that [35:07.960 - 35:12.360] I was kind of pitching on the faculty job market is like, um, I want to do this thing [35:12.360 - 35:16.960] called social reinforcement learning. And the idea is what are the benefits you can [35:16.960 - 35:21.640] get in terms of improving AI when you consider the case that you're likely going to be learning [35:21.640 - 35:26.400] in an environment with other intelligent agents. So you can either think about that as like [35:26.400 - 35:30.760] setting up a multi-agent system to make your agent more robust. That would be like paired [35:30.760 - 35:35.400] would be in that kind of category of thing. Or you could think about this idea that, you [35:35.400 - 35:38.280] know, for most of what we want AI to do, you might be deployed in environments where there [35:38.280 - 35:42.120] are humans and humans are pretty smart and have a lot of knowledge that might benefit [35:42.120 - 35:47.520] you when you're trying to do a task. So not only thinking about how to flexibly learn [35:47.520 - 35:52.040] from humans, like when I think about social learning, I don't think about just indiscriminately [35:52.040 - 35:58.520] imitating every human, but maybe kind of the human skill of social learning is about identifying [35:58.520 - 36:02.600] which models are actually worth learning from and when you should rely on learning from [36:02.600 - 36:07.160] others versus your independent exploration. So I think that's like a whole set of questions. [36:07.160 - 36:11.960] And then finally, I want to just make AI that's useful for interacting with humans. So, you [36:11.960 - 36:16.680] know, how do you interact with a new human you've never seen before and cooperate with [36:16.680 - 36:20.640] them to solve a task? So kind of the zero shot cooperation problem, how do you perceive [36:20.640 - 36:26.120] what goal they're trying to solve? How do you learn from their feedback? And this is [36:26.120 - 36:30.320] including types of implicit feedback. And then finally, this whole branch of like, how [36:30.320 - 36:34.440] do you communicate with humans in natural language to solve tasks? So that's why I've [36:34.440 - 36:38.320] been working on this kind of language condition RL, how do you train language models with [36:38.320 - 36:43.160] human feedback, this whole set of things. That's the pitch. [36:43.160 - 36:47.560] Awesome. And they obviously loved it because you're hired. [36:47.560 - 36:51.120] It depends, but yeah, I'm excited. [36:51.120 - 36:55.840] So I mean, it sounds like a lot of stuff that I had to learn as a young person as a awkward [36:55.840 - 37:03.520] nerdy teen how to talk to humans. Who is human? Should I imitate? Right? Exactly. And then [37:03.520 - 37:07.800] can you do you want to talk about some of your recent papers since you've been on last, [37:07.800 - 37:11.880] which is three and a half years ago, I see there on Google Scholar, there's been lots [37:11.880 - 37:16.720] of lots of papers since then with your name on them. But there was a few that that we [37:16.720 - 37:22.120] had kind of talked about touching on today, including basis and sci fi. Should we talk [37:22.120 - 37:23.120] about those? [37:23.120 - 37:27.640] Sure. So I think maybe I'll also add another paper that was like sort of the precursor [37:27.640 - 37:32.280] to sci fi from my perspective, really touching on this idea of like, what is social learning [37:32.280 - 37:37.040] versus just like imitation learning versus RL. So I'm really thinking about this problem, [37:37.040 - 37:41.680] like you're in an environment with other agents that might have knowledge that's relevant [37:41.680 - 37:45.680] to the task, but you don't know if they do and they're pursuing self interested goals. [37:45.680 - 37:49.840] So you can think about like an autonomous car on the road. There are other cars that [37:49.840 - 37:53.240] are driving, but some of them are actually bad drivers. So you don't want to sort of [37:53.240 - 37:58.240] indiscriminately imitate or your robot in an office picking up trash. There are humans [37:58.240 - 38:01.620] that are going about their day. They don't want to stop and sort of explicitly teach [38:01.620 - 38:05.220] you what to do. They're trying to get work done. So how do you benefit from learning [38:05.220 - 38:06.220] from that? [38:06.220 - 38:11.400] So we had a couple of papers on this. The first paper was actually with Kamal Indus, [38:11.400 - 38:17.560] who's now at entropic. And he his paper was looking at do RL agents benefit from social [38:17.560 - 38:22.440] learning by default. So if you're in an environment with another agent that's sort of constantly [38:22.440 - 38:28.480] showing you how to do the task correctly, do you learn any faster than an RL agent that's [38:28.480 - 38:35.680] in an environment by itself? And his conclusion was actually, no, they don't. So default RL [38:35.680 - 38:40.040] agents are actually really bad at social learning. And his work showed that if you just add this [38:40.040 - 38:44.680] auxiliary prediction task, like predicting your own next observation, then you're implicitly [38:44.680 - 38:49.200] modeling what's going on with the other agents in the environment. That makes its way into [38:49.200 - 38:53.920] your representation and you're more able to learn from their behavior. And that the cool [38:53.920 - 38:58.240] part about this is, if you actually learn the social learning behavior, like how to [38:58.240 - 39:02.840] learn from other agents in your environment, then when you can actually generalize much [39:02.840 - 39:07.680] more effectively to a totally new task that you've never seen before, because you can [39:07.680 - 39:12.560] apply that skill of social learning to master the new task. So you sort of learned how to [39:12.560 - 39:17.160] socially learn. And those social learning agents end up generalizing a lot better than [39:17.160 - 39:21.640] agents that are trained with imitation learning or with RL and generalizing to new tasks. [39:21.640 - 39:27.120] So I think that's quite exciting. And then sci-fi learning was like a follow-up that [39:27.120 - 39:33.080] does the social learning in a much more effective way. So basically, it's going to be hard to [39:33.080 - 39:37.440] describe. It's a little, it's kind of uses the math of successor features. So it might [39:37.440 - 39:44.800] be a little hard to describe on a podcast, but the idea is you're going to model not [39:44.800 - 39:49.760] only your own policy, but every other agent's policy in the environment in a way that kind [39:49.760 - 39:55.440] of disentangles a representation of the states that they're going to experience from the [39:55.440 - 39:59.880] rewards that they're trying to optimize. So using this like successor representation. [39:59.880 - 40:03.760] And what that lets you do is you can kind of take out the part that models the other [40:03.760 - 40:10.520] agent's rewards and substitute your own reward function in with the other agent's policy. [40:10.520 - 40:14.040] And that lets you compute, hey, if I were to act like the other agent right now, if [40:14.040 - 40:19.020] I were to copy, you know, agent two over here, would I actually get more rewards under my [40:19.020 - 40:25.980] own reward function? And so you can, that lets you just flexibly choose who and what [40:25.980 - 40:31.320] to imitate and when. So at every time step, you can choose to rely on your own policy [40:31.320 - 40:34.320] or you can choose to copy someone else and you can choose who's the most appropriate [40:34.320 - 40:40.240] person to copy. And what we show is that that actually gets you better performance than [40:40.240 - 40:44.560] either purely relying on imitation learning, which is going to fail if the other agents [40:44.560 - 40:50.680] are doing bad stuff or purely relying on RL, which is you're going to miss out on a bunch [40:50.680 - 40:53.920] of useful behaviors that other agents know how to do if you're just trying to discover [40:53.920 - 40:58.360] everything yourself. So I think that whole direction is actually quite interesting to [40:58.360 - 41:04.520] me. I did skim that paper. And it seemed like it reminded me of an old multi agent competition [41:04.520 - 41:10.600] I once did, Bomberman. And it was quite challenging to work with these other agents. And it would [41:10.600 - 41:15.440] have been pretty cool to be able to imitate them, imitate them better. And I could imagine [41:15.440 - 41:20.040] that for humans, we're learning from other people all the time, not ever since probably [41:20.040 - 41:25.200] since birth. And and we haven't really spent as much time thinking about that in AI. [41:25.200 - 41:27.680] That's something I'm really excited about. I don't know if we talked about this last [41:27.680 - 41:33.360] time, but this whole idea that a big component of human intelligence and what sets us apart [41:33.360 - 41:38.680] from other animals or, you know, other forms of intelligence is that we rely so heavily [41:38.680 - 41:43.720] on social learning. Like we discover almost nothing completely independently, like, look [41:43.720 - 41:47.520] at research, right? So much of it is reading what everyone else has done and then making [41:47.520 - 41:53.440] a tiny tweak on top. Right? So it's just that kind of building on standing on the shoulders [41:53.440 - 41:58.080] of giants, learning from others, I see is really important. I also see social learning [41:58.080 - 42:01.960] as a path to address this sort of like truck on truck on truck problem we were talking [42:01.960 - 42:08.040] about earlier. Like you kind of need adaptive online generalization to solve some of these [42:08.040 - 42:13.800] safety critical at like problems. So imagine I'm a self driving car. And I encounter a [42:13.800 - 42:18.280] situation that I've never seen in my training data, which is like, there's a big flood. [42:18.280 - 42:23.720] And the bridge I'm trying to go under is completely flooded. Right? And if I just drive forward, [42:23.720 - 42:30.080] I can actually destroy my car and get the passengers in danger, right? But the other [42:30.080 - 42:34.200] humans are on the road are probably gonna be pretty smart and realize what they should [42:34.200 - 42:38.480] do or it'll have a better chance of realizing it than me, the self driving car. So maybe [42:38.480 - 42:42.640] I should be at that point, actually relying on more on social learning to take cues from [42:42.640 - 42:48.160] others and figure use that as a way to adapt to the situation, rather than just relying [42:48.160 - 42:52.520] on my pre training data. And this isn't just my idea. Like I think Anka Dragan has a nice [42:52.520 - 42:57.840] paper on this. When you're if you're a self driving cars uncertain, it should be copying [42:57.840 - 43:01.360] other agents. But I think I think there's something really promising there. [43:01.360 - 43:06.240] Yeah, coming back to that truck on truck on truck, like there's no limit to what things [43:06.240 - 43:11.720] you might stack. I used to live in India and the stuff you would see on a truck in India [43:11.720 - 43:16.520] is just so unpredictable. But but the way I recognize what it is, is I is I look at [43:16.520 - 43:21.200] the lower the lower part of it. And I'm like, Oh, it has truck wheels. No matter what weird [43:21.200 - 43:26.560] thing is on top, that is a truck. And I think the the models that we have right now aren't [43:26.560 - 43:31.680] very good at like, ignoring thing distract stuff. That's that's more a problem with the [43:31.680 - 43:36.080] function approximator. It's not I don't think it's a real RL issue. But, but um, that's [43:36.080 - 43:40.800] always disappointed me that we haven't, we haven't somehow got past that distracter feature. [43:40.800 - 43:46.360] That's a really insightful point. And I think, you know, there's many different things we [43:46.360 - 43:51.040] have to solve with AI. If I'm channeling like Josh Tenenbaum's answer to the problem you [43:51.040 - 43:55.040] just brought up, I mean, he would basically, well, I don't know how good of a job I can [43:55.040 - 43:59.040] do channeling Josh Tenenbaum, but he would say like, we need more symbolic representations [43:59.040 - 44:03.060] where we can generalize representation to understand that like, a truck with hay on [44:03.060 - 44:08.360] it is still fundamentally a truck. Like there's some fundamental characteristics that make [44:08.360 - 44:12.100] the definition of this thing. And we shouldn't be just if we're just doing like this purely [44:12.100 - 44:16.920] inductive deep learning thing of like, I've seen a bazillion examples of a truck, and [44:16.920 - 44:20.520] therefore I can recognize a truck. But if it goes out of my distribution, I can't recognize [44:20.520 - 44:27.240] it. I mean, maybe this is the problem of representation. And just to be very like, speculative, [44:27.240 - 44:32.000] I do think there's something promising about models that integrate language, speaking of [44:32.000 - 44:36.120] why I want to put language models into agents that actually like put an actual language [44:36.120 - 44:41.420] representation into an RL agent, like because language is compositional, you get these kind [44:41.420 - 44:44.640] of compositional representations that could potentially help you generalize better. So [44:44.640 - 44:50.320] like, if you look at like, image and language models, you know, like clip, or you look at [44:50.320 - 44:55.440] all these image generation models, we see very strong evidence of compositionality, [44:55.440 - 45:00.960] right? Like you get these prompts that clearly have never been in the training data. And [45:00.960 - 45:05.660] they're able to generate convincing images of them. And I think that's just because language [45:05.660 - 45:11.360] helps you organize your representation in a way that allows you to combine these components. [45:11.360 - 45:15.200] So maybe like a compositional representation of a truck is like, yeah, it's more like, [45:15.200 - 45:18.840] it definitely has to have wheels. But it doesn't matter what it's carrying. [45:18.840 - 45:23.960] This reminds me of a poster I saw at ICML called concept bottleneck model. [45:23.960 - 45:29.720] Oh, yeah. Exactly. I'm doing a concept bottleneck model for multi agent interpretability paper. [45:29.720 - 45:34.520] I think we're going to release it on archive very soon. I'm very excited about it. But [45:34.520 - 45:36.160] yeah, it's a it's a cool idea. [45:36.160 - 45:40.440] Great looking forward to that too. Yeah, I just want to say it's always such a good time [45:40.440 - 45:44.680] chatting with you. It's really enjoyable. I always learn so much. I'm inspired. I can't [45:44.680 - 45:49.280] wait to see what you come up with next. Thanks so much for sharing your time with with the [45:49.280 - 46:12.400] talk our audience. Thank you so much. I really appreciate being here.