TalkRL: The Reinforcement Learning Podcast

Hear about why OpenAI cites her work in RLHF and dialog models, approaches to rewards in RLHF, ChatGPT, Industry vs Academia, PsiPhi-Learning, AGI and more!

Dr Natasha Jaques is a Senior Research Scientist at Google Brain.

Featured References

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard

Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E. Turner, Douglas Eck

PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
Angelos Filos, Clare Lyle, Yarin Gal, Sergey Levine, Natasha Jaques, Gregory Farquhar

Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience
Marwa Abdulhai, Natasha Jaques, Sergey Levine

Additional References

Fine-Tuning Language Models from Human Preferences, Daniel M. Ziegler et al 2019
Learning to summarize from human feedback, Nisan Stiennon et al 2020
Training language models to follow instructions with human feedback, Long Ouyang et al 2022

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

[00:00.000 - 00:11.120]
Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests,
[00:11.120 - 00:17.320]
both research and applied. Join the conversation on Twitter at Talk RL podcast. I'm your host
[00:17.320 - 00:23.440]
Robin Chauhan.
[00:23.440 - 00:27.680]
Dr. Natasha Jaques is a senior research scientist at Google Brain, and she was our first guest
[00:27.680 - 00:32.440]
on the show three and a half years ago on Talk RL episode one. Natasha, I'm super honored
[00:32.440 - 00:36.440]
and also totally stoked to welcome you back for round two. Thanks for being here today.
[00:36.440 - 00:39.960]
Well, thank you so much for having me. I'm stoked to be back.
[00:39.960 - 00:44.200]
So when we did that first interview back in 2019, I remember you're just wrapping up your
[00:44.200 - 00:49.960]
PhD at MIT. And I can tell you've been super busy and lots, lots of things have been happening
[00:49.960 - 00:55.520]
in RL and AI in general since then. So can you start us off with like, what do you feel
[00:55.520 - 01:00.240]
have been like the big exciting advances and trends in your field since you completed your
[01:00.240 - 01:01.240]
PhD?
[01:01.240 - 01:05.440]
Yeah, well, I think it's kind of obvious, right? I mean, everyone's obsessed with the
[01:05.440 - 01:11.360]
progress in large language models that have been happening, you know, chat GPT, how the
[01:11.360 - 01:17.280]
API is getting deployed. I think that's kind of the, I mean, image and language models,
[01:17.280 - 01:19.440]
diffusion models, there's so much going on.
[01:19.440 - 01:23.840]
Yeah, like you said, all this buzz around chat GPT, and reinforcement learning from
[01:23.840 - 01:28.000]
human feedback and the dialogue models in general. And of course, you were really early
[01:28.000 - 01:33.280]
in that space. And a lot of the key open AI papers actually cite your work in this space.
[01:33.280 - 01:39.000]
And there's a few of them. Can, can you talk a bit about how your work in that area relates
[01:39.000 - 01:44.560]
to to what open AI is doing today and what these these models are doing today?
[01:44.560 - 01:53.040]
Sure, yeah. So I guess, like, let me take you back to 2016, when I was thinking about
[01:53.040 - 01:57.920]
how do you take a pre trained language model, but in that case, I was looking at actually
[01:57.920 - 02:04.640]
LSTM, so like, early stuff, and actually fine tune it with reinforcement learning. And in
[02:04.640 - 02:10.000]
that time, I was actually looking not at language, per se, but at like, music generation and
[02:10.000 - 02:15.600]
even generating molecules that might look like drugs. But I think the I think the molecules
[02:15.600 - 02:20.280]
examples is a really good way to see this. So basically, the idea was like, we have a
[02:20.280 - 02:25.400]
data set of known molecules, so we could train a supervised model on it and have it generate
[02:25.400 - 02:30.240]
new molecules. But those molecules don't really have like the properties that we want, right?
[02:30.240 - 02:34.960]
We might want molecules that are more easily able to be synthesized as a drug. So we have
[02:34.960 - 02:42.000]
scores that are like the synthetic accessibility of the molecule. But neither so neither thing
[02:42.000 - 02:45.960]
is perfect. If you just train on the data, you don't get optimized molecules. If you
[02:45.960 - 02:50.480]
just optimize for synthetic accessibility, then you would get molecules that are just
[02:50.480 - 02:57.280]
like long chains of carbon, right? So they're useless as a drug, for example. So what you
[02:57.280 - 03:01.040]
can see, like in this problem, you can you can use like reinforcement learning to optimize
[03:01.040 - 03:05.400]
for drug likeness or synthetic accessibility, but it's not perfect. The data is not perfect.
[03:05.400 - 03:09.600]
So how do you combine both? So what we ended up proposing was this approach where you pre
[03:09.600 - 03:14.220]
trained on data, and then you train with RL to optimize some reward, but you minimize
[03:14.220 - 03:18.600]
the KL divergence from your pre trained policy that you train on data. So we call that like
[03:18.600 - 03:23.080]
your pre trained prior. And this approach lets you flexibly combine both supervised
[03:23.080 - 03:28.140]
learning, get the benefit of the data, and RL, where you kind of optimize within the
[03:28.140 - 03:35.080]
space that's within the space of things that are probable in the data distribution for
[03:35.080 - 03:39.960]
sequences that have high reward. And so you can see how this is obviously related to what's
[03:39.960 - 03:45.040]
going on with our LHF right now, which is that they pre train a large language model
[03:45.040 - 03:49.320]
on data set. And then they say, let's optimize for human feedback, but we're still going
[03:49.320 - 03:53.580]
to minimize that KL divergence from that pre trained prior model. So there's still an end
[03:53.580 - 03:58.080]
up using that technique. And it turns out to be, to be pretty important to the framework
[03:58.080 - 04:04.760]
for to the RLHF framework. But I was also working on our LHF, the idea of like learning
[04:04.760 - 04:10.800]
from human feedback. In around 2019, we took that same KL control approach. And we actually
[04:10.800 - 04:15.880]
had dialogue models try to optimize for signals that they got from talking to humans in a
[04:15.880 - 04:24.600]
conversation. But what we were doing is, instead of having the humans like rate, which dialogue
[04:24.600 - 04:30.640]
entries were good or bad, or do the preference ranking that open AI is doing with RLHF, we
[04:30.640 - 04:34.720]
wanted to learn from implicit signals in the conversation with the humans. So they don't
[04:34.720 - 04:38.480]
have to go out of their way to provide any extra feedback. What can we get from just
[04:38.480 - 04:44.140]
the text that they're typing? So we did things like analyze the sentiment of the text. So
[04:44.140 - 04:49.320]
if the person sounded generally happy, then we would use that as a positive reward signal
[04:49.320 - 04:54.240]
to train the model. Whereas if they sounded frustrated or confused, that's probably a
[04:54.240 - 04:58.240]
sign that the model is saying something nonsensical, we can use that as a negative reward. And
[04:58.240 - 05:01.760]
so we worked on actually optimizing those kind of signals with the same technique.
[05:01.760 - 05:06.800]
I mean, it sounds so much like what chat GPT is doing. Maybe the function approximator
[05:06.800 - 05:10.880]
was a bit different. Maybe the way you got the feedback was a bit different, but under
[05:10.880 - 05:12.680]
the hood, it was really RLHF.
[05:12.680 - 05:17.460]
Well, there's key differences. So open AI is taking a different approach than we did
[05:17.460 - 05:23.200]
in our 2019 paper on human feedback, where they train this reward model. So we don't
[05:23.200 - 05:25.800]
do that. So what they're doing is they're saying, we're going to get a bunch of humans
[05:25.800 - 05:32.280]
to rate, which of two outputs is better. And we're going to train a model to approximate
[05:32.280 - 05:37.580]
those human ratings. And that idea is coming from way earlier, like open AI's early work
[05:37.580 - 05:42.200]
on deep RL from human preferences, if you remember that paper. And in contrast, the
[05:42.200 - 05:49.960]
stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific
[05:49.960 - 05:58.640]
output, and then train on that as one example of a reward. But I didn't have this generalizable
[05:58.640 - 06:02.720]
reward model that could be applied across more examples. So I think there's a good argument
[06:02.720 - 06:07.720]
to be made that the training of reward model approach actually seems to scale pretty well,
[06:07.720 - 06:09.920]
because you can sample it so many times.
[06:09.920 - 06:15.600]
Can we talk about also the challenges and limits of this approach? So in the last episode,
[06:15.600 - 06:22.800]
38, we featured OpenAI founder and inventor of PPO, John Schulman, who did a lot of the
[06:22.800 - 06:28.680]
RLHF work at OpenAI. And he talked about instruct GPT, the sibling model to chat GPT, because
[06:28.680 - 06:33.120]
chat GPT wasn't released yet. And there is no chat GPT paper yet. But the paper explained
[06:33.120 - 06:37.320]
that that it required a lot of human feedback. And the instructions for the human ratings
[06:37.320 - 06:41.320]
was really detailed and super long. And so there was a lot of there was a significant
[06:41.320 - 06:46.440]
cost in getting all of that human feedback. So I just I guess I wonder what you think
[06:46.440 - 06:53.280]
about that? Is there is that cost going to limit how useful RLHF can be? Or is that not
[06:53.280 - 06:55.480]
a big deal? Because it's totally worth it?
[06:55.480 - 06:59.400]
Yeah, I mean, that's a great question. And going back and reading the history of papers
[06:59.400 - 07:05.160]
they've been doing on RLHF, even before instruct GPT, like in the summarization stuff, it seems
[07:05.160 - 07:11.680]
like one of the key enablers of getting RLHF to work effectively, is actually investing
[07:11.680 - 07:16.400]
a lot into getting quality human data. So between these, they have these two summarization
[07:16.400 - 07:20.000]
papers where one, I guess wasn't working that well, then they have a follow up where they
[07:20.000 - 07:23.840]
said, one of the key differences, we just did a better job recruiting raters that were
[07:23.840 - 07:28.400]
going to agree with the researchers, we were taking a high touch approach of like, being
[07:28.400 - 07:32.520]
able to be in a shared slack group with the raters to answer their questions and make
[07:32.520 - 07:37.560]
sure they stay aligned. And like that investment in the quality of the data that they collected
[07:37.560 - 07:42.440]
from humans was key in getting this work. So it is obviously expensive. But what I was
[07:42.440 - 07:48.000]
struck by in those papers and also in instruct GPT is that, as you'll notice in instruct
[07:48.000 - 07:55.680]
GPT, the what was it like the 1.3 billion parameter model trained with RLHF is outperforming
[07:55.680 - 08:03.200]
the 175 billion parameter model trained with supervised learning. So it's like 100x the
[08:03.200 - 08:08.640]
size of a model is outperformed by just doing some of this RLHF. And obviously training
[08:08.640 - 08:13.520]
100x size model with supervised learning is extremely expensive in terms of compute. So
[08:13.520 - 08:17.160]
I don't know what like, I don't think open AI released the actual numbers and dollar
[08:17.160 - 08:21.320]
value that they spent on collecting human data versus like training giant models. But
[08:21.320 - 08:26.560]
you could make a good case that RLHF actually is cost effective because of it could reduce
[08:26.560 - 08:28.280]
the cost of training larger models.
[08:28.280 - 08:32.880]
Okay, that part makes sense to me. But then when I think about the, you know, this data
[08:32.880 - 08:40.000]
set that's been collected, it's I mean, they're using the data for on policy training. From
[08:40.000 - 08:44.920]
what I understand, they're using PPO, which is on policy methods. So and on policy methods,
[08:44.920 - 08:51.000]
generally, or the way I see them is you can't reuse the data, because they depend on the
[08:51.000 - 08:56.280]
data from this model sample or from a very close by model. So if you start training on
[08:56.280 - 09:00.960]
this data, and the model drifts away, then is that data set going to be still useful?
[09:00.960 - 09:04.720]
Or is it could it could ever be used for another model? Like, are these like, like disposable
[09:04.720 - 09:08.240]
data sets that are just only used for that model in one point in time?
[09:08.240 - 09:12.360]
I wouldn't say it's disposable, like I would still use that data, because the data they
[09:12.360 - 09:17.080]
actually use is like comparisons of summaries, and then they use it to train the reward model.
[09:17.080 - 09:20.920]
And so your reward model can be kind of like trained offline in that way and used for your
[09:20.920 - 09:27.480]
policy. But this the actual comparisons they do for my understanding is they compare like,
[09:27.480 - 09:32.200]
not only their current RL model, but they're comparing the supervised baseline, they're
[09:32.200 - 09:36.360]
comparing the instructions from the data set. So you kind of get this like general property
[09:36.360 - 09:41.040]
of like, is this summary better than another summary? Right. And I think that's kind of
[09:41.040 - 09:45.920]
a reusable, reusable truth about the data, you just look at as their general summaries,
[09:45.920 - 09:49.840]
and this is what makes a high quality summary, then why couldn't that apply across different
[09:49.840 - 09:53.800]
models? And that those data sets are totally reusable. And maybe we can cost effectively
[09:53.800 - 09:56.240]
build up these libraries of data sets that way.
[09:56.240 - 10:00.600]
Yeah, like to put more fine a point on it, the data that they use to train their reward
[10:00.600 - 10:06.040]
model comes from a bunch of models that isn't just their RL model. So they are using quote
[10:06.040 - 10:10.680]
unquote, off policy data to train their reward model. And it's working.
[10:10.680 - 10:14.560]
The human feedback is like only valid for a limited amount of training. Like John was
[10:14.560 - 10:19.280]
saying, if you train with that same reward model for too far, your performance ends up
[10:19.280 - 10:23.720]
falling off at some point. So I guess the implication is that you would have to keep
[10:23.720 - 10:27.400]
collecting additional human feedback after every stage, like after you've trained to
[10:27.400 - 10:31.520]
a certain degree to improve it further might require a whole new data set. We don't really
[10:31.520 - 10:34.720]
get into that with that with the chat with john. But I wonder if you had any comment
[10:34.720 - 10:35.720]
about that part.
[10:35.720 - 10:39.400]
I can't say as much for what's going on with open AI is work. But I can't say I observed
[10:39.400 - 10:44.640]
this phenomenon in my own work trying to optimize for reward, but still do something probable
[10:44.640 - 10:49.320]
under the data. And you can definitely sort of over exploit the reward function. So like
[10:49.320 - 10:54.680]
when I was training dialogue models, we had this reward function that would reward the
[10:54.680 - 11:00.200]
dialogue model for having a conversation with a human such that the human seemed positive
[11:00.200 - 11:03.880]
seem to be responding positively, but that the dialogue model itself was outputting sort
[11:03.880 - 11:09.240]
of like, high sentiment, text and stuff like that. And we had a very limited amount of
[11:09.240 - 11:12.880]
data. So I think we might have like quickly overfit to the data and the rewards that were
[11:12.880 - 11:18.640]
in it. And what you see is the policy kind of like, collapse a little bit on. So its
[11:18.640 - 11:22.680]
objective is to stay with stay within something that's probable under the data distribution,
[11:22.680 - 11:27.560]
but maximize the reward. RL is ultimately even though we're using maximum entropy RL,
[11:27.560 - 11:33.840]
it's trying to find the optimal policy. So it doesn't really care like it, it ended up
[11:33.840 - 11:37.440]
having sort of a really restricted set of behaviors where it could get kind of repetitive
[11:37.440 - 11:42.720]
and sort of exploit the reward function. So our agent with those rewards kind of got overly
[11:42.720 - 11:47.520]
positive, polite and cheerful. So I always joke that it was like the most Canadian dialogue
[11:47.520 - 11:57.160]
agent you could train. We can say that because we're two Canadians. Exactly, exactly. But
[11:57.160 - 12:02.320]
yeah, it was kind of collapsing. Like the diversity came at a cost of like diversity
[12:02.320 - 12:06.560]
in the text that was output. So I wonder if there's something similar going on with their
[12:06.560 - 12:11.800]
results about like training too long on the reward model actually leads to diminishing
[12:11.800 - 12:19.080]
and then eventually like negative returns. And it seems that the reward model isn't perfect.
[12:19.080 - 12:22.280]
If you look at the accuracy of the reward model on the validation data, it's like in
[12:22.280 - 12:27.560]
the seventies or something. So it's not perfectly describing what is quality. So you really
[12:27.560 - 12:32.080]
overfit to that reward model. It's not clear that it's going to be comprehensive enough
[12:32.080 - 12:36.680]
to describe good, good outputs. I gather that like some of your past work in this, in this
[12:36.680 - 12:41.000]
area was like doing RL at the token level, like considering each token as a separate
[12:41.000 - 12:45.320]
action, maybe sequence tutor and side learning from your way off policy paper. Was that how
[12:45.320 - 12:50.960]
it worked? Was it individual token actions? Yes. But I would mention that so is instruct
[12:50.960 - 12:56.080]
GPT if you dig into it. So what they end up doing is what you can do, it's a little easier
[12:56.080 - 12:59.640]
in policy gradients because you can get the probability of the whole sequence by just
[12:59.640 - 13:04.160]
summing the probabilities over the individual tokens. But at the end of the day, your loss
[13:04.160 - 13:08.640]
is still being propagated into your model at the token level by increasing or decreasing
[13:08.640 - 13:13.240]
token level probabilities. Oh, so you're saying when they because because the paper says that
[13:13.240 - 13:18.160]
it framed it as a bandit. And to me, that meant the entire sample, all the tokens together
[13:18.160 - 13:22.840]
were taken as one action. But you're saying because of the way it's constructed, then
[13:22.840 - 13:28.240]
it still breaks down the token level probabilities. Yeah, you can write the math as like, reward
[13:28.240 - 13:33.800]
of the entire sequence for word of the entire output times probability of the entire output.
[13:33.800 - 13:37.640]
But under the hood, the way you get probability of the entire output is a sum of the token
[13:37.640 - 13:42.600]
level probabilities. So the way that that's going to actually change the model is to affect
[13:42.600 - 13:47.120]
token level probabilities. This is why I like having this podcast because that that question
[13:47.120 - 13:51.400]
is for a while like, who am I who's gonna explain this to me? So thank you for clearing
[13:51.400 - 13:56.320]
that up. For me, Natasha, that's really cool. No problem. So does that mean there's no benefit
[13:56.320 - 14:00.840]
to looking at a token level? Or like, is it always going to be this way? Because like,
[14:00.840 - 14:05.280]
I think john was saying that it's like more tractable to do it this way as a whole sample.
[14:05.280 - 14:08.920]
So what they're actually doing that might be a little bit different than token level
[14:08.920 - 14:15.040]
RL normally is like, their discount factor is one. So they apply the same reward to all
[14:15.040 - 14:20.800]
of the tokens in the sequence. And there's no discount where like, you're getting like
[14:20.800 - 14:23.320]
later in the sequence, you're discounting the reward you're going to get at the end
[14:23.320 - 14:26.640]
of the sequence or whatever, or earlier in the sequence, you're just getting so that
[14:26.640 - 14:30.000]
is a difference. That makes sense. It seems to be working well for them. Yeah, because
[14:30.000 - 14:33.560]
it matters just as much what you say at the end, like if you say not in capital letters,
[14:33.560 - 14:35.480]
then that's kind of important.
[14:35.480 - 14:41.720]
Yeah, exactly. And I think in my work, if I recall correctly, we had experimented. So
[14:41.720 - 14:46.080]
we experienced we had rewards that were at the sequence level as well, even at the level
[14:46.080 - 14:51.240]
of the whole dialogue. So we had stuff about like, how long does the conversation go on,
[14:51.240 - 14:56.160]
which is of course, across many dialogue turns. And then we had sentence level rewards that
[14:56.160 - 15:00.640]
were spread equally over the tokens in the sentence. But for something like conversation
[15:00.640 - 15:05.240]
length, we did have a discount factor, you aren't sure the conversation is going to go
[15:05.240 - 15:09.160]
on as long as it is at the beginning. So you discount that reward. But once you're already
[15:09.160 - 15:14.120]
having a long conversation, then the reward is higher. And it was very difficult to optimize
[15:14.120 - 15:17.120]
those discounted rewards across the whole conversation.
[15:17.120 - 15:19.720]
So you combined rewards at different levels?
[15:19.720 - 15:20.720]
Yeah, yeah.
[15:20.720 - 15:25.760]
Which kind of reminds me of this recursive reward modeling. There was a paper from like
[15:25.760 - 15:31.600]
at all out of DeepMind, who was in 2018. It seems like the idea here is taking this whole
[15:31.600 - 15:37.400]
RLHF further and stacking them for more complex domains, where we have models that help the
[15:37.400 - 15:43.000]
humans provide the human feedback and stacking them up. Do you have any thoughts about recursive
[15:43.000 - 15:47.200]
reward models? Do you think that's a promising way forward? Or like, are we gonna need that
[15:47.200 - 15:48.200]
soon?
[15:48.200 - 15:52.040]
I mean, so my understanding of their example of like a recursive reward model is the user
[15:52.040 - 15:56.520]
wants to write a fantasy novel, but evaluate like writing a whole novel, and then having
[15:56.520 - 16:00.160]
that evaluated would be very expensive, and you get very little data. So you could have
[16:00.160 - 16:07.760]
a bunch of RLHF trained assistants that do things like check the grammar or summarize
[16:07.760 - 16:11.880]
the character development up to this point or something like that. And that can assist
[16:11.880 - 16:17.440]
the user in doing the task. So I think like, sure, that idea makes sense. If you want to,
[16:17.440 - 16:21.480]
if I were to make a company that's helping people write novels, I would do it at that
[16:21.480 - 16:27.680]
level rather than at the level of the whole novel, right? So so that's definitely cool.
[16:27.680 - 16:32.400]
But in terms of like, pushing forward the boundaries of RLHF, I think what I would bet
[16:32.400 - 16:36.180]
on, and maybe I'm just biased, because this is literally my own work, but I would still
[16:36.180 - 16:42.560]
bet on this idea of trying to get other forms of feedback than just like humans comparing
[16:42.560 - 16:47.040]
to answers and rate like ranking them. So I'm not saying my work is the perfect answer,
[16:47.040 - 16:52.200]
but we were trying to get this type of implicit signal that you're getting during the interaction
[16:52.200 - 16:57.080]
all the time. And so, you know, when you're speaking about, oh, RLHF is so expensive to
[16:57.080 - 17:03.480]
collect the human data. Well, what if you could be getting data for free in any way
[17:03.480 - 17:08.120]
that's pervasively in your interactions? And so it doesn't cost anything additional to
[17:08.120 - 17:12.720]
find it. So like, okay, imagine you're using open AI playground or something to play with
[17:12.720 - 17:19.560]
chat GPT. How many times did you like rephrase the same prompt until you got some behavior
[17:19.560 - 17:24.200]
and then stopped? Yeah, they must be like, could that be it? But not yet. Do you think
[17:24.200 - 17:28.880]
so? I don't know. You would hope so. Because otherwise, how are they going to scale this?
[17:28.880 - 17:32.400]
Like they, they also have thumbs up and thumbs down. But they don't, they kind of have the
[17:32.400 - 17:36.520]
limited feedback though, right? And it's not always about whether the sentiment is good.
[17:36.520 - 17:42.400]
Like you could be wanting to write something scary. Exactly. Yes. Sentiment isn't perfect.
[17:42.400 - 17:45.880]
You could also look at like, okay, I prompt GPT, I get some output. Like if they had a
[17:45.880 - 17:50.360]
way to like edit that output in the editor, which I don't actually know if they do in
[17:50.360 - 17:54.720]
playground, I have to, I have to look at that again. But any edits I made to the text would
[17:54.720 - 17:58.800]
be a signal that I didn't like it, like I need to fix this. So that could be a signal
[17:58.800 - 18:03.000]
you could be training on with RLHF. I feel like that's just going to be more scalable.
[18:03.000 - 18:06.840]
And ultimately, it's not the ground truth of the human rating of quality. But what we
[18:06.840 - 18:11.000]
show in our work, it's like even though sentiment is very and the other stuff, we didn't just
[18:11.000 - 18:14.720]
use sentiment, we use a bunch of stuff. But even though those are imperfect, and only
[18:14.720 - 18:19.720]
proxy measures, optimizing for those things still did better than optimizing for the thumbs
[18:19.720 - 18:24.160]
up thumbs down that we built into the interface, because just no one wants to bother providing
[18:24.160 - 18:28.000]
that. You have to go out of your way out of the normal interaction that you're trying
[18:28.000 - 18:32.880]
to use to like sort of altruistically provide this extra feedback and people just don't.
[18:32.880 - 18:39.060]
So yeah, I think more scalable signals is the right direction. That makes so much sense.
[18:39.060 - 18:43.380]
Are you up for talking about AGI?
[18:43.380 - 18:45.000]
Depends what the question is.
[18:45.000 - 18:48.680]
So first of all, do you think it's like it's something we should be talking about and thinking
[18:48.680 - 18:52.960]
about these days? Or is it like a distant fantasy? That's just not really worth talking
[18:52.960 - 18:53.960]
about.
[18:53.960 - 18:58.000]
Oh, man, I always get a little bit frustrated with like AGI conversations, because nobody
[18:58.000 - 19:02.380]
really knows what they're talking about when they say AGI. Like it's not clear what the
[19:02.380 - 19:07.800]
definition is. And if you try to pin people down, it can get a little bit circular. So
[19:07.800 - 19:12.560]
like, you know, I've had people tell me, oh, AGI is coming in five years, right? And I
[19:12.560 - 19:17.760]
say, okay, well, so how do you reconcile that with the fact that CEOs of self driving car
[19:17.760 - 19:22.880]
companies think that fully autonomous self driving is it coming for 20 years? Right?
[19:22.880 - 19:27.040]
So if AGI is in five, and then my definition of AGI might be it can do everything a human
[19:27.040 - 19:33.560]
can do, but better. That doesn't make sense, right? If it can't drive a car, it's not AGI.
[19:33.560 - 19:37.440]
But then people will say, oh, but it doesn't have to be embodied. And it can still be AGI.
[19:37.440 - 19:42.000]
And okay, but then what is it doing? Like, it's, it's just such a muddy, muddy concept,
[19:42.000 - 19:43.000]
right?
[19:43.000 - 19:46.840]
I've also been in these arguments or discussions. And then in the end, we just realized we have
[19:46.840 - 19:51.400]
different definitions. And then there's no point in arguing about two words that mean
[19:51.400 - 19:52.400]
different things.
[19:52.400 - 19:59.240]
All of that aside, I do think I have been really impressed and even a little bit concerned
[19:59.240 - 20:04.160]
about the pace of progress. Like it stuff is happening so fast that if you want to just
[20:04.160 - 20:12.360]
define AGI as highly disruptive, fast advancements in AI technology, I think we're already there.
[20:12.360 - 20:18.440]
Right? Like, look at chat GBT, right? Universities are having to revise their entire curriculum
[20:18.440 - 20:23.320]
around writing take home essays, because you can just get chat GBT to write it you an essay
[20:23.320 - 20:28.280]
better than an undergrad can. So it's already super disruptive. Like where we are now is
[20:28.280 - 20:29.680]
already super disruptive.
[20:29.680 - 20:35.800]
Yeah, it might not be like AGI do all the jobs AGI. But if it's, it's general, it's,
[20:35.800 - 20:40.160]
to me, chat GBT is the first thing I've seen that really is so general. Like nothing has
[20:40.160 - 20:45.280]
been that general before, that imagining where that generality could take us in a few years
[20:45.280 - 20:49.880]
does make me think your point about the self driving vehicles is well taken. Like I think
[20:49.880 - 20:54.360]
everyone recognizes it's been a bit of a shit show with people predicting that it's going
[20:54.360 - 20:58.040]
to come in two years and three years and it just keeps getting pushed back and the timelines
[20:58.040 - 20:59.040]
just get longer.
[20:59.040 - 21:03.160]
I think embodiment is really hard. I think fitting the long tail of stuff in the real
[21:03.160 - 21:06.960]
world is really hard. So you might have seen this example. I think like Andre Carpathi
[21:06.960 - 21:14.400]
talked about it for Tesla, where they had an accident because there was a, the car couldn't
[21:14.400 - 21:19.480]
perceive this thing that happened, which was a semi truck carrying a semi truck carrying
[21:19.480 - 21:25.640]
a semi truck. So like a truck on a truck on a truck. And they were just like that. I hadn't
[21:25.640 - 21:29.400]
even seen that before. It wasn't in the support of the training data. And of course we know
[21:29.400 - 21:34.280]
these models, like if they get off the support of the training data, don't do that well.
[21:34.280 - 21:39.280]
So how will you ever curate a dataset that's going to cover every single thing in the real
[21:39.280 - 21:44.200]
world? I would argue that you can't, especially because the real world is non-stationary.
[21:44.200 - 21:48.760]
It's always changing. So new things are always being introduced. So sort of definitionally,
[21:48.760 - 21:54.960]
you can't cover everything that might happen in the real world. And so, you know, that's
[21:54.960 - 21:58.400]
why I'm excited about some of these approaches. It sounds like you talked about this on a
[21:58.400 - 22:02.760]
previous episode, but like, um, I've been working on this like adversarial environment
[22:02.760 - 22:07.320]
design stuff or unsupervised environment design stuff for RL agents, where you actually try
[22:07.320 - 22:13.400]
to search for things that can make your model fail and like generate those problems, um,
[22:13.400 - 22:18.360]
and train on them. And I think that could be an approach that is more tenable than just
[22:18.360 - 22:23.560]
supervised learning on a limited dataset. Totally. Yeah. We spoke with your colleague,
[22:23.560 - 22:28.960]
Michael Dennis, who was a co-author of yours on the paired paper. Is that right? Yes. Yeah,
[22:28.960 - 22:33.600]
exactly. Yeah. And I met him as at the poster session at, I think it was ICML. I love that
[22:33.600 - 22:36.560]
right away. And then I wasn't surprised at all to find your name on it. I didn't know
[22:36.560 - 22:41.160]
that at first. That makes total sense. That's exactly the type of thing Natasha would come
[22:41.160 - 22:45.680]
up with. The idea of embodiment, basically robotics is super hard or anything that has
[22:45.680 - 22:51.280]
to touch real world sensors. And it seems what chat GPT has shown us is if we can stay
[22:51.280 - 22:57.280]
in the abstract world of text, we actually have like magic powers even today in 2022,
[22:57.280 - 23:03.240]
2023. Um, we could do a lot with the techniques we already have in the, we were staying in
[23:03.240 - 23:10.040]
the world of texts and abstract thought and now, and, and code and, um, abstract symbols
[23:10.040 - 23:14.920]
basically. So maybe it goes to the back to that point of, of the real world and robotics
[23:14.920 - 23:18.880]
just being turning out to be the really hard stuff, the animal intelligence being super
[23:18.880 - 23:23.080]
hard and the abstract thought that we used to think we made us so special is turning
[23:23.080 - 23:27.640]
out to be maybe way easier. We've already solved go that we thought was impossible not
[23:27.640 - 23:33.480]
long ago. And, and, uh, Chad GPT is doing, showing us a level of generality we could
[23:33.480 - 23:39.800]
not expect from robotics, you know, maybe for ages. Yeah. And I mean, you probably remember
[23:39.800 - 23:43.120]
the name of this principle better than I do, but it's sort of the principle that, uh, things
[23:43.120 - 23:47.080]
for, that are really hard for us to solve, like chess and go are actually easy to get
[23:47.080 - 23:51.200]
AI to solve. Maybe because we have more awareness of the process, but like the most low level
[23:51.200 - 23:54.760]
stuff about, you know, manipulation, like how do you pick something up with your hand
[23:54.760 - 23:59.760]
is a very challenging problem editor's note. I forgot. So I looked it up afterwards. This
[23:59.760 - 24:03.840]
is more of X paradox. I want to share like my favorite anecdote when thinking about why
[24:03.840 - 24:09.440]
embodiment is so hard. I've been working on this, this problem of, um, language conditioned
[24:09.440 - 24:13.360]
RL agents. So they take a natural language instruction, they try to follow it and do
[24:13.360 - 24:18.720]
something in the world. Right. And, uh, so I was in, in that space, I was reading this
[24:18.720 - 24:23.080]
paper from deep mind, which is, uh, imitating interactive intelligence and they have this
[24:23.080 - 24:27.440]
sort of simulated world where a robot can walk around and it's kind of like a video
[24:27.440 - 24:32.480]
game, like a low res video game kind of environment. So not super high res visuals, but it can
[24:32.480 - 24:36.880]
do things like, um, it'll get an instruction, like pick up the orange duck and put it on
[24:36.880 - 24:41.720]
the bed or pick up the cup and put it on the table or something like that. Right. And they
[24:41.720 - 24:46.400]
invested like two years. There's a team of 30 people. I heard they spent millions of
[24:46.400 - 24:52.960]
dollars on this project, right? They collect this massive human dataset of, um, people
[24:52.960 - 24:58.160]
giving instructions and then trying to follow those instructions in the environment. And
[24:58.160 - 25:02.280]
the dataset they collect is so massive that I think half of the instructions in the dataset
[25:02.280 - 25:06.600]
are exact duplicates of each other. So they'd have two copies of it, pick up the orange
[25:06.600 - 25:11.440]
duck and put it on the table or whatever. Um, and they train on this to the best of
[25:11.440 - 25:16.000]
their ability. And guess what, their success rate in actually following these instructions,
[25:16.000 - 25:19.920]
like guess what percentage of the time they can successfully follow the instructions in
[25:19.920 - 25:24.280]
this environment. I'm just trying to take a cue from you. I don't, I vaguely remember
[25:24.280 - 25:29.640]
this paper, but I'm going to guess it was terrible. Like 5%, not 5%, but it's 50%. 50%.
[25:29.640 - 25:34.960]
Okay. What do you feel about that number? Is it is shockingly low or low for that much
[25:34.960 - 25:40.480]
investment and for a pretty simple problem. Like it just, it's surprising that they can't
[25:40.480 - 25:45.960]
do better. And I think that just illustrates like how hard this, you know, we've seen that
[25:45.960 - 25:49.840]
you can tie a text and images together pretty effectively. Like we're seeing all of these
[25:49.840 - 25:53.000]
texts to image generation models that are compositional. They're beautiful. They're
[25:53.000 - 25:58.440]
working really well. Um, so I don't think that's the problem, but just like adding this
[25:58.440 - 26:04.200]
idea of navigating a physical body in the environment to carry out the task while perceiving
[26:04.200 - 26:09.520]
vision and linking it to the text just becomes so hard and it's very hard to get anything
[26:09.520 - 26:10.520]
working.
[26:10.520 - 26:15.840]
Yeah. 50%. I don't know. It's higher than I thought. But if we look at like, uh, we,
[26:15.840 - 26:20.240]
so we talked to Carol Houseman here a few episodes back and working on this, the say
[26:20.240 - 26:27.360]
can, which is the kitchen robot that you can give verbal, which becomes textual instructions
[26:27.360 - 26:31.800]
and it is using RL and it is actually doing things in a real kitchen with, you know, in
[26:31.800 - 26:36.600]
the real world and some sponging things up. And, and, um, I mean, a few things struck
[26:36.600 - 26:40.800]
me about that. Like they were doing something that sounds kind of similar to what you're
[26:40.800 - 26:46.720]
describing and, but I was amazed by how much they had to divide up the problem and how
[26:46.720 - 26:51.240]
much work it was to build all the parts because they had to make separate value functions
[26:51.240 - 26:56.400]
for all their skills. And then, but I think connecting it to the text seemed to be kind
[26:56.400 - 26:57.400]
of the easier part.
[26:57.400 - 27:03.000]
Well, so they actually, they actually don't connect text to embodiment. I would argue.
[27:03.000 - 27:08.520]
So first let me say Carol's an amazing person. He's great. Say can is so great of a paper
[27:08.520 - 27:13.000]
that Google is amazingly excited. And I think, so I'm actually doing some work. That's like
[27:13.000 - 27:18.040]
a followup to say can, and it's literally the most crowded research area I've ever been
[27:18.040 - 27:22.700]
in. Like there's so many Google interns working on followups to say can like everyone's excited.
[27:22.700 - 27:28.240]
So it's great work. So not trash the work at all, but they actually do separate the
[27:28.240 - 27:33.760]
problem of understanding the language and doing the embodied tasks almost completely
[27:33.760 - 27:38.080]
because the understanding of the language is entirely offloaded to a pre-trained large
[27:38.080 - 27:44.240]
language model. And then the executing of tasks is train. You train a bunch of low level
[27:44.240 - 27:49.480]
robotic policies that are able to like pick something up or do this. And you just select
[27:49.480 - 27:55.120]
which low level robotics policy to execute based on what looks probable under the language
[27:55.120 - 28:00.560]
model and what has the highest value estimate for those different policies. But there's
[28:00.560 - 28:08.080]
no network that's really doing high level language understanding and embodied manipulation
[28:08.080 - 28:13.320]
at the same time. Yeah. I thought it was innovative how they separated that so they didn't really
[28:13.320 - 28:18.560]
have to worry about that. They kind of like offloaded that whole problem to the LLM without
[28:18.560 - 28:21.940]
having the LLM know anything about robotics. It's definitely innovative and it works super
[28:21.940 - 28:27.320]
well and I think that's why the paper is exciting. But it's kind of, to me, like I was really
[28:27.320 - 28:31.920]
excited about this idea of an embodied agent that could really understand language and
[28:31.920 - 28:36.440]
do embodied stuff at the same time because if you think, okay, talking about what is
[28:36.440 - 28:42.440]
AGI, if we just use a definition of something that's like the maximally general representation
[28:42.440 - 28:48.920]
of knowledge, then you should have something that can not only understand text, but understand
[28:48.920 - 28:52.520]
how the text is mapped to images in the world because that's already going to expand your
[28:52.520 - 28:58.400]
representation, but understand how that maps to physics and how to navigate the world.
[28:58.400 - 29:01.880]
And so it'd be so cool if we could have an agent that actually like in the same network
[29:01.880 - 29:06.920]
is encoding all of those things. This is also just really reminding me of why I really like
[29:06.920 - 29:10.640]
talking with you, Tasha, because you're so passionate about this stuff. And also you
[29:10.640 - 29:17.660]
don't pull any punches. You will call a spade a spade no matter what. And you see the big
[29:17.660 - 29:24.480]
picture and you're so critical and sharp. And that's honestly the spirit that I was
[29:24.480 - 29:26.760]
looking for with this whole show.
[29:26.760 - 29:33.440]
I hope I'm not sounding too critical. I mean, I love this work, so.
[29:33.440 - 29:38.500]
I think my feedback on Seikan on a very high level is that they're depending on the language
[29:38.500 - 29:44.200]
model to already know what makes sense in that kitchen. But if they were in an untraditional
[29:44.200 - 29:47.600]
kitchen or they invented a new type of kitchen or they were in some kind of space where the
[29:47.600 - 29:52.980]
language model didn't really get it, then none of that would work. They're depending
[29:52.980 - 29:57.760]
on common sense of the language model to know what order to do things in the kitchen. And
[29:57.760 - 29:59.800]
they're assuming that common sense is common.
[29:59.800 - 30:03.400]
Yeah. And it's hard because they're kind of missing this like pragmatics thing too.
[30:03.400 - 30:07.680]
So humans could give you ambiguous instructions about what to do in the kitchen that could
[30:07.680 - 30:14.240]
only be resolved by looking around the kitchen. Like if they just said, get me that plate.
[30:14.240 - 30:19.240]
And there's multiple plates. How do you resolve that? Well, now you might want to use pragmatics
[30:19.240 - 30:24.640]
about like the plate that's closer to the human or something about like visually assessing
[30:24.640 - 30:27.920]
the environment and Seikan's not going to be able to do that, right?
[30:27.920 - 30:33.140]
Well they had the inner monologue edition, which added this idea of having other voices.
[30:33.140 - 30:37.200]
And so that might be able to, if they had another voice that was like describing what
[30:37.200 - 30:42.320]
the person's doing or looking at, inject it into the conversation. And inner monologue
[30:42.320 - 30:46.720]
to me seemed very promising. That was the second part of our conversation with Carol
[30:46.720 - 30:51.180]
and Fay. And that was fascinating to me and a little smooth because this robot has an
[30:51.180 - 30:57.800]
inner monologue going. But that let them leverage the language model and have more, a lot more
[30:57.800 - 30:59.400]
input into it.
[30:59.400 - 31:00.400]
That's cool.
[31:00.400 - 31:01.520]
And it seemed like an extensible approach.
[31:01.520 - 31:05.320]
That's cool. That can be quite promising. I don't know. I still just want to see a model
[31:05.320 - 31:10.160]
that does vision, text, and embodiment. I'm excited for that when that comes.
[31:10.160 - 31:14.760]
I see that you're planning to return to academia as an assistant professor at U Washington,
[31:14.760 - 31:15.760]
is that right?
[31:15.760 - 31:16.760]
That's right.
[31:16.760 - 31:22.100]
Cool. So that's an interesting choice to me after working at leading labs in the industry.
[31:22.100 - 31:26.760]
And I bet some people might be looking to move the opposite direction, especially a
[31:26.760 - 31:32.040]
lot of people have talked about the challenges of doing cutting edge AI on academic budgets
[31:32.040 - 31:36.960]
when more and more of this AI depends on scale. That becomes very expensive. So can you tell
[31:36.960 - 31:41.880]
us more about the decision? What drew you back to academia? What's your thought process
[31:41.880 - 31:42.880]
here?
[31:42.880 - 31:47.080]
Yeah. I mean, so you might think like, if I want to contribute to AI, I need a massive
[31:47.080 - 31:52.440]
compute budget and I need to be training these large models and how can academics afford
[31:52.440 - 31:56.880]
that? But what I actually see happening as a result of this is that what's going on in
[31:56.880 - 32:02.440]
industry is that more and more people and researchers in industry are being encouraged
[32:02.440 - 32:08.680]
to sort of amalgamate into these large, large teams of 30 or 50 authors where they're all
[32:08.680 - 32:14.320]
just working on what looks more like a large scale engineering effort to scale up a research
[32:14.320 - 32:19.640]
idea that's kind of already been proven out. Right? So you'll see like, there's big teams
[32:19.640 - 32:25.060]
at Google that are now trying to work on RLHF and the RLHF they're doing is very similar
[32:25.060 - 32:28.240]
to what OpenAI is doing. They're just trying to actually scale it up and write their own
[32:28.240 - 32:33.440]
version of infrastructure and stuff like that. And I hear the same thing is going on. It
[32:33.440 - 32:39.080]
already was that case at OpenAI where they're a little less focused on publishing, a little
[32:39.080 - 32:44.640]
more focused on scaling up in big teams. Apparently pressure at DeepMind is doing something similar
[32:44.640 - 32:49.960]
where if you're pursuing your own little creative research direction, that's going to be less
[32:49.960 - 32:55.240]
tenable than actually jumping onto a big team and kind of contributing in that way. So if
[32:55.240 - 33:01.720]
you're interested in doing creative research, novel research that sort of hasn't been proven
[33:01.720 - 33:06.160]
out already and coming up with new ideas and testing them out, I think there's less room
[33:06.160 - 33:11.440]
for that in industry right now. And I actually care a lot about research freedom and the
[33:11.440 - 33:15.760]
ability to kind of like think of a clever idea and test it out myself and see if it's
[33:15.760 - 33:20.360]
going to work. And I think there's a real role for that. Obviously scaling this stuff
[33:20.360 - 33:25.960]
up in industry works really well, but what actually works is they do end up using ideas
[33:25.960 - 33:31.360]
that were innovated in academia and incorporating that into what they're scaling up. So we were
[33:31.360 - 33:36.000]
talking at the beginning of this podcast about just that idea of doing KL control from your
[33:36.000 - 33:40.340]
prior is something that I did on a very, very small scale in academia that ends up being
[33:40.340 - 33:45.800]
useful in the system eventually, right? In the system that gets scaled up. So I see the
[33:45.800 - 33:50.720]
role of academics to do that same kind of proof of concept work, like discover these
[33:50.720 - 33:55.920]
new novel research ideas that work and then industry can have the role of scaling them
[33:55.920 - 33:59.600]
up, right? And so it just depends on what you want to be doing. Like, do you want to
[33:59.600 - 34:03.680]
be on a giant team working on infrastructure or do you want to be doing the kind of more
[34:03.680 - 34:08.520]
researchy like testing out ideas thing? And for me, I'm much more excited about the ladder.
[34:08.520 - 34:13.400]
That makes total sense. And like, I guess you're getting the credit from the citations
[34:13.400 - 34:18.120]
from these big papers that really work, but maybe not so much the public credit because
[34:18.120 - 34:23.280]
like everyone's just points to check and they think that is AI, like open AI invented AI,
[34:23.280 - 34:26.480]
but they're building on like the shoulders of all these giants from the past, including
[34:26.480 - 34:30.000]
yourself and all the academics know this, but for the public, it's like, Oh look, they
[34:30.000 - 34:31.000]
solved AI.
[34:31.000 - 34:37.400]
That's interesting. Yeah. I mean, I think my, my objective is more about like, well,
[34:37.400 - 34:41.680]
I just enjoy the process of like testing out ideas and seeing if they work, but my objective
[34:41.680 - 34:47.040]
is much more like, did you end up contributing something that was useful rather than did
[34:47.040 - 34:49.480]
you get the glory?
[34:49.480 - 34:54.600]
That's very legitimate to legit. Okay. So, um, what do you plan to work on at UW? Have
[34:54.600 - 34:58.600]
you, do you have a clear idea of that or is that something that you'll decide?
[34:58.600 - 35:02.080]
I do have a clear idea because you kind of, they don't give you the job unless you can
[35:02.080 - 35:07.960]
kind of sell it and sell what you're going to do. So, um, yeah, I mean the pitch that
[35:07.960 - 35:12.360]
I was kind of pitching on the faculty job market is like, um, I want to do this thing
[35:12.360 - 35:16.960]
called social reinforcement learning. And the idea is what are the benefits you can
[35:16.960 - 35:21.640]
get in terms of improving AI when you consider the case that you're likely going to be learning
[35:21.640 - 35:26.400]
in an environment with other intelligent agents. So you can either think about that as like
[35:26.400 - 35:30.760]
setting up a multi-agent system to make your agent more robust. That would be like paired
[35:30.760 - 35:35.400]
would be in that kind of category of thing. Or you could think about this idea that, you
[35:35.400 - 35:38.280]
know, for most of what we want AI to do, you might be deployed in environments where there
[35:38.280 - 35:42.120]
are humans and humans are pretty smart and have a lot of knowledge that might benefit
[35:42.120 - 35:47.520]
you when you're trying to do a task. So not only thinking about how to flexibly learn
[35:47.520 - 35:52.040]
from humans, like when I think about social learning, I don't think about just indiscriminately
[35:52.040 - 35:58.520]
imitating every human, but maybe kind of the human skill of social learning is about identifying
[35:58.520 - 36:02.600]
which models are actually worth learning from and when you should rely on learning from
[36:02.600 - 36:07.160]
others versus your independent exploration. So I think that's like a whole set of questions.
[36:07.160 - 36:11.960]
And then finally, I want to just make AI that's useful for interacting with humans. So, you
[36:11.960 - 36:16.680]
know, how do you interact with a new human you've never seen before and cooperate with
[36:16.680 - 36:20.640]
them to solve a task? So kind of the zero shot cooperation problem, how do you perceive
[36:20.640 - 36:26.120]
what goal they're trying to solve? How do you learn from their feedback? And this is
[36:26.120 - 36:30.320]
including types of implicit feedback. And then finally, this whole branch of like, how
[36:30.320 - 36:34.440]
do you communicate with humans in natural language to solve tasks? So that's why I've
[36:34.440 - 36:38.320]
been working on this kind of language condition RL, how do you train language models with
[36:38.320 - 36:43.160]
human feedback, this whole set of things. That's the pitch.
[36:43.160 - 36:47.560]
Awesome. And they obviously loved it because you're hired.
[36:47.560 - 36:51.120]
It depends, but yeah, I'm excited.
[36:51.120 - 36:55.840]
So I mean, it sounds like a lot of stuff that I had to learn as a young person as a awkward
[36:55.840 - 37:03.520]
nerdy teen how to talk to humans. Who is human? Should I imitate? Right? Exactly. And then
[37:03.520 - 37:07.800]
can you do you want to talk about some of your recent papers since you've been on last,
[37:07.800 - 37:11.880]
which is three and a half years ago, I see there on Google Scholar, there's been lots
[37:11.880 - 37:16.720]
of lots of papers since then with your name on them. But there was a few that that we
[37:16.720 - 37:22.120]
had kind of talked about touching on today, including basis and sci fi. Should we talk
[37:22.120 - 37:23.120]
about those?
[37:23.120 - 37:27.640]
Sure. So I think maybe I'll also add another paper that was like sort of the precursor
[37:27.640 - 37:32.280]
to sci fi from my perspective, really touching on this idea of like, what is social learning
[37:32.280 - 37:37.040]
versus just like imitation learning versus RL. So I'm really thinking about this problem,
[37:37.040 - 37:41.680]
like you're in an environment with other agents that might have knowledge that's relevant
[37:41.680 - 37:45.680]
to the task, but you don't know if they do and they're pursuing self interested goals.
[37:45.680 - 37:49.840]
So you can think about like an autonomous car on the road. There are other cars that
[37:49.840 - 37:53.240]
are driving, but some of them are actually bad drivers. So you don't want to sort of
[37:53.240 - 37:58.240]
indiscriminately imitate or your robot in an office picking up trash. There are humans
[37:58.240 - 38:01.620]
that are going about their day. They don't want to stop and sort of explicitly teach
[38:01.620 - 38:05.220]
you what to do. They're trying to get work done. So how do you benefit from learning
[38:05.220 - 38:06.220]
from that?
[38:06.220 - 38:11.400]
So we had a couple of papers on this. The first paper was actually with Kamal Indus,
[38:11.400 - 38:17.560]
who's now at entropic. And he his paper was looking at do RL agents benefit from social
[38:17.560 - 38:22.440]
learning by default. So if you're in an environment with another agent that's sort of constantly
[38:22.440 - 38:28.480]
showing you how to do the task correctly, do you learn any faster than an RL agent that's
[38:28.480 - 38:35.680]
in an environment by itself? And his conclusion was actually, no, they don't. So default RL
[38:35.680 - 38:40.040]
agents are actually really bad at social learning. And his work showed that if you just add this
[38:40.040 - 38:44.680]
auxiliary prediction task, like predicting your own next observation, then you're implicitly
[38:44.680 - 38:49.200]
modeling what's going on with the other agents in the environment. That makes its way into
[38:49.200 - 38:53.920]
your representation and you're more able to learn from their behavior. And that the cool
[38:53.920 - 38:58.240]
part about this is, if you actually learn the social learning behavior, like how to
[38:58.240 - 39:02.840]
learn from other agents in your environment, then when you can actually generalize much
[39:02.840 - 39:07.680]
more effectively to a totally new task that you've never seen before, because you can
[39:07.680 - 39:12.560]
apply that skill of social learning to master the new task. So you sort of learned how to
[39:12.560 - 39:17.160]
socially learn. And those social learning agents end up generalizing a lot better than
[39:17.160 - 39:21.640]
agents that are trained with imitation learning or with RL and generalizing to new tasks.
[39:21.640 - 39:27.120]
So I think that's quite exciting. And then sci-fi learning was like a follow-up that
[39:27.120 - 39:33.080]
does the social learning in a much more effective way. So basically, it's going to be hard to
[39:33.080 - 39:37.440]
describe. It's a little, it's kind of uses the math of successor features. So it might
[39:37.440 - 39:44.800]
be a little hard to describe on a podcast, but the idea is you're going to model not
[39:44.800 - 39:49.760]
only your own policy, but every other agent's policy in the environment in a way that kind
[39:49.760 - 39:55.440]
of disentangles a representation of the states that they're going to experience from the
[39:55.440 - 39:59.880]
rewards that they're trying to optimize. So using this like successor representation.
[39:59.880 - 40:03.760]
And what that lets you do is you can kind of take out the part that models the other
[40:03.760 - 40:10.520]
agent's rewards and substitute your own reward function in with the other agent's policy.
[40:10.520 - 40:14.040]
And that lets you compute, hey, if I were to act like the other agent right now, if
[40:14.040 - 40:19.020]
I were to copy, you know, agent two over here, would I actually get more rewards under my
[40:19.020 - 40:25.980]
own reward function? And so you can, that lets you just flexibly choose who and what
[40:25.980 - 40:31.320]
to imitate and when. So at every time step, you can choose to rely on your own policy
[40:31.320 - 40:34.320]
or you can choose to copy someone else and you can choose who's the most appropriate
[40:34.320 - 40:40.240]
person to copy. And what we show is that that actually gets you better performance than
[40:40.240 - 40:44.560]
either purely relying on imitation learning, which is going to fail if the other agents
[40:44.560 - 40:50.680]
are doing bad stuff or purely relying on RL, which is you're going to miss out on a bunch
[40:50.680 - 40:53.920]
of useful behaviors that other agents know how to do if you're just trying to discover
[40:53.920 - 40:58.360]
everything yourself. So I think that whole direction is actually quite interesting to
[40:58.360 - 41:04.520]
me. I did skim that paper. And it seemed like it reminded me of an old multi agent competition
[41:04.520 - 41:10.600]
I once did, Bomberman. And it was quite challenging to work with these other agents. And it would
[41:10.600 - 41:15.440]
have been pretty cool to be able to imitate them, imitate them better. And I could imagine
[41:15.440 - 41:20.040]
that for humans, we're learning from other people all the time, not ever since probably
[41:20.040 - 41:25.200]
since birth. And and we haven't really spent as much time thinking about that in AI.
[41:25.200 - 41:27.680]
That's something I'm really excited about. I don't know if we talked about this last
[41:27.680 - 41:33.360]
time, but this whole idea that a big component of human intelligence and what sets us apart
[41:33.360 - 41:38.680]
from other animals or, you know, other forms of intelligence is that we rely so heavily
[41:38.680 - 41:43.720]
on social learning. Like we discover almost nothing completely independently, like, look
[41:43.720 - 41:47.520]
at research, right? So much of it is reading what everyone else has done and then making
[41:47.520 - 41:53.440]
a tiny tweak on top. Right? So it's just that kind of building on standing on the shoulders
[41:53.440 - 41:58.080]
of giants, learning from others, I see is really important. I also see social learning
[41:58.080 - 42:01.960]
as a path to address this sort of like truck on truck on truck problem we were talking
[42:01.960 - 42:08.040]
about earlier. Like you kind of need adaptive online generalization to solve some of these
[42:08.040 - 42:13.800]
safety critical at like problems. So imagine I'm a self driving car. And I encounter a
[42:13.800 - 42:18.280]
situation that I've never seen in my training data, which is like, there's a big flood.
[42:18.280 - 42:23.720]
And the bridge I'm trying to go under is completely flooded. Right? And if I just drive forward,
[42:23.720 - 42:30.080]
I can actually destroy my car and get the passengers in danger, right? But the other
[42:30.080 - 42:34.200]
humans are on the road are probably gonna be pretty smart and realize what they should
[42:34.200 - 42:38.480]
do or it'll have a better chance of realizing it than me, the self driving car. So maybe
[42:38.480 - 42:42.640]
I should be at that point, actually relying on more on social learning to take cues from
[42:42.640 - 42:48.160]
others and figure use that as a way to adapt to the situation, rather than just relying
[42:48.160 - 42:52.520]
on my pre training data. And this isn't just my idea. Like I think Anka Dragan has a nice
[42:52.520 - 42:57.840]
paper on this. When you're if you're a self driving cars uncertain, it should be copying
[42:57.840 - 43:01.360]
other agents. But I think I think there's something really promising there.
[43:01.360 - 43:06.240]
Yeah, coming back to that truck on truck on truck, like there's no limit to what things
[43:06.240 - 43:11.720]
you might stack. I used to live in India and the stuff you would see on a truck in India
[43:11.720 - 43:16.520]
is just so unpredictable. But but the way I recognize what it is, is I is I look at
[43:16.520 - 43:21.200]
the lower the lower part of it. And I'm like, Oh, it has truck wheels. No matter what weird
[43:21.200 - 43:26.560]
thing is on top, that is a truck. And I think the the models that we have right now aren't
[43:26.560 - 43:31.680]
very good at like, ignoring thing distract stuff. That's that's more a problem with the
[43:31.680 - 43:36.080]
function approximator. It's not I don't think it's a real RL issue. But, but um, that's
[43:36.080 - 43:40.800]
always disappointed me that we haven't, we haven't somehow got past that distracter feature.
[43:40.800 - 43:46.360]
That's a really insightful point. And I think, you know, there's many different things we
[43:46.360 - 43:51.040]
have to solve with AI. If I'm channeling like Josh Tenenbaum's answer to the problem you
[43:51.040 - 43:55.040]
just brought up, I mean, he would basically, well, I don't know how good of a job I can
[43:55.040 - 43:59.040]
do channeling Josh Tenenbaum, but he would say like, we need more symbolic representations
[43:59.040 - 44:03.060]
where we can generalize representation to understand that like, a truck with hay on
[44:03.060 - 44:08.360]
it is still fundamentally a truck. Like there's some fundamental characteristics that make
[44:08.360 - 44:12.100]
the definition of this thing. And we shouldn't be just if we're just doing like this purely
[44:12.100 - 44:16.920]
inductive deep learning thing of like, I've seen a bazillion examples of a truck, and
[44:16.920 - 44:20.520]
therefore I can recognize a truck. But if it goes out of my distribution, I can't recognize
[44:20.520 - 44:27.240]
it. I mean, maybe this is the problem of representation. And just to be very like, speculative,
[44:27.240 - 44:32.000]
I do think there's something promising about models that integrate language, speaking of
[44:32.000 - 44:36.120]
why I want to put language models into agents that actually like put an actual language
[44:36.120 - 44:41.420]
representation into an RL agent, like because language is compositional, you get these kind
[44:41.420 - 44:44.640]
of compositional representations that could potentially help you generalize better. So
[44:44.640 - 44:50.320]
like, if you look at like, image and language models, you know, like clip, or you look at
[44:50.320 - 44:55.440]
all these image generation models, we see very strong evidence of compositionality,
[44:55.440 - 45:00.960]
right? Like you get these prompts that clearly have never been in the training data. And
[45:00.960 - 45:05.660]
they're able to generate convincing images of them. And I think that's just because language
[45:05.660 - 45:11.360]
helps you organize your representation in a way that allows you to combine these components.
[45:11.360 - 45:15.200]
So maybe like a compositional representation of a truck is like, yeah, it's more like,
[45:15.200 - 45:18.840]
it definitely has to have wheels. But it doesn't matter what it's carrying.
[45:18.840 - 45:23.960]
This reminds me of a poster I saw at ICML called concept bottleneck model.
[45:23.960 - 45:29.720]
Oh, yeah. Exactly. I'm doing a concept bottleneck model for multi agent interpretability paper.
[45:29.720 - 45:34.520]
I think we're going to release it on archive very soon. I'm very excited about it. But
[45:34.520 - 45:36.160]
yeah, it's a it's a cool idea.
[45:36.160 - 45:40.440]
Great looking forward to that too. Yeah, I just want to say it's always such a good time
[45:40.440 - 45:44.680]
chatting with you. It's really enjoyable. I always learn so much. I'm inspired. I can't
[45:44.680 - 45:49.280]
wait to see what you come up with next. Thanks so much for sharing your time with with the
[45:49.280 - 46:12.400]
talk our audience. Thank you so much. I really appreciate being here.

TalkRL: The Reinforcement Learning Podcast

More episodes

Chapters

Creators and Guests

What is TalkRL: The Reinforcement Learning Podcast?