WEBVTT

NOTE
This file was generated by Descript 

00:00:05.360 --> 00:00:08.780
TalkRL podcast is all
reinforcement learning all the time.

00:00:09.440 --> 00:00:11.870
Featuring brilliant guests,
both research and applied.

00:00:12.620 --> 00:00:15.590
Join the conversation on
Twitter at @TalkRLpodcast.

00:00:16.250 --> 00:00:17.720
I'm your host Robin Chauhan.

00:00:22.875 --> 00:00:26.744
Rohin Shah is a research scientist
at deep mind and the editor and main

00:00:26.744 --> 00:00:29.115
contributor of the alignment newsletter.

00:00:29.265 --> 00:00:31.544
Thanks so much for
joining us today, Rohin.

00:00:31.784 --> 00:00:32.085
Yeah.

00:00:32.114 --> 00:00:35.085
Thanks for having me Robin
let's get started with, um, how

00:00:35.085 --> 00:00:38.925
do you like to describe your
area of interest on my website?

00:00:38.925 --> 00:00:43.635
The thing that I say is that I'm
interested in sort of the longterm

00:00:43.635 --> 00:00:48.584
trajectory of AI, because it seems like
AI is becoming more and more capable

00:00:48.584 --> 00:00:53.535
over time with many people thinking that
someday we are going to get to artificial

00:00:53.535 --> 00:00:59.355
general intelligence or AGI, uh, where
AI systems will be able to replace humans

00:00:59.355 --> 00:01:01.485
at most economically valuable tasks.

00:01:01.545 --> 00:01:05.535
And that just seems like such an important
event in the history of humanity.

00:01:06.255 --> 00:01:09.015
Uh, it seems like it would
radically transform the world.

00:01:09.075 --> 00:01:13.455
And so it seems very important to
both important and interesting to

00:01:13.455 --> 00:01:17.985
understand what is going to happen and
to see how we can make that important

00:01:17.985 --> 00:01:21.884
stuff happened better so that we get
good outcomes instead of bad outcomes.

00:01:22.155 --> 00:01:25.515
That's a sort of very general
statement, but I would say that

00:01:25.575 --> 00:01:27.375
that's a pretty big area of interest.

00:01:28.185 --> 00:01:34.605
And then I often spend most of my time
on a particular question within that,

00:01:35.055 --> 00:01:42.735
uh, which is what are the chances that
these AGI systems will be misaligned

00:01:42.735 --> 00:01:45.225
with humanity in the sense that
they will want something other than.

00:01:45.875 --> 00:01:48.785
Uh, they will want to do things other
than what humans want them to do.

00:01:48.815 --> 00:01:50.645
So a, what is the risk of that?

00:01:50.645 --> 00:01:54.725
How can it arise and B how can we
prevent that problem from happening?

00:01:54.935 --> 00:01:55.295
Cool.

00:01:55.295 --> 00:01:55.595
Okay.

00:01:55.595 --> 00:01:59.525
So we're going to talk, uh, about some
of this in more general terms later on.

00:01:59.555 --> 00:02:03.485
And, but first let's, let's get
a little more specific about

00:02:03.485 --> 00:02:05.165
some of your recent papers.

00:02:05.195 --> 00:02:08.705
First we have in the minor, all basketball
competition on learning from human

00:02:08.705 --> 00:02:14.675
feedback, and that was benchmark for
agents that solve almost lifelike tasks.

00:02:14.705 --> 00:02:19.685
So I gather this is based on the mine
RL, a Minecraft based RL environment.

00:02:19.905 --> 00:02:23.765
We saw some competitions on using
that before, but here you're doing

00:02:23.765 --> 00:02:26.165
something different with the minor RL.

00:02:26.195 --> 00:02:28.835
Can you tell us about basalt
and what's the idea here?

00:02:29.045 --> 00:02:34.895
So I think the basic idea is that a
reward function, which is a typical.

00:02:35.540 --> 00:02:37.580
Tool that you use in
reinforcement learning.

00:02:37.580 --> 00:02:38.360
I'm sure your list.

00:02:38.390 --> 00:02:41.660
I expect your listeners probably
know about that or word function.

00:02:41.660 --> 00:02:47.210
If you have to write it down by hand
is actually a pretty, not great way of

00:02:47.210 --> 00:02:51.530
specifying what you want an AI system to
do, like reinforcement learning treats

00:02:51.530 --> 00:02:55.370
that reward function as a specification
of exactly what the optimal behavior

00:02:55.400 --> 00:02:59.270
is to do in every possible circumstance
that could possibly arise when you'd

00:02:59.270 --> 00:03:00.380
have to have that reward function.

00:03:00.380 --> 00:03:03.230
Did you think of every possible
situation that could ever possibly

00:03:03.230 --> 00:03:06.770
arise and check whether your reward
function was specifying the correct

00:03:06.770 --> 00:03:08.120
behavior and that situation?

00:03:08.420 --> 00:03:09.800
No, you did not do that.

00:03:09.890 --> 00:03:14.270
And so we already have lots and lots
of examples of cases where people

00:03:14.270 --> 00:03:17.420
like try to right there, write down
their reward function, thought I

00:03:17.420 --> 00:03:18.860
thought would lead to good behavior.

00:03:19.160 --> 00:03:21.530
And they actually around
reinforcement learning or some

00:03:21.530 --> 00:03:24.860
other optimization algorithm with,
uh, with that reward function.

00:03:25.190 --> 00:03:28.550
And the AI found some totally
unexpected solution that did get

00:03:28.550 --> 00:03:31.820
high award, but didn't do what
the designer wanted it to do.

00:03:32.090 --> 00:03:34.550
And so this motivates the
question, like, all right, how can.

00:03:35.280 --> 00:03:39.299
Specify what we want the
agent to do without using

00:03:39.329 --> 00:03:40.950
handwritten reward functions.

00:03:41.340 --> 00:03:46.350
The general class of approaches that has
been developed in response to this is, uh,

00:03:46.380 --> 00:03:52.299
what I call learning from human feedback,
or LFH H F w the idea here is that you

00:03:52.350 --> 00:03:56.459
consider some possible situations where
the air could do things, and then you

00:03:56.459 --> 00:04:02.040
like ask a human, Hey, in these particular
situations, what should the AI system do?

00:04:02.340 --> 00:04:08.160
So you're making more local acquirees,
um, and, uh, local specifications, rather

00:04:08.160 --> 00:04:11.579
than having to reason about every possible
circumstance that can never arise.

00:04:12.120 --> 00:04:16.380
And then given all of this human, this,
like, uh, given a large data set of human

00:04:16.380 --> 00:04:22.079
feedback on various situations, uh, you
can then train and, uh, train an agent to

00:04:22.079 --> 00:04:24.180
meet that specification as best as it can.

00:04:24.240 --> 00:04:27.330
So people have been developing these
techniques and includes things like

00:04:27.360 --> 00:04:30.240
imitation learning, where you learn
from human demonstrations of how

00:04:30.240 --> 00:04:33.900
to do the task or learning from
comparisons where humans can be.

00:04:34.785 --> 00:04:39.555
Uh, look at videos of two agents, two
videos of agent behavior, and then say,

00:04:39.855 --> 00:04:43.965
you know, the left one is better than
the right one, or it includes corrections

00:04:43.965 --> 00:04:46.365
where the agent does something on humans.

00:04:46.365 --> 00:04:49.065
Like at this point you should
have like taken this other action

00:04:49.075 --> 00:04:50.505
instead that would have been better.

00:04:50.565 --> 00:04:54.885
These are all the ways that you can
use human, uh, human feedback to train

00:04:54.885 --> 00:04:56.895
an agent, to do the, do what you want.

00:04:56.955 --> 00:05:00.255
But so people have developed a lot
of algorithms like this, but the

00:05:00.255 --> 00:05:01.815
evaluation of them as kind of added.

00:05:02.880 --> 00:05:08.340
Um, people just sort of make up some, uh,
new environment to test their method on.

00:05:08.820 --> 00:05:13.380
Uh, they don't really compare on
any like, uh, on, on a standard

00:05:13.380 --> 00:05:15.810
benchmark that everyone is using.

00:05:16.050 --> 00:05:20.640
So the big idea with basalt was to,
um, was to change that, to actually

00:05:20.640 --> 00:05:26.490
make a benchmark that could reasonably
fairly compare all of these, uh,

00:05:26.670 --> 00:05:28.170
all of these different approaches.

00:05:28.200 --> 00:05:31.740
So we like, we wanted it to mimic
the real-world situation as much as

00:05:31.740 --> 00:05:33.960
possible in the real world situation.

00:05:33.960 --> 00:05:37.290
You just have like some notion
in your head of what task you

00:05:37.290 --> 00:05:38.640
want your AI system to do.

00:05:39.060 --> 00:05:42.360
And then you have to, you have to take
a learning from human feedback algorithm

00:05:42.360 --> 00:05:44.370
and give it the appropriate feedback.

00:05:44.700 --> 00:05:48.990
So similarly, in this benchmark, we
instantiate the agent and a Minecraft

00:05:48.990 --> 00:05:52.530
world, and then we just tell the
designer, Hey, you've got to train

00:05:52.530 --> 00:05:55.500
your agent to say, make a waterfall.

00:05:55.530 --> 00:05:58.320
That's one of our tasks, uh,
and then take a picture of it.

00:05:58.350 --> 00:06:00.310
So we just tell the
designers, you have to.

00:06:00.830 --> 00:06:05.150
So now the designer has in their
head a like notion of what the agent

00:06:05.150 --> 00:06:08.180
is supposed to do, but there's no
formal specification, no reward,

00:06:08.180 --> 00:06:09.230
function, nothing like that.

00:06:09.440 --> 00:06:11.030
So they can then do whatever they want.

00:06:11.030 --> 00:06:14.480
They can write down at a board function by
hand, if that seems like an approach they

00:06:14.480 --> 00:06:16.490
want to do, they can use demonstrations.

00:06:16.490 --> 00:06:18.650
They can use preferences, they
can use corrections, they can

00:06:18.950 --> 00:06:20.510
do active learning and so on.

00:06:20.990 --> 00:06:24.920
Uh, but their job is to like make an
agent that actually does the task.

00:06:25.040 --> 00:06:29.270
Ideally they want to maximize,
uh, performance and minimize costs

00:06:29.300 --> 00:06:32.870
both in terms of compute and in
terms of how much human feedback

00:06:33.140 --> 00:06:34.850
it takes to train the agent.

00:06:35.210 --> 00:06:39.230
So I watched, uh, the presentations
of the top two solutions and

00:06:39.260 --> 00:06:39.740
it seemed like they were.

00:06:40.500 --> 00:06:41.610
Very different approaches.

00:06:42.270 --> 00:06:46.350
Uh, the first one Kairos I would
say is, seem like a lot of hand

00:06:46.350 --> 00:06:51.300
engineering and I think they use 80,000
plus labeled images and built some

00:06:51.330 --> 00:06:52.830
very specific components for this.

00:06:53.070 --> 00:06:56.400
They kind of decompose the problem, which
I think is a very sensible thing to do.

00:06:56.760 --> 00:07:00.180
But then also, uh, the
second one was obsidian.

00:07:00.270 --> 00:07:04.770
They produce this inverse cue learning
method, a new method, which has seemed

00:07:04.770 --> 00:07:06.720
like a more general theoretical solution.

00:07:06.780 --> 00:07:09.600
I just wonder if you have any comments
on the different types of solutions

00:07:09.630 --> 00:07:14.250
that came out of this or those kind of
two main classes that you saw or did

00:07:14.250 --> 00:07:15.990
any classes of solutions surprise you?

00:07:16.230 --> 00:07:18.540
Yeah, I think that's basically a right.

00:07:18.720 --> 00:07:22.380
I don't think they were
particularly surprising and that.

00:07:23.065 --> 00:07:27.625
We spent a lot of time making sure
that the tasks can trivially be

00:07:27.625 --> 00:07:31.675
solved by just doing, um, hand
engineering, like classical program.

00:07:31.735 --> 00:07:35.725
So even, even the top team
did rely on a behavior cloned

00:07:35.755 --> 00:07:39.565
navigation policy, uh, that used
in your own network, but is true.

00:07:39.565 --> 00:07:43.285
They'd done did a bunch of engineering on
top of that, which I think is, according

00:07:43.285 --> 00:07:46.345
to me is just a benefit of this set up.

00:07:46.375 --> 00:07:51.085
It shows you like, Hey, if you're just
actually trying to get good performance,

00:07:51.115 --> 00:07:53.005
do you train a neural network end to end?

00:07:53.005 --> 00:07:57.085
Or do you put in a, or do you put
in domain knowledge and how much

00:07:57.085 --> 00:08:00.625
domain knowledge do you put in
and uh, how, how do you do it?

00:08:00.985 --> 00:08:04.495
And it turns out that in this particular
case, the domain knowledge, well, they

00:08:04.495 --> 00:08:08.905
did end up getting first, but a team
of city and was quite close behind.

00:08:08.905 --> 00:08:12.145
So I would say that the two experiences
were actually pretty comparable.

00:08:12.205 --> 00:08:15.895
And I do agree that I would say one is
more of an engineering geese solution.

00:08:15.895 --> 00:08:16.825
Then the other one is more.

00:08:17.580 --> 00:08:18.990
Researchy solution.

00:08:19.080 --> 00:08:23.790
So it seems to me like the goals here were
things that could be modeled and learned.

00:08:23.880 --> 00:08:26.880
Like it seems feasible to learn the
concept or to train a network, to

00:08:26.880 --> 00:08:29.790
learn the concept of looking at a
waterfall that had enough labels.

00:08:30.450 --> 00:08:32.220
And I guess that's what
some contestants did.

00:08:32.250 --> 00:08:36.750
But do you have any comments on if
we were to, to want goals that are

00:08:36.780 --> 00:08:39.030
harder to model than these things?

00:08:39.090 --> 00:08:39.490
I I'm.

00:08:39.490 --> 00:08:41.940
I was trying to think of examples
that came up with like our knee

00:08:41.970 --> 00:08:44.580
or dance choreography scoring.

00:08:44.580 --> 00:08:47.190
Like how would you even begin
to, to model those things?

00:08:47.190 --> 00:08:51.900
Do we have to just continue improving
our modeling toolkit so that we can make

00:08:51.900 --> 00:08:54.300
models of these, uh, reward functions?

00:08:54.300 --> 00:08:55.860
Or is there some other strategy?

00:08:56.040 --> 00:08:59.550
Uh, it depends exactly what you mean
by improving the modeling toolkit,

00:08:59.610 --> 00:09:03.450
but basically I think the answer is
yes, but you know, the way that we

00:09:03.450 --> 00:09:07.350
can improve our modeling toolkit, it
may not look like explicit modeling.

00:09:07.680 --> 00:09:11.280
So for example, for irony, I
think you could probably get

00:09:11.730 --> 00:09:15.300
a decent, well, maybe not.

00:09:16.305 --> 00:09:21.224
Uh, it's plausible that you could get
a decent, uh, reward model out of a

00:09:21.224 --> 00:09:25.155
large language model that like does
in fact how the concept of iron irony.

00:09:25.785 --> 00:09:29.685
Um, if I remember correctly, large
language models are not actually that

00:09:29.685 --> 00:09:32.385
great, that humorous, so I'm not sure
if they have the concept of irony,

00:09:33.015 --> 00:09:37.064
but I wouldn't be surprised that if
further scaling did in fact, give

00:09:37.064 --> 00:09:41.505
them a concept of irony, such that we
could use, uh, we could then use them

00:09:41.505 --> 00:09:44.775
to have rewards that involve irony.

00:09:45.084 --> 00:09:47.685
I think that's the same sort
of thing as like waterfall.

00:09:47.714 --> 00:09:52.785
Like I agree that we can learn
the concept of a waterfall,

00:09:53.055 --> 00:09:54.435
but it's not a trivial concept.

00:09:54.435 --> 00:09:57.015
If you asked me to program it
by hand, I would have no idea.

00:09:57.045 --> 00:09:58.155
Like the only input.

00:09:58.185 --> 00:09:58.425
Yeah.

00:09:58.454 --> 00:09:59.954
You get pixels as an input.

00:10:00.464 --> 00:10:02.775
If you're like, here's
a rectangle of pixels.

00:10:03.720 --> 00:10:07.440
Please write a program that
detects the waterfall on there.

00:10:07.440 --> 00:10:09.480
I'm like, oh God, that
sounds really difficult.

00:10:09.480 --> 00:10:13.320
I don't know how to do it, but we
can, if we apply machine learning,

00:10:13.320 --> 00:10:16.650
then like turns out that we can
recognize these sorts of concepts.

00:10:17.430 --> 00:10:21.570
And similarly, I think it's not
going to be like, I definitely

00:10:21.570 --> 00:10:26.610
couldn't write the program, uh,
directly, that can recognize R and D.

00:10:26.640 --> 00:10:30.900
But if you do machine learning, if
you use machine learning to model all

00:10:30.900 --> 00:10:34.200
the texts on the internet, uh, the
resulting model does in fact have a

00:10:34.200 --> 00:10:37.530
concept of irony that you can then
try to use in your reward functions.

00:10:37.770 --> 00:10:40.530
And then there's a Twitter
thread related to disinformation.

00:10:40.560 --> 00:10:44.820
And I shared a line from your paper
where you said learning from human

00:10:44.820 --> 00:10:48.120
feedback offers the alternative
of training recommender systems to

00:10:48.120 --> 00:10:51.180
promote content that humans would
predict would improve the users.

00:10:51.180 --> 00:10:53.985
Well, And I thought that
was really cool insight.

00:10:54.074 --> 00:10:56.715
Is that something you're interested
in pursuing or are you, you

00:10:56.715 --> 00:10:58.694
see that, uh, being a thing?

00:10:58.875 --> 00:11:02.205
I don't know whether or not it
is actually feasible currently.

00:11:02.265 --> 00:11:06.824
Uh, one thing that needs to be true of
recommender systems is they need to be

00:11:06.824 --> 00:11:08.835
cheap to run because they are being run.

00:11:08.895 --> 00:11:13.485
So, so many times every day, I
don't actually know this for a fact.

00:11:13.485 --> 00:11:17.265
I haven't actually done any Fermi
estimates, but my guess would be that

00:11:17.265 --> 00:11:23.715
if you try to actually run TPD three
on say, um, Facebook posts in order to

00:11:23.715 --> 00:11:27.585
then, uh, to then rank them, I think
that would just be, that would probably

00:11:27.585 --> 00:11:29.685
be prohibitively expensive for Facebook.

00:11:29.745 --> 00:11:34.095
So there's a question of like, can
you get a model that actually makes a

00:11:34.095 --> 00:11:39.165
reasonable predictions about the user
as well, being that can also be run

00:11:39.165 --> 00:11:44.865
cheaply enough, that it's not a huge, uh,
expensive cost to whoever is implementing

00:11:44.865 --> 00:11:47.204
the recommendation system and also.

00:11:47.819 --> 00:11:51.810
Does it take a, like sufficiently small
amount of human feedback that you aren't

00:11:51.810 --> 00:11:55.680
bottlenecked on cost, uh, from, from
the humans, providing the feedback.

00:11:55.740 --> 00:12:00.270
And also do we have algorithms
that are good enough to, uh, train

00:12:00.270 --> 00:12:01.710
recommender systems this way?

00:12:01.770 --> 00:12:03.210
I think the answer is plausibly.

00:12:03.210 --> 00:12:03.449
Yes.

00:12:03.449 --> 00:12:04.350
To all of these.

00:12:04.439 --> 00:12:08.130
Uh, I haven't, it's just that I haven't
actually checked myself nor have I even

00:12:08.130 --> 00:12:10.530
like, tried to do any feasibility studies.

00:12:10.860 --> 00:12:11.220
I think.

00:12:11.935 --> 00:12:14.815
The line that you're quoting
was more about like, okay,

00:12:14.845 --> 00:12:16.255
why do this research at all?

00:12:16.255 --> 00:12:19.405
And I'm like, well, someday in the
future, this should be possible.

00:12:19.405 --> 00:12:21.895
And I stick by that, like someday
in the future, things will

00:12:21.895 --> 00:12:23.455
become significantly cheaper.

00:12:23.665 --> 00:12:24.865
Learning from human feedback.

00:12:24.865 --> 00:12:26.905
Algorithms will be a lot better and so on.

00:12:26.905 --> 00:12:30.835
And then like, it will just totally
make sense to you recommend your systems

00:12:30.955 --> 00:12:34.285
trained with human feedback, unless we
found something even better by then.

00:12:34.375 --> 00:12:36.685
It's just not obvious to me
that it is the right choice.

00:12:36.685 --> 00:12:37.255
Currently.

00:12:37.375 --> 00:12:41.395
I look forward to that and, uh, uh,
I'm really concerned, like many people

00:12:41.395 --> 00:12:45.355
are about the disinformation and the
divisiveness, uh, of social media.

00:12:45.355 --> 00:12:46.375
So that sounds great.

00:12:46.375 --> 00:12:49.015
I think everyone's used to
very cheap reward function.

00:12:49.755 --> 00:12:51.135
Uh, pretty much across the board.

00:12:51.165 --> 00:12:54.405
So I guess what you're kind of pointing
to with these reward functions is

00:12:54.405 --> 00:12:58.425
potentially more expensive to evaluate
reward functions, which has maybe

00:12:58.645 --> 00:13:02.475
hasn't been a common thing until now
both more expensive reward functions.

00:13:02.475 --> 00:13:06.675
And also the model that you train with
that or word or function might be,

00:13:06.735 --> 00:13:10.725
might still be very expensive to do
inference with presumably recommender

00:13:10.725 --> 00:13:15.015
systems right now are like compute
these, uh, you know, run a few linear

00:13:15.015 --> 00:13:19.755
time algorithms on the post in order
to like compute a like a hundred or a

00:13:19.755 --> 00:13:23.655
hundred thousand features, then do a dot
product with a hundred thousand weights.

00:13:23.895 --> 00:13:26.085
See which, and then like
rank things in the order.

00:13:26.880 --> 00:13:27.840
By those numbers.

00:13:28.140 --> 00:13:31.860
And that's like, you know, maybe a
million flops or something, which is

00:13:31.860 --> 00:13:36.870
a tiny, tiny number of flops, whereas
like a forward pass, the GPD three is

00:13:36.870 --> 00:13:39.810
more is several hundred billion flops.

00:13:40.320 --> 00:13:46.500
Uh, so that's a, like, uh, 10 to
the five X increase in the amount

00:13:46.500 --> 00:13:47.970
of computation you have to do.

00:13:48.360 --> 00:13:51.690
Uh, actually, no that's one part and
pass through GPT three, but there

00:13:51.690 --> 00:13:53.940
are many words in a Facebook post.

00:13:53.970 --> 00:13:57.810
So multiply the 10 to the five by the
number of words in the Facebook posts.

00:13:58.320 --> 00:14:02.070
Uh, and now we're at like maybe more
like 10 to the seven times cost increase

00:14:02.130 --> 00:14:05.430
just to do inference, even as you mean
you were, you had successfully trained a

00:14:05.430 --> 00:14:07.520
model that could do it recommendations.

00:14:07.560 --> 00:14:07.620
Yeah.

00:14:07.650 --> 00:14:12.060
And in the end result may be lowering
engagement for the benefit of less

00:14:12.060 --> 00:14:15.750
divisive content, which is maybe not
in the interest of the, of the social

00:14:15.750 --> 00:14:16.860
media companies in the first place.

00:14:17.040 --> 00:14:17.280
Yeah.

00:14:17.310 --> 00:14:20.550
There's also a question of, I
agree whether the companies will

00:14:20.550 --> 00:14:22.260
want to do this, but I think if.

00:14:22.949 --> 00:14:27.630
I don't know if we like showed that
this was feasible, uh, that would give

00:14:27.810 --> 00:14:32.760
regulator is I'm much more like, I
think a common problem with regulation

00:14:32.939 --> 00:14:37.770
is that you don't know what to regulate
because there's no alternative on the

00:14:37.770 --> 00:14:39.959
table for what people are already doing.

00:14:40.020 --> 00:14:43.709
And if we were to come to them and
say, look, there's this learning

00:14:43.709 --> 00:14:47.280
from human feedback approach,
we've like, calculated it out.

00:14:47.370 --> 00:14:51.689
They should, they should only increase
costs by two X or maybe, uh, uh,

00:14:51.719 --> 00:14:54.689
yeah, this should, maybe this is
like just the same amount of costs.

00:14:55.260 --> 00:14:59.099
Um, and it shouldn't be too hard for
companies to actually train such a model.

00:14:59.189 --> 00:15:00.660
They've already got
all the infrastructure.

00:15:00.660 --> 00:15:03.359
It should barely be like, I
don't know, a hundred thousand

00:15:03.359 --> 00:15:04.949
dollars to train the model once.

00:15:05.310 --> 00:15:09.510
And like, if you like lay out that
case, I think it's much, I would

00:15:09.510 --> 00:15:12.900
hope at least that it would be a
lot easier for the regulators to be

00:15:12.900 --> 00:15:14.459
like, yes, everyone, you must train.

00:15:15.225 --> 00:15:19.905
Recommender systems to be optimizing
for what humans would predict as good as

00:15:19.905 --> 00:15:24.074
opposed to whatever you're doing right
now that could really change the game.

00:15:24.074 --> 00:15:27.795
And then the bots or the divisive
posters are now trying to gain that,

00:15:27.824 --> 00:15:30.675
that new reward function and then
probably find some different strategies.

00:15:31.485 --> 00:15:33.944
Yeah, you might, you might
imagine that you have to like

00:15:33.944 --> 00:15:35.835
keep retraining in order to.

00:15:37.095 --> 00:15:40.365
Deal with new strategies that are,
uh, that people are finding in

00:15:40.365 --> 00:15:42.915
response to like, we can't do this.

00:15:43.245 --> 00:15:46.905
I don't have any special information
about that on this from working at

00:15:46.905 --> 00:15:51.555
Google, but I'm told that Google is
actually like pretty good at defeating

00:15:51.855 --> 00:15:55.725
defeating spammers, for example, like
in fact, my Gmail spam filter works

00:15:55.845 --> 00:15:59.475
quite well as far as I can tell,
uh, despite the fact that spammers.

00:16:00.245 --> 00:16:03.305
Uh, constantly trying to evade
it and we'll, hopefully we

00:16:03.305 --> 00:16:04.415
could do the same thing here.

00:16:04.505 --> 00:16:04.834
Cool.

00:16:04.865 --> 00:16:05.285
Okay.

00:16:05.314 --> 00:16:07.745
Let's move on to your next
paper preferences implicit

00:16:07.745 --> 00:16:08.795
in the state of the world.

00:16:08.885 --> 00:16:12.454
I understand this paper is closely
related to your dissertation.

00:16:12.574 --> 00:16:14.885
We'll link to your dissertation
in the show notes as well.

00:16:14.944 --> 00:16:18.694
I'm just going to read a quote and I
love how you distilled this key insight.

00:16:18.694 --> 00:16:22.204
You said the key insight of this paper
is that when a robot is deployed in an

00:16:22.204 --> 00:16:25.055
environment that humans have been acting
in, the state of the environment is

00:16:25.055 --> 00:16:26.885
already optimized for what humans want.

00:16:26.915 --> 00:16:31.295
Can you, um, tell us the general idea here
and what do you mean by that statement?

00:16:31.535 --> 00:16:37.985
Maybe like put yourself in the
position of a robot or an AI system

00:16:37.985 --> 00:16:39.545
that knows nothing about the world.

00:16:40.175 --> 00:16:41.314
Maybe it's like, all right, sorry.

00:16:41.964 --> 00:16:44.185
Like it knows the laws
of physics or something.

00:16:44.185 --> 00:16:47.935
It knows that like there's gravity,
it knows that like, there is solid.

00:16:47.935 --> 00:16:50.964
It's like what's in gases,
liquids, uh, tend to, you know,

00:16:50.964 --> 00:16:54.145
take the shape of the container
that they're in, stuff like that.

00:16:54.655 --> 00:16:58.915
Um, but it doesn't know anything
about humans or maybe like, you know,

00:16:58.915 --> 00:17:04.045
it was, it was, we imagined that
it's sort of like off in other parts

00:17:04.045 --> 00:17:06.504
of the solar system or whatever,
and it hasn't really seen it yet.

00:17:07.135 --> 00:17:10.555
And then it comes to her and
it's like, whoa, earth has these

00:17:10.555 --> 00:17:12.565
like super regular structures.

00:17:12.805 --> 00:17:18.565
There's like these like very,
uh, cuboidal, um, structures with

00:17:18.595 --> 00:17:21.115
glass panes at regular intervals.

00:17:21.355 --> 00:17:26.095
Um, that often seem to have lights inside
of them, even though, even at night

00:17:26.095 --> 00:17:30.295
when there isn't light outside of, uh,
outside of them, this is kind of shocking.

00:17:30.295 --> 00:17:32.935
You, you wouldn't expect this
from a random configuration of

00:17:32.935 --> 00:17:35.335
atoms, um, or something like that.

00:17:36.825 --> 00:17:39.225
There is some sense in which
state order, if the world that,

00:17:39.575 --> 00:17:43.305
that we humans have imposed upon,
it is like extremely surprising.

00:17:43.754 --> 00:17:48.254
Um, if you don't know about humans
already being there and what they want.

00:17:48.375 --> 00:17:51.764
So then you can imagine, uh,
asking your AI system, Hey,

00:17:51.764 --> 00:17:53.295
you see a lot of order here.

00:17:53.835 --> 00:17:58.335
Uh, can you like figure out an
explanation for why this order is there?

00:17:58.815 --> 00:18:01.155
Um, perhaps, uh, and then you.

00:18:01.950 --> 00:18:05.130
And maybe you get, and then you give
it the hint of like, look, it's, we're

00:18:05.130 --> 00:18:08.610
going to give you the hint that it was
created by somebody optimizing the world.

00:18:08.850 --> 00:18:11.460
What sort of things might
they have been optimizing for?

00:18:11.550 --> 00:18:15.660
And then you, like, you know, you look
around and you see that like, oh, liquids.

00:18:15.690 --> 00:18:17.640
They tend to be in these like glasses.

00:18:17.670 --> 00:18:21.270
It would be really easy to tip over the
classes and have all the liquid spill out.

00:18:21.420 --> 00:18:23.400
But like that mostly doesn't happen.

00:18:23.610 --> 00:18:26.520
So people must want to have
their liquids in glasses.

00:18:26.520 --> 00:18:27.810
And probably I shouldn't knock out.

00:18:28.745 --> 00:18:29.285
Vases.

00:18:29.285 --> 00:18:30.905
They're like kind of fragile.

00:18:30.935 --> 00:18:36.275
You could like easily just like move them
a little bit to the, to the left or right.

00:18:36.305 --> 00:18:38.075
And they would like fall down and break.

00:18:38.465 --> 00:18:41.645
Um, and once they are broken,
you can then reassemble them.

00:18:42.035 --> 00:18:44.015
But nonetheless, they're still not broken.

00:18:44.225 --> 00:18:47.375
So like probably someone like
actively doesn't want them to break

00:18:47.465 --> 00:18:49.325
and is leaving them on the table.

00:18:49.355 --> 00:18:49.565
Yeah.

00:18:49.595 --> 00:18:51.555
So really I would say the idea is.

00:18:52.260 --> 00:18:55.770
The order in the world did not
just happen by random chance.

00:18:55.860 --> 00:18:58.139
It happened because of human optimization.

00:18:58.199 --> 00:19:01.560
And so from looking at the order of
the world, you can figure out what

00:19:01.560 --> 00:19:03.240
the humans were optimizing for.

00:19:03.360 --> 00:19:03.629
Yeah.

00:19:03.659 --> 00:19:05.490
That's the basic idea
under length of paper.

00:19:05.550 --> 00:19:09.030
So there's some kind of relationship
here to inverse reinforcement

00:19:09.030 --> 00:19:12.899
learning where we're trying to
recover the reward function from,

00:19:12.990 --> 00:19:14.729
from observing an agent's behavior.

00:19:14.729 --> 00:19:16.979
But here you're not observing
the agent's behavior.

00:19:16.979 --> 00:19:17.189
Right.

00:19:17.189 --> 00:19:18.870
So it's not quite in verse aro.

00:19:19.169 --> 00:19:21.449
Would, how would you describe the
relationship between what you're

00:19:21.449 --> 00:19:24.300
doing here and a standard inverse RL?

00:19:25.169 --> 00:19:25.469
Yeah.

00:19:25.469 --> 00:19:30.840
So in terms of the formalism, um,
in verse RL, so that says that you

00:19:30.840 --> 00:19:33.060
observe the human's behavior over time.

00:19:33.060 --> 00:19:35.429
So that's the sequence of
states and actions that the

00:19:35.429 --> 00:19:37.229
human took within those states.

00:19:37.770 --> 00:19:39.570
Whereas we're just saying no, no, no.

00:19:39.959 --> 00:19:41.399
We're not watching the human's behavior.

00:19:41.399 --> 00:19:44.820
We're just going to see only the,
the state, the current state.

00:19:44.850 --> 00:19:46.290
That's the only thing that we see.

00:19:46.500 --> 00:19:48.899
And so you can think of this
in the framework of inverse

00:19:48.899 --> 00:19:49.620
reinforcement learning.

00:19:49.620 --> 00:19:50.370
You can think of this as.

00:19:51.180 --> 00:19:55.889
Either the final state of the
trajectory or a state samples from

00:19:55.889 --> 00:20:00.690
the stationary distribution, from an
infinitely long trajectory, uh, either

00:20:00.690 --> 00:20:04.350
of those would be reasonable to do, but
you're only observing that one thing

00:20:04.379 --> 00:20:08.070
instead of observing the entire state
action history, um, starting from a

00:20:08.070 --> 00:20:09.899
random initialization of the world.

00:20:09.930 --> 00:20:13.020
But other than that, you just make
that one change and then you run

00:20:13.020 --> 00:20:15.780
through all the same map and you
get a slightly different algorithm.

00:20:15.780 --> 00:20:19.800
And that's basically what we,
uh, did to, to make this paper.

00:20:20.010 --> 00:20:23.460
So with this approach, I guess
potentially you're opening up a huge

00:20:23.460 --> 00:20:26.970
amount of kind of unsupervised learning
just from observing what's happening.

00:20:27.030 --> 00:20:30.210
And you can kind of almost
do it instantaneously in

00:20:30.210 --> 00:20:31.350
terms of observation, right?

00:20:31.350 --> 00:20:33.960
You don't have to watch billions
of humans for thousands of years.

00:20:34.050 --> 00:20:34.409
Yep.

00:20:34.500 --> 00:20:35.159
That's right.

00:20:35.310 --> 00:20:40.889
Um, it does require that your
AI system knows like the laws of

00:20:40.889 --> 00:20:45.420
physics or as we would call it
in RL, the transition dynamic.

00:20:46.305 --> 00:20:50.175
Or, well, it needs to be there to know
that, or have some sorts of data from

00:20:50.175 --> 00:20:54.225
which it can learn that because if
you're just, if you just look at the

00:20:54.225 --> 00:20:58.035
state of the world and you have no
idea of what the laws of physics are

00:20:58.035 --> 00:21:01.755
or how, how things work at all, you're
not going to be able to figure out

00:21:01.905 --> 00:21:04.035
how it was optimized into this state.

00:21:04.095 --> 00:21:07.575
Like if you want to infer that humans
don't want their basis to be broken.

00:21:08.265 --> 00:21:13.095
It's an important fact in order to
infer that that if a vase is broken,

00:21:13.095 --> 00:21:15.105
it's very hard to put it back together.

00:21:15.465 --> 00:21:20.145
And that is a fact about the transition
dynamics, which we assumed by Fiat

00:21:20.145 --> 00:21:22.065
that the, that the agent knows.

00:21:22.275 --> 00:21:25.005
But yes, if you had a.

00:21:25.770 --> 00:21:28.590
Enough data sets itself,
supervised learning, could teach

00:21:28.590 --> 00:21:30.150
the agent a bunch of dynamics.

00:21:30.720 --> 00:21:34.800
And also then, and then like also
the agent could go about, go around

00:21:34.800 --> 00:21:39.000
looking at the state of the world,
in theory, it could then, uh, and for

00:21:39.000 --> 00:21:40.530
a lot about what humans care about.

00:21:40.740 --> 00:21:47.400
So I very clearly remember meeting you
at new Europe's, uh, 2018 deep workshop

00:21:47.430 --> 00:21:49.020
in Montreal, the poster session.

00:21:49.110 --> 00:21:52.980
And I remember your poster on
this, um, and you showed a dining

00:21:52.980 --> 00:21:54.690
room that was all nicely arranged.

00:21:54.750 --> 00:21:57.960
And, uh, and, and you were saying
how a robot could learn from

00:21:57.960 --> 00:21:59.460
how things things are arranged.

00:21:59.460 --> 00:22:03.690
And, and I just want to say, I'll say
this publicly, I didn't understand,

00:22:03.780 --> 00:22:07.980
uh, at that point what, what you
meant or why that could be important.

00:22:08.040 --> 00:22:09.330
Um, and it was so different.

00:22:09.330 --> 00:22:11.640
Your angle was just so different
than everything else that was

00:22:11.640 --> 00:22:13.320
being presented, um, that day.

00:22:13.440 --> 00:22:14.490
And I really didn't get it.

00:22:14.490 --> 00:22:16.800
So I, I, and I'll own that.

00:22:16.860 --> 00:22:18.060
Uh, it was, it was my loss.

00:22:18.090 --> 00:22:19.590
And, uh, so thanks for your patience.

00:22:19.590 --> 00:22:22.350
It only took me three and a half years
or something to get to come around.

00:22:24.065 --> 00:22:24.515
Yeah.

00:22:24.725 --> 00:22:27.065
Uh, sorry, I didn't communicate.

00:22:27.065 --> 00:22:30.245
I clicked or I suppose I
don't think it was no, I don't

00:22:30.245 --> 00:22:31.865
think it was at all on you.

00:22:31.955 --> 00:22:36.425
Um, but I, uh, maybe I just
lacked the background to see why

00:22:36.485 --> 00:22:39.455
I like to understand, um, let,
let me, let me put it this way.

00:22:39.455 --> 00:22:44.015
Like how often do you find people who
have some technical understanding of

00:22:44.015 --> 00:22:49.235
AI, but still, maybe don't appreciate,
uh, some of this line of work, including

00:22:49.235 --> 00:22:50.765
alignment and things like that.

00:22:51.095 --> 00:22:52.175
Is that a common thing?

00:22:52.325 --> 00:22:54.755
I think that's a reasonably common.

00:22:54.965 --> 00:22:56.195
And what do you attribute that to?

00:22:56.195 --> 00:22:58.745
Like what's going on there
and is that changing at all?

00:22:58.845 --> 00:23:00.755
Or I think it's pretty interesting.

00:23:00.965 --> 00:23:04.595
I don't think that these people
would say that like, oh, this is

00:23:04.595 --> 00:23:06.925
a boring paper at all, or this is.

00:23:07.755 --> 00:23:09.315
I'm incompetent paper.

00:23:09.405 --> 00:23:14.475
I think they would say yes, the person
who wrote this paper is in fact, has

00:23:14.475 --> 00:23:19.905
in fact done something impressive by
the standards of like, was like, you

00:23:19.905 --> 00:23:23.535
know, did you need to be intelligent and
like, do good math in order to do this?

00:23:23.865 --> 00:23:28.485
I think they are more likely to say
something like, okay, but, so what,

00:23:28.725 --> 00:23:30.645
and that's not entirely unfair.

00:23:30.645 --> 00:23:34.095
Like, you know, it was the
deep RL workshop and here I

00:23:34.095 --> 00:23:36.075
am talking about like, oh yes.

00:23:36.105 --> 00:23:38.415
Imagine that you'd like,
know all the dynamics.

00:23:38.475 --> 00:23:41.985
And I'll say you're like only getting
to look at the state of the world.

00:23:42.255 --> 00:23:45.615
Uh, and then you like, think about
how vases can be broken, but then

00:23:45.615 --> 00:23:46.935
they can't be put back together.

00:23:46.935 --> 00:23:49.275
And voila, you've learned that
humans don't like to break faces.

00:23:49.485 --> 00:23:50.175
There's just something.

00:23:50.805 --> 00:23:54.585
So different from all of the things
that our L easily focuses on.

00:23:54.735 --> 00:23:55.035
Right?

00:23:55.065 --> 00:23:56.865
Like it doesn't have any
of the puzzle rights.

00:23:56.865 --> 00:24:00.075
There's no like, you know, deep
learning, there's no exploration.

00:24:00.105 --> 00:24:03.015
There's no, um, uh, there's
no catastrophic forgetting

00:24:03.015 --> 00:24:04.155
no, nothing like that.

00:24:04.275 --> 00:24:08.025
And to be clear, all of those seem
like important things to focus on.

00:24:08.025 --> 00:24:11.475
And I think many of the people who were
at that workshop, we're focusing on

00:24:11.475 --> 00:24:13.575
those and are doing good work on them.

00:24:13.665 --> 00:24:15.765
Uh, and I'm just doing
something completely different.

00:24:15.765 --> 00:24:19.215
That's like, not all that interesting
to them because they want to

00:24:19.605 --> 00:24:21.135
work on reinforcement learning.

00:24:21.285 --> 00:24:25.875
I think they're making a mistake
in the sense that like AI alignment

00:24:25.875 --> 00:24:29.445
is important and more people should
work on it, but I don't think

00:24:29.445 --> 00:24:31.815
they're making a mistake in that.

00:24:32.085 --> 00:24:35.625
They're probably correct about what
does and doesn't interest them.

00:24:35.775 --> 00:24:36.015
Okay.

00:24:36.015 --> 00:24:38.595
Just so I'm clear, I was not
critiquing your math or the

00:24:38.595 --> 00:24:40.125
value of anything you were doing.

00:24:40.205 --> 00:24:43.995
It was just my ability to understand
the importance of this type of work.

00:24:44.085 --> 00:24:46.305
And I didn't think you were okay.

00:24:47.300 --> 00:24:51.170
So I will say that that day, when I
first encountered your, your poster,

00:24:51.530 --> 00:24:53.240
I was really hung up on edge cases.

00:24:53.840 --> 00:24:58.399
Uh, like, um, you know, there's in the
world, the robot might observe there's

00:24:58.399 --> 00:25:00.110
hunger and there's traffic accidents.

00:25:00.200 --> 00:25:03.170
And there's things that things
like, like not everything is perfect

00:25:03.170 --> 00:25:06.320
and we don't want the robot to
replicate all these, all these flaws

00:25:06.320 --> 00:25:07.940
in the world or the dining room.

00:25:07.940 --> 00:25:10.280
There might be, you know,
dirty dishes or something.

00:25:10.760 --> 00:25:13.909
And so the world is clearly not
exactly how we want it to be.

00:25:13.909 --> 00:25:19.250
So how, how is that, is that an issue or
is that, is that, uh, is that not an issue

00:25:19.250 --> 00:25:20.780
or is that just not the point of this?

00:25:21.110 --> 00:25:22.610
Uh, not, not addressed here?

00:25:22.790 --> 00:25:24.500
It depends a little bit.

00:25:24.590 --> 00:25:28.430
I think in many cases it's not
an issue if you imagined that the

00:25:28.430 --> 00:25:30.320
robot somehow sees the entire world.

00:25:30.409 --> 00:25:32.480
Um, so for example,
you mentioned a hunger.

00:25:32.540 --> 00:25:39.290
Uh, I think the robot would notice
that we do in fact spend a lot of

00:25:39.290 --> 00:25:43.760
effort, making sure that at least
large number of people don't go hungry.

00:25:44.730 --> 00:25:49.200
We've built these giant vehicles,
both trucks and cargo ships, and

00:25:49.200 --> 00:25:53.280
so on, then move food around in a
way that seems at least somewhat

00:25:53.280 --> 00:25:57.870
optimized to get food to people who
like that food and want to eat it.

00:25:58.140 --> 00:26:00.180
So there's lots of
effort being put into it.

00:26:00.360 --> 00:26:03.270
There's not like the maximum
amount of effort being put in.

00:26:04.080 --> 00:26:07.380
Which I think reflects the fact
that there are things that we

00:26:07.380 --> 00:26:08.760
care about other than food.

00:26:08.910 --> 00:26:11.640
So, so I do think it would
be like, all right, humans

00:26:11.640 --> 00:26:13.620
definitely care about having food.

00:26:13.650 --> 00:26:17.880
I think it might then like if you, if
you use the assumption that we have in

00:26:17.880 --> 00:26:22.980
the paper, which is that humans are, the
humans are noisily rational, then it might

00:26:23.130 --> 00:26:28.350
conclude things like I, uh, yes, Western
countries care about getting food to.

00:26:29.010 --> 00:26:33.840
Um, Western Western citizens to
the citizens of their country.

00:26:34.110 --> 00:26:39.120
And they care a little bit about, uh,
other people having food, but like, not

00:26:39.120 --> 00:26:42.990
that much, it's like a small portion
of their, uh, governments aid budget.

00:26:43.080 --> 00:26:45.960
So like there's a positive weight
on there and fairly small weight.

00:26:46.500 --> 00:26:51.750
And that seems like maybe not the
thing that we wanted to learn, but like

00:26:51.810 --> 00:26:57.629
also I think it is in some sense, an
accurate reflection of what Western

00:26:57.690 --> 00:27:01.740
countries care about if you go by their
actions rather than what they say.

00:27:01.980 --> 00:27:02.310
Cool.

00:27:02.310 --> 00:27:02.639
Okay.

00:27:02.639 --> 00:27:07.320
So I, uh, I'm going to move on to
benefits of assistance over rewarding.

00:27:07.959 --> 00:27:12.610
And this one was absolutely fascinating
to me actually, mind blowing.

00:27:12.610 --> 00:27:15.669
I highly recommend people read
all of these, but, but definitely

00:27:15.669 --> 00:27:18.699
I can point to this one as,
um, something surprising to me.

00:27:18.699 --> 00:27:20.139
So that was you as the first author.

00:27:20.260 --> 00:27:22.449
And, uh, can you share, what
is the, what's the general

00:27:22.449 --> 00:27:24.610
idea of this paper around?

00:27:24.760 --> 00:27:28.000
I should say that this general
idea was not novel to this paper

00:27:28.000 --> 00:27:30.040
it's been proposed previously.

00:27:30.399 --> 00:27:33.850
I am not going to remember the
paper, but it's by friend at all.

00:27:33.850 --> 00:27:37.090
It's like towards a dish decision
theater, that tech model of

00:27:37.090 --> 00:27:38.980
assistance or something like that.

00:27:39.100 --> 00:27:42.429
Um, and then there's also cooperative
inverse reinforcement learning

00:27:42.429 --> 00:27:44.290
from chai where I did my PhD.

00:27:44.350 --> 00:27:48.399
The idea with this paper was just to
take that the models that had already

00:27:48.399 --> 00:27:52.870
been proposed in these papers and
explain them why they were so nice.

00:27:52.899 --> 00:27:53.560
Why, why?

00:27:53.560 --> 00:27:57.820
I was like particularly keen on
these models as opposed to, um, other

00:27:57.820 --> 00:27:59.439
things that the field could be doing.

00:27:59.620 --> 00:28:00.550
So the idea here.

00:28:01.235 --> 00:28:05.525
Is that generally we want to build
AI systems that help us do stuff.

00:28:05.705 --> 00:28:09.155
And you could imagine two different
ways that this could be done.

00:28:09.305 --> 00:28:16.115
Uh, first you could imagine a system
that has two separate modules.

00:28:16.475 --> 00:28:19.485
One module is doing is
trying to figure out.

00:28:20.240 --> 00:28:23.450
The humans want or what the
humans want the system to do.

00:28:23.630 --> 00:28:28.670
And the other module is then is trying
to then do the things that the first

00:28:28.670 --> 00:28:31.340
module said the people wanted it to do.

00:28:31.550 --> 00:28:34.910
And that's kind of like the, um, when we
talked about learning from human feedback

00:28:34.940 --> 00:28:38.570
earlier on in modeling reward functions,
is that what, what that would exactly?

00:28:38.660 --> 00:28:40.070
Um, I think that is.

00:28:40.940 --> 00:28:42.950
That that's often what
people are thinking about.

00:28:43.370 --> 00:28:48.020
I would make a diff distinction
between how you train the AI system

00:28:48.140 --> 00:28:50.270
and what the AI system is doing.

00:28:50.330 --> 00:28:53.840
This paper, I would say is more
about what the AI system is doing.

00:28:54.200 --> 00:28:57.020
Whereas the learning from human
feedback stuff is more about,

00:28:57.170 --> 00:28:59.780
um, how you train the system.

00:29:00.200 --> 00:29:00.500
Yeah.

00:29:00.500 --> 00:29:04.129
So in the, what the AI system is
doing framework, I would call this a

00:29:04.790 --> 00:29:08.810
value learning or reward learning, and
then the alternative is assistance.

00:29:08.840 --> 00:29:11.660
And so, although there's like some
surface similarities between learning

00:29:11.660 --> 00:29:16.040
from human feedback and award Lang,
it is totally possible to use learning

00:29:16.040 --> 00:29:22.129
from human feedback algorithms to train
an AI system, then acts as the, that

00:29:22.129 --> 00:29:23.840
then acts as though it doesn't assist.

00:29:24.060 --> 00:29:25.220
It is in the assistance.

00:29:25.220 --> 00:29:28.550
Paradigm is also possible to
use learning from human feedback

00:29:28.610 --> 00:29:31.310
approaches to train an AI system.

00:29:31.340 --> 00:29:34.550
Then act as though that then
acts as though it does a, in

00:29:34.550 --> 00:29:35.900
the reward learning paradigm.

00:29:35.960 --> 00:29:37.420
So that's one distinction.

00:29:38.145 --> 00:29:42.825
To recap, the value learning or
reward learning, uh, side of the two,

00:29:42.825 --> 00:29:45.945
two models is two separate modules.

00:29:46.245 --> 00:29:50.294
One that like figures out what
the humans want and the other that

00:29:50.294 --> 00:29:53.205
then acts to optimize those values.

00:29:53.355 --> 00:29:57.345
And the other side, which, which we
might call assistance is where you

00:29:57.345 --> 00:30:01.245
still have both of those functions, but
they're combined into a single module.

00:30:01.514 --> 00:30:05.925
And the way that you do this is you
have the AI system posit that there

00:30:05.925 --> 00:30:11.504
is some true unknown reward function
data, only the human, the human, who

00:30:11.504 --> 00:30:15.735
is a part of the environment, uh,
knows this data and their behavior

00:30:15.745 --> 00:30:17.835
depends on what the data actually is.

00:30:17.895 --> 00:30:20.715
And so now they can just test to
act on the, in order to maximize

00:30:20.715 --> 00:30:22.425
data, but it doesn't know data.

00:30:22.635 --> 00:30:25.544
So it has to like look at how
the human is behaving within the

00:30:25.544 --> 00:30:29.014
environment in order to like, make some
inferences about what data probably.

00:30:29.790 --> 00:30:32.820
Uh, and then as it gets more and more
information about data that allows

00:30:32.820 --> 00:30:36.660
it to take more and more like, uh,
actions in order to optimize data.

00:30:37.139 --> 00:30:42.000
But fundamentally this like, uh,
learning about data is an instrumental

00:30:42.000 --> 00:30:46.770
action that the agent predicts
would be useful for helping it to

00:30:46.830 --> 00:30:48.840
better optimize data in the future.

00:30:49.200 --> 00:30:54.330
So if I understand correctly, you're
saying assistance is superior because

00:30:54.510 --> 00:30:59.430
it can, the agent can reason about
how to improve its model of, of

00:30:59.430 --> 00:31:03.720
what the human wants or how do you
describe Y Y it's you, you get all

00:31:03.720 --> 00:31:05.610
these benefits from assistance.

00:31:05.760 --> 00:31:06.030
Yeah.

00:31:06.030 --> 00:31:08.520
I think that benefits come
more from the fact that these

00:31:08.520 --> 00:31:10.530
two functions are integrated.

00:31:10.560 --> 00:31:14.010
There's the value learning,
uh, there weren't learning or

00:31:14.010 --> 00:31:15.570
value learning and the control.

00:31:15.600 --> 00:31:17.250
So like acting to optimize the value.

00:31:18.240 --> 00:31:20.760
So we can think of these
two functions in assistance.

00:31:20.760 --> 00:31:24.900
They're merged into a single
module that does like nice, good

00:31:24.900 --> 00:31:26.400
basion reasoning about all of it.

00:31:26.970 --> 00:31:30.120
Whereas in the value learning
paradigm, they're separated.

00:31:30.150 --> 00:31:32.940
And it's this integration
that provides the benefits.

00:31:33.060 --> 00:31:37.620
You can make plans, which is
generally the domain of control,

00:31:37.740 --> 00:31:39.630
but those plans can then depend on.

00:31:40.725 --> 00:31:44.835
Uh, the agent believing that in
the future, it's going to learn

00:31:44.925 --> 00:31:48.915
some more things about the reward
function data, which would normally

00:31:48.915 --> 00:31:50.415
be the domain of value learning.

00:31:50.685 --> 00:31:56.535
So that's an example where control
is, uh, using information, future

00:31:56.535 --> 00:31:59.655
information about valley learning
in order to make its plans.

00:31:59.865 --> 00:32:02.715
Whereas when those two modules
are separated, you can't do that.

00:32:02.835 --> 00:32:08.925
Um, and so like one example that we have
in the paper is you is like, you imagined

00:32:08.925 --> 00:32:14.175
that, uh, you've got a robot, uh, who
is, who asked to cook dinner for Alice.

00:32:14.505 --> 00:32:18.495
Alice is currently a well not
cooked dinner, bake a pie for Alice.

00:32:18.675 --> 00:32:21.735
Um, Alice is currently at the office,
so the robot can't talk to her.

00:32:22.035 --> 00:32:25.515
And unfortunately the robot about
doesn't know what kind of tie she

00:32:25.515 --> 00:32:29.925
wants, maybe apple blueberry or cherry,
but like the robot could guess, but

00:32:29.925 --> 00:32:31.365
it's guests is not that likely to be.

00:32:32.625 --> 00:32:36.645
Uh, however, it turns out the, you
know, the, the steps to make the pie

00:32:36.645 --> 00:32:38.535
crusts are the same for all three pies.

00:32:38.985 --> 00:32:42.435
So an assistive robot can reason.

00:32:42.475 --> 00:32:49.995
Hey, uh, my plan is first, make the pie
crest, then wait for Alice to get home.

00:32:50.205 --> 00:32:51.915
Then ask her what fillings she wants.

00:32:51.945 --> 00:32:53.205
Then put the filling in.

00:32:54.075 --> 00:32:58.245
And that entire plan consists of both
taking actions on the environment,

00:32:58.275 --> 00:33:02.535
like making the crust and putting
in the filling, and also includes

00:33:02.535 --> 00:33:06.675
things like learn more about
data by asking Alice a question.

00:33:06.855 --> 00:33:10.545
Um, and so it's like integrating all
of these into a single plan, whereas

00:33:10.545 --> 00:33:14.895
that plan cannot be expressed in
the value learning paradigm, the

00:33:14.895 --> 00:33:16.855
query as an action in the action.

00:33:17.730 --> 00:33:22.110
So I, um, I really like the, uh,
you laid out some levels of task

00:33:22.110 --> 00:33:24.420
complexity, and I'm just going to
go through them really briefly.

00:33:24.780 --> 00:33:28.740
You mentioned traditional CS is,
uh, giving instructions to computer

00:33:28.740 --> 00:33:33.150
on how to perform a task and then
using AI or ML for simpler tasks

00:33:33.150 --> 00:33:35.250
would be specifying what the task is.

00:33:35.940 --> 00:33:38.640
Um, and the machine
figures out how to do it.

00:33:38.640 --> 00:33:40.950
I guess that's standard RL formulation.

00:33:42.270 --> 00:33:45.960
And then I, the heart for heart attacks
specifying the task is difficult.

00:33:45.960 --> 00:33:51.900
So the agents can learn may, may learn
a reward function from human feedback.

00:33:52.020 --> 00:33:56.430
Um, and then, and then the, and then
you mentioned assistance paradigm as,

00:33:56.520 --> 00:34:00.210
as the next level where the human is
part of the environment has latent

00:34:00.210 --> 00:34:02.340
goals that the robot does not know.

00:34:02.670 --> 00:34:02.970
Yup.

00:34:03.630 --> 00:34:04.710
How do you see this ladder?

00:34:04.710 --> 00:34:08.730
Like, does this describe, is this a
universal, um, classification scheme?

00:34:08.830 --> 00:34:09.930
Is, is, are we done?

00:34:09.960 --> 00:34:11.010
Is that the highest level?

00:34:11.350 --> 00:34:12.420
I think it question.

00:34:14.505 --> 00:34:16.665
I haven't really thought about it before.

00:34:17.114 --> 00:34:23.445
You can imagine a different version of the
highest level, which is like here, we've

00:34:23.445 --> 00:34:29.235
talked about the assistance framing where
you're like, there is some objective, but

00:34:29.235 --> 00:34:32.235
you have to infer it from human feedback.

00:34:32.715 --> 00:34:36.074
There is a different version that
maybe is more in line with the way

00:34:36.074 --> 00:34:39.344
things are going with deep learning
right now, which is more like

00:34:39.495 --> 00:34:41.384
specifying the task is difficult.

00:34:41.384 --> 00:34:44.685
So we're only going to like
evaluate behaviors that the AI

00:34:44.685 --> 00:34:48.585
agent shows and maybe like also
tried to find some hypothetical

00:34:48.585 --> 00:34:51.495
behaviors and evaluate those as well.

00:34:51.675 --> 00:34:56.054
Uh, so that's a different way that you
could talk about those highest level

00:34:56.985 --> 00:35:01.725
where you're like evaluating specific
behaviors, rather than trying to specify

00:35:01.725 --> 00:35:04.154
the task across all possible behaviors.

00:35:04.515 --> 00:35:06.255
And then maybe that would
have to be the highest.

00:35:07.065 --> 00:35:11.235
And now you could just keep inventing
new kinds of human feedback inputs,

00:35:11.775 --> 00:35:15.495
uh, and maybe those can be thought of
as higher levels beyond that as well.

00:35:15.765 --> 00:35:20.145
Um, so then, um, one detail I
mentioned, I, I saw in the paper, you

00:35:20.145 --> 00:35:25.125
mentioned a two phase of assistance
is equivalent to reward learning.

00:35:25.275 --> 00:35:28.335
And I, I puzzled over that line
and I couldn't really quite,

00:35:28.365 --> 00:35:29.385
uh, understand what you meant.

00:35:29.385 --> 00:35:30.915
Can you say a little bit more about that?

00:35:30.915 --> 00:35:31.545
What does that mean?

00:35:31.545 --> 00:35:34.875
And how do you conclude that there,
through those two things are equivalent?

00:35:35.295 --> 00:35:35.805
Yeah.

00:35:35.865 --> 00:35:38.115
So there are a fair number
of definitions here.

00:35:38.445 --> 00:35:43.245
I won't, maybe I won't go through
all of it, but just, uh, for, so that

00:35:43.275 --> 00:35:47.865
listeners know we had definition,
we had formal definitions of like

00:35:47.865 --> 00:35:51.615
what counsel's assistance and what
counts as a reward learning, uh,

00:35:51.675 --> 00:35:53.125
and the, the word learning set.

00:35:54.000 --> 00:35:59.130
Case we imagined, we like
imagined it as first.

00:35:59.130 --> 00:36:03.420
You have a system that like asks like
human questions are actually, it doesn't

00:36:03.420 --> 00:36:06.840
have to ask the human questions, but
first we have a system that interacts

00:36:06.840 --> 00:36:11.400
with the human somehow and like develops
a guess of what the reward function is.

00:36:11.790 --> 00:36:15.900
And then, uh, that yes, of what the
reward function is, which could be a

00:36:15.900 --> 00:36:21.750
distribution over to awards is passed
on, uh, to a system that then acts to

00:36:21.750 --> 00:36:25.680
maximize the expected value of the,
sorry, the expected to award, according

00:36:25.680 --> 00:36:27.330
to that distribution over towards.

00:36:28.080 --> 00:36:28.530
Okay.

00:36:28.560 --> 00:36:28.890
Yeah.

00:36:28.920 --> 00:36:32.280
So once it's done it's communication,
it's learned to reward and in phase

00:36:32.280 --> 00:36:35.760
two, it's not, it doesn't have
any query as action at that point.

00:36:35.790 --> 00:36:36.340
That's what you're saying.

00:36:36.340 --> 00:36:36.870
Exactly.

00:36:37.230 --> 00:36:37.680
Okay, cool.

00:36:37.710 --> 00:36:45.030
Um, and so then the, you know, two phase
is the two phase communicative assistance,

00:36:45.030 --> 00:36:46.650
the two phase and the communicative.

00:36:47.490 --> 00:36:51.720
Both have technical definitions, but they
roughly mean exactly what you would expect

00:36:51.720 --> 00:36:53.580
them to mean in order to make this true.

00:36:54.030 --> 00:36:58.770
Um, so you mentioned three
benefits of using assistance,

00:36:59.040 --> 00:37:00.300
this assistance paradigm.

00:37:00.390 --> 00:37:04.170
Can you briefly explain
what those benefits are?

00:37:04.560 --> 00:37:08.100
The first one, which I already
talked about, um, his plans,

00:37:08.100 --> 00:37:09.780
conditional on feature feedback.

00:37:10.170 --> 00:37:14.940
So this is the example of where the
robot can make a plan that says,

00:37:15.000 --> 00:37:17.220
Hey, first, I'll make the pie crust.

00:37:17.430 --> 00:37:19.530
Then I'll wait for Alice to
get back from the office.

00:37:19.560 --> 00:37:21.660
Then I'll ask her what filling she wants.

00:37:22.170 --> 00:37:24.570
Then I'll put in the appropriate filling.

00:37:24.570 --> 00:37:25.350
So they're there.

00:37:25.350 --> 00:37:32.160
The plan was conditional on the answer
that Alice was going to give in the future

00:37:32.460 --> 00:37:34.350
that the robot predicted she would give.

00:37:34.350 --> 00:37:36.720
But like, couldn't actually
ask the question now.

00:37:37.200 --> 00:37:42.000
So that's one thing that, uh, can
be done in the assistance paradigm,

00:37:42.000 --> 00:37:45.990
but not in the, um, value learning
or toward learning paradigm.

00:37:47.805 --> 00:37:54.375
Uh, a second one is what we call
relevance where active learning.

00:37:54.884 --> 00:38:02.055
Uh, so active learning is the idea
that instead of the human passively,

00:38:02.085 --> 00:38:05.775
giving the robot, sorry, instead of
the human giving a bunch of information

00:38:05.775 --> 00:38:09.525
to the robot and the robot passively
taking it and using it to update its

00:38:09.525 --> 00:38:14.234
estimate of data, the robot actively
asks the human quite human questions

00:38:14.505 --> 00:38:19.154
that seem most relevant to updating its
understanding of the reward data, and

00:38:19.154 --> 00:38:20.984
then the human answers, those questions.

00:38:21.165 --> 00:38:24.285
So that's active learning that
can be done in both paradigms.

00:38:24.375 --> 00:38:29.535
The thing that assistants can do is to
have the robot only ask questions that

00:38:29.535 --> 00:38:33.464
are actually relevant for the plans
that's going to have in the feature.

00:38:33.855 --> 00:38:37.815
So to make this point that I might,
you might imagine that like, you

00:38:37.815 --> 00:38:41.625
know, you get a hustled robot, um,
that your hustled robots booting up.

00:38:41.714 --> 00:38:44.795
And if it was in the reward,
lending paradigm and test

00:38:44.815 --> 00:38:46.245
like figure out data, right.

00:38:46.995 --> 00:38:48.345
And so it's like, all right.

00:38:48.375 --> 00:38:53.415
Do you tend to like, uh, at what
time do you tend to prefer a dinner?

00:38:53.475 --> 00:38:55.305
Um, so I can cook that for you.

00:38:55.305 --> 00:38:58.154
And that's like a pretty reasonable
question and you're like, yeah, I

00:38:58.424 --> 00:39:01.185
usually eat around, um, 7:00 PM.

00:39:01.694 --> 00:39:04.875
Uh, and it's got a few more questions
like this, and later on, it's like,

00:39:05.085 --> 00:39:10.185
well, if you ever wanted to paint your
house, what colors did we paint it?

00:39:10.845 --> 00:39:14.835
And you're like, kind of like
a blue, I guess, but like,

00:39:14.865 --> 00:39:16.515
why are you asking me this?

00:39:16.515 --> 00:39:19.755
And then it's like, if aliens
come and then they'd from mark.

00:39:20.505 --> 00:39:24.404
Where would, what would be your
preference of place to hide it, hide in.

00:39:24.404 --> 00:39:27.045
And you're like, why, why
are you asking me this?

00:39:27.315 --> 00:39:30.375
But the thing is like, all of these
questions are in fact relevant for,

00:39:30.585 --> 00:39:32.085
for their reward function data.

00:39:32.835 --> 00:39:39.075
The reason that you don't that like, if
this were a human, instead of a robot, the

00:39:39.075 --> 00:39:42.105
reason they went to ask these questions
is because the situations too, it's,

00:39:42.105 --> 00:39:44.565
they're relevant probably don't come up.

00:39:45.105 --> 00:39:50.145
But in order to like, make that
prediction, you need to be talking more

00:39:50.145 --> 00:39:54.944
to the control, uh, sub module, the, with,
uh, the control module, which is like,

00:39:54.975 --> 00:39:58.755
I think that our word learning paradigm
doesn't do they control somebody modules?

00:39:58.765 --> 00:40:01.185
The one that's like, all right,
we're gonna take, we're probably

00:40:01.185 --> 00:40:02.625
going to take these sorts of actions.

00:40:02.625 --> 00:40:03.915
That's going to lead to
those kinds of feeds.

00:40:04.814 --> 00:40:07.154
And so like, you know, probably
aliens from Mars aren't

00:40:07.194 --> 00:40:08.384
ever going to be relevant.

00:40:08.475 --> 00:40:14.205
So if, if you have this like one unified
system, uh, then it can be like, well,

00:40:14.205 --> 00:40:17.895
okay, I know that like aliens from
Myers, I probably not going to show

00:40:17.895 --> 00:40:20.234
up, uh, anytime in the near future.

00:40:20.444 --> 00:40:24.075
And I don't need to ask about
those preferences right now.

00:40:24.134 --> 00:40:29.234
If they, if I do find out that aliens
from Mars are likely to land, uh, soon

00:40:29.325 --> 00:40:32.984
then I will ask that question, but I
can leave that to later and not bother,

00:40:33.254 --> 00:40:35.564
um, Alice until that actually happens.

00:40:35.625 --> 00:40:36.915
Um, so that's the second one.

00:40:37.125 --> 00:40:40.964
And then the final one is that.

00:40:41.790 --> 00:40:46.049
You know, so far, I've been talking
to cases where the robot is learning

00:40:46.230 --> 00:40:47.910
by asking the human questions.

00:40:47.910 --> 00:40:50.669
And the human just like gives
answers that are informative

00:40:50.669 --> 00:40:52.020
about the reward function data.

00:40:52.799 --> 00:40:56.399
Uh, the third one is that, you know, you
don't have to ask the human questions.

00:40:56.430 --> 00:40:59.730
You can also learn from their
behavior just directly while

00:40:59.730 --> 00:41:02.580
they're going about their day
and optimizing their environment.

00:41:03.240 --> 00:41:08.220
A good example of this is like your robot
starts helping out around the kitchen.

00:41:08.370 --> 00:41:12.060
It starts by doing some like very obvious
things like, okay, there is some dirty

00:41:12.060 --> 00:41:14.940
dishes, just put them in the dishwasher.

00:41:15.299 --> 00:41:20.220
Um, meanwhile the humans going around and
like starting to collect the ingredients

00:41:20.220 --> 00:41:25.290
for baking a pie, sort of, I can see
this, notice that that's, that's the case.

00:41:25.290 --> 00:41:30.210
And I'm like, go and get out the like
mixing bowl on the egg beater and so on.

00:41:30.839 --> 00:41:31.950
Um, in order to help.

00:41:32.730 --> 00:41:36.810
Uh, like the sort of just like
seeing what the human is up to and

00:41:36.810 --> 00:41:40.260
then like immediately starting to
help with that is the sort of thing

00:41:40.260 --> 00:41:44.700
that you can only, like this is all
happening within a single episode,

00:41:44.700 --> 00:41:46.290
rather than being across episodes.

00:41:46.320 --> 00:41:50.160
The like value learning or borderline
could do it across episodes where

00:41:50.160 --> 00:41:55.920
like first the robot looks and watches
the human, uh, act in the environment

00:41:55.920 --> 00:41:57.480
to make an entire cake from scratch.

00:41:57.510 --> 00:42:01.590
And then the next time when the robot is
actually Indian, It goes and helps the

00:42:01.590 --> 00:42:06.810
human out, but in the assistance paradigm,
it can do that learning and help out with

00:42:06.810 --> 00:42:11.730
making the cake within the episode itself,
as long as it has enough understanding

00:42:11.730 --> 00:42:16.140
of how the world works and what data is
likely to be, uh, in order to actually

00:42:16.200 --> 00:42:19.440
like did these with enough confidence,
that those actions are good to take.

00:42:19.680 --> 00:42:21.960
When you described the robot that
would ask all these irrelevant

00:42:21.960 --> 00:42:23.070
questions, I couldn't help.

00:42:23.120 --> 00:42:24.000
I'm a parent.

00:42:24.060 --> 00:42:25.890
I couldn't help with thinking,
you know, that's the kind of

00:42:25.890 --> 00:42:26.790
thing a four-year-old would do.

00:42:26.790 --> 00:42:28.470
Try ask you every random question.

00:42:28.890 --> 00:42:29.970
That's not irrelevant right then.

00:42:29.970 --> 00:42:31.980
And it seems like you're,
you're kind of pulling into a

00:42:31.980 --> 00:42:33.360
more mature type of intense.

00:42:34.080 --> 00:42:34.680
Yeah.

00:42:34.950 --> 00:42:35.250
Yeah.

00:42:35.280 --> 00:42:39.960
A lot of this is like, like this, the
entire paper, uh, has this assumption

00:42:39.960 --> 00:42:43.080
of like, we're going to write down
math and then we're going to talk about

00:42:43.110 --> 00:42:44.940
agents that are optimal for that math.

00:42:45.210 --> 00:42:47.670
We're not going to bother thinking
of, we're not going to think

00:42:47.670 --> 00:42:50.190
about like, okay, how do we in
practice get the optimal thing.

00:42:50.190 --> 00:42:53.430
We're just like, is the optimal thing,
actually, the thing that we want.

00:42:53.940 --> 00:42:59.220
Uh, and so one would hope that yes, uh,
if we're assuming the actual optimal

00:42:59.220 --> 00:43:05.160
agent, it should in fact be, um, more
mature than four year olds, one hopes.

00:43:06.480 --> 00:43:10.710
So how do you, um, relate, can you
relate this assistance paradigm

00:43:10.710 --> 00:43:14.010
back to standard in inverse RL?

00:43:14.040 --> 00:43:16.260
What is the relationship
between these two paradigms?

00:43:16.590 --> 00:43:17.040
Yeah.

00:43:17.070 --> 00:43:22.620
So in verse RL, zooms that it's an
example of the reward learning paradigm.

00:43:23.520 --> 00:43:27.720
Um, it assumes that you get full
demonstrations of the entire task.

00:43:28.830 --> 00:43:33.180
And then you have, and then
you like, uh, executed by the

00:43:33.180 --> 00:43:35.850
human tele operating the robot.

00:43:36.150 --> 00:43:37.230
There's like versions of it.

00:43:37.230 --> 00:43:42.510
That don't seem the teller operation
part, but usually that's an assumption.

00:43:43.200 --> 00:43:48.750
And then given the, you know, tell our
operated robot demonstrations of how to

00:43:48.750 --> 00:43:52.830
do the task, the robot does, then it's
supposed to infer what the task actually

00:43:52.830 --> 00:43:57.090
was and then be able to do it itself in
the future without any tele operation.

00:43:57.750 --> 00:44:01.590
So without uncertainty, is that true
with the inverse RL paradigm assumes

00:44:01.590 --> 00:44:03.510
that we were not uncertain in the end?

00:44:04.200 --> 00:44:05.430
Uh, no.

00:44:05.430 --> 00:44:11.610
It doesn't necessarily seem that I
think in many deep IRL algorithms that

00:44:11.610 --> 00:44:15.120
does end up being an assumption that
they use, but it's not a necessary one.

00:44:15.240 --> 00:44:17.280
Uh, it can still be uncertain.

00:44:17.310 --> 00:44:23.170
And then I would plan typically with
respect to maximizing the expectation of.

00:44:23.895 --> 00:44:28.155
The reward function, although you could
also try to be conservative or risks,

00:44:28.215 --> 00:44:32.625
risk sensitive, and then you would be
max, uh, you, you wouldn't be maximizing

00:44:32.625 --> 00:44:36.735
expected reward and maybe you'd be
maximizing like worst case reward if

00:44:36.735 --> 00:44:40.485
you want it to be maximally conservative
or something like that, or a fifth

00:44:40.485 --> 00:44:42.285
percentile reward or something like that.

00:44:42.405 --> 00:44:42.645
Yeah.

00:44:42.655 --> 00:44:47.325
So, so there can be uncertainty, but like
the human isn't in the environment and

00:44:47.685 --> 00:44:51.345
there's this episodic assumption where
like the demonstration is one episode

00:44:51.345 --> 00:44:54.555
and then when the robot is acting,
that's a totally different episode.

00:44:54.945 --> 00:44:56.115
And that also isn't true.

00:44:56.115 --> 00:45:00.315
In the assistance case, you talk
about active reward learning

00:45:00.375 --> 00:45:01.995
and interactive reward learning.

00:45:02.025 --> 00:45:05.085
Can you help us understand those,
those two phrases and how they differ?

00:45:05.205 --> 00:45:05.475
Yeah.

00:45:05.475 --> 00:45:10.275
So active reward learning is just
when, uh, the robot has the ability,

00:45:10.335 --> 00:45:14.025
like in the reward learning paradigm,
the robot has given the ability to

00:45:14.025 --> 00:45:16.395
ask questions, um, rather than just.

00:45:17.205 --> 00:45:19.335
Just getting to observe
what the human is doing.

00:45:19.425 --> 00:45:21.765
So hopefully that one
should be relatively clear.

00:45:21.915 --> 00:45:26.925
The interactive reward learning
setting is, uh, it's mostly just

00:45:26.955 --> 00:45:31.065
a thing we made up because it was
a thing that people often brought

00:45:31.065 --> 00:45:32.625
up as like, maybe this will work.

00:45:32.985 --> 00:45:36.495
So he wanted to talk about it and show
why it doesn't, it doesn't in fact work.

00:45:37.005 --> 00:45:40.695
Uh, but the idea there is that you
alternate between you still have

00:45:40.695 --> 00:45:43.905
your two modules, you have one reward
learning module and one control

00:45:43.905 --> 00:45:47.355
module, and they don't talk to each
other, but instead of like just doing

00:45:47.355 --> 00:45:51.195
one, the word planning thing, and
then, and then doing control forever.

00:45:51.985 --> 00:45:56.665
You do, like, I don't know, you do 10,
10 steps of reward learning, then 10

00:45:56.665 --> 00:46:00.565
steps of control, then 10 steps over
war line, then 10 steps of control.

00:46:00.895 --> 00:46:05.185
And you keep iterating
between the two stages.

00:46:05.395 --> 00:46:09.445
So why is computational complexity
really high for algorithms that

00:46:09.445 --> 00:46:11.365
try to optimize over assistance?

00:46:11.365 --> 00:46:12.445
I think you mentioned that here.

00:46:12.595 --> 00:46:12.835
Yeah.

00:46:12.835 --> 00:46:15.895
So everything I've talked about has
just sort of a zoom that the agents are

00:46:15.895 --> 00:46:21.145
optimal by default, but if you think
about it, what the optimal agent has to

00:46:21.145 --> 00:46:26.365
do is it has to, you know, maintain a
probability distribution over all of the

00:46:26.365 --> 00:46:32.275
possible reward functions that Alice could
have and then updated over time as it

00:46:32.275 --> 00:46:34.285
sees more and more of Alice's behavior.

00:46:34.615 --> 00:46:41.485
And as you probably know, full base and
updating over a large list of hypothesis,

00:46:41.575 --> 00:46:45.025
uh, is very computationally intractable.

00:46:45.275 --> 00:46:48.535
Another way of seeing it is
that, you know, if you take this

00:46:48.535 --> 00:46:50.745
assistance paradigm, you can.

00:46:51.629 --> 00:46:55.379
Through a relatively simple reduction,
turn it into a partially observable

00:46:55.379 --> 00:46:57.480
markup decision process or Palm DP.

00:46:58.049 --> 00:47:03.540
The basic idea there is to treat
the reward, function data as like

00:47:03.540 --> 00:47:05.339
some unobserved part of the state.

00:47:05.730 --> 00:47:09.060
Uh, and then that reward function
is whatever that unobserved

00:47:09.060 --> 00:47:10.350
part of the state would say.

00:47:11.220 --> 00:47:15.750
Uh, and then the, um, Alice's
behavior is thought of as part of the

00:47:15.750 --> 00:47:18.960
transition dynamics, which depends
on the unemployed part of the state.

00:47:18.960 --> 00:47:20.730
That is the status data.

00:47:21.660 --> 00:47:26.069
Uh, so that's the rough reduction to
how you phrase assistance as a Palm DP.

00:47:26.310 --> 00:47:32.940
Uh, and then Palm DPS are known to be
very computationally intractable to solve

00:47:33.390 --> 00:47:36.990
again for basically the same reasons that
I was just saying, which is that like, to

00:47:36.990 --> 00:47:41.730
actually solve them, you need to maintain
a patient, a probability distribution

00:47:41.730 --> 00:47:45.930
over all the, uh, ways that the
unemployed parts of the state could be.

00:47:46.109 --> 00:47:48.450
And that's just
computationally and tracked.

00:47:49.305 --> 00:47:52.965
So do you plan to work on this, on
this particular line of work further?

00:47:53.384 --> 00:47:58.785
I think I don't plan to do further
direct research on this myself.

00:47:59.325 --> 00:48:04.904
I still basically agree with the
point of the paper, which is look

00:48:04.904 --> 00:48:08.325
when you're building your AI systems,
they should be reasoning more.

00:48:08.654 --> 00:48:12.015
They should be reasoning in the way
that the assistance paradigm suggests

00:48:12.015 --> 00:48:15.615
where there's like this integrated
reward, learning and control,

00:48:15.884 --> 00:48:17.745
and they shouldn't be reasoning.

00:48:17.805 --> 00:48:20.865
And the way that the value of
learning, uh, paradigms, just where

00:48:20.865 --> 00:48:25.125
you first figure out what human
values are and then optimize for them.

00:48:25.245 --> 00:48:31.035
And so I think that point is a
pretty important point and will

00:48:31.065 --> 00:48:36.225
guide how we build a AI systems in
the future, or it will guide how,

00:48:36.315 --> 00:48:38.295
what we have our AI systems do.

00:48:38.925 --> 00:48:43.620
And I think I will continue to push for
that point, including like, Projects

00:48:43.620 --> 00:48:48.180
that deep DeepMind, but I probably
won't be doing more like technical

00:48:48.180 --> 00:48:53.759
research on the math and those papers
specifically, because I think I, like

00:48:54.150 --> 00:48:58.500
it said, the things that I wanted
to say, uh, there's still more work.

00:48:58.560 --> 00:49:01.529
There's still plenty of work that
one could do such as like trying to

00:49:01.859 --> 00:49:05.759
come up with algorithms to directly
optimize the maths that we wrote down.

00:49:06.180 --> 00:49:09.060
Um, but that seems less
high leveraged to me.

00:49:09.420 --> 00:49:09.630
Okay.

00:49:09.630 --> 00:49:13.859
Moving to the next paper on the
utility of learning about humans for

00:49:13.859 --> 00:49:18.240
human AI coordination, that was Carol
at all with yourself as a coauthor.

00:49:18.359 --> 00:49:21.330
Um, can you tell us the
brief, uh, general idea here?

00:49:21.600 --> 00:49:27.270
I think this paper was written
in, in the wake of some pretty

00:49:27.270 --> 00:49:29.609
big successes of self-pleasure.

00:49:30.360 --> 00:49:35.549
Um, so self play is the algorithm
underlying well self player, like very

00:49:35.549 --> 00:49:41.009
similar variants are the out, is the
algorithm underlying open AI five a

00:49:41.009 --> 00:49:45.450
which plays Dota alpha star, which
plays StarCraft alpha and alpha zero,

00:49:45.450 --> 00:49:47.910
which play, you know, go chests charity.

00:49:48.480 --> 00:49:52.259
And so on at a superhuman level, these
were like some of the, yeah, some of the

00:49:52.259 --> 00:49:56.820
biggest results in AI around that time
and sort of suggested that like self

00:49:56.820 --> 00:49:58.650
play was going to be a really big thing.

00:49:58.890 --> 00:50:07.065
And the point we were making in this
paper, Is that self play works well when

00:50:07.065 --> 00:50:12.825
you have a zero sum, uh, two players
zero-sum game, uh, which has a like

00:50:12.855 --> 00:50:18.405
perfectly competitive game, uh, because
it's effectively going to cause you to

00:50:18.405 --> 00:50:21.945
explore the full space of strategies,
because it, if you're like playing against

00:50:21.945 --> 00:50:27.945
yourself in a competitive game, if there's
any fly in your strategy, then gradient

00:50:27.945 --> 00:50:31.725
descent is going to like push you in
the direction of like exploiting that

00:50:31.725 --> 00:50:36.285
flaw because you're, you know, you're
trying to beat the other copy of you.

00:50:36.465 --> 00:50:42.345
So you're always given to get better,
uh, in contrast in common payoff

00:50:42.375 --> 00:50:46.695
game, which are the most collaborative
games, um, where each agent gets the

00:50:46.695 --> 00:50:51.225
same payoff, no matter what happens,
uh, but the paths can be different.

00:50:52.020 --> 00:50:55.560
Uh, you don't have this,
um, similar incentive.

00:50:55.709 --> 00:50:59.430
Uh, you don't have any
incentive to be unexplainable.

00:50:59.549 --> 00:51:05.910
Like all you want is to come up with
some policy that like, if played against

00:51:05.910 --> 00:51:12.899
yourself will get the maximum reward,
but it doesn't really matter if you are.

00:51:13.680 --> 00:51:17.759
If you would like play badly with
somebody else, like a human, like if

00:51:17.759 --> 00:51:21.120
that were true, that wouldn't come
up in self play, self play would be

00:51:21.120 --> 00:51:24.630
like, nah, every in every single game
you play, you got the maximum reward.

00:51:24.690 --> 00:51:25.740
There's nothing to do here.

00:51:26.069 --> 00:51:30.660
So there's no forests that's like
causing you to be robust to all of the

00:51:30.660 --> 00:51:32.850
possible players that you could have.

00:51:32.880 --> 00:51:36.870
Whereas in the competitive game, if
you weren't drove us to all of the

00:51:36.870 --> 00:51:40.620
players that could possibly arise,
then you're exploitable in some way.

00:51:40.680 --> 00:51:44.310
And then the grading dissenters,
incentivized to find that exploit after

00:51:44.310 --> 00:51:45.690
which you have to become robust to it.

00:51:45.960 --> 00:51:48.870
Is there any way to reformulate it so
that there is that competitive pressure?

00:51:49.140 --> 00:51:50.490
You can actually do this.

00:51:50.490 --> 00:51:54.750
And so I know you've had Michael
Dennis, um, and I think also Natasha

00:51:54.750 --> 00:51:58.950
shacks on this podcast before,
and both of them are doing work.

00:51:58.950 --> 00:52:02.700
That's kind of like this,
uh, with paired, right.

00:52:02.700 --> 00:52:04.029
That was just shakes and.

00:52:05.760 --> 00:52:06.210
Yeah.

00:52:06.390 --> 00:52:09.330
Oh, the way you do it, as you just
say, all right, we're going to make

00:52:09.330 --> 00:52:14.550
the environment a, our competitor, the
environment is going to like try and like

00:52:14.790 --> 00:52:20.100
make itself super complicated, uh, in a
way that defeats, uh, whatever policy,

00:52:20.580 --> 00:52:22.620
uh, we were trying to use to coordinate.

00:52:23.130 --> 00:52:28.440
And so then this makes sure that
you have to be robust to whichever

00:52:28.440 --> 00:52:30.300
environment you find yourself in.

00:52:30.480 --> 00:52:33.330
So that's like one way to get
robustness to, well, it's getting

00:52:33.330 --> 00:52:34.740
you to robustness, to environments.

00:52:34.740 --> 00:52:38.490
It's not necessarily getting
robustness to your partners.

00:52:39.000 --> 00:52:42.780
Um, when, like, if you, for example,
you wanted to cooperate with the

00:52:42.780 --> 00:52:47.100
human, but you could do a similar
thing there where you say we're going

00:52:47.100 --> 00:52:52.830
to also take the partner agent and
we're going to make it be adversarial.

00:52:53.460 --> 00:52:57.240
Now this doesn't work great if you
like, literally make it adversarial

00:52:57.240 --> 00:53:00.960
because sometimes in many like
interesting collaborative games,

00:53:01.859 --> 00:53:05.370
Um, like, like over cooked, which is
the one that we were studying here.

00:53:05.640 --> 00:53:09.569
If your partner is an adversary,
they can just guarantee

00:53:09.569 --> 00:53:11.040
that you get minimum reward.

00:53:11.220 --> 00:53:13.680
It's not, it's often
not difficult in this.

00:53:13.710 --> 00:53:17.549
And over cooked, you just like
stand in front of the station where

00:53:17.549 --> 00:53:21.180
you deliver the dishes that you've
cooked and you just stand there.

00:53:21.450 --> 00:53:23.460
And that's what the adversary does.

00:53:23.460 --> 00:53:27.149
And then the agent is just like,
well, okay, I can make a soup,

00:53:27.180 --> 00:53:28.319
but I can never deliver it.

00:53:28.470 --> 00:53:29.790
I guess I never get the reward.

00:53:30.750 --> 00:53:35.580
Uh, so, so it doesn't quite that like
naive, simple approach doesn't quite

00:53:35.580 --> 00:53:41.399
work, but you can, instead you can
like try to have a, uh, slightly more

00:53:41.399 --> 00:53:46.379
sophisticated method where, you know,
the, instead of being an adversarial

00:53:46.379 --> 00:53:47.790
partner, it's a partner that is.

00:53:48.960 --> 00:53:51.240
Trying to keep you on the
edge of your abilities.

00:53:51.360 --> 00:53:55.860
And then you like, uh, as you, uh, and
then like, once your agent learns how to

00:53:55.860 --> 00:54:00.390
like, do well with the one, uh, with your
current partner, then like the partner

00:54:00.390 --> 00:54:02.190
tries to make itself a bit harder to do.

00:54:02.550 --> 00:54:02.880
And so on.

00:54:02.880 --> 00:54:09.150
So there, there are a few, there's a few
papers like this that I I'm kindly failing

00:54:09.150 --> 00:54:11.400
to remember, but, but there are papers.

00:54:11.410 --> 00:54:12.840
I tried to do this sort of thing.

00:54:13.230 --> 00:54:17.190
I think many of them did end up just
like following, uh, both the self

00:54:17.190 --> 00:54:19.110
play work and those paper of ours.

00:54:19.500 --> 00:54:20.220
So, yeah.

00:54:20.460 --> 00:54:22.350
And basically I think you're right.

00:54:22.530 --> 00:54:26.790
You can in fact do some clever
tricks to make things, uh, to make

00:54:26.790 --> 00:54:28.380
things better and to get around this.

00:54:29.010 --> 00:54:31.890
It's not quite as simple
and elegant as self play.

00:54:31.890 --> 00:54:36.150
And I don't think the results are quite
as good as you get what self play.

00:54:36.150 --> 00:54:38.130
Cause it's still not
exactly the thing that.

00:54:39.060 --> 00:54:43.650
So now we have a contributed question,
which I'm very excited about from, uh, Dr.

00:54:43.650 --> 00:54:48.210
Natasha Jacques' senior research scientist
at Google AI and postdoc at Berkeley.

00:54:48.270 --> 00:54:51.720
And we were lucky to have Natasha
as our guest on episode one.

00:54:51.810 --> 00:54:55.980
So Natasha Natasha asks the most
interesting questions are about why

00:54:56.100 --> 00:54:59.850
interacting with humans is so much
harder flash, so different than

00:54:59.850 --> 00:55:01.980
interacting with simulated RL agents.

00:55:02.520 --> 00:55:06.569
So Rohin, what is it about humans
that makes them, um, harder?

00:55:08.234 --> 00:55:13.605
Yeah, there are a bunch of factors here,
maybe the most obvious one and probably

00:55:13.605 --> 00:55:19.185
the biggest one in practice is that you
can't just put humans in your environment

00:55:19.214 --> 00:55:24.375
to do like a million steps of gradient
descent on, uh, which often we do in

00:55:24.375 --> 00:55:26.535
fact do with our simulated RL agents.

00:55:26.775 --> 00:55:31.935
And so like, if you could just somehow
put a human in the loop, uh, in a million

00:55:32.654 --> 00:55:37.275
for a million episodes, maybe then the
resulting agent would in fact, just be

00:55:37.275 --> 00:55:39.134
really good at coordinating with humans.

00:55:39.375 --> 00:55:42.375
In fact, I might like take out
the, maybe there and I will, I will

00:55:42.375 --> 00:55:46.185
actually predict that that resulting
agent will be good with humans.

00:55:46.185 --> 00:55:50.355
As long as you had like, uh, like
reasonable diversity of humans, um,

00:55:50.775 --> 00:55:53.145
and that you had to collaborate with.

00:55:53.325 --> 00:55:56.805
So my first and biggest answer is.

00:55:57.765 --> 00:56:00.615
You can't get a lot of data from
humans in the way that you can

00:56:00.615 --> 00:56:04.605
get a lot of data from simulated
RL agents, uh, or equivalently.

00:56:04.605 --> 00:56:09.435
You can't just put the human into the
training loop the way you can put a

00:56:09.435 --> 00:56:11.654
simulated RL agent into the training loop.

00:56:12.435 --> 00:56:14.025
Uh, so that's answer number one.

00:56:14.475 --> 00:56:19.755
And then there is another answer, uh,
would seem significantly less important,

00:56:20.145 --> 00:56:25.755
which is that humans are just not as
are, sorry, are significantly more

00:56:25.755 --> 00:56:27.825
diverse than simulated RL agents.

00:56:27.855 --> 00:56:30.705
Typically humans don't
all act the same way.

00:56:31.005 --> 00:56:33.674
Uh, even an individual human
will act pretty different.

00:56:34.095 --> 00:56:38.025
Um, from one episode to the next
humans will like learn over time.

00:56:38.865 --> 00:56:43.965
Uh, and so there, not only is there
a policy like kind of, kind of

00:56:43.965 --> 00:56:48.285
stochastic, but their policy isn't
even stationary that policy changes

00:56:48.285 --> 00:56:51.285
over time as they learn how to play
the game and become better at it.

00:56:51.915 --> 00:56:57.075
Um, and that's another thing that RL,
like, usually our El seems that that

00:56:57.105 --> 00:57:02.745
doesn't, that is not in fact true that
like episodes are drawn IED because

00:57:02.745 --> 00:57:07.635
of this like non station Harrity and
stochastic stochasticity and diversity,

00:57:07.785 --> 00:57:13.635
you would imagine that it, like you have
to get a much more robust policy, uh,

00:57:13.695 --> 00:57:17.445
in order to work with humans instead
of working with simulated RL agents.

00:57:17.505 --> 00:57:22.845
And so that, uh, ends up being, uh,
that ends up being harder to do.

00:57:23.175 --> 00:57:26.655
Sometimes people try to like take
their simulated RL agents and

00:57:26.655 --> 00:57:31.725
like make them more stochastic
to be more similar to humans.

00:57:32.295 --> 00:57:36.465
Um, for example, by like maybe taking a
random action with some small probability.

00:57:37.634 --> 00:57:42.195
And I think usually this
ends up still looking kind of

00:57:42.195 --> 00:57:44.685
like artificial and forest.

00:57:44.775 --> 00:57:48.735
When you like look at the resulting
behavior such that it still doesn't

00:57:48.735 --> 00:57:53.565
require that robust a policy in
order to collaborate well, but those

00:57:53.565 --> 00:57:57.855
agents, um, and humans are just
like more challenging than that.

00:57:58.125 --> 00:57:58.365
Okay.

00:57:58.365 --> 00:58:01.965
Let's briefly move to the next
paper, evaluating the robustness

00:58:01.965 --> 00:58:03.165
of collaborative agents.

00:58:03.195 --> 00:58:06.105
That was not at all with
yourself as a co-author.

00:58:06.495 --> 00:58:09.465
Can you give us the short version
of what this paper is about?

00:58:09.735 --> 00:58:13.845
Like we just talked about how, in
order to get your agency work well

00:58:13.845 --> 00:58:17.565
with humans, they need to be, they
need to learn a pretty robust policy.

00:58:18.015 --> 00:58:23.175
And so one way of measuring how good your
aides and sorry, uh, collaborating with

00:58:23.175 --> 00:58:27.765
humans is while you just like, have them
play with humans and see how well that

00:58:27.765 --> 00:58:30.285
goes, which is a reasonable thing to do.

00:58:31.395 --> 00:58:36.495
Um, and people should definitely do
it, but this paper proposed a like

00:58:36.855 --> 00:58:40.515
maybe simpler and more reproducible
tests that you can run more often.

00:58:41.355 --> 00:58:44.895
Um, which is just, I mean, it's
the basic idea from software

00:58:44.904 --> 00:58:46.785
engineering is just a unit test.

00:58:47.415 --> 00:58:50.325
Uh, and so it's a very simple idea.

00:58:50.335 --> 00:58:54.615
The idea is just write some unit tests
for the robustness of your agents, right?

00:58:54.615 --> 00:58:55.815
Some cases in which you think.

00:58:56.580 --> 00:58:57.299
Like correct.

00:58:57.299 --> 00:59:02.759
Action is unambiguously clear in cases
that you may be expect not to come up,

00:59:02.940 --> 00:59:08.640
uh, during, uh, during training and then
just see whether agent does in fact do

00:59:08.640 --> 00:59:11.040
the right thing, uh, on those inputs.

00:59:11.040 --> 00:59:14.700
And that can give you, like, if you're,
it's in passes, all of those tests,

00:59:14.700 --> 00:59:16.560
that's not a guarantee that it's robust.

00:59:17.160 --> 00:59:20.850
Um, but if it fails, some of those
tests then knew, definitely sound

00:59:20.880 --> 00:59:22.590
found some failures of robustness.

00:59:23.160 --> 00:59:29.670
I think in practice, uh, the agents that
we tested all like failed many tests.

00:59:29.700 --> 00:59:32.490
I w yeah, I don't remember
the exact numbers off the

00:59:32.490 --> 00:59:33.540
top of my head, but I think.

00:59:34.455 --> 00:59:38.984
Some of the better agents were
getting scores of maybe 70%.

00:59:39.285 --> 00:59:43.065
Could we kind of say that this
is related to the idea of,

00:59:43.145 --> 00:59:47.984
of sampling from environments
outside of the train distribution?

00:59:48.194 --> 00:59:52.754
Because we think that like in, in,
in samples that are related to the

00:59:52.754 --> 00:59:57.285
distribution, that the agent would,
uh, encounter after it's deployed,

00:59:57.315 --> 01:00:00.105
would you, would you phrase it that
way or is it, is it going in different?

01:00:00.105 --> 01:00:00.225
Yes.

01:00:01.425 --> 01:00:03.825
I think that's pretty close.

01:00:03.855 --> 01:00:07.875
I would say basically everything
about that seems correct.

01:00:07.875 --> 01:00:11.625
Except the part where you say
like a, and it's probably going

01:00:11.625 --> 01:00:13.365
to arise in the test distribution.

01:00:13.725 --> 01:00:17.355
I think usually I just wouldn't
even try to like, um, check

01:00:17.355 --> 01:00:20.535
whether or not it would, uh, up
here in the test distribution.

01:00:20.565 --> 01:00:23.265
I just, I guess, like,
that's very hard to do.

01:00:23.265 --> 01:00:27.975
You don't know what's going, like, if
you knew how the test distribution was

01:00:27.975 --> 01:00:30.705
going to look and in what way it was
going to be different from the train

01:00:30.705 --> 01:00:33.915
distribution, then you should just change
your train distribution to be the test

01:00:33.915 --> 01:00:38.325
distribution, but like the fundamental
challenge of robustness as easily that

01:00:38.325 --> 01:00:40.035
you don't know what your test is to be in.

01:00:40.035 --> 01:00:41.025
That's going to look like.

01:00:41.625 --> 01:00:42.615
So I would say it's more.

01:00:43.260 --> 01:00:48.060
We try to deliberately and find situations
that are outside the training situation,

01:00:48.450 --> 01:00:51.960
but where a human would agree that
there's like one unambiguously correct

01:00:51.960 --> 01:00:54.690
answer, um, and test it on those cases.

01:00:54.840 --> 01:00:58.290
Like maybe this will lead us to be too
conservative because like, actually

01:00:58.290 --> 01:01:02.460
the test was in a state that will never
actually come up in the test distribution.

01:01:03.030 --> 01:01:08.250
But given that we, it seems very hard
to know that I think, um, it's still

01:01:08.250 --> 01:01:11.220
a good idea to be driving these tests
and to take failures fairly soon.

01:01:12.060 --> 01:01:14.550
And this paper mentions
three types of robustness.

01:01:14.550 --> 01:01:17.550
Can you, um, briefly touch
on, on the three types?

01:01:17.730 --> 01:01:18.090
Yeah.

01:01:18.120 --> 01:01:22.290
So this is basically a categorization
that we found helpful in generating

01:01:22.290 --> 01:01:27.840
the tests, uh, and it's, uh, somewhat
specific to reinforcement learning agents.

01:01:27.960 --> 01:01:34.350
So the three types were state robustness,
which is, um, a case where like, basically

01:01:34.350 --> 01:01:38.820
these are test cases on which the
main thing that you've changed is the

01:01:39.330 --> 01:01:41.940
state in which the agent is operating.

01:01:42.540 --> 01:01:48.000
Then there's agent robustness, which
is, uh, when one of the other agents

01:01:48.000 --> 01:01:52.530
in the environment, uh, exhibit
some behavior that's like, uh,

01:01:52.590 --> 01:01:55.770
unusual and not what you expected.

01:01:56.400 --> 01:02:01.710
And then that can further be,
uh, decomposed into two types is

01:02:01.710 --> 01:02:07.710
agent robustness without memory
where, uh, even like where the,

01:02:07.790 --> 01:02:09.570
the test doesn't require the.

01:02:10.350 --> 01:02:12.540
AI system to have any memory.

01:02:12.630 --> 01:02:15.060
There's like a correct action.

01:02:15.060 --> 01:02:19.230
That seems determinable even if
the system doesn't have a memory.

01:02:19.530 --> 01:02:21.810
Uh, so this might be what you want to use.

01:02:21.870 --> 01:02:28.500
If you, for some reason they're using, uh,
an MLP or a CNN as your architecture, and

01:02:28.500 --> 01:02:33.960
then there's agent robustness with memory,
uh, which is where the distribution shift

01:02:33.960 --> 01:02:40.110
happens from, uh, and, uh, partner agent,
and the environment doing something

01:02:40.170 --> 01:02:44.250
that where you have to actually like,
look at the behavior over time, notice

01:02:44.259 --> 01:02:49.350
that, uh, something is violating what
you expected during training, and then

01:02:49.350 --> 01:02:51.330
take some corrective action as a result.

01:02:51.870 --> 01:02:54.150
Uh, so there you need memory
in order to understand.

01:02:54.945 --> 01:02:59.205
Um, how the partner agent is doing
something that wasn't what you expected.

01:02:59.385 --> 01:03:02.895
And then I guess when we're
dealing with a high dimensional

01:03:02.895 --> 01:03:07.095
state, there's just a ridiculous
number of permutations situations.

01:03:07.185 --> 01:03:11.295
And we've seen in the past that, um, that
deep learning, especially it can be really

01:03:11.295 --> 01:03:15.225
sensitive to small seemingly meaningless
changes in this high dimensional state.

01:03:15.225 --> 01:03:18.855
So how do we, how, how could we
possibly think about scaling this

01:03:18.855 --> 01:03:22.695
up to a point where, uh, we don't
have to test every single thing.

01:03:22.995 --> 01:03:28.245
I think that basically this particular
approach, you mostly just, shouldn't

01:03:28.245 --> 01:03:29.685
try to scale up in this way.

01:03:30.135 --> 01:03:34.305
It's more meant to be a like first
quick sanity check that is already

01:03:34.305 --> 01:03:39.195
quite hard to pass, uh, for kind systems
where you're talking scores like 70%.

01:03:39.195 --> 01:03:44.295
I think once you get to like score
is like 95, 90 9%, uh, then it's

01:03:44.295 --> 01:03:47.325
like, okay, that's the point to
like, start thinking about scaling

01:03:47.325 --> 01:03:49.065
up, but like, suppose we got.

01:03:49.875 --> 01:03:51.465
Uh, what do we then do?

01:03:51.945 --> 01:03:57.465
I don't think we really want to scale up
the, like the specific process of humans,

01:03:57.465 --> 01:03:59.805
think of tests, humans write down tests.

01:04:00.165 --> 01:04:02.715
Uh, then we like run
those on the air system.

01:04:03.375 --> 01:04:11.205
I think at that point, uh, we want to
migrate to a more like alignment flavored,

01:04:11.445 --> 01:04:16.245
uh, viewpoint, which I think we were going
to talk about in the near future anyway.

01:04:16.335 --> 01:04:21.315
Uh, but to give, uh, give some
advance, uh, to talk about

01:04:21.315 --> 01:04:22.695
that a little bit in advance.

01:04:23.175 --> 01:04:30.615
I think once we like scale up, we want
to try and find cases where the AI system

01:04:30.825 --> 01:04:33.915
does something bad that it knew was bad.

01:04:34.035 --> 01:04:37.095
It knew that it wasn't the thing
that its designers intended.

01:04:37.185 --> 01:04:40.455
And the reason that this allows you
to scale up is because now you can.

01:04:41.339 --> 01:04:46.710
Go and inspect the AI system and try to
find facts that it knows and like leverage

01:04:46.740 --> 01:04:49.109
those in order to create your test cases.

01:04:49.410 --> 01:04:52.950
And one hopes that the set
of things that the AI knows.

01:04:52.980 --> 01:04:53.040
Yeah.

01:04:53.805 --> 01:04:57.825
Still plausibly, a very large space, but
hopefully not an exponentially growing

01:04:57.825 --> 01:05:02.865
space, the way the state space is and
the intuition for why this is okay.

01:05:03.195 --> 01:05:09.165
Is that like, yes, the AI system might
end up, may end up having accidents and

01:05:09.165 --> 01:05:13.035
that wouldn't be caught if we were only
looking for cases where the AI system

01:05:13.035 --> 01:05:14.775
made a mistake that I knew was a mistake.

01:05:14.835 --> 01:05:17.985
But like, usually those
things aren't that bad.

01:05:18.165 --> 01:05:22.785
Uh, they can be if your AI system is like
in a nuclear power plant, for example,

01:05:22.815 --> 01:05:28.965
or, uh, in some like, uh, in a weapon
system, perhaps, but like in many cases,

01:05:28.995 --> 01:05:33.465
it's not actually that bad for the,
your AI system to make an accidental.

01:05:34.400 --> 01:05:39.710
The really bad areas are the ones where
the system is like intentionally making

01:05:39.710 --> 01:05:44.030
an error, uh, or making something that is
bad from the perspective of the designers.

01:05:44.510 --> 01:05:47.210
Those are, those are like
really bad situations and you

01:05:47.210 --> 01:05:48.410
don't want to get into them.

01:05:48.680 --> 01:05:52.010
And so I'm most interested in like
thinking of like how we can avoid that.

01:05:52.130 --> 01:05:55.190
Uh, and so then you can like try
to leverage the agent's knowledge

01:05:55.190 --> 01:05:56.660
to construct and put study.

01:05:56.660 --> 01:05:58.340
You can then test the VA system on.

01:05:58.520 --> 01:06:01.340
So this is a great segue
to the alignment section.

01:06:01.430 --> 01:06:05.120
Um, so how do you define
a alignment in AI?

01:06:05.510 --> 01:06:09.710
Maybe I will give you two definitions,
uh, that are like slightly

01:06:09.710 --> 01:06:11.240
different, but mostly the same.

01:06:11.660 --> 01:06:17.120
So one is that an AI system is
misaligned, so I'm not aligned, uh,

01:06:17.210 --> 01:06:21.980
if it takes actions that it needs.

01:06:22.950 --> 01:06:26.520
Uh, where against the
wishes of its designers.

01:06:26.669 --> 01:06:31.290
That's basically the definition that
I was just giving earlier a different,

01:06:31.410 --> 01:06:37.470
more positive definition of AI alignment
is an, is that an AI system is aligned

01:06:37.830 --> 01:06:42.779
if it is trying to do what its, uh,
designers intended for it to do.

01:06:43.410 --> 01:06:48.060
And is there some, um, agreed
upon taxonomy of like top

01:06:48.060 --> 01:06:49.890
level topics and alignment?

01:06:49.980 --> 01:06:55.439
Um, like how does it relate to
concepts like AI safety and human

01:06:55.439 --> 01:06:57.689
feedback, that different things
that we talked about today?

01:06:57.750 --> 01:07:01.020
How do we, how would we, uh, arrange
these in a kind of high level?

01:07:01.410 --> 01:07:05.100
There is definitely not a
canonical textonomy of topics.

01:07:05.100 --> 01:07:06.779
There's not even a canonical definition.

01:07:07.590 --> 01:07:14.610
So like the one I gave doesn't include
the problem, for example, of how you

01:07:14.610 --> 01:07:19.020
resolve disagreements between humans,
on what the AI system should do.

01:07:19.530 --> 01:07:23.610
It just says, all right, there is
some designers, they wanted something.

01:07:23.730 --> 01:07:26.280
That's what the AI system
is supposed to be doing.

01:07:26.880 --> 01:07:29.430
Uh, and it doesn't talk about
like, all right, the process

01:07:29.430 --> 01:07:32.640
by which those designers decide
what the AI system intends to do.

01:07:32.640 --> 01:07:35.340
That's like not, not a part of
the problem as I'm defining it.

01:07:35.520 --> 01:07:37.530
It's obviously still an important problem.

01:07:37.710 --> 01:07:42.120
Just like not part of this definition,
uh, as I gave it, but other people

01:07:42.120 --> 01:07:44.160
would say, no, that's a bad definition.

01:07:44.160 --> 01:07:45.480
You should include that problem.

01:07:46.140 --> 01:07:48.330
So there's not even a
canonical definition.

01:07:48.600 --> 01:07:55.350
So I think I will just give you maybe
my techsonomy of alignment topics.

01:07:55.560 --> 01:08:00.420
So in terms of how alignment
relates to AI safety, uh, there's

01:08:00.540 --> 01:08:03.960
this sort of general big picture
question of like, how do we get.

01:08:04.815 --> 01:08:10.485
Or we'll add, be beneficial for humanity,
which you might call AI safety or

01:08:10.545 --> 01:08:12.495
add beneficial illness or something.

01:08:12.675 --> 01:08:17.685
And on that you can break down into a
few possible, uh, possible categories.

01:08:17.895 --> 01:08:23.654
I quite like the I'm gonna forget where
the, where I, where this taxonomy comes

01:08:23.654 --> 01:08:30.375
from, but I liked the taxonomy into
accidents, misuse and structural risks.

01:08:31.125 --> 01:08:33.465
So accidents are exactly
what they sound like.

01:08:33.495 --> 01:08:37.455
Accidents happen when an AI system
does something bad and nobody intended

01:08:37.455 --> 01:08:39.435
for that VA system to do that thing.

01:08:39.645 --> 01:08:41.895
Um, Missy's also exactly
what it sounds like.

01:08:41.895 --> 01:08:47.535
It's when it's, when somebody gets an AI
system to do something, and that's the

01:08:47.535 --> 01:08:51.225
thing that it got the AI system to do was
something that we didn't actually want.

01:08:51.225 --> 01:08:55.005
So think of like terrorists,
um, using AI assistant.

01:08:55.785 --> 01:09:02.804
Um, to like assassinate people, uh, and
unstructured risks are maybe less obvious

01:09:02.804 --> 01:09:06.705
than the previous tube, but structural
risks happen when, you know, if, as

01:09:06.705 --> 01:09:12.194
we infuse AI systems into our economy,
do any new sorts of problems arise?

01:09:12.194 --> 01:09:15.585
Do we get into like racist
to the bottom on safety?

01:09:15.585 --> 01:09:19.844
Do we get to, do we have like a
whole bunch of increased economic

01:09:19.844 --> 01:09:24.764
competition that causes us to sacrifice
money, to sacrifice many of our

01:09:24.764 --> 01:09:26.625
values in the name of Trent activity?

01:09:27.104 --> 01:09:28.394
Uh, stuff like that.

01:09:28.604 --> 01:09:34.154
So that's like one starting categorization
accidents, CS structural risk, and

01:09:34.245 --> 01:09:39.335
within accidents you can have, you
can then further separate into.

01:09:40.275 --> 01:09:45.795
Uh, accidents where the system knew that
the thing that was doing was bad and

01:09:45.795 --> 01:09:49.545
accents where the system didn't know
that the thing that it was doing was bad.

01:09:49.875 --> 01:09:55.125
And the first one is AI alignment,
according to my definition, which

01:09:55.125 --> 01:09:59.205
again is not a canonical Def I
think it's maybe the most common

01:09:59.205 --> 01:10:01.665
definition, but it's like not canonical.

01:10:02.145 --> 01:10:06.945
So that was like how alignment relates
to AI safety and then like, how does the

01:10:06.955 --> 01:10:08.715
stuff we've been talking about today?

01:10:08.745 --> 01:10:09.975
Relate to alignment.

01:10:10.545 --> 01:10:12.395
Again, people will disagree with me on.

01:10:13.290 --> 01:10:20.099
But according to me, the way to build
a line to AI systems and the sense

01:10:20.099 --> 01:10:24.839
of, eh, uh, systems that don't make
take bad actions that they knew were

01:10:24.839 --> 01:10:31.679
bad is that you use a lot of human
feedback to train your AI system to

01:10:32.070 --> 01:10:35.790
where like the human feedback, you
know, it rewards the AI system when

01:10:35.790 --> 01:10:40.230
it does things, stuff that humans want
and, uh, punished as the air system.

01:10:40.320 --> 01:10:43.620
When the system does things that
the human doesn't want, this

01:10:43.620 --> 01:10:45.000
doesn't solve the entire problem.

01:10:45.000 --> 01:10:49.860
You, you basically then just want
to like make your human, the people,

01:10:49.860 --> 01:10:53.370
providing your feedback as powerful as.

01:10:54.195 --> 01:10:56.025
Make them as competent as possible.

01:10:56.355 --> 01:10:59.325
So maybe you could do some
interpretability with the model that

01:10:59.325 --> 01:11:03.764
you're training, um, in order to
like, understand how exactly it's like

01:11:03.764 --> 01:11:08.894
reasoning, how it's making decisions,
you can then feed that information to

01:11:09.195 --> 01:11:11.175
the humans who are providing feedback.

01:11:11.175 --> 01:11:16.665
And thus, this can then maybe allow them
to, uh, not just select AI systems that

01:11:16.675 --> 01:11:19.545
get the right outcomes, but now they
can select it as systems, like get the

01:11:19.545 --> 01:11:21.465
right dog comes for the right reasons.

01:11:21.795 --> 01:11:23.684
And that can help you get more robustness.

01:11:24.465 --> 01:11:31.065
Uh, you could imagine that you have
some other air systems that are in

01:11:31.065 --> 01:11:36.165
charge of like finding new hypothetical
inputs on which the system that

01:11:36.165 --> 01:11:38.054
you're training takes a bad action.

01:11:38.175 --> 01:11:41.195
Um, so like this, uh, systems and
like here's this hypothetical.

01:11:42.330 --> 01:11:45.419
Uh, here's this input on which your
AI system is doing a bad thing.

01:11:45.419 --> 01:11:47.639
And then they came into
like, oh, that's bad.

01:11:47.700 --> 01:11:52.410
Let's put it in the training data set, um,
and give good feedback on it and so on.

01:11:52.860 --> 01:11:56.820
So then I think the salt would be
maybe the most obviously connected

01:11:56.820 --> 01:12:01.110
here where it was about how do you just
train anything with human feedback,

01:12:01.110 --> 01:12:03.599
which is obviously a core thing I've
been talking about in this plan.

01:12:03.719 --> 01:12:06.690
Um, preferences implicit
in the state of the world.

01:12:07.320 --> 01:12:09.509
It's less clear how that relates here.

01:12:09.509 --> 01:12:14.370
I think that paper makes
more sense in a plan.

01:12:14.459 --> 01:12:17.950
That's more like traditional
value alignment where you're as

01:12:17.950 --> 01:12:23.190
a system maintain, I like has an
explicit distribution over it data

01:12:23.219 --> 01:12:25.139
that it's updating by evidence.

01:12:25.620 --> 01:12:30.240
So I think that one is less relevant
to the, to the, to the subscription.

01:12:30.389 --> 01:12:33.089
The benefits of assistance
paper is I think.

01:12:33.795 --> 01:12:37.245
Primarily a statement about
what the air system should do.

01:12:37.815 --> 01:12:42.375
And so like what we want our human
feedback providers to be doing is to be

01:12:42.375 --> 01:12:49.275
seeing, Hey, is this AI system, like,
thinking about what, uh, what it's users

01:12:49.275 --> 01:12:53.054
will want, um, if it's uncertain about
what the users will want, does it like

01:12:53.085 --> 01:12:58.754
ask for clarification or does it just
like guess, um, we probably wanted to ask

01:12:58.754 --> 01:13:02.445
for clarification rather than guessing
if it's a sufficiently important thing.

01:13:02.835 --> 01:13:06.884
Uh, but if it's like some probably
insignificant thing, then it's

01:13:06.884 --> 01:13:08.325
like fine, if it can guess.

01:13:08.804 --> 01:13:13.245
And so through the human feedback that
you can then like train a system, that's

01:13:13.245 --> 01:13:20.174
being very assistive, the overcooked
papers, uh, on the, you tell it to

01:13:20.174 --> 01:13:23.565
you of learning about learning about
humans for human error, coordinate.

01:13:24.375 --> 01:13:31.214
Uh, that one is, I think, not that
relevant to this plan, unless you

01:13:31.255 --> 01:13:34.964
happened to be building an AI system
that is playing a collaborative game,

01:13:35.205 --> 01:13:41.594
the evaluating the robustness paper is,
uh, more relevant in that, like part

01:13:41.594 --> 01:13:45.075
of the thing that these human feedback
providers are going to be doing is to

01:13:45.075 --> 01:13:50.115
like, be constructing these hypothetic,
be constructing inputs on which the

01:13:50.115 --> 01:13:54.915
AI system, uh, behaves badly and then
training VA system, not to behave badly

01:13:54.915 --> 01:13:57.825
on those inputs, uh, send that sense.

01:13:57.825 --> 01:14:02.384
It's, uh, it also fits
into this overall story.

01:14:02.655 --> 01:14:03.014
Cool.

01:14:03.014 --> 01:14:03.375
Okay.

01:14:03.375 --> 01:14:06.674
Can you mention a bit about
your alignment newsletter?

01:14:06.855 --> 01:14:11.365
Um, like what, what, how do you, how
do you define that newsletter and

01:14:11.655 --> 01:14:13.155
how did you, how did you start that?

01:14:13.245 --> 01:14:14.594
And what's happening with the newsletter?

01:14:14.594 --> 01:14:20.775
Now, the alignment newsletter is
supposed to be a weekly newsletter

01:14:20.775 --> 01:14:22.545
that I write that summarizes.

01:14:23.670 --> 01:14:26.460
Just recent content
relevant to AI alignment.

01:14:27.059 --> 01:14:33.360
It has not been a very weekly and the
last couple of months because I've

01:14:33.360 --> 01:14:37.230
been busy, but I do intend to go back
to making it a weekly newsletter.

01:14:38.130 --> 01:14:40.349
I mean, the origin story is kind of funny.

01:14:40.410 --> 01:14:45.240
It was just, we, this was while I
was a PhD student at the center for

01:14:45.240 --> 01:14:47.490
human compatible AI at UC Berkeley.

01:14:48.360 --> 01:14:52.650
Uh, we were just discussing that, like,
there were a lot of papers that were

01:14:52.650 --> 01:14:58.080
coming out all the time, uh, as people
will probably be familiar with and it

01:14:58.080 --> 01:14:59.490
was hard to keep track of them all.

01:14:59.910 --> 01:15:03.750
Um, and so someone suggested
that, Hey, maybe we should have

01:15:03.780 --> 01:15:06.510
a rotation of people who just.

01:15:07.380 --> 01:15:11.790
Uh, search for all of the new papers
that ever arrived in the past week.

01:15:11.790 --> 01:15:14.820
And just send an email out
to everyone just like lists

01:15:14.850 --> 01:15:16.170
giving links to those papers.

01:15:16.170 --> 01:15:18.360
So other people don't have
to do the search themselves.

01:15:18.930 --> 01:15:22.740
And I said like, look, I, you
know, I just do this every week.

01:15:22.740 --> 01:15:23.160
Anyway.

01:15:23.190 --> 01:15:27.450
I I'm just happy to take on those jobs,
sending an, uh, sending one email with a

01:15:27.450 --> 01:15:31.530
bunch of links is not a hard, uh, we don't
need to have this rotation of people.

01:15:32.250 --> 01:15:37.350
Um, so I did that internally to chai,
uh, then like, you know, a couple of

01:15:37.350 --> 01:15:42.000
weeks later, I like added a sentence
that was telling people, Hey, this is

01:15:42.000 --> 01:15:47.130
what this is like the topic, um, here
is, you know, maybe you should read it

01:15:47.130 --> 01:15:49.740
if you are interested in X, Y, and Z.

01:15:50.550 --> 01:15:53.520
Uh, and so that happened for a while.

01:15:53.520 --> 01:15:55.530
And then I think I started writing.

01:15:56.355 --> 01:15:59.415
A slightly more extensive summaries
so that people didn't have to read the

01:15:59.415 --> 01:16:04.005
paper, uh, unless it was something they
were particularly interested in, uh, and

01:16:04.005 --> 01:16:07.785
flight around that point, people were
like, this is actually quite useful.

01:16:07.995 --> 01:16:09.135
You should make it public.

01:16:09.615 --> 01:16:13.845
Uh, and then I like tested it a bit
more, um, maybe for another, like

01:16:13.995 --> 01:16:16.665
three to four weeks internally to try.

01:16:17.085 --> 01:16:20.325
And then I, um, after
that I released a public.

01:16:21.300 --> 01:16:25.020
Uh, it still did go up under
a fair amount of improvement.

01:16:25.020 --> 01:16:30.180
I think maybe after like 10 to 15
newsletters was when it felt more stable.

01:16:30.300 --> 01:16:30.630
Yeah.

01:16:30.660 --> 01:16:34.110
And now it's like, apart from the
fact that I've been too busy to do it

01:16:34.140 --> 01:16:39.300
recently, it's been pretty stable for
the last, I don't know, two years or so.

01:16:40.020 --> 01:16:43.470
Well, uh, to the audience, I
highly recommend the newsletter.

01:16:43.500 --> 01:16:47.610
And, uh, like I mentioned, you know,
when I first met you and heard about

01:16:47.610 --> 01:16:51.120
your alignment newsletter early
on at that point, I really wasn't.

01:16:51.120 --> 01:16:55.380
Um, I didn't really appreciate the, the
importance of alignment, uh, issues.

01:16:55.410 --> 01:16:58.800
And, and I gotta say that really
changed for me when I read the

01:16:58.800 --> 01:17:02.640
book human compatible by professor
Stuart, Russell, who I gather is

01:17:02.640 --> 01:17:04.260
your one of your PhD advisors.

01:17:05.250 --> 01:17:07.950
And so that book really helped
me appreciate the importance

01:17:07.950 --> 01:17:09.480
of alignment related stuff.

01:17:09.510 --> 01:17:13.500
And it was part of the reason that I, that
I sought sought you out to interview you.

01:17:13.500 --> 01:17:17.160
So I, I'm happy to recommend that
a plug that book to the audience,

01:17:17.730 --> 01:17:19.080
uh, professor Russell's awesome.

01:17:19.080 --> 01:17:22.050
And it's a very well-written book
and, uh, and full of great insight.

01:17:22.320 --> 01:17:22.740
Yep.

01:17:22.950 --> 01:17:24.630
I also strongly recommend this book.

01:17:24.690 --> 01:17:28.290
And since we're on the topic of the
alignment newsletter, you can read

01:17:28.290 --> 01:17:32.880
my summary of, uh, steroid Russell's
book in order to get a sense of

01:17:32.880 --> 01:17:36.750
what it talks about, uh, before
you actually make the commitment of

01:17:36.750 --> 01:17:37.980
actually reading the entire book.

01:17:38.580 --> 01:17:42.690
Um, so you can find that on my
website under a alignment newsletter,

01:17:42.690 --> 01:17:44.040
there's a list of past issues.

01:17:44.925 --> 01:17:47.475
I think this was newsletter edition 69.

01:17:47.625 --> 01:17:49.605
Not totally sure you can check that.

01:17:49.995 --> 01:17:51.135
And what was your website again?

01:17:51.225 --> 01:17:53.475
I it's just my first name and last name.

01:17:53.535 --> 01:17:55.155
Rohinshah.com.

01:17:55.365 --> 01:17:56.055
Okay, cool.

01:17:56.055 --> 01:17:59.115
I highly recommended doing
that, um, to the audience.

01:17:59.205 --> 01:18:05.625
And so I wanted to ask you about how,
you know, how alignment work is done.

01:18:05.655 --> 01:18:09.455
So a common pattern that, you know,
we might be familiar with that in,

01:18:09.455 --> 01:18:13.305
in many ML papers is to show a new
method and show some experiments.

01:18:13.905 --> 01:18:17.655
Um, but his alignment, uh, is work in
alignment, fundamentally different.

01:18:17.655 --> 01:18:21.135
Like what does the work
entail in, in alignment?

01:18:21.335 --> 01:18:24.705
Is there a lot of thought experiments
or how would you describe that?

01:18:25.005 --> 01:18:27.405
Uh, there's a big variety of things.

01:18:27.435 --> 01:18:33.585
So some alignment work, um, is
in fact pretty similar to, uh,

01:18:33.645 --> 01:18:36.705
existing, uh, T to typical ML work.

01:18:37.335 --> 01:18:39.705
Um, so for example, there's
a lot of alignment work.

01:18:39.705 --> 01:18:42.675
That's like, can we make
human feedback algorithms.

01:18:43.815 --> 01:18:49.095
Uh, and you know, you start with
some baseline and some task or

01:18:49.095 --> 01:18:52.635
environment in which you want to
get an AI system to do something.

01:18:53.265 --> 01:18:56.445
And then you like try to improve
upon the baseline, using some

01:18:56.445 --> 01:18:57.555
ideas that you thought about it.

01:18:58.335 --> 01:19:01.905
Uh, and like, you know, maybe
it's somewhat different because

01:19:01.905 --> 01:19:03.075
you're using human feedback.

01:19:03.075 --> 01:19:07.545
Whereas typical ML res uh, MLRA switch
doesn't involve human feedback, but

01:19:07.545 --> 01:19:08.775
that's not that big a difference.

01:19:08.805 --> 01:19:10.935
It's still like mostly the same skills.

01:19:11.685 --> 01:19:16.575
Uh, so that's probably the kind that's
closest to existing ML research.

01:19:16.635 --> 01:19:20.955
There's also like a lot of
interpretability work, which again is

01:19:20.955 --> 01:19:25.125
just like working with actual machine
learning models and trying to figure

01:19:25.125 --> 01:19:25.965
out what the heck they're doing.

01:19:26.644 --> 01:19:31.144
Also seems pretty, it's like not the same
thing as like get a better performance

01:19:31.144 --> 01:19:35.974
on those tasks, but it's still like
pretty similar to the general fee to like

01:19:35.974 --> 01:19:38.195
some parts of the, of machine learning.

01:19:38.315 --> 01:19:42.394
So that's like one kind of one
type of alignment research.

01:19:42.514 --> 01:19:46.775
And then there's, you know, on the
complete other side that there is a

01:19:46.775 --> 01:19:52.085
bunch of stuff where you're like, where
you think very abstractly about what

01:19:52.085 --> 01:19:54.155
feature AI systems are going to look like.

01:19:54.155 --> 01:20:00.215
So like, maybe you're like, all right,
maybe you think about how some story by

01:20:00.215 --> 01:20:03.005
which you might, by which AGI might arise.

01:20:03.065 --> 01:20:07.745
Like we run such and such algorithm,
maybe what set some improvements.

01:20:07.835 --> 01:20:12.455
And the arc in various architecture
is with like such and such data

01:20:12.695 --> 01:20:16.535
and you get a, and it turns out
you can get AGI out of this.

01:20:17.105 --> 01:20:20.465
Uh, then you maybe like think
in this hypothetical, okay.

01:20:21.405 --> 01:20:24.165
Uh, does this AGI ended
up getting misaligned?

01:20:24.165 --> 01:20:26.505
If so, how, how does it get misaligned?

01:20:26.505 --> 01:20:27.015
If yes.

01:20:27.555 --> 01:20:31.905
Um, well you tell that story and they're
like, okay, now I have a story of like

01:20:31.905 --> 01:20:35.355
how they, uh, AGI system was misaligned.

01:20:35.415 --> 01:20:38.535
What would I need to do in order to
like, prevent this from happening?

01:20:39.045 --> 01:20:43.245
Um, so you can do like pretty elaborate,
uh, conceptual thought experiments.

01:20:43.545 --> 01:20:47.775
I think these are usually good as a
way of ensuring that the things that

01:20:47.775 --> 01:20:50.325
you're working on are actually useful.

01:20:50.475 --> 01:20:55.275
I think there are a few people
who do these sorts of conceptual

01:20:55.275 --> 01:20:57.345
arguments, almost always.

01:20:57.375 --> 01:21:01.425
And do them well, such that I'm
like, yeah, this stuff they're

01:21:01.425 --> 01:21:05.205
producing, I think is probably
going to matter in the future.

01:21:05.535 --> 01:21:09.645
But I think it's also very easy
to end up not very grounded in

01:21:09.645 --> 01:21:10.695
what's actually going to happen.

01:21:11.355 --> 01:21:14.894
Such that you end up saying things that
won't actually be true in the future

01:21:14.985 --> 01:21:19.905
and could notably like some somewhat,
there is some reasonably easy to find

01:21:19.905 --> 01:21:22.905
argument today that could convince
you that the things you're saying are

01:21:23.144 --> 01:21:25.365
not going to matter in the future.

01:21:25.575 --> 01:21:28.575
So it's pretty hard to do this
research because of the lack of

01:21:28.665 --> 01:21:30.255
actual empirical feedback loops.

01:21:30.495 --> 01:21:31.875
But I don't think that has doomed.

01:21:32.054 --> 01:21:36.224
Um, I think people do in fact get, um,
some interesting results out of this

01:21:36.224 --> 01:21:40.755
and often the results side of this,
that the best results out of this line

01:21:40.755 --> 01:21:44.175
of work, uh, usually seem better to
me than the results that we get out

01:21:44.445 --> 01:21:46.005
out of the empirical line of work.

01:21:46.215 --> 01:21:49.934
So you mentioned in your newsletter
and then there's an alignment forum.

01:21:49.965 --> 01:21:53.445
If I understand that that's what
that was spring out of less wrong.

01:21:53.505 --> 01:21:54.065
Is that, is that.

01:21:55.065 --> 01:21:57.075
I don't know if I would say
it's sprang out of less wrong.

01:21:57.075 --> 01:22:00.014
It was meant to be at least somewhat
separate from it, but it's definitely

01:22:00.014 --> 01:22:03.254
very, it's definitely affiliated with
less wrong and like everything on

01:22:03.254 --> 01:22:04.844
it gets cross posted to less wrong.

01:22:05.144 --> 01:22:07.245
And so these are pretty
advanced resources.

01:22:07.245 --> 01:22:11.115
I mean, from my point of view, um,
but to the audience who maybe is just

01:22:11.115 --> 01:22:14.865
getting started with these ideas, can
you recommend, uh, you know, a couple

01:22:14.865 --> 01:22:18.134
of resources that might be good for
them to get like an on-ramp for them?

01:22:18.165 --> 01:22:21.315
Um, I guess including the
human compatible, but anything

01:22:21.315 --> 01:22:22.065
else you'd want to mention?

01:22:22.304 --> 01:22:22.545
Yeah.

01:22:22.545 --> 01:22:24.974
So human compatible is a
pretty good suggestion.

01:22:25.065 --> 01:22:27.375
Um, there are other books as well.

01:22:27.405 --> 01:22:31.455
Um, so super intelligence is
more on the philosophy side.

01:22:31.905 --> 01:22:37.844
Uh, the alignment problem by Brian
Christian is less on the like, uh,

01:22:37.875 --> 01:22:41.325
has a little bit less on like what,
what might solutions look like?

01:22:41.405 --> 01:22:44.775
It has more of the like intellectual
history behind how, how these

01:22:44.775 --> 01:22:47.054
concerns started rising on life.

01:22:47.085 --> 01:22:49.214
3.0 by max Tegmark.

01:22:49.665 --> 01:22:51.884
I don't remember.

01:22:53.130 --> 01:22:54.990
How much it talks about alignment.

01:22:54.990 --> 01:22:57.390
I assume it does a decent amount.

01:22:58.500 --> 01:23:02.340
Uh, but that's, that's another
option apart from books.

01:23:02.850 --> 01:23:09.360
I think so the alignment for M
has, um, sequences of blog posts

01:23:09.360 --> 01:23:14.490
that are, that, that don't require
quite as much, um, technical depth.

01:23:14.640 --> 01:23:19.530
So for example, it's got the
value learning sequence, which

01:23:19.530 --> 01:23:24.210
I, well, which I have wrote half
curated other people's posts.

01:23:24.780 --> 01:23:30.360
Um, so I think that's a good introduction
to some of the ideas and alignment.

01:23:31.200 --> 01:23:34.740
Uh, there's the embedded agency
sequence also on the Atlantans

01:23:34.740 --> 01:23:38.820
forum and the iterated amplification
sequence and the alignment for him.

01:23:39.360 --> 01:23:44.130
Oh, there's the, there's an
AGI safety fundamentals course.

01:23:44.400 --> 01:23:45.930
And then you can just Google it.

01:23:45.960 --> 01:23:48.510
It has a publicly available curriculum.

01:23:49.290 --> 01:23:53.760
I believe, I think really ignore all
the other suggestions, look at that

01:23:53.760 --> 01:23:56.820
curriculum and then read things on.

01:23:56.820 --> 01:23:59.010
There is probably actually my advice.

01:23:59.370 --> 01:24:03.630
Have you seen any, uh, depictions
of, of alignment issues in science

01:24:03.630 --> 01:24:07.950
fiction or, um, these, these ideas
come up for you when you, when

01:24:07.950 --> 01:24:09.720
you watch or read, read Spotify?

01:24:10.890 --> 01:24:13.050
They definitely come up to some extent.

01:24:13.080 --> 01:24:17.040
I think there are many ways in which
the depictions aren't realistic, but

01:24:17.040 --> 01:24:22.200
like they do come up or I guess even
outside or just, uh, even mythology,

01:24:22.200 --> 01:24:26.280
like the whole Midas touch thing seems
like a perfect example of a misalignment.

01:24:26.460 --> 01:24:26.580
Yeah.

01:24:26.610 --> 01:24:27.060
Yeah.

01:24:27.090 --> 01:24:29.820
The king might example is a good example.

01:24:29.820 --> 01:24:30.090
I do.

01:24:31.620 --> 01:24:32.040
Yeah.

01:24:33.060 --> 01:24:33.360
Yeah.

01:24:33.360 --> 01:24:34.530
Those are good examples.

01:24:34.560 --> 01:24:34.770
Yeah.

01:24:34.770 --> 01:24:35.220
That's true.

01:24:35.220 --> 01:24:39.960
If you, if you expand to include
mythology in general, I feel

01:24:39.960 --> 01:24:41.640
like it's probably everywhere.

01:24:42.120 --> 01:24:45.360
Um, especially if you include things
like you asked for something and.

01:24:46.230 --> 01:24:49.019
What you're literally asked for,
but not what you actually meant.

01:24:49.349 --> 01:24:50.550
That's really common, isn't it?

01:24:50.760 --> 01:24:51.269
Yeah.

01:24:51.450 --> 01:24:52.080
In stories.

01:24:52.470 --> 01:24:52.890
Yeah.

01:24:52.950 --> 01:24:55.960
I mean, we've got, like, I
could just take any story.

01:24:55.960 --> 01:24:58.290
Your budget is, and probably
this little feature.

01:24:59.370 --> 01:25:03.030
Um, so they really started the, uh,
alignment, uh, literature back then,

01:25:03.030 --> 01:25:08.250
I guess, thousands of years old,
the problem of there are two people.

01:25:08.250 --> 01:25:11.670
One person wants the other person
to do something that's just like

01:25:11.670 --> 01:25:15.480
as a very important, fundamental
problem that you need to deal with.

01:25:15.870 --> 01:25:19.019
There's like tons of stuff also
in economics about those rights,

01:25:19.019 --> 01:25:22.380
that principal agent problem and
like the island and problem is

01:25:22.380 --> 01:25:23.760
not literally at the same thing.

01:25:24.000 --> 01:25:25.200
And the principal agent problem.

01:25:25.220 --> 01:25:29.580
It seems that the agent had already has
some motivation, some utility function.

01:25:29.580 --> 01:25:32.490
And you were like trying to incentivize
them to do the things that you want.

01:25:32.790 --> 01:25:34.650
Whereas in the AI alignment,
you've got to build it.

01:25:35.490 --> 01:25:37.080
Patient that you're delegating to.

01:25:37.110 --> 01:25:38.910
And so you have more control over it.

01:25:39.330 --> 01:25:44.460
So there are differences, but like
fundamentally the like entity a once

01:25:44.460 --> 01:25:50.880
entity B to do something for it, entity a
is like just a super common pattern that

01:25:51.360 --> 01:25:53.519
human society has thought about a lot.

01:25:55.260 --> 01:25:57.630
So we have some more
contributing questions.

01:25:57.690 --> 01:26:02.639
Uh, this is one from Nathan Lambert,
a PhD student at UC Berkeley

01:26:02.670 --> 01:26:04.500
doing research on robot learning.

01:26:04.559 --> 01:26:07.470
And, uh, Nathan was our
guest for episode 19.

01:26:07.500 --> 01:26:11.730
So Nathan says a lot of AI
alignment and AGI safety work

01:26:11.730 --> 01:26:13.650
happens on blog posts and forums.

01:26:13.710 --> 01:26:17.580
Uh, what's the right manner to draw more
attention from the academic community.

01:26:17.790 --> 01:26:18.599
Any comment on that?

01:26:19.019 --> 01:26:26.190
I think, um, I think that this is
basically a reasonable strategy where

01:26:26.190 --> 01:26:33.090
like, by, by doing this work on blog posts
and forums, people can move a lot faster.

01:26:33.945 --> 01:26:39.735
Uh, like ML is pretty good and
that, uh, like relative to other

01:26:39.735 --> 01:26:43.005
academic fields, you know, it doesn't
take years to publish your paper.

01:26:43.005 --> 01:26:45.045
It only takes some months
to publish your paper.

01:26:45.705 --> 01:26:49.635
Uh, but blood present forums, it can
be days to talk about your ideas.

01:26:50.085 --> 01:26:55.515
Um, so you can move a lot faster if
you're trusting in everyone's ability

01:26:55.515 --> 01:26:59.595
to like, understand which work is
good, um, and what to build on.

01:26:59.985 --> 01:27:03.645
Uh, and so that's like, I think the
main benefit of blog posts and forums,

01:27:03.675 --> 01:27:08.565
but then as a result, anyone who isn't
an expert correctly, doesn't end up

01:27:08.565 --> 01:27:12.165
reading the blog posts and forums,
because there's not, it's a little

01:27:12.165 --> 01:27:16.785
hard if you're not an expert to extract
the signal and ignore the noise.

01:27:17.355 --> 01:27:22.845
So I think then there's like a
separate group of people and not say

01:27:22.845 --> 01:27:25.845
they're not a separate group, but
there's a group of people who then

01:27:25.845 --> 01:27:31.665
takes a bunch of these ideas and then
tries and then converts them into.

01:27:33.059 --> 01:27:36.450
More vigorous, uh, and correct.

01:27:36.480 --> 01:27:41.460
And academically presented,
um, ideas and, and papers.

01:27:41.519 --> 01:27:46.170
And that's the thing that you can like,
uh, show to the academic community

01:27:46.170 --> 01:27:48.300
in order to draw more attention.

01:27:48.720 --> 01:27:52.200
In fact, we've just been working
on a project along these lines

01:27:52.200 --> 01:27:55.830
at DeepMind, which hopefully will
release soon talking about the

01:27:55.830 --> 01:27:58.410
risks from, uh, inner misalignment.

01:27:58.740 --> 01:28:05.670
So yeah, I think roughly my story is you
figure out conceptually what you want

01:28:05.670 --> 01:28:08.400
to do via the blog posts and forums.

01:28:08.639 --> 01:28:14.099
And then you'll like make it rigorous
and have experiments and like demonstrate

01:28:14.099 --> 01:28:19.559
things with, um, actual examples
instead of hypothetical ones, uh,

01:28:19.620 --> 01:28:21.630
and the format of an academic paper.

01:28:21.660 --> 01:28:26.070
And that's how you then like,
make it, um, credible enough and

01:28:26.070 --> 01:28:28.500
convincing enough to draw attention
from the academic committee.

01:28:30.135 --> 01:28:30.525
Great.

01:28:30.525 --> 01:28:34.485
And then Taylor Killian asks
to Taylor's a PhD student at

01:28:34.485 --> 01:28:36.225
U of T and the vector Institute.

01:28:36.745 --> 01:28:38.685
Taylor was our guest for episode 13.

01:28:39.285 --> 01:28:42.645
And Taylor asks, how can we
approach the alignment problem when

01:28:42.645 --> 01:28:47.625
faced with heterogeneous behavior
from possibly many human actors?

01:28:47.925 --> 01:28:52.665
I think under my interpretation of
this question is that, you know,

01:28:52.755 --> 01:28:56.595
humans sometimes disagree on what
things to value and similarly

01:28:56.595 --> 01:29:01.395
disagree on what behaviors they, they
exhibit and want the AI to exhibit.

01:29:02.175 --> 01:29:08.835
Um, so how do you get the AI to decide on
one set of values or one set of behaviors?

01:29:09.465 --> 01:29:16.935
And as I talked about a little bit
before, I mostly just take this question

01:29:16.935 --> 01:29:20.925
and like it is outside of the scope of
the things that I usually think of that

01:29:21.075 --> 01:29:25.095
I'm usually just, I'm usually thinking
about the designers have something

01:29:25.095 --> 01:29:26.775
in mind that they want the system.

01:29:27.570 --> 01:29:31.470
Did the AI system actually do
do that thing or at least it,

01:29:31.480 --> 01:29:33.000
is it trying to do that thing?

01:29:33.450 --> 01:29:37.260
I do think that this problem is in
fact an important problem, but I think

01:29:37.260 --> 01:29:42.720
what you, the way, what your solution,
like the solutions are probably going

01:29:42.720 --> 01:29:50.550
to be more like political, um, or like
societal rather than technical, where,

01:29:51.030 --> 01:29:55.950
you know, you have to negotiate with
other people to figure out what exactly

01:29:55.950 --> 01:29:58.350
you want your AI systems to be doing.

01:29:58.650 --> 01:30:01.920
And then you like take that, take
that like simple spec and you

01:30:01.920 --> 01:30:03.630
hand it off to the AI designers.

01:30:03.630 --> 01:30:04.980
And then the idea of
saying it's all right.

01:30:05.040 --> 01:30:05.460
All right.

01:30:05.490 --> 01:30:08.040
Now we will make an AI
system with the spec.

01:30:08.100 --> 01:30:08.460
Yeah.

01:30:08.520 --> 01:30:13.290
So, so I would say it's like, yeah,
there's a separate problem of like how

01:30:13.290 --> 01:30:17.880
to go from human society to something
that we can put inside of an AI.

01:30:18.300 --> 01:30:21.420
This is like the domain of a
significant portion of social science.

01:30:22.214 --> 01:30:24.855
Uh, and it has technical aspects too.

01:30:24.855 --> 01:30:29.025
So like social choice theory, for
example, I think has at least some

01:30:29.025 --> 01:30:34.875
technical people trying to do a mechanism
design to, to solve these problems.

01:30:34.964 --> 01:30:36.735
And that seems great.

01:30:36.825 --> 01:30:39.044
And people should do that.

01:30:39.195 --> 01:30:40.605
It's a good problem to solve.

01:30:41.115 --> 01:30:44.714
Um, as unfortunately not one,
I have thought about very much,

01:30:44.955 --> 01:30:47.684
but I do feel pretty strongly
about the factorization into.

01:30:48.464 --> 01:30:52.394
One part of, you know, one problem,
which is like, figure out what exactly

01:30:52.424 --> 01:30:54.075
you want to put into the AI system.

01:30:54.434 --> 01:30:57.315
And then the other part of the problem,
which I call the alignment problem,

01:30:57.315 --> 01:31:00.674
which is then how do you take that thing
that you want to put into the system

01:31:00.794 --> 01:31:02.625
and actually put it into the AI system.

01:31:03.044 --> 01:31:03.674
Okay, cool.

01:31:03.674 --> 01:31:07.394
And Taylor also asks, how do we
best handle bias when learning

01:31:07.394 --> 01:31:09.644
from human expert demonstrations?

01:31:09.825 --> 01:31:09.915
Okay.

01:31:09.945 --> 01:31:12.315
This is a good question.

01:31:12.375 --> 01:31:16.485
And I would say is an open
question and in the field.

01:31:17.054 --> 01:31:21.794
So I don't have a great answer to it,
but some approaches that people have

01:31:21.794 --> 01:31:27.495
taken, one simple thing is to get a, uh,
get demonstration from a wide variety

01:31:27.495 --> 01:31:31.455
of humans and hope that to, to the
extent that they're making mistakes,

01:31:31.455 --> 01:31:33.165
some of those mistakes will cancel out.

01:31:33.224 --> 01:31:36.224
You can invest additional effort.

01:31:36.434 --> 01:31:39.105
Like you get a bunch of demonstrations
and then you invest a lot of

01:31:39.105 --> 01:31:43.365
effort into evaluating the quality
of each of those demonstrations.

01:31:43.905 --> 01:31:46.275
And then you can like label
each demonstration with

01:31:46.275 --> 01:31:48.750
like, How high quality it is.

01:31:49.170 --> 01:31:53.220
And then you can design an algorithm that
like takes the quality into account when

01:31:53.220 --> 01:31:56.970
learning, or, I mean, the most simple
thing is you just like discard everything.

01:31:56.970 --> 01:31:59.670
That's too low quality and only
keep the high quality ones.

01:32:00.090 --> 01:32:03.540
But, uh, there are some algorithms
that have been proposed that can

01:32:04.050 --> 01:32:07.050
make use of the low quality ones
while still trying to get to the

01:32:07.050 --> 01:32:09.360
performance of the high quality ones.

01:32:09.420 --> 01:32:13.620
Another approach that people have,
um, tried to take is to like, try

01:32:13.620 --> 01:32:18.660
and guess what sorts of biases, um,
are present and then try to build

01:32:18.660 --> 01:32:20.790
algorithms that correct for those biases.

01:32:21.690 --> 01:32:27.810
Uh, so in fact, one of my older papers
looks into an approach, uh, of this farm.

01:32:28.260 --> 01:32:28.740
I think.

01:32:29.625 --> 01:32:34.245
Like we did get results that were
better than the baseline, but I don't

01:32:34.245 --> 01:32:35.955
think it was all that promising.

01:32:36.165 --> 01:32:39.525
Uh, so I mostly did not continue
working on that approach.

01:32:39.644 --> 01:32:43.905
So it just seems kind of hard to
like, know exactly which biases,

01:32:43.964 --> 01:32:47.294
uh, are going to happen and to
then correct for all of them.

01:32:47.625 --> 01:32:47.804
Right.

01:32:47.804 --> 01:32:51.014
So those are a few thoughts on
how you can try to handle bias.

01:32:51.014 --> 01:32:53.025
I don't think we know the
best way to do it yet.

01:32:53.325 --> 01:32:53.594
Cool.

01:32:53.594 --> 01:32:54.405
Thanks so much.

01:32:54.554 --> 01:32:58.334
Uh, to Taylor and Nathan and
Natasha for contributed questions.

01:32:58.815 --> 01:33:02.504
Um, you can also contribute questions
to our next, uh, interviews.

01:33:02.535 --> 01:33:05.174
Uh, if you show up on our
Twitter at taco bell podcast.

01:33:06.044 --> 01:33:09.315
So we're just about wrapping up here,
a few more questions for you today.

01:33:09.644 --> 01:33:15.224
Rohin, what would you say is the
holy grail for your line of research?

01:33:15.525 --> 01:33:23.415
I think the holy grail is to
have a procedure for training AI

01:33:23.415 --> 01:33:25.125
systems, that particular task.

01:33:26.385 --> 01:33:33.885
Um, where we tell them where we can apply
arbitrary human understandable constraints

01:33:33.915 --> 01:33:36.434
to how the system achieves those tasks.

01:33:36.825 --> 01:33:40.605
So for example, we can be like,
we can build an AI assistant that

01:33:40.605 --> 01:33:42.165
scheduled your meetings, but.

01:33:43.110 --> 01:33:47.070
And sh and like, but unsure is
that it's always very respectful

01:33:47.070 --> 01:33:50.100
when it's talking to other people
in order to schedule your emails.

01:33:50.100 --> 01:33:52.830
And there's never like, you
know, discriminating based on

01:33:52.830 --> 01:33:54.300
sex or something like that.

01:33:54.360 --> 01:33:59.490
Or you can like build an agent that plays
Minecraft and you can just deploy it on

01:33:59.490 --> 01:34:04.560
an entirely new multiplayer server that
includes both humans and AI systems.

01:34:05.040 --> 01:34:08.460
And then you can say, Hey, you should
just go help such and such player

01:34:08.460 --> 01:34:09.900
with whatever it is they want to do.

01:34:09.900 --> 01:34:11.160
And the agent just does that.

01:34:11.160 --> 01:34:15.360
And they're like abides by the norms
on that, uh, on the multi-player

01:34:15.360 --> 01:34:19.620
server server that had joined, or
you can build a recommender system.

01:34:19.740 --> 01:34:25.170
That's just optimizing for what humans
think, uh, is good for recommender

01:34:25.170 --> 01:34:30.420
systems to be doing while, uh, rather
than optimizing for say engagement.

01:34:30.420 --> 01:34:33.300
If we think that engagement is a
bad thing to be optimizing for.

01:34:33.660 --> 01:34:36.600
So how do you see your, uh,
your research career plan?

01:34:36.690 --> 01:34:40.770
Um, do you have a clear roadmap
in mind or are you, uh, doing

01:34:40.770 --> 01:34:41.760
a lot of exploration as you.

01:34:42.475 --> 01:34:47.155
I think, I feel more like there's
maybe I wouldn't call it a roadmap.

01:34:47.155 --> 01:34:47.725
Exactly.

01:34:47.725 --> 01:34:49.585
But there's a clear plan.

01:34:49.795 --> 01:34:54.355
Uh, and the plan is we talked
about a bit about it earlier.

01:34:54.595 --> 01:34:58.975
The plan is roughly train models
using human feedback, and then

01:34:58.975 --> 01:35:02.755
like empower the heat, the humans,
providing the feedback as much as he

01:35:02.755 --> 01:35:06.865
can, um, ideally so that they can know
everything that the model knows and

01:35:06.865 --> 01:35:10.255
select the models that are getting the
right outcomes for the right reasons.

01:35:10.345 --> 01:35:13.225
I'd say like, that's the plan.

01:35:13.675 --> 01:35:16.135
That's like an ideal to which we aspire.

01:35:16.675 --> 01:35:20.394
Uh, we will probably not actually
reach it, knowing everything that

01:35:20.394 --> 01:35:25.015
the model knows is a pretty high
bar and probably we won't get to it.

01:35:25.875 --> 01:35:28.785
But there are like a bunch of
tricks that we can do that get

01:35:28.785 --> 01:35:30.075
us closer and closer to it.

01:35:30.075 --> 01:35:33.195
And the closer we get to it, the
better, the better we're doing.

01:35:34.004 --> 01:35:38.294
Um, and some like, let us find more
and more of those tricks find which

01:35:38.294 --> 01:35:42.134
ones are the best, see how like cost
efficient, how costly they are and so on.

01:35:42.644 --> 01:35:47.054
Um, and ideally this just leads to our,
to a significant improvement in our

01:35:47.054 --> 01:35:48.855
ability to do these things every time.

01:35:49.094 --> 01:35:54.344
Um, I will say though, that it took me
several years to get to those points.

01:35:54.375 --> 01:36:00.105
Like most of the, uh, most of the previous
years of my career, I have in fact been

01:36:00.464 --> 01:36:05.714
a significant amount of exploration,
uh, which is part of why, like, not all

01:36:05.714 --> 01:36:10.695
of the papers, uh, that we've talked
about so far really fit into the story.

01:36:11.504 --> 01:36:14.474
Is there anything else you want
to mention to our audience today?

01:36:14.474 --> 01:36:14.865
Rohin?

01:36:15.495 --> 01:36:15.974
Yeah.

01:36:16.304 --> 01:36:23.924
Um, so I, I'm probably going to start a
hiring round at DeepMind for my own team.

01:36:25.395 --> 01:36:31.965
Probably sometime in the next month from
the time of recording today is March 22nd.

01:36:32.745 --> 01:36:35.085
So yeah, please do apply.

01:36:35.085 --> 01:36:37.395
If you're interested in
working on the AI alignment.

01:36:37.695 --> 01:36:38.055
Great.

01:36:38.295 --> 01:36:38.715
Dr.

01:36:38.715 --> 01:36:43.155
Rohin Shah, this has been an absolute
pleasure and, and a total honor, by

01:36:43.155 --> 01:36:46.935
the way, I want to thank you for on
behalf of myself and in our audience.

01:36:47.115 --> 01:36:47.415
Yeah.

01:36:47.445 --> 01:36:48.495
Thanks for having me on.

01:36:48.495 --> 01:36:51.915
It was really fun to actually
go through all of these papers,

01:36:51.945 --> 01:36:54.525
uh, in a single session.

01:36:54.675 --> 01:36:56.235
I don't think I've ever done that before.