TalkRL: The Reinforcement Learning Podcast

DeepMind Research Scientist Dr. Rohin Shah on Value Alignment, Learning from Human feedback, Assistance paradigm, the BASALT MineRL competition, his Alignment Newsletter, and more!

Show Notes

Dr. Rohin Shah is a Research Scientist at DeepMind, and the editor and main contributor of the Alignment Newsletter.

Featured References

The MineRL BASALT Competition on Learning from Human Feedback
Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan

Preferences Implicit in the State of the World
Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan

Benefits of Assistance over Reward Learning
Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

On the Utility of Learning about Humans for Human-AI Coordination
Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan

Evaluating the Robustness of Collaborative Agents
Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah


Additional References

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

TalkRL podcast is all
reinforcement learning all the time.

Featuring brilliant guests,
both research and applied.

Join the conversation on
Twitter at @TalkRLpodcast.

I'm your host Robin Chauhan.

Rohin Shah is a research scientist
at deep mind and the editor and main

contributor of the alignment newsletter.

Thanks so much for
joining us today, Rohin.

Yeah.

Thanks for having me Robin
let's get started with, um, how

do you like to describe your
area of interest on my website?

The thing that I say is that I'm
interested in sort of the longterm

trajectory of AI, because it seems like
AI is becoming more and more capable

over time with many people thinking that
someday we are going to get to artificial

general intelligence or AGI, uh, where
AI systems will be able to replace humans

at most economically valuable tasks.

And that just seems like such an important
event in the history of humanity.

Uh, it seems like it would
radically transform the world.

And so it seems very important to
both important and interesting to

understand what is going to happen and
to see how we can make that important

stuff happened better so that we get
good outcomes instead of bad outcomes.

That's a sort of very general
statement, but I would say that

that's a pretty big area of interest.

And then I often spend most of my time
on a particular question within that,

uh, which is what are the chances that
these AGI systems will be misaligned

with humanity in the sense that
they will want something other than.

Uh, they will want to do things other
than what humans want them to do.

So a, what is the risk of that?

How can it arise and B how can we
prevent that problem from happening?

Cool.

Okay.

So we're going to talk, uh, about some
of this in more general terms later on.

And, but first let's, let's get
a little more specific about

some of your recent papers.

First we have in the minor, all basketball
competition on learning from human

feedback, and that was benchmark for
agents that solve almost lifelike tasks.

So I gather this is based on the mine
RL, a Minecraft based RL environment.

We saw some competitions on using
that before, but here you're doing

something different with the minor RL.

Can you tell us about basalt
and what's the idea here?

So I think the basic idea is that a
reward function, which is a typical.

Tool that you use in
reinforcement learning.

I'm sure your list.

I expect your listeners probably
know about that or word function.

If you have to write it down by hand
is actually a pretty, not great way of

specifying what you want an AI system to
do, like reinforcement learning treats

that reward function as a specification
of exactly what the optimal behavior

is to do in every possible circumstance
that could possibly arise when you'd

have to have that reward function.

Did you think of every possible
situation that could ever possibly

arise and check whether your reward
function was specifying the correct

behavior and that situation?

No, you did not do that.

And so we already have lots and lots
of examples of cases where people

like try to right there, write down
their reward function, thought I

thought would lead to good behavior.

And they actually around
reinforcement learning or some

other optimization algorithm with,
uh, with that reward function.

And the AI found some totally
unexpected solution that did get

high award, but didn't do what
the designer wanted it to do.

And so this motivates the
question, like, all right, how can.

Specify what we want the
agent to do without using

handwritten reward functions.

The general class of approaches that has
been developed in response to this is, uh,

what I call learning from human feedback,
or LFH H F w the idea here is that you

consider some possible situations where
the air could do things, and then you

like ask a human, Hey, in these particular
situations, what should the AI system do?

So you're making more local acquirees,
um, and, uh, local specifications, rather

than having to reason about every possible
circumstance that can never arise.

And then given all of this human, this,
like, uh, given a large data set of human

feedback on various situations, uh, you
can then train and, uh, train an agent to

meet that specification as best as it can.

So people have been developing these
techniques and includes things like

imitation learning, where you learn
from human demonstrations of how

to do the task or learning from
comparisons where humans can be.

Uh, look at videos of two agents, two
videos of agent behavior, and then say,

you know, the left one is better than
the right one, or it includes corrections

where the agent does something on humans.

Like at this point you should
have like taken this other action

instead that would have been better.

These are all the ways that you can
use human, uh, human feedback to train

an agent, to do the, do what you want.

But so people have developed a lot
of algorithms like this, but the

evaluation of them as kind of added.

Um, people just sort of make up some, uh,
new environment to test their method on.

Uh, they don't really compare on
any like, uh, on, on a standard

benchmark that everyone is using.

So the big idea with basalt was to,
um, was to change that, to actually

make a benchmark that could reasonably
fairly compare all of these, uh,

all of these different approaches.

So we like, we wanted it to mimic
the real-world situation as much as

possible in the real world situation.

You just have like some notion
in your head of what task you

want your AI system to do.

And then you have to, you have to take
a learning from human feedback algorithm

and give it the appropriate feedback.

So similarly, in this benchmark, we
instantiate the agent and a Minecraft

world, and then we just tell the
designer, Hey, you've got to train

your agent to say, make a waterfall.

That's one of our tasks, uh,
and then take a picture of it.

So we just tell the
designers, you have to.

So now the designer has in their
head a like notion of what the agent

is supposed to do, but there's no
formal specification, no reward,

function, nothing like that.

So they can then do whatever they want.

They can write down at a board function by
hand, if that seems like an approach they

want to do, they can use demonstrations.

They can use preferences, they
can use corrections, they can

do active learning and so on.

Uh, but their job is to like make an
agent that actually does the task.

Ideally they want to maximize,
uh, performance and minimize costs

both in terms of compute and in
terms of how much human feedback

it takes to train the agent.

So I watched, uh, the presentations
of the top two solutions and

it seemed like they were.

Very different approaches.

Uh, the first one Kairos I would
say is, seem like a lot of hand

engineering and I think they use 80,000
plus labeled images and built some

very specific components for this.

They kind of decompose the problem, which
I think is a very sensible thing to do.

But then also, uh, the
second one was obsidian.

They produce this inverse cue learning
method, a new method, which has seemed

like a more general theoretical solution.

I just wonder if you have any comments
on the different types of solutions

that came out of this or those kind of
two main classes that you saw or did

any classes of solutions surprise you?

Yeah, I think that's basically a right.

I don't think they were
particularly surprising and that.

We spent a lot of time making sure
that the tasks can trivially be

solved by just doing, um, hand
engineering, like classical program.

So even, even the top team
did rely on a behavior cloned

navigation policy, uh, that used
in your own network, but is true.

They'd done did a bunch of engineering on
top of that, which I think is, according

to me is just a benefit of this set up.

It shows you like, Hey, if you're just
actually trying to get good performance,

do you train a neural network end to end?

Or do you put in a, or do you put
in domain knowledge and how much

domain knowledge do you put in
and uh, how, how do you do it?

And it turns out that in this particular
case, the domain knowledge, well, they

did end up getting first, but a team
of city and was quite close behind.

So I would say that the two experiences
were actually pretty comparable.

And I do agree that I would say one is
more of an engineering geese solution.

Then the other one is more.

Researchy solution.

So it seems to me like the goals here were
things that could be modeled and learned.

Like it seems feasible to learn the
concept or to train a network, to

learn the concept of looking at a
waterfall that had enough labels.

And I guess that's what
some contestants did.

But do you have any comments on if
we were to, to want goals that are

harder to model than these things?

I I'm.

I was trying to think of examples
that came up with like our knee

or dance choreography scoring.

Like how would you even begin
to, to model those things?

Do we have to just continue improving
our modeling toolkit so that we can make

models of these, uh, reward functions?

Or is there some other strategy?

Uh, it depends exactly what you mean
by improving the modeling toolkit,

but basically I think the answer is
yes, but you know, the way that we

can improve our modeling toolkit, it
may not look like explicit modeling.

So for example, for irony, I
think you could probably get

a decent, well, maybe not.

Uh, it's plausible that you could get
a decent, uh, reward model out of a

large language model that like does
in fact how the concept of iron irony.

Um, if I remember correctly, large
language models are not actually that

great, that humorous, so I'm not sure
if they have the concept of irony,

but I wouldn't be surprised that if
further scaling did in fact, give

them a concept of irony, such that we
could use, uh, we could then use them

to have rewards that involve irony.

I think that's the same sort
of thing as like waterfall.

Like I agree that we can learn
the concept of a waterfall,

but it's not a trivial concept.

If you asked me to program it
by hand, I would have no idea.

Like the only input.

Yeah.

You get pixels as an input.

If you're like, here's
a rectangle of pixels.

Please write a program that
detects the waterfall on there.

I'm like, oh God, that
sounds really difficult.

I don't know how to do it, but we
can, if we apply machine learning,

then like turns out that we can
recognize these sorts of concepts.

And similarly, I think it's not
going to be like, I definitely

couldn't write the program, uh,
directly, that can recognize R and D.

But if you do machine learning, if
you use machine learning to model all

the texts on the internet, uh, the
resulting model does in fact have a

concept of irony that you can then
try to use in your reward functions.

And then there's a Twitter
thread related to disinformation.

And I shared a line from your paper
where you said learning from human

feedback offers the alternative
of training recommender systems to

promote content that humans would
predict would improve the users.

Well, And I thought that
was really cool insight.

Is that something you're interested
in pursuing or are you, you

see that, uh, being a thing?

I don't know whether or not it
is actually feasible currently.

Uh, one thing that needs to be true of
recommender systems is they need to be

cheap to run because they are being run.

So, so many times every day, I
don't actually know this for a fact.

I haven't actually done any Fermi
estimates, but my guess would be that

if you try to actually run TPD three
on say, um, Facebook posts in order to

then, uh, to then rank them, I think
that would just be, that would probably

be prohibitively expensive for Facebook.

So there's a question of like, can
you get a model that actually makes a

reasonable predictions about the user
as well, being that can also be run

cheaply enough, that it's not a huge, uh,
expensive cost to whoever is implementing

the recommendation system and also.

Does it take a, like sufficiently small
amount of human feedback that you aren't

bottlenecked on cost, uh, from, from
the humans, providing the feedback.

And also do we have algorithms
that are good enough to, uh, train

recommender systems this way?

I think the answer is plausibly.

Yes.

To all of these.

Uh, I haven't, it's just that I haven't
actually checked myself nor have I even

like, tried to do any feasibility studies.

I think.

The line that you're quoting
was more about like, okay,

why do this research at all?

And I'm like, well, someday in the
future, this should be possible.

And I stick by that, like someday
in the future, things will

become significantly cheaper.

Learning from human feedback.

Algorithms will be a lot better and so on.

And then like, it will just totally
make sense to you recommend your systems

trained with human feedback, unless we
found something even better by then.

It's just not obvious to me
that it is the right choice.

Currently.

I look forward to that and, uh, uh,
I'm really concerned, like many people

are about the disinformation and the
divisiveness, uh, of social media.

So that sounds great.

I think everyone's used to
very cheap reward function.

Uh, pretty much across the board.

So I guess what you're kind of pointing
to with these reward functions is

potentially more expensive to evaluate
reward functions, which has maybe

hasn't been a common thing until now
both more expensive reward functions.

And also the model that you train with
that or word or function might be,

might still be very expensive to do
inference with presumably recommender

systems right now are like compute
these, uh, you know, run a few linear

time algorithms on the post in order
to like compute a like a hundred or a

hundred thousand features, then do a dot
product with a hundred thousand weights.

See which, and then like
rank things in the order.

By those numbers.

And that's like, you know, maybe a
million flops or something, which is

a tiny, tiny number of flops, whereas
like a forward pass, the GPD three is

more is several hundred billion flops.

Uh, so that's a, like, uh, 10 to
the five X increase in the amount

of computation you have to do.

Uh, actually, no that's one part and
pass through GPT three, but there

are many words in a Facebook post.

So multiply the 10 to the five by the
number of words in the Facebook posts.

Uh, and now we're at like maybe more
like 10 to the seven times cost increase

just to do inference, even as you mean
you were, you had successfully trained a

model that could do it recommendations.

Yeah.

And in the end result may be lowering
engagement for the benefit of less

divisive content, which is maybe not
in the interest of the, of the social

media companies in the first place.

Yeah.

There's also a question of, I
agree whether the companies will

want to do this, but I think if.

I don't know if we like showed that
this was feasible, uh, that would give

regulator is I'm much more like, I
think a common problem with regulation

is that you don't know what to regulate
because there's no alternative on the

table for what people are already doing.

And if we were to come to them and
say, look, there's this learning

from human feedback approach,
we've like, calculated it out.

They should, they should only increase
costs by two X or maybe, uh, uh,

yeah, this should, maybe this is
like just the same amount of costs.

Um, and it shouldn't be too hard for
companies to actually train such a model.

They've already got
all the infrastructure.

It should barely be like, I
don't know, a hundred thousand

dollars to train the model once.

And like, if you like lay out that
case, I think it's much, I would

hope at least that it would be a
lot easier for the regulators to be

like, yes, everyone, you must train.

Recommender systems to be optimizing
for what humans would predict as good as

opposed to whatever you're doing right
now that could really change the game.

And then the bots or the divisive
posters are now trying to gain that,

that new reward function and then
probably find some different strategies.

Yeah, you might, you might
imagine that you have to like

keep retraining in order to.

Deal with new strategies that are,
uh, that people are finding in

response to like, we can't do this.

I don't have any special information
about that on this from working at

Google, but I'm told that Google is
actually like pretty good at defeating

defeating spammers, for example, like
in fact, my Gmail spam filter works

quite well as far as I can tell,
uh, despite the fact that spammers.

Uh, constantly trying to evade
it and we'll, hopefully we

could do the same thing here.

Cool.

Okay.

Let's move on to your next
paper preferences implicit

in the state of the world.

I understand this paper is closely
related to your dissertation.

We'll link to your dissertation
in the show notes as well.

I'm just going to read a quote and I
love how you distilled this key insight.

You said the key insight of this paper
is that when a robot is deployed in an

environment that humans have been acting
in, the state of the environment is

already optimized for what humans want.

Can you, um, tell us the general idea here
and what do you mean by that statement?

Maybe like put yourself in the
position of a robot or an AI system

that knows nothing about the world.

Maybe it's like, all right, sorry.

Like it knows the laws
of physics or something.

It knows that like there's gravity,
it knows that like, there is solid.

It's like what's in gases,
liquids, uh, tend to, you know,

take the shape of the container
that they're in, stuff like that.

Um, but it doesn't know anything
about humans or maybe like, you know,

it was, it was, we imagined that
it's sort of like off in other parts

of the solar system or whatever,
and it hasn't really seen it yet.

And then it comes to her and
it's like, whoa, earth has these

like super regular structures.

There's like these like very,
uh, cuboidal, um, structures with

glass panes at regular intervals.

Um, that often seem to have lights inside
of them, even though, even at night

when there isn't light outside of, uh,
outside of them, this is kind of shocking.

You, you wouldn't expect this
from a random configuration of

atoms, um, or something like that.

There is some sense in which
state order, if the world that,

that we humans have imposed upon,
it is like extremely surprising.

Um, if you don't know about humans
already being there and what they want.

So then you can imagine, uh,
asking your AI system, Hey,

you see a lot of order here.

Uh, can you like figure out an
explanation for why this order is there?

Um, perhaps, uh, and then you.

And maybe you get, and then you give
it the hint of like, look, it's, we're

going to give you the hint that it was
created by somebody optimizing the world.

What sort of things might
they have been optimizing for?

And then you, like, you know, you look
around and you see that like, oh, liquids.

They tend to be in these like glasses.

It would be really easy to tip over the
classes and have all the liquid spill out.

But like that mostly doesn't happen.

So people must want to have
their liquids in glasses.

And probably I shouldn't knock out.

Vases.

They're like kind of fragile.

You could like easily just like move them
a little bit to the, to the left or right.

And they would like fall down and break.

Um, and once they are broken,
you can then reassemble them.

But nonetheless, they're still not broken.

So like probably someone like
actively doesn't want them to break

and is leaving them on the table.

Yeah.

So really I would say the idea is.

The order in the world did not
just happen by random chance.

It happened because of human optimization.

And so from looking at the order of
the world, you can figure out what

the humans were optimizing for.

Yeah.

That's the basic idea
under length of paper.

So there's some kind of relationship
here to inverse reinforcement

learning where we're trying to
recover the reward function from,

from observing an agent's behavior.

But here you're not observing
the agent's behavior.

Right.

So it's not quite in verse aro.

Would, how would you describe the
relationship between what you're

doing here and a standard inverse RL?

Yeah.

So in terms of the formalism, um,
in verse RL, so that says that you

observe the human's behavior over time.

So that's the sequence of
states and actions that the

human took within those states.

Whereas we're just saying no, no, no.

We're not watching the human's behavior.

We're just going to see only the,
the state, the current state.

That's the only thing that we see.

And so you can think of this
in the framework of inverse

reinforcement learning.

You can think of this as.

Either the final state of the
trajectory or a state samples from

the stationary distribution, from an
infinitely long trajectory, uh, either

of those would be reasonable to do, but
you're only observing that one thing

instead of observing the entire state
action history, um, starting from a

random initialization of the world.

But other than that, you just make
that one change and then you run

through all the same map and you
get a slightly different algorithm.

And that's basically what we,
uh, did to, to make this paper.

So with this approach, I guess
potentially you're opening up a huge

amount of kind of unsupervised learning
just from observing what's happening.

And you can kind of almost
do it instantaneously in

terms of observation, right?

You don't have to watch billions
of humans for thousands of years.

Yep.

That's right.

Um, it does require that your
AI system knows like the laws of

physics or as we would call it
in RL, the transition dynamic.

Or, well, it needs to be there to know
that, or have some sorts of data from

which it can learn that because if
you're just, if you just look at the

state of the world and you have no
idea of what the laws of physics are

or how, how things work at all, you're
not going to be able to figure out

how it was optimized into this state.

Like if you want to infer that humans
don't want their basis to be broken.

It's an important fact in order to
infer that that if a vase is broken,

it's very hard to put it back together.

And that is a fact about the transition
dynamics, which we assumed by Fiat

that the, that the agent knows.

But yes, if you had a.

Enough data sets itself,
supervised learning, could teach

the agent a bunch of dynamics.

And also then, and then like also
the agent could go about, go around

looking at the state of the world,
in theory, it could then, uh, and for

a lot about what humans care about.

So I very clearly remember meeting you
at new Europe's, uh, 2018 deep workshop

in Montreal, the poster session.

And I remember your poster on
this, um, and you showed a dining

room that was all nicely arranged.

And, uh, and, and you were saying
how a robot could learn from

how things things are arranged.

And, and I just want to say, I'll say
this publicly, I didn't understand,

uh, at that point what, what you
meant or why that could be important.

Um, and it was so different.

Your angle was just so different
than everything else that was

being presented, um, that day.

And I really didn't get it.

So I, I, and I'll own that.

Uh, it was, it was my loss.

And, uh, so thanks for your patience.

It only took me three and a half years
or something to get to come around.

Yeah.

Uh, sorry, I didn't communicate.

I clicked or I suppose I
don't think it was no, I don't

think it was at all on you.

Um, but I, uh, maybe I just
lacked the background to see why

I like to understand, um, let,
let me, let me put it this way.

Like how often do you find people who
have some technical understanding of

AI, but still, maybe don't appreciate,
uh, some of this line of work, including

alignment and things like that.

Is that a common thing?

I think that's a reasonably common.

And what do you attribute that to?

Like what's going on there
and is that changing at all?

Or I think it's pretty interesting.

I don't think that these people
would say that like, oh, this is

a boring paper at all, or this is.

I'm incompetent paper.

I think they would say yes, the person
who wrote this paper is in fact, has

in fact done something impressive by
the standards of like, was like, you

know, did you need to be intelligent and
like, do good math in order to do this?

I think they are more likely to say
something like, okay, but, so what,

and that's not entirely unfair.

Like, you know, it was the
deep RL workshop and here I

am talking about like, oh yes.

Imagine that you'd like,
know all the dynamics.

And I'll say you're like only getting
to look at the state of the world.

Uh, and then you like, think about
how vases can be broken, but then

they can't be put back together.

And voila, you've learned that
humans don't like to break faces.

There's just something.

So different from all of the things
that our L easily focuses on.

Right?

Like it doesn't have any
of the puzzle rights.

There's no like, you know, deep
learning, there's no exploration.

There's no, um, uh, there's
no catastrophic forgetting

no, nothing like that.

And to be clear, all of those seem
like important things to focus on.

And I think many of the people who were
at that workshop, we're focusing on

those and are doing good work on them.

Uh, and I'm just doing
something completely different.

That's like, not all that interesting
to them because they want to

work on reinforcement learning.

I think they're making a mistake
in the sense that like AI alignment

is important and more people should
work on it, but I don't think

they're making a mistake in that.

They're probably correct about what
does and doesn't interest them.

Okay.

Just so I'm clear, I was not
critiquing your math or the

value of anything you were doing.

It was just my ability to understand
the importance of this type of work.

And I didn't think you were okay.

So I will say that that day, when I
first encountered your, your poster,

I was really hung up on edge cases.

Uh, like, um, you know, there's in the
world, the robot might observe there's

hunger and there's traffic accidents.

And there's things that things
like, like not everything is perfect

and we don't want the robot to
replicate all these, all these flaws

in the world or the dining room.

There might be, you know,
dirty dishes or something.

And so the world is clearly not
exactly how we want it to be.

So how, how is that, is that an issue or
is that, is that, uh, is that not an issue

or is that just not the point of this?

Uh, not, not addressed here?

It depends a little bit.

I think in many cases it's not
an issue if you imagined that the

robot somehow sees the entire world.

Um, so for example,
you mentioned a hunger.

Uh, I think the robot would notice
that we do in fact spend a lot of

effort, making sure that at least
large number of people don't go hungry.

We've built these giant vehicles,
both trucks and cargo ships, and

so on, then move food around in a
way that seems at least somewhat

optimized to get food to people who
like that food and want to eat it.

So there's lots of
effort being put into it.

There's not like the maximum
amount of effort being put in.

Which I think reflects the fact
that there are things that we

care about other than food.

So, so I do think it would
be like, all right, humans

definitely care about having food.

I think it might then like if you, if
you use the assumption that we have in

the paper, which is that humans are, the
humans are noisily rational, then it might

conclude things like I, uh, yes, Western
countries care about getting food to.

Um, Western Western citizens to
the citizens of their country.

And they care a little bit about, uh,
other people having food, but like, not

that much, it's like a small portion
of their, uh, governments aid budget.

So like there's a positive weight
on there and fairly small weight.

And that seems like maybe not the
thing that we wanted to learn, but like

also I think it is in some sense, an
accurate reflection of what Western

countries care about if you go by their
actions rather than what they say.

Cool.

Okay.

So I, uh, I'm going to move on to
benefits of assistance over rewarding.

And this one was absolutely fascinating
to me actually, mind blowing.

I highly recommend people read
all of these, but, but definitely

I can point to this one as,
um, something surprising to me.

So that was you as the first author.

And, uh, can you share, what
is the, what's the general

idea of this paper around?

I should say that this general
idea was not novel to this paper

it's been proposed previously.

I am not going to remember the
paper, but it's by friend at all.

It's like towards a dish decision
theater, that tech model of

assistance or something like that.

Um, and then there's also cooperative
inverse reinforcement learning

from chai where I did my PhD.

The idea with this paper was just to
take that the models that had already

been proposed in these papers and
explain them why they were so nice.

Why, why?

I was like particularly keen on
these models as opposed to, um, other

things that the field could be doing.

So the idea here.

Is that generally we want to build
AI systems that help us do stuff.

And you could imagine two different
ways that this could be done.

Uh, first you could imagine a system
that has two separate modules.

One module is doing is
trying to figure out.

The humans want or what the
humans want the system to do.

And the other module is then is trying
to then do the things that the first

module said the people wanted it to do.

And that's kind of like the, um, when we
talked about learning from human feedback

earlier on in modeling reward functions,
is that what, what that would exactly?

Um, I think that is.

That that's often what
people are thinking about.

I would make a diff distinction
between how you train the AI system

and what the AI system is doing.

This paper, I would say is more
about what the AI system is doing.

Whereas the learning from human
feedback stuff is more about,

um, how you train the system.

Yeah.

So in the, what the AI system is
doing framework, I would call this a

value learning or reward learning, and
then the alternative is assistance.

And so, although there's like some
surface similarities between learning

from human feedback and award Lang,
it is totally possible to use learning

from human feedback algorithms to train
an AI system, then acts as the, that

then acts as though it doesn't assist.

It is in the assistance.

Paradigm is also possible to
use learning from human feedback

approaches to train an AI system.

Then act as though that then
acts as though it does a, in

the reward learning paradigm.

So that's one distinction.

To recap, the value learning or
reward learning, uh, side of the two,

two models is two separate modules.

One that like figures out what
the humans want and the other that

then acts to optimize those values.

And the other side, which, which we
might call assistance is where you

still have both of those functions, but
they're combined into a single module.

And the way that you do this is you
have the AI system posit that there

is some true unknown reward function
data, only the human, the human, who

is a part of the environment, uh,
knows this data and their behavior

depends on what the data actually is.

And so now they can just test to
act on the, in order to maximize

data, but it doesn't know data.

So it has to like look at how
the human is behaving within the

environment in order to like, make some
inferences about what data probably.

Uh, and then as it gets more and more
information about data that allows

it to take more and more like, uh,
actions in order to optimize data.

But fundamentally this like, uh,
learning about data is an instrumental

action that the agent predicts
would be useful for helping it to

better optimize data in the future.

So if I understand correctly, you're
saying assistance is superior because

it can, the agent can reason about
how to improve its model of, of

what the human wants or how do you
describe Y Y it's you, you get all

these benefits from assistance.

Yeah.

I think that benefits come
more from the fact that these

two functions are integrated.

There's the value learning,
uh, there weren't learning or

value learning and the control.

So like acting to optimize the value.

So we can think of these
two functions in assistance.

They're merged into a single
module that does like nice, good

basion reasoning about all of it.

Whereas in the value learning
paradigm, they're separated.

And it's this integration
that provides the benefits.

You can make plans, which is
generally the domain of control,

but those plans can then depend on.

Uh, the agent believing that in
the future, it's going to learn

some more things about the reward
function data, which would normally

be the domain of value learning.

So that's an example where control
is, uh, using information, future

information about valley learning
in order to make its plans.

Whereas when those two modules
are separated, you can't do that.

Um, and so like one example that we have
in the paper is you is like, you imagined

that, uh, you've got a robot, uh, who
is, who asked to cook dinner for Alice.

Alice is currently a well not
cooked dinner, bake a pie for Alice.

Um, Alice is currently at the office,
so the robot can't talk to her.

And unfortunately the robot about
doesn't know what kind of tie she

wants, maybe apple blueberry or cherry,
but like the robot could guess, but

it's guests is not that likely to be.

Uh, however, it turns out the, you
know, the, the steps to make the pie

crusts are the same for all three pies.

So an assistive robot can reason.

Hey, uh, my plan is first, make the pie
crest, then wait for Alice to get home.

Then ask her what fillings she wants.

Then put the filling in.

And that entire plan consists of both
taking actions on the environment,

like making the crust and putting
in the filling, and also includes

things like learn more about
data by asking Alice a question.

Um, and so it's like integrating all
of these into a single plan, whereas

that plan cannot be expressed in
the value learning paradigm, the

query as an action in the action.

So I, um, I really like the, uh,
you laid out some levels of task

complexity, and I'm just going to
go through them really briefly.

You mentioned traditional CS is,
uh, giving instructions to computer

on how to perform a task and then
using AI or ML for simpler tasks

would be specifying what the task is.

Um, and the machine
figures out how to do it.

I guess that's standard RL formulation.

And then I, the heart for heart attacks
specifying the task is difficult.

So the agents can learn may, may learn
a reward function from human feedback.

Um, and then, and then the, and then
you mentioned assistance paradigm as,

as the next level where the human is
part of the environment has latent

goals that the robot does not know.

Yup.

How do you see this ladder?

Like, does this describe, is this a
universal, um, classification scheme?

Is, is, are we done?

Is that the highest level?

I think it question.

I haven't really thought about it before.

You can imagine a different version of the
highest level, which is like here, we've

talked about the assistance framing where
you're like, there is some objective, but

you have to infer it from human feedback.

There is a different version that
maybe is more in line with the way

things are going with deep learning
right now, which is more like

specifying the task is difficult.

So we're only going to like
evaluate behaviors that the AI

agent shows and maybe like also
tried to find some hypothetical

behaviors and evaluate those as well.

Uh, so that's a different way that you
could talk about those highest level

where you're like evaluating specific
behaviors, rather than trying to specify

the task across all possible behaviors.

And then maybe that would
have to be the highest.

And now you could just keep inventing
new kinds of human feedback inputs,

uh, and maybe those can be thought of
as higher levels beyond that as well.

Um, so then, um, one detail I
mentioned, I, I saw in the paper, you

mentioned a two phase of assistance
is equivalent to reward learning.

And I, I puzzled over that line
and I couldn't really quite,

uh, understand what you meant.

Can you say a little bit more about that?

What does that mean?

And how do you conclude that there,
through those two things are equivalent?

Yeah.

So there are a fair number
of definitions here.

I won't, maybe I won't go through
all of it, but just, uh, for, so that

listeners know we had definition,
we had formal definitions of like

what counsel's assistance and what
counts as a reward learning, uh,

and the, the word learning set.

Case we imagined, we like
imagined it as first.

You have a system that like asks like
human questions are actually, it doesn't

have to ask the human questions, but
first we have a system that interacts

with the human somehow and like develops
a guess of what the reward function is.

And then, uh, that yes, of what the
reward function is, which could be a

distribution over to awards is passed
on, uh, to a system that then acts to

maximize the expected value of the,
sorry, the expected to award, according

to that distribution over towards.

Okay.

Yeah.

So once it's done it's communication,
it's learned to reward and in phase

two, it's not, it doesn't have
any query as action at that point.

That's what you're saying.

Exactly.

Okay, cool.

Um, and so then the, you know, two phase
is the two phase communicative assistance,

the two phase and the communicative.

Both have technical definitions, but they
roughly mean exactly what you would expect

them to mean in order to make this true.

Um, so you mentioned three
benefits of using assistance,

this assistance paradigm.

Can you briefly explain
what those benefits are?

The first one, which I already
talked about, um, his plans,

conditional on feature feedback.

So this is the example of where the
robot can make a plan that says,

Hey, first, I'll make the pie crust.

Then I'll wait for Alice to
get back from the office.

Then I'll ask her what filling she wants.

Then I'll put in the appropriate filling.

So they're there.

The plan was conditional on the answer
that Alice was going to give in the future

that the robot predicted she would give.

But like, couldn't actually
ask the question now.

So that's one thing that, uh, can
be done in the assistance paradigm,

but not in the, um, value learning
or toward learning paradigm.

Uh, a second one is what we call
relevance where active learning.

Uh, so active learning is the idea
that instead of the human passively,

giving the robot, sorry, instead of
the human giving a bunch of information

to the robot and the robot passively
taking it and using it to update its

estimate of data, the robot actively
asks the human quite human questions

that seem most relevant to updating its
understanding of the reward data, and

then the human answers, those questions.

So that's active learning that
can be done in both paradigms.

The thing that assistants can do is to
have the robot only ask questions that

are actually relevant for the plans
that's going to have in the feature.

So to make this point that I might,
you might imagine that like, you

know, you get a hustled robot, um,
that your hustled robots booting up.

And if it was in the reward,
lending paradigm and test

like figure out data, right.

And so it's like, all right.

Do you tend to like, uh, at what
time do you tend to prefer a dinner?

Um, so I can cook that for you.

And that's like a pretty reasonable
question and you're like, yeah, I

usually eat around, um, 7:00 PM.

Uh, and it's got a few more questions
like this, and later on, it's like,

well, if you ever wanted to paint your
house, what colors did we paint it?

And you're like, kind of like
a blue, I guess, but like,

why are you asking me this?

And then it's like, if aliens
come and then they'd from mark.

Where would, what would be your
preference of place to hide it, hide in.

And you're like, why, why
are you asking me this?

But the thing is like, all of these
questions are in fact relevant for,

for their reward function data.

The reason that you don't that like, if
this were a human, instead of a robot, the

reason they went to ask these questions
is because the situations too, it's,

they're relevant probably don't come up.

But in order to like, make that
prediction, you need to be talking more

to the control, uh, sub module, the, with,
uh, the control module, which is like,

I think that our word learning paradigm
doesn't do they control somebody modules?

The one that's like, all right,
we're gonna take, we're probably

going to take these sorts of actions.

That's going to lead to
those kinds of feeds.

And so like, you know, probably
aliens from Mars aren't

ever going to be relevant.

So if, if you have this like one unified
system, uh, then it can be like, well,

okay, I know that like aliens from
Myers, I probably not going to show

up, uh, anytime in the near future.

And I don't need to ask about
those preferences right now.

If they, if I do find out that aliens
from Mars are likely to land, uh, soon

then I will ask that question, but I
can leave that to later and not bother,

um, Alice until that actually happens.

Um, so that's the second one.

And then the final one is that.

You know, so far, I've been talking
to cases where the robot is learning

by asking the human questions.

And the human just like gives
answers that are informative

about the reward function data.

Uh, the third one is that, you know, you
don't have to ask the human questions.

You can also learn from their
behavior just directly while

they're going about their day
and optimizing their environment.

A good example of this is like your robot
starts helping out around the kitchen.

It starts by doing some like very obvious
things like, okay, there is some dirty

dishes, just put them in the dishwasher.

Um, meanwhile the humans going around and
like starting to collect the ingredients

for baking a pie, sort of, I can see
this, notice that that's, that's the case.

And I'm like, go and get out the like
mixing bowl on the egg beater and so on.

Um, in order to help.

Uh, like the sort of just like
seeing what the human is up to and

then like immediately starting to
help with that is the sort of thing

that you can only, like this is all
happening within a single episode,

rather than being across episodes.

The like value learning or borderline
could do it across episodes where

like first the robot looks and watches
the human, uh, act in the environment

to make an entire cake from scratch.

And then the next time when the robot is
actually Indian, It goes and helps the

human out, but in the assistance paradigm,
it can do that learning and help out with

making the cake within the episode itself,
as long as it has enough understanding

of how the world works and what data is
likely to be, uh, in order to actually

like did these with enough confidence,
that those actions are good to take.

When you described the robot that
would ask all these irrelevant

questions, I couldn't help.

I'm a parent.

I couldn't help with thinking,
you know, that's the kind of

thing a four-year-old would do.

Try ask you every random question.

That's not irrelevant right then.

And it seems like you're,
you're kind of pulling into a

more mature type of intense.

Yeah.

Yeah.

A lot of this is like, like this, the
entire paper, uh, has this assumption

of like, we're going to write down
math and then we're going to talk about

agents that are optimal for that math.

We're not going to bother thinking
of, we're not going to think

about like, okay, how do we in
practice get the optimal thing.

We're just like, is the optimal thing,
actually, the thing that we want.

Uh, and so one would hope that yes, uh,
if we're assuming the actual optimal

agent, it should in fact be, um, more
mature than four year olds, one hopes.

So how do you, um, relate, can you
relate this assistance paradigm

back to standard in inverse RL?

What is the relationship
between these two paradigms?

Yeah.

So in verse RL, zooms that it's an
example of the reward learning paradigm.

Um, it assumes that you get full
demonstrations of the entire task.

And then you have, and then
you like, uh, executed by the

human tele operating the robot.

There's like versions of it.

That don't seem the teller operation
part, but usually that's an assumption.

And then given the, you know, tell our
operated robot demonstrations of how to

do the task, the robot does, then it's
supposed to infer what the task actually

was and then be able to do it itself in
the future without any tele operation.

So without uncertainty, is that true
with the inverse RL paradigm assumes

that we were not uncertain in the end?

Uh, no.

It doesn't necessarily seem that I
think in many deep IRL algorithms that

does end up being an assumption that
they use, but it's not a necessary one.

Uh, it can still be uncertain.

And then I would plan typically with
respect to maximizing the expectation of.

The reward function, although you could
also try to be conservative or risks,

risk sensitive, and then you would be
max, uh, you, you wouldn't be maximizing

expected reward and maybe you'd be
maximizing like worst case reward if

you want it to be maximally conservative
or something like that, or a fifth

percentile reward or something like that.

Yeah.

So, so there can be uncertainty, but like
the human isn't in the environment and

there's this episodic assumption where
like the demonstration is one episode

and then when the robot is acting,
that's a totally different episode.

And that also isn't true.

In the assistance case, you talk
about active reward learning

and interactive reward learning.

Can you help us understand those,
those two phrases and how they differ?

Yeah.

So active reward learning is just
when, uh, the robot has the ability,

like in the reward learning paradigm,
the robot has given the ability to

ask questions, um, rather than just.

Just getting to observe
what the human is doing.

So hopefully that one
should be relatively clear.

The interactive reward learning
setting is, uh, it's mostly just

a thing we made up because it was
a thing that people often brought

up as like, maybe this will work.

So he wanted to talk about it and show
why it doesn't, it doesn't in fact work.

Uh, but the idea there is that you
alternate between you still have

your two modules, you have one reward
learning module and one control

module, and they don't talk to each
other, but instead of like just doing

one, the word planning thing, and
then, and then doing control forever.

You do, like, I don't know, you do 10,
10 steps of reward learning, then 10

steps of control, then 10 steps over
war line, then 10 steps of control.

And you keep iterating
between the two stages.

So why is computational complexity
really high for algorithms that

try to optimize over assistance?

I think you mentioned that here.

Yeah.

So everything I've talked about has
just sort of a zoom that the agents are

optimal by default, but if you think
about it, what the optimal agent has to

do is it has to, you know, maintain a
probability distribution over all of the

possible reward functions that Alice could
have and then updated over time as it

sees more and more of Alice's behavior.

And as you probably know, full base and
updating over a large list of hypothesis,

uh, is very computationally intractable.

Another way of seeing it is
that, you know, if you take this

assistance paradigm, you can.

Through a relatively simple reduction,
turn it into a partially observable

markup decision process or Palm DP.

The basic idea there is to treat
the reward, function data as like

some unobserved part of the state.

Uh, and then that reward function
is whatever that unobserved

part of the state would say.

Uh, and then the, um, Alice's
behavior is thought of as part of the

transition dynamics, which depends
on the unemployed part of the state.

That is the status data.

Uh, so that's the rough reduction to
how you phrase assistance as a Palm DP.

Uh, and then Palm DPS are known to be
very computationally intractable to solve

again for basically the same reasons that
I was just saying, which is that like, to

actually solve them, you need to maintain
a patient, a probability distribution

over all the, uh, ways that the
unemployed parts of the state could be.

And that's just
computationally and tracked.

So do you plan to work on this, on
this particular line of work further?

I think I don't plan to do further
direct research on this myself.

I still basically agree with the
point of the paper, which is look

when you're building your AI systems,
they should be reasoning more.

They should be reasoning in the way
that the assistance paradigm suggests

where there's like this integrated
reward, learning and control,

and they shouldn't be reasoning.

And the way that the value of
learning, uh, paradigms, just where

you first figure out what human
values are and then optimize for them.

And so I think that point is a
pretty important point and will

guide how we build a AI systems in
the future, or it will guide how,

what we have our AI systems do.

And I think I will continue to push for
that point, including like, Projects

that deep DeepMind, but I probably
won't be doing more like technical

research on the math and those papers
specifically, because I think I, like

it said, the things that I wanted
to say, uh, there's still more work.

There's still plenty of work that
one could do such as like trying to

come up with algorithms to directly
optimize the maths that we wrote down.

Um, but that seems less
high leveraged to me.

Okay.

Moving to the next paper on the
utility of learning about humans for

human AI coordination, that was Carol
at all with yourself as a coauthor.

Um, can you tell us the
brief, uh, general idea here?

I think this paper was written
in, in the wake of some pretty

big successes of self-pleasure.

Um, so self play is the algorithm
underlying well self player, like very

similar variants are the out, is the
algorithm underlying open AI five a

which plays Dota alpha star, which
plays StarCraft alpha and alpha zero,

which play, you know, go chests charity.

And so on at a superhuman level, these
were like some of the, yeah, some of the

biggest results in AI around that time
and sort of suggested that like self

play was going to be a really big thing.

And the point we were making in this
paper, Is that self play works well when

you have a zero sum, uh, two players
zero-sum game, uh, which has a like

perfectly competitive game, uh, because
it's effectively going to cause you to

explore the full space of strategies,
because it, if you're like playing against

yourself in a competitive game, if there's
any fly in your strategy, then gradient

descent is going to like push you in
the direction of like exploiting that

flaw because you're, you know, you're
trying to beat the other copy of you.

So you're always given to get better,
uh, in contrast in common payoff

game, which are the most collaborative
games, um, where each agent gets the

same payoff, no matter what happens,
uh, but the paths can be different.

Uh, you don't have this,
um, similar incentive.

Uh, you don't have any
incentive to be unexplainable.

Like all you want is to come up with
some policy that like, if played against

yourself will get the maximum reward,
but it doesn't really matter if you are.

If you would like play badly with
somebody else, like a human, like if

that were true, that wouldn't come
up in self play, self play would be

like, nah, every in every single game
you play, you got the maximum reward.

There's nothing to do here.

So there's no forests that's like
causing you to be robust to all of the

possible players that you could have.

Whereas in the competitive game, if
you weren't drove us to all of the

players that could possibly arise,
then you're exploitable in some way.

And then the grading dissenters,
incentivized to find that exploit after

which you have to become robust to it.

Is there any way to reformulate it so
that there is that competitive pressure?

You can actually do this.

And so I know you've had Michael
Dennis, um, and I think also Natasha

shacks on this podcast before,
and both of them are doing work.

That's kind of like this,
uh, with paired, right.

That was just shakes and.

Yeah.

Oh, the way you do it, as you just
say, all right, we're going to make

the environment a, our competitor, the
environment is going to like try and like

make itself super complicated, uh, in a
way that defeats, uh, whatever policy,

uh, we were trying to use to coordinate.

And so then this makes sure that
you have to be robust to whichever

environment you find yourself in.

So that's like one way to get
robustness to, well, it's getting

you to robustness, to environments.

It's not necessarily getting
robustness to your partners.

Um, when, like, if you, for example,
you wanted to cooperate with the

human, but you could do a similar
thing there where you say we're going

to also take the partner agent and
we're going to make it be adversarial.

Now this doesn't work great if you
like, literally make it adversarial

because sometimes in many like
interesting collaborative games,

Um, like, like over cooked, which is
the one that we were studying here.

If your partner is an adversary,
they can just guarantee

that you get minimum reward.

It's not, it's often
not difficult in this.

And over cooked, you just like
stand in front of the station where

you deliver the dishes that you've
cooked and you just stand there.

And that's what the adversary does.

And then the agent is just like,
well, okay, I can make a soup,

but I can never deliver it.

I guess I never get the reward.

Uh, so, so it doesn't quite that like
naive, simple approach doesn't quite

work, but you can, instead you can
like try to have a, uh, slightly more

sophisticated method where, you know,
the, instead of being an adversarial

partner, it's a partner that is.

Trying to keep you on the
edge of your abilities.

And then you like, uh, as you, uh, and
then like, once your agent learns how to

like, do well with the one, uh, with your
current partner, then like the partner

tries to make itself a bit harder to do.

And so on.

So there, there are a few, there's a few
papers like this that I I'm kindly failing

to remember, but, but there are papers.

I tried to do this sort of thing.

I think many of them did end up just
like following, uh, both the self

play work and those paper of ours.

So, yeah.

And basically I think you're right.

You can in fact do some clever
tricks to make things, uh, to make

things better and to get around this.

It's not quite as simple
and elegant as self play.

And I don't think the results are quite
as good as you get what self play.

Cause it's still not
exactly the thing that.

So now we have a contributed question,
which I'm very excited about from, uh, Dr.

Natasha Jacques' senior research scientist
at Google AI and postdoc at Berkeley.

And we were lucky to have Natasha
as our guest on episode one.

So Natasha Natasha asks the most
interesting questions are about why

interacting with humans is so much
harder flash, so different than

interacting with simulated RL agents.

So Rohin, what is it about humans
that makes them, um, harder?

Yeah, there are a bunch of factors here,
maybe the most obvious one and probably

the biggest one in practice is that you
can't just put humans in your environment

to do like a million steps of gradient
descent on, uh, which often we do in

fact do with our simulated RL agents.

And so like, if you could just somehow
put a human in the loop, uh, in a million

for a million episodes, maybe then the
resulting agent would in fact, just be

really good at coordinating with humans.

In fact, I might like take out
the, maybe there and I will, I will

actually predict that that resulting
agent will be good with humans.

As long as you had like, uh, like
reasonable diversity of humans, um,

and that you had to collaborate with.

So my first and biggest answer is.

You can't get a lot of data from
humans in the way that you can

get a lot of data from simulated
RL agents, uh, or equivalently.

You can't just put the human into the
training loop the way you can put a

simulated RL agent into the training loop.

Uh, so that's answer number one.

And then there is another answer, uh,
would seem significantly less important,

which is that humans are just not as
are, sorry, are significantly more

diverse than simulated RL agents.

Typically humans don't
all act the same way.

Uh, even an individual human
will act pretty different.

Um, from one episode to the next
humans will like learn over time.

Uh, and so there, not only is there
a policy like kind of, kind of

stochastic, but their policy isn't
even stationary that policy changes

over time as they learn how to play
the game and become better at it.

Um, and that's another thing that RL,
like, usually our El seems that that

doesn't, that is not in fact true that
like episodes are drawn IED because

of this like non station Harrity and
stochastic stochasticity and diversity,

you would imagine that it, like you have
to get a much more robust policy, uh,

in order to work with humans instead
of working with simulated RL agents.

And so that, uh, ends up being, uh,
that ends up being harder to do.

Sometimes people try to like take
their simulated RL agents and

like make them more stochastic
to be more similar to humans.

Um, for example, by like maybe taking a
random action with some small probability.

And I think usually this
ends up still looking kind of

like artificial and forest.

When you like look at the resulting
behavior such that it still doesn't

require that robust a policy in
order to collaborate well, but those

agents, um, and humans are just
like more challenging than that.

Okay.

Let's briefly move to the next
paper, evaluating the robustness

of collaborative agents.

That was not at all with
yourself as a co-author.

Can you give us the short version
of what this paper is about?

Like we just talked about how, in
order to get your agency work well

with humans, they need to be, they
need to learn a pretty robust policy.

And so one way of measuring how good your
aides and sorry, uh, collaborating with

humans is while you just like, have them
play with humans and see how well that

goes, which is a reasonable thing to do.

Um, and people should definitely do
it, but this paper proposed a like

maybe simpler and more reproducible
tests that you can run more often.

Um, which is just, I mean, it's
the basic idea from software

engineering is just a unit test.

Uh, and so it's a very simple idea.

The idea is just write some unit tests
for the robustness of your agents, right?

Some cases in which you think.

Like correct.

Action is unambiguously clear in cases
that you may be expect not to come up,

uh, during, uh, during training and then
just see whether agent does in fact do

the right thing, uh, on those inputs.

And that can give you, like, if you're,
it's in passes, all of those tests,

that's not a guarantee that it's robust.

Um, but if it fails, some of those
tests then knew, definitely sound

found some failures of robustness.

I think in practice, uh, the agents that
we tested all like failed many tests.

I w yeah, I don't remember
the exact numbers off the

top of my head, but I think.

Some of the better agents were
getting scores of maybe 70%.

Could we kind of say that this
is related to the idea of,

of sampling from environments
outside of the train distribution?

Because we think that like in, in,
in samples that are related to the

distribution, that the agent would,
uh, encounter after it's deployed,

would you, would you phrase it that
way or is it, is it going in different?

Yes.

I think that's pretty close.

I would say basically everything
about that seems correct.

Except the part where you say
like a, and it's probably going

to arise in the test distribution.

I think usually I just wouldn't
even try to like, um, check

whether or not it would, uh, up
here in the test distribution.

I just, I guess, like,
that's very hard to do.

You don't know what's going, like, if
you knew how the test distribution was

going to look and in what way it was
going to be different from the train

distribution, then you should just change
your train distribution to be the test

distribution, but like the fundamental
challenge of robustness as easily that

you don't know what your test is to be in.

That's going to look like.

So I would say it's more.

We try to deliberately and find situations
that are outside the training situation,

but where a human would agree that
there's like one unambiguously correct

answer, um, and test it on those cases.

Like maybe this will lead us to be too
conservative because like, actually

the test was in a state that will never
actually come up in the test distribution.

But given that we, it seems very hard
to know that I think, um, it's still

a good idea to be driving these tests
and to take failures fairly soon.

And this paper mentions
three types of robustness.

Can you, um, briefly touch
on, on the three types?

Yeah.

So this is basically a categorization
that we found helpful in generating

the tests, uh, and it's, uh, somewhat
specific to reinforcement learning agents.

So the three types were state robustness,
which is, um, a case where like, basically

these are test cases on which the
main thing that you've changed is the

state in which the agent is operating.

Then there's agent robustness, which
is, uh, when one of the other agents

in the environment, uh, exhibit
some behavior that's like, uh,

unusual and not what you expected.

And then that can further be,
uh, decomposed into two types is

agent robustness without memory
where, uh, even like where the,

the test doesn't require the.

AI system to have any memory.

There's like a correct action.

That seems determinable even if
the system doesn't have a memory.

Uh, so this might be what you want to use.

If you, for some reason they're using, uh,
an MLP or a CNN as your architecture, and

then there's agent robustness with memory,
uh, which is where the distribution shift

happens from, uh, and, uh, partner agent,
and the environment doing something

that where you have to actually like,
look at the behavior over time, notice

that, uh, something is violating what
you expected during training, and then

take some corrective action as a result.

Uh, so there you need memory
in order to understand.

Um, how the partner agent is doing
something that wasn't what you expected.

And then I guess when we're
dealing with a high dimensional

state, there's just a ridiculous
number of permutations situations.

And we've seen in the past that, um, that
deep learning, especially it can be really

sensitive to small seemingly meaningless
changes in this high dimensional state.

So how do we, how, how could we
possibly think about scaling this

up to a point where, uh, we don't
have to test every single thing.

I think that basically this particular
approach, you mostly just, shouldn't

try to scale up in this way.

It's more meant to be a like first
quick sanity check that is already

quite hard to pass, uh, for kind systems
where you're talking scores like 70%.

I think once you get to like score
is like 95, 90 9%, uh, then it's

like, okay, that's the point to
like, start thinking about scaling

up, but like, suppose we got.

Uh, what do we then do?

I don't think we really want to scale up
the, like the specific process of humans,

think of tests, humans write down tests.

Uh, then we like run
those on the air system.

I think at that point, uh, we want to
migrate to a more like alignment flavored,

uh, viewpoint, which I think we were going
to talk about in the near future anyway.

Uh, but to give, uh, give some
advance, uh, to talk about

that a little bit in advance.

I think once we like scale up, we want
to try and find cases where the AI system

does something bad that it knew was bad.

It knew that it wasn't the thing
that its designers intended.

And the reason that this allows you
to scale up is because now you can.

Go and inspect the AI system and try to
find facts that it knows and like leverage

those in order to create your test cases.

And one hopes that the set
of things that the AI knows.

Yeah.

Still plausibly, a very large space, but
hopefully not an exponentially growing

space, the way the state space is and
the intuition for why this is okay.

Is that like, yes, the AI system might
end up, may end up having accidents and

that wouldn't be caught if we were only
looking for cases where the AI system

made a mistake that I knew was a mistake.

But like, usually those
things aren't that bad.

Uh, they can be if your AI system is like
in a nuclear power plant, for example,

or, uh, in some like, uh, in a weapon
system, perhaps, but like in many cases,

it's not actually that bad for the,
your AI system to make an accidental.

The really bad areas are the ones where
the system is like intentionally making

an error, uh, or making something that is
bad from the perspective of the designers.

Those are, those are like
really bad situations and you

don't want to get into them.

And so I'm most interested in like
thinking of like how we can avoid that.

Uh, and so then you can like try
to leverage the agent's knowledge

to construct and put study.

You can then test the VA system on.

So this is a great segue
to the alignment section.

Um, so how do you define
a alignment in AI?

Maybe I will give you two definitions,
uh, that are like slightly

different, but mostly the same.

So one is that an AI system is
misaligned, so I'm not aligned, uh,

if it takes actions that it needs.

Uh, where against the
wishes of its designers.

That's basically the definition that
I was just giving earlier a different,

more positive definition of AI alignment
is an, is that an AI system is aligned

if it is trying to do what its, uh,
designers intended for it to do.

And is there some, um, agreed
upon taxonomy of like top

level topics and alignment?

Um, like how does it relate to
concepts like AI safety and human

feedback, that different things
that we talked about today?

How do we, how would we, uh, arrange
these in a kind of high level?

There is definitely not a
canonical textonomy of topics.

There's not even a canonical definition.

So like the one I gave doesn't include
the problem, for example, of how you

resolve disagreements between humans,
on what the AI system should do.

It just says, all right, there is
some designers, they wanted something.

That's what the AI system
is supposed to be doing.

Uh, and it doesn't talk about
like, all right, the process

by which those designers decide
what the AI system intends to do.

That's like not, not a part of
the problem as I'm defining it.

It's obviously still an important problem.

Just like not part of this definition,
uh, as I gave it, but other people

would say, no, that's a bad definition.

You should include that problem.

So there's not even a
canonical definition.

So I think I will just give you maybe
my techsonomy of alignment topics.

So in terms of how alignment
relates to AI safety, uh, there's

this sort of general big picture
question of like, how do we get.

Or we'll add, be beneficial for humanity,
which you might call AI safety or

add beneficial illness or something.

And on that you can break down into a
few possible, uh, possible categories.

I quite like the I'm gonna forget where
the, where I, where this taxonomy comes

from, but I liked the taxonomy into
accidents, misuse and structural risks.

So accidents are exactly
what they sound like.

Accidents happen when an AI system
does something bad and nobody intended

for that VA system to do that thing.

Um, Missy's also exactly
what it sounds like.

It's when it's, when somebody gets an AI
system to do something, and that's the

thing that it got the AI system to do was
something that we didn't actually want.

So think of like terrorists,
um, using AI assistant.

Um, to like assassinate people, uh, and
unstructured risks are maybe less obvious

than the previous tube, but structural
risks happen when, you know, if, as

we infuse AI systems into our economy,
do any new sorts of problems arise?

Do we get into like racist
to the bottom on safety?

Do we get to, do we have like a
whole bunch of increased economic

competition that causes us to sacrifice
money, to sacrifice many of our

values in the name of Trent activity?

Uh, stuff like that.

So that's like one starting categorization
accidents, CS structural risk, and

within accidents you can have, you
can then further separate into.

Uh, accidents where the system knew that
the thing that was doing was bad and

accents where the system didn't know
that the thing that it was doing was bad.

And the first one is AI alignment,
according to my definition, which

again is not a canonical Def I
think it's maybe the most common

definition, but it's like not canonical.

So that was like how alignment relates
to AI safety and then like, how does the

stuff we've been talking about today?

Relate to alignment.

Again, people will disagree with me on.

But according to me, the way to build
a line to AI systems and the sense

of, eh, uh, systems that don't make
take bad actions that they knew were

bad is that you use a lot of human
feedback to train your AI system to

where like the human feedback, you
know, it rewards the AI system when

it does things, stuff that humans want
and, uh, punished as the air system.

When the system does things that
the human doesn't want, this

doesn't solve the entire problem.

You, you basically then just want
to like make your human, the people,

providing your feedback as powerful as.

Make them as competent as possible.

So maybe you could do some
interpretability with the model that

you're training, um, in order to
like, understand how exactly it's like

reasoning, how it's making decisions,
you can then feed that information to

the humans who are providing feedback.

And thus, this can then maybe allow them
to, uh, not just select AI systems that

get the right outcomes, but now they
can select it as systems, like get the

right dog comes for the right reasons.

And that can help you get more robustness.

Uh, you could imagine that you have
some other air systems that are in

charge of like finding new hypothetical
inputs on which the system that

you're training takes a bad action.

Um, so like this, uh, systems and
like here's this hypothetical.

Uh, here's this input on which your
AI system is doing a bad thing.

And then they came into
like, oh, that's bad.

Let's put it in the training data set, um,
and give good feedback on it and so on.

So then I think the salt would be
maybe the most obviously connected

here where it was about how do you just
train anything with human feedback,

which is obviously a core thing I've
been talking about in this plan.

Um, preferences implicit
in the state of the world.

It's less clear how that relates here.

I think that paper makes
more sense in a plan.

That's more like traditional
value alignment where you're as

a system maintain, I like has an
explicit distribution over it data

that it's updating by evidence.

So I think that one is less relevant
to the, to the, to the subscription.

The benefits of assistance
paper is I think.

Primarily a statement about
what the air system should do.

And so like what we want our human
feedback providers to be doing is to be

seeing, Hey, is this AI system, like,
thinking about what, uh, what it's users

will want, um, if it's uncertain about
what the users will want, does it like

ask for clarification or does it just
like guess, um, we probably wanted to ask

for clarification rather than guessing
if it's a sufficiently important thing.

Uh, but if it's like some probably
insignificant thing, then it's

like fine, if it can guess.

And so through the human feedback that
you can then like train a system, that's

being very assistive, the overcooked
papers, uh, on the, you tell it to

you of learning about learning about
humans for human error, coordinate.

Uh, that one is, I think, not that
relevant to this plan, unless you

happened to be building an AI system
that is playing a collaborative game,

the evaluating the robustness paper is,
uh, more relevant in that, like part

of the thing that these human feedback
providers are going to be doing is to

like, be constructing these hypothetic,
be constructing inputs on which the

AI system, uh, behaves badly and then
training VA system, not to behave badly

on those inputs, uh, send that sense.

It's, uh, it also fits
into this overall story.

Cool.

Okay.

Can you mention a bit about
your alignment newsletter?

Um, like what, what, how do you, how
do you define that newsletter and

how did you, how did you start that?

And what's happening with the newsletter?

Now, the alignment newsletter is
supposed to be a weekly newsletter

that I write that summarizes.

Just recent content
relevant to AI alignment.

It has not been a very weekly and the
last couple of months because I've

been busy, but I do intend to go back
to making it a weekly newsletter.

I mean, the origin story is kind of funny.

It was just, we, this was while I
was a PhD student at the center for

human compatible AI at UC Berkeley.

Uh, we were just discussing that, like,
there were a lot of papers that were

coming out all the time, uh, as people
will probably be familiar with and it

was hard to keep track of them all.

Um, and so someone suggested
that, Hey, maybe we should have

a rotation of people who just.

Uh, search for all of the new papers
that ever arrived in the past week.

And just send an email out
to everyone just like lists

giving links to those papers.

So other people don't have
to do the search themselves.

And I said like, look, I, you
know, I just do this every week.

Anyway.

I I'm just happy to take on those jobs,
sending an, uh, sending one email with a

bunch of links is not a hard, uh, we don't
need to have this rotation of people.

Um, so I did that internally to chai,
uh, then like, you know, a couple of

weeks later, I like added a sentence
that was telling people, Hey, this is

what this is like the topic, um, here
is, you know, maybe you should read it

if you are interested in X, Y, and Z.

Uh, and so that happened for a while.

And then I think I started writing.

A slightly more extensive summaries
so that people didn't have to read the

paper, uh, unless it was something they
were particularly interested in, uh, and

flight around that point, people were
like, this is actually quite useful.

You should make it public.

Uh, and then I like tested it a bit
more, um, maybe for another, like

three to four weeks internally to try.

And then I, um, after
that I released a public.

Uh, it still did go up under
a fair amount of improvement.

I think maybe after like 10 to 15
newsletters was when it felt more stable.

Yeah.

And now it's like, apart from the
fact that I've been too busy to do it

recently, it's been pretty stable for
the last, I don't know, two years or so.

Well, uh, to the audience, I
highly recommend the newsletter.

And, uh, like I mentioned, you know,
when I first met you and heard about

your alignment newsletter early
on at that point, I really wasn't.

Um, I didn't really appreciate the, the
importance of alignment, uh, issues.

And, and I gotta say that really
changed for me when I read the

book human compatible by professor
Stuart, Russell, who I gather is

your one of your PhD advisors.

And so that book really helped
me appreciate the importance

of alignment related stuff.

And it was part of the reason that I, that
I sought sought you out to interview you.

So I, I'm happy to recommend that
a plug that book to the audience,

uh, professor Russell's awesome.

And it's a very well-written book
and, uh, and full of great insight.

Yep.

I also strongly recommend this book.

And since we're on the topic of the
alignment newsletter, you can read

my summary of, uh, steroid Russell's
book in order to get a sense of

what it talks about, uh, before
you actually make the commitment of

actually reading the entire book.

Um, so you can find that on my
website under a alignment newsletter,

there's a list of past issues.

I think this was newsletter edition 69.

Not totally sure you can check that.

And what was your website again?

I it's just my first name and last name.

Rohinshah.com.

Okay, cool.

I highly recommended doing
that, um, to the audience.

And so I wanted to ask you about how,
you know, how alignment work is done.

So a common pattern that, you know,
we might be familiar with that in,

in many ML papers is to show a new
method and show some experiments.

Um, but his alignment, uh, is work in
alignment, fundamentally different.

Like what does the work
entail in, in alignment?

Is there a lot of thought experiments
or how would you describe that?

Uh, there's a big variety of things.

So some alignment work, um, is
in fact pretty similar to, uh,

existing, uh, T to typical ML work.

Um, so for example, there's
a lot of alignment work.

That's like, can we make
human feedback algorithms.

Uh, and you know, you start with
some baseline and some task or

environment in which you want to
get an AI system to do something.

And then you like try to improve
upon the baseline, using some

ideas that you thought about it.

Uh, and like, you know, maybe
it's somewhat different because

you're using human feedback.

Whereas typical ML res uh, MLRA switch
doesn't involve human feedback, but

that's not that big a difference.

It's still like mostly the same skills.

Uh, so that's probably the kind that's
closest to existing ML research.

There's also like a lot of
interpretability work, which again is

just like working with actual machine
learning models and trying to figure

out what the heck they're doing.

Also seems pretty, it's like not the same
thing as like get a better performance

on those tasks, but it's still like
pretty similar to the general fee to like

some parts of the, of machine learning.

So that's like one kind of one
type of alignment research.

And then there's, you know, on the
complete other side that there is a

bunch of stuff where you're like, where
you think very abstractly about what

feature AI systems are going to look like.

So like, maybe you're like, all right,
maybe you think about how some story by

which you might, by which AGI might arise.

Like we run such and such algorithm,
maybe what set some improvements.

And the arc in various architecture
is with like such and such data

and you get a, and it turns out
you can get AGI out of this.

Uh, then you maybe like think
in this hypothetical, okay.

Uh, does this AGI ended
up getting misaligned?

If so, how, how does it get misaligned?

If yes.

Um, well you tell that story and they're
like, okay, now I have a story of like

how they, uh, AGI system was misaligned.

What would I need to do in order to
like, prevent this from happening?

Um, so you can do like pretty elaborate,
uh, conceptual thought experiments.

I think these are usually good as a
way of ensuring that the things that

you're working on are actually useful.

I think there are a few people
who do these sorts of conceptual

arguments, almost always.

And do them well, such that I'm
like, yeah, this stuff they're

producing, I think is probably
going to matter in the future.

But I think it's also very easy
to end up not very grounded in

what's actually going to happen.

Such that you end up saying things that
won't actually be true in the future

and could notably like some somewhat,
there is some reasonably easy to find

argument today that could convince
you that the things you're saying are

not going to matter in the future.

So it's pretty hard to do this
research because of the lack of

actual empirical feedback loops.

But I don't think that has doomed.

Um, I think people do in fact get, um,
some interesting results out of this

and often the results side of this,
that the best results out of this line

of work, uh, usually seem better to
me than the results that we get out

out of the empirical line of work.

So you mentioned in your newsletter
and then there's an alignment forum.

If I understand that that's what
that was spring out of less wrong.

Is that, is that.

I don't know if I would say
it's sprang out of less wrong.

It was meant to be at least somewhat
separate from it, but it's definitely

very, it's definitely affiliated with
less wrong and like everything on

it gets cross posted to less wrong.

And so these are pretty
advanced resources.

I mean, from my point of view, um,
but to the audience who maybe is just

getting started with these ideas, can
you recommend, uh, you know, a couple

of resources that might be good for
them to get like an on-ramp for them?

Um, I guess including the
human compatible, but anything

else you'd want to mention?

Yeah.

So human compatible is a
pretty good suggestion.

Um, there are other books as well.

Um, so super intelligence is
more on the philosophy side.

Uh, the alignment problem by Brian
Christian is less on the like, uh,

has a little bit less on like what,
what might solutions look like?

It has more of the like intellectual
history behind how, how these

concerns started rising on life.

3.0 by max Tegmark.

I don't remember.

How much it talks about alignment.

I assume it does a decent amount.

Uh, but that's, that's another
option apart from books.

I think so the alignment for M
has, um, sequences of blog posts

that are, that, that don't require
quite as much, um, technical depth.

So for example, it's got the
value learning sequence, which

I, well, which I have wrote half
curated other people's posts.

Um, so I think that's a good introduction
to some of the ideas and alignment.

Uh, there's the embedded agency
sequence also on the Atlantans

forum and the iterated amplification
sequence and the alignment for him.

Oh, there's the, there's an
AGI safety fundamentals course.

And then you can just Google it.

It has a publicly available curriculum.

I believe, I think really ignore all
the other suggestions, look at that

curriculum and then read things on.

There is probably actually my advice.

Have you seen any, uh, depictions
of, of alignment issues in science

fiction or, um, these, these ideas
come up for you when you, when

you watch or read, read Spotify?

They definitely come up to some extent.

I think there are many ways in which
the depictions aren't realistic, but

like they do come up or I guess even
outside or just, uh, even mythology,

like the whole Midas touch thing seems
like a perfect example of a misalignment.

Yeah.

Yeah.

The king might example is a good example.

I do.

Yeah.

Yeah.

Those are good examples.

Yeah.

That's true.

If you, if you expand to include
mythology in general, I feel

like it's probably everywhere.

Um, especially if you include things
like you asked for something and.

What you're literally asked for,
but not what you actually meant.

That's really common, isn't it?

Yeah.

In stories.

Yeah.

I mean, we've got, like, I
could just take any story.

Your budget is, and probably
this little feature.

Um, so they really started the, uh,
alignment, uh, literature back then,

I guess, thousands of years old,
the problem of there are two people.

One person wants the other person
to do something that's just like

as a very important, fundamental
problem that you need to deal with.

There's like tons of stuff also
in economics about those rights,

that principal agent problem and
like the island and problem is

not literally at the same thing.

And the principal agent problem.

It seems that the agent had already has
some motivation, some utility function.

And you were like trying to incentivize
them to do the things that you want.

Whereas in the AI alignment,
you've got to build it.

Patient that you're delegating to.

And so you have more control over it.

So there are differences, but like
fundamentally the like entity a once

entity B to do something for it, entity a
is like just a super common pattern that

human society has thought about a lot.

So we have some more
contributing questions.

Uh, this is one from Nathan Lambert,
a PhD student at UC Berkeley

doing research on robot learning.

And, uh, Nathan was our
guest for episode 19.

So Nathan says a lot of AI
alignment and AGI safety work

happens on blog posts and forums.

Uh, what's the right manner to draw more
attention from the academic community.

Any comment on that?

I think, um, I think that this is
basically a reasonable strategy where

like, by, by doing this work on blog posts
and forums, people can move a lot faster.

Uh, like ML is pretty good and
that, uh, like relative to other

academic fields, you know, it doesn't
take years to publish your paper.

It only takes some months
to publish your paper.

Uh, but blood present forums, it can
be days to talk about your ideas.

Um, so you can move a lot faster if
you're trusting in everyone's ability

to like, understand which work is
good, um, and what to build on.

Uh, and so that's like, I think the
main benefit of blog posts and forums,

but then as a result, anyone who isn't
an expert correctly, doesn't end up

reading the blog posts and forums,
because there's not, it's a little

hard if you're not an expert to extract
the signal and ignore the noise.

So I think then there's like a
separate group of people and not say

they're not a separate group, but
there's a group of people who then

takes a bunch of these ideas and then
tries and then converts them into.

More vigorous, uh, and correct.

And academically presented,
um, ideas and, and papers.

And that's the thing that you can like,
uh, show to the academic community

in order to draw more attention.

In fact, we've just been working
on a project along these lines

at DeepMind, which hopefully will
release soon talking about the

risks from, uh, inner misalignment.

So yeah, I think roughly my story is you
figure out conceptually what you want

to do via the blog posts and forums.

And then you'll like make it rigorous
and have experiments and like demonstrate

things with, um, actual examples
instead of hypothetical ones, uh,

and the format of an academic paper.

And that's how you then like,
make it, um, credible enough and

convincing enough to draw attention
from the academic committee.

Great.

And then Taylor Killian asks
to Taylor's a PhD student at

U of T and the vector Institute.

Taylor was our guest for episode 13.

And Taylor asks, how can we
approach the alignment problem when

faced with heterogeneous behavior
from possibly many human actors?

I think under my interpretation of
this question is that, you know,

humans sometimes disagree on what
things to value and similarly

disagree on what behaviors they, they
exhibit and want the AI to exhibit.

Um, so how do you get the AI to decide on
one set of values or one set of behaviors?

And as I talked about a little bit
before, I mostly just take this question

and like it is outside of the scope of
the things that I usually think of that

I'm usually just, I'm usually thinking
about the designers have something

in mind that they want the system.

Did the AI system actually do
do that thing or at least it,

is it trying to do that thing?

I do think that this problem is in
fact an important problem, but I think

what you, the way, what your solution,
like the solutions are probably going

to be more like political, um, or like
societal rather than technical, where,

you know, you have to negotiate with
other people to figure out what exactly

you want your AI systems to be doing.

And then you like take that, take
that like simple spec and you

hand it off to the AI designers.

And then the idea of
saying it's all right.

All right.

Now we will make an AI
system with the spec.

Yeah.

So, so I would say it's like, yeah,
there's a separate problem of like how

to go from human society to something
that we can put inside of an AI.

This is like the domain of a
significant portion of social science.

Uh, and it has technical aspects too.

So like social choice theory, for
example, I think has at least some

technical people trying to do a mechanism
design to, to solve these problems.

And that seems great.

And people should do that.

It's a good problem to solve.

Um, as unfortunately not one,
I have thought about very much,

but I do feel pretty strongly
about the factorization into.

One part of, you know, one problem,
which is like, figure out what exactly

you want to put into the AI system.

And then the other part of the problem,
which I call the alignment problem,

which is then how do you take that thing
that you want to put into the system

and actually put it into the AI system.

Okay, cool.

And Taylor also asks, how do we
best handle bias when learning

from human expert demonstrations?

Okay.

This is a good question.

And I would say is an open
question and in the field.

So I don't have a great answer to it,
but some approaches that people have

taken, one simple thing is to get a, uh,
get demonstration from a wide variety

of humans and hope that to, to the
extent that they're making mistakes,

some of those mistakes will cancel out.

You can invest additional effort.

Like you get a bunch of demonstrations
and then you invest a lot of

effort into evaluating the quality
of each of those demonstrations.

And then you can like label
each demonstration with

like, How high quality it is.

And then you can design an algorithm that
like takes the quality into account when

learning, or, I mean, the most simple
thing is you just like discard everything.

That's too low quality and only
keep the high quality ones.

But, uh, there are some algorithms
that have been proposed that can

make use of the low quality ones
while still trying to get to the

performance of the high quality ones.

Another approach that people have,
um, tried to take is to like, try

and guess what sorts of biases, um,
are present and then try to build

algorithms that correct for those biases.

Uh, so in fact, one of my older papers
looks into an approach, uh, of this farm.

I think.

Like we did get results that were
better than the baseline, but I don't

think it was all that promising.

Uh, so I mostly did not continue
working on that approach.

So it just seems kind of hard to
like, know exactly which biases,

uh, are going to happen and to
then correct for all of them.

Right.

So those are a few thoughts on
how you can try to handle bias.

I don't think we know the
best way to do it yet.

Cool.

Thanks so much.

Uh, to Taylor and Nathan and
Natasha for contributed questions.

Um, you can also contribute questions
to our next, uh, interviews.

Uh, if you show up on our
Twitter at taco bell podcast.

So we're just about wrapping up here,
a few more questions for you today.

Rohin, what would you say is the
holy grail for your line of research?

I think the holy grail is to
have a procedure for training AI

systems, that particular task.

Um, where we tell them where we can apply
arbitrary human understandable constraints

to how the system achieves those tasks.

So for example, we can be like,
we can build an AI assistant that

scheduled your meetings, but.

And sh and like, but unsure is
that it's always very respectful

when it's talking to other people
in order to schedule your emails.

And there's never like, you
know, discriminating based on

sex or something like that.

Or you can like build an agent that plays
Minecraft and you can just deploy it on

an entirely new multiplayer server that
includes both humans and AI systems.

And then you can say, Hey, you should
just go help such and such player

with whatever it is they want to do.

And the agent just does that.

And they're like abides by the norms
on that, uh, on the multi-player

server server that had joined, or
you can build a recommender system.

That's just optimizing for what humans
think, uh, is good for recommender

systems to be doing while, uh, rather
than optimizing for say engagement.

If we think that engagement is a
bad thing to be optimizing for.

So how do you see your, uh,
your research career plan?

Um, do you have a clear roadmap
in mind or are you, uh, doing

a lot of exploration as you.

I think, I feel more like there's
maybe I wouldn't call it a roadmap.

Exactly.

But there's a clear plan.

Uh, and the plan is we talked
about a bit about it earlier.

The plan is roughly train models
using human feedback, and then

like empower the heat, the humans,
providing the feedback as much as he

can, um, ideally so that they can know
everything that the model knows and

select the models that are getting the
right outcomes for the right reasons.

I'd say like, that's the plan.

That's like an ideal to which we aspire.

Uh, we will probably not actually
reach it, knowing everything that

the model knows is a pretty high
bar and probably we won't get to it.

But there are like a bunch of
tricks that we can do that get

us closer and closer to it.

And the closer we get to it, the
better, the better we're doing.

Um, and some like, let us find more
and more of those tricks find which

ones are the best, see how like cost
efficient, how costly they are and so on.

Um, and ideally this just leads to our,
to a significant improvement in our

ability to do these things every time.

Um, I will say though, that it took me
several years to get to those points.

Like most of the, uh, most of the previous
years of my career, I have in fact been

a significant amount of exploration,
uh, which is part of why, like, not all

of the papers, uh, that we've talked
about so far really fit into the story.

Is there anything else you want
to mention to our audience today?

Rohin?

Yeah.

Um, so I, I'm probably going to start a
hiring round at DeepMind for my own team.

Probably sometime in the next month from
the time of recording today is March 22nd.

So yeah, please do apply.

If you're interested in
working on the AI alignment.

Great.

Dr.

Rohin Shah, this has been an absolute
pleasure and, and a total honor, by

the way, I want to thank you for on
behalf of myself and in our audience.

Yeah.

Thanks for having me on.

It was really fun to actually
go through all of these papers,

uh, in a single session.

I don't think I've ever done that before.