[00:00.000 -- 00:01.960]  the answer was affirmative.
[00:01.960 -- 00:05.680]  We can get an agent to basically use a set of tools
[00:05.680 -- 00:06.520]  that we give it.
[00:06.520 -- 00:09.440]  In this case, the browsing commands, like searchings.
[00:09.440 -- 00:13.880]  I would say I expect AI to be able to do a better job
[00:13.880 -- 00:16.860]  than humans at most jobs that humans do now,
[00:16.860 -- 00:17.980]  five years or so.
[00:19.660 -- 00:20.500]  Talk RL.
[00:22.660 -- 00:26.700]  Talk RL podcast is all reinforcement learning all the time,
[00:26.700 -- 00:29.900]  featuring brilliant guests, both researched and applied.
[00:29.900 -- 00:33.520]  Join the conversation on Twitter at Talk RL podcast.
[00:33.520 -- 00:35.160]  I'm your host, Robin Chauhan.
[00:39.500 -- 00:41.900]  John Shulman is a co-founder of OpenAI
[00:41.900 -- 00:44.480]  and a researcher and engineer at OpenAI.
[00:44.480 -- 00:46.360]  He is well known for major contributions
[00:46.360 -- 00:48.440]  to the field of reinforcement learning,
[00:48.440 -- 00:50.760]  including the TRPO algorithm,
[00:50.760 -- 00:52.920]  that's Trust Region Policy Optimization,
[00:52.920 -- 00:56.000]  GAE, Generalized Advantage Estimation.
[00:56.000 -- 00:58.120]  Those are from his UC Berkeley dissertation.
[00:58.120 -- 01:02.080]  And TRPO's Descendant Proximal Policy Optimization, or PPO.
[01:02.080 -- 01:06.040]  His current focus at OpenAI is on RL from human feedback.
[01:06.040 -- 01:08.320]  John, welcome to the show and thanks so much for being here.
[01:08.320 -- 01:09.360]  Thanks a lot for having me.
[01:09.360 -- 01:11.380]  You were literally one of the first people I thought of
[01:11.380 -- 01:13.840]  when I started the show three years back.
[01:13.840 -- 01:14.880]  Thanks, I'm honored.
[01:14.880 -- 01:17.320]  It means a lot to me to have you here today.
[01:17.320 -- 01:20.920]  I definitely remember your nuts and bolts of deep RL video
[01:20.920 -- 01:23.240]  back in the day and watching that multiple times
[01:23.240 -- 01:24.360]  and gaining a lot from that.
[01:24.360 -- 01:26.200]  So I think you helped probably a generation
[01:26.200 -- 01:28.640]  of RL practitioners back then.
[01:28.640 -- 01:31.280]  By the way, there's going to be a reboot
[01:31.280 -- 01:33.360]  of the nuts and bolts presentation.
[01:33.360 -- 01:37.320]  I got invited to give a talk at NURIPS this year on it.
[01:37.320 -- 01:41.200]  So I'll have to revamp the guidelines and everything.
[01:41.200 -- 01:42.120]  So that'll be fun.
[01:42.120 -- 01:42.960]  Oh, that's awesome.
[01:42.960 -- 01:43.780]  Can't wait for that.
[01:43.780 -- 01:47.240]  So you were clearly one of the earlier pioneers in deep RL.
[01:47.240 -- 01:49.640]  So how did you choose to move your focus to RL
[01:49.640 -- 01:50.800]  from human feedback?
[01:50.800 -- 01:52.560]  And why is that an important problem?
[01:52.560 -- 01:53.740]  Why is that important to you?
[01:53.740 -- 01:57.560]  After GBD3 was trained, I was blown away by how smart it was.
[01:57.560 -- 02:00.040]  And I realized the next frontier was figuring out
[02:00.040 -- 02:02.000]  how to make language models actually useful.
[02:02.000 -- 02:03.800]  I'm still really interested in RL,
[02:03.800 -- 02:07.400]  but solving RL benchmarks isn't the end of the story.
[02:07.400 -- 02:10.360]  To use your RL algorithm, you need a reward function.
[02:10.360 -- 02:12.680]  But where does the reward function come from?
[02:12.680 -- 02:15.160]  In RL benchmarks, you usually just code up
[02:15.160 -- 02:16.020]  the reward function.
[02:16.020 -- 02:18.320]  But if you're not in a simulator environment,
[02:18.320 -- 02:19.160]  that doesn't work.
[02:19.160 -- 02:23.280]  So what we have to do in any kind of real world use case
[02:23.280 -- 02:25.160]  is have humans look at what the AI did
[02:25.160 -- 02:26.680]  and decide if it was good or bad.
[02:26.680 -- 02:29.200]  So how exactly you define this reward
[02:29.200 -- 02:31.800]  becomes a really challenging and important problem,
[02:31.800 -- 02:34.160]  especially as the tasks get harder to evaluate.
[02:34.160 -- 02:37.240]  Another angle on this is that language models are very smart,
[02:37.240 -- 02:40.400]  but it's hard to get them to do anything useful.
[02:40.400 -- 02:43.200]  A big part of that is they're not necessarily
[02:43.200 -- 02:44.240]  trying to do what you want.
[02:44.240 -- 02:46.400]  They're just trying to imitate the training corpus.
[02:46.400 -- 02:48.440]  So that means there's a big opportunity
[02:48.440 -- 02:50.640]  to improve them a lot by just giving them
[02:50.640 -- 02:51.600]  the right objective.
[02:51.600 -- 02:55.280]  That's what we can do by applying RL to these language
[02:55.280 -- 02:58.560]  models using human feedback to define the reward.
[02:58.560 -- 03:02.560]  Is using human feedback harder or very different in some way
[03:02.560 -- 03:04.360]  than using a synthetic reward?
[03:04.360 -- 03:06.600]  There are a lot of new complications.
[03:06.600 -- 03:09.800]  Now you have to collect a data set dynamically.
[03:09.800 -- 03:12.160]  So you're always in the business of building data
[03:12.160 -- 03:14.720]  sets of human preferences.
[03:14.720 -- 03:17.160]  Often the data quality there matters more
[03:17.160 -- 03:19.320]  than various algorithmic details.
[03:19.320 -- 03:22.440]  And you also have to think a lot about exactly how you're
[03:22.440 -- 03:24.360]  giving the task to the human trainers
[03:24.360 -- 03:25.680]  and various other things that you
[03:25.680 -- 03:27.360]  wouldn't have thought about if you just
[03:27.360 -- 03:29.040]  had a programmatic reward function.
[03:29.040 -- 03:31.080]  Does the difference between human raters
[03:31.080 -- 03:34.200]  or the noisiness of the reward signal cause any problems?
[03:34.200 -- 03:36.640]  I would say the noise, definitely
[03:36.640 -- 03:40.320]  you need to be below some threshold of noise
[03:40.320 -- 03:41.360]  to learn anything.
[03:41.360 -- 03:44.160]  I think, in general, if you have a large noisy data
[03:44.160 -- 03:47.640]  set that can be as good as a smaller, clean data set.
[03:47.640 -- 03:50.640]  So actually, noise isn't the thing that worries me the most.
[03:50.640 -- 03:53.600]  It's more that there are sometimes consistent biases
[03:53.600 -- 03:54.680]  that people have.
[03:54.680 -- 03:58.920]  For example, in settings like question answering or settings
[03:58.920 -- 04:02.000]  where you have a model writing some text,
[04:02.000 -- 04:04.160]  often people prefer longer answers.
[04:04.160 -- 04:06.680]  You end up with these very verbose answers.
[04:06.680 -- 04:08.880]  If you're not careful with the instructions, that is.
[04:08.880 -- 04:12.000]  I mean, you can also instruct people, the raters,
[04:12.000 -- 04:14.440]  to reward brevity.
[04:14.440 -- 04:17.200]  But if you're not careful, you can
[04:17.200 -- 04:19.360]  incentivize the wrong kinds of behaviors.
[04:19.360 -- 04:21.480]  So let's move to some of your recent work.
[04:21.480 -- 04:24.640]  First up is WebGPT, browser assisted question
[04:24.640 -- 04:26.200]  answering with human feedback.
[04:26.200 -- 04:30.000]  That's Nakano et al with yourself as a co-author in 2021.
[04:30.000 -- 04:32.880]  Can you tell us what is the main idea of this paper?
[04:32.880 -- 04:33.880]  What is WebGPT?
[04:33.880 -- 04:37.720]  In WebGPT, we basically took our language models
[04:37.720 -- 04:40.040]  and we hooked them up to a web browser
[04:40.040 -- 04:42.520]  so they could retrieve information from the web.
[04:42.520 -- 04:44.480]  And they can write an answer by summarizing
[04:44.480 -- 04:45.960]  the relevant pages from the web.
[04:45.960 -- 04:48.760]  So that way if you're asking a question about current events
[04:48.760 -- 04:51.520]  or a question that requires some detailed scientific
[04:51.520 -- 04:53.840]  or technical knowledge, this AI can go out
[04:53.840 -- 04:56.680]  and look up the answer and with detailed citations
[04:56.680 -- 04:57.560]  to its sources.
[04:57.560 -- 05:00.320]  So I would say there's kind of two interesting points
[05:00.320 -- 05:01.160]  to this.
[05:01.160 -- 05:03.600]  One is we were exploring whether you could turn language
[05:03.600 -- 05:05.360]  models into a kind of agent.
[05:05.360 -- 05:07.840]  There's a lot of data on the web of different texts
[05:07.840 -- 05:09.920]  that people have written, but there's not a lot of data
[05:09.920 -- 05:13.360]  that shows how to actually do some multi-step process.
[05:13.360 -- 05:15.400]  So it's not that clear a priori
[05:15.400 -- 05:16.880]  whether you can get a language model
[05:16.880 -- 05:19.600]  to actually carry out some iterative process.
[05:19.600 -- 05:22.480]  We just have a lot of data like writing essays
[05:22.480 -- 05:23.960]  and having chats and so forth.
[05:23.960 -- 05:25.840]  So that was one thing we were exploring here.
[05:25.840 -- 05:28.120]  And I think the answer was affirmative.
[05:28.120 -- 05:32.280]  We can get an agent to basically use a set of tools
[05:32.280 -- 05:34.880]  that we give it, in this case, the browsing commands
[05:34.880 -- 05:37.480]  like searching, scrolling, clicking on links.
[05:37.480 -- 05:40.560]  The second theme of this paper was around truthfulness.
[05:40.560 -- 05:44.120]  I mean, a big issue with language models is,
[05:44.120 -- 05:45.600]  I mean, they're not very reliable
[05:45.600 -- 05:47.080]  at giving you true information.
[05:47.080 -- 05:49.680]  They know a vastly superhuman amount,
[05:49.680 -- 05:51.640]  but if you prompt them in the wrong way,
[05:51.640 -- 05:54.520]  they'll just output lots of plausible sounding nonsense.
[05:54.520 -- 05:57.680]  So how to fix that is a big research question
[05:57.680 -- 05:59.800]  or one of the biggest research questions
[05:59.800 -- 06:01.640]  in the world of language models.
[06:01.640 -- 06:03.480]  I think it's gonna be challenging to fully fix it,
[06:03.480 -- 06:06.960]  but I think a big part of the story involves retrieval
[06:06.960 -- 06:10.520]  and having models write answers that contain citations,
[06:10.520 -- 06:12.600]  citations to trusted sources.
[06:12.600 -- 06:14.440]  So a person who's checking over the answer
[06:14.440 -- 06:16.160]  doesn't have to go and try to figure out
[06:16.160 -- 06:18.200]  where the model might've gotten this idea.
[06:18.200 -- 06:20.520]  They can go and directly look at the source
[06:20.520 -- 06:23.280]  and see if it supports the AI's statement.
[06:23.280 -- 06:25.960]  With WebGBT, we just wanted to see
[06:25.960 -- 06:28.520]  if we do give the language model
[06:28.520 -- 06:30.400]  a really flexible interface of the web,
[06:30.400 -- 06:33.240]  can we have it answer hard questions truthfully
[06:34.440 -- 06:36.280]  with the help of all these citations?
[06:36.280 -- 06:38.360]  And it's actually really non-trivial
[06:38.360 -- 06:41.040]  because if you look at the dataset we use,
[06:41.040 -- 06:43.280]  the Reddit explained it like I'm five.
[06:43.280 -- 06:44.680]  The questions are really varied,
[06:44.680 -- 06:46.840]  like some of them are about science, history,
[06:46.840 -- 06:49.560]  current events, like our raters didn't necessarily
[06:49.560 -- 06:51.520]  know anything about these topics,
[06:51.520 -- 06:55.760]  but still they had to judge the detailed answers.
[06:55.760 -- 06:57.640]  So it would have been really hard to do it
[06:57.640 -- 06:59.960]  without the supporting citations.
[06:59.960 -- 07:04.000]  So we kind of validated that we could get good feedback
[07:04.000 -- 07:07.440]  in a hard domain like this with the help of citations.
[07:07.440 -- 07:10.680]  Can you talk about where the idea for WebGBT came from?
[07:10.680 -- 07:13.000]  Is that an idea you've had kicking around for a while
[07:13.000 -- 07:15.800]  or was it something that came up recently before the paper?
[07:15.800 -- 07:17.760]  How did that play out?
[07:17.760 -- 07:19.800]  Some of the ideas had been floating around,
[07:19.800 -- 07:22.400]  like we thought that we actually had a project
[07:22.400 -- 07:26.160]  at OpenAI very early on called World of Bits.
[07:26.160 -- 07:28.520]  We were looking at controlling web browsers
[07:28.520 -- 07:31.120]  or doing tasks that involved tasks on the internet
[07:31.120 -- 07:32.360]  with the web browser,
[07:32.360 -- 07:34.520]  but it was way too early at the time.
[07:34.520 -- 07:38.120]  So we kind of abandoned it for a few years.
[07:38.120 -- 07:40.240]  Actually we were trying to, back then we were trying to do it
[07:40.240 -- 07:41.480]  with full visual input.
[07:41.480 -- 07:45.040]  So we thought, yeah, we could give some instructions
[07:45.040 -- 07:48.880]  to the agent, like go and figure out the address
[07:48.880 -- 07:51.000]  of this building or something.
[07:51.000 -- 07:54.000]  The agent would go and search the web
[07:54.000 -- 07:57.000]  or use Google maps or whatever to figure out the answer.
[07:57.000 -- 07:58.760]  And we were trying to do this all in pixels.
[07:58.760 -- 08:00.640]  That obviously didn't work very well,
[08:00.640 -- 08:03.640]  but now we have these great language models
[08:03.640 -- 08:05.680]  on the work on text data.
[08:05.680 -- 08:08.960]  We can also extract the text out of web pages
[08:08.960 -- 08:12.000]  to get most of the information.
[08:12.000 -- 08:15.280]  We can't really interact with a lot of dynamic websites.
[08:15.280 -- 08:16.960]  Yeah, where there's a lot of JavaScript
[08:16.960 -- 08:18.000]  and images and so forth,
[08:18.000 -- 08:19.960]  but as long as it's just browsing
[08:19.960 -- 08:21.760]  and reading texts, we're fine.
[08:21.760 -- 08:24.320]  So yeah, we had good enough models
[08:24.320 -- 08:27.880]  and that made it kind of feasible to revisit this idea
[08:27.880 -- 08:30.960]  of using the internet as an environment.
[08:30.960 -- 08:33.640]  So I would say that was one of the sources
[08:33.640 -- 08:36.760]  of inspiration, that long kind of thread
[08:36.760 -- 08:39.320]  about like using the internet as an environment.
[08:39.320 -- 08:44.320]  Another motivation was just after we started playing
[08:44.680 -- 08:47.920]  with GPT-3, we noticed that it had all these problems
[08:47.920 -- 08:51.400]  with factual accuracy and the reliability
[08:51.400 -- 08:52.920]  of the information it was giving us.
[08:52.920 -- 08:56.280]  So that kind of motivated doing more research
[08:56.280 -- 08:58.960]  on how to make language models more truthful.
[08:58.960 -- 09:01.040]  We were kind of brainstorming what to do there
[09:01.040 -- 09:05.480]  and we went through some docs and eventually decided
[09:05.480 -- 09:07.760]  that we wanted to try some question answering
[09:07.760 -- 09:09.800]  like using the web, looking up knowledge
[09:09.800 -- 09:11.560]  on the web to help answer questions.
[09:11.560 -- 09:12.880]  So actually the original version
[09:12.880 -- 09:15.000]  of the project used trivia questions.
[09:15.000 -- 09:18.400]  So there's this well-known dataset trivia QA
[09:18.400 -- 09:20.080]  that has some basic trivia questions.
[09:20.080 -- 09:23.600]  So we first worked a little bit on that dataset
[09:23.600 -- 09:26.960]  and tried to see if we could boost the model's accuracy
[09:26.960 -- 09:29.840]  by giving it web search.
[09:29.840 -- 09:33.040]  And yeah, that actually worked quite straight.
[09:33.040 -- 09:34.160]  That worked pretty easily.
[09:34.160 -- 09:36.120]  So then we decided to move on
[09:36.120 -- 09:38.080]  to long form question answering.
[09:38.080 -- 09:41.880]  And so that gave us the, that was the project
[09:41.880 -- 09:43.880]  we ended up working on for a while.
[09:43.880 -- 09:47.080]  Seems like you use a few different datasets here
[09:47.080 -- 09:49.800]  and a number of different training methods.
[09:50.760 -- 09:52.600]  I'll just mention the last behavior cloning,
[09:52.600 -- 09:55.080]  reward modeling, reinforcement learning
[09:55.080 -- 09:56.800]  and rejection sampling.
[09:56.800 -- 10:00.520]  So we were using a fairly standard methodology
[10:00.520 -- 10:03.240]  which was actually adapted from previous work
[10:03.240 -- 10:05.600]  on RL from human preferences.
[10:05.600 -- 10:09.120]  So the pipeline is you first train a model
[10:09.120 -- 10:13.320]  with supervised learning where you have human demonstrators
[10:13.320 -- 10:15.560]  show how to do the task, like show how to map
[10:15.560 -- 10:17.160]  from observations to actions.
[10:17.160 -- 10:19.280]  Yeah, so that's the supervised learning
[10:19.280 -- 10:20.440]  or behavior cloning step.
[10:20.440 -- 10:24.400]  Then we train a reward model or a preference model.
[10:24.400 -- 10:28.320]  It looks at two actions or two trajectories
[10:28.320 -- 10:29.720]  and decides which one is better.
[10:29.720 -- 10:32.640]  In this case, like in a question answering setting
[10:32.640 -- 10:33.880]  you're looking at two answers
[10:33.880 -- 10:35.480]  and deciding which answer is better.
[10:35.480 -- 10:37.440]  And we use that to train a reward model
[10:37.440 -- 10:39.640]  that assigns higher score to the good answers
[10:39.640 -- 10:40.480]  than the bad ones.
[10:40.480 -- 10:41.840]  Then you do reinforcement learning
[10:41.840 -- 10:43.160]  against that reward function.
[10:43.160 -- 10:45.560]  And of course you can iterate these last two steps
[10:45.560 -- 10:46.960]  after you do a little RL.
[10:46.960 -- 10:49.520]  Now you're, you've sort of exploited some of the flaws
[10:49.520 -- 10:52.080]  of the reward model, like, or some of the noise
[10:52.080 -- 10:53.200]  in the reward model.
[10:53.200 -- 10:55.120]  And it's not necessarily accurate
[10:55.120 -- 10:56.760]  on your new distribution of data.
[10:56.760 -- 10:59.040]  You recollect more pairs of samples
[10:59.040 -- 11:01.680]  and refit this preference model.
[11:01.680 -- 11:04.000]  And then you do another iteration of RL.
[11:04.000 -- 11:06.160]  So that's like, that's the whole RL
[11:06.160 -- 11:07.600]  from human feedback pipeline.
[11:07.600 -- 11:11.080]  And there's this other idea called rejection sampling
[11:11.080 -- 11:12.400]  or best of end sampling.
[11:12.400 -- 11:14.840]  And in general, you can do other kinds of search too.
[11:14.840 -- 11:18.680]  Where instead of doing RL once you have your reward model
[11:18.680 -- 11:21.040]  you can just search against that reward model.
[11:21.040 -- 11:23.440]  So you can take a bunch of, collect a bunch of samples
[11:23.440 -- 11:25.960]  and re-rank them with the reward model
[11:25.960 -- 11:28.960]  and take the best one as your action.
[11:28.960 -- 11:30.520]  Kind of like MPC?
[11:30.520 -- 11:31.360]  Yeah, exactly.
[11:31.360 -- 11:33.440]  Yeah, it kind of depends exactly
[11:33.440 -- 11:35.640]  what setting you're in, what you can do.
[11:35.640 -- 11:38.400]  If you're in a setting where there's some environment
[11:38.400 -- 11:41.040]  you're interacting with, then you would have to simulate
[11:41.040 -- 11:44.160]  your, you'd have to simulate the dynamics
[11:44.160 -- 11:45.920]  of your environment, which yeah.
[11:45.920 -- 11:47.920]  So that would look kind of like MPC.
[11:47.920 -- 11:51.360]  In our case, we were, the only thing we had to learn
[11:51.360 -- 11:55.080]  a model of was the human preference.
[11:55.080 -- 11:57.480]  So like we're, it's a question answering setting.
[11:57.480 -- 11:59.760]  So it's really like a contextual bandit problem.
[11:59.760 -- 12:02.520]  So it's kind of straightforward to take a bunch of,
[12:02.520 -- 12:04.320]  sample a bunch of actions where each action
[12:04.320 -- 12:06.880]  is a full answer and re-rank them
[12:06.880 -- 12:11.640]  and or search against the search over answers.
[12:11.640 -- 12:13.760]  So in terms of the action space,
[12:13.760 -- 12:16.040]  was it the action space, just the list of commands
[12:16.040 -- 12:17.800]  or is it still generating tokens
[12:17.800 -- 12:20.440]  like a regular generative mode?
[12:20.440 -- 12:21.800]  We were generating tokens.
[12:21.800 -- 12:26.800]  We had two phases of like in each episode of the RL tasks.
[12:26.800 -- 12:31.280]  So there was first a browsing phase where the model goes
[12:31.280 -- 12:33.960]  and it issues searches and clicks on things
[12:33.960 -- 12:36.560]  and quotes relevant information.
[12:36.560 -- 12:38.400]  Like if it sees something useful on the page,
[12:38.400 -- 12:40.920]  it'll quote it using this quote command.
[12:40.920 -- 12:44.560]  And then once it's done browsing,
[12:44.560 -- 12:48.480]  it'll issue another command called end browsing
[12:48.480 -- 12:49.920]  and it'll write its answer.
[12:49.920 -- 12:52.120]  That's also expressed in tokens.
[12:52.120 -- 12:55.400]  But really we rolled this all into one big RL task
[12:55.400 -- 12:57.440]  where your episode involves browsing
[12:57.440 -- 12:58.640]  and writing out the answer
[12:58.640 -- 13:01.480]  and it's all one big RL episode.
[13:01.480 -- 13:02.840]  Did you think this is gonna work well
[13:02.840 -- 13:04.440]  or were you kind of surprised?
[13:04.440 -- 13:06.360]  At the very beginning of the project,
[13:06.360 -- 13:09.000]  we didn't know if it was gonna work or not.
[13:09.000 -- 13:10.920]  Like after we did the initial experiments
[13:10.920 -- 13:12.560]  with the trivia QA,
[13:12.560 -- 13:15.560]  which actually didn't take that long to get running,
[13:15.560 -- 13:19.120]  then it became pretty clear that it would work,
[13:19.120 -- 13:20.640]  that the browsing part worked at least.
[13:20.640 -- 13:22.880]  And we already know that we can get these models
[13:22.880 -- 13:26.760]  to write pretty good long form text with a bunch of,
[13:26.760 -- 13:28.520]  if you give them a bunch of snippets
[13:28.520 -- 13:31.080]  of text that they can cite.
[13:31.080 -- 13:35.400]  So I noticed the human raters task was quite complicated.
[13:35.400 -- 13:38.200]  It was a long guide and there was many types of feedback
[13:38.200 -- 13:39.040]  that they were giving.
[13:39.040 -- 13:40.440]  But in the end, the paper said
[13:40.440 -- 13:42.720]  that only the final rating was used.
[13:42.720 -- 13:44.640]  So I was just curious if you had any comment about that.
[13:44.640 -- 13:46.040]  Like why do you think maybe the model
[13:46.040 -- 13:47.440]  couldn't use that extra feedback
[13:47.440 -- 13:50.840]  or is this maybe just too much or not enough samples?
[13:50.840 -- 13:55.200]  Yeah, that's been one frustrating finding so far.
[13:55.200 -- 13:58.480]  In that project and also some other projects,
[13:58.480 -- 14:01.480]  we've had the same finding that you have your raters
[14:01.480 -- 14:05.760]  go through this long process for each comparison they do
[14:05.760 -- 14:08.240]  where they're comparing a pair of answers.
[14:08.240 -- 14:10.440]  And then you only use one bit of information
[14:10.440 -- 14:13.080]  from this whole process,
[14:13.080 -- 14:14.720]  which might've taken like half an hour.
[14:14.720 -- 14:15.840]  It seems like it would be better
[14:15.840 -- 14:19.320]  if we were able to extract more information,
[14:19.320 -- 14:21.680]  more about the process they went through
[14:21.680 -- 14:22.920]  in arriving at the answer.
[14:22.920 -- 14:25.040]  So we did collect all sorts of other information
[14:25.040 -- 14:27.160]  like we had them provide ratings
[14:27.160 -- 14:28.600]  along several different axes
[14:28.600 -- 14:32.760]  like coherence and factual accuracy and so forth.
[14:32.760 -- 14:35.960]  But in the end, we didn't really get much of a boost
[14:35.960 -- 14:39.160]  out of using any of this other information.
[14:39.160 -- 14:44.160]  So I'd say it seems like it should be possible to do better.
[14:44.800 -- 14:46.520]  But unfortunately this methodology,
[14:46.520 -- 14:49.840]  which seems kind of dumb so far is hard to beat.
[14:49.840 -- 14:52.760]  And people have tried various other ideas
[14:52.760 -- 14:55.120]  for like how to use human feedback
[14:55.120 -- 14:57.080]  instead of you getting these preference scores,
[14:57.080 -- 14:58.400]  there are various other things you can do.
[14:58.400 -- 15:00.840]  Like you can have them write critiques and edit
[15:00.840 -- 15:03.200]  or maybe edit the responses.
[15:03.200 -- 15:07.080]  Yeah, I think some of these things are also promising.
[15:07.080 -- 15:09.440]  But yeah, this methodology
[15:09.440 -- 15:12.080]  of collecting preference data works well.
[15:12.080 -- 15:15.160]  Yeah, I think it's still an open area of research.
[15:15.160 -- 15:18.280]  Oh yeah, regarding the really long instructions.
[15:18.280 -- 15:20.000]  Yeah, I think for any of these tasks,
[15:20.000 -- 15:24.000]  there is a lot of subtlety in how to do the task properly.
[15:24.000 -- 15:27.800]  And so we ended up adding more and more details
[15:27.800 -- 15:29.640]  of like what do you do in this situation?
[15:29.640 -- 15:30.960]  What do you do in that situation?
[15:30.960 -- 15:33.320]  I think it's starting to get pretty unwieldy
[15:33.320 -- 15:35.760]  with these really long instruction manuals.
[15:35.760 -- 15:39.920]  So there's some promising ideas for how to address this.
[15:39.920 -- 15:42.840]  Like there's a paper from DeepMind recently,
[15:42.840 -- 15:45.920]  Sparrow that used basically broke down the task
[15:45.920 -- 15:48.520]  and they trained, they basically had people look
[15:48.520 -- 15:52.400]  at one aspect of the response at a time.
[15:52.400 -- 15:54.640]  And then they had a way of combining
[15:54.640 -- 15:56.480]  these different rule specific,
[15:56.480 -- 15:58.680]  they would train a bunch of rule specific reward models
[15:58.680 -- 16:00.440]  and then combine them at the end.
[16:00.440 -- 16:02.520]  Yeah, I think there's some other interesting ideas
[16:02.520 -- 16:05.320]  for how to make this process better.
[16:05.320 -- 16:08.480]  So I gather that from your answer about WebGPT
[16:08.480 -- 16:10.720]  and the whole idea of WebGPT is that you want
[16:10.720 -- 16:14.400]  the language model to have access to external knowledge.
[16:14.400 -- 16:17.560]  But I wonder where you think the line should really be
[16:17.560 -- 16:19.680]  in terms of what a language model should know
[16:19.680 -- 16:21.920]  and what the language model should look up
[16:21.920 -- 16:24.240]  and maybe what the language model should not know
[16:24.240 -- 16:25.600]  or not purport to know.
[16:25.600 -- 16:27.120]  Do you have opinions about that?
[16:27.120 -- 16:28.560]  Yeah, let's see.
[16:28.560 -- 16:30.200]  Like some people are advocating
[16:30.200 -- 16:32.480]  for very small language models that have
[16:32.480 -- 16:35.480]  like no external knowledge aside from language,
[16:35.480 -- 16:37.000]  I guess would be the extreme position.
[16:37.000 -- 16:39.680]  And then other people have talked about language models
[16:39.680 -- 16:41.000]  that just know everything
[16:41.000 -- 16:43.440]  as opposed to having an external knowledge source.
[16:43.440 -- 16:45.000]  There's some interesting questions there.
[16:45.000 -- 16:48.440]  So I think it is a little hard to separate knowledge,
[16:48.440 -- 16:51.160]  factual knowledge from understanding.
[16:51.160 -- 16:55.120]  So as humans, we get by like not memorizing
[16:55.120 -- 16:57.560]  all sorts of facts and just knowing
[16:57.560 -- 16:59.720]  that we can look them up if needed.
[16:59.720 -- 17:01.520]  For working on a specific domain,
[17:01.520 -- 17:06.440]  it is useful to like have a lot of facts internalized
[17:06.440 -- 17:08.520]  so that you can recall them very quickly
[17:08.520 -- 17:11.480]  and kind of combine them in your head.
[17:11.480 -- 17:14.840]  So I wouldn't take an extreme position on either side.
[17:14.840 -- 17:18.400]  I would say, I think retrieval is gonna be really useful
[17:19.520 -- 17:22.480]  just at the very least for current events,
[17:22.480 -- 17:26.480]  but also I don't think we wanna try to pack
[17:26.480 -- 17:29.960]  all human knowledge into the weights of a neural net.
[17:29.960 -- 17:32.280]  On the other hand, I think people have had a lot of luck
[17:32.280 -- 17:37.200]  just scaling up models and like as they soak up
[17:37.200 -- 17:40.800]  more factual knowledge, they also get better at reasoning
[17:40.800 -- 17:41.640]  and other things.
[17:41.640 -- 17:44.280]  And I think I haven't seen any demonstrations
[17:44.280 -- 17:48.080]  of tiny models that just do lots of retrieval
[17:48.080 -- 17:50.320]  and save all their weights for reasoning.
[17:50.320 -- 17:53.840]  Yeah, I just haven't seen any evidence of this
[17:53.840 -- 17:57.480]  or I haven't seen any successful attempts at making this.
[17:57.480 -- 17:59.640]  Let's move on to training language models
[17:59.640 -- 18:01.680]  to follow instructions with human feedback.
[18:01.680 -- 18:03.080]  That was Wuyang et al.
[18:03.080 -- 18:05.640]  And that was 2022 with yourself as a co-author.
[18:05.640 -- 18:08.040]  Can you tell us the main idea with this paper?
[18:08.040 -- 18:09.760]  This is the instruct GPT paper.
[18:09.760 -- 18:12.000]  What is instruct GPT and what's going on here?
[18:12.000 -- 18:15.240]  Instruct GPT is a language model that's fine tuned
[18:15.240 -- 18:16.480]  to follow instructions.
[18:16.480 -- 18:19.000]  And it's in fact the one that you can play with
[18:19.000 -- 18:23.280]  if you go to the OpenAI website, you get a big text box
[18:23.280 -- 18:25.920]  and you can write some text and then press the button
[18:25.920 -- 18:27.680]  to generate a completion.
[18:27.680 -- 18:30.240]  So the idea here was, I mean, language models
[18:30.240 -- 18:33.800]  are pretty useful and you can sometimes get them
[18:33.800 -- 18:36.160]  to do what you want by prompting them just right.
[18:36.160 -- 18:39.880]  This idea of few-shot prompting has become pretty popular
[18:39.880 -- 18:41.560]  where you give a few examples,
[18:41.560 -- 18:44.200]  like a few question and answer examples.
[18:44.200 -- 18:45.720]  And then if you ask another question,
[18:45.720 -- 18:48.520]  it'll hopefully provide an answer in the same style.
[18:48.520 -- 18:51.600]  So the idea, yeah, so you can get language models
[18:51.600 -- 18:53.240]  to do great things with prompting,
[18:53.240 -- 18:55.240]  but prompting is itself an art
[18:55.240 -- 18:56.480]  and it's tricky to get right.
[18:56.480 -- 18:59.040]  And it's also kind of not necessarily getting
[18:59.040 -- 19:01.600]  the best possible performance out of the model.
[19:01.600 -- 19:03.120]  If you just take a raw language model
[19:03.120 -- 19:06.000]  and you try to talk to it, like you ask it a question,
[19:06.000 -- 19:08.840]  it probably, it doesn't know that it should actually answer
[19:08.840 -- 19:10.560]  that question as well as possible.
[19:10.560 -- 19:13.840]  It, for all it knows, you want it to give a joke answer
[19:13.840 -- 19:15.320]  or a riddle or something.
[19:15.320 -- 19:17.840]  Yeah, so the idea of instruct GPT was,
[19:17.840 -- 19:21.120]  let's make a kind of small change to our language models
[19:21.120 -- 19:22.880]  so that they're much easier to use.
[19:22.880 -- 19:25.360]  In particular, we're gonna train them to,
[19:25.360 -- 19:29.440]  if you have a piece of text where there's an instruction,
[19:29.440 -- 19:32.840]  the model will try to follow that instruction
[19:32.840 -- 19:34.120]  to the best of its abilities.
[19:34.120 -- 19:36.480]  And pretty much anything can be an instruction.
[19:36.480 -- 19:38.760]  Like you can have a, the instruction can be
[19:38.760 -- 19:43.760]  to continue a chat or it can be to summarize this text
[19:44.400 -- 19:48.740]  or give me a list of names for my company
[19:48.740 -- 19:50.240]  that sells widgets.
[19:50.240 -- 19:51.680]  Yeah, instructions can be anything
[19:51.680 -- 19:54.960]  and that makes this kind of model very powerful.
[19:54.960 -- 19:56.000]  So that was kind of,
[19:56.000 -- 19:58.120]  that's the idea of an instruction following model.
[19:58.120 -- 19:59.760]  It's like a model that can do anything
[19:59.760 -- 20:01.460]  that you specify with an instruction.
[20:01.460 -- 20:04.000]  And by the way, I wasn't a core contributor to this work.
[20:04.000 -- 20:09.000]  I was more involved with like getting the RL infrastructure
[20:09.360 -- 20:12.280]  and some of the RL training details,
[20:12.280 -- 20:14.440]  like helping out with that stuff.
[20:14.440 -- 20:16.840]  But anyway, yeah, what we did in this project was
[20:16.840 -- 20:20.620]  we ran this whole methodology that I just described
[20:20.620 -- 20:23.160]  of RL from human preferences
[20:23.160 -- 20:24.900]  in this instruction following setting.
[20:24.900 -- 20:28.080]  So we did supervised fine tuning,
[20:28.080 -- 20:30.840]  collected preference data, train a reward model
[20:30.840 -- 20:33.800]  and then did RL against that reward model.
[20:33.800 -- 20:36.240]  And one interesting detail is actually
[20:36.240 -- 20:40.080]  whereas the original initial data was just collected
[20:40.080 -- 20:41.840]  using contractors.
[20:41.840 -- 20:46.840]  At a certain point we had the API and it's got this,
[20:47.040 -- 20:50.520]  I mean, we have this playgrounds on the website
[20:50.520 -- 20:52.800]  where this is where the big text box
[20:52.800 -- 20:54.800]  where you can use the model.
[20:54.800 -- 20:57.200]  So we took prompts that people,
[20:57.200 -- 20:59.680]  that users had put into the playground
[20:59.680 -- 21:01.280]  and use those for training,
[21:01.280 -- 21:04.680]  like both to collect preference data and to do RL.
[21:04.680 -- 21:07.040]  So, and this is like,
[21:07.040 -- 21:10.760]  this is disclosed to users pretty prominently.
[21:10.760 -- 21:13.040]  Like when people are using the playgrounds,
[21:13.040 -- 21:15.520]  you get notified that your prompts might be used
[21:15.520 -- 21:16.480]  for the training.
[21:16.480 -- 21:19.120]  And we're also careful to train in such a way
[21:19.120 -- 21:20.860]  that we don't memorize any information
[21:20.860 -- 21:23.080]  that was in the prompts.
[21:23.080 -- 21:24.760]  Like, and it explicit,
[21:24.760 -- 21:27.480]  like we have a pretty like elaborate process
[21:27.480 -- 21:30.680]  for making sure there's no like private information
[21:30.680 -- 21:32.840]  being leaked into the model.
[21:32.840 -- 21:36.960]  But anyway, yeah, that's basically the experimental setup.
[21:36.960 -- 21:39.680]  And the result was that it works
[21:39.680 -- 21:42.060]  like this methodology works quite well.
[21:42.060 -- 21:44.480]  And you get a model that's vastly preferred
[21:44.480 -- 21:48.820]  to the base model on this distribution of realistic prompts
[21:48.820 -- 21:50.880]  that people are giving the model,
[21:50.880 -- 21:53.040]  often which contain instructions.
[21:53.040 -- 21:56.040]  So the raw, like the raw language models
[21:56.040 -- 21:58.760]  generally do a really bad job following instructions.
[21:58.760 -- 22:02.920]  But this RL trained instruction following model
[22:02.920 -- 22:04.120]  is a lot better.
[22:04.120 -- 22:06.440]  And it's something like,
[22:06.440 -- 22:08.220]  if you just calculate how much better,
[22:08.220 -- 22:09.200]  it's something like,
[22:09.200 -- 22:11.800]  it's as good as a model that's a hundred times bigger.
[22:11.800 -- 22:13.200]  That's a lot.
[22:13.200 -- 22:14.040]  Yeah.
[22:14.040 -- 22:15.280]  You wanted the model to be truthful.
[22:15.280 -- 22:17.640]  Is that one of the criteria you wanted?
[22:17.640 -- 22:20.000]  Yeah, truthfulness was one of the criteria.
[22:20.000 -- 22:22.200]  That seems amazing to me that truthfulness
[22:22.200 -- 22:24.080]  is something that I could learn by example.
[22:24.080 -- 22:26.480]  Like does that mean that truthfulness is somehow
[22:26.480 -- 22:28.000]  represented inside the network
[22:28.000 -- 22:31.240]  or because there's no external way for the model to confirm
[22:31.240 -- 22:32.720]  whether something is true or false?
[22:32.720 -- 22:35.440]  So how might it know what is true
[22:35.440 -- 22:37.480]  without any external reference?
[22:37.480 -- 22:38.960]  I think to some extent,
[22:38.960 -- 22:42.420]  there is some internal representation of truthfulness.
[22:42.420 -- 22:43.260]  So I would say,
[22:43.260 -- 22:45.340]  like one way to think about what language models do
[22:45.340 -- 22:48.200]  is they're trained to imitate the whole internet.
[22:48.200 -- 22:50.520]  And the internet is written by lots of different people
[22:50.520 -- 22:52.520]  and has lots of different types of content
[22:52.520 -- 22:57.200]  from fiction to nonfiction to like technical,
[22:57.200 -- 23:00.600]  like detailed technical literature to like jokes
[23:00.600 -- 23:03.400]  and like forum posts, whatever.
[23:03.400 -- 23:07.260]  So the model is basically an ensemble of all these people
[23:07.260 -- 23:08.880]  who wrote stuff on the internet,
[23:08.880 -- 23:11.000]  the raw pre-trained model.
[23:11.000 -- 23:13.080]  When you feed it a prompt,
[23:13.080 -- 23:15.580]  what it's doing internally has to be something like
[23:15.580 -- 23:18.200]  figuring out who wrote this prompt
[23:18.200 -- 23:20.020]  and then trying to continue in that style.
[23:20.020 -- 23:21.880]  So if it thinks it's reading,
[23:21.880 -- 23:26.180]  just reading something on the Wall Street Bets Reddit,
[23:26.180 -- 23:28.440]  it's gonna continue on that style.
[23:28.440 -- 23:30.640]  But if it thinks it's in the New York Times,
[23:30.640 -- 23:33.320]  it's gonna write in a very different way.
[23:33.320 -- 23:38.280]  So effectively, the model must be calculating somewhere,
[23:38.280 -- 23:40.800]  like what style is this or what ensemble,
[23:40.800 -- 23:43.900]  what's the narrower ensemble of styles
[23:43.900 -- 23:46.400]  that I'm trying to imitate now.
[23:46.400 -- 23:48.400]  At the very least, when you do some kind of,
[23:48.400 -- 23:51.080]  when you do training like either supervised fine tuning
[23:51.080 -- 23:52.840]  or all from human feedback,
[23:52.840 -- 23:55.600]  you can at least like narrow down the set of styles
[23:55.600 -- 23:59.500]  the model is producing and try to imitate like the best
[23:59.500 -- 24:02.680]  or the best person in the training set
[24:02.680 -- 24:04.300]  or the best style in the training set.
[24:04.300 -- 24:06.480]  And obviously best will differ a lot.
[24:06.480 -- 24:09.540]  So what we'll end up with will depend on our instructions.
[24:09.540 -- 24:12.520]  So if we tell, I don't know,
[24:12.520 -- 24:15.080]  we'll end up with something that has kind of safe,
[24:15.080 -- 24:19.000]  like not too controversial,
[24:19.000 -- 24:21.160]  but a bit corporate,
[24:21.160 -- 24:23.240]  we'll end up with something like that
[24:23.240 -- 24:25.680]  depending on what our instructions are.
[24:25.680 -- 24:27.320]  So at the very least,
[24:27.320 -- 24:29.880]  like we can kind of narrow in on one style
[24:29.880 -- 24:32.160]  instead of having the whole distribution
[24:32.160 -- 24:33.320]  of styles on the internet.
[24:33.320 -- 24:35.780]  I think probably there's more to it than that.
[24:35.780 -- 24:38.140]  Like we're not just learning about style,
[24:38.140 -- 24:40.580]  but the model probably is like internally
[24:40.580 -- 24:42.220]  trying to determine if things are,
[24:42.220 -- 24:44.000]  if statements are true or not,
[24:44.000 -- 24:47.320]  like if the prompt contains incorrect information,
[24:47.320 -- 24:48.980]  because that probably would be useful
[24:48.980 -- 24:51.560]  for determining a likely completion.
[24:51.560 -- 24:53.340]  I'm just talking about the raw pre-trained model.
[24:53.340 -- 24:54.520]  So I think, yeah,
[24:54.520 -- 24:58.180]  I think just the objective of predicting next tokens
[24:58.180 -- 24:59.520]  probably gives you a lot.
[24:59.520 -- 25:02.120]  It forces the model to like to determine
[25:02.120 -- 25:03.680]  if things are true or not.
[25:03.680 -- 25:05.880]  I think for RL fine tuning,
[25:05.880 -- 25:07.560]  there's a lot more potential for the model
[25:07.560 -- 25:11.900]  to actually like try to output something truthful
[25:11.900 -- 25:14.240]  as opposed to trying to imitate a certain style.
[25:14.240 -- 25:16.120]  Though it's hard to,
[25:16.120 -- 25:18.520]  I guess it would be hard to like determine
[25:18.520 -- 25:21.400]  if that's what the model is actually trying to do.
[25:21.400 -- 25:24.240]  So it's almost like the prompt is guiding the model.
[25:24.240 -- 25:26.720]  It's like, what corner of the internet do we want to,
[25:26.720 -- 25:28.320]  do we want to imitate here?
[25:28.320 -- 25:31.240]  And maybe we want to instruct GPG wants to,
[25:31.240 -- 25:33.520]  to focus more on the most more truthful corners
[25:33.520 -- 25:35.800]  of the internet and something similar to that.
[25:35.800 -- 25:36.880]  Yeah, I would hope so.
[25:36.880 -- 25:38.680]  At least I think that's a pretty good,
[25:38.680 -- 25:41.360]  though maybe a little simplistic picture of what's going on.
[25:41.360 -- 25:42.200]  At the very least,
[25:42.200 -- 25:44.920]  we should be able to imitate the most truthful corner
[25:44.920 -- 25:45.760]  of the internet.
[25:45.760 -- 25:47.760]  So can you talk about a generalization
[25:47.760 -- 25:52.360]  and how does this type of model perform out of distribution?
[25:52.360 -- 25:54.080]  Like, I guess if it seems questions
[25:54.080 -- 25:56.480]  that are a bit different than what it was trained on,
[25:56.480 -- 25:58.040]  what happens if we get a little bit away
[25:58.040 -- 26:00.560]  from the training data with the reward models?
[26:00.560 -- 26:02.320]  I mean, language models in general,
[26:02.320 -- 26:03.840]  generalize surprisingly well.
[26:03.840 -- 26:05.400]  And I would say overall,
[26:05.400 -- 26:07.600]  like these pre-trained models that are trained
[26:07.600 -- 26:09.760]  on super diverse data sets from the internet,
[26:09.760 -- 26:12.920]  they tend to generalize quite well, or surprisingly well,
[26:12.920 -- 26:15.200]  at least it's surprising to those of us
[26:15.200 -- 26:19.000]  who were around for the earlier days of machine learning
[26:19.000 -- 26:22.800]  when everything was trained from scratch and very fragile.
[26:22.800 -- 26:25.640]  For example, if you provide an instruction
[26:25.640 -- 26:29.280]  in some other language, even a fairly rare language,
[26:29.280 -- 26:32.360]  it'll often do a decent job following the instruction,
[26:32.360 -- 26:35.840]  even if there's zero data in the whole instruction
[26:35.840 -- 26:39.360]  following the training process that's in that language.
[26:39.360 -- 26:41.840]  And that's just to carry over from the pre-training.
[26:41.840 -- 26:43.960]  So I think generalization,
[26:43.960 -- 26:46.080]  yeah, I think language models generalize quite well.
[26:46.080 -- 26:47.880]  So you asked about reward models.
[26:47.880 -- 26:50.840]  I think one of the tricky pieces about RL
[26:50.840 -- 26:52.400]  from human feedback is how,
[26:52.400 -- 26:53.880]  so you have this reward model
[26:53.880 -- 26:55.480]  and you're actually training against it,
[26:55.480 -- 26:57.880]  meaning you're training your policy to have high reward
[26:57.880 -- 27:01.200]  and it's going to exploit the errors in the reward model.
[27:01.200 -- 27:04.280]  So it's gonna eventually find adversarial examples
[27:04.280 -- 27:05.200]  to the reward model.
[27:05.200 -- 27:07.200]  This is worse than kind of normal
[27:07.200 -- 27:08.640]  out of distribution behavior.
[27:08.640 -- 27:11.480]  It's like targeted out of distribution examples.
[27:11.480 -- 27:13.800]  So there are definitely some challenges
[27:13.800 -- 27:17.400]  around getting reward models to generalize well
[27:17.400 -- 27:20.960]  or generalize as far as possible from the training set.
[27:20.960 -- 27:22.760]  Can these types of agents tell us
[27:22.760 -- 27:26.240]  when they don't know something or is that a hard problem?
[27:26.240 -- 27:28.800]  I'd say sort of, if you ask a question
[27:28.800 -- 27:31.480]  that's kind of in the core of the model's knowledge,
[27:31.480 -- 27:34.160]  it will know the answer and it'll know that it knows.
[27:34.160 -- 27:35.640]  By the way, I'm talking about models
[27:35.640 -- 27:37.240]  like for the instruct model.
[27:37.240 -- 27:40.360]  If you ask it about something that's like very simple
[27:40.360 -- 27:42.160]  at the core of its knowledge,
[27:42.160 -- 27:44.160]  it'll know if you, there are certain things
[27:44.160 -- 27:45.920]  that it knows that it doesn't know,
[27:45.920 -- 27:49.240]  like current events where it's been trained
[27:49.240 -- 27:52.840]  to know that it doesn't know certain things in real time.
[27:52.840 -- 27:55.000]  But if you ask it about something
[27:55.000 -- 27:56.760]  that's kind of on the edge of its knowledge,
[27:56.760 -- 27:59.480]  it's gonna have a hard time.
[27:59.480 -- 28:01.640]  It's necessarily gonna be inaccurate.
[28:01.640 -- 28:03.920]  I mean, there have been a couple of papers
[28:03.920 -- 28:04.880]  about this question.
[28:04.880 -- 28:08.080]  So there was a paper from Entropic recently
[28:08.080 -- 28:09.360]  called Language Models,
[28:09.360 -- 28:10.920]  mostly know what they know.
[28:10.920 -- 28:15.120]  And there's also a paper from FHI and OpenAI
[28:15.120 -- 28:17.680]  called Getting Language Models
[28:17.680 -- 28:20.080]  to Express Their Uncertainty in Words.
[28:20.080 -- 28:22.000]  These language models,
[28:22.000 -- 28:24.160]  as well as a lot of other models in machine learning
[28:24.160 -- 28:26.560]  are trained to maximize likelihood.
[28:26.560 -- 28:28.680]  So maximize log-prob of data.
[28:28.680 -- 28:29.920]  You're already training them
[28:29.920 -- 28:32.480]  to always predict a distribution of outputs.
[28:32.480 -- 28:35.440]  So for language models, given a prefix,
[28:35.440 -- 28:38.920]  it's predicting a distribution over the next token.
[28:38.920 -- 28:41.760]  These predictions for the next token
[28:41.760 -- 28:44.720]  generally are pretty well calibrated.
[28:44.720 -- 28:47.680]  If it puts 80% probability on something,
[28:47.680 -- 28:49.160]  and you look at all the times
[28:49.160 -- 28:51.920]  when it puts 80% probability on something,
[28:51.920 -- 28:54.080]  it's right 80% of the time.
[28:54.080 -- 28:56.400]  That's just a result of the training objective.
[28:56.400 -- 28:59.960]  The training objective strongly incentivizes the model
[28:59.960 -- 29:01.400]  to be calibrated,
[29:01.400 -- 29:05.320]  meaning it has a reasonable estimate of its uncertainty.
[29:05.320 -- 29:07.240]  So at the single token level,
[29:07.240 -- 29:08.960]  models definitely are calibrated.
[29:08.960 -- 29:10.880]  The question is whether they're calibrated on,
[29:10.880 -- 29:14.680]  whether this calibration extends to settings
[29:14.680 -- 29:18.000]  where they are generating multi-token outputs,
[29:18.000 -- 29:20.360]  or whether they can like judge the correctness
[29:20.360 -- 29:22.000]  of some multi-token statement.
[29:22.000 -- 29:25.000]  So I would say since models are calibrated
[29:25.000 -- 29:26.600]  at the single token level,
[29:26.600 -- 29:29.640]  I think that they definitely have the information
[29:29.640 -- 29:32.840]  to be calibrated in these other settings.
[29:32.840 -- 29:35.960]  So that's why I think the problem of models
[29:35.960 -- 29:38.640]  knowing what they know isn't actually that hard,
[29:38.640 -- 29:42.240]  or at least getting a model to express its uncertainty
[29:42.240 -- 29:44.080]  pretty much as well as a human does,
[29:44.080 -- 29:46.560]  doesn't feel like a insurmountable problem,
[29:46.560 -- 29:48.360]  but there are some practical difficulties
[29:48.360 -- 29:50.120]  to getting there.
[29:50.120 -- 29:52.720]  People use the phrase AI alignment in different ways.
[29:52.720 -- 29:54.440]  Can you talk about how you see alignment
[29:54.440 -- 29:57.680]  in your work on RL from human feedback?
[29:57.680 -- 29:59.720]  I think of alignment mostly as the problem
[29:59.720 -- 30:03.560]  of getting the model to try to do the right thing.
[30:03.560 -- 30:05.000]  So we can kind of make a distinction
[30:05.000 -- 30:08.240]  between what the model is capable of doing.
[30:08.240 -- 30:10.200]  Like if you just take a raw language model
[30:10.200 -- 30:13.240]  and you ask it a question, like I said before,
[30:13.240 -- 30:14.720]  it doesn't know that you actually wanted
[30:14.720 -- 30:17.120]  to give the correct answer as opposed to,
[30:17.120 -- 30:20.160]  it might think someone who's not very knowledgeable
[30:20.160 -- 30:21.000]  is answering.
[30:21.000 -- 30:22.480]  By doing some extra training,
[30:22.480 -- 30:24.800]  we can get the model to actually try to do the right thing.
[30:24.800 -- 30:28.680]  And so I would say that that's the main goal of alignment.
[30:28.680 -- 30:31.720]  So there was an OpenAI blog post recently
[30:31.720 -- 30:34.560]  that talked about the sequence in alignment.
[30:34.560 -- 30:38.800]  One was training AI systems using human feedback,
[30:38.800 -- 30:42.800]  two, training AI systems to assist human evaluation,
[30:42.800 -- 30:46.440]  and three, training AI systems to do alignment research.
[30:46.440 -- 30:50.200]  So is your current work mostly about this first item
[30:50.200 -- 30:51.800]  and when and how do you see us
[30:51.800 -- 30:53.440]  getting to these other stages?
[30:53.440 -- 30:56.240]  I'm doing some work now on number two,
[30:56.240 -- 30:58.520]  training AI systems to assist human feedback.
[30:58.520 -- 31:01.760]  I think that sort of becomes increasingly necessary
[31:01.760 -- 31:05.120]  as you start trying to get the systems
[31:05.120 -- 31:06.840]  to solve harder and harder problems.
[31:06.840 -- 31:09.520]  When you have models that are kind of very below human level
[31:09.520 -- 31:12.000]  or maybe at human level at a certain task,
[31:12.000 -- 31:15.080]  it's pretty straightforward to supervise them.
[31:15.080 -- 31:17.200]  But once they're doing things that are very hard
[31:17.200 -- 31:19.480]  or doing things that require a lot
[31:19.480 -- 31:21.960]  of diverse technical knowledge,
[31:21.960 -- 31:24.480]  it becomes pretty hard to provide
[31:24.480 -- 31:26.560]  a useful supervision signal.
[31:26.560 -- 31:29.280]  So we have to start doing things like one model
[31:29.280 -- 31:31.680]  writes an answer to a question
[31:31.680 -- 31:35.320]  and then another model provides a critique of that answer,
[31:35.320 -- 31:36.680]  points out some flaws,
[31:36.680 -- 31:38.880]  and then the human only has to judge
[31:38.880 -- 31:43.120]  the first answer after looking at the critique,
[31:43.120 -- 31:45.440]  meaning basically the critique helps the human
[31:45.440 -- 31:46.520]  assess the answer.
[31:46.520 -- 31:48.840]  So I think that kind of idea
[31:48.840 -- 31:51.000]  is starting to become pretty relevant.
[31:51.000 -- 31:53.560]  Colleagues and I are exploring that kind of idea now.
[31:53.560 -- 31:55.520]  As for assisting alignment research,
[31:55.520 -- 31:56.960]  there's some other work at OpenAI
[31:56.960 -- 31:58.600]  that's starting to explore this.
[31:58.600 -- 32:02.040]  It's also, that's sort of the furthest down the road.
[32:02.040 -- 32:05.080]  So I saw Stuart Russell was on your PhD committee
[32:05.080 -- 32:07.680]  and I really enjoyed his book, Human Compatible.
[32:07.680 -- 32:10.200]  I wonder if you share the idea mentioned in the book
[32:10.200 -- 32:11.880]  that the standard RL framing
[32:11.880 -- 32:14.760]  with this fixed reward signal is problematic
[32:14.760 -- 32:16.360]  and that agents, powerful agents,
[32:16.360 -- 32:18.960]  should try to do what we want
[32:18.960 -- 32:21.880]  and maintain some uncertainty about what it is we want
[32:21.880 -- 32:26.120]  and the agents that are too certain will be problematic.
[32:26.120 -- 32:28.320]  Do you have any thoughts on that idea?
[32:28.320 -- 32:31.560]  Yeah, I totally agree with that idea.
[32:31.560 -- 32:34.120]  So I think first it's really hard to write down
[32:34.120 -- 32:37.560]  a simple reward function that actually captures
[32:37.560 -- 32:41.080]  what we want or what any particular person wants.
[32:41.080 -- 32:43.720]  I can say I want a little more of this
[32:43.720 -- 32:44.880]  or a little more of that,
[32:44.880 -- 32:47.760]  but you wouldn't want to take that to the extreme.
[32:47.760 -- 32:52.600]  If we build agents that try to cater to our wishes,
[32:52.600 -- 32:55.200]  we should make sure they're,
[32:55.200 -- 32:58.240]  like they have a lot of, they have uncertainty
[32:58.240 -- 33:00.080]  about what we want or what we value.
[33:00.080 -- 33:03.480]  And that'll also cause them to be a little more cautious
[33:03.480 -- 33:07.600]  and say, not disturb anything that might be important to us.
[33:07.600 -- 33:10.600]  So yeah, I agree with that.
[33:10.600 -- 33:13.360]  Like Stuart Russell gave a very good
[33:13.360 -- 33:17.040]  like problem definition of what we want AI to do.
[33:17.040 -- 33:18.440]  Like we want it to basically,
[33:18.440 -- 33:21.040]  we want to jointly like play this game
[33:21.040 -- 33:23.760]  where AI is trying to figure out what we want
[33:23.760 -- 33:24.840]  and then trying to do that.
[33:24.840 -- 33:27.600]  But simultaneously maintaining some uncertainty
[33:27.600 -- 33:28.640]  about what we want.
[33:28.640 -- 33:30.560]  I would say if you start to look
[33:30.560 -- 33:31.920]  at how to get that in practice,
[33:31.920 -- 33:34.400]  it actually looks quite a bit like the kind of RL
[33:34.400 -- 33:37.920]  from human feedback that we're working on at OpenAI
[33:37.920 -- 33:41.280]  and others are working on at other places.
[33:41.280 -- 33:44.720]  I think, yeah, I see what we're doing
[33:44.720 -- 33:47.320]  as a practical implementation
[33:47.320 -- 33:50.720]  of getting towards this behavior that Russell described.
[33:50.720 -- 33:53.160]  Do you think of AGI as an abstract goal
[33:53.160 -- 33:55.560]  or are we gonna see a model come out one day
[33:55.560 -- 33:58.040]  and people are gonna say, oh, that's the first AGI model?
[33:58.040 -- 34:01.640]  Like, what does it have to do for people to say that?
[34:01.640 -- 34:04.920]  I think people will say that many times
[34:04.920 -- 34:07.200]  then realize that it doesn't quite do everything
[34:07.200 -- 34:08.080]  that you want.
[34:08.080 -- 34:10.600]  I think we're gonna have a lot of like a long series
[34:10.600 -- 34:14.320]  of models that are superhuman at most things
[34:14.320 -- 34:16.640]  or at a certain class of things,
[34:16.640 -- 34:20.840]  but they also have some failure modes and weaknesses.
[34:20.840 -- 34:24.640]  Like I expect us to see multiple models
[34:24.640 -- 34:26.600]  that are proclaimed as AGI
[34:26.600 -- 34:30.360]  and then only after interacting with it a while,
[34:30.360 -- 34:33.880]  do you realize it's not quite there.
[34:33.880 -- 34:35.520]  What would you say is the relationship
[34:35.520 -- 34:39.760]  between AGI and RL and AGI and these large language models?
[34:39.760 -- 34:41.680]  How do those concepts fit together?
[34:41.680 -- 34:46.680]  I'd say that RL is a useful component of training AGI
[34:47.160 -- 34:49.240]  or an almost essential component.
[34:49.240 -- 34:52.440]  The thing RL lets you do is it lets you optimize
[34:52.440 -- 34:54.960]  any objective for the agents,
[34:54.960 -- 34:59.280]  any objective that is a function of the agent's behavior.
[34:59.280 -- 35:03.720]  So with pre-training, like what we do for language models,
[35:03.720 -- 35:05.760]  you're kind of choosing an objective
[35:05.760 -- 35:09.400]  that lets us do something with all the training data
[35:09.400 -- 35:11.720]  we have, which is all this internet text.
[35:11.720 -- 35:14.200]  So we choose this maximum likelihood objective,
[35:14.200 -- 35:17.000]  which is basically the only, or not the only thing,
[35:17.000 -- 35:20.200]  but it's like a sensible way to absorb all this knowledge.
[35:20.200 -- 35:24.040]  But then if we really want to optimize the agent's behavior
[35:24.040 -- 35:25.440]  for a specific objective,
[35:25.440 -- 35:29.040]  RL is kind of the only framework that lets you do that.
[35:29.960 -- 35:32.240]  Okay, John, we have a few questions from the audience
[35:32.240 -- 35:33.280]  and I'm just going to pick the two
[35:33.280 -- 35:36.240]  that have the highest score in terms of Twitter likes.
[35:36.240 -- 35:40.760]  So the first is from Eric Chang, VP of AI at Haloti Robotics.
[35:40.760 -- 35:43.360]  He asked, RL distributions are non-stationary,
[35:43.360 -- 35:46.080]  making it hard to reason about PPO losses
[35:46.080 -- 35:48.520]  and how that relates to return or generalization.
[35:48.520 -- 35:51.000]  Are there any intermediate plots and visualizations
[35:51.000 -- 35:53.120]  you'd like to generate to debug
[35:53.120 -- 35:56.200]  or incrementally build up a large scale RL system?
[35:56.200 -- 35:59.760]  Yeah, there are definitely some stats that I look at.
[35:59.760 -- 36:02.640]  So I will be, I'll talk about this
[36:02.640 -- 36:07.640]  in the nuts and bolts like reboot later this year,
[36:07.760 -- 36:12.760]  but I'd say things like looking at the explained variance
[36:12.800 -- 36:15.320]  of the value function and looking at the,
[36:15.320 -- 36:18.120]  like how many samples are getting clipped in PPO
[36:18.120 -- 36:23.120]  and what the KL divergences between the policy before
[36:23.120 -- 36:25.680]  and after the update is, yeah, things like that.
[36:25.680 -- 36:30.640]  And then Ethan, the Calibero from Mila asks,
[36:30.640 -- 36:33.760]  what is your median estimate for the arrival date of AGI?
[36:33.760 -- 36:37.440]  I think not too far away, but like I said,
[36:37.440 -- 36:39.480]  I expect there to be a lot of false starts.
[36:39.480 -- 36:44.360]  I would say I expect like AI to be able to do better,
[36:44.360 -- 36:46.520]  a better job than humans at most jobs
[36:46.520 -- 36:49.040]  that humans do now, five years or so.
[36:49.040 -- 36:51.040]  That's not all jobs, but most jobs.
[36:51.040 -- 36:52.680]  For a while, we're gonna discover things
[36:52.680 -- 36:54.080]  that AI is very good at
[36:54.080 -- 36:56.440]  and where we wanna keep humans in control.
[36:56.440 -- 36:59.440]  So I think there'll be some kind of gradual process
[36:59.440 -- 37:01.240]  over the next 10 or 15 years.
[37:01.240 -- 37:02.440]  I've been curious about this.
[37:02.440 -- 37:05.160]  I see that some RL work is patented,
[37:05.160 -- 37:08.800]  but I could not find a TRPO or PPO in,
[37:08.800 -- 37:10.160]  I could not find patents on these.
[37:10.160 -- 37:13.760]  Are those protected, patent protected at all?
[37:13.760 -- 37:18.320]  Or how do you think of intellectual property protection
[37:18.320 -- 37:19.280]  for that kind of work?
[37:19.280 -- 37:22.120]  I haven't ever looked into patenting anything
[37:22.120 -- 37:25.080]  and OpenAI hasn't either as far as I know.
[37:25.080 -- 37:26.960]  I think the trend over time has been
[37:26.960 -- 37:29.600]  for people to take patents in machine,
[37:29.600 -- 37:31.920]  like a machine learning algorithms less seriously.
[37:31.920 -- 37:34.520]  There's this algorithm in computer vision called SIFT,
[37:34.520 -- 37:36.960]  which is like this key point to detector.
[37:36.960 -- 37:38.960]  And this was patented.
[37:38.960 -- 37:42.080]  I think the guy who patented it,
[37:42.080 -- 37:44.680]  he probably made his university some money from the patent,
[37:44.680 -- 37:48.160]  but in the end, all it did was cause people
[37:48.160 -- 37:52.080]  a lot of annoyance because people had to come up
[37:52.080 -- 37:56.280]  with alternative algorithms that had a different acronym
[37:56.280 -- 37:58.240]  and weren't patented.
[37:58.240 -- 38:02.920]  So the OpenCV open source library would have,
[38:02.920 -- 38:05.400]  had to be careful about putting this algorithm
[38:05.400 -- 38:07.960]  in their library because of the patent risks.
[38:07.960 -- 38:11.960]  So I think like these patents aren't,
[38:11.960 -- 38:13.920]  patent rights aren't exercised that much.
[38:13.920 -- 38:17.080]  And I think big companies like Google will patent
[38:17.080 -- 38:19.280]  a lot of stuff for defensive reasons.
[38:19.280 -- 38:22.040]  So if they get in some big legal dispute
[38:22.040 -- 38:24.360]  with another company, it can be used
[38:24.360 -- 38:26.520]  as like one of the bargaining chips.
[38:26.520 -- 38:30.440]  But I think, I don't think anyone's gonna like get sued
[38:30.440 -- 38:35.320]  for royalties for not providing royalties
[38:35.320 -- 38:36.960]  for the use of some algorithm.
[38:36.960 -- 38:40.080]  Okay, and then there's been a ton of work in RL, of course,
[38:40.080 -- 38:43.560]  since you first published TRPO and PPO.
[38:43.560 -- 38:45.200]  But from your point of view,
[38:45.200 -- 38:46.440]  if you had to pick a few highlights
[38:46.440 -- 38:50.360]  in terms of a few important milestones in RL algorithms
[38:50.360 -- 38:51.600]  since PPO came out,
[38:53.120 -- 38:55.080]  and by the way, it's amazing that in 2022,
[38:55.080 -- 38:56.400]  we're still using PPO,
[38:57.520 -- 39:01.000]  I think quite similar to its original form.
[39:01.000 -- 39:01.840]  Is that right?
[39:02.920 -- 39:03.920]  Yeah, pretty much.
[39:03.920 -- 39:06.880]  Yeah, so what would you say are the biggest
[39:06.880 -- 39:09.680]  highlights for you in terms of RL algorithm
[39:09.680 -- 39:11.640]  since you did PPO?
[39:11.640 -- 39:13.440]  Yeah, there's definitely been some interesting stuff.
[39:13.440 -- 39:16.480]  So I think like a little after PPO,
[39:16.480 -- 39:19.120]  there is TD3 and SAC,
[39:19.120 -- 39:23.000]  and those seem like pretty solid value-based methods.
[39:23.000 -- 39:25.320]  That was one development that was interesting.
[39:25.320 -- 39:27.840]  I think like, yeah, I thought Mu zero
[39:27.840 -- 39:32.840]  and it's like elaborations were also like efficient zero.
[39:32.840 -- 39:36.840]  Efficient zero were also pretty impressive
[39:36.840 -- 39:38.960]  that you can get that good sample efficiency.
[39:38.960 -- 39:41.600]  Both of the things I just mentioned were kind of,
[39:41.600 -- 39:45.000]  well, I don't wanna say mostly on toy tasks or benchmarks
[39:45.000 -- 39:48.120]  because yeah, I'm sure people are doing some real things
[39:48.120 -- 39:49.440]  with these algorithms.
[39:49.440 -- 39:52.040]  Yeah, so I think that stuff was interesting.
[39:52.040 -- 39:56.760]  I think like the whole recent interest,
[39:56.760 -- 40:00.360]  surge of interest in the offline RL was also notable.
[40:00.360 -- 40:02.480]  I would say the stuff we're doing
[40:02.480 -- 40:06.040]  with RL from human feedback is the kind of offline RL
[40:06.040 -- 40:09.000]  because we're like, we have a fixed dataset
[40:09.000 -- 40:11.640]  and we have a fixed reward modeling dataset
[40:11.640 -- 40:12.880]  and we're training against that.
[40:12.880 -- 40:14.720]  This is like offline RL,
[40:14.720 -- 40:15.960]  but you're doing it in a different way.
[40:15.960 -- 40:19.640]  You're using an on policy algorithm with a reward model
[40:19.640 -- 40:23.280]  as opposed to maybe a more typical way to do offline RL
[40:23.280 -- 40:25.040]  would be use off policy algorithm.
[40:25.040 -- 40:27.760]  Would that work here or would that not work here?
[40:27.760 -- 40:30.160]  What we're doing here is kind of like model-based RL
[40:30.160 -- 40:33.280]  because the reward model is like a model
[40:33.280 -- 40:35.800]  of the unknown part of the system.
[40:35.800 -- 40:38.920]  So like the unknown part of the system here
[40:38.920 -- 40:42.760]  is the human radar or yeah, the human.
[40:42.760 -- 40:46.880]  It's not the outputting appending to your list of tokens.
[40:46.880 -- 40:48.600]  So this is kind of like the work
[40:48.600 -- 40:51.840]  that's like takes a dynamics model of the environment
[40:51.840 -- 40:54.240]  and does some kind of just runs
[40:54.240 -- 40:56.600]  a policy grading algorithm against it.
[40:56.600 -- 40:57.440]  So it's not like,
[40:57.440 -- 41:00.400]  so the idea of running an online algorithm
[41:00.400 -- 41:03.720]  against a model, that's kind of a well-established idea.
[41:03.720 -- 41:06.800]  Though I would say the papers that previously did this,
[41:06.800 -- 41:08.520]  they were in a pretty different regime.
[41:08.520 -- 41:11.200]  We're in this regime of doing fairly small updates
[41:11.200 -- 41:14.600]  to the policy because we have these awesome pre-trained models
[41:14.600 -- 41:19.000]  and we don't need to actually change them that much.
[41:19.000 -- 41:21.520]  So yeah, we use these online algorithms.
[41:21.520 -- 41:23.760]  I'd say part of the reason why we can get away
[41:23.760 -- 41:28.000]  with using just like an online algorithm
[41:28.000 -- 41:30.480]  is because we've been just looking
[41:30.480 -- 41:32.480]  at a contextual bandit problem.
[41:32.480 -- 41:35.080]  Yeah, because we only have like one time step.
[41:35.080 -- 41:37.840]  Like you get a query and you output a response
[41:37.840 -- 41:40.160]  and then that response gets a reward.
[41:40.160 -- 41:43.120]  So if we had like a multi-step process
[41:43.120 -- 41:48.120]  such as a conversation where you can't assign a reward
[41:48.320 -- 41:50.280]  until the very end of the conversation
[41:50.280 -- 41:54.160]  and or you had some, I don't know, some interaction
[41:54.160 -- 41:57.800]  with like some real world system that's hard to simulate,
[41:57.800 -- 42:00.440]  you wouldn't, then it wouldn't be as straightforward to,
[42:00.440 -- 42:03.760]  you wouldn't be able to use exactly the same methodology.
[42:03.760 -- 42:05.680]  You would probably have to use a,
[42:05.680 -- 42:08.360]  you would have to probably train a Q function
[42:08.360 -- 42:10.600]  or something like that.
[42:10.600 -- 42:13.080]  If you want your method to be sample efficient,
[42:13.080 -- 42:15.640]  you would probably have to do something slightly different.
[42:15.640 -- 42:19.120]  I think we'll have to start exploring this
[42:19.120 -- 42:22.560]  at some point soon, but so far we haven't,
[42:22.560 -- 42:27.480]  at least I haven't seen any cases in like in the domain
[42:27.480 -- 42:29.680]  I'm looking at that require this,
[42:29.680 -- 42:33.480]  but I expect it to be relevant at some point.
[42:33.480 -- 42:37.080]  So we had Arvind Srinivas talking about decision transformer
[42:37.080 -- 42:39.360]  on the show recently, that was a great episode.
[42:39.360 -- 42:41.360]  And I see that you were also a co-author
[42:41.360 -- 42:43.920]  on the 2016 RL squared paper.
[42:43.920 -- 42:46.680]  I want to ask you what your thoughts about meta RL.
[42:46.680 -- 42:48.560]  Arvind had some interesting things to say
[42:48.560 -- 42:50.640]  about maybe the idea that a transformer
[42:50.640 -- 42:52.320]  could kind of supersede the need
[42:52.320 -- 42:54.200]  for an RL algorithm altogether.
[42:54.200 -- 42:56.200]  What do you expect from meta RL?
[42:56.200 -- 42:58.600]  Do you expect we'll still be using human-authored
[42:58.600 -- 43:00.600]  RL algorithms in the future?
[43:00.600 -- 43:03.000]  Yeah, that's a pretty bold statement that we don't need,
[43:03.000 -- 43:05.400]  we won't need any RL algorithms anymore.
[43:05.400 -- 43:07.640]  Yeah, since the RL squared paper,
[43:07.640 -- 43:10.920]  people have been talking less about meta learning,
[43:10.920 -- 43:12.400]  as far as I can tell,
[43:12.400 -- 43:15.760]  actually because of sequence modeling has gotten so good,
[43:15.760 -- 43:19.680]  like transformer sequence models, that it's kind of clear
[43:19.680 -- 43:21.920]  that meta learning is just a special case of learning.
[43:21.920 -- 43:26.560]  Like it's just like a certain kind of long context learning,
[43:26.560 -- 43:28.720]  learning involving long episodes.
[43:28.720 -- 43:31.120]  And maybe it shouldn't be treated that differently
[43:31.120 -- 43:33.600]  or addressed with special algorithms.
[43:33.600 -- 43:36.760]  I would say, yeah, the ideas like decision transformer
[43:36.760 -- 43:37.880]  are pretty interesting,
[43:37.880 -- 43:40.520]  where you try to reduce RL to supervised learning.
[43:40.520 -- 43:43.800]  It's still not like certain exactly how these compare
[43:43.800 -- 43:47.320]  in performance to RL, like people have started to analyze
[43:47.320 -- 43:49.280]  that empirically and theoretically.
[43:49.280 -- 43:53.320]  And I would say in practice, sometimes it's better,
[43:53.320 -- 43:55.240]  sometimes it's worse.
[43:55.240 -- 43:57.960]  In my experience, like it's been worse on the problems
[43:57.960 -- 44:01.920]  that my colleagues and I have, where we've tested it.
[44:01.920 -- 44:05.480]  But yeah, it's definitely an interesting direction.
[44:05.480 -- 44:08.360]  Dr. John Schulman, thank you so much for sharing your time
[44:08.360 -- 44:10.360]  and your insight with the talk RL audience today.
[44:10.360 -- 44:11.480]  Thanks so much.