John Schulman, OpenAI cofounder and researcher, inventor of PPO/TRPO talks RL from human feedback, tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the internet (WebGPT), AI alignment, AGI timelines, and more!
TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
[00:00.000 -- 00:01.960] the answer was affirmative.
[00:01.960 -- 00:05.680] We can get an agent to basically use a set of tools
[00:05.680 -- 00:06.520] that we give it.
[00:06.520 -- 00:09.440] In this case, the browsing commands, like searchings.
[00:09.440 -- 00:13.880] I would say I expect AI to be able to do a better job
[00:13.880 -- 00:16.860] than humans at most jobs that humans do now,
[00:16.860 -- 00:17.980] five years or so.
[00:19.660 -- 00:20.500] Talk RL.
[00:22.660 -- 00:26.700] Talk RL podcast is all reinforcement learning all the time,
[00:26.700 -- 00:29.900] featuring brilliant guests, both researched and applied.
[00:29.900 -- 00:33.520] Join the conversation on Twitter at Talk RL podcast.
[00:33.520 -- 00:35.160] I'm your host, Robin Chauhan.
[00:39.500 -- 00:41.900] John Shulman is a co-founder of OpenAI
[00:41.900 -- 00:44.480] and a researcher and engineer at OpenAI.
[00:44.480 -- 00:46.360] He is well known for major contributions
[00:46.360 -- 00:48.440] to the field of reinforcement learning,
[00:48.440 -- 00:50.760] including the TRPO algorithm,
[00:50.760 -- 00:52.920] that's Trust Region Policy Optimization,
[00:52.920 -- 00:56.000] GAE, Generalized Advantage Estimation.
[00:56.000 -- 00:58.120] Those are from his UC Berkeley dissertation.
[00:58.120 -- 01:02.080] And TRPO's Descendant Proximal Policy Optimization, or PPO.
[01:02.080 -- 01:06.040] His current focus at OpenAI is on RL from human feedback.
[01:06.040 -- 01:08.320] John, welcome to the show and thanks so much for being here.
[01:08.320 -- 01:09.360] Thanks a lot for having me.
[01:09.360 -- 01:11.380] You were literally one of the first people I thought of
[01:11.380 -- 01:13.840] when I started the show three years back.
[01:13.840 -- 01:14.880] Thanks, I'm honored.
[01:14.880 -- 01:17.320] It means a lot to me to have you here today.
[01:17.320 -- 01:20.920] I definitely remember your nuts and bolts of deep RL video
[01:20.920 -- 01:23.240] back in the day and watching that multiple times
[01:23.240 -- 01:24.360] and gaining a lot from that.
[01:24.360 -- 01:26.200] So I think you helped probably a generation
[01:26.200 -- 01:28.640] of RL practitioners back then.
[01:28.640 -- 01:31.280] By the way, there's going to be a reboot
[01:31.280 -- 01:33.360] of the nuts and bolts presentation.
[01:33.360 -- 01:37.320] I got invited to give a talk at NURIPS this year on it.
[01:37.320 -- 01:41.200] So I'll have to revamp the guidelines and everything.
[01:41.200 -- 01:42.120] So that'll be fun.
[01:42.120 -- 01:42.960] Oh, that's awesome.
[01:42.960 -- 01:43.780] Can't wait for that.
[01:43.780 -- 01:47.240] So you were clearly one of the earlier pioneers in deep RL.
[01:47.240 -- 01:49.640] So how did you choose to move your focus to RL
[01:49.640 -- 01:50.800] from human feedback?
[01:50.800 -- 01:52.560] And why is that an important problem?
[01:52.560 -- 01:53.740] Why is that important to you?
[01:53.740 -- 01:57.560] After GBD3 was trained, I was blown away by how smart it was.
[01:57.560 -- 02:00.040] And I realized the next frontier was figuring out
[02:00.040 -- 02:02.000] how to make language models actually useful.
[02:02.000 -- 02:03.800] I'm still really interested in RL,
[02:03.800 -- 02:07.400] but solving RL benchmarks isn't the end of the story.
[02:07.400 -- 02:10.360] To use your RL algorithm, you need a reward function.
[02:10.360 -- 02:12.680] But where does the reward function come from?
[02:12.680 -- 02:15.160] In RL benchmarks, you usually just code up
[02:15.160 -- 02:16.020] the reward function.
[02:16.020 -- 02:18.320] But if you're not in a simulator environment,
[02:18.320 -- 02:19.160] that doesn't work.
[02:19.160 -- 02:23.280] So what we have to do in any kind of real world use case
[02:23.280 -- 02:25.160] is have humans look at what the AI did
[02:25.160 -- 02:26.680] and decide if it was good or bad.
[02:26.680 -- 02:29.200] So how exactly you define this reward
[02:29.200 -- 02:31.800] becomes a really challenging and important problem,
[02:31.800 -- 02:34.160] especially as the tasks get harder to evaluate.
[02:34.160 -- 02:37.240] Another angle on this is that language models are very smart,
[02:37.240 -- 02:40.400] but it's hard to get them to do anything useful.
[02:40.400 -- 02:43.200] A big part of that is they're not necessarily
[02:43.200 -- 02:44.240] trying to do what you want.
[02:44.240 -- 02:46.400] They're just trying to imitate the training corpus.
[02:46.400 -- 02:48.440] So that means there's a big opportunity
[02:48.440 -- 02:50.640] to improve them a lot by just giving them
[02:50.640 -- 02:51.600] the right objective.
[02:51.600 -- 02:55.280] That's what we can do by applying RL to these language
[02:55.280 -- 02:58.560] models using human feedback to define the reward.
[02:58.560 -- 03:02.560] Is using human feedback harder or very different in some way
[03:02.560 -- 03:04.360] than using a synthetic reward?
[03:04.360 -- 03:06.600] There are a lot of new complications.
[03:06.600 -- 03:09.800] Now you have to collect a data set dynamically.
[03:09.800 -- 03:12.160] So you're always in the business of building data
[03:12.160 -- 03:14.720] sets of human preferences.
[03:14.720 -- 03:17.160] Often the data quality there matters more
[03:17.160 -- 03:19.320] than various algorithmic details.
[03:19.320 -- 03:22.440] And you also have to think a lot about exactly how you're
[03:22.440 -- 03:24.360] giving the task to the human trainers
[03:24.360 -- 03:25.680] and various other things that you
[03:25.680 -- 03:27.360] wouldn't have thought about if you just
[03:27.360 -- 03:29.040] had a programmatic reward function.
[03:29.040 -- 03:31.080] Does the difference between human raters
[03:31.080 -- 03:34.200] or the noisiness of the reward signal cause any problems?
[03:34.200 -- 03:36.640] I would say the noise, definitely
[03:36.640 -- 03:40.320] you need to be below some threshold of noise
[03:40.320 -- 03:41.360] to learn anything.
[03:41.360 -- 03:44.160] I think, in general, if you have a large noisy data
[03:44.160 -- 03:47.640] set that can be as good as a smaller, clean data set.
[03:47.640 -- 03:50.640] So actually, noise isn't the thing that worries me the most.
[03:50.640 -- 03:53.600] It's more that there are sometimes consistent biases
[03:53.600 -- 03:54.680] that people have.
[03:54.680 -- 03:58.920] For example, in settings like question answering or settings
[03:58.920 -- 04:02.000] where you have a model writing some text,
[04:02.000 -- 04:04.160] often people prefer longer answers.
[04:04.160 -- 04:06.680] You end up with these very verbose answers.
[04:06.680 -- 04:08.880] If you're not careful with the instructions, that is.
[04:08.880 -- 04:12.000] I mean, you can also instruct people, the raters,
[04:12.000 -- 04:14.440] to reward brevity.
[04:14.440 -- 04:17.200] But if you're not careful, you can
[04:17.200 -- 04:19.360] incentivize the wrong kinds of behaviors.
[04:19.360 -- 04:21.480] So let's move to some of your recent work.
[04:21.480 -- 04:24.640] First up is WebGPT, browser assisted question
[04:24.640 -- 04:26.200] answering with human feedback.
[04:26.200 -- 04:30.000] That's Nakano et al with yourself as a co-author in 2021.
[04:30.000 -- 04:32.880] Can you tell us what is the main idea of this paper?
[04:32.880 -- 04:33.880] What is WebGPT?
[04:33.880 -- 04:37.720] In WebGPT, we basically took our language models
[04:37.720 -- 04:40.040] and we hooked them up to a web browser
[04:40.040 -- 04:42.520] so they could retrieve information from the web.
[04:42.520 -- 04:44.480] And they can write an answer by summarizing
[04:44.480 -- 04:45.960] the relevant pages from the web.
[04:45.960 -- 04:48.760] So that way if you're asking a question about current events
[04:48.760 -- 04:51.520] or a question that requires some detailed scientific
[04:51.520 -- 04:53.840] or technical knowledge, this AI can go out
[04:53.840 -- 04:56.680] and look up the answer and with detailed citations
[04:56.680 -- 04:57.560] to its sources.
[04:57.560 -- 05:00.320] So I would say there's kind of two interesting points
[05:00.320 -- 05:01.160] to this.
[05:01.160 -- 05:03.600] One is we were exploring whether you could turn language
[05:03.600 -- 05:05.360] models into a kind of agent.
[05:05.360 -- 05:07.840] There's a lot of data on the web of different texts
[05:07.840 -- 05:09.920] that people have written, but there's not a lot of data
[05:09.920 -- 05:13.360] that shows how to actually do some multi-step process.
[05:13.360 -- 05:15.400] So it's not that clear a priori
[05:15.400 -- 05:16.880] whether you can get a language model
[05:16.880 -- 05:19.600] to actually carry out some iterative process.
[05:19.600 -- 05:22.480] We just have a lot of data like writing essays
[05:22.480 -- 05:23.960] and having chats and so forth.
[05:23.960 -- 05:25.840] So that was one thing we were exploring here.
[05:25.840 -- 05:28.120] And I think the answer was affirmative.
[05:28.120 -- 05:32.280] We can get an agent to basically use a set of tools
[05:32.280 -- 05:34.880] that we give it, in this case, the browsing commands
[05:34.880 -- 05:37.480] like searching, scrolling, clicking on links.
[05:37.480 -- 05:40.560] The second theme of this paper was around truthfulness.
[05:40.560 -- 05:44.120] I mean, a big issue with language models is,
[05:44.120 -- 05:45.600] I mean, they're not very reliable
[05:45.600 -- 05:47.080] at giving you true information.
[05:47.080 -- 05:49.680] They know a vastly superhuman amount,
[05:49.680 -- 05:51.640] but if you prompt them in the wrong way,
[05:51.640 -- 05:54.520] they'll just output lots of plausible sounding nonsense.
[05:54.520 -- 05:57.680] So how to fix that is a big research question
[05:57.680 -- 05:59.800] or one of the biggest research questions
[05:59.800 -- 06:01.640] in the world of language models.
[06:01.640 -- 06:03.480] I think it's gonna be challenging to fully fix it,
[06:03.480 -- 06:06.960] but I think a big part of the story involves retrieval
[06:06.960 -- 06:10.520] and having models write answers that contain citations,
[06:10.520 -- 06:12.600] citations to trusted sources.
[06:12.600 -- 06:14.440] So a person who's checking over the answer
[06:14.440 -- 06:16.160] doesn't have to go and try to figure out
[06:16.160 -- 06:18.200] where the model might've gotten this idea.
[06:18.200 -- 06:20.520] They can go and directly look at the source
[06:20.520 -- 06:23.280] and see if it supports the AI's statement.
[06:23.280 -- 06:25.960] With WebGBT, we just wanted to see
[06:25.960 -- 06:28.520] if we do give the language model
[06:28.520 -- 06:30.400] a really flexible interface of the web,
[06:30.400 -- 06:33.240] can we have it answer hard questions truthfully
[06:34.440 -- 06:36.280] with the help of all these citations?
[06:36.280 -- 06:38.360] And it's actually really non-trivial
[06:38.360 -- 06:41.040] because if you look at the dataset we use,
[06:41.040 -- 06:43.280] the Reddit explained it like I'm five.
[06:43.280 -- 06:44.680] The questions are really varied,
[06:44.680 -- 06:46.840] like some of them are about science, history,
[06:46.840 -- 06:49.560] current events, like our raters didn't necessarily
[06:49.560 -- 06:51.520] know anything about these topics,
[06:51.520 -- 06:55.760] but still they had to judge the detailed answers.
[06:55.760 -- 06:57.640] So it would have been really hard to do it
[06:57.640 -- 06:59.960] without the supporting citations.
[06:59.960 -- 07:04.000] So we kind of validated that we could get good feedback
[07:04.000 -- 07:07.440] in a hard domain like this with the help of citations.
[07:07.440 -- 07:10.680] Can you talk about where the idea for WebGBT came from?
[07:10.680 -- 07:13.000] Is that an idea you've had kicking around for a while
[07:13.000 -- 07:15.800] or was it something that came up recently before the paper?
[07:15.800 -- 07:17.760] How did that play out?
[07:17.760 -- 07:19.800] Some of the ideas had been floating around,
[07:19.800 -- 07:22.400] like we thought that we actually had a project
[07:22.400 -- 07:26.160] at OpenAI very early on called World of Bits.
[07:26.160 -- 07:28.520] We were looking at controlling web browsers
[07:28.520 -- 07:31.120] or doing tasks that involved tasks on the internet
[07:31.120 -- 07:32.360] with the web browser,
[07:32.360 -- 07:34.520] but it was way too early at the time.
[07:34.520 -- 07:38.120] So we kind of abandoned it for a few years.
[07:38.120 -- 07:40.240] Actually we were trying to, back then we were trying to do it
[07:40.240 -- 07:41.480] with full visual input.
[07:41.480 -- 07:45.040] So we thought, yeah, we could give some instructions
[07:45.040 -- 07:48.880] to the agent, like go and figure out the address
[07:48.880 -- 07:51.000] of this building or something.
[07:51.000 -- 07:54.000] The agent would go and search the web
[07:54.000 -- 07:57.000] or use Google maps or whatever to figure out the answer.
[07:57.000 -- 07:58.760] And we were trying to do this all in pixels.
[07:58.760 -- 08:00.640] That obviously didn't work very well,
[08:00.640 -- 08:03.640] but now we have these great language models
[08:03.640 -- 08:05.680] on the work on text data.
[08:05.680 -- 08:08.960] We can also extract the text out of web pages
[08:08.960 -- 08:12.000] to get most of the information.
[08:12.000 -- 08:15.280] We can't really interact with a lot of dynamic websites.
[08:15.280 -- 08:16.960] Yeah, where there's a lot of JavaScript
[08:16.960 -- 08:18.000] and images and so forth,
[08:18.000 -- 08:19.960] but as long as it's just browsing
[08:19.960 -- 08:21.760] and reading texts, we're fine.
[08:21.760 -- 08:24.320] So yeah, we had good enough models
[08:24.320 -- 08:27.880] and that made it kind of feasible to revisit this idea
[08:27.880 -- 08:30.960] of using the internet as an environment.
[08:30.960 -- 08:33.640] So I would say that was one of the sources
[08:33.640 -- 08:36.760] of inspiration, that long kind of thread
[08:36.760 -- 08:39.320] about like using the internet as an environment.
[08:39.320 -- 08:44.320] Another motivation was just after we started playing
[08:44.680 -- 08:47.920] with GPT-3, we noticed that it had all these problems
[08:47.920 -- 08:51.400] with factual accuracy and the reliability
[08:51.400 -- 08:52.920] of the information it was giving us.
[08:52.920 -- 08:56.280] So that kind of motivated doing more research
[08:56.280 -- 08:58.960] on how to make language models more truthful.
[08:58.960 -- 09:01.040] We were kind of brainstorming what to do there
[09:01.040 -- 09:05.480] and we went through some docs and eventually decided
[09:05.480 -- 09:07.760] that we wanted to try some question answering
[09:07.760 -- 09:09.800] like using the web, looking up knowledge
[09:09.800 -- 09:11.560] on the web to help answer questions.
[09:11.560 -- 09:12.880] So actually the original version
[09:12.880 -- 09:15.000] of the project used trivia questions.
[09:15.000 -- 09:18.400] So there's this well-known dataset trivia QA
[09:18.400 -- 09:20.080] that has some basic trivia questions.
[09:20.080 -- 09:23.600] So we first worked a little bit on that dataset
[09:23.600 -- 09:26.960] and tried to see if we could boost the model's accuracy
[09:26.960 -- 09:29.840] by giving it web search.
[09:29.840 -- 09:33.040] And yeah, that actually worked quite straight.
[09:33.040 -- 09:34.160] That worked pretty easily.
[09:34.160 -- 09:36.120] So then we decided to move on
[09:36.120 -- 09:38.080] to long form question answering.
[09:38.080 -- 09:41.880] And so that gave us the, that was the project
[09:41.880 -- 09:43.880] we ended up working on for a while.
[09:43.880 -- 09:47.080] Seems like you use a few different datasets here
[09:47.080 -- 09:49.800] and a number of different training methods.
[09:50.760 -- 09:52.600] I'll just mention the last behavior cloning,
[09:52.600 -- 09:55.080] reward modeling, reinforcement learning
[09:55.080 -- 09:56.800] and rejection sampling.
[09:56.800 -- 10:00.520] So we were using a fairly standard methodology
[10:00.520 -- 10:03.240] which was actually adapted from previous work
[10:03.240 -- 10:05.600] on RL from human preferences.
[10:05.600 -- 10:09.120] So the pipeline is you first train a model
[10:09.120 -- 10:13.320] with supervised learning where you have human demonstrators
[10:13.320 -- 10:15.560] show how to do the task, like show how to map
[10:15.560 -- 10:17.160] from observations to actions.
[10:17.160 -- 10:19.280] Yeah, so that's the supervised learning
[10:19.280 -- 10:20.440] or behavior cloning step.
[10:20.440 -- 10:24.400] Then we train a reward model or a preference model.
[10:24.400 -- 10:28.320] It looks at two actions or two trajectories
[10:28.320 -- 10:29.720] and decides which one is better.
[10:29.720 -- 10:32.640] In this case, like in a question answering setting
[10:32.640 -- 10:33.880] you're looking at two answers
[10:33.880 -- 10:35.480] and deciding which answer is better.
[10:35.480 -- 10:37.440] And we use that to train a reward model
[10:37.440 -- 10:39.640] that assigns higher score to the good answers
[10:39.640 -- 10:40.480] than the bad ones.
[10:40.480 -- 10:41.840] Then you do reinforcement learning
[10:41.840 -- 10:43.160] against that reward function.
[10:43.160 -- 10:45.560] And of course you can iterate these last two steps
[10:45.560 -- 10:46.960] after you do a little RL.
[10:46.960 -- 10:49.520] Now you're, you've sort of exploited some of the flaws
[10:49.520 -- 10:52.080] of the reward model, like, or some of the noise
[10:52.080 -- 10:53.200] in the reward model.
[10:53.200 -- 10:55.120] And it's not necessarily accurate
[10:55.120 -- 10:56.760] on your new distribution of data.
[10:56.760 -- 10:59.040] You recollect more pairs of samples
[10:59.040 -- 11:01.680] and refit this preference model.
[11:01.680 -- 11:04.000] And then you do another iteration of RL.
[11:04.000 -- 11:06.160] So that's like, that's the whole RL
[11:06.160 -- 11:07.600] from human feedback pipeline.
[11:07.600 -- 11:11.080] And there's this other idea called rejection sampling
[11:11.080 -- 11:12.400] or best of end sampling.
[11:12.400 -- 11:14.840] And in general, you can do other kinds of search too.
[11:14.840 -- 11:18.680] Where instead of doing RL once you have your reward model
[11:18.680 -- 11:21.040] you can just search against that reward model.
[11:21.040 -- 11:23.440] So you can take a bunch of, collect a bunch of samples
[11:23.440 -- 11:25.960] and re-rank them with the reward model
[11:25.960 -- 11:28.960] and take the best one as your action.
[11:28.960 -- 11:30.520] Kind of like MPC?
[11:30.520 -- 11:31.360] Yeah, exactly.
[11:31.360 -- 11:33.440] Yeah, it kind of depends exactly
[11:33.440 -- 11:35.640] what setting you're in, what you can do.
[11:35.640 -- 11:38.400] If you're in a setting where there's some environment
[11:38.400 -- 11:41.040] you're interacting with, then you would have to simulate
[11:41.040 -- 11:44.160] your, you'd have to simulate the dynamics
[11:44.160 -- 11:45.920] of your environment, which yeah.
[11:45.920 -- 11:47.920] So that would look kind of like MPC.
[11:47.920 -- 11:51.360] In our case, we were, the only thing we had to learn
[11:51.360 -- 11:55.080] a model of was the human preference.
[11:55.080 -- 11:57.480] So like we're, it's a question answering setting.
[11:57.480 -- 11:59.760] So it's really like a contextual bandit problem.
[11:59.760 -- 12:02.520] So it's kind of straightforward to take a bunch of,
[12:02.520 -- 12:04.320] sample a bunch of actions where each action
[12:04.320 -- 12:06.880] is a full answer and re-rank them
[12:06.880 -- 12:11.640] and or search against the search over answers.
[12:11.640 -- 12:13.760] So in terms of the action space,
[12:13.760 -- 12:16.040] was it the action space, just the list of commands
[12:16.040 -- 12:17.800] or is it still generating tokens
[12:17.800 -- 12:20.440] like a regular generative mode?
[12:20.440 -- 12:21.800] We were generating tokens.
[12:21.800 -- 12:26.800] We had two phases of like in each episode of the RL tasks.
[12:26.800 -- 12:31.280] So there was first a browsing phase where the model goes
[12:31.280 -- 12:33.960] and it issues searches and clicks on things
[12:33.960 -- 12:36.560] and quotes relevant information.
[12:36.560 -- 12:38.400] Like if it sees something useful on the page,
[12:38.400 -- 12:40.920] it'll quote it using this quote command.
[12:40.920 -- 12:44.560] And then once it's done browsing,
[12:44.560 -- 12:48.480] it'll issue another command called end browsing
[12:48.480 -- 12:49.920] and it'll write its answer.
[12:49.920 -- 12:52.120] That's also expressed in tokens.
[12:52.120 -- 12:55.400] But really we rolled this all into one big RL task
[12:55.400 -- 12:57.440] where your episode involves browsing
[12:57.440 -- 12:58.640] and writing out the answer
[12:58.640 -- 13:01.480] and it's all one big RL episode.
[13:01.480 -- 13:02.840] Did you think this is gonna work well
[13:02.840 -- 13:04.440] or were you kind of surprised?
[13:04.440 -- 13:06.360] At the very beginning of the project,
[13:06.360 -- 13:09.000] we didn't know if it was gonna work or not.
[13:09.000 -- 13:10.920] Like after we did the initial experiments
[13:10.920 -- 13:12.560] with the trivia QA,
[13:12.560 -- 13:15.560] which actually didn't take that long to get running,
[13:15.560 -- 13:19.120] then it became pretty clear that it would work,
[13:19.120 -- 13:20.640] that the browsing part worked at least.
[13:20.640 -- 13:22.880] And we already know that we can get these models
[13:22.880 -- 13:26.760] to write pretty good long form text with a bunch of,
[13:26.760 -- 13:28.520] if you give them a bunch of snippets
[13:28.520 -- 13:31.080] of text that they can cite.
[13:31.080 -- 13:35.400] So I noticed the human raters task was quite complicated.
[13:35.400 -- 13:38.200] It was a long guide and there was many types of feedback
[13:38.200 -- 13:39.040] that they were giving.
[13:39.040 -- 13:40.440] But in the end, the paper said
[13:40.440 -- 13:42.720] that only the final rating was used.
[13:42.720 -- 13:44.640] So I was just curious if you had any comment about that.
[13:44.640 -- 13:46.040] Like why do you think maybe the model
[13:46.040 -- 13:47.440] couldn't use that extra feedback
[13:47.440 -- 13:50.840] or is this maybe just too much or not enough samples?
[13:50.840 -- 13:55.200] Yeah, that's been one frustrating finding so far.
[13:55.200 -- 13:58.480] In that project and also some other projects,
[13:58.480 -- 14:01.480] we've had the same finding that you have your raters
[14:01.480 -- 14:05.760] go through this long process for each comparison they do
[14:05.760 -- 14:08.240] where they're comparing a pair of answers.
[14:08.240 -- 14:10.440] And then you only use one bit of information
[14:10.440 -- 14:13.080] from this whole process,
[14:13.080 -- 14:14.720] which might've taken like half an hour.
[14:14.720 -- 14:15.840] It seems like it would be better
[14:15.840 -- 14:19.320] if we were able to extract more information,
[14:19.320 -- 14:21.680] more about the process they went through
[14:21.680 -- 14:22.920] in arriving at the answer.
[14:22.920 -- 14:25.040] So we did collect all sorts of other information
[14:25.040 -- 14:27.160] like we had them provide ratings
[14:27.160 -- 14:28.600] along several different axes
[14:28.600 -- 14:32.760] like coherence and factual accuracy and so forth.
[14:32.760 -- 14:35.960] But in the end, we didn't really get much of a boost
[14:35.960 -- 14:39.160] out of using any of this other information.
[14:39.160 -- 14:44.160] So I'd say it seems like it should be possible to do better.
[14:44.800 -- 14:46.520] But unfortunately this methodology,
[14:46.520 -- 14:49.840] which seems kind of dumb so far is hard to beat.
[14:49.840 -- 14:52.760] And people have tried various other ideas
[14:52.760 -- 14:55.120] for like how to use human feedback
[14:55.120 -- 14:57.080] instead of you getting these preference scores,
[14:57.080 -- 14:58.400] there are various other things you can do.
[14:58.400 -- 15:00.840] Like you can have them write critiques and edit
[15:00.840 -- 15:03.200] or maybe edit the responses.
[15:03.200 -- 15:07.080] Yeah, I think some of these things are also promising.
[15:07.080 -- 15:09.440] But yeah, this methodology
[15:09.440 -- 15:12.080] of collecting preference data works well.
[15:12.080 -- 15:15.160] Yeah, I think it's still an open area of research.
[15:15.160 -- 15:18.280] Oh yeah, regarding the really long instructions.
[15:18.280 -- 15:20.000] Yeah, I think for any of these tasks,
[15:20.000 -- 15:24.000] there is a lot of subtlety in how to do the task properly.
[15:24.000 -- 15:27.800] And so we ended up adding more and more details
[15:27.800 -- 15:29.640] of like what do you do in this situation?
[15:29.640 -- 15:30.960] What do you do in that situation?
[15:30.960 -- 15:33.320] I think it's starting to get pretty unwieldy
[15:33.320 -- 15:35.760] with these really long instruction manuals.
[15:35.760 -- 15:39.920] So there's some promising ideas for how to address this.
[15:39.920 -- 15:42.840] Like there's a paper from DeepMind recently,
[15:42.840 -- 15:45.920] Sparrow that used basically broke down the task
[15:45.920 -- 15:48.520] and they trained, they basically had people look
[15:48.520 -- 15:52.400] at one aspect of the response at a time.
[15:52.400 -- 15:54.640] And then they had a way of combining
[15:54.640 -- 15:56.480] these different rule specific,
[15:56.480 -- 15:58.680] they would train a bunch of rule specific reward models
[15:58.680 -- 16:00.440] and then combine them at the end.
[16:00.440 -- 16:02.520] Yeah, I think there's some other interesting ideas
[16:02.520 -- 16:05.320] for how to make this process better.
[16:05.320 -- 16:08.480] So I gather that from your answer about WebGPT
[16:08.480 -- 16:10.720] and the whole idea of WebGPT is that you want
[16:10.720 -- 16:14.400] the language model to have access to external knowledge.
[16:14.400 -- 16:17.560] But I wonder where you think the line should really be
[16:17.560 -- 16:19.680] in terms of what a language model should know
[16:19.680 -- 16:21.920] and what the language model should look up
[16:21.920 -- 16:24.240] and maybe what the language model should not know
[16:24.240 -- 16:25.600] or not purport to know.
[16:25.600 -- 16:27.120] Do you have opinions about that?
[16:27.120 -- 16:28.560] Yeah, let's see.
[16:28.560 -- 16:30.200] Like some people are advocating
[16:30.200 -- 16:32.480] for very small language models that have
[16:32.480 -- 16:35.480] like no external knowledge aside from language,
[16:35.480 -- 16:37.000] I guess would be the extreme position.
[16:37.000 -- 16:39.680] And then other people have talked about language models
[16:39.680 -- 16:41.000] that just know everything
[16:41.000 -- 16:43.440] as opposed to having an external knowledge source.
[16:43.440 -- 16:45.000] There's some interesting questions there.
[16:45.000 -- 16:48.440] So I think it is a little hard to separate knowledge,
[16:48.440 -- 16:51.160] factual knowledge from understanding.
[16:51.160 -- 16:55.120] So as humans, we get by like not memorizing
[16:55.120 -- 16:57.560] all sorts of facts and just knowing
[16:57.560 -- 16:59.720] that we can look them up if needed.
[16:59.720 -- 17:01.520] For working on a specific domain,
[17:01.520 -- 17:06.440] it is useful to like have a lot of facts internalized
[17:06.440 -- 17:08.520] so that you can recall them very quickly
[17:08.520 -- 17:11.480] and kind of combine them in your head.
[17:11.480 -- 17:14.840] So I wouldn't take an extreme position on either side.
[17:14.840 -- 17:18.400] I would say, I think retrieval is gonna be really useful
[17:19.520 -- 17:22.480] just at the very least for current events,
[17:22.480 -- 17:26.480] but also I don't think we wanna try to pack
[17:26.480 -- 17:29.960] all human knowledge into the weights of a neural net.
[17:29.960 -- 17:32.280] On the other hand, I think people have had a lot of luck
[17:32.280 -- 17:37.200] just scaling up models and like as they soak up
[17:37.200 -- 17:40.800] more factual knowledge, they also get better at reasoning
[17:40.800 -- 17:41.640] and other things.
[17:41.640 -- 17:44.280] And I think I haven't seen any demonstrations
[17:44.280 -- 17:48.080] of tiny models that just do lots of retrieval
[17:48.080 -- 17:50.320] and save all their weights for reasoning.
[17:50.320 -- 17:53.840] Yeah, I just haven't seen any evidence of this
[17:53.840 -- 17:57.480] or I haven't seen any successful attempts at making this.
[17:57.480 -- 17:59.640] Let's move on to training language models
[17:59.640 -- 18:01.680] to follow instructions with human feedback.
[18:01.680 -- 18:03.080] That was Wuyang et al.
[18:03.080 -- 18:05.640] And that was 2022 with yourself as a co-author.
[18:05.640 -- 18:08.040] Can you tell us the main idea with this paper?
[18:08.040 -- 18:09.760] This is the instruct GPT paper.
[18:09.760 -- 18:12.000] What is instruct GPT and what's going on here?
[18:12.000 -- 18:15.240] Instruct GPT is a language model that's fine tuned
[18:15.240 -- 18:16.480] to follow instructions.
[18:16.480 -- 18:19.000] And it's in fact the one that you can play with
[18:19.000 -- 18:23.280] if you go to the OpenAI website, you get a big text box
[18:23.280 -- 18:25.920] and you can write some text and then press the button
[18:25.920 -- 18:27.680] to generate a completion.
[18:27.680 -- 18:30.240] So the idea here was, I mean, language models
[18:30.240 -- 18:33.800] are pretty useful and you can sometimes get them
[18:33.800 -- 18:36.160] to do what you want by prompting them just right.
[18:36.160 -- 18:39.880] This idea of few-shot prompting has become pretty popular
[18:39.880 -- 18:41.560] where you give a few examples,
[18:41.560 -- 18:44.200] like a few question and answer examples.
[18:44.200 -- 18:45.720] And then if you ask another question,
[18:45.720 -- 18:48.520] it'll hopefully provide an answer in the same style.
[18:48.520 -- 18:51.600] So the idea, yeah, so you can get language models
[18:51.600 -- 18:53.240] to do great things with prompting,
[18:53.240 -- 18:55.240] but prompting is itself an art
[18:55.240 -- 18:56.480] and it's tricky to get right.
[18:56.480 -- 18:59.040] And it's also kind of not necessarily getting
[18:59.040 -- 19:01.600] the best possible performance out of the model.
[19:01.600 -- 19:03.120] If you just take a raw language model
[19:03.120 -- 19:06.000] and you try to talk to it, like you ask it a question,
[19:06.000 -- 19:08.840] it probably, it doesn't know that it should actually answer
[19:08.840 -- 19:10.560] that question as well as possible.
[19:10.560 -- 19:13.840] It, for all it knows, you want it to give a joke answer
[19:13.840 -- 19:15.320] or a riddle or something.
[19:15.320 -- 19:17.840] Yeah, so the idea of instruct GPT was,
[19:17.840 -- 19:21.120] let's make a kind of small change to our language models
[19:21.120 -- 19:22.880] so that they're much easier to use.
[19:22.880 -- 19:25.360] In particular, we're gonna train them to,
[19:25.360 -- 19:29.440] if you have a piece of text where there's an instruction,
[19:29.440 -- 19:32.840] the model will try to follow that instruction
[19:32.840 -- 19:34.120] to the best of its abilities.
[19:34.120 -- 19:36.480] And pretty much anything can be an instruction.
[19:36.480 -- 19:38.760] Like you can have a, the instruction can be
[19:38.760 -- 19:43.760] to continue a chat or it can be to summarize this text
[19:44.400 -- 19:48.740] or give me a list of names for my company
[19:48.740 -- 19:50.240] that sells widgets.
[19:50.240 -- 19:51.680] Yeah, instructions can be anything
[19:51.680 -- 19:54.960] and that makes this kind of model very powerful.
[19:54.960 -- 19:56.000] So that was kind of,
[19:56.000 -- 19:58.120] that's the idea of an instruction following model.
[19:58.120 -- 19:59.760] It's like a model that can do anything
[19:59.760 -- 20:01.460] that you specify with an instruction.
[20:01.460 -- 20:04.000] And by the way, I wasn't a core contributor to this work.
[20:04.000 -- 20:09.000] I was more involved with like getting the RL infrastructure
[20:09.360 -- 20:12.280] and some of the RL training details,
[20:12.280 -- 20:14.440] like helping out with that stuff.
[20:14.440 -- 20:16.840] But anyway, yeah, what we did in this project was
[20:16.840 -- 20:20.620] we ran this whole methodology that I just described
[20:20.620 -- 20:23.160] of RL from human preferences
[20:23.160 -- 20:24.900] in this instruction following setting.
[20:24.900 -- 20:28.080] So we did supervised fine tuning,
[20:28.080 -- 20:30.840] collected preference data, train a reward model
[20:30.840 -- 20:33.800] and then did RL against that reward model.
[20:33.800 -- 20:36.240] And one interesting detail is actually
[20:36.240 -- 20:40.080] whereas the original initial data was just collected
[20:40.080 -- 20:41.840] using contractors.
[20:41.840 -- 20:46.840] At a certain point we had the API and it's got this,
[20:47.040 -- 20:50.520] I mean, we have this playgrounds on the website
[20:50.520 -- 20:52.800] where this is where the big text box
[20:52.800 -- 20:54.800] where you can use the model.
[20:54.800 -- 20:57.200] So we took prompts that people,
[20:57.200 -- 20:59.680] that users had put into the playground
[20:59.680 -- 21:01.280] and use those for training,
[21:01.280 -- 21:04.680] like both to collect preference data and to do RL.
[21:04.680 -- 21:07.040] So, and this is like,
[21:07.040 -- 21:10.760] this is disclosed to users pretty prominently.
[21:10.760 -- 21:13.040] Like when people are using the playgrounds,
[21:13.040 -- 21:15.520] you get notified that your prompts might be used
[21:15.520 -- 21:16.480] for the training.
[21:16.480 -- 21:19.120] And we're also careful to train in such a way
[21:19.120 -- 21:20.860] that we don't memorize any information
[21:20.860 -- 21:23.080] that was in the prompts.
[21:23.080 -- 21:24.760] Like, and it explicit,
[21:24.760 -- 21:27.480] like we have a pretty like elaborate process
[21:27.480 -- 21:30.680] for making sure there's no like private information
[21:30.680 -- 21:32.840] being leaked into the model.
[21:32.840 -- 21:36.960] But anyway, yeah, that's basically the experimental setup.
[21:36.960 -- 21:39.680] And the result was that it works
[21:39.680 -- 21:42.060] like this methodology works quite well.
[21:42.060 -- 21:44.480] And you get a model that's vastly preferred
[21:44.480 -- 21:48.820] to the base model on this distribution of realistic prompts
[21:48.820 -- 21:50.880] that people are giving the model,
[21:50.880 -- 21:53.040] often which contain instructions.
[21:53.040 -- 21:56.040] So the raw, like the raw language models
[21:56.040 -- 21:58.760] generally do a really bad job following instructions.
[21:58.760 -- 22:02.920] But this RL trained instruction following model
[22:02.920 -- 22:04.120] is a lot better.
[22:04.120 -- 22:06.440] And it's something like,
[22:06.440 -- 22:08.220] if you just calculate how much better,
[22:08.220 -- 22:09.200] it's something like,
[22:09.200 -- 22:11.800] it's as good as a model that's a hundred times bigger.
[22:11.800 -- 22:13.200] That's a lot.
[22:13.200 -- 22:14.040] Yeah.
[22:14.040 -- 22:15.280] You wanted the model to be truthful.
[22:15.280 -- 22:17.640] Is that one of the criteria you wanted?
[22:17.640 -- 22:20.000] Yeah, truthfulness was one of the criteria.
[22:20.000 -- 22:22.200] That seems amazing to me that truthfulness
[22:22.200 -- 22:24.080] is something that I could learn by example.
[22:24.080 -- 22:26.480] Like does that mean that truthfulness is somehow
[22:26.480 -- 22:28.000] represented inside the network
[22:28.000 -- 22:31.240] or because there's no external way for the model to confirm
[22:31.240 -- 22:32.720] whether something is true or false?
[22:32.720 -- 22:35.440] So how might it know what is true
[22:35.440 -- 22:37.480] without any external reference?
[22:37.480 -- 22:38.960] I think to some extent,
[22:38.960 -- 22:42.420] there is some internal representation of truthfulness.
[22:42.420 -- 22:43.260] So I would say,
[22:43.260 -- 22:45.340] like one way to think about what language models do
[22:45.340 -- 22:48.200] is they're trained to imitate the whole internet.
[22:48.200 -- 22:50.520] And the internet is written by lots of different people
[22:50.520 -- 22:52.520] and has lots of different types of content
[22:52.520 -- 22:57.200] from fiction to nonfiction to like technical,
[22:57.200 -- 23:00.600] like detailed technical literature to like jokes
[23:00.600 -- 23:03.400] and like forum posts, whatever.
[23:03.400 -- 23:07.260] So the model is basically an ensemble of all these people
[23:07.260 -- 23:08.880] who wrote stuff on the internet,
[23:08.880 -- 23:11.000] the raw pre-trained model.
[23:11.000 -- 23:13.080] When you feed it a prompt,
[23:13.080 -- 23:15.580] what it's doing internally has to be something like
[23:15.580 -- 23:18.200] figuring out who wrote this prompt
[23:18.200 -- 23:20.020] and then trying to continue in that style.
[23:20.020 -- 23:21.880] So if it thinks it's reading,
[23:21.880 -- 23:26.180] just reading something on the Wall Street Bets Reddit,
[23:26.180 -- 23:28.440] it's gonna continue on that style.
[23:28.440 -- 23:30.640] But if it thinks it's in the New York Times,
[23:30.640 -- 23:33.320] it's gonna write in a very different way.
[23:33.320 -- 23:38.280] So effectively, the model must be calculating somewhere,
[23:38.280 -- 23:40.800] like what style is this or what ensemble,
[23:40.800 -- 23:43.900] what's the narrower ensemble of styles
[23:43.900 -- 23:46.400] that I'm trying to imitate now.
[23:46.400 -- 23:48.400] At the very least, when you do some kind of,
[23:48.400 -- 23:51.080] when you do training like either supervised fine tuning
[23:51.080 -- 23:52.840] or all from human feedback,
[23:52.840 -- 23:55.600] you can at least like narrow down the set of styles
[23:55.600 -- 23:59.500] the model is producing and try to imitate like the best
[23:59.500 -- 24:02.680] or the best person in the training set
[24:02.680 -- 24:04.300] or the best style in the training set.
[24:04.300 -- 24:06.480] And obviously best will differ a lot.
[24:06.480 -- 24:09.540] So what we'll end up with will depend on our instructions.
[24:09.540 -- 24:12.520] So if we tell, I don't know,
[24:12.520 -- 24:15.080] we'll end up with something that has kind of safe,
[24:15.080 -- 24:19.000] like not too controversial,
[24:19.000 -- 24:21.160] but a bit corporate,
[24:21.160 -- 24:23.240] we'll end up with something like that
[24:23.240 -- 24:25.680] depending on what our instructions are.
[24:25.680 -- 24:27.320] So at the very least,
[24:27.320 -- 24:29.880] like we can kind of narrow in on one style
[24:29.880 -- 24:32.160] instead of having the whole distribution
[24:32.160 -- 24:33.320] of styles on the internet.
[24:33.320 -- 24:35.780] I think probably there's more to it than that.
[24:35.780 -- 24:38.140] Like we're not just learning about style,
[24:38.140 -- 24:40.580] but the model probably is like internally
[24:40.580 -- 24:42.220] trying to determine if things are,
[24:42.220 -- 24:44.000] if statements are true or not,
[24:44.000 -- 24:47.320] like if the prompt contains incorrect information,
[24:47.320 -- 24:48.980] because that probably would be useful
[24:48.980 -- 24:51.560] for determining a likely completion.
[24:51.560 -- 24:53.340] I'm just talking about the raw pre-trained model.
[24:53.340 -- 24:54.520] So I think, yeah,
[24:54.520 -- 24:58.180] I think just the objective of predicting next tokens
[24:58.180 -- 24:59.520] probably gives you a lot.
[24:59.520 -- 25:02.120] It forces the model to like to determine
[25:02.120 -- 25:03.680] if things are true or not.
[25:03.680 -- 25:05.880] I think for RL fine tuning,
[25:05.880 -- 25:07.560] there's a lot more potential for the model
[25:07.560 -- 25:11.900] to actually like try to output something truthful
[25:11.900 -- 25:14.240] as opposed to trying to imitate a certain style.
[25:14.240 -- 25:16.120] Though it's hard to,
[25:16.120 -- 25:18.520] I guess it would be hard to like determine
[25:18.520 -- 25:21.400] if that's what the model is actually trying to do.
[25:21.400 -- 25:24.240] So it's almost like the prompt is guiding the model.
[25:24.240 -- 25:26.720] It's like, what corner of the internet do we want to,
[25:26.720 -- 25:28.320] do we want to imitate here?
[25:28.320 -- 25:31.240] And maybe we want to instruct GPG wants to,
[25:31.240 -- 25:33.520] to focus more on the most more truthful corners
[25:33.520 -- 25:35.800] of the internet and something similar to that.
[25:35.800 -- 25:36.880] Yeah, I would hope so.
[25:36.880 -- 25:38.680] At least I think that's a pretty good,
[25:38.680 -- 25:41.360] though maybe a little simplistic picture of what's going on.
[25:41.360 -- 25:42.200] At the very least,
[25:42.200 -- 25:44.920] we should be able to imitate the most truthful corner
[25:44.920 -- 25:45.760] of the internet.
[25:45.760 -- 25:47.760] So can you talk about a generalization
[25:47.760 -- 25:52.360] and how does this type of model perform out of distribution?
[25:52.360 -- 25:54.080] Like, I guess if it seems questions
[25:54.080 -- 25:56.480] that are a bit different than what it was trained on,
[25:56.480 -- 25:58.040] what happens if we get a little bit away
[25:58.040 -- 26:00.560] from the training data with the reward models?
[26:00.560 -- 26:02.320] I mean, language models in general,
[26:02.320 -- 26:03.840] generalize surprisingly well.
[26:03.840 -- 26:05.400] And I would say overall,
[26:05.400 -- 26:07.600] like these pre-trained models that are trained
[26:07.600 -- 26:09.760] on super diverse data sets from the internet,
[26:09.760 -- 26:12.920] they tend to generalize quite well, or surprisingly well,
[26:12.920 -- 26:15.200] at least it's surprising to those of us
[26:15.200 -- 26:19.000] who were around for the earlier days of machine learning
[26:19.000 -- 26:22.800] when everything was trained from scratch and very fragile.
[26:22.800 -- 26:25.640] For example, if you provide an instruction
[26:25.640 -- 26:29.280] in some other language, even a fairly rare language,
[26:29.280 -- 26:32.360] it'll often do a decent job following the instruction,
[26:32.360 -- 26:35.840] even if there's zero data in the whole instruction
[26:35.840 -- 26:39.360] following the training process that's in that language.
[26:39.360 -- 26:41.840] And that's just to carry over from the pre-training.
[26:41.840 -- 26:43.960] So I think generalization,
[26:43.960 -- 26:46.080] yeah, I think language models generalize quite well.
[26:46.080 -- 26:47.880] So you asked about reward models.
[26:47.880 -- 26:50.840] I think one of the tricky pieces about RL
[26:50.840 -- 26:52.400] from human feedback is how,
[26:52.400 -- 26:53.880] so you have this reward model
[26:53.880 -- 26:55.480] and you're actually training against it,
[26:55.480 -- 26:57.880] meaning you're training your policy to have high reward
[26:57.880 -- 27:01.200] and it's going to exploit the errors in the reward model.
[27:01.200 -- 27:04.280] So it's gonna eventually find adversarial examples
[27:04.280 -- 27:05.200] to the reward model.
[27:05.200 -- 27:07.200] This is worse than kind of normal
[27:07.200 -- 27:08.640] out of distribution behavior.
[27:08.640 -- 27:11.480] It's like targeted out of distribution examples.
[27:11.480 -- 27:13.800] So there are definitely some challenges
[27:13.800 -- 27:17.400] around getting reward models to generalize well
[27:17.400 -- 27:20.960] or generalize as far as possible from the training set.
[27:20.960 -- 27:22.760] Can these types of agents tell us
[27:22.760 -- 27:26.240] when they don't know something or is that a hard problem?
[27:26.240 -- 27:28.800] I'd say sort of, if you ask a question
[27:28.800 -- 27:31.480] that's kind of in the core of the model's knowledge,
[27:31.480 -- 27:34.160] it will know the answer and it'll know that it knows.
[27:34.160 -- 27:35.640] By the way, I'm talking about models
[27:35.640 -- 27:37.240] like for the instruct model.
[27:37.240 -- 27:40.360] If you ask it about something that's like very simple
[27:40.360 -- 27:42.160] at the core of its knowledge,
[27:42.160 -- 27:44.160] it'll know if you, there are certain things
[27:44.160 -- 27:45.920] that it knows that it doesn't know,
[27:45.920 -- 27:49.240] like current events where it's been trained
[27:49.240 -- 27:52.840] to know that it doesn't know certain things in real time.
[27:52.840 -- 27:55.000] But if you ask it about something
[27:55.000 -- 27:56.760] that's kind of on the edge of its knowledge,
[27:56.760 -- 27:59.480] it's gonna have a hard time.
[27:59.480 -- 28:01.640] It's necessarily gonna be inaccurate.
[28:01.640 -- 28:03.920] I mean, there have been a couple of papers
[28:03.920 -- 28:04.880] about this question.
[28:04.880 -- 28:08.080] So there was a paper from Entropic recently
[28:08.080 -- 28:09.360] called Language Models,
[28:09.360 -- 28:10.920] mostly know what they know.
[28:10.920 -- 28:15.120] And there's also a paper from FHI and OpenAI
[28:15.120 -- 28:17.680] called Getting Language Models
[28:17.680 -- 28:20.080] to Express Their Uncertainty in Words.
[28:20.080 -- 28:22.000] These language models,
[28:22.000 -- 28:24.160] as well as a lot of other models in machine learning
[28:24.160 -- 28:26.560] are trained to maximize likelihood.
[28:26.560 -- 28:28.680] So maximize log-prob of data.
[28:28.680 -- 28:29.920] You're already training them
[28:29.920 -- 28:32.480] to always predict a distribution of outputs.
[28:32.480 -- 28:35.440] So for language models, given a prefix,
[28:35.440 -- 28:38.920] it's predicting a distribution over the next token.
[28:38.920 -- 28:41.760] These predictions for the next token
[28:41.760 -- 28:44.720] generally are pretty well calibrated.
[28:44.720 -- 28:47.680] If it puts 80% probability on something,
[28:47.680 -- 28:49.160] and you look at all the times
[28:49.160 -- 28:51.920] when it puts 80% probability on something,
[28:51.920 -- 28:54.080] it's right 80% of the time.
[28:54.080 -- 28:56.400] That's just a result of the training objective.
[28:56.400 -- 28:59.960] The training objective strongly incentivizes the model
[28:59.960 -- 29:01.400] to be calibrated,
[29:01.400 -- 29:05.320] meaning it has a reasonable estimate of its uncertainty.
[29:05.320 -- 29:07.240] So at the single token level,
[29:07.240 -- 29:08.960] models definitely are calibrated.
[29:08.960 -- 29:10.880] The question is whether they're calibrated on,
[29:10.880 -- 29:14.680] whether this calibration extends to settings
[29:14.680 -- 29:18.000] where they are generating multi-token outputs,
[29:18.000 -- 29:20.360] or whether they can like judge the correctness
[29:20.360 -- 29:22.000] of some multi-token statement.
[29:22.000 -- 29:25.000] So I would say since models are calibrated
[29:25.000 -- 29:26.600] at the single token level,
[29:26.600 -- 29:29.640] I think that they definitely have the information
[29:29.640 -- 29:32.840] to be calibrated in these other settings.
[29:32.840 -- 29:35.960] So that's why I think the problem of models
[29:35.960 -- 29:38.640] knowing what they know isn't actually that hard,
[29:38.640 -- 29:42.240] or at least getting a model to express its uncertainty
[29:42.240 -- 29:44.080] pretty much as well as a human does,
[29:44.080 -- 29:46.560] doesn't feel like a insurmountable problem,
[29:46.560 -- 29:48.360] but there are some practical difficulties
[29:48.360 -- 29:50.120] to getting there.
[29:50.120 -- 29:52.720] People use the phrase AI alignment in different ways.
[29:52.720 -- 29:54.440] Can you talk about how you see alignment
[29:54.440 -- 29:57.680] in your work on RL from human feedback?
[29:57.680 -- 29:59.720] I think of alignment mostly as the problem
[29:59.720 -- 30:03.560] of getting the model to try to do the right thing.
[30:03.560 -- 30:05.000] So we can kind of make a distinction
[30:05.000 -- 30:08.240] between what the model is capable of doing.
[30:08.240 -- 30:10.200] Like if you just take a raw language model
[30:10.200 -- 30:13.240] and you ask it a question, like I said before,
[30:13.240 -- 30:14.720] it doesn't know that you actually wanted
[30:14.720 -- 30:17.120] to give the correct answer as opposed to,
[30:17.120 -- 30:20.160] it might think someone who's not very knowledgeable
[30:20.160 -- 30:21.000] is answering.
[30:21.000 -- 30:22.480] By doing some extra training,
[30:22.480 -- 30:24.800] we can get the model to actually try to do the right thing.
[30:24.800 -- 30:28.680] And so I would say that that's the main goal of alignment.
[30:28.680 -- 30:31.720] So there was an OpenAI blog post recently
[30:31.720 -- 30:34.560] that talked about the sequence in alignment.
[30:34.560 -- 30:38.800] One was training AI systems using human feedback,
[30:38.800 -- 30:42.800] two, training AI systems to assist human evaluation,
[30:42.800 -- 30:46.440] and three, training AI systems to do alignment research.
[30:46.440 -- 30:50.200] So is your current work mostly about this first item
[30:50.200 -- 30:51.800] and when and how do you see us
[30:51.800 -- 30:53.440] getting to these other stages?
[30:53.440 -- 30:56.240] I'm doing some work now on number two,
[30:56.240 -- 30:58.520] training AI systems to assist human feedback.
[30:58.520 -- 31:01.760] I think that sort of becomes increasingly necessary
[31:01.760 -- 31:05.120] as you start trying to get the systems
[31:05.120 -- 31:06.840] to solve harder and harder problems.
[31:06.840 -- 31:09.520] When you have models that are kind of very below human level
[31:09.520 -- 31:12.000] or maybe at human level at a certain task,
[31:12.000 -- 31:15.080] it's pretty straightforward to supervise them.
[31:15.080 -- 31:17.200] But once they're doing things that are very hard
[31:17.200 -- 31:19.480] or doing things that require a lot
[31:19.480 -- 31:21.960] of diverse technical knowledge,
[31:21.960 -- 31:24.480] it becomes pretty hard to provide
[31:24.480 -- 31:26.560] a useful supervision signal.
[31:26.560 -- 31:29.280] So we have to start doing things like one model
[31:29.280 -- 31:31.680] writes an answer to a question
[31:31.680 -- 31:35.320] and then another model provides a critique of that answer,
[31:35.320 -- 31:36.680] points out some flaws,
[31:36.680 -- 31:38.880] and then the human only has to judge
[31:38.880 -- 31:43.120] the first answer after looking at the critique,
[31:43.120 -- 31:45.440] meaning basically the critique helps the human
[31:45.440 -- 31:46.520] assess the answer.
[31:46.520 -- 31:48.840] So I think that kind of idea
[31:48.840 -- 31:51.000] is starting to become pretty relevant.
[31:51.000 -- 31:53.560] Colleagues and I are exploring that kind of idea now.
[31:53.560 -- 31:55.520] As for assisting alignment research,
[31:55.520 -- 31:56.960] there's some other work at OpenAI
[31:56.960 -- 31:58.600] that's starting to explore this.
[31:58.600 -- 32:02.040] It's also, that's sort of the furthest down the road.
[32:02.040 -- 32:05.080] So I saw Stuart Russell was on your PhD committee
[32:05.080 -- 32:07.680] and I really enjoyed his book, Human Compatible.
[32:07.680 -- 32:10.200] I wonder if you share the idea mentioned in the book
[32:10.200 -- 32:11.880] that the standard RL framing
[32:11.880 -- 32:14.760] with this fixed reward signal is problematic
[32:14.760 -- 32:16.360] and that agents, powerful agents,
[32:16.360 -- 32:18.960] should try to do what we want
[32:18.960 -- 32:21.880] and maintain some uncertainty about what it is we want
[32:21.880 -- 32:26.120] and the agents that are too certain will be problematic.
[32:26.120 -- 32:28.320] Do you have any thoughts on that idea?
[32:28.320 -- 32:31.560] Yeah, I totally agree with that idea.
[32:31.560 -- 32:34.120] So I think first it's really hard to write down
[32:34.120 -- 32:37.560] a simple reward function that actually captures
[32:37.560 -- 32:41.080] what we want or what any particular person wants.
[32:41.080 -- 32:43.720] I can say I want a little more of this
[32:43.720 -- 32:44.880] or a little more of that,
[32:44.880 -- 32:47.760] but you wouldn't want to take that to the extreme.
[32:47.760 -- 32:52.600] If we build agents that try to cater to our wishes,
[32:52.600 -- 32:55.200] we should make sure they're,
[32:55.200 -- 32:58.240] like they have a lot of, they have uncertainty
[32:58.240 -- 33:00.080] about what we want or what we value.
[33:00.080 -- 33:03.480] And that'll also cause them to be a little more cautious
[33:03.480 -- 33:07.600] and say, not disturb anything that might be important to us.
[33:07.600 -- 33:10.600] So yeah, I agree with that.
[33:10.600 -- 33:13.360] Like Stuart Russell gave a very good
[33:13.360 -- 33:17.040] like problem definition of what we want AI to do.
[33:17.040 -- 33:18.440] Like we want it to basically,
[33:18.440 -- 33:21.040] we want to jointly like play this game
[33:21.040 -- 33:23.760] where AI is trying to figure out what we want
[33:23.760 -- 33:24.840] and then trying to do that.
[33:24.840 -- 33:27.600] But simultaneously maintaining some uncertainty
[33:27.600 -- 33:28.640] about what we want.
[33:28.640 -- 33:30.560] I would say if you start to look
[33:30.560 -- 33:31.920] at how to get that in practice,
[33:31.920 -- 33:34.400] it actually looks quite a bit like the kind of RL
[33:34.400 -- 33:37.920] from human feedback that we're working on at OpenAI
[33:37.920 -- 33:41.280] and others are working on at other places.
[33:41.280 -- 33:44.720] I think, yeah, I see what we're doing
[33:44.720 -- 33:47.320] as a practical implementation
[33:47.320 -- 33:50.720] of getting towards this behavior that Russell described.
[33:50.720 -- 33:53.160] Do you think of AGI as an abstract goal
[33:53.160 -- 33:55.560] or are we gonna see a model come out one day
[33:55.560 -- 33:58.040] and people are gonna say, oh, that's the first AGI model?
[33:58.040 -- 34:01.640] Like, what does it have to do for people to say that?
[34:01.640 -- 34:04.920] I think people will say that many times
[34:04.920 -- 34:07.200] then realize that it doesn't quite do everything
[34:07.200 -- 34:08.080] that you want.
[34:08.080 -- 34:10.600] I think we're gonna have a lot of like a long series
[34:10.600 -- 34:14.320] of models that are superhuman at most things
[34:14.320 -- 34:16.640] or at a certain class of things,
[34:16.640 -- 34:20.840] but they also have some failure modes and weaknesses.
[34:20.840 -- 34:24.640] Like I expect us to see multiple models
[34:24.640 -- 34:26.600] that are proclaimed as AGI
[34:26.600 -- 34:30.360] and then only after interacting with it a while,
[34:30.360 -- 34:33.880] do you realize it's not quite there.
[34:33.880 -- 34:35.520] What would you say is the relationship
[34:35.520 -- 34:39.760] between AGI and RL and AGI and these large language models?
[34:39.760 -- 34:41.680] How do those concepts fit together?
[34:41.680 -- 34:46.680] I'd say that RL is a useful component of training AGI
[34:47.160 -- 34:49.240] or an almost essential component.
[34:49.240 -- 34:52.440] The thing RL lets you do is it lets you optimize
[34:52.440 -- 34:54.960] any objective for the agents,
[34:54.960 -- 34:59.280] any objective that is a function of the agent's behavior.
[34:59.280 -- 35:03.720] So with pre-training, like what we do for language models,
[35:03.720 -- 35:05.760] you're kind of choosing an objective
[35:05.760 -- 35:09.400] that lets us do something with all the training data
[35:09.400 -- 35:11.720] we have, which is all this internet text.
[35:11.720 -- 35:14.200] So we choose this maximum likelihood objective,
[35:14.200 -- 35:17.000] which is basically the only, or not the only thing,
[35:17.000 -- 35:20.200] but it's like a sensible way to absorb all this knowledge.
[35:20.200 -- 35:24.040] But then if we really want to optimize the agent's behavior
[35:24.040 -- 35:25.440] for a specific objective,
[35:25.440 -- 35:29.040] RL is kind of the only framework that lets you do that.
[35:29.960 -- 35:32.240] Okay, John, we have a few questions from the audience
[35:32.240 -- 35:33.280] and I'm just going to pick the two
[35:33.280 -- 35:36.240] that have the highest score in terms of Twitter likes.
[35:36.240 -- 35:40.760] So the first is from Eric Chang, VP of AI at Haloti Robotics.
[35:40.760 -- 35:43.360] He asked, RL distributions are non-stationary,
[35:43.360 -- 35:46.080] making it hard to reason about PPO losses
[35:46.080 -- 35:48.520] and how that relates to return or generalization.
[35:48.520 -- 35:51.000] Are there any intermediate plots and visualizations
[35:51.000 -- 35:53.120] you'd like to generate to debug
[35:53.120 -- 35:56.200] or incrementally build up a large scale RL system?
[35:56.200 -- 35:59.760] Yeah, there are definitely some stats that I look at.
[35:59.760 -- 36:02.640] So I will be, I'll talk about this
[36:02.640 -- 36:07.640] in the nuts and bolts like reboot later this year,
[36:07.760 -- 36:12.760] but I'd say things like looking at the explained variance
[36:12.800 -- 36:15.320] of the value function and looking at the,
[36:15.320 -- 36:18.120] like how many samples are getting clipped in PPO
[36:18.120 -- 36:23.120] and what the KL divergences between the policy before
[36:23.120 -- 36:25.680] and after the update is, yeah, things like that.
[36:25.680 -- 36:30.640] And then Ethan, the Calibero from Mila asks,
[36:30.640 -- 36:33.760] what is your median estimate for the arrival date of AGI?
[36:33.760 -- 36:37.440] I think not too far away, but like I said,
[36:37.440 -- 36:39.480] I expect there to be a lot of false starts.
[36:39.480 -- 36:44.360] I would say I expect like AI to be able to do better,
[36:44.360 -- 36:46.520] a better job than humans at most jobs
[36:46.520 -- 36:49.040] that humans do now, five years or so.
[36:49.040 -- 36:51.040] That's not all jobs, but most jobs.
[36:51.040 -- 36:52.680] For a while, we're gonna discover things
[36:52.680 -- 36:54.080] that AI is very good at
[36:54.080 -- 36:56.440] and where we wanna keep humans in control.
[36:56.440 -- 36:59.440] So I think there'll be some kind of gradual process
[36:59.440 -- 37:01.240] over the next 10 or 15 years.
[37:01.240 -- 37:02.440] I've been curious about this.
[37:02.440 -- 37:05.160] I see that some RL work is patented,
[37:05.160 -- 37:08.800] but I could not find a TRPO or PPO in,
[37:08.800 -- 37:10.160] I could not find patents on these.
[37:10.160 -- 37:13.760] Are those protected, patent protected at all?
[37:13.760 -- 37:18.320] Or how do you think of intellectual property protection
[37:18.320 -- 37:19.280] for that kind of work?
[37:19.280 -- 37:22.120] I haven't ever looked into patenting anything
[37:22.120 -- 37:25.080] and OpenAI hasn't either as far as I know.
[37:25.080 -- 37:26.960] I think the trend over time has been
[37:26.960 -- 37:29.600] for people to take patents in machine,
[37:29.600 -- 37:31.920] like a machine learning algorithms less seriously.
[37:31.920 -- 37:34.520] There's this algorithm in computer vision called SIFT,
[37:34.520 -- 37:36.960] which is like this key point to detector.
[37:36.960 -- 37:38.960] And this was patented.
[37:38.960 -- 37:42.080] I think the guy who patented it,
[37:42.080 -- 37:44.680] he probably made his university some money from the patent,
[37:44.680 -- 37:48.160] but in the end, all it did was cause people
[37:48.160 -- 37:52.080] a lot of annoyance because people had to come up
[37:52.080 -- 37:56.280] with alternative algorithms that had a different acronym
[37:56.280 -- 37:58.240] and weren't patented.
[37:58.240 -- 38:02.920] So the OpenCV open source library would have,
[38:02.920 -- 38:05.400] had to be careful about putting this algorithm
[38:05.400 -- 38:07.960] in their library because of the patent risks.
[38:07.960 -- 38:11.960] So I think like these patents aren't,
[38:11.960 -- 38:13.920] patent rights aren't exercised that much.
[38:13.920 -- 38:17.080] And I think big companies like Google will patent
[38:17.080 -- 38:19.280] a lot of stuff for defensive reasons.
[38:19.280 -- 38:22.040] So if they get in some big legal dispute
[38:22.040 -- 38:24.360] with another company, it can be used
[38:24.360 -- 38:26.520] as like one of the bargaining chips.
[38:26.520 -- 38:30.440] But I think, I don't think anyone's gonna like get sued
[38:30.440 -- 38:35.320] for royalties for not providing royalties
[38:35.320 -- 38:36.960] for the use of some algorithm.
[38:36.960 -- 38:40.080] Okay, and then there's been a ton of work in RL, of course,
[38:40.080 -- 38:43.560] since you first published TRPO and PPO.
[38:43.560 -- 38:45.200] But from your point of view,
[38:45.200 -- 38:46.440] if you had to pick a few highlights
[38:46.440 -- 38:50.360] in terms of a few important milestones in RL algorithms
[38:50.360 -- 38:51.600] since PPO came out,
[38:53.120 -- 38:55.080] and by the way, it's amazing that in 2022,
[38:55.080 -- 38:56.400] we're still using PPO,
[38:57.520 -- 39:01.000] I think quite similar to its original form.
[39:01.000 -- 39:01.840] Is that right?
[39:02.920 -- 39:03.920] Yeah, pretty much.
[39:03.920 -- 39:06.880] Yeah, so what would you say are the biggest
[39:06.880 -- 39:09.680] highlights for you in terms of RL algorithm
[39:09.680 -- 39:11.640] since you did PPO?
[39:11.640 -- 39:13.440] Yeah, there's definitely been some interesting stuff.
[39:13.440 -- 39:16.480] So I think like a little after PPO,
[39:16.480 -- 39:19.120] there is TD3 and SAC,
[39:19.120 -- 39:23.000] and those seem like pretty solid value-based methods.
[39:23.000 -- 39:25.320] That was one development that was interesting.
[39:25.320 -- 39:27.840] I think like, yeah, I thought Mu zero
[39:27.840 -- 39:32.840] and it's like elaborations were also like efficient zero.
[39:32.840 -- 39:36.840] Efficient zero were also pretty impressive
[39:36.840 -- 39:38.960] that you can get that good sample efficiency.
[39:38.960 -- 39:41.600] Both of the things I just mentioned were kind of,
[39:41.600 -- 39:45.000] well, I don't wanna say mostly on toy tasks or benchmarks
[39:45.000 -- 39:48.120] because yeah, I'm sure people are doing some real things
[39:48.120 -- 39:49.440] with these algorithms.
[39:49.440 -- 39:52.040] Yeah, so I think that stuff was interesting.
[39:52.040 -- 39:56.760] I think like the whole recent interest,
[39:56.760 -- 40:00.360] surge of interest in the offline RL was also notable.
[40:00.360 -- 40:02.480] I would say the stuff we're doing
[40:02.480 -- 40:06.040] with RL from human feedback is the kind of offline RL
[40:06.040 -- 40:09.000] because we're like, we have a fixed dataset
[40:09.000 -- 40:11.640] and we have a fixed reward modeling dataset
[40:11.640 -- 40:12.880] and we're training against that.
[40:12.880 -- 40:14.720] This is like offline RL,
[40:14.720 -- 40:15.960] but you're doing it in a different way.
[40:15.960 -- 40:19.640] You're using an on policy algorithm with a reward model
[40:19.640 -- 40:23.280] as opposed to maybe a more typical way to do offline RL
[40:23.280 -- 40:25.040] would be use off policy algorithm.
[40:25.040 -- 40:27.760] Would that work here or would that not work here?
[40:27.760 -- 40:30.160] What we're doing here is kind of like model-based RL
[40:30.160 -- 40:33.280] because the reward model is like a model
[40:33.280 -- 40:35.800] of the unknown part of the system.
[40:35.800 -- 40:38.920] So like the unknown part of the system here
[40:38.920 -- 40:42.760] is the human radar or yeah, the human.
[40:42.760 -- 40:46.880] It's not the outputting appending to your list of tokens.
[40:46.880 -- 40:48.600] So this is kind of like the work
[40:48.600 -- 40:51.840] that's like takes a dynamics model of the environment
[40:51.840 -- 40:54.240] and does some kind of just runs
[40:54.240 -- 40:56.600] a policy grading algorithm against it.
[40:56.600 -- 40:57.440] So it's not like,
[40:57.440 -- 41:00.400] so the idea of running an online algorithm
[41:00.400 -- 41:03.720] against a model, that's kind of a well-established idea.
[41:03.720 -- 41:06.800] Though I would say the papers that previously did this,
[41:06.800 -- 41:08.520] they were in a pretty different regime.
[41:08.520 -- 41:11.200] We're in this regime of doing fairly small updates
[41:11.200 -- 41:14.600] to the policy because we have these awesome pre-trained models
[41:14.600 -- 41:19.000] and we don't need to actually change them that much.
[41:19.000 -- 41:21.520] So yeah, we use these online algorithms.
[41:21.520 -- 41:23.760] I'd say part of the reason why we can get away
[41:23.760 -- 41:28.000] with using just like an online algorithm
[41:28.000 -- 41:30.480] is because we've been just looking
[41:30.480 -- 41:32.480] at a contextual bandit problem.
[41:32.480 -- 41:35.080] Yeah, because we only have like one time step.
[41:35.080 -- 41:37.840] Like you get a query and you output a response
[41:37.840 -- 41:40.160] and then that response gets a reward.
[41:40.160 -- 41:43.120] So if we had like a multi-step process
[41:43.120 -- 41:48.120] such as a conversation where you can't assign a reward
[41:48.320 -- 41:50.280] until the very end of the conversation
[41:50.280 -- 41:54.160] and or you had some, I don't know, some interaction
[41:54.160 -- 41:57.800] with like some real world system that's hard to simulate,
[41:57.800 -- 42:00.440] you wouldn't, then it wouldn't be as straightforward to,
[42:00.440 -- 42:03.760] you wouldn't be able to use exactly the same methodology.
[42:03.760 -- 42:05.680] You would probably have to use a,
[42:05.680 -- 42:08.360] you would have to probably train a Q function
[42:08.360 -- 42:10.600] or something like that.
[42:10.600 -- 42:13.080] If you want your method to be sample efficient,
[42:13.080 -- 42:15.640] you would probably have to do something slightly different.
[42:15.640 -- 42:19.120] I think we'll have to start exploring this
[42:19.120 -- 42:22.560] at some point soon, but so far we haven't,
[42:22.560 -- 42:27.480] at least I haven't seen any cases in like in the domain
[42:27.480 -- 42:29.680] I'm looking at that require this,
[42:29.680 -- 42:33.480] but I expect it to be relevant at some point.
[42:33.480 -- 42:37.080] So we had Arvind Srinivas talking about decision transformer
[42:37.080 -- 42:39.360] on the show recently, that was a great episode.
[42:39.360 -- 42:41.360] And I see that you were also a co-author
[42:41.360 -- 42:43.920] on the 2016 RL squared paper.
[42:43.920 -- 42:46.680] I want to ask you what your thoughts about meta RL.
[42:46.680 -- 42:48.560] Arvind had some interesting things to say
[42:48.560 -- 42:50.640] about maybe the idea that a transformer
[42:50.640 -- 42:52.320] could kind of supersede the need
[42:52.320 -- 42:54.200] for an RL algorithm altogether.
[42:54.200 -- 42:56.200] What do you expect from meta RL?
[42:56.200 -- 42:58.600] Do you expect we'll still be using human-authored
[42:58.600 -- 43:00.600] RL algorithms in the future?
[43:00.600 -- 43:03.000] Yeah, that's a pretty bold statement that we don't need,
[43:03.000 -- 43:05.400] we won't need any RL algorithms anymore.
[43:05.400 -- 43:07.640] Yeah, since the RL squared paper,
[43:07.640 -- 43:10.920] people have been talking less about meta learning,
[43:10.920 -- 43:12.400] as far as I can tell,
[43:12.400 -- 43:15.760] actually because of sequence modeling has gotten so good,
[43:15.760 -- 43:19.680] like transformer sequence models, that it's kind of clear
[43:19.680 -- 43:21.920] that meta learning is just a special case of learning.
[43:21.920 -- 43:26.560] Like it's just like a certain kind of long context learning,
[43:26.560 -- 43:28.720] learning involving long episodes.
[43:28.720 -- 43:31.120] And maybe it shouldn't be treated that differently
[43:31.120 -- 43:33.600] or addressed with special algorithms.
[43:33.600 -- 43:36.760] I would say, yeah, the ideas like decision transformer
[43:36.760 -- 43:37.880] are pretty interesting,
[43:37.880 -- 43:40.520] where you try to reduce RL to supervised learning.
[43:40.520 -- 43:43.800] It's still not like certain exactly how these compare
[43:43.800 -- 43:47.320] in performance to RL, like people have started to analyze
[43:47.320 -- 43:49.280] that empirically and theoretically.
[43:49.280 -- 43:53.320] And I would say in practice, sometimes it's better,
[43:53.320 -- 43:55.240] sometimes it's worse.
[43:55.240 -- 43:57.960] In my experience, like it's been worse on the problems
[43:57.960 -- 44:01.920] that my colleagues and I have, where we've tested it.
[44:01.920 -- 44:05.480] But yeah, it's definitely an interesting direction.
[44:05.480 -- 44:08.360] Dr. John Schulman, thank you so much for sharing your time
[44:08.360 -- 44:10.360] and your insight with the talk RL audience today.
[44:10.360 -- 44:11.480] Thanks so much.