[00:00.000 -- 00:01.960] the answer was affirmative. [00:01.960 -- 00:05.680] We can get an agent to basically use a set of tools [00:05.680 -- 00:06.520] that we give it. [00:06.520 -- 00:09.440] In this case, the browsing commands, like searchings. [00:09.440 -- 00:13.880] I would say I expect AI to be able to do a better job [00:13.880 -- 00:16.860] than humans at most jobs that humans do now, [00:16.860 -- 00:17.980] five years or so. [00:19.660 -- 00:20.500] Talk RL. [00:22.660 -- 00:26.700] Talk RL podcast is all reinforcement learning all the time, [00:26.700 -- 00:29.900] featuring brilliant guests, both researched and applied. [00:29.900 -- 00:33.520] Join the conversation on Twitter at Talk RL podcast. [00:33.520 -- 00:35.160] I'm your host, Robin Chauhan. [00:39.500 -- 00:41.900] John Shulman is a co-founder of OpenAI [00:41.900 -- 00:44.480] and a researcher and engineer at OpenAI. [00:44.480 -- 00:46.360] He is well known for major contributions [00:46.360 -- 00:48.440] to the field of reinforcement learning, [00:48.440 -- 00:50.760] including the TRPO algorithm, [00:50.760 -- 00:52.920] that's Trust Region Policy Optimization, [00:52.920 -- 00:56.000] GAE, Generalized Advantage Estimation. [00:56.000 -- 00:58.120] Those are from his UC Berkeley dissertation. [00:58.120 -- 01:02.080] And TRPO's Descendant Proximal Policy Optimization, or PPO. [01:02.080 -- 01:06.040] His current focus at OpenAI is on RL from human feedback. [01:06.040 -- 01:08.320] John, welcome to the show and thanks so much for being here. [01:08.320 -- 01:09.360] Thanks a lot for having me. [01:09.360 -- 01:11.380] You were literally one of the first people I thought of [01:11.380 -- 01:13.840] when I started the show three years back. [01:13.840 -- 01:14.880] Thanks, I'm honored. [01:14.880 -- 01:17.320] It means a lot to me to have you here today. [01:17.320 -- 01:20.920] I definitely remember your nuts and bolts of deep RL video [01:20.920 -- 01:23.240] back in the day and watching that multiple times [01:23.240 -- 01:24.360] and gaining a lot from that. [01:24.360 -- 01:26.200] So I think you helped probably a generation [01:26.200 -- 01:28.640] of RL practitioners back then. [01:28.640 -- 01:31.280] By the way, there's going to be a reboot [01:31.280 -- 01:33.360] of the nuts and bolts presentation. [01:33.360 -- 01:37.320] I got invited to give a talk at NURIPS this year on it. [01:37.320 -- 01:41.200] So I'll have to revamp the guidelines and everything. [01:41.200 -- 01:42.120] So that'll be fun. [01:42.120 -- 01:42.960] Oh, that's awesome. [01:42.960 -- 01:43.780] Can't wait for that. [01:43.780 -- 01:47.240] So you were clearly one of the earlier pioneers in deep RL. [01:47.240 -- 01:49.640] So how did you choose to move your focus to RL [01:49.640 -- 01:50.800] from human feedback? [01:50.800 -- 01:52.560] And why is that an important problem? [01:52.560 -- 01:53.740] Why is that important to you? [01:53.740 -- 01:57.560] After GBD3 was trained, I was blown away by how smart it was. [01:57.560 -- 02:00.040] And I realized the next frontier was figuring out [02:00.040 -- 02:02.000] how to make language models actually useful. [02:02.000 -- 02:03.800] I'm still really interested in RL, [02:03.800 -- 02:07.400] but solving RL benchmarks isn't the end of the story. [02:07.400 -- 02:10.360] To use your RL algorithm, you need a reward function. [02:10.360 -- 02:12.680] But where does the reward function come from? [02:12.680 -- 02:15.160] In RL benchmarks, you usually just code up [02:15.160 -- 02:16.020] the reward function. [02:16.020 -- 02:18.320] But if you're not in a simulator environment, [02:18.320 -- 02:19.160] that doesn't work. [02:19.160 -- 02:23.280] So what we have to do in any kind of real world use case [02:23.280 -- 02:25.160] is have humans look at what the AI did [02:25.160 -- 02:26.680] and decide if it was good or bad. [02:26.680 -- 02:29.200] So how exactly you define this reward [02:29.200 -- 02:31.800] becomes a really challenging and important problem, [02:31.800 -- 02:34.160] especially as the tasks get harder to evaluate. [02:34.160 -- 02:37.240] Another angle on this is that language models are very smart, [02:37.240 -- 02:40.400] but it's hard to get them to do anything useful. [02:40.400 -- 02:43.200] A big part of that is they're not necessarily [02:43.200 -- 02:44.240] trying to do what you want. [02:44.240 -- 02:46.400] They're just trying to imitate the training corpus. [02:46.400 -- 02:48.440] So that means there's a big opportunity [02:48.440 -- 02:50.640] to improve them a lot by just giving them [02:50.640 -- 02:51.600] the right objective. [02:51.600 -- 02:55.280] That's what we can do by applying RL to these language [02:55.280 -- 02:58.560] models using human feedback to define the reward. [02:58.560 -- 03:02.560] Is using human feedback harder or very different in some way [03:02.560 -- 03:04.360] than using a synthetic reward? [03:04.360 -- 03:06.600] There are a lot of new complications. [03:06.600 -- 03:09.800] Now you have to collect a data set dynamically. [03:09.800 -- 03:12.160] So you're always in the business of building data [03:12.160 -- 03:14.720] sets of human preferences. [03:14.720 -- 03:17.160] Often the data quality there matters more [03:17.160 -- 03:19.320] than various algorithmic details. [03:19.320 -- 03:22.440] And you also have to think a lot about exactly how you're [03:22.440 -- 03:24.360] giving the task to the human trainers [03:24.360 -- 03:25.680] and various other things that you [03:25.680 -- 03:27.360] wouldn't have thought about if you just [03:27.360 -- 03:29.040] had a programmatic reward function. [03:29.040 -- 03:31.080] Does the difference between human raters [03:31.080 -- 03:34.200] or the noisiness of the reward signal cause any problems? [03:34.200 -- 03:36.640] I would say the noise, definitely [03:36.640 -- 03:40.320] you need to be below some threshold of noise [03:40.320 -- 03:41.360] to learn anything. [03:41.360 -- 03:44.160] I think, in general, if you have a large noisy data [03:44.160 -- 03:47.640] set that can be as good as a smaller, clean data set. [03:47.640 -- 03:50.640] So actually, noise isn't the thing that worries me the most. [03:50.640 -- 03:53.600] It's more that there are sometimes consistent biases [03:53.600 -- 03:54.680] that people have. [03:54.680 -- 03:58.920] For example, in settings like question answering or settings [03:58.920 -- 04:02.000] where you have a model writing some text, [04:02.000 -- 04:04.160] often people prefer longer answers. [04:04.160 -- 04:06.680] You end up with these very verbose answers. [04:06.680 -- 04:08.880] If you're not careful with the instructions, that is. [04:08.880 -- 04:12.000] I mean, you can also instruct people, the raters, [04:12.000 -- 04:14.440] to reward brevity. [04:14.440 -- 04:17.200] But if you're not careful, you can [04:17.200 -- 04:19.360] incentivize the wrong kinds of behaviors. [04:19.360 -- 04:21.480] So let's move to some of your recent work. [04:21.480 -- 04:24.640] First up is WebGPT, browser assisted question [04:24.640 -- 04:26.200] answering with human feedback. [04:26.200 -- 04:30.000] That's Nakano et al with yourself as a co-author in 2021. [04:30.000 -- 04:32.880] Can you tell us what is the main idea of this paper? [04:32.880 -- 04:33.880] What is WebGPT? [04:33.880 -- 04:37.720] In WebGPT, we basically took our language models [04:37.720 -- 04:40.040] and we hooked them up to a web browser [04:40.040 -- 04:42.520] so they could retrieve information from the web. [04:42.520 -- 04:44.480] And they can write an answer by summarizing [04:44.480 -- 04:45.960] the relevant pages from the web. [04:45.960 -- 04:48.760] So that way if you're asking a question about current events [04:48.760 -- 04:51.520] or a question that requires some detailed scientific [04:51.520 -- 04:53.840] or technical knowledge, this AI can go out [04:53.840 -- 04:56.680] and look up the answer and with detailed citations [04:56.680 -- 04:57.560] to its sources. [04:57.560 -- 05:00.320] So I would say there's kind of two interesting points [05:00.320 -- 05:01.160] to this. [05:01.160 -- 05:03.600] One is we were exploring whether you could turn language [05:03.600 -- 05:05.360] models into a kind of agent. [05:05.360 -- 05:07.840] There's a lot of data on the web of different texts [05:07.840 -- 05:09.920] that people have written, but there's not a lot of data [05:09.920 -- 05:13.360] that shows how to actually do some multi-step process. [05:13.360 -- 05:15.400] So it's not that clear a priori [05:15.400 -- 05:16.880] whether you can get a language model [05:16.880 -- 05:19.600] to actually carry out some iterative process. [05:19.600 -- 05:22.480] We just have a lot of data like writing essays [05:22.480 -- 05:23.960] and having chats and so forth. [05:23.960 -- 05:25.840] So that was one thing we were exploring here. [05:25.840 -- 05:28.120] And I think the answer was affirmative. [05:28.120 -- 05:32.280] We can get an agent to basically use a set of tools [05:32.280 -- 05:34.880] that we give it, in this case, the browsing commands [05:34.880 -- 05:37.480] like searching, scrolling, clicking on links. [05:37.480 -- 05:40.560] The second theme of this paper was around truthfulness. [05:40.560 -- 05:44.120] I mean, a big issue with language models is, [05:44.120 -- 05:45.600] I mean, they're not very reliable [05:45.600 -- 05:47.080] at giving you true information. [05:47.080 -- 05:49.680] They know a vastly superhuman amount, [05:49.680 -- 05:51.640] but if you prompt them in the wrong way, [05:51.640 -- 05:54.520] they'll just output lots of plausible sounding nonsense. [05:54.520 -- 05:57.680] So how to fix that is a big research question [05:57.680 -- 05:59.800] or one of the biggest research questions [05:59.800 -- 06:01.640] in the world of language models. [06:01.640 -- 06:03.480] I think it's gonna be challenging to fully fix it, [06:03.480 -- 06:06.960] but I think a big part of the story involves retrieval [06:06.960 -- 06:10.520] and having models write answers that contain citations, [06:10.520 -- 06:12.600] citations to trusted sources. [06:12.600 -- 06:14.440] So a person who's checking over the answer [06:14.440 -- 06:16.160] doesn't have to go and try to figure out [06:16.160 -- 06:18.200] where the model might've gotten this idea. [06:18.200 -- 06:20.520] They can go and directly look at the source [06:20.520 -- 06:23.280] and see if it supports the AI's statement. [06:23.280 -- 06:25.960] With WebGBT, we just wanted to see [06:25.960 -- 06:28.520] if we do give the language model [06:28.520 -- 06:30.400] a really flexible interface of the web, [06:30.400 -- 06:33.240] can we have it answer hard questions truthfully [06:34.440 -- 06:36.280] with the help of all these citations? [06:36.280 -- 06:38.360] And it's actually really non-trivial [06:38.360 -- 06:41.040] because if you look at the dataset we use, [06:41.040 -- 06:43.280] the Reddit explained it like I'm five. [06:43.280 -- 06:44.680] The questions are really varied, [06:44.680 -- 06:46.840] like some of them are about science, history, [06:46.840 -- 06:49.560] current events, like our raters didn't necessarily [06:49.560 -- 06:51.520] know anything about these topics, [06:51.520 -- 06:55.760] but still they had to judge the detailed answers. [06:55.760 -- 06:57.640] So it would have been really hard to do it [06:57.640 -- 06:59.960] without the supporting citations. [06:59.960 -- 07:04.000] So we kind of validated that we could get good feedback [07:04.000 -- 07:07.440] in a hard domain like this with the help of citations. [07:07.440 -- 07:10.680] Can you talk about where the idea for WebGBT came from? [07:10.680 -- 07:13.000] Is that an idea you've had kicking around for a while [07:13.000 -- 07:15.800] or was it something that came up recently before the paper? [07:15.800 -- 07:17.760] How did that play out? [07:17.760 -- 07:19.800] Some of the ideas had been floating around, [07:19.800 -- 07:22.400] like we thought that we actually had a project [07:22.400 -- 07:26.160] at OpenAI very early on called World of Bits. [07:26.160 -- 07:28.520] We were looking at controlling web browsers [07:28.520 -- 07:31.120] or doing tasks that involved tasks on the internet [07:31.120 -- 07:32.360] with the web browser, [07:32.360 -- 07:34.520] but it was way too early at the time. [07:34.520 -- 07:38.120] So we kind of abandoned it for a few years. [07:38.120 -- 07:40.240] Actually we were trying to, back then we were trying to do it [07:40.240 -- 07:41.480] with full visual input. [07:41.480 -- 07:45.040] So we thought, yeah, we could give some instructions [07:45.040 -- 07:48.880] to the agent, like go and figure out the address [07:48.880 -- 07:51.000] of this building or something. [07:51.000 -- 07:54.000] The agent would go and search the web [07:54.000 -- 07:57.000] or use Google maps or whatever to figure out the answer. [07:57.000 -- 07:58.760] And we were trying to do this all in pixels. [07:58.760 -- 08:00.640] That obviously didn't work very well, [08:00.640 -- 08:03.640] but now we have these great language models [08:03.640 -- 08:05.680] on the work on text data. [08:05.680 -- 08:08.960] We can also extract the text out of web pages [08:08.960 -- 08:12.000] to get most of the information. [08:12.000 -- 08:15.280] We can't really interact with a lot of dynamic websites. [08:15.280 -- 08:16.960] Yeah, where there's a lot of JavaScript [08:16.960 -- 08:18.000] and images and so forth, [08:18.000 -- 08:19.960] but as long as it's just browsing [08:19.960 -- 08:21.760] and reading texts, we're fine. [08:21.760 -- 08:24.320] So yeah, we had good enough models [08:24.320 -- 08:27.880] and that made it kind of feasible to revisit this idea [08:27.880 -- 08:30.960] of using the internet as an environment. [08:30.960 -- 08:33.640] So I would say that was one of the sources [08:33.640 -- 08:36.760] of inspiration, that long kind of thread [08:36.760 -- 08:39.320] about like using the internet as an environment. [08:39.320 -- 08:44.320] Another motivation was just after we started playing [08:44.680 -- 08:47.920] with GPT-3, we noticed that it had all these problems [08:47.920 -- 08:51.400] with factual accuracy and the reliability [08:51.400 -- 08:52.920] of the information it was giving us. [08:52.920 -- 08:56.280] So that kind of motivated doing more research [08:56.280 -- 08:58.960] on how to make language models more truthful. [08:58.960 -- 09:01.040] We were kind of brainstorming what to do there [09:01.040 -- 09:05.480] and we went through some docs and eventually decided [09:05.480 -- 09:07.760] that we wanted to try some question answering [09:07.760 -- 09:09.800] like using the web, looking up knowledge [09:09.800 -- 09:11.560] on the web to help answer questions. [09:11.560 -- 09:12.880] So actually the original version [09:12.880 -- 09:15.000] of the project used trivia questions. [09:15.000 -- 09:18.400] So there's this well-known dataset trivia QA [09:18.400 -- 09:20.080] that has some basic trivia questions. [09:20.080 -- 09:23.600] So we first worked a little bit on that dataset [09:23.600 -- 09:26.960] and tried to see if we could boost the model's accuracy [09:26.960 -- 09:29.840] by giving it web search. [09:29.840 -- 09:33.040] And yeah, that actually worked quite straight. [09:33.040 -- 09:34.160] That worked pretty easily. [09:34.160 -- 09:36.120] So then we decided to move on [09:36.120 -- 09:38.080] to long form question answering. [09:38.080 -- 09:41.880] And so that gave us the, that was the project [09:41.880 -- 09:43.880] we ended up working on for a while. [09:43.880 -- 09:47.080] Seems like you use a few different datasets here [09:47.080 -- 09:49.800] and a number of different training methods. [09:50.760 -- 09:52.600] I'll just mention the last behavior cloning, [09:52.600 -- 09:55.080] reward modeling, reinforcement learning [09:55.080 -- 09:56.800] and rejection sampling. [09:56.800 -- 10:00.520] So we were using a fairly standard methodology [10:00.520 -- 10:03.240] which was actually adapted from previous work [10:03.240 -- 10:05.600] on RL from human preferences. [10:05.600 -- 10:09.120] So the pipeline is you first train a model [10:09.120 -- 10:13.320] with supervised learning where you have human demonstrators [10:13.320 -- 10:15.560] show how to do the task, like show how to map [10:15.560 -- 10:17.160] from observations to actions. [10:17.160 -- 10:19.280] Yeah, so that's the supervised learning [10:19.280 -- 10:20.440] or behavior cloning step. [10:20.440 -- 10:24.400] Then we train a reward model or a preference model. [10:24.400 -- 10:28.320] It looks at two actions or two trajectories [10:28.320 -- 10:29.720] and decides which one is better. [10:29.720 -- 10:32.640] In this case, like in a question answering setting [10:32.640 -- 10:33.880] you're looking at two answers [10:33.880 -- 10:35.480] and deciding which answer is better. [10:35.480 -- 10:37.440] And we use that to train a reward model [10:37.440 -- 10:39.640] that assigns higher score to the good answers [10:39.640 -- 10:40.480] than the bad ones. [10:40.480 -- 10:41.840] Then you do reinforcement learning [10:41.840 -- 10:43.160] against that reward function. [10:43.160 -- 10:45.560] And of course you can iterate these last two steps [10:45.560 -- 10:46.960] after you do a little RL. [10:46.960 -- 10:49.520] Now you're, you've sort of exploited some of the flaws [10:49.520 -- 10:52.080] of the reward model, like, or some of the noise [10:52.080 -- 10:53.200] in the reward model. [10:53.200 -- 10:55.120] And it's not necessarily accurate [10:55.120 -- 10:56.760] on your new distribution of data. [10:56.760 -- 10:59.040] You recollect more pairs of samples [10:59.040 -- 11:01.680] and refit this preference model. [11:01.680 -- 11:04.000] And then you do another iteration of RL. [11:04.000 -- 11:06.160] So that's like, that's the whole RL [11:06.160 -- 11:07.600] from human feedback pipeline. [11:07.600 -- 11:11.080] And there's this other idea called rejection sampling [11:11.080 -- 11:12.400] or best of end sampling. [11:12.400 -- 11:14.840] And in general, you can do other kinds of search too. [11:14.840 -- 11:18.680] Where instead of doing RL once you have your reward model [11:18.680 -- 11:21.040] you can just search against that reward model. [11:21.040 -- 11:23.440] So you can take a bunch of, collect a bunch of samples [11:23.440 -- 11:25.960] and re-rank them with the reward model [11:25.960 -- 11:28.960] and take the best one as your action. [11:28.960 -- 11:30.520] Kind of like MPC? [11:30.520 -- 11:31.360] Yeah, exactly. [11:31.360 -- 11:33.440] Yeah, it kind of depends exactly [11:33.440 -- 11:35.640] what setting you're in, what you can do. [11:35.640 -- 11:38.400] If you're in a setting where there's some environment [11:38.400 -- 11:41.040] you're interacting with, then you would have to simulate [11:41.040 -- 11:44.160] your, you'd have to simulate the dynamics [11:44.160 -- 11:45.920] of your environment, which yeah. [11:45.920 -- 11:47.920] So that would look kind of like MPC. [11:47.920 -- 11:51.360] In our case, we were, the only thing we had to learn [11:51.360 -- 11:55.080] a model of was the human preference. [11:55.080 -- 11:57.480] So like we're, it's a question answering setting. [11:57.480 -- 11:59.760] So it's really like a contextual bandit problem. [11:59.760 -- 12:02.520] So it's kind of straightforward to take a bunch of, [12:02.520 -- 12:04.320] sample a bunch of actions where each action [12:04.320 -- 12:06.880] is a full answer and re-rank them [12:06.880 -- 12:11.640] and or search against the search over answers. [12:11.640 -- 12:13.760] So in terms of the action space, [12:13.760 -- 12:16.040] was it the action space, just the list of commands [12:16.040 -- 12:17.800] or is it still generating tokens [12:17.800 -- 12:20.440] like a regular generative mode? [12:20.440 -- 12:21.800] We were generating tokens. [12:21.800 -- 12:26.800] We had two phases of like in each episode of the RL tasks. [12:26.800 -- 12:31.280] So there was first a browsing phase where the model goes [12:31.280 -- 12:33.960] and it issues searches and clicks on things [12:33.960 -- 12:36.560] and quotes relevant information. [12:36.560 -- 12:38.400] Like if it sees something useful on the page, [12:38.400 -- 12:40.920] it'll quote it using this quote command. [12:40.920 -- 12:44.560] And then once it's done browsing, [12:44.560 -- 12:48.480] it'll issue another command called end browsing [12:48.480 -- 12:49.920] and it'll write its answer. [12:49.920 -- 12:52.120] That's also expressed in tokens. [12:52.120 -- 12:55.400] But really we rolled this all into one big RL task [12:55.400 -- 12:57.440] where your episode involves browsing [12:57.440 -- 12:58.640] and writing out the answer [12:58.640 -- 13:01.480] and it's all one big RL episode. [13:01.480 -- 13:02.840] Did you think this is gonna work well [13:02.840 -- 13:04.440] or were you kind of surprised? [13:04.440 -- 13:06.360] At the very beginning of the project, [13:06.360 -- 13:09.000] we didn't know if it was gonna work or not. [13:09.000 -- 13:10.920] Like after we did the initial experiments [13:10.920 -- 13:12.560] with the trivia QA, [13:12.560 -- 13:15.560] which actually didn't take that long to get running, [13:15.560 -- 13:19.120] then it became pretty clear that it would work, [13:19.120 -- 13:20.640] that the browsing part worked at least. [13:20.640 -- 13:22.880] And we already know that we can get these models [13:22.880 -- 13:26.760] to write pretty good long form text with a bunch of, [13:26.760 -- 13:28.520] if you give them a bunch of snippets [13:28.520 -- 13:31.080] of text that they can cite. [13:31.080 -- 13:35.400] So I noticed the human raters task was quite complicated. [13:35.400 -- 13:38.200] It was a long guide and there was many types of feedback [13:38.200 -- 13:39.040] that they were giving. [13:39.040 -- 13:40.440] But in the end, the paper said [13:40.440 -- 13:42.720] that only the final rating was used. [13:42.720 -- 13:44.640] So I was just curious if you had any comment about that. [13:44.640 -- 13:46.040] Like why do you think maybe the model [13:46.040 -- 13:47.440] couldn't use that extra feedback [13:47.440 -- 13:50.840] or is this maybe just too much or not enough samples? [13:50.840 -- 13:55.200] Yeah, that's been one frustrating finding so far. [13:55.200 -- 13:58.480] In that project and also some other projects, [13:58.480 -- 14:01.480] we've had the same finding that you have your raters [14:01.480 -- 14:05.760] go through this long process for each comparison they do [14:05.760 -- 14:08.240] where they're comparing a pair of answers. [14:08.240 -- 14:10.440] And then you only use one bit of information [14:10.440 -- 14:13.080] from this whole process, [14:13.080 -- 14:14.720] which might've taken like half an hour. [14:14.720 -- 14:15.840] It seems like it would be better [14:15.840 -- 14:19.320] if we were able to extract more information, [14:19.320 -- 14:21.680] more about the process they went through [14:21.680 -- 14:22.920] in arriving at the answer. [14:22.920 -- 14:25.040] So we did collect all sorts of other information [14:25.040 -- 14:27.160] like we had them provide ratings [14:27.160 -- 14:28.600] along several different axes [14:28.600 -- 14:32.760] like coherence and factual accuracy and so forth. [14:32.760 -- 14:35.960] But in the end, we didn't really get much of a boost [14:35.960 -- 14:39.160] out of using any of this other information. [14:39.160 -- 14:44.160] So I'd say it seems like it should be possible to do better. [14:44.800 -- 14:46.520] But unfortunately this methodology, [14:46.520 -- 14:49.840] which seems kind of dumb so far is hard to beat. [14:49.840 -- 14:52.760] And people have tried various other ideas [14:52.760 -- 14:55.120] for like how to use human feedback [14:55.120 -- 14:57.080] instead of you getting these preference scores, [14:57.080 -- 14:58.400] there are various other things you can do. [14:58.400 -- 15:00.840] Like you can have them write critiques and edit [15:00.840 -- 15:03.200] or maybe edit the responses. [15:03.200 -- 15:07.080] Yeah, I think some of these things are also promising. [15:07.080 -- 15:09.440] But yeah, this methodology [15:09.440 -- 15:12.080] of collecting preference data works well. [15:12.080 -- 15:15.160] Yeah, I think it's still an open area of research. [15:15.160 -- 15:18.280] Oh yeah, regarding the really long instructions. [15:18.280 -- 15:20.000] Yeah, I think for any of these tasks, [15:20.000 -- 15:24.000] there is a lot of subtlety in how to do the task properly. [15:24.000 -- 15:27.800] And so we ended up adding more and more details [15:27.800 -- 15:29.640] of like what do you do in this situation? [15:29.640 -- 15:30.960] What do you do in that situation? [15:30.960 -- 15:33.320] I think it's starting to get pretty unwieldy [15:33.320 -- 15:35.760] with these really long instruction manuals. [15:35.760 -- 15:39.920] So there's some promising ideas for how to address this. [15:39.920 -- 15:42.840] Like there's a paper from DeepMind recently, [15:42.840 -- 15:45.920] Sparrow that used basically broke down the task [15:45.920 -- 15:48.520] and they trained, they basically had people look [15:48.520 -- 15:52.400] at one aspect of the response at a time. [15:52.400 -- 15:54.640] And then they had a way of combining [15:54.640 -- 15:56.480] these different rule specific, [15:56.480 -- 15:58.680] they would train a bunch of rule specific reward models [15:58.680 -- 16:00.440] and then combine them at the end. [16:00.440 -- 16:02.520] Yeah, I think there's some other interesting ideas [16:02.520 -- 16:05.320] for how to make this process better. [16:05.320 -- 16:08.480] So I gather that from your answer about WebGPT [16:08.480 -- 16:10.720] and the whole idea of WebGPT is that you want [16:10.720 -- 16:14.400] the language model to have access to external knowledge. [16:14.400 -- 16:17.560] But I wonder where you think the line should really be [16:17.560 -- 16:19.680] in terms of what a language model should know [16:19.680 -- 16:21.920] and what the language model should look up [16:21.920 -- 16:24.240] and maybe what the language model should not know [16:24.240 -- 16:25.600] or not purport to know. [16:25.600 -- 16:27.120] Do you have opinions about that? [16:27.120 -- 16:28.560] Yeah, let's see. [16:28.560 -- 16:30.200] Like some people are advocating [16:30.200 -- 16:32.480] for very small language models that have [16:32.480 -- 16:35.480] like no external knowledge aside from language, [16:35.480 -- 16:37.000] I guess would be the extreme position. [16:37.000 -- 16:39.680] And then other people have talked about language models [16:39.680 -- 16:41.000] that just know everything [16:41.000 -- 16:43.440] as opposed to having an external knowledge source. [16:43.440 -- 16:45.000] There's some interesting questions there. [16:45.000 -- 16:48.440] So I think it is a little hard to separate knowledge, [16:48.440 -- 16:51.160] factual knowledge from understanding. [16:51.160 -- 16:55.120] So as humans, we get by like not memorizing [16:55.120 -- 16:57.560] all sorts of facts and just knowing [16:57.560 -- 16:59.720] that we can look them up if needed. [16:59.720 -- 17:01.520] For working on a specific domain, [17:01.520 -- 17:06.440] it is useful to like have a lot of facts internalized [17:06.440 -- 17:08.520] so that you can recall them very quickly [17:08.520 -- 17:11.480] and kind of combine them in your head. [17:11.480 -- 17:14.840] So I wouldn't take an extreme position on either side. [17:14.840 -- 17:18.400] I would say, I think retrieval is gonna be really useful [17:19.520 -- 17:22.480] just at the very least for current events, [17:22.480 -- 17:26.480] but also I don't think we wanna try to pack [17:26.480 -- 17:29.960] all human knowledge into the weights of a neural net. [17:29.960 -- 17:32.280] On the other hand, I think people have had a lot of luck [17:32.280 -- 17:37.200] just scaling up models and like as they soak up [17:37.200 -- 17:40.800] more factual knowledge, they also get better at reasoning [17:40.800 -- 17:41.640] and other things. [17:41.640 -- 17:44.280] And I think I haven't seen any demonstrations [17:44.280 -- 17:48.080] of tiny models that just do lots of retrieval [17:48.080 -- 17:50.320] and save all their weights for reasoning. [17:50.320 -- 17:53.840] Yeah, I just haven't seen any evidence of this [17:53.840 -- 17:57.480] or I haven't seen any successful attempts at making this. [17:57.480 -- 17:59.640] Let's move on to training language models [17:59.640 -- 18:01.680] to follow instructions with human feedback. [18:01.680 -- 18:03.080] That was Wuyang et al. [18:03.080 -- 18:05.640] And that was 2022 with yourself as a co-author. [18:05.640 -- 18:08.040] Can you tell us the main idea with this paper? [18:08.040 -- 18:09.760] This is the instruct GPT paper. [18:09.760 -- 18:12.000] What is instruct GPT and what's going on here? [18:12.000 -- 18:15.240] Instruct GPT is a language model that's fine tuned [18:15.240 -- 18:16.480] to follow instructions. [18:16.480 -- 18:19.000] And it's in fact the one that you can play with [18:19.000 -- 18:23.280] if you go to the OpenAI website, you get a big text box [18:23.280 -- 18:25.920] and you can write some text and then press the button [18:25.920 -- 18:27.680] to generate a completion. [18:27.680 -- 18:30.240] So the idea here was, I mean, language models [18:30.240 -- 18:33.800] are pretty useful and you can sometimes get them [18:33.800 -- 18:36.160] to do what you want by prompting them just right. [18:36.160 -- 18:39.880] This idea of few-shot prompting has become pretty popular [18:39.880 -- 18:41.560] where you give a few examples, [18:41.560 -- 18:44.200] like a few question and answer examples. [18:44.200 -- 18:45.720] And then if you ask another question, [18:45.720 -- 18:48.520] it'll hopefully provide an answer in the same style. [18:48.520 -- 18:51.600] So the idea, yeah, so you can get language models [18:51.600 -- 18:53.240] to do great things with prompting, [18:53.240 -- 18:55.240] but prompting is itself an art [18:55.240 -- 18:56.480] and it's tricky to get right. [18:56.480 -- 18:59.040] And it's also kind of not necessarily getting [18:59.040 -- 19:01.600] the best possible performance out of the model. [19:01.600 -- 19:03.120] If you just take a raw language model [19:03.120 -- 19:06.000] and you try to talk to it, like you ask it a question, [19:06.000 -- 19:08.840] it probably, it doesn't know that it should actually answer [19:08.840 -- 19:10.560] that question as well as possible. [19:10.560 -- 19:13.840] It, for all it knows, you want it to give a joke answer [19:13.840 -- 19:15.320] or a riddle or something. [19:15.320 -- 19:17.840] Yeah, so the idea of instruct GPT was, [19:17.840 -- 19:21.120] let's make a kind of small change to our language models [19:21.120 -- 19:22.880] so that they're much easier to use. [19:22.880 -- 19:25.360] In particular, we're gonna train them to, [19:25.360 -- 19:29.440] if you have a piece of text where there's an instruction, [19:29.440 -- 19:32.840] the model will try to follow that instruction [19:32.840 -- 19:34.120] to the best of its abilities. [19:34.120 -- 19:36.480] And pretty much anything can be an instruction. [19:36.480 -- 19:38.760] Like you can have a, the instruction can be [19:38.760 -- 19:43.760] to continue a chat or it can be to summarize this text [19:44.400 -- 19:48.740] or give me a list of names for my company [19:48.740 -- 19:50.240] that sells widgets. [19:50.240 -- 19:51.680] Yeah, instructions can be anything [19:51.680 -- 19:54.960] and that makes this kind of model very powerful. [19:54.960 -- 19:56.000] So that was kind of, [19:56.000 -- 19:58.120] that's the idea of an instruction following model. [19:58.120 -- 19:59.760] It's like a model that can do anything [19:59.760 -- 20:01.460] that you specify with an instruction. [20:01.460 -- 20:04.000] And by the way, I wasn't a core contributor to this work. [20:04.000 -- 20:09.000] I was more involved with like getting the RL infrastructure [20:09.360 -- 20:12.280] and some of the RL training details, [20:12.280 -- 20:14.440] like helping out with that stuff. [20:14.440 -- 20:16.840] But anyway, yeah, what we did in this project was [20:16.840 -- 20:20.620] we ran this whole methodology that I just described [20:20.620 -- 20:23.160] of RL from human preferences [20:23.160 -- 20:24.900] in this instruction following setting. [20:24.900 -- 20:28.080] So we did supervised fine tuning, [20:28.080 -- 20:30.840] collected preference data, train a reward model [20:30.840 -- 20:33.800] and then did RL against that reward model. [20:33.800 -- 20:36.240] And one interesting detail is actually [20:36.240 -- 20:40.080] whereas the original initial data was just collected [20:40.080 -- 20:41.840] using contractors. [20:41.840 -- 20:46.840] At a certain point we had the API and it's got this, [20:47.040 -- 20:50.520] I mean, we have this playgrounds on the website [20:50.520 -- 20:52.800] where this is where the big text box [20:52.800 -- 20:54.800] where you can use the model. [20:54.800 -- 20:57.200] So we took prompts that people, [20:57.200 -- 20:59.680] that users had put into the playground [20:59.680 -- 21:01.280] and use those for training, [21:01.280 -- 21:04.680] like both to collect preference data and to do RL. [21:04.680 -- 21:07.040] So, and this is like, [21:07.040 -- 21:10.760] this is disclosed to users pretty prominently. [21:10.760 -- 21:13.040] Like when people are using the playgrounds, [21:13.040 -- 21:15.520] you get notified that your prompts might be used [21:15.520 -- 21:16.480] for the training. [21:16.480 -- 21:19.120] And we're also careful to train in such a way [21:19.120 -- 21:20.860] that we don't memorize any information [21:20.860 -- 21:23.080] that was in the prompts. [21:23.080 -- 21:24.760] Like, and it explicit, [21:24.760 -- 21:27.480] like we have a pretty like elaborate process [21:27.480 -- 21:30.680] for making sure there's no like private information [21:30.680 -- 21:32.840] being leaked into the model. [21:32.840 -- 21:36.960] But anyway, yeah, that's basically the experimental setup. [21:36.960 -- 21:39.680] And the result was that it works [21:39.680 -- 21:42.060] like this methodology works quite well. [21:42.060 -- 21:44.480] And you get a model that's vastly preferred [21:44.480 -- 21:48.820] to the base model on this distribution of realistic prompts [21:48.820 -- 21:50.880] that people are giving the model, [21:50.880 -- 21:53.040] often which contain instructions. [21:53.040 -- 21:56.040] So the raw, like the raw language models [21:56.040 -- 21:58.760] generally do a really bad job following instructions. [21:58.760 -- 22:02.920] But this RL trained instruction following model [22:02.920 -- 22:04.120] is a lot better. [22:04.120 -- 22:06.440] And it's something like, [22:06.440 -- 22:08.220] if you just calculate how much better, [22:08.220 -- 22:09.200] it's something like, [22:09.200 -- 22:11.800] it's as good as a model that's a hundred times bigger. [22:11.800 -- 22:13.200] That's a lot. [22:13.200 -- 22:14.040] Yeah. [22:14.040 -- 22:15.280] You wanted the model to be truthful. [22:15.280 -- 22:17.640] Is that one of the criteria you wanted? [22:17.640 -- 22:20.000] Yeah, truthfulness was one of the criteria. [22:20.000 -- 22:22.200] That seems amazing to me that truthfulness [22:22.200 -- 22:24.080] is something that I could learn by example. [22:24.080 -- 22:26.480] Like does that mean that truthfulness is somehow [22:26.480 -- 22:28.000] represented inside the network [22:28.000 -- 22:31.240] or because there's no external way for the model to confirm [22:31.240 -- 22:32.720] whether something is true or false? [22:32.720 -- 22:35.440] So how might it know what is true [22:35.440 -- 22:37.480] without any external reference? [22:37.480 -- 22:38.960] I think to some extent, [22:38.960 -- 22:42.420] there is some internal representation of truthfulness. [22:42.420 -- 22:43.260] So I would say, [22:43.260 -- 22:45.340] like one way to think about what language models do [22:45.340 -- 22:48.200] is they're trained to imitate the whole internet. [22:48.200 -- 22:50.520] And the internet is written by lots of different people [22:50.520 -- 22:52.520] and has lots of different types of content [22:52.520 -- 22:57.200] from fiction to nonfiction to like technical, [22:57.200 -- 23:00.600] like detailed technical literature to like jokes [23:00.600 -- 23:03.400] and like forum posts, whatever. [23:03.400 -- 23:07.260] So the model is basically an ensemble of all these people [23:07.260 -- 23:08.880] who wrote stuff on the internet, [23:08.880 -- 23:11.000] the raw pre-trained model. [23:11.000 -- 23:13.080] When you feed it a prompt, [23:13.080 -- 23:15.580] what it's doing internally has to be something like [23:15.580 -- 23:18.200] figuring out who wrote this prompt [23:18.200 -- 23:20.020] and then trying to continue in that style. [23:20.020 -- 23:21.880] So if it thinks it's reading, [23:21.880 -- 23:26.180] just reading something on the Wall Street Bets Reddit, [23:26.180 -- 23:28.440] it's gonna continue on that style. [23:28.440 -- 23:30.640] But if it thinks it's in the New York Times, [23:30.640 -- 23:33.320] it's gonna write in a very different way. [23:33.320 -- 23:38.280] So effectively, the model must be calculating somewhere, [23:38.280 -- 23:40.800] like what style is this or what ensemble, [23:40.800 -- 23:43.900] what's the narrower ensemble of styles [23:43.900 -- 23:46.400] that I'm trying to imitate now. [23:46.400 -- 23:48.400] At the very least, when you do some kind of, [23:48.400 -- 23:51.080] when you do training like either supervised fine tuning [23:51.080 -- 23:52.840] or all from human feedback, [23:52.840 -- 23:55.600] you can at least like narrow down the set of styles [23:55.600 -- 23:59.500] the model is producing and try to imitate like the best [23:59.500 -- 24:02.680] or the best person in the training set [24:02.680 -- 24:04.300] or the best style in the training set. [24:04.300 -- 24:06.480] And obviously best will differ a lot. [24:06.480 -- 24:09.540] So what we'll end up with will depend on our instructions. [24:09.540 -- 24:12.520] So if we tell, I don't know, [24:12.520 -- 24:15.080] we'll end up with something that has kind of safe, [24:15.080 -- 24:19.000] like not too controversial, [24:19.000 -- 24:21.160] but a bit corporate, [24:21.160 -- 24:23.240] we'll end up with something like that [24:23.240 -- 24:25.680] depending on what our instructions are. [24:25.680 -- 24:27.320] So at the very least, [24:27.320 -- 24:29.880] like we can kind of narrow in on one style [24:29.880 -- 24:32.160] instead of having the whole distribution [24:32.160 -- 24:33.320] of styles on the internet. [24:33.320 -- 24:35.780] I think probably there's more to it than that. [24:35.780 -- 24:38.140] Like we're not just learning about style, [24:38.140 -- 24:40.580] but the model probably is like internally [24:40.580 -- 24:42.220] trying to determine if things are, [24:42.220 -- 24:44.000] if statements are true or not, [24:44.000 -- 24:47.320] like if the prompt contains incorrect information, [24:47.320 -- 24:48.980] because that probably would be useful [24:48.980 -- 24:51.560] for determining a likely completion. [24:51.560 -- 24:53.340] I'm just talking about the raw pre-trained model. [24:53.340 -- 24:54.520] So I think, yeah, [24:54.520 -- 24:58.180] I think just the objective of predicting next tokens [24:58.180 -- 24:59.520] probably gives you a lot. [24:59.520 -- 25:02.120] It forces the model to like to determine [25:02.120 -- 25:03.680] if things are true or not. [25:03.680 -- 25:05.880] I think for RL fine tuning, [25:05.880 -- 25:07.560] there's a lot more potential for the model [25:07.560 -- 25:11.900] to actually like try to output something truthful [25:11.900 -- 25:14.240] as opposed to trying to imitate a certain style. [25:14.240 -- 25:16.120] Though it's hard to, [25:16.120 -- 25:18.520] I guess it would be hard to like determine [25:18.520 -- 25:21.400] if that's what the model is actually trying to do. [25:21.400 -- 25:24.240] So it's almost like the prompt is guiding the model. [25:24.240 -- 25:26.720] It's like, what corner of the internet do we want to, [25:26.720 -- 25:28.320] do we want to imitate here? [25:28.320 -- 25:31.240] And maybe we want to instruct GPG wants to, [25:31.240 -- 25:33.520] to focus more on the most more truthful corners [25:33.520 -- 25:35.800] of the internet and something similar to that. [25:35.800 -- 25:36.880] Yeah, I would hope so. [25:36.880 -- 25:38.680] At least I think that's a pretty good, [25:38.680 -- 25:41.360] though maybe a little simplistic picture of what's going on. [25:41.360 -- 25:42.200] At the very least, [25:42.200 -- 25:44.920] we should be able to imitate the most truthful corner [25:44.920 -- 25:45.760] of the internet. [25:45.760 -- 25:47.760] So can you talk about a generalization [25:47.760 -- 25:52.360] and how does this type of model perform out of distribution? [25:52.360 -- 25:54.080] Like, I guess if it seems questions [25:54.080 -- 25:56.480] that are a bit different than what it was trained on, [25:56.480 -- 25:58.040] what happens if we get a little bit away [25:58.040 -- 26:00.560] from the training data with the reward models? [26:00.560 -- 26:02.320] I mean, language models in general, [26:02.320 -- 26:03.840] generalize surprisingly well. [26:03.840 -- 26:05.400] And I would say overall, [26:05.400 -- 26:07.600] like these pre-trained models that are trained [26:07.600 -- 26:09.760] on super diverse data sets from the internet, [26:09.760 -- 26:12.920] they tend to generalize quite well, or surprisingly well, [26:12.920 -- 26:15.200] at least it's surprising to those of us [26:15.200 -- 26:19.000] who were around for the earlier days of machine learning [26:19.000 -- 26:22.800] when everything was trained from scratch and very fragile. [26:22.800 -- 26:25.640] For example, if you provide an instruction [26:25.640 -- 26:29.280] in some other language, even a fairly rare language, [26:29.280 -- 26:32.360] it'll often do a decent job following the instruction, [26:32.360 -- 26:35.840] even if there's zero data in the whole instruction [26:35.840 -- 26:39.360] following the training process that's in that language. [26:39.360 -- 26:41.840] And that's just to carry over from the pre-training. [26:41.840 -- 26:43.960] So I think generalization, [26:43.960 -- 26:46.080] yeah, I think language models generalize quite well. [26:46.080 -- 26:47.880] So you asked about reward models. [26:47.880 -- 26:50.840] I think one of the tricky pieces about RL [26:50.840 -- 26:52.400] from human feedback is how, [26:52.400 -- 26:53.880] so you have this reward model [26:53.880 -- 26:55.480] and you're actually training against it, [26:55.480 -- 26:57.880] meaning you're training your policy to have high reward [26:57.880 -- 27:01.200] and it's going to exploit the errors in the reward model. [27:01.200 -- 27:04.280] So it's gonna eventually find adversarial examples [27:04.280 -- 27:05.200] to the reward model. [27:05.200 -- 27:07.200] This is worse than kind of normal [27:07.200 -- 27:08.640] out of distribution behavior. [27:08.640 -- 27:11.480] It's like targeted out of distribution examples. [27:11.480 -- 27:13.800] So there are definitely some challenges [27:13.800 -- 27:17.400] around getting reward models to generalize well [27:17.400 -- 27:20.960] or generalize as far as possible from the training set. [27:20.960 -- 27:22.760] Can these types of agents tell us [27:22.760 -- 27:26.240] when they don't know something or is that a hard problem? [27:26.240 -- 27:28.800] I'd say sort of, if you ask a question [27:28.800 -- 27:31.480] that's kind of in the core of the model's knowledge, [27:31.480 -- 27:34.160] it will know the answer and it'll know that it knows. [27:34.160 -- 27:35.640] By the way, I'm talking about models [27:35.640 -- 27:37.240] like for the instruct model. [27:37.240 -- 27:40.360] If you ask it about something that's like very simple [27:40.360 -- 27:42.160] at the core of its knowledge, [27:42.160 -- 27:44.160] it'll know if you, there are certain things [27:44.160 -- 27:45.920] that it knows that it doesn't know, [27:45.920 -- 27:49.240] like current events where it's been trained [27:49.240 -- 27:52.840] to know that it doesn't know certain things in real time. [27:52.840 -- 27:55.000] But if you ask it about something [27:55.000 -- 27:56.760] that's kind of on the edge of its knowledge, [27:56.760 -- 27:59.480] it's gonna have a hard time. [27:59.480 -- 28:01.640] It's necessarily gonna be inaccurate. [28:01.640 -- 28:03.920] I mean, there have been a couple of papers [28:03.920 -- 28:04.880] about this question. [28:04.880 -- 28:08.080] So there was a paper from Entropic recently [28:08.080 -- 28:09.360] called Language Models, [28:09.360 -- 28:10.920] mostly know what they know. [28:10.920 -- 28:15.120] And there's also a paper from FHI and OpenAI [28:15.120 -- 28:17.680] called Getting Language Models [28:17.680 -- 28:20.080] to Express Their Uncertainty in Words. [28:20.080 -- 28:22.000] These language models, [28:22.000 -- 28:24.160] as well as a lot of other models in machine learning [28:24.160 -- 28:26.560] are trained to maximize likelihood. [28:26.560 -- 28:28.680] So maximize log-prob of data. [28:28.680 -- 28:29.920] You're already training them [28:29.920 -- 28:32.480] to always predict a distribution of outputs. [28:32.480 -- 28:35.440] So for language models, given a prefix, [28:35.440 -- 28:38.920] it's predicting a distribution over the next token. [28:38.920 -- 28:41.760] These predictions for the next token [28:41.760 -- 28:44.720] generally are pretty well calibrated. [28:44.720 -- 28:47.680] If it puts 80% probability on something, [28:47.680 -- 28:49.160] and you look at all the times [28:49.160 -- 28:51.920] when it puts 80% probability on something, [28:51.920 -- 28:54.080] it's right 80% of the time. [28:54.080 -- 28:56.400] That's just a result of the training objective. [28:56.400 -- 28:59.960] The training objective strongly incentivizes the model [28:59.960 -- 29:01.400] to be calibrated, [29:01.400 -- 29:05.320] meaning it has a reasonable estimate of its uncertainty. [29:05.320 -- 29:07.240] So at the single token level, [29:07.240 -- 29:08.960] models definitely are calibrated. [29:08.960 -- 29:10.880] The question is whether they're calibrated on, [29:10.880 -- 29:14.680] whether this calibration extends to settings [29:14.680 -- 29:18.000] where they are generating multi-token outputs, [29:18.000 -- 29:20.360] or whether they can like judge the correctness [29:20.360 -- 29:22.000] of some multi-token statement. [29:22.000 -- 29:25.000] So I would say since models are calibrated [29:25.000 -- 29:26.600] at the single token level, [29:26.600 -- 29:29.640] I think that they definitely have the information [29:29.640 -- 29:32.840] to be calibrated in these other settings. [29:32.840 -- 29:35.960] So that's why I think the problem of models [29:35.960 -- 29:38.640] knowing what they know isn't actually that hard, [29:38.640 -- 29:42.240] or at least getting a model to express its uncertainty [29:42.240 -- 29:44.080] pretty much as well as a human does, [29:44.080 -- 29:46.560] doesn't feel like a insurmountable problem, [29:46.560 -- 29:48.360] but there are some practical difficulties [29:48.360 -- 29:50.120] to getting there. [29:50.120 -- 29:52.720] People use the phrase AI alignment in different ways. [29:52.720 -- 29:54.440] Can you talk about how you see alignment [29:54.440 -- 29:57.680] in your work on RL from human feedback? [29:57.680 -- 29:59.720] I think of alignment mostly as the problem [29:59.720 -- 30:03.560] of getting the model to try to do the right thing. [30:03.560 -- 30:05.000] So we can kind of make a distinction [30:05.000 -- 30:08.240] between what the model is capable of doing. [30:08.240 -- 30:10.200] Like if you just take a raw language model [30:10.200 -- 30:13.240] and you ask it a question, like I said before, [30:13.240 -- 30:14.720] it doesn't know that you actually wanted [30:14.720 -- 30:17.120] to give the correct answer as opposed to, [30:17.120 -- 30:20.160] it might think someone who's not very knowledgeable [30:20.160 -- 30:21.000] is answering. [30:21.000 -- 30:22.480] By doing some extra training, [30:22.480 -- 30:24.800] we can get the model to actually try to do the right thing. [30:24.800 -- 30:28.680] And so I would say that that's the main goal of alignment. [30:28.680 -- 30:31.720] So there was an OpenAI blog post recently [30:31.720 -- 30:34.560] that talked about the sequence in alignment. [30:34.560 -- 30:38.800] One was training AI systems using human feedback, [30:38.800 -- 30:42.800] two, training AI systems to assist human evaluation, [30:42.800 -- 30:46.440] and three, training AI systems to do alignment research. [30:46.440 -- 30:50.200] So is your current work mostly about this first item [30:50.200 -- 30:51.800] and when and how do you see us [30:51.800 -- 30:53.440] getting to these other stages? [30:53.440 -- 30:56.240] I'm doing some work now on number two, [30:56.240 -- 30:58.520] training AI systems to assist human feedback. [30:58.520 -- 31:01.760] I think that sort of becomes increasingly necessary [31:01.760 -- 31:05.120] as you start trying to get the systems [31:05.120 -- 31:06.840] to solve harder and harder problems. [31:06.840 -- 31:09.520] When you have models that are kind of very below human level [31:09.520 -- 31:12.000] or maybe at human level at a certain task, [31:12.000 -- 31:15.080] it's pretty straightforward to supervise them. [31:15.080 -- 31:17.200] But once they're doing things that are very hard [31:17.200 -- 31:19.480] or doing things that require a lot [31:19.480 -- 31:21.960] of diverse technical knowledge, [31:21.960 -- 31:24.480] it becomes pretty hard to provide [31:24.480 -- 31:26.560] a useful supervision signal. [31:26.560 -- 31:29.280] So we have to start doing things like one model [31:29.280 -- 31:31.680] writes an answer to a question [31:31.680 -- 31:35.320] and then another model provides a critique of that answer, [31:35.320 -- 31:36.680] points out some flaws, [31:36.680 -- 31:38.880] and then the human only has to judge [31:38.880 -- 31:43.120] the first answer after looking at the critique, [31:43.120 -- 31:45.440] meaning basically the critique helps the human [31:45.440 -- 31:46.520] assess the answer. [31:46.520 -- 31:48.840] So I think that kind of idea [31:48.840 -- 31:51.000] is starting to become pretty relevant. [31:51.000 -- 31:53.560] Colleagues and I are exploring that kind of idea now. [31:53.560 -- 31:55.520] As for assisting alignment research, [31:55.520 -- 31:56.960] there's some other work at OpenAI [31:56.960 -- 31:58.600] that's starting to explore this. [31:58.600 -- 32:02.040] It's also, that's sort of the furthest down the road. [32:02.040 -- 32:05.080] So I saw Stuart Russell was on your PhD committee [32:05.080 -- 32:07.680] and I really enjoyed his book, Human Compatible. [32:07.680 -- 32:10.200] I wonder if you share the idea mentioned in the book [32:10.200 -- 32:11.880] that the standard RL framing [32:11.880 -- 32:14.760] with this fixed reward signal is problematic [32:14.760 -- 32:16.360] and that agents, powerful agents, [32:16.360 -- 32:18.960] should try to do what we want [32:18.960 -- 32:21.880] and maintain some uncertainty about what it is we want [32:21.880 -- 32:26.120] and the agents that are too certain will be problematic. [32:26.120 -- 32:28.320] Do you have any thoughts on that idea? [32:28.320 -- 32:31.560] Yeah, I totally agree with that idea. [32:31.560 -- 32:34.120] So I think first it's really hard to write down [32:34.120 -- 32:37.560] a simple reward function that actually captures [32:37.560 -- 32:41.080] what we want or what any particular person wants. [32:41.080 -- 32:43.720] I can say I want a little more of this [32:43.720 -- 32:44.880] or a little more of that, [32:44.880 -- 32:47.760] but you wouldn't want to take that to the extreme. [32:47.760 -- 32:52.600] If we build agents that try to cater to our wishes, [32:52.600 -- 32:55.200] we should make sure they're, [32:55.200 -- 32:58.240] like they have a lot of, they have uncertainty [32:58.240 -- 33:00.080] about what we want or what we value. [33:00.080 -- 33:03.480] And that'll also cause them to be a little more cautious [33:03.480 -- 33:07.600] and say, not disturb anything that might be important to us. [33:07.600 -- 33:10.600] So yeah, I agree with that. [33:10.600 -- 33:13.360] Like Stuart Russell gave a very good [33:13.360 -- 33:17.040] like problem definition of what we want AI to do. [33:17.040 -- 33:18.440] Like we want it to basically, [33:18.440 -- 33:21.040] we want to jointly like play this game [33:21.040 -- 33:23.760] where AI is trying to figure out what we want [33:23.760 -- 33:24.840] and then trying to do that. [33:24.840 -- 33:27.600] But simultaneously maintaining some uncertainty [33:27.600 -- 33:28.640] about what we want. [33:28.640 -- 33:30.560] I would say if you start to look [33:30.560 -- 33:31.920] at how to get that in practice, [33:31.920 -- 33:34.400] it actually looks quite a bit like the kind of RL [33:34.400 -- 33:37.920] from human feedback that we're working on at OpenAI [33:37.920 -- 33:41.280] and others are working on at other places. [33:41.280 -- 33:44.720] I think, yeah, I see what we're doing [33:44.720 -- 33:47.320] as a practical implementation [33:47.320 -- 33:50.720] of getting towards this behavior that Russell described. [33:50.720 -- 33:53.160] Do you think of AGI as an abstract goal [33:53.160 -- 33:55.560] or are we gonna see a model come out one day [33:55.560 -- 33:58.040] and people are gonna say, oh, that's the first AGI model? [33:58.040 -- 34:01.640] Like, what does it have to do for people to say that? [34:01.640 -- 34:04.920] I think people will say that many times [34:04.920 -- 34:07.200] then realize that it doesn't quite do everything [34:07.200 -- 34:08.080] that you want. [34:08.080 -- 34:10.600] I think we're gonna have a lot of like a long series [34:10.600 -- 34:14.320] of models that are superhuman at most things [34:14.320 -- 34:16.640] or at a certain class of things, [34:16.640 -- 34:20.840] but they also have some failure modes and weaknesses. [34:20.840 -- 34:24.640] Like I expect us to see multiple models [34:24.640 -- 34:26.600] that are proclaimed as AGI [34:26.600 -- 34:30.360] and then only after interacting with it a while, [34:30.360 -- 34:33.880] do you realize it's not quite there. [34:33.880 -- 34:35.520] What would you say is the relationship [34:35.520 -- 34:39.760] between AGI and RL and AGI and these large language models? [34:39.760 -- 34:41.680] How do those concepts fit together? [34:41.680 -- 34:46.680] I'd say that RL is a useful component of training AGI [34:47.160 -- 34:49.240] or an almost essential component. [34:49.240 -- 34:52.440] The thing RL lets you do is it lets you optimize [34:52.440 -- 34:54.960] any objective for the agents, [34:54.960 -- 34:59.280] any objective that is a function of the agent's behavior. [34:59.280 -- 35:03.720] So with pre-training, like what we do for language models, [35:03.720 -- 35:05.760] you're kind of choosing an objective [35:05.760 -- 35:09.400] that lets us do something with all the training data [35:09.400 -- 35:11.720] we have, which is all this internet text. [35:11.720 -- 35:14.200] So we choose this maximum likelihood objective, [35:14.200 -- 35:17.000] which is basically the only, or not the only thing, [35:17.000 -- 35:20.200] but it's like a sensible way to absorb all this knowledge. [35:20.200 -- 35:24.040] But then if we really want to optimize the agent's behavior [35:24.040 -- 35:25.440] for a specific objective, [35:25.440 -- 35:29.040] RL is kind of the only framework that lets you do that. [35:29.960 -- 35:32.240] Okay, John, we have a few questions from the audience [35:32.240 -- 35:33.280] and I'm just going to pick the two [35:33.280 -- 35:36.240] that have the highest score in terms of Twitter likes. [35:36.240 -- 35:40.760] So the first is from Eric Chang, VP of AI at Haloti Robotics. [35:40.760 -- 35:43.360] He asked, RL distributions are non-stationary, [35:43.360 -- 35:46.080] making it hard to reason about PPO losses [35:46.080 -- 35:48.520] and how that relates to return or generalization. [35:48.520 -- 35:51.000] Are there any intermediate plots and visualizations [35:51.000 -- 35:53.120] you'd like to generate to debug [35:53.120 -- 35:56.200] or incrementally build up a large scale RL system? [35:56.200 -- 35:59.760] Yeah, there are definitely some stats that I look at. [35:59.760 -- 36:02.640] So I will be, I'll talk about this [36:02.640 -- 36:07.640] in the nuts and bolts like reboot later this year, [36:07.760 -- 36:12.760] but I'd say things like looking at the explained variance [36:12.800 -- 36:15.320] of the value function and looking at the, [36:15.320 -- 36:18.120] like how many samples are getting clipped in PPO [36:18.120 -- 36:23.120] and what the KL divergences between the policy before [36:23.120 -- 36:25.680] and after the update is, yeah, things like that. [36:25.680 -- 36:30.640] And then Ethan, the Calibero from Mila asks, [36:30.640 -- 36:33.760] what is your median estimate for the arrival date of AGI? [36:33.760 -- 36:37.440] I think not too far away, but like I said, [36:37.440 -- 36:39.480] I expect there to be a lot of false starts. [36:39.480 -- 36:44.360] I would say I expect like AI to be able to do better, [36:44.360 -- 36:46.520] a better job than humans at most jobs [36:46.520 -- 36:49.040] that humans do now, five years or so. [36:49.040 -- 36:51.040] That's not all jobs, but most jobs. [36:51.040 -- 36:52.680] For a while, we're gonna discover things [36:52.680 -- 36:54.080] that AI is very good at [36:54.080 -- 36:56.440] and where we wanna keep humans in control. [36:56.440 -- 36:59.440] So I think there'll be some kind of gradual process [36:59.440 -- 37:01.240] over the next 10 or 15 years. [37:01.240 -- 37:02.440] I've been curious about this. [37:02.440 -- 37:05.160] I see that some RL work is patented, [37:05.160 -- 37:08.800] but I could not find a TRPO or PPO in, [37:08.800 -- 37:10.160] I could not find patents on these. [37:10.160 -- 37:13.760] Are those protected, patent protected at all? [37:13.760 -- 37:18.320] Or how do you think of intellectual property protection [37:18.320 -- 37:19.280] for that kind of work? [37:19.280 -- 37:22.120] I haven't ever looked into patenting anything [37:22.120 -- 37:25.080] and OpenAI hasn't either as far as I know. [37:25.080 -- 37:26.960] I think the trend over time has been [37:26.960 -- 37:29.600] for people to take patents in machine, [37:29.600 -- 37:31.920] like a machine learning algorithms less seriously. [37:31.920 -- 37:34.520] There's this algorithm in computer vision called SIFT, [37:34.520 -- 37:36.960] which is like this key point to detector. [37:36.960 -- 37:38.960] And this was patented. [37:38.960 -- 37:42.080] I think the guy who patented it, [37:42.080 -- 37:44.680] he probably made his university some money from the patent, [37:44.680 -- 37:48.160] but in the end, all it did was cause people [37:48.160 -- 37:52.080] a lot of annoyance because people had to come up [37:52.080 -- 37:56.280] with alternative algorithms that had a different acronym [37:56.280 -- 37:58.240] and weren't patented. [37:58.240 -- 38:02.920] So the OpenCV open source library would have, [38:02.920 -- 38:05.400] had to be careful about putting this algorithm [38:05.400 -- 38:07.960] in their library because of the patent risks. [38:07.960 -- 38:11.960] So I think like these patents aren't, [38:11.960 -- 38:13.920] patent rights aren't exercised that much. [38:13.920 -- 38:17.080] And I think big companies like Google will patent [38:17.080 -- 38:19.280] a lot of stuff for defensive reasons. [38:19.280 -- 38:22.040] So if they get in some big legal dispute [38:22.040 -- 38:24.360] with another company, it can be used [38:24.360 -- 38:26.520] as like one of the bargaining chips. [38:26.520 -- 38:30.440] But I think, I don't think anyone's gonna like get sued [38:30.440 -- 38:35.320] for royalties for not providing royalties [38:35.320 -- 38:36.960] for the use of some algorithm. [38:36.960 -- 38:40.080] Okay, and then there's been a ton of work in RL, of course, [38:40.080 -- 38:43.560] since you first published TRPO and PPO. [38:43.560 -- 38:45.200] But from your point of view, [38:45.200 -- 38:46.440] if you had to pick a few highlights [38:46.440 -- 38:50.360] in terms of a few important milestones in RL algorithms [38:50.360 -- 38:51.600] since PPO came out, [38:53.120 -- 38:55.080] and by the way, it's amazing that in 2022, [38:55.080 -- 38:56.400] we're still using PPO, [38:57.520 -- 39:01.000] I think quite similar to its original form. [39:01.000 -- 39:01.840] Is that right? [39:02.920 -- 39:03.920] Yeah, pretty much. [39:03.920 -- 39:06.880] Yeah, so what would you say are the biggest [39:06.880 -- 39:09.680] highlights for you in terms of RL algorithm [39:09.680 -- 39:11.640] since you did PPO? [39:11.640 -- 39:13.440] Yeah, there's definitely been some interesting stuff. [39:13.440 -- 39:16.480] So I think like a little after PPO, [39:16.480 -- 39:19.120] there is TD3 and SAC, [39:19.120 -- 39:23.000] and those seem like pretty solid value-based methods. [39:23.000 -- 39:25.320] That was one development that was interesting. [39:25.320 -- 39:27.840] I think like, yeah, I thought Mu zero [39:27.840 -- 39:32.840] and it's like elaborations were also like efficient zero. [39:32.840 -- 39:36.840] Efficient zero were also pretty impressive [39:36.840 -- 39:38.960] that you can get that good sample efficiency. [39:38.960 -- 39:41.600] Both of the things I just mentioned were kind of, [39:41.600 -- 39:45.000] well, I don't wanna say mostly on toy tasks or benchmarks [39:45.000 -- 39:48.120] because yeah, I'm sure people are doing some real things [39:48.120 -- 39:49.440] with these algorithms. [39:49.440 -- 39:52.040] Yeah, so I think that stuff was interesting. [39:52.040 -- 39:56.760] I think like the whole recent interest, [39:56.760 -- 40:00.360] surge of interest in the offline RL was also notable. [40:00.360 -- 40:02.480] I would say the stuff we're doing [40:02.480 -- 40:06.040] with RL from human feedback is the kind of offline RL [40:06.040 -- 40:09.000] because we're like, we have a fixed dataset [40:09.000 -- 40:11.640] and we have a fixed reward modeling dataset [40:11.640 -- 40:12.880] and we're training against that. [40:12.880 -- 40:14.720] This is like offline RL, [40:14.720 -- 40:15.960] but you're doing it in a different way. [40:15.960 -- 40:19.640] You're using an on policy algorithm with a reward model [40:19.640 -- 40:23.280] as opposed to maybe a more typical way to do offline RL [40:23.280 -- 40:25.040] would be use off policy algorithm. [40:25.040 -- 40:27.760] Would that work here or would that not work here? [40:27.760 -- 40:30.160] What we're doing here is kind of like model-based RL [40:30.160 -- 40:33.280] because the reward model is like a model [40:33.280 -- 40:35.800] of the unknown part of the system. [40:35.800 -- 40:38.920] So like the unknown part of the system here [40:38.920 -- 40:42.760] is the human radar or yeah, the human. [40:42.760 -- 40:46.880] It's not the outputting appending to your list of tokens. [40:46.880 -- 40:48.600] So this is kind of like the work [40:48.600 -- 40:51.840] that's like takes a dynamics model of the environment [40:51.840 -- 40:54.240] and does some kind of just runs [40:54.240 -- 40:56.600] a policy grading algorithm against it. [40:56.600 -- 40:57.440] So it's not like, [40:57.440 -- 41:00.400] so the idea of running an online algorithm [41:00.400 -- 41:03.720] against a model, that's kind of a well-established idea. [41:03.720 -- 41:06.800] Though I would say the papers that previously did this, [41:06.800 -- 41:08.520] they were in a pretty different regime. [41:08.520 -- 41:11.200] We're in this regime of doing fairly small updates [41:11.200 -- 41:14.600] to the policy because we have these awesome pre-trained models [41:14.600 -- 41:19.000] and we don't need to actually change them that much. [41:19.000 -- 41:21.520] So yeah, we use these online algorithms. [41:21.520 -- 41:23.760] I'd say part of the reason why we can get away [41:23.760 -- 41:28.000] with using just like an online algorithm [41:28.000 -- 41:30.480] is because we've been just looking [41:30.480 -- 41:32.480] at a contextual bandit problem. [41:32.480 -- 41:35.080] Yeah, because we only have like one time step. [41:35.080 -- 41:37.840] Like you get a query and you output a response [41:37.840 -- 41:40.160] and then that response gets a reward. [41:40.160 -- 41:43.120] So if we had like a multi-step process [41:43.120 -- 41:48.120] such as a conversation where you can't assign a reward [41:48.320 -- 41:50.280] until the very end of the conversation [41:50.280 -- 41:54.160] and or you had some, I don't know, some interaction [41:54.160 -- 41:57.800] with like some real world system that's hard to simulate, [41:57.800 -- 42:00.440] you wouldn't, then it wouldn't be as straightforward to, [42:00.440 -- 42:03.760] you wouldn't be able to use exactly the same methodology. [42:03.760 -- 42:05.680] You would probably have to use a, [42:05.680 -- 42:08.360] you would have to probably train a Q function [42:08.360 -- 42:10.600] or something like that. [42:10.600 -- 42:13.080] If you want your method to be sample efficient, [42:13.080 -- 42:15.640] you would probably have to do something slightly different. [42:15.640 -- 42:19.120] I think we'll have to start exploring this [42:19.120 -- 42:22.560] at some point soon, but so far we haven't, [42:22.560 -- 42:27.480] at least I haven't seen any cases in like in the domain [42:27.480 -- 42:29.680] I'm looking at that require this, [42:29.680 -- 42:33.480] but I expect it to be relevant at some point. [42:33.480 -- 42:37.080] So we had Arvind Srinivas talking about decision transformer [42:37.080 -- 42:39.360] on the show recently, that was a great episode. [42:39.360 -- 42:41.360] And I see that you were also a co-author [42:41.360 -- 42:43.920] on the 2016 RL squared paper. [42:43.920 -- 42:46.680] I want to ask you what your thoughts about meta RL. [42:46.680 -- 42:48.560] Arvind had some interesting things to say [42:48.560 -- 42:50.640] about maybe the idea that a transformer [42:50.640 -- 42:52.320] could kind of supersede the need [42:52.320 -- 42:54.200] for an RL algorithm altogether. [42:54.200 -- 42:56.200] What do you expect from meta RL? [42:56.200 -- 42:58.600] Do you expect we'll still be using human-authored [42:58.600 -- 43:00.600] RL algorithms in the future? [43:00.600 -- 43:03.000] Yeah, that's a pretty bold statement that we don't need, [43:03.000 -- 43:05.400] we won't need any RL algorithms anymore. [43:05.400 -- 43:07.640] Yeah, since the RL squared paper, [43:07.640 -- 43:10.920] people have been talking less about meta learning, [43:10.920 -- 43:12.400] as far as I can tell, [43:12.400 -- 43:15.760] actually because of sequence modeling has gotten so good, [43:15.760 -- 43:19.680] like transformer sequence models, that it's kind of clear [43:19.680 -- 43:21.920] that meta learning is just a special case of learning. [43:21.920 -- 43:26.560] Like it's just like a certain kind of long context learning, [43:26.560 -- 43:28.720] learning involving long episodes. [43:28.720 -- 43:31.120] And maybe it shouldn't be treated that differently [43:31.120 -- 43:33.600] or addressed with special algorithms. [43:33.600 -- 43:36.760] I would say, yeah, the ideas like decision transformer [43:36.760 -- 43:37.880] are pretty interesting, [43:37.880 -- 43:40.520] where you try to reduce RL to supervised learning. [43:40.520 -- 43:43.800] It's still not like certain exactly how these compare [43:43.800 -- 43:47.320] in performance to RL, like people have started to analyze [43:47.320 -- 43:49.280] that empirically and theoretically. [43:49.280 -- 43:53.320] And I would say in practice, sometimes it's better, [43:53.320 -- 43:55.240] sometimes it's worse. [43:55.240 -- 43:57.960] In my experience, like it's been worse on the problems [43:57.960 -- 44:01.920] that my colleagues and I have, where we've tested it. [44:01.920 -- 44:05.480] But yeah, it's definitely an interesting direction. [44:05.480 -- 44:08.360] Dr. John Schulman, thank you so much for sharing your time [44:08.360 -- 44:10.360] and your insight with the talk RL audience today. [44:10.360 -- 44:11.480] Thanks so much.