TalkRL: The Reinforcement Learning Podcast

Arash Ahmadian is a Researcher at Cohere and Cohere For AI focussed on Preference Training of large language models. He’s also a researcher at the Vector Institute of AI.

Featured Reference

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker

Additional References

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Robin:

TalkRL Podcast is all reinforcement Learning all the time, featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talkRL Podcast. I'm your host, Robin Chauhan. Today, I'm here with, Arash Ahmadian. Arash is a researcher at Cohere and Cohere for AI.

Robin:

He's focused on preference training of large language models. He's also a researcher at the Vector Institute of AI. Welcome, Arash.

Arash:

Hey, Robin. Thanks for having me. Super excited to be here.

Robin:

Great to have you today. Can you tell us in your own words, what do you focus on exactly?

Arash:

Currently, my main focus is on reinforcement learning from human feedback slash preference training. But previously, I've done some work in efficiency, quantization, and mixture of experts.

Robin:

And is your role at Cohere, is it really research, or do you also do applied work?

Arash:

It's mostly research. So previously, I was at Cohere for AI. I was a research scholar there. And it was throughout my time there that, I got exposed to reinforcement learning from human feedback and preference training. And the reinforced paper, which we're gonna talk about momentarily, is actually a an artifact of that scholarship.

Arash:

Currently I do research mostly in Coherent Self.

Robin:

So, yes, speaking of your recent work, let's get to it. This paper is named Back to Basics Revisiting Reinforced Style Optimization for Learning from Human Feedback in LLMs. That's first author yourself, 2024. So what is the main idea, in this paper, Arash?

Arash:

Really at a at a high level, it's, the the main o main the main goal of the paper is to convey the message that optimization for fine tuning language models with, reinforcement learning, is actually not difficult as a optimization problem. And, you know, methods that are pretty obsolete like reinforce and vanilla policy gradient and deep reinforcement learning are are actually quite applicable in this setting. So it's really it really has a focus on sort of, taking a step back and, trying to show the fundamental differences between deep RL and RLHF or fine tuning of large language models, you know, as settings. K.

Robin:

Now there's, many different methods that people have presented and talked about, for doing preference training. And so we hear we definitely hear about PPO. You're presenting here reinforced and what you're calling are all o o, and you can tell us more about that soon. But other things we've heard about, GPO, IPO, RAFT, KTO, there's can you, give us a general idea of what are the different groupings here? What is different between all these methods?

Robin:

Is that getting getting into the all the details? There's there's kind of a couple different themes here. How do they how do these methods differ?

Arash:

Starting from all the way at the top, you have offline versus online. And offline just means, you know, it's you don't do any sampling or you don't get feedback during your training process. So in my mind, DPO, IPO, they all fall into the offline category. Because, they all skip the training of the reward model phase altogether. That's offline.

Arash:

And then you get it online where, you know, PPO, Reinforce, and Arlo, which are really like reinforcement learning algorithms come in. And you have iterative fine tuning methods which are online. For example, like RAC. But they they don't really do policy gradients. They don't do r l in additional sense.

Arash:

They they use their reward model as a mechanism to filter their generations throughout training. So really, iterative fine tuning methods, they they trade off the optimization complexity and the tuning complexity of, you know, reinforcement learning algorithms with, and also some of the, practical limitations, with higher time required to sample generations training. Yeah. But, you know, you have PPO, proximal policy optimization, which is sort of de facto RL algorithm, which has is sorta, made its way from deep RL to RLHF and reinforce an Arlo. Arlo is really it builds on the reinforce estimator itself, But it's closer to RAFT and iterative fine tuning in that it actually leverages multiple samples.

Arash:

So, yeah. I would say RLU is sort of a mix between iterative fine tuning and reinforcement learning. And you have PPO, which is your typical policy gradient or off policy, but online optimization.

Robin:

So and which ones did you focus on in your paper, and and and what did you find?

Arash:

The main well, we focused on a couple of different axes. The the main axis was, really seeing if PPO was the right tool for for doing r l and or doing rlHF. And what I mean by that is, PPO has a lot of bells and whistles, and it's it has a lot of augmentations built on top of it, or into it that are designed to prevent instabilities in your optimization procedure. And these are all borrowed from DeepRL and and, you know, the instabilities and the high variance issues that come with it. And, you know, the the what we posited and we empirically sort of showed is that, you don't need all all the complexity of PPO when you're doing RLHF with a with a pre trained language model.

Arash:

And, something like reinforce is is a much better fit. So that was the first thing that we focused on. And then we also sort of showed that iterative fine tuning methods like Raft, where you have access to a reward model, and you use it as a means to filter online generations throughout your training, they they sort of fail to make full use of the the samples you generate. And, they they have issues with robustness, due to this, like, filtering that they do. So and we also show basically our loop, which is reinforced leave one out, is basically a more robust and better version of Raft in the paper.

Arash:

Yeah. We we also did benchmark DPO as a baseline, but, the the focus was more more on the online paradigm as opposed to offline.

Robin:

Can you say a little bit more about, Arulu? How does that work? Reinforce Sleep 1 out.

Arash:

Arlu stands for, Reinforced Sleep 1 Out. I kinda know it's it's not the best name, but, in the spirit of going back to basics, we sort of kept a name of, you know, where it was proposed originally. But at a high level, again, it builds on the reinforced estimator. And again, the the the reinforced estimator really at its core, is a means it it provides a means to do gradient updates if you want to, maximize the probability of your policy generating generations. And what do you have a notion of baseline when dealing with the reinforced estimator?

Arash:

Because it's known to have really high variance for for a variety of reasons. And, baselines what they do is, you know, they they reduce the variance of this, of the gradient updates but they keep it unbiased. So you don't do your optimization remains, not biased. What leave reinforced leave one out does is it generates multiple samples, for a given prompt and looks at each each the gradient updates for each of those samples individually and uses the remaining samples, to create a this this notion of baseline to reduce variance, but keep the estimator overall unbiased. So if you if you think about it in in in the lens of, traditional RL or active critic algorithms like PPO or a to c, There you have a value network which, which, you know, reduces is meant to reduce variance.

Arash:

And it's sort of similar to a baseline, but there are some differences. Here you don't have an extra network or a learned baseline. You use you create this on the fly sample specific baseline through generating multiple samples. And that's you know, it's it's pretty effective in terms of practice. And it's also much easier to implement compared to the when when you have to load up multiple copies of the model.

Robin:

And can you remind me of the relationship between reinforce and vanilla policy gradient?

Arash:

In the tradition, like, in the old in the Sutton, introduction to RL paper, book, reinforce is treated as or is is is presented as a policy gradient like an algorithm. You know, full fledged MDP where your traditional RL problem setting. But really reinforces also an estimator given a single generation from Ukraine policy. So in the paper, what we do is to, like, distinguish between the the reinforced estimator itself. And once you actually build it into the RL framework where you have multiple states, you know, potentially non terminal states that you visit in a trajectory, before you get to the terminal state, and you have intermediary rewards, you you, you know, you can use that reinforce estimator to, like, and look at partial rewards.

Arash:

And we call that vanilla policy gradient. And the reason this distinction is actually pretty important is because we talk about action modeling. You know, where where you model a single token as as an action whereas or if you model the entire generation as as a full on single action. So yeah. That that's really the main difference.

Arash:

But at their core, they both use the reinforced estimator. 1, the reinforced algorithm as we refer to it only uses it on a trajectory level. But then vanilla policy gradients, uses the reinforced estimator at partial completions as well.

Robin:

Okay. So I I kinda dismiss reinforce in the same way I don't pay much attention to tabular RL because it seems seemed obsolete to me. And then but in the paper, you talk you say why reinforce is is applicable here even though it doesn't work well typically in deep RL, like in robotics or something. Can you explain why it's it makes sense here but not but, but it's been dismissed in other other areas?

Arash:

Yeah. So, really, it's it's an issue of variance. It's, you know, the it's the the same old problem of variance bias trade off when you're trying to estimate the the return trajectory and and and, and the return of the full trajectory. And this is not a new problem. It's, you know, to my knowledge, it's the first mission for the variance bias trade off for the return estimator is from the Watkins dissert dissertation in 1989, I think, on page 86.

Arash:

There is there is a brief mention about this trade off. And then there's a couple of works, I think in 2011, there's there was some theoretical work on, theoretical balance on the variance and bias trade off. But really, reinforce has really high variance, especially when you're starting from a random initialization of your policy or a not decent one. And this high variance is actually pretty detrimental in settings like deep RL, or in Atari games and whatnot is, and that's why you like that's the whole idea of why after critic algorithms exist. So the the original policy gradient method was, you know, vanilla policy gradient where, you don't do, any bootstrapping as we call it.

Arash:

You don't use your you don't have a value function which we use which you use to estimate your full trajectory return, and you use the entire the full return of the trajectory as you sampled it in your optimization. And you use the learned baseline, as a means to reduce variance. You don't use it in bootstrapping. That's why it's called a learned baseline as opposed to a value network really. But for example, you have TD learning.

Arash:

It's it's an extreme case. It's the extreme case of doing bootstrapping using your value function. Meaning that, you don't use the the entire trajectory return in estimating, the, like in estimating the return itself, you just look, let's say, n steps ahead and then cut it off and use the value, the estimate of the value network to estimate the the the rest of the return of your trajectory. And this introduces bias. It's it's a well known issue.

Arash:

This introduces bias. And in the case of RLHF, you don't really need to reduce variance more. You don't need to trade off. You don't you don't need to introduce bias to reduce variance. And introducing bias is actually really detrimental to your optimization performance.

Arash:

And PPO, at its core, like any other actor critic algorithm, it does this trade off, to an extent. Uses generalized advantage estimation, which has a specific way or or a hyperparameter to tune this trade off. And, if if set correctly, that advantage estimation actually reduces to reinforce with a learned baseline. So and we empirically found that it actually performs best, which, you know, is is a return estimator that does not introduce any bias into your optimization. But in theory, it has high variance.

Arash:

Now the reason, like I alluded to, the the reason why this in theory, reinforce should not be practical, but it is in the case of LLMs and RLHF is because your initial policy is really strong. You have a pre trained model which is super strong. And then on top of that, most cases you do SFT, so supervised learning. And this gives you a really good starting point as, to start, you know, your RL, which is not really, like, applicable in in deep reality. It's it's really rare that you come across policies that are this good at the start.

Arash:

Most of the time you start from random in it. So that's really the main reason why reinforce is not only applicable, but actually it's it's the right tool for the job because it's unbiased. It it doesn't introduce any bias into your optimization.

Robin:

So you've totally removed the critic, and and you're saying it's better.

Arash:

Yeah. So you you have for vanilla policy gradient, you have a learned baseline as opposed to a critic. So you don't use the it's it's it's not really a value network anymore. So you don't use that as a critic. You just use it to reduce variance, basically.

Arash:

And that's better, because it doesn't introduce bias into your optimization. Now, that's, you know, that's really the main aspect of PPO, which, you know, be because it's an active critic algorithm. You have other things like, you know, clipping clipping and, trust region policy optimization, things that that go into PPO. And in in in providing further empirical evidence that really high variance is not an issue. If you can look at, for example, how large your gradient updates get, and how many times they are clipped.

Arash:

Because you have this notion of clipping the gradient, updates in PPO, as if they get too large or if they're too off policy. And, really, you don't really clip that much, which which points to, like, the gradient you know, the the loss landscape being pretty smooth, and you you don't you don't need to deal with high variance gradient estimates. And that's, yeah, that's why, you also don't need to further reduce variance at the cost of introducing bias. So it

Robin:

sounds like you're finding that PPO is overkill for this problem. 100%. Does that surprise you? Were you shocked when you found this, or were you totally expecting this? How did you get to this how did you get to the point of asking this question and then how did you, you know, kind of feel about it when you saw the answer?

Arash:

How this this whole thing really came about is was, you know, I was I was reading the works by OpenAI on on, like, summarize from human feedback and WebGPT where they do mention using PPO for the the RHF stage. But then I kinda really didn't understand why or, you know, the reasoning behind using PPO. As, you know, aside from the fact that, you know, it's the the facto RL algorithm borrowed from deep learning. So really the question in my mind was that, you know, is is it a proper way to do RL in language models using PPO? And because, like, the setting is really really different.

Arash:

And, it's yeah. That that's that's how this whole, I guess, research, direction really started. And honestly, I mean, obviously, I think, I would be lying if if I say, the the the results did not surprise me the slightest. They they did surprise me, but to a small extent because I I kinda had you know, I was with my intuition and, like some of some of the earlier experience I had with playing around with this stuff, I I did expect reinforce to outperform PPO, but I did not expect it to outperform it by this much, as, you know, as reported in the paper, because the results are pretty significant in terms of the shift between, PPO and vanilla vanilla policy gradient slash, reinforce.

Robin:

So you talk about the difference between treating the whole completion as a single rewardable action like a band like in a bandit setting Yep. Versus treating each token as a separate action, which is more how we would generally think of multistep RO.

Arash:

Mhmm.

Robin:

Can can you can you say more about that? I mean, like, don't all the chat systems of today treat it as a bandit?

Arash:

In the RL formulation, you treat each token as an action. In that when you learn a value network, you output a value estimate for for the partial completions, not, you know, for for the partial completions which are, you know, due to the the the tokens you generated, you don't output a, a single value estimate only for your given prompt. Right? In practice, you know, how PPO is implemented under the hood and how it's used for RHF, you treat a each token as as a single action. And because you have the k l component in the reward, you're you're able to define really, partial completion rewards that are not the same throughout your generation.

Arash:

Even though you only again, also in practice, the the the reward you get is only attributed to the EOS token at the end of your generation.

Robin:

So I I wonder if you could explain that a little bit clearer. You're saying that we're we're only getting rewarded for the whole thing. Mhmm. But somehow, there's still a token by token notion of value. Is that what you're saying?

Robin:

I I didn't understand the second part.

Arash:

The reward so for regularization, we we have this, like, KL components or, you know, the difference between the logprobs of of the generation built into the reward, like, function itself, by definition. So, you know, it's r minus beta times a k l. Now that component, the second component is actually token wise because you, you know, you you take the difference between the log probabilities of of each token, but then you sum it up to get the sequence level k l. That's the component which, you know, which leads really to the the single token action modeling being, like applicable here, because you can define a different trajectory reward given your starting position in your partial completion. Even though your reward model gives the same true reward for for the entire completion, but then you have different KL components, depending on where you start, right, because you you skip over some tokens.

Arash:

And that that's precisely like the the whole argument. One of the main arguments of the paper is that the, you should treat it as a bandit problem. And in a bandit problem, you only should use the reinforced estimator on the entire trajectory return. And you just ignore partial completions, and you don't care about the token level, or the partial completion, level reward, which yeah. Which really, in practice, PPO and and and RL as typically used for RLHF, they they they they they don't make this distinction.

Arash:

So it is more intuitive to do a bandit modeling because you only attribute a single reward to the entire generation. But in practice, that that that that has not happened, which is precisely why we we try to emphasize that in in this work.

Robin:

Do you have any, thoughts about the whole LLM, you know, RLHF human preferences paradigm today, the way the the the general framework works? Like, do you think we can get further just by making some improvements to it, or, are we gonna need to really change to a different framework? And I wonder one of the things, like, you were saying, you know, you only get the reward at the end Mhmm. At the at the end of the completion. But I'm sure some people are working on complete, per token rewarding or Mhmm.

Robin:

Self play, or we've heard we've heard rumors about Q Star. Do you have any thoughts about about, about these things or quite different paradigms?

Arash:

It really depends on the nature of your reward signal, the type of optimization you do. In in in the typical human preference training, you, you give it the entire sequence and you get a single reward at the end. The, the, the idea, you know, in the paper and what, you know, the position is is for this, you should use a bandit formulation because it's just less computationally heavy and it's more intuitive. Now, there are other use cases whereas you actually have access to intermediary rewards. Like, let's say you have code, right, or math.

Arash:

There you can extract intermediate rewards for, let's say, for for each line of code or for each equation actually written in in in in your proof. There it does not really, well, the bandit formulation would probably lag behind as opposed to something like vanilla policy. If you wanna keep using the reinforced estimator, you would use something like vanilla policy gradient, whereas where, you know, you actually take into account the, the multistate transitions and the, you know, the the whole shebang of having intermediary states before your terminal state. So from that aspect, it's really a question of, like, what you're trying to optimize and what's yeah. What's the what's the objective?

Arash:

And what's the nature of the data? Now yeah. So so there there there's been some work, you know, like you mentioned, like DPO, IPO, KTO, and FDPO, where, where, you know, they they they try to remove the RL components of of RLHF. And they really they they do make the problem simpler, but it's it's it's within, like, one problem formulation that that's really applicable. And that's with for human preference whereas, you know, where you have the Bradley Terry model and, you you get a sequence level reward really is where where they work also best in practice.

Arash:

So yeah. As you know, with regards to key star type training, again, rumors. I'm also speculating myself, but I think it's I think for more for tasks that require more reasoning, and are more complicated than just human preference, on, you know, the typical datasets that that, you know, we you know, the research community works on. You need you will need intermediate rewards to enable generalization, Like I mentioned in code or math. And for those types of things, then something like, you know, Q Networks or again, a vanilla policy gradient reinforce, I think would work pretty well.

Arash:

So do

Robin:

you think your work, that we discussed here could impact how preference training is done at Cohere? And can you share, anything about that?

Arash:

Yeah. No. I mean, for sure. So, you know, Cohere is a company and, you know, is always up to date with regards to what's happening in in in the in the research field. And this is no exception.

Arash:

You know? This this, this work is indeed public. And even though it was, it was an effort by people within Cohere for AI and, and Cohere, the, the findings in it are pretty significant, in that, you know, one, it it removes a lot of complexity, which which, you know, enable pretty, let's say smoother adaptation, for for, you know, the algorithms like you reinforce Arlo or vanilla policy gradients and, into things like, you know, product, or prod models, as opposed to something like PPO. But also, I think, really, what again, zooming out from from at a high level, what this work really tries to emphasize is the optimization nature of RLHF or learning from an external reward signal or an external signal where you have a pretrained model is not difficult. And it it's really about how you curate the signal, how how you actually come up with that signal, and the type of data you have access to.

Arash:

So the focus should really be in reward model training. How to make our reward models more robust. How to make them be just be more accurate, better generalizable. Things of that nature as opposed to coming up with ways to make the optimization itself, given a out of the box reward model, better. So I think these findings will have an impact for sure echo here and I I imagine at other places too, because these are pretty high level statements about the general, you know, positioning of the field as as as as a as a RHF as a paradigm as opposed to, you know, proposing a a single method.

Arash:

Yeah. Which I'm quite happy about. I bet. Congrats.

Robin:

This is pretty cool.

Arash:

Yeah. No. I appreciate it. I think it's, I'm I'm I'm glad that we could get this work out to clear up some of some of the, because it really does shake the foundational beliefs in RLHF and things that people have sort of stuck to for the past couple of years because, you know, they were just the like, the first papers that the proposed RLHF, they just used them. You know?

Arash:

No one had really taken the time to dig into, okay, why why do we use PPO? Is it really necessary, you know, and and and try to discern the differences between DeepRL and, LLMs and and and, you know, RLHF. Yeah.

Robin:

Yeah. Well, that's the reason I wanted to have you on the show today. I noticed that in the paper when Matthias, shared it in in the thread on Twitter. Just a note, I heard about your paper in the replies to my replies to Nathan Lambert's tweet on Twitter Rex. Nathan tweeted about the use of reinforce in Google Gemma, which I was asking about.

Robin:

And Arash's colleague and coauthor Matthias Galle at Cohere shared this paper on the thread. Nathan is at Allen AI and he was our guest back in episode 19. So check that out and check out his Twitter and his newsletter, interconnects.ai. What do you, what do you hope to work on in the future or plan to work on in the future?

Arash:

I think the main thing for me now is really digging into the reward modeling aspect or not even calling it a reward model and having an external evaluator or an external signal, that you can use to, you know, align quote unquote your language model. So really digging into how we can make these role models more robust. How can we effectively incorporate continuous, you know, let's say user data into these these, into our reward models? How how can we, you know, ensure robustness, generalizability, things of that nature? And, also, there, you know, I I was in I'm still in Cohere for AI, but, you know, I was a research scholar within Cohere for AI.

Arash:

And, there we, you know, we have a pretty really big focus on multilinguality. And, you might have heard of the Aya model, which we released, pretty recently. And it's, you know, it's it's the best multilingual model out there, which is open source. And I think intersection of RLHF with a multilingual model hasn't really been explored, and it hasn't really been, done justice, to in the research field. So I'm also pretty excited to, explore things in that direction at the intersection of your multilingual models and human preferences or having external signals that you can do RL on.

Arash:

But yeah. Yeah. Those are, you know, you never know what happens. I feel like especially with with the case for me. Earlier on, I I was doing more of the efficiency type of work, quantization, and mixture of experts.

Arash:

And now I'm, I'm doing reinforcement learning. But but, yeah, that's that's for the near term future. That's what I see.

Robin:

So outside your own work, are there other interesting things happening in RL or in preference training that, that you wanna mention?

Arash:

Yeah. So there's been sort of a plethora of work of, trying to in the direction of having better evaluators and, you know, having better external signal, like I mentioned. And, one one paper that did catch my eye was, you know, self rewarding language models. I'm not sure if you've, you know, familiar with that, but it's, at its core, it's it's, you know, again, taking out the the the the non shareholder reward model and treating the the policy slash the general model as as a as a evaluator on its own. I think paradigms or systems where it's mostly mostly synthetic, but there is little supervision from an outside source, whether it's human or some some other thing.

Arash:

Like, let's say, a tool or or a web browser, I think those are pretty promising. But things like, you know, the self rewarding self reward what was it? The the self rewarding language model paper, they're they're a step in the right direction in that, they try to minimize the the dependency on human feedback. I'm I'm personally a bit skeptical about to the degree where we can, you know, completely remove any outside influence. It doesn't need to be humans.

Arash:

Like, any outside contact with with such synthetically driven systems. But I think we can definitely minimize it. Minimize, but but make it count. The the minimal interaction with outside tools and outside the supervisors. So, yeah, I'm I'm honestly pretty excited for that.

Arash:

But but also I really hope to see more, more research on, really digging into reward models and the notion of signal, you know, what the signal that we're trying to align the the language model to. I feel like the as a research field in general, you know, with with LLMs, it's it's it that that type of work will be less flashy, but I think it will really matter in the long run-in terms of providing a better understanding of, like, RLHF again as a paradigm, but also, you know, tackling problems which really will make a huge impact as opposed to things like, making you know, coming up with a new method to do the optimization. Because really, at the end, data is king and, the data is directly used to train your or is used to provide that external feedback.

Robin:

And while you're here, is there anything else you wanna mention, to the audience?

Arash:

It was pretty great to be on here. I'm I'm super happy that I got the chance to, you know, sort of put a put a spotlight on on, you know, the high level position of the paper. And, you know, I hope really, works like this encourage, you know, researchers to sort of take a step back and not really take things for granted just because, you know, the first people who've who've published in in in in in paradigms, you know, sort of propose those methods. But, really, you know, take everything step by step and question everything, and it's their validity in in, you know, in in in in in the higher level. So, yeah, thanks for having me, though.

Robin:

Thanks so much for being here, Arash Amadian. This has been great.

Arash:

Yeah. I know. It's been awesome. Thanks again for having me.