{"type":"rich","version":"1.0","provider_name":"Transistor","provider_url":"https://transistor.fm","author_name":"TalkRL: The Reinforcement Learning Podcast","title":"Arash Ahmadian on Rethinking RLHF","html":"<iframe width=\"100%\" height=\"180\" frameborder=\"no\" scrolling=\"no\" seamless src=\"https://share.transistor.fm/e/e54fabe1\"></iframe>","width":"100%","height":180,"duration":2010,"description":"Arash Ahmadian is a Researcher at Cohere and Cohere For AI focussed on Preference Training of large language models. He’s also a researcher at the Vector Institute of AI.Featured ReferenceBack to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsArash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara HookerAdditional ReferencesSelf-Rewarding Language Models, Yuan et al 2024 Reinforcement Learning: An Introduction, Sutton and Barto 1992Learning from Delayed Rewards, Chris Watkins 1989Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Williams 1992","thumbnail_url":"https://img.transistorcdn.com/jXB1-VPK-A9v1epzc4aG4pFxqlvo2vbQ_Ytyuar_gPI/rs:fill:0:0:1/w:400/h:400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS9zaG93/LzIwNDcvMTcwNzk1/NDcxMS1hcnR3b3Jr/LmpwZw.webp","thumbnail_width":300,"thumbnail_height":300}