TalkRL: The Reinforcement Learning Podcast

Posters and Hallway episodes are short interviews and poster summaries.  Recorded at RLC 2024 in Amherst MA.   

Featuring:  

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Speaker 1:

I'm here with David Radke of the Chicago Blackhawks.

Speaker 2:

Yeah. Hi. Thanks for having me. I'm a senior research scientist with the Chicago Blackhawks in the National Hockey League. I'm taking, multi agent AI research into the world of sports.

Speaker 2:

2 years ago, I published a Blue Sky paper at AMOS, essentially making the thesis that what statistics did for striking games such as baseball, multi agent AI research, will be necessary to do for sports like ice hockey, basketball, and soccer.

Speaker 1:

Awesome. And how's r RLC going for you and was there anything of particular interest for you at RLC this year?

Speaker 2:

Yeah. RLC has been great. I think the most interesting has been the keynotes. There's been a lot of really interesting speakers, and, yeah, the coffee breaks, it's a really good, community here and, getting to talk to everybody about the work they're doing has been great.

Speaker 1:

I'm here with Abhishek Naik, of the University of Alberta.

Speaker 3:

High level in my PhD, I was working on reinforcement learning. Duh. But, in the non episodic setting where, you know, life was on forever, no resets, no episodes, nothing like that,

Speaker 1:

much like how life is for us, you

Speaker 3:

know, human beings and other beings living here. So, yeah, most of my work was along those lines, particularly focusing on the average reward formulation. And the work I'm presenting here is actually making all discounted methods better, using a very simple idea coming from some average reward connections. It's yeah. Basically, just learn and reward, and then it has, potentially big implications in the sense that there is no instability or divergence as you increase your discount factor because that is something that we see very often in Ardent.

Speaker 3:

Right? And, yeah, many other things. So I'll talk about it during my poster. Yeah. That's all I want to say about it.

Speaker 1:

What are the highlights of RLC for you?

Speaker 3:

Oh, the fact that it is small and focused. You know, if you've been to NewReps, ICML, those are, like there's a warehouse full of people and, yeah, 10 1,000 people going around. It's it's just the scale of it is insane. And this is, you know, barely 500 people. You can almost talk to everyone.

Speaker 3:

And, yeah, the posters are, you know, 10 in a room, so you can actually interact with everyone there. And so I think yeah. And most of the things are something that you're interested in, so it's been very fruitful. And I've I love the many conversations that I've

Speaker 1:

had with people here. So, yeah, that's that's that's been the highlight for me.

Speaker 4:

So my name is Daphne, Cornelis. I'm a PhD student at NYU. Okay. So I'm here presenting our work on human compatible driving partners to data regularized self JRL. So in this work, we were motivated by improving the ability of self driving vehicles to coordinate with other humans, which is, currently well, which is quite challenging.

Speaker 4:

So here, we what we did is we basically combined self play with regularization to create driving agents that are both effective, meaning that they can, navigate to particular goal positions, but also human like. So meaning they drive to particular goal position in a human like way. And we are interested in doing this because we have access while we have access to a large data set of human demonstrations, meaning we have static trajectories in a simulator of human drivers, these trajectories are, yeah, they don't respond to other agents. So we would like to learn a model, based on these human, trajectories.

Speaker 1:

So can I ask you a question? Like like, so you're you're making a an imitation learning model. Is there can you explain why you wouldn't do inverse reinforcement learning to understand what the drivers are doing? Or would that seem tractable maybe here? Or

Speaker 4:

Yeah. Good question. I think you can use in in 1st reinforcement earning here, but I think we went for, a 2 step method where we first do imitation earning and then, guided self pay because, we are it's it's kind of challenging to reverse engineer the rewards, and we are not necessarily interested in what exactly the reward function is, but we were more interested in using the, the human data to pay agents that are that are more human like. So, yeah, it's it's not so much about reverse engineering the exact reward function, but it's more, about using the human data to create policies that are that are human like so that we can populate our simulators with these, with the human like driving agents. Oh, yeah.

Speaker 4:

Yeah. So, yeah, so let me walk you through the setting. So in in, in this work, we used, Nocturne, which is a multi agent driving simulator based on the which works with the Waymo open motion datasets. So, our environment is partially observable. Agents have, the agents can, see the road map.

Speaker 4:

They can also see other agents. Maybe maybe some agents that are too far away are not visible to them. And, we, as reward function, we only reward agents. Basically, we want to keep the reward function as simple as possible to to let the behavior emerge from the from the human from the human data. So a reward function is just plus 1 agents reach the goal position and 0 otherwise.

Speaker 4:

And we divided our evaluation So

Speaker 1:

if they don't crash?

Speaker 4:

Yeah. If they don't crash. So we have 2 classes of metrics here to evaluate the agents. One class of metrics is about effectiveness, and now I'm pointing to a picture because I know they no one can see this. But there's one class of metrics, that's about effectiveness, and here we measure, yeah, goal achieved, whether they reach the goal, whether they collide.

Speaker 4:

So if they collide, the agents are removed from the scene, so they cannot achieve their goal anymore. And, there's also a third class that measures the goal achieved, off roads, and, oh, and and vehicle collision. So whether they collide with another vehicle, whether they go off the road, or whether they achieve their goal. So if they achieve their goal, that that's good. Right?

Speaker 4:

They achieved their goal, they have not collide with another agent, and they have not gone off the road. So, in the second class of metrics is, what we call realism. So here, the the the goal of this set of metrics is to evaluate, are these agents actually you would like? And this is not a super easy question because, well, you can look at a video and then it's relatively easy to say, oh, is this vehicle you would like or not? But then another thing is to, use metrics to easily evaluate agents in this respect.

Speaker 4:

But there are metrics, so the 2 metrics that we used are the average displacement error. So this is just if you compare the the to the trajectories from the, obviously, the strength on the human data and the actual trajectory from the agents, what is the difference between the 2. And then we also have we can also compare the difference between the actions of the agents. So, yeah, so and our approach is, what we call guided self play or yeah. And it consists of 2 steps.

Speaker 4:

So the first step is, imitation learning. So here, we basically just we just collect a set of human trajectories in a simulator, and then we learn a we we obtain a behavioral cloning policy from this data by just maximum, likelihood. We use a maximum of negative of likelihood estimation. And then the second step is guided self play, where, we essentially just augment the standard PPO objective, by a, way that scale divergence term between the current policy, at a certain time step and reroll out and the aesthetic obtained paper cloning, learn paper cloning policy. So so

Speaker 1:

we are using Nocturne, and and I saw on Twitter. There's a new faster framework that was it from your group?

Speaker 4:

Yeah. Yeah. And what's that called? Yeah. So, the we now have a GPU accelerated version, of Nocturne rewrote Nocturne, entirely based on the Moderna framework.

Speaker 4:

So now it's it's it's, like, 30 to 40 times faster. Yeah. So if anyone is interested in in in using our simulator, we suggest using a GPU drive. So

Speaker 5:

I'm Sre Bansal. I'm a post doc at Georgia Tech. My work is titled Reinforcement Learning with Cognitive Bias for Human AI Ad hoc Teamwork. Ad hoc teamwork involves agents that are paired together in a fully cooperative task with no prior interaction. The challenge is how do we learn to perform this task without actually knowing how the other agent will behave.

Speaker 5:

The prior solutions are trying to learn adaptable strategies by pairing with the population of agents, with different forms of diversity. Our insight is that, trying to adapt to our possible behavior is inefficient, and human decision making is not rational. And we can use the systematic biases in the form of cognitive bias to incorporate and reinforcement. We model the game as a Markov game, and we incorporate cognitive bias by either modifying the Markov game or, modifying the policy space. Our work, which involves learning biased agents with self play, will learn a best response strategy to to a population of biased self agents.

Speaker 5:

And what we show in our preliminary results is that our method is able to perform significantly better than other agents that are using best response models with

Speaker 4:

a

Speaker 5:

live with using fewer size of the population.

Speaker 6:

I am Klas Wacher. I'm a PhD student at the University of Toronto, and I'm currently presenting our work on can we hop in general? So what we did is relatively simple. We asked ourselves, okay. If 2 benchmarks look very similar to us, does it actually mean that algorithms perform similarly well on them?

Speaker 6:

And what we realized is we have a very nice setup where the MuJoCo locomotion environments that people love to use to test their RL algorithms actually exist into formats, 1 published by DeepMind, 1 by OpenAI. And what we realized is that algorithm algorithmic performance doesn't actually translate from one of these benchmarks to another. Some algorithms just completely fail on one of their variant, which lets us to present our work, which basically argues for the fact that we need to start as a community in our own to pay much more attention to how we build, design, and select the benchmarks that we test on. Thank you.

Speaker 7:

I'm Brett Venable from the Florida Institute of Human Machine Cognition, and this work is in on cooperative information dissemination with graph based multi agent reinforcement learning, in collaboration with my colleague, Rafaela Galera, Mateo Bassani, and Niran Jansuri, also from IHMC and UWF. So in this work, we look at networks, communication networks where nodes are actually moving. They broadcast information to the nodes that are currently in range, and they have to decide whether to broadcast this piece of information or not. Current methods are actually centralized, so they have to know what the topology of the network looks like. We take a decentralized approach.

Speaker 7:

The first one where we create some, encoding of the 2 hop neighborhood of a node. So we compact the information and allow it to decide through a DQN network whether they want to for whether it's a good idea to forward the message or not. And the other one basically is a more simpler strategy where there is no exchange between the nodes of their representation of their neighborhoods. So if you look at our methods, they do really well. They reach, you know, coverage of the dissemination of our graph, on the order of 95%.

Speaker 7:

They send a few slightly more messages than the state of the art, but with higher coverage. So this is exciting, and what we're doing next is work on realistic emulations of, real tactical networks and making more, complicating the problem even more.