TalkRL: The Reinforcement Learning Podcast

Posters and Hallway episodes are short interviews and poster summaries.  Recorded at RLC 2024 in Amherst MA.  

Featuring:  

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Speaker 1:

Hello. Welcome to all the Aeros fans around the world. My name is Hector. I traveled all the way to the Atlantic to come present my poster at the Aeros conference. We studied with my coauthor, Quentin, that is here too, how to extract, decision tree policies from DPL agents similar to the VIPER algorithm, but we went one step further.

Speaker 1:

We extracted oblique decision trees and converted them to Python program. We showed in our work that they can get close to the performances of a PPO neural network oracle. We did some short user study to validate how trustworthy and easy to understand a Python program and coding a decision tree is. And we also had a super cool experiment on real agricultural data that was validated by a professional from agriculture. And we managed to explain some agriculture professional policies on when they fertilize their soil with some, decision trees, which is very encouraging for the future of having error policies in the real world.

Speaker 2:

Awesome. Can you say more about the agriculture, task? What what was the data and what were you trying to do with that? So

Speaker 1:

yes. So the agriculture task is a soil fertilizing. What we had access to was what the expert said was some kind of guts feeding from an agriculture expert. So he said, okay. Me?

Speaker 1:

Here is how I fertilize my soul. I put some fertilizer on day 0 then, I come back a few days later, and I fertilize again. Then I come back a few days later and I fertilize again because this is my guts feeding. So you're like, okay. Let's try to see if there is some science or intuition behind your gut feeling as an agriculture.

Speaker 1:

So we took this dictionary policy that mapped, days since, last fertilizer to quantity of fertilizer, and we extracted an oblique decision tree encoding this policy. So we have now a decision tree that performed the exact same action as the agricultural expert, except that instead of being based on, the last day since fertilizing, it is based on nitrogen level of the soil, the weight of the grains, the size of the plants. And this tree policy that is based on real agricultural features and, inputs has been again validated by some agricultural expert that said, yeah. Actually, that makes sense. That is that we fertilize on those days because of those values.

Speaker 3:

My name is Quentin Delphos, and, I'm today talking to you about interpretable concept bottlenecks to line reinforcement learning agents. And I'm from the machine learning lab of TUH D'Ampsijt. In this work, basically, what we were doing is trying to find a way to extract interpretable agents, and understand their decisions. So we really wanna follow why do you actually take the action. And what we're doing is extracting an interpretable agent that was playing Pong.

Speaker 3:

So Pong is a famous game where you have basically 2 paddles. You're controlling your paddle and you're trying to send, to send the ball behind the enemy, basically. And when we were looking at, like, interpretable policies, we were understanding that it's actually not relying on the ball. And we're like, why would, a policy that is interpretable not rely on the ball at all? And what was going on actually is that you have a spurious correlation between the enemy and the ball position.

Speaker 3:

The enemy is programmed to follow the ball. So the most of the agents are actually gonna focus on following the enemy instead of following the ball. Now, if you change your training environment to a more lazy environment where the enemy is actually not moving anymore, what is happening is that the enemy now is luring your agent into, like, not move as well. And your agent is now completely not able to play at all. And so if you have interpretable agents, if you have like interpretable concepts all the way through, then you can actually prune them out and you can actually correct your agent in a very, very easy way.

Speaker 3:

And we show in this paper that for many, many problem like sparse reinforcement learning or, difficult crate assignment, for example, that you're gonna have a very, very easy time correcting those misaligned or problematic agents. Whereas if you have deep agents, you actually cannot correct them and sometimes you don't even know that, they are misaligned.

Speaker 4:

Hi. My name is Sonia Johnson Yu. I am a PhD student at Harvard University, advised by Kanika Rajan, and my work is titled Understanding Biological Active Sensing Behaviors by Interpreting Learned Artificial Agent Policies. My goal is to understand animal behavior, specifically by creating these agents that acts as a proxy for the animal behavior. So in this case I have trained an agent that does active sensing much like you would see in weakly electric fish.

Speaker 4:

And so, at each time step, this deep RL agent that I've trained is going around this 2 d arena, continuous continuous action space, continuous observation space and it's moving, it's turning and it's choosing to emit a pulse. And what I'm really interested in understanding is what is going on and how do they decide when to emit these pulses. One of the things that's known in the fish literature is that there's a lot of variability in when they pulse when they're foraging. And so the hope is by constructing an agent like this we can understand, what's really going on with the fish. And so when we look at the behavior that we see in our agents, 2 behavioral modes emerge.

Speaker 4:

And this is something that we can see, both in terms of the pulses that the agents are emitting as well as in terms of the velocity and how vigorously they're moving around. And those two primary behavior modes are resting and active foraging. So when they're resting, generally they're just hanging out and, kind of letting there be pretty big gaps between the pulses. And when they're actively foraging, there's lots of rapid fire foraging and there's, a lot of swimming around quickly. This is something that is also reflected when you look at the hidden state in the RNN.

Speaker 4:

So the RNN is basically the brain of this agent. And so when we look at, if we look at the hidden state and we visualize that in latent space, we can actually see this partitioning between these two behavioral modes. So the really cool thing here is that we kind of went in and we discovered that the agent learned to solve this problem of forging by emerging these 2 different behavioral modes.

Speaker 5:

Well, hello. I'm Janu Gregori from the Technical University of Darmstadt, and, we have built a project called OC Atari, where we want to create an object centric environment for Atari agents. We actually created this by using the RAM because we tried it with visual methods and saw, well, the object centric performance is not that great when you only use visual methods, but we still want to be object centric since, only working with visual recognition doesn't work that well for our neurosymbolic learning. And so we actually used the RAM to reverse engineer how we, where the objects are positioned on the on the image of the Atari game. And so we actually created an environment that allows us to quickly extract object positions and therefore have a very interpretable decision making of the, agent to see what why is it moving where and what is causing it to actually perform in a certain way.

Speaker 2:

With this environment, can you train an Atari agent more quickly?

Speaker 5:

You definitely can train an Atari agent more quickly because you have, like, 1 100th of the time you actually would need if you're using a visual method, which is probably the biggest advantage actually of using OCAtari compared to the usual visual method. But, another advantage would be, for example, objects overlapping with each other are often weirdly displayed in these visual methods. And if you use OCAtari, you still have a clear image of where the object is even if it's behind another object on the game.

Speaker 6:

I'm Cam Allen, and this is joint work with Aaron Kirtland and David Tau. It's called Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy. This the poster is really about we're trying to get agents that can determine whether they're in a POMDP or an MVP. Basically looking at the world and seeing if they have all the information they need to make decisions. And one thing that you can use to to sort of resolve this question is you can use value functions.

Speaker 6:

There's a couple of ways of estimating value functions. And if you and and some of them, make this assumption that you can that you can do bootstrapping, that you can that you have essentially a markup state. But not all of the value estimation methods make that assumption. And so by estimating it one way that makes the assumption and one way that doesn't and comparing the difference this is like a TV value function versus a Monte Carlo value function. If you compare the difference between these two value functions, it can if it's if the difference is non zero, it'll tell you whether you're in a POMDP or an MVP.

Speaker 6:

So now you have this signal that the agent can use to leverage to decide which of these two types of observations it has. And now not only that, you have a signal that you can minimize to resolve that partial observability. And you can you can get to the point where you you learn a Markov representation for your problem and you can do much better decision making with that.

Speaker 7:

Cool. Yeah. My name is James Staley. I'm a PhD candidate at Tufts University in the Able Lab, which is a human robot interaction lab. My work here is in the, human feedback and imitation learning track.

Speaker 7:

And, basically, I I just took a bunch of demonstrations. I was I was trying to build out a Dreamer v 3. I was trying to combine it with, something that Hafner did earlier, which is director, which is a hierarchical version. So I was trying to build that out and see if I could use it for human control robotic system. And then I was getting really good results, and my adviser was like, oh, you should simplify it and pair it down so you know why where these results are coming from.

Speaker 7:

And I I went all the way back to to just Dreamer, and then in Dreamer plus human demonstrations gave extremely surprisingly fast learning. So, actually, I thought it was, like I thought I had broken something in the system in a way where I was benefiting for, like a month, but I it's something about Dreamer just picks up information from human demonstrations extremely well. And so, for the paper that's in RLC, it's in these 2 it's in a pinpad environment which is a discreet, discreet action image, environment, but it's a long horizon task. It's like you have to hit these pads in the right order. And then, in memory maze, it's like a, exploration and memory task where you can move around this maze and pick up the ward.

Speaker 7:

But, with human demonstrations, Dreamer learns these tasks between, 7 and, or between, like, 4 and 7 times as fast as Dreamer alone, which is it's not surprising because you're giving it these human demonstrations that contain the sparse reward, but the degree to which it's learning is pretty surprising. Like, it's that it's learning so much faster. So we I think it has something to do with, you know, when you when Dreamer is training, it's using the real world trajectory, in this case, a human demonstration, as sort of a scaffold from which it does its imagined exploration. So it's both exploring along these reward containing trajectories. And since it has those trajectories early, I think it probably is this is speculation, but it probably has, an easier time forming representations that allow it to attain that reward.

Speaker 7:

Yeah. So that's that's it. It's like it's not really an algorithm contribution or a technical contribution. It's sort of just like this is a surprising, this is a surprising ability that Dreamer has here. And it goes along with sort of, like, my personal long term goal, which is to it's great that big models are going to do really well.

Speaker 7:

That's awesome. But it would be really wonderful if you could get 1 robot and one person to be able to learn a task within, like, a 4 to 8 hour window. And we seem like we're getting really close to that. So this is this is really nice to find for that for that because it, it brings sort of like the it sort of shifts the power more towards the small group individual, you know, with the robot, instead of having to say rely on, like, a foundation model that's trained by Microsoft, or by some large company, and having to rely on that in order to get to, like, local performance.

Speaker 2:

So are you planning to, continue along this line, follow-up this work?

Speaker 7:

Yeah. I think, I think it would it makes sense to turn into a journal paper. It also it has something to do with, like, I'll talk about it a little in the, when I when I present, but it's like you there's a lot of pitfalls to using human demonstration data, to either to learn value functions or policies, because, you know, human human data is not Markovian a lot of the time. People aren't acting in a way where they just do what they're doing based on the current state. It's actually in it's in the ACT paper.

Speaker 7:

It's like about, or sorry, the AMOA paper, which, like, that paper is really really cool if you haven't seen it. That she, they talk a little bit about how human demonstrations are, they're not just state dependent, they're time dependent too. Another way to think about that is, like, you might be thinking about how you spilled coffee on your pants this morning and it might cause you to pause in the middle of doing something, but that's not relevant to the task. That information is relevant to the task. So, the journal paper, I think, is gonna be a little bit about showing how human demonstrations are different than expert policy demonstrations and why you sort of get to go around these pitfalls of imitation learning and learning from demonstration by using the the information that's in a human demonstration to train the world model instead of a policy or value function.

Speaker 7:

Yeah. And then we and then so that's one. It's like the journal paper will be like about that, I think. And then we also really wanna do an in the wild study, where we have the arm accepting sort of naive and amateur people's demonstrations because the same principle should apply, which is that, like, you get to you get to peel out any useful information for the demonstrations without being pampered by incorrect or, bad demonstrations.

Speaker 2:

Awesome. Anything else you wanna share with the audience, while we're here?

Speaker 7:

The AbleLab at Tufts is great, and my advisor, Lake Short, is extremely, extremely smart and helpful and wonderful. And, it's a good place to apply to do human focused machine HRM.

Speaker 2:

I'm here with Jonathan Lee. Jonathan, where are you from and what do you do? Hi. I'm just Jonathan. I I'm from RPI, Rensselaer Polytechnic Institute.

Speaker 2:

I'm a PhD student there. I work on, foundation models and decision making. Alright. So trying to basically integrate some of the, the plan much a lot of the planning and search techniques such as MCTS, like the ones we saw at this RLC conference with large language models, to create better agents that are able to, you know, plan and and reason better. Awesome.

Speaker 2:

How's RLC RLC going for you? Yeah. RLC is going great. This I've been learning a lot. It's been it's amazing seeing all these, all these, academics, you know, and the whole like, like I just saw I just saw David Silver just now talking about AlphaGo and AlphaProof and kind of latest methods developed.

Speaker 2:

So it's been really cool, you know, asking questions and meeting all these researchers.