TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
So I'm Chris Diasis. I'm a former PhD student with Rich Sutton at the UofA. I'm currently an open mind research fellow. And, yeah, most of my work, has to do with online reinforcement learning, specifically value based methods. And my work here is just noticing a little discrepancy with time discretization in the definitions of the return.
Speaker 1:So, like, notably, people would use a, a pretty common definition of the discrete time return, but then it turns out that this has this little mismatch between the continuous time return and that you're doing this weird integral approximation. And it turns out to fix that, you just multiply everything by gamma and every and it works out.
Speaker 2:Awesome. So elegant. And does it does it change things? Well, if you're purely in
Speaker 1:the discrete time setting, no. And if you have, like, a fixed interval size when you're doing continuous time reinforcement learning, it's also just proportional. But the moment you have any stochasticity in your interval sizes, like if you're on a real robotics environment where there's just inherent noise in your time step, then it's no longer proportional, and you have better alignment with the underlying integral return.
Speaker 2:Awesome. Any highlights from RLC?
Speaker 1:Yeah. So I I actually met a lot of people who aren't from reinforcement learning and just saw that this conference was happening, and then they were just trying to see what was up. And something that really struck me was that at at first, they were kinda skeptical. They were trying to see if whether this applied to their field at all or not. But when they watched Andy Bartos talk, like, they got extremely excited.
Speaker 1:They were, like, hyped with everyone else, even the RL people, and that was, like, really amazing to see. This seems so fundamental. It's almost surprising to
Speaker 2:me that, you know, we're you're the community or you just discovered this in 2024. Like, how how did you come to this?
Speaker 1:Yeah. So this was around the start of my PhD. This was actually from 2019, where I was taking a course, applying reinforcement learning to robotics, and, we started working with the integral return. And I sort of got hung up on these time indices because in the literature, people would start from, like, r of t, some people would start from r of t plus 1, and I actually tried plugging those values into the internal return, and things weren't lining up. And so I noticed it in that course, but then it was, like, too late to adjust it in my course project.
Speaker 1:But this has always been at the back of my mind, and RLC came around, and this seemed like the venue where this was actually relevant or people would actually care about this.
Speaker 3:Hi. I'm Anna Hachterdan. I'm from U of A. I'm a master's student working with Martha White on online reinforcement learning. One of the biggest challenges of just reinforcement learning is hyper parameters and how to deal with it, and I'm trying to see if there is a way to do it in online manner.
Speaker 3:Meaning, your agent doesn't have any prior knowledge of the environment and it can't access, like, any kind of simulator. Can it figure out a good set of hyperparameters while learning in the environment itself without losing its what it what it has learned so far?
Speaker 2:Doesn't sound like the easiest problem I've ever heard of.
Speaker 3:Yes. It's it's it's it's not something that I recommend any master student to tackle. Yes. It's not that easy at all, but it's fun, honestly. And I know it's if it's if I start doing something and I can get something working, it will be hopefully helpful to many people, especially people that want to bring RL to real world.
Speaker 2:Was there was there anything particular at RLC that was a highlight for you?
Speaker 3:Oh, I absolutely love Andy Barta's, like, talk. It was very interesting because, like, I'm a master's student just getting into RL, and then you are listening to the founder of RL talk about the history of it. And then it was, like, very, very nice talk, and it was very touching too because everyone was very excited and respectful, and it was just great time.
Speaker 4:Hi. My name is Dilip Barumugam. I'm a postdoc at, Princeton working with Tom Griffiths, and I think a lot about the intersection of reinforcement learning and information theory as it pertains to, generalization, exploration, and credit assignment. The work that I presented at the finding the frame workshop was, specifically about information theory as it pertains to exploration. So typically in RL, we make this default assumption that an agent always explores to learn what's optimal.
Speaker 4:And so my work thinks about, well, what happens when being optimal is entirely infeasible? How should an agent orient its exploration when, star, the optimal policy, is no longer a realizable goal?
Speaker 2:What were any highlights for you at RLC?
Speaker 4:I really love Doina's talk. I mean, all the keynote speakers are are absolutely amazing and and legends for our field, but, Doina's talk in particular seemed to strike a lot of chords with me. I see a lot of touch points to where, ideas from information theory can really help, deal with problems like, well, maybe the agent's environment is only markup in its mind, and also, of course, work on learning targets, which is, very much central to, work that I've done, but also work in, Ben Van Roe's group, which is where I'm coming from.
Speaker 2:I'm here with Micah Carroll from UC Berkeley. Micah, what are you working on these days?
Speaker 5:Yeah. So I've been working on this problem of changing preferences and AI alignment. And there's this question of, like, okay, if people's notions of optimality change over time or their preferences change over time, what are AI systems supposed to be optimizing in trying to assist the human? And in particular, if you just do standard reinforcement learning in this kind of setting where, in fact, people's reward functions or preferences are changing over time. But it turns out that the optimal thing often can be to try to change people's preferences or reward functions, in order to be easier to satisfy and obtain higher rewards.
Speaker 5:So, for example, in recommender systems, it might be optimal to 1st try to, if you're optimizing for long term engagement, first try to get the user, is, like, addicted to the platform or, like, liking certain types of content that are easier to provide. And then at that point, like, get a lot of engagement, out of them. And this is clearly not a good outcome. So, like, kind of asking this question, like, how do you avoid these type of influence incentives?
Speaker 2:Sounds like a very important and hard problem. So I was in that quadrant, which is great. What, can you tell me anything about highlights, of RLC for yourself?
Speaker 5:I really enjoyed the workshops and in particular, the finding the frame workshop. I was, like, also co organizing the RL safety workshop, and I really enjoyed the panel, there. So I don't know, I think like I don't know. Also all the keynotes were really good. So I'm really excited for our role, see also next year.
Speaker 5:Yeah.