Sai Krishna Gottipati of AI Redefined on RL for synthesizable drug discovery, Multi-Teacher Self-Play, Cogment framework for realtime multi-actor RL, AI + Chess, and more!
TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
TalkRL Podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talkRL podcast. I'm your host, Robin Chauhan. I have a brief message from AnyScale, our sponsor for this episode. Reinforcement learning is gaining traction as a complementary approach to supervised learning, with applications ranging from recommender systems to games to production planning.
Robin:So don't miss Ray Summit, the annual user conference for the Ray open source project, where you can hear how teams at DAO, Verizon, Riot Games, and more are solving their RL challenges with RLLib. That's the Ray Ecosystems Open Source Library for RL. RAY Summit is happening August 23rd 24th in San Francisco. You can register at ray summit dot org and use the code ray summit 22 r l for a further 25% off the already reduced prices of $100 for keynotes only or 150 to add a tutorial from Sven. These prices are for the first 25 people to register.
Robin:Now I can see from personal experience, I've used Ray's RLLib, and I have recommended it for consulting clients. It's easy to get started with, but it's also highly scalable and supports a variety of advanced algorithms and and settings. Now on to our episode. Sai Krishna Gottipati is an RL researcher at AI Redefined working on RL, a multi agent RL, human in the loop learning. And he developed the world's 1st RL based algorithm for synthesizable drug discovery.
Robin:As a master's student at Mila, he worked on RL for active localization, board games, and character generation. Sai is also a international master in chess. Sai, thanks for joining us today.
Sai:Thanks for inviting me.
Robin:So can you tell us a bit more about your your current focus areas?
Sai:Currently, I'm working mostly on, multi agent RL human in the loop learning and some of its applications and some industrial settings.
Robin:Can you say a bit about the settings?
Sai:Sure. I think it mostly relates, to, our product, which is Corp. Mint. It's a multi actor, framework. And so by actor, I mean, the actor could be an AI agent or a human agent or a heuristic agent and so on.
Sai:So it's especially useful in very complex ecosystems where multiple actors are acting. And so we have multiple industrial clients that are using, Cogmed for their products. For example, we are working on a project with Dallas for airport security, where the idea is to defend the airport from, incoming drones or any other, objects. And we have 2 teams of, drones. 1 is team defending the airport and the other is the one that's trying to attack the airport.
Sai:So as it as as a defense team, we need to develop, sophisticated algorithms against, different kinds of attacks. And this is where, the teams within the defense team should learn to collaborate with each other at this at and simultaneously launch and attack against the offenders. Yeah. That that's one of the application, for example.
Robin:Wow. Okay. So not shying away from the hard problems in in multi agent RL or or the safety critical settings either?
Sai:Yep.
Robin:Mhmm. That's amazing okay so we're gonna hear more about Cogment, later on in our chat and I look forward to that So just looking through your background, it seems like you've had a few different chapters in your research career. I see early papers on SLAM and drug discovery and then computer vision and RL fundamentals. Can you give us a brief, idea, like how you got to this point?
Sai:During my undergrad, as part of my honors project, I started in the robotics research center. I think it's now called, triple IT robotics lab. So at that time the lab was working on this Mahindra RISE challenge, which is to design an autonomous driving car for Indian roads. So I started working on, various computer vision problems like, road sign or traffic signal detection and recognition, potholes and speed breakers recognition and so on.
Robin:That sounds challenging. I just want I'll just say I learned to drive in New Delhi and the roads there are quite if you wanna test your, autonomous driving that's gotta be a really place a good place for a lot of tail events.
Sai:Yeah. Yeah. I think Hyderabad, roads are even more challenging. So, yeah, at that time so this was back in 2015 or 2016. So at that time I was mostly using still traditional computer vision techniques, but that's also the time when I slowly got introduced to, deep learning.
Sai:Yeah. And then I started using deep learning first for the recognition part, and then also for the detection part. At the same time, I'd got an opportunity to work on this research project as well, which is, the 3 d reconstruction of the vehicles. I worked on a very small part of that project at the time, which is the key point localization. That's when I got introduced to many of the deep learning frameworks at that time.
Sai:I think PyTorch wasn't even released or not that mature at that time. I was dealing with Caffe at that time. And before Caffe I was doing deep learning on in MATLAB. So those are fun times. Yeah.
Sai:Towards the end of my undergrad, I got an admit, in Mila to work with, Liam Paul. So both are robotics labs, basically, and the project I'll be working on isn't decided. And I thought I would continue working on some fun and challenging robotics problems. And yeah, I explored a lot of different problems in localization, SLAM and so on. And I finally got to work on the problem of active localization.
Sai:And yeah, I initially tried out the traditional methods for active localization and soon realized that, reinforcement learning is a very good fit, for this problem. So I started using reinforcement learning for active localization, and that's how I got into, reinforcement learning at the same time. Yeah, I think this was beginning of 2018. I was also taking the reinforcement learning course at Michael, where I got to work on some interesting assignments and projects. And, yeah, after, graduating from Mila, I started working in a truck discovery company, where again, reinforcement learning is a very good fit for the problem I was working on then.
Sai:And then now I'm at, a redefined where I think I I I now find, multi agent parallel and human loop are like more challenging and interesting problems to work on.
Robin:What would you say are the holy grail problems or the long term goals in in your current focus areas?
Sai:I think human in the loop learning is at still at very early stages. We we even don't have proper benchmarks at this point. For example, for reinforcement learning, Atari is kind of considered to be a good environment to test different algorithms on, but we don't have any such ideal, algorithm to to test human in the loop running on. I mean, even the metrics, that we have to optimize aren't very clear because it's not just about maximizing the particular reward. We should also need to care about, the trust factor.
Sai:And I think as a first, very, good challenge is to develop these benchmarks, develop the environments and try to optimize for these different metrics. And, yeah, I think in 10 years, we would have a very good, complex ecosystems where humans and AI agents can learn from each other, trust each other, and cooperate each cooperate with each other.
Robin:Yeah. I mean, the whole idea of a benchmark for human in the loop seems so difficult to execute. Like, how many times can you run it? How much human time is it gonna take? How replicable would the results be with different humans?
Sai:Exactly.
Robin:Do you feel like there's progress is being made in that question or is it is there people kind of kicking the can down the road a bit? I've seen some multi agent RL papers will focus on how other RL agents or other types of automated agents will respond or react, but but it doesn't seem like there's any clear way to to automate a human response. I mean, the whole point is that the human responds very differently than any machine ever would. So how how could you ever put that in a loop, in terms of, like, running huge amounts of fiber parameter sweeps or anything like that?
Sai:Yeah. That is a very challenging question. And, we are kind of working on a small part of that, right now, on the HANA BE project where we are trying to have humans play simultaneously humans play against other agents and train agents in such a way that they can learn to, collaborate with all the humans.
Robin:Okay. And then we're gonna talk about that Hanabi paper, in a few minutes. So I I just saw an announcement a few days ago that that Mila, the Research Institute in Montreal, and your employer, AI Redefined, have a partnership. Can you, say a bit more about AI Redefined and its mission and the and the partnership with with Mila? And what what stage is AI Redefined at with with its sounds like really ambitious work.
Sai:Yeah. So AI Redefined, started out around 2017. It's based in Montreal and, its mission, I think in my own words, is to develop, complex ecosystems where humans and AI agents can, as I was saying, learn, from each other or collaborate with each other and trust each other. Yeah. So I think that's a grand goal, that we have.
Sai:And, we are kind of working on, multiple, projects with, Mila researchers. For example, the one with, professor Sarajendra's group on Hanabi, and we are looking forward to working on more such projects with other Miller researchers as well and test out the actual potential of Cogmed.
Robin:Awesome. Okay. So let's talk about the Cogmed paper. That is Cognent open source framework for distributed multi actor training, deployment, and operations. And that's with authors AI Redefined and yourself, as well as co authors.
Sai:Yeah.
Robin:So you said a little bit about, Cogments. So it's really about multi agents, systems. And is it really about, learning in real time or inference in real time? Or what is the can you tell us more about the settings that Cognizant is best for?
Sai:Yeah. I wouldn't call it a multi agent system. It's more of a multi actor system. Yeah. As I was saying, actor could be an AI agent or a human or a heuristic agent or basically any other actor.
Sai:It can be used for even normal, simple, single agent, reinforcement learning algorithms, but there, I guess you won't see any, advantages compared to the other existing frameworks, where Cognite really shines using these multi actor systems, because you can have, multiple actors simultaneously acting in an environment, doing very different things. For example, imagine all the ways, a human can interact with an AI agent, right? An AI agent can reward, the human at every time step or vice versa. And similarly, one agent can set a curriculum and the other agent can follow the curriculum or even simpler algorithms like behavior cloning. So these are all different ways in which a human can interact, with the agent.
Sai:Cogment is really suited for all these kinds of, use cases. For example, one simple demonstration would be in the case of a simple gym environment like lunar lander, where an AI agent with its randomly initialized policy starts playing the game and human can intervene at any time step in the middle of the episode. And the AI agent can learn, both from its own samples and from the human interventions. So instead of continuously interacting with the agent, human can just sit back and relax and only intervene when he thinks that agent is acting very stupidly. Right.
Sai:And I think this is one of the efficient, uses of human time. One of the projects was on this airport security that we are working with, Dallas and, Matthew Taylor and Matthew Gustal from University of Alberta. So the other collaboration we are having is with Confiance dot ai, which is, I think, like kind of consortium of, multiple French industries and labs. And, we are working on this specific, case of hyperparameter optimization, guided by humans. Yeah.
Sai:So basically allowing the humans to explore the space of hyperparameter so that they can end up with the final optimized parameters that they want. One other, project interesting project is, with this major player in training simulation. I think I can't reveal the name, but the project is, basically in air traffic, controller and pilot training, where you have multiple aerial vehicles that are queued to land at, different landing spots or different landing destination. And then you receive an emergency request from a different pilot. And so how should this, ATC react, so that they can reroute the existing aerial vehicles and also help this new training pilot, land safely.
Sai:We also have this other collaboration with, renewable energy company, where the goal is to basically manage, the energy grid or to decide when to sell or store the energy in the grid. It's basically an optimization problem with RL, but we could, however, have a human in the loop with an operator actually controlling, the decisions. And you can also have different kinds of risk profiles, that are controlled by the humans.
Robin:So how do you think about the use of AI and RL in safety critical situations? Because it it seems like especially with the aircraft traffic controller case, I guess, and the, the power case too.
Sai:Yeah. So I think, it's important to have a human in the loop and kind of have a human as human have a final say in the systems. And, yeah, that's kind of the primary focus that you already defined as well.
Robin:Okay. But you think the the human and the AI together can make better decisions and and safer decisions, than the human on its own? Is that the the goal here?
Sai:Yeah, exactly. I mean, there are some complex, planning that needs to be underdone. So which in a time critical situations human might not be able to do. So agent will do all those, hard work very quickly, and then it will suggest what it thinks is the best action. And if it seems like a sensible, action that is not dangerous, then the human can approve that action.
Sai:Basically, based on the human approval or disapproval, the agent can also learn from learn further from these kinds of feedback. So it it would be a continually learning system as well.
Robin:Is it is it very general or or is it more focused on on policy, off policy, offline? Is it like model based, model free? Is it all of the above? On the IRF
Sai:side? It's all of the above. So especially this, we have this thing called retroactive rewards where the, even the rewards can be, given much later than when the time steps or the episode has actually happened. So this gives rise to like wide range of, applications as well. For example, when AI agent is acting in an environment, human might not be as quick to give the reward.
Sai:Right? So it's useful in those cases.
Robin:And what stage is Cognite at? And is it built is it built on other tools or is it kind of a greenfield project or are you extending something here or is it really, starting from scratch?
Sai:It's it's mostly a greenfield project. It's based on microservice architecture. I think that's like, just like a concept, of microservice architecture. There are multiple versions of Cognite. I think the first version came out about 1 and a half year ago or something.
Sai:Recently we released a Codman 2.0, which is more, academic oriented and which is more friendly to the researchers as well. And on top of Cogmed, we released, something called Cogmed verse as well, which is a collection of a bunch of reinforcement learning agents and environments like simple gym environments, betting to, procedural, generation environments and so on. So so that it would be easy for any actual academic researcher to get started and do a couple of experiments with Cogmed.
Robin:I guess, in the case where a human takes over, are are you labeling those those samples as expert demonstrations or they are they considered differently?
Sai:Yes. They can be, they can be stored to a different replay buffer or they can be stored on the same replay buffer. It depends on how we code.
Robin:What is your role, in in the Cogman project?
Sai:I'm mostly, developing on Cogmed first, which is, implementing and benchmarking different reinforcement learning or multi agent dialogue with thumbs, with different kinds of environments. And then we also use, Cogman for all of our research projects ongoing research projects.
Robin:Cool. Okay. Do you wanna, move on to the asymmetric self play paper?
Sai:Yeah. Yeah.
Robin:So I think this is a paper from OpenAI, that you weren't a coauthor on, but you found it interesting for for our discussion.
Sai:I think the idea here is to solve, goal conditioned, reinforcement learning. Usually, it's a very sparse reward problem, and hence, it's a very challenging task to solve. So what these guys do is they introduce a new kind of agent. They call it allies and Bob. So allies being like a teacher agent, that acts out in the environment and reaches a particular state.
Sai:And then the Bob agent is supposed to reach that state reached by the ally allies. This way, the problem of sparsity can be kind of, overcome.
Robin:So this paper was asymmetric self play for automatic goal discovery in robotic manipulation
Sai:Yeah.
Robin:With authors OpenAI, Matthias Clappert, et al. So why fundamentally, why do you think that splitting the learning problem in this way using 2 separate agents, why is that somehow better? Like, we see different algorithms that use that split the learning problem in this type of way or in related ways. Why is that somehow better? Like, is there some reason why it should make sense to do that?
Robin:It's almost like they're setting up a game. Right?
Sai:Yeah. That's so if if a single agent is acting out out in the environment, that word is very sparse, especially in cold condition environment. So I'm thinking of a robotic manipulation task where all the end end locations has to exactly match. Yeah. Maybe even after a 100 time steps, you might not be able to reach, that location, and it's hard for any typical RL algorithms to learn from such kind of sparse rewards.
Sai:So using introducing this new agent will encourage exploration. It will encourage the first, the, teacher agent or the LS agent to go to the places it hasn't been to before, because if it's revolving around the same area, then the Bob agent can reach those locations and the teacher will be negatively rewarded. So teacher is always incentivized to explore more. And consequently the student is, incentivized to follow the teacher. I think this way the exploration is much faster and the at the end of the day, the agent can generalize much better even to to the unseen goals.
Robin:So but why do you think that is that it that it works better with 2 agents? Like you could imagine another formulation where we just had one agent and we said okay we're gonna give some curiosity, intrinsic curiosity to this agent and it's gonna do its best to explore everywhere and then it's gonna and then we're gonna do some kind of hindsight replay thing to say we'll just pretend you were trying to find these goals. It seems like that maybe could work as well or why why do you think this is better this way?
Sai:Yeah. That those could, work as well, but I think one, kind of issue or challenge I see with this, intrinsic reward based methods or information theoretic based rewards, curiosity based rewards, and so on is they don't necessarily align with your actual goal. You are especially incentivizing the agent to just, increase its curiosity or optimize some kind of information theoretic metric, which might not be relevant to your actual goal of, solving a cold condition problem. But, on the other hand, this teacher student approach is kind of, incentivizing the agent to reach a wide, range of goals, in a much quick fashion.
Robin:So, like, the training procedure is closer to the test time procedure. Like, it seems like the train the teachers here training for the similar behavior that we actually wanna see.
Sai:Yep.
Robin:Right. So if we're if if it maybe it's just using some kind of noisy exploration, then it's not gonna be really optimized for quickly getting to any specific goal because it it never behaved that way really during training time.
Sai:Yep. Correct. Yeah.
Robin:Alright. Well, anything else you wanna say, about this paper? I think we've seen that general idea, you know, show up in a lot of times in terms of goal selection and a separate agent trying to reach that goal as a as a strategy for self play.
Sai:Yeah. So so I think one, other interesting thing they did in this paper is, add a behavior cloning loss to the student, training. So usually we have seen multiple approaches before, where we have a goal, generating agent and another agent that's trying to reach the goal. But this goal generating agents are usually, some VAEs or GANs and so on. But in the case of this asymmetric self play paper, the teacher agent also actually acts in the environment and reaches that position.
Sai:What that means for the student agent is in case the student finds a goal too hard to reach, then the student can actually learn from the behavior cloning of the teacher. I think that, really helped in much faster training.
Robin:But do we have a chicken and egg problem? Like, why? How does the teacher know how to get there? I actually didn't follow that part. How does the teacher know how to get there?
Sai:So initially, teacher moves completely randomly. So both the teacher and the student agent starts out completely randomly. But once the teacher gets to a certain location and if the student fails to reach there, for the first time, then it's good. The teacher agent gets rewarded. In the second episode as well, if the teacher reaches the same spot, but now the student has learned how to reach that place.
Sai:So the student reaches that goal and the teacher will be negatively devolved. So now the teacher realizes that, okay, the student can reach these goals. Now I should further expand my space and it's intent, it's incentivized to export more.
Robin:So what kind of settings do you think are most suitable for this?
Sai:I'm thinking of a real world application in the context of industrial robots, for example, either in the kitchen robots or in some, factory settings and so on, those manipulated arms has to be trained to reach, different kinds of poses. So I think during its training phase, it's ideal if they were trained in this manner. You have, 1, agent, 1 teacher agent, trying to do multiple, trying to reach multiple locations, but it could also have multiple student agents trying to reach the same, goal posts.
Robin:Okay. And and do you think this is really makes sense in simulation and then using sim to real or, like, literally doing all of this in in IRL in the real world?
Sai:Yeah. I think that's always a complex, question. It, depends on the, specifics. But, yeah, doing it in the simulation first and then send to real time, it should work.
Robin:Okay. So let's move on to a paper that you have currently under review at a conference. And I and I won't say the conference name, but it's a big well known conference. The paper is called Do as You Teach, a multi teacher approach to self play in deep reinforcement learning. So can you give us a basic idea what's what's going on in this paper, Sai?
Sai:Yeah. So so we have, seen this asymmetric self play paper, and, we implemented it and then notice that it's working well, but, not as good as we expected. So then we were thinking of what kind of improvements, we can make to that. And one issue we noticed is that there is kind of lack of diversity in how the teacher is setting the goals. It is exploring, but it is kind of exploring mostly in one direction, considering, considering grid, grid world example, and the teacher is setting goals in it's, it's, it's still challenging goals, but, it's setting goals in only one direction.
Sai:So I think, yeah, that's the basis for our approach. So we believe that we need multiple teachers, to set diverse goals for that could also help in faster learning of the student agent and also better, better generalization.
Robin:And where does the stochasticity come from? The randomness in the teachers?
Sai:It's random initial it's random initialization of the networks, and then they act all differently, because they are incentivized based on whether the student has reached the goal or not.
Robin:You could get away with 1 teacher if the distribution was what you wanted, but you're saying you don't get a Yeah. Exactly. Distribution for 1. Yeah. And that's because so I I just wonder what the other approach would be like.
Robin:Is there some way to fix is there any alternative to fix the distribution? Because what I think what we're saying is the distribution from any one teacher is just not distributed, basically, not evenly distributed. So is there some way to make it evenly distributed, or there's just no way and this is this, multi teacher is a is a kind of a approach to overcome that problem?
Sai:I mean, we thought of other approaches, for example, adding diversity specific, metric and so on. But I think they are really dependent on the environment or particular task at hand and not, really gender stick general algorithms. And I think there are some other ways you could do it. For example, adding, goals to the replay buffer that are only diverse. So you let the teacher agent generate all these goals, but store those goals in the replay buffer that are explicitly different from these goals that are already stored.
Sai:But these are also computationally expensive.
Robin:And how do you consider a difference between goals? Like, as you have have some idea of distance between the goals, is that in terms of steps to get there? Or how do you think of difference between goals?
Sai:That's another, challenge, actually. You can't, you don't have any specific metric or distance between goals. If you're acting in a great world, then it's clear. But again, it's usually specific to the environment you are acting in, which is why I think this, multi teacher approach is very general. It's not computationally intensive and it gives much better results.
Sai:And it also shows that we are actually generating the much diverse goals.
Robin:And are some of the teachers, winning, like, are the teachers competing amongst themselves too? Like, are there kind of losing teachers and winning teachers?
Sai:It's possible that particular teacher can always get in some kind, can get stuck in some kind of local minima. This, you have this danger, especially in the case of a single teacher, right? It's always possible that it can always get stuck somewhere, but using multiple teachers kind of solves this issue as well. It also depends on the complexity of the environment. So if the environment is not complex enough, there is no point in having multiple teachers because all the teachers would be generating goals around the same region where the student had already reached that region and the teachers are getting they're not getting incentivized anymore.
Robin:Well, I love the concept and I love the parallel to the real world. I I think of, every guest on the show as a as a teacher to me. I learned from every guest, and it's great to have multiple teachers because every teacher has their own distribution of areas that they are more interested in. And, so to get a diverse scope is actually really nice to treat. So in this case, there's, in this paper, there's students, teachers, and I think there's also intern agents.
Robin:Can you tell us about that? What is the, what is the intern about? What are their roles?
Sai:Once we are let the teacher agents generate these goals and the student learns from those goals, we also wanted to see if these generated goals are of use at all. So we started calling this new agent as intern agent. So the intern doesn't have access to the teacher's trajectories. They only have access to the teacher's, goals. Essentially, they can't use something like behavior cloning laws or other intention learning methods.
Sai:The only way they're allowed to, learn is based on this curriculum of goals. And we have observed that, this curriculum of goals set by the teachers is much better compared to a random, set of goals. And also if you increase the number of teachers, the diversity of the goals generated increases, and also it helps the intern learn much faster. I think you can also kind of draw the real life parallel to to this one as well. That even if you don't have access to the complete, lecture, but if you just have access to some references and so on, you could still learn from those differences, but those references has to be act accurate and useful and not just arbitrary.
Robin:So this reminds me of something I love talking about, which is the paired system. It's a way of curriculum design. So is there something similar to paired going on here? Or can you talk about the relationship, between those two ideas?
Sai:Yeah. The they're very related. So our work can be kind of seen as a specific instance of the broader problem of this emergent ecosystems, where, you have one agent, let's call it again, a teacher agent that's, generating increasingly complex environments and the actual reinforcement learning agent that has to, solve this, whatever the environment the teacher is in throws at it. Right? So we can see kind of this, goal generating teacher, and the student agent as a specific instance of that, where instead of generating this complex environments, we are only restricting the gen generation to goals inside a specific environment.
Sai:All those algorithms that are applicable, those Mentioned ecosystems are applicable here as well. Broadly speaking, for example, I have seen approaches that use, like, I think, evolutionary, search or genetic algorithms, but these kinds of teacher agents,
Robin:How do you, represent these goals? Are they just states that you show the agents? Okay. I want you to get into the into the state, or how do you represent the goal?
Sai:Yeah. So we have tried this approach on 2 environments. One is fetch, and the other is, custom driving simulator. Yeah. In both the cases, we present the position as X, Y, and yeah, we could try other things.
Sai:For example, as a bitmap representation, if it's a grid world kind of setting.
Robin:So states as opposed to not like observations, Like, are these are they robot arms? I think you're talking about a robot arm setting. Is that right?
Sai:Gym. A simple gym version of that.
Robin:And so in that case, is it using proprioceptive observations that's like the state variables of the positions and angles of the arms? Or is it more an observation like a image of the image of the outside of the robot or what? How how does that work?
Sai:No, no. It's not an image. The goal would, just be encoded as the goal position that the arm has to reach like X, Y. The actual state is the different, the position or the velocities of the hand.
Robin:I see. So what does the intern add? Is it intern like an additional experiment, or does it add to to make does it actually make the learning better?
Sai:It it doesn't add to the actual, student teacher training. It's it's an additional experiment to show the utility of the goals generated by the teachers.
Robin:So what kind of problems, are best suited for this type of, approach do you think?
Sai:So we are essentially solving, goal conditioned RL here. There are a wide variety of applications for goal condition RL. Yeah. I think as we were discussing this industrial manipulator robots, or even the medical, robots and so on.
Robin:Cool. Okay. Do you want to, move to the next paper here, continuous coordination? So this paper is from ICML 2021. The title is continuous coordination as a realistic scenario for a lifelong learning.
Robin:And this is, Nikoi as author plus co authors.
Sai:No, I wasn't involved, when the paper was being published. So this is something I believe that that could be a good, setup for testing the capabilities of, Cogmed. So in their paper, they established this, lifelong, learning setup, with multiple agents. And we are currently working with, these others, to have humans in the loop, to have human agents, learn to cooperate with the AI agents and vice versa.
Robin:So Hanabi is a quite unusual game, and I think that's why it comes up in these settings. It has some some very unusual properties. Can you can you talk about, about that about Hanabi and why it's a why it's a good candidate?
Sai:Yeah. It's a very challenging, multiplayers, like 2 to 5 players, cooperative card game. So if humans actually play the game for the first time, they would never win. Yeah, I, myself played the games multiple times and every time a player changes your entire strategy changes and you kind of had to start everything from the beginning, because the players really need to establish, some kind of implicit connection or strategy, of what they're doing. So the game is basically every player can see every other player's cards except his own cards.
Sai:And at every time step, you have you can you can choose to do multiple, actions. The final goal is to so the colors are numbered 1 to 5 and their color. And the goal is to drop the cards in such a way that, they're arranged in increasing order from 1 to 5 and across all colors. So, yeah, it's a very challenging thing to do. And, you could, choose to give out hints to other players, or you can choose to drop the card or you can choose to play the card.
Sai:There are a very limited, number of information tokens, so you can't keep, giving hints forever. There is a very limited number of hints that you could get.
Robin:So, I mean, many games, especially card games, have partial information as part of the game. And then we have that here too, of course. Why is it why is it different here? What what makes the partial information here different than, say, in blackjack or any other game we might play?
Sai:I think the corporate aspect is important one here. The goal is for all the players to play collectively so that they could all either all win or all lose. And so this acts like a good, benchmark for teaching agents to collaborate with each other or having bringing humans in the loop and teaching agents to cooperate or collaborate with humans.
Robin:So that is unusual. I think most card games are about winning, each person winning and not collaborating as a more competitive I guess there's there's games like bridge where there's teams, but, this idea of all all being on the same team, but missing this crucial information is really interesting. It's also seems to me a bit artificial in the sense that, this game is only fun because you can't say hey Cy you're carrying a yellow 2 and a red 3 I'm not allowed to say that to you right as part of the rules of the game but as humans that's trivial we would it's it's a strange situation because normally we could just say our communication is so good, we could just easily clear up the situation and win together. And so somehow we've, we've this game has added this artificial constraint. We you cannot communicate, you have to really limit your communication bandwidth.
Robin:Couldn't we short circuit the whole challenge just by letting communication flow freely or no?
Sai:No. So because, in, in real, in realistic settings, you can of course communicate in natural language, but I think that adds a whole lot of complexity. And, at this point or at the current, state of research of NLP, I don't think we can trust the systems too well. So I think that's why it's important to constrain on what the agents are allowed to communicate at this point, but given, the, these limited communication capabilities that we deem are perfectly safe, Can this can they learn to can they learn useful cooperative behaviors? That's a very good challenge to have.
Robin:I mean, we don't have to constrain the agents to speak a natural language. Like maybe they exchange a vector or something, a learned vector. They could do a learned to communicate type thing, but that would be against the rules as well. Right? We don't want people exchanging vectors with each other, then the Anave doesn't work.
Sai:Yeah. I mean, I think the point of this is to see how well they can learn to cooperate. It's to the to have challenging corporate. Again, we can of course change the rules and make it easy, but I think that won't be challenging. So, so yeah, I can explain the concept of, this paper.
Sai:So you have the Hanabi game. So what, these guys do is, the first train, a bunch of, self play agents around a 100 of them, so that they can, get better by playing with themselves. And then the sample randomly sample few of these trained agents and make them play with each other so that they can learn the cooperative behaviors. And then in the final test phase, they again sample a bunch of agents and that weren't, chosen before that did not play with each other before, and then they make them play with each other and then see how well it works in the context of zero shot coordination. So what we are currently trying to do or extending this work is to have humans, human agent play with these bunch of trained agent.
Sai:And this is not just the challenge for the AI agent, but it's also a challenge for the bunch of human agents, to learn to cooperate with this trained agents. As the trained agents keep changing, it's also important to continuously adapt to your new opponents, but also remember how you have performed with your old, old partners, not opponents, but partners.
Robin:And we saw things like population based training, which I think was used in StarCraft where there was many types of strategies and and, to nail things down, to keep things from sliding all over Yeah. In strategy space. They had to nail things down by having all these fixed agents, and then you keep growing the population. So, I mean, it seems like this approach has some things in common with that. Although the the random, I mean, I think they went a little further maybe with the population based training in terms of really keeping track of, which agents were dominating which, and and really focusing on training on the agents that were still a challenge Mhmm.
Robin:So that so that they could get the best ranks possible to be efficient with that. So I wonder, is there is this the same type of setting? Like like, would a population based training also be applicable here? Is this kind of an alternative to that? Or how do you see the relationship between those two things?
Sai:Yeah. Basically, the approaches that were used, that can be used here as well. I think the Hanabi is basically kind of a simpler version of the game where we don't have any of those additional complexities of let's say, I don't know, vision, not other kinds of representations. The representation here is simple. And the core task here is just to learn the abilities of cooperation.
Robin:You know, and that, these types of games require a really good memory. Is that true? Or that that was Stratego, actually. Someone was saying that about Stratego, which is another game with a lot of, partial information and
Sai:Yeah.
Robin:Then the idea and the and the comment was about the fact that, well, computers can trivially memorize any amount of of data. So does that make these games less, interesting for testing algorithms on because the computer can just remember every comment? Whereas for a human, they might, start losing track of the hints over time. Is that a factor here or not? Not so much.
Sai:So in strategy, Stratigo is basically a 2 player, game, right? So one could always kind of try to memorize most of the things. Whereas in the case of Hanabi, you have, the agents as well that are, changing. So it's not, trivial to memorize, of course, in the context of self play, it's easy to memorize. Like if the agent is playing with itself, then, which is happening in the first phase of the training, then it's easy to learn.
Sai:But again, that is being challenged by the next phase of training where these agents are made to play against with, play alongside other agents. And, I think, this is where the ability of Cogman also really shines where you have the multiple actors, acting out in the environment where this actors can either be the trained agents or, the human agents, right? So this is one natural fit that we found for, Cognite.
Robin:Great. So let's move on to the next paper here, which is Learning to Navigate the Synthetically Accessible Chemical Space Using Reinforcement Learning with, first author yourself and, Satorov and with co authors. I'm really excited about this paper. I remember this, back from ICML, I think it was and I think that's where I met you
Sai:yeah
Robin:yeah I'm I I mean I wanted to have you on the show largely because of this paper and all and because I just, I thought you were great to talk to and you had such interesting views on on on the work that you were doing. So Mhmm. So, yeah, this is this is kind of the paper that, grabbed my attention. So tell us tell us about this this exciting paper here. What did you do with this work?
Sai:Yeah. So the challenge, was to generate molecules that are actually synthesizable. So at that time, what people used to do before this paper was that, so the molecules are usually represented as a string or as a graph. So they use different kinds of GANs. We sort of been, a reinforcement learning based methods to generate different kinds of, this graph structures or the strings and so on.
Sai:But once they are generated and they're obviously optimized, for the reward that we wanted. But once these are, generated, there is no guarantee that any of these are actually synthesizable. Yeah. So that is the challenge we were trying to overcome then. Then our approach was basically instead of searching in the space of the structures, we should actually search in the space of, chemical reactions.
Sai:So we will start with a bunch of, chemical reactants, choose one of them and make it react with one other reactant. You get a product and then choose another reactant. You get one more product and so on, repeat this process until you get satisfying reward or basically optimized in this particular, space.
Robin:So how does the chemistry part work in terms of having the data in place? Is there databases with these these chemicals and these reactions and how it would transform your molecule? Or how does how does that work?
Sai:Yeah. So for the reactants, there is this database called Inamine datasets. It contains about 150,000 molecules. So that's our initial, starting database. And then for chemical reaction, we have, something called reaction templates, which basically say that what reactive parts in any of the reactants and how they react with each other to obtain a particular product, just corresponding to those reactive part.
Sai:And what are the carbon string attached to the rest of the molecules? There's the same way. And I think your smarts is kind of way to, represent these. And we have, libraries like RDKit that does, the computes most of these things.
Robin:I mean, this, this is kind of implying a, a giant search tree. Maybe not that, not that dissimilar from a game tree, but I guess the branching factor is is very huge and the depth is very large, so you can't explore the whole tree. Is that is that the story?
Sai:Exactly. Yeah. So you could you can't have normal, any kind of research or other heuristic methods to search this space. That's why, we needed reinforcement learning, even for reinforcement learning, space of 150,000 reactants is very huge. So at first, we choose something called reaction template.
Sai:There are about 49 of them. And once you choose a specific reaction template, the number of reactants you can choose, decreases from about a 150,000 to about 30,000 on average. Again, this is on an average, but for us, for a specific template, it could be as low as 50 or as high as 100,000. So it really depends. So even to compute or yeah, to find the reactant in the space of 30,000 reactants is still very hard task for reinforcement learning agent.
Sai:So what we did is, we predicted the action in a continuous space and then mapped it to, the discrete space using the KNN method, or just computed the 1st nearest neighbor. So we predicted the proper instead of predicting the number, a discrete number from 12,150,000, we predicted the properties of a molecule in a continuous space. And we pre computed, the, all properties of all these a 150,000 reactants beforehand, so that we can directly use the nearest neighbor method, to compute, the actual reactant that we want.
Robin:So what is the reward, design here?
Sai:Yeah. So so the drug discovery community, works on a specific, set of benchmarks. One of them is called QED, which is basically a drug likeness score. So how likely that the molecule you generated is good to be a drug. And then you have a penalized, a lock Pisco, which, it's kind of related to water solubility, I believe.
Sai:And then, you have, other, methods, for example, let's say you want to invent a drug to cure HIV. Then what, you do is you develop some QSAR, model. So you know what the HIV target is, and then you have existing, a very small database of, molecules and how it reacted to that particular HIV target. So you train, some models using some supervised, method, to obtain a reward model. So when you when you get a new molecule, you pass your molecule through this, reward model and obtain a particular scalar value.
Sai:So these are called, QSAR models. And in that paper, we did it against 3 HIV based targets.
Robin:Okay. So it's based on the experience of how other drugs, how effective, past drugs have been?
Sai:Yeah. They're not necessarily drugs, but any kind of, molecules because yeah. You you basically, your training data shouldn't be biased. Right? So it shouldn't be just be passed with only the useful, molecules.
Sai:It should also have some useless molecules so that, the score can be predicted accurately.
Robin:So how do you represent the the chemicals internally?
Sai:So the so the molecules can be represented in different ways. The people who work with smile string, they represented it in string, converted to the one hot vector and then embedding and so on. This paper, if I remember correctly, we considered a few representations. There is ECFP 4, which kind of, so these are all vectors. ECFP 4 is, is a vector that, that contains information of the graph structure of the molecule.
Sai:And then, we have something called MACCS or MAACS, which is a binary vector, that tells you the presence or absence of different features of the molecule. And then, we have something called multiset, which contains several features. I think there were 200 such features, and we handpicked 35 of them to use as a representation. So we experimented with, all these kinds of representations. And I think at the end, what bugged out is ECFP features as the input because we want a robust representation as input, and then the small d set, the 35 features from small d set as the output.
Robin:So these are established, standard representations? Yeah I wonder if you've been following the alpha fold work at all and I know that was that was for protein folding very different space Yeah. But I wonder if, if you think those two lines of work have something in common or are going to overlap at some point.
Sai:No. I think there are very, different approaches that alpha fold is mostly a supervised, learning algorithm, But, yeah, having the ability to, predict, the protein structures has a lot of use cases in drug discovery, but not, I don't think it's related to this work.
Robin:These drugs are not proteins generally. Right? But but they they could they could affect proteins?
Sai:Yeah. So so they basically react, with the proteins. So one, I think the way to see it is if you have an accurate structure of the protein, then you could probably predict its reactive properties. So this could probably help in the reward, function design that we were talking about earlier, instead of just, learning from the, from the existing database of how different molecules interacted with a particular protein target, Probably the protein structure can also help in other ways of, report design.
Robin:So I see this paper is steadily accumulating citations. Are are are people building cool things on top of this that you're aware of?
Sai:Yeah. I think so. I think what this, paper opened up is like kind of a new, chemical space for people to experiment on. So it may not, just be a reinforcement learnings. So I think I've seen a few papers where people are using the genetic algorithms or this, evolutionary algorithms, instead of RL for exploring the same kind of chemical space.
Sai:And then people were, trying out different presentations. I think graphical representation is, very attractive. And I think I've seen 1 or 2 papers doing that. And then people can also, they also tried, I think, learning the inverse graph. So we are just doing forward synthesis, right?
Sai:So people also try to do the retro synthesis based on the forward synthesis. So they try to train the inverse network as well. Yeah. I think, very important, challenge is, multi objective optimization because in drug discovery, you just don't want to optimize for one particular score. Your generated molecule, should fit in a specific profile.
Sai:For example, it should have a particular drug likeness score, but it should also have particular water solubility levels. I don't know, particular different profiles that are not harmful to the human body, basically. So it's essentially a multi objective optimization problem. And I think a couple of papers have started dealing with that, based on this new chemical space.
Robin:Awesome. That must be very gratifying for you to see as a researcher.
Sai:Yeah, definitely. Yes.
Robin:Okay. So coming back to chess, has your chess background, influenced your approach to AI, do you think?
Sai:So not so much, I think. But in general, I think, being a chess player helped, because you could generally do your calculations much faster or you could kind of, visualize proofs without actually putting everything on paper. I think it has helped in that way. Yeah.
Robin:So what about, has AI influenced your approach to chess at all?
Sai:Not not so much, I think. I mean, I haven't played, I haven't played many, chess tournaments since I started doing AI. I probably played it played, like, 3 or 4 tournaments. So
Robin:So do you find ChessAI, interesting?
Sai:Yeah. Yeah. I think, a lot of exciting things are happening, especially with, this tabular as a learning systems like alpha go alpha 0 and so on. I think, this kind of approach has existed before and they were tried on different games, but, to see it work on chess is, really exciting. I think at the end of the day, I still, see that these are only acting like a help us to the, Monte Carlo research, right?
Sai:The policy networks or the value networks that these algorithms are learning. I think they're only adding as an extra help to the, MCTS. And I think MCTS is still at the core of all these test engines, which hasn't, which had which it has been since many decades.
Robin:Do you feel like this generation of AI has solved chess in a sense, or do you think there's more interesting things that we could do, in the chest domain or or in closer related domains?
Sai:No. No way. I don't think I think I think we are already, far from saying it to be solved, because we, we, we still see this, alpha 0 or, Lila 0 making some mistakes and those mistakes cannot really be explained. So I think it's, far from perfect or, far from, being solved.
Robin:What do you think the reason is why it's it's, that happens? Like, is there some what do you think is it what is missing in the design?
Sai:Yeah. So I think for any, chess engine, it mostly boils down to how much competition or how many Monte Carlo, research simulations you're allowing the engine to have. And despite having all this trained policy and value networks, if you don't allow it to explore far enough, there are still lot of blind dense, even if it's foreseeing 25 most, there could be something on the 26th ply that it must 26 more that the engine has missed, that primarily probably because the value network failed to predict that something might happen in the on the next form. These are still the corner cases. When I observe, some engine games, I don't think much has.
Sai:There's a lot of, interesting games from alpha 0. It has been very aggressive in some games. There are a lot of sacrifices. That's very good to watch. But at the same time, it still has those components or the drawbacks that the older AI engines have, for example, in a very closed position, it can't plan properly.
Sai:It moves, It just keeps moving the pieces around without proper, futuristic plan.
Robin:So it seems to me that, alpha 0 can only perform as well as the function approximator is properly approximating the function and also only as well as the data so if it hasn't explored certain regions or if the function approximator doesn't generalize enough or in the right way
Sai:Yeah.
Robin:Then in both of those cases are the where the corner cases will hit us. I've never been very clear on how perfect a fit the convolutional network really is for this problem. Seems to me it may be not the perfect fit.
Sai:Exactly. I agree. That's another very good, question to explore. Unlike other board games, like go, chess has a very interesting representation as well. It has like multiple kinds of pieces, so you can't just represent them as numbers on a 2 d map.
Sai:So what people do is they use something called bitmap representations. So each piece is, represented in a binary 1 or 0, and it's dedicated, 2 dimensional map in a multiple, layered 3-dimensional structure. Right? And yeah, I'm still not sure if it's the, most optimal representation to have. And yeah, definitely on top of that, it's very unclear if the usual convolution neural networks are suitable to these kind of representations.
Robin:There's, like, there's definitely some locality and some spatial component that maybe the CNN is capturing, but also, like, a rook can move all across
Sai:the,
Robin:board all at once. And so that seems like CNN is not gonna be very suitable for that part. So I I wonder I do wonder about that. I think I think in, in AlphaFold, AlphaFold 1 used some CNNs and then in AlphaFold 2 they took the CNL because of the locality restriction of the CNN wasn't help wasn't helping them because it would restrict the field of, to the to the block, the CNN block. So I wonder if that's the case here.
Robin:Yeah. You'll never have enough, data if the game is hard enough. So I wonder if Yeah. You know, I guess the challenge is how do you get the network? How do you get the function approximator to generalize without covering every possible position?
Robin:And so then I wonder if there's how to get that inductive bias that we really want, which seems right now, it seems very situation specific. We designing the inductive bias. Like, you know, I was I keep going back to alpha fold because I think it was really interesting. They really baked in a very specific inductive bias after deeply understanding their problem. So a lot of the intelligence is right there in the inductive bias design, in the network design.
Robin:And I think that there wasn't much of that in in this line of work.
Sai:Yeah. Yeah. There's a lot of open problems to explore in this. I think I really consider it, solved, if an agent can play without any, research, for example, if given a position can a policy network or using a value network, can we predict the best in that position, which I think is impossible to achieve. Yeah.
Sai:At least not in the next 20, 30 years. I don't think so.
Robin:I mean, you can play AlphaZero in in only one step mode, I guess, without the full tree search. Right? And it still does better than, it still it still has some level of skill, but it's just not that strong. Right?
Sai:Yeah. Yeah. It's it's very inferior, playing. And in such a case, I think there are too many failure, most that can be exploited.
Robin:So, I mean, it begs the question, like, why do we even need this type type of of structure, this tree search at all? You know, I I gave a talk a while ago, to the data science group in Vancouver about about, you know, why DQN for Atari makes sense and why the AlphaZero, algorithm makes sense for situation like Go. It's because and and what I was saying and so if you agree with me, is the reason is that the value the true value function of go is so bumpy and hard to predict that you whereas in atari the value function is much smoother and easier to predict and so dqn is enough to master that value function but on the go side or maybe on the chest side the value function changes so much from any small move so the function is so non smooth that you have no choice. Your function approximator is not strong enough to generalize into the future. So the only choice you have is to simulate into the future and see what the effect is on the on the value function.
Sai:That's exactly correct. Yeah. Yeah.
Robin:But but if we had but if we had function approximators that were more powerful, that could model the all the complexity of chess and go, then we wouldn't need the MCTS. But the fact is we have, you know, the current generation of neural networks doesn't have that property. So maybe it's a failing of the function approximator we have to make up for with this additional mechanism. Is that is that how you see it?
Sai:Yeah. I still, I'm still not clear at what point, this function approximator would be able to solve that. I don't see that happening anytime in the near future, but that's generally true. Yeah.
Robin:So So what do you think about explainability in in chess and these types of games? Like, definitely when talking, you know, so I never got very far at chess I'm not very good at chess but I was very interested as a kid and I remember reading books on chess strategy and there would be so many recipes and there's a lot to talk about in trust strategy. And people use a lot of metaphors. And, and it's not like people use a lot of generalization as they're talking about the strategy, even when you're talking about open and close positions and end game and this and that. There's all these concepts that we throw around.
Robin:Mhmm. I wonder what you think about explainability, in terms of chess AI. Like, do you think we could ever get to the point where we could have a discussion with the chess AI about strategy or is that, kind of a ridiculous concept?
Sai:I think it can explain why it thinks, why it thinks a particular move is good, but that explanation would still be based on the variations that it's calculating and not in any, like a natural, language that, that it sees that it somehow sees this double bond structure is good, or I don't see that happening anytime soon. But, yeah, that's something it would that would be useful to have.
Robin:I guess there's all this work now with language models and attaching language models to everything and grounding language models and everything and do you think if we plugged in a large language model to, alpha 0, we could somehow get it to explain, you know, why Psy beat it in the latest round?
Sai:It's a very tough challenge. I don't think, I don't think your current language models are that accurate to do that. I mean, it's not lot of, we need a lot of novel data to train such models on, which are not easily accessible or within a reasonable amount of compute.
Robin:I guess if it read chess books and if it was able to understand the positions and somehow map it to its existing representation, then maybe we could get somewhere. But it's it's hard to it's just hard to imagine, but it seems like what I've been noticing is plugging LLMs into different things is working way better than I ever imagined it would. I'm shocked by how often it's working well. Are there good people getting it to work?
Sai:Yeah. Never thought about having an agent reading chess books. That's definitely sounds interesting. Yeah.
Robin:Is so besides your own work, is there other things happening in RL, or other parts of AI lately that you find really interesting, Sai?
Sai:Yeah. So so these language models are somehow very interesting. Yeah, they're already working at a very large scale and, I like these ideas on scaling laws as well. Like what some amount of increased computation or increased network size or increased training data size can do. And there's this latest paper from, Google that shows some emergent, behavior, like so far, and a language model cannot solve some arithmetic, but if you have, more compute and more, scaling, then it's basically the accuracy is increasing significantly.
Sai:And so they call this as emergent, properties because it did not that particular property of solving those mathematics did not exist, when they had less compute. And I want to see how far, the increased compute, would be useful in reinforcement learning.
Robin:Do you consider yourself in the scale is all you need to camp?
Sai:It's not all we need, but I think it's something we definitely need. I went to the, scaling loss, workshop, recently and yeah, it's very exciting. I think more, many people in the camp also actually, believe that scale is not all you need, but, it's something definitely that you definitely need.
Robin:So is there anything else, that I should have asked you today or that you wanna share with our Talkara audience?
Sai:Yeah. Check out Cognite. It's exciting. And, yeah, if you are working on multi agent, TARL or human in the loop learning, check out Cognite, and I'm happy to chat more about, your ongoing projects on these topics.
Robin:So is it open source anyone can download?
Sai:Yeah. Exactly. And it's easy to get started as well, I believe.
Robin:And we'll have a link in the show notes. But just for the record, where are people getting it? Is that Yeah.
Sai:It's coqman.ai.
Robin:So, Sai Krishna Gottipati, thank you so much for joining us here at TalkRL and sharing your insights with us today. Thanks so much for taking the time.
Sai:Yeah. Thank you for having me. I think it's my first podcast.