TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.
TalkRL Podcast is all reinforcement learning all the time, featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan. I'm very glad to introduce our guest today. I'm here with Sharath Chandra Ramparti, who is an AI resident at FAIR at Meta, and he did his master's at Mila.
Robin:Welcome to the show, Sharath.
Sharath:Thank you so much. I'm excited for this.
Robin:We met on Twitter, and, I saw your your work, your recent paper, which I find very fascinating. I'm excited to talk about today. This is called a generalization to new sequential decision making tasks within context learning. That's first author yourself, et al. And, so can you tell us tell tell our audience what is the general idea here?
Robin:What are you trying to do and and how does it work? And let's let's do something new, today. Let's try to do this description for a lay audience.
Sharath:We all are familiar with, like, large language models. Thanks to, like, ChargeGPT, LLAMA, and other, models which are, you know, kind of taking over over the past 1, 2 years. So the example is very easy. Right? I mean, if you have ever played with LLM, LLMs, they are, like, very good instruction followers in the sense that if we ask it to, let's say, solve a math task by saying, okay, here's a math task.
Sharath:You need to add two numbers. I'm gonna give you some examples on how to do that. And and now it's your turn to answer that question answer the new question that, I'm gonna ask you. For example, I can say, okay. 1 plus 1 is equals to 2.
Sharath:2 plus 2 equal to 4. What is 4 +3? And and the LLM is very good at this. It kind of understands that, oh, I have to do a math task now, and, it is addition of two numbers. I've I'm given, like, some some examples in the context, and I just have to, like, add the last two numbers, and it will answer, you know, 4 plus 3 r seven, for example.
Sharath:So, this this ability of, you you know, learning a new task from very few few examples, like, people call this, like, a few shot learning or or a more technical term for this is like in context learning. And, and LLMs are pretty good at this. Right? And the motivation of this paper is exactly the same, how you can adapt to new tasks, But the setting is different, in the sense that, the setting we are considering is a sequential decision making setting where you you wanna, like, solve, let's say, reinforcement learning task, for example, Atari.
Robin:So traditionally, reinforcement learning, which would be applied to Atari or Amaz or these types of tasks takes a huge amount of data before it it it gets anywhere. Is that is that so is that the difference here that we're able to be very sample efficient?
Sharath:Yes. Yes. That's that's a that's a good point. So traditionally, reinforcement learning agents, as you mentioned, they take a lot of time to train, but but they also test on the same training environment. Even though you have trained for so long, it it can't adapt to a new MDP, for example.
Sharath:And this is, like, one of the drawbacks of, like, traditional RL. And in this paper, we wanna, like, just go a bit more further and and ask the model to, like, generalize to, completely unseen task. And by that, what I mean is basically so if you if you take this benchmark, which is called ProgGen, it's by OpenAI. It has, like, 16 diverse environments. What we do is we kind of train on, let's say, 12 environments, And we wanna test on the 4 unseen environments.
Sharath:And then these these environments are, like, completely different, from the perspective of the observations, the action distribution, the reward distribution, and also the transition dynamics. So it's like a completely OD kind of generalization. And LLMs are pretty good at this if you take the example of, again, going back to language models. They are kind of good at adapting to this, like, unseen tasks, which are which could be out of distribution. And the way they are doing the doing it is basically by in context learning.
Sharath:So we wanna do the same thing for reinforcement learning setup, And we wanna adapt to the same challenging out of distribution tasks.
Robin:Is it true that, each environment in ProGen itself is full of randomized Yeah. Settings so that you don't play the same exact level each time? But even then, there's different completely different environments. And what you're saying is it's not just that you are mastering 1 you know, single environments with the randomization that they have, but also being able to go to a brand new game. Yep.
Robin:And and what and so what are you what are you providing to the agent to help it perform in the brand new game? You're giving a few examples, like, how much data is it getting?
Sharath:Yeah. Honestly, it just gets I think in the paper, the number of demonstration that we use is basically we can we can count that. I think the max is 7 for mini hack environments, when mini hack environment experiments that we did in the paper. And for Proggen, we just give 4 demonstrations.
Robin:So this is 4 complete trajectories of an expert doing their best to try to play the game. Is that what it is?
Sharath:It could be complete or it could not be complete, because, like, we are limited by the context length on transformer models. So we just, like, accommodate as much as we can in the context, and then we we kind of roll out from there.
Robin:And so how like, when I think of a even one frame in Atari, it's quite a bit of data. So, like, is this context window very large? How how much can you step in there? How that
Sharath:seems like
Robin:a lot larger than the context window for an LOM, for example, because it's just small tokens.
Sharath:Yeah. That's true. So so so one thing we did is we treat every image, as, like, one token instead of, like, you know, every, like, vision transformer kind of patches the images and treat treats the patches as tokens. We don't do that. And because of that, we kind of save some, you know, space in the memory, in the context memory.
Sharath:So in terms of number of tokens that we use in the paper, it's around, like, 2048 2,048 tokens. That's the max, number of tokens that we use.
Robin:Okay. Now looking, looking back at at some related work, you know, we had, Arvind Srinivas and now Complexity back in episode 34 with his work on decision transformer and with with Lily Chan et al. And and, of course, they were using a transformer for sequential decision making. But this is really for their theirs is really framed as supervised learning.
Sharath:And so
Robin:I don't I think they weren't really doing this in context thing. And so we saw, you know, multi game decision transformer, and then we saw Gato and these other things that were there was something in common there, but I think what you're doing is is is so different because you're learning even though you're using transformer for RL, this here, It's the focus is the in context. Is this the first time that's been done?
Sharath:I would say it's not the first time it's been done, because, like, the notion of in context learning for transformers is is there for a while. But in reinforcement learning, I think there was a concurrent work when when we kind of, we're publishing our paper. It it is called algorithmic distillation. But but they don't, like, test on completely out of distribution tasks that we'd consider in the setting. And there is also one more paper from DeepMind around the same time.
Sharath:It's called adaptive agents, ADA, I guess. And they do the same in context learning stuff, but they do, like, online reinforcement learning with transformers. And we don't do online reinforcement learning. So those are there are, like, some differences between the concurrent works, which came around the same time we were doing this doing this research. But but, yeah, I think, like, it's it's still not enough.
Sharath:I think there's definitely more to do here.
Robin:And and your paper mentions that there's certain properties you want in the data to make this work. Yeah. Can you help us understand, what is what is it that you found in terms of what kind of data is needed to make this actually work?
Sharath:Language as a data modality is very interesting. I'm just, like, going back to the LLMs and why in context learning works for for LLMs. Language has, like, a very nice data distribution properties, which includes in nature. Like, when you plot the the the word frequencies, of, like, bunch of documents, It follows, like, a Zipfian distribution. And and the second second property is basically, you know, it has this notion of busting us, in the data.
Sharath:And by that, what I mean is if you take an example of, let's say, the following sentence, the sentence is basically as a chef seasoned sorry. As a chef seasoned, the soup, the aroma of herbs and spices fill the kitchen. If you consider the sentence, the kind of, like, see that, like, most of the words like chef, seasoned, soup, aroma, herbs, kitchen, etcetera, etcetera are kind of, like, closely associated with the topic of, you know, preparation of a meal or a soup in the scales. Right? So you can imagine, like, these words are kind of, like, clustered, in the space.
Sharath:And and they are, like, bursty along the temporal axis. And because, like, similar tokens are kind of, like, together in the along the temporal axis, it is, like, kind of crucial. It is, like, one of the crucial factors why in context learning, you know, works for for for transformers when the when the modalities like language. And because of these 2, like, properties, there was a paper by Stephanie, Chan from DeepMind. And it basically says that these 2 are, like, quite crucial for Nxonics learning to emerge during the training.
Sharath:And and we wanna take, like, when we when we read when we read this paper, we were like, oh my god. This is, like, really cool. And we wanna see whether we can, like, kind of apply these distribution properties for decision making settings and see whether if we can get, like, in context learning emergence just like language. So we take the notion of busting us, which basically means that you want similar, you you want the episodes from the similar similar levels in case of, like, Proggen in the context. And that basically helps the model whenever it is, like, doing, like, attention.
Sharath:It kind of, like, helps the model to, like, look back and identify that, oh, there is a similar example in the context. I can maybe, learn how to solve, you know, the current task by looking back into the context and, you know, kind of solve it, like that. And this kind of forces the model to double up this in context, in context learning ability during the training. And once we get this during the training, we can just, like, expect that it it also works during the test environment.
Robin:Can can you say more about burstiness? Because it it seemed like a very, important feature of the the part of the recipe to make this work. So you're saying bursty. Of course, we know burstiness in general daily use, but, here, I think we're we're talk you're talking about is it right? The density of of samples from the same trajectory needs to be higher.
Robin:Is that what you're saying? And that's in content, not in the training data, but in the in the context window?
Sharath:Yes. Yes. Exactly. So, transformer takes a sequence, as an input. Right?
Sharath:So so you wanna make sure that some tokens in that sequence are kind of clustered together and if you kind of plot them in in some space. And, that is basically, called bustiness. And in terms of, like, our work, we call this, we we kind of introduce this notion of trajectory bustiness, which basically means that within the sequence, if we stack, like, multiple trajectories together, they should be bursty in the sense that there are there should be similar trajectories in the context.
Robin:Some kind of minimum density of the related stuff. So we have there something to latch on to maybe?
Sharath:Yeah. Exactly.
Robin:What kind of generalization do you observe here? Like, is it and and maybe this one of this is qualitative. Like, is it learning the idea of playing a game? Is it learning the idea of of there's some main character that has to be protected? Or what do you think is being generalized here?
Robin:Or is it really is it really getting most of it from the context itself?
Sharath:Yeah. That that's a that's a very good question. That's something we haven't analyzed in in the in terms of what kind of skill that is, that the model is learning during the training. But but, yeah, I think generalization is mostly through in context learning, in the sense that, okay, if if I have a similar looking trajectory, in the context, then I can I can probably believe that, you know, I can follow that particular, example, and and I can solve this particular task using that? But but, again, like, there is one more sort of generalization, which is, like, generalization to unseen states.
Sharath:Right? Because, like, these environments are in inherently stochastic, there is some some sort of, like, noise in the input sequence that you give to the transformer because of the environments to cast the city. And we kind of observe that as we, like, scale the model size and also the data size, the model kind of, kind of learns to generalize to even, like, unseen states and still, like, get some good enough score for that particular, task. So that is one one one more generalization. But we haven't really studied generalization from the skill perspective in a sense that, oh, what sort of, like, general skills, that the model, learned during the training, which are, like, transferable during the test time?
Robin:That makes sense. Like, one question at a time. I'm sure there's a lot of science to do here. Yeah. The the paper said that, instead of single trajectory sequences like indecision transformer, you're using multi trajectory sequences.
Robin:Can can you say more about that? Is that, do you do you mean, what, is fed into the context?
Sharath:Yeah. Exactly. So I think, like, decision transformer uses just a single trajectory. Even that too, I think, like, they they truncate the trajectory and the context window is, like, very minimal. And because of that, you you can't really, like, get a lot of in context learning out of it.
Sharath:So what we do is we basically basically, like, stack multiple trajectories, together and just, like, pass that as a input to the transformer. So that's that's one innovation, but it's it's it's basically it's it's it's been done in other papers as well. But we kind of say that you can't just just, like, naively start these. These these trajectories should obey some distribution properties, and that's where the burstiness kind of comes into picture. We only, stack trajectories which are from the similar level so that the model can, like, look back and, learn from the context.
Robin:So do the new tasks have a different shaped action space and observation space as well? Or is there or are they different just different types of actions? What what kind of differences are are support us here?
Sharath:I think the differences are basically from the observations. Like, if you if you kind of, like, take this example of, you know, Proggen, you have, like, let's say, MACE task, during the training, as one of the tasks. But the test task could be, like, let's say, climber. And the climber has climber has, like, a different set of action sequences that lead to, you know, high reward high rewarding instances. So so you can't just, like, you know, zero short transfer to this new environment, because, like, they differ in terms of the the observations and the sequence of actions that kind of leads to, you know, high reward.
Sharath:And also the transition dynamics is is different, for for these two particular games. And and also the reward dynamics, the reward distribution is different for these 2 games. So it's it's it's basically like if you take, if you consider these 2 as MDPs, then the states, the actions, the rewards, and the transition probabilities are are different.
Robin:If I understand correctly, similar shapes in terms of the size of the action and observation space, but completely different MDPs and in all the other ways that we're using. Does that make sense?
Sharath:Exactly. So so I see. I mean, in terms of, like, dimensions, it's the same, like, 64 cross, 64 RGB image, but, the it kind of looks different, and also the action sequences that lead to the optimal, behavior are different for these and so are the reward functions and transition, distributions.
Robin:Can you tell us, some more about the experience of of doing this work and, like, how you came up with initial vision and how it progressed to how you ended up the paper ended up as it is today? And
Sharath:That's, that's a interesting question. So I think, like, when this work started out, it was more of, like, a exploratory research because I just joined as a resident, and I was just exploring a bunch of things. And I I really wanted to study this problem of in context learning just from, like, curiosity perspective. And and then then I thought, okay. I mean, in context learning in LLMs is, like, already well studied in the sense that, you know, there are papers from Anthropic, and, Google, etcetera, who kind of did, like, really good work, on on for the language at least.
Sharath:Then then kind I kind of identified because, like, I I I'm from other background. I thought maybe it's interesting to study this problem, for this particular setting. So so yeah. I mean, it basically started out as a pure exploratory, curiosity driven project. And, most of the the time is spent on is basically analyzing, you know, how to exactly get that, in context learning ability for for decision making tasks.
Sharath:And then, like, as I analyze more, I kind of understood what properties that we need, to make this happen. And at the same time, it's it's not an easy problem for RL because, like, I think there is a special there's a section in the paper, where I talk about the limitations of the approach, which is basically most, mostly, like, RL centric, and not really, you know, supervised learning centric. Because, like, these things these are these small things can be, you know, easily I mean, there is at least in supervised learning, the effect of these, like, very small things are not as huge as in reinforcement learning. For example, the environment stochasticity is, like, huge problem in in in RL. And and and because of that, even though if you if you can do, like, very good in context learning, you might, you know, drift away from the context and, and never recover from it.
Sharath:And this is like a very idle kind of problem and not really, like, supervised learning problem. So it was very interesting to, like, understand, all these things, and, kind of come to the understanding that, you know, it's it's very difficult. It's it's not really that easy. And hence, it is like kind of not really mature to expect the same results that we that we have like seen in LLM space.
Robin:Going back to the general idea of in context learning, I think the first time I encountered the idea was in GPT 3 paper. I remember Yeah. The day it came out and being really surprised. And it seems like a completely different paradigm of learning that we've been we've ever seen before. Is that is that Yeah.
Robin:The case?
Sharath:Yeah. Yeah. Exactly. I think most of the impressive results that you see today, in LLMs is I think, of course, like, scale, and, the data side size, all these kind of contribute to the emergence of in context learning. But at the core, transformers do very good in context learning.
Sharath:And and because of that, you know, we we see all these, like, impressive results.
Robin:I guess with language models, LLMs, they can read all the text in the world or at least all the public text. And there's no obvious equivalent for RL.
Sharath:What
Robin:would that even be?
Sharath:That that's that's actually a good point. So we we are kind of, like, data bounded when it comes to reinforcement learning because, like, we don't really have, like, a Internet scale datasets to train, you know, these, these, RL models. Yeah. That that that is, like, one, one limitation. But but I guess, like, you know, this is where I kind of, think more about how they can use other modalities to do reinforcement learning.
Sharath:Right? So for example, there is this, like, very interesting, paper from, it's called Voyager. And it I'm not sure if you're, if you're aware of this paper. It is basically like using GPD 4 and, kind of like doing this open ended exploration, just through in context learning. And and they are not even like it's not it's not like traditional RL kind of actions.
Sharath:Everything is, like, kind of, embedded in the code. Like, it kind of generates a code and that's that's an action, for example. So everything is kind of, converted into language. And since GPTs are, like, very good at this, like, processing language data, you can leverage the power of those models and kind of use that for reinforcement learning tasks. This doesn't require any, like, learning, fancy, weight updates, fine tuning, etcetera.
Sharath:It's purely in context learning. And that is one paper which was like which I which I found, like, really interesting, honestly. We do we already have data maybe maybe in a in a different morality. We just need to be open about the design choices that we kind of make just, like, try and use these, like, super powerful models for doing decision making.
Robin:I guess with the Voyager I I do vaguely remember Voyager. I'm just looking it up now. It's It's kinda depending on the ability to describe a scene in English Yep. Which may be hard for some RL tasks. I'm trying to imagine, like, in ant maze, how you would describe the position of the legs or like, there's just just the scope of what RL could do could be so far outside of language, I guess.
Robin:But but but you're saying as as a as an inspiration for what is possible. Yeah. If I understand correctly, your paper doesn't actually do any reinforcement learning. You're solving an RO environment, but actually without using RO really. Right?
Sharath:Yeah. That's true.
Robin:And although, I guess, you used RO to train the experts, but that's kind of not doesn't that's not really strictly necessary. Yeah. You know, there's been this theme that's coming up now and then, where people are saying, oh, is RL really even needed? Like, in in LLMs, you know, some people are moving to to DPO and some more supervised learning approaches. And there's this kind of meme that people have been kicking around is is RL, you know, relevant.
Robin:And so your your paper is another example of getting results in our own environments without actually using our own with all the known complexities and issues around our own. So how do you feel about this? Does this is this a death knell for RL? Or how do you see the role of RL changing in future?
Sharath:Yeah. So so one thing we need to be very careful about is is like, you know, the kind of I mean, of course, like, LLMs are it kind of became mainstream over the past few years. And because of that, you know, everything else might look very tiny. Right? But but I think, like, over here, we are overlooking the the things that RL can do beyond LLMs.
Sharath:Sure. I mean, LLMs in LLM space, I think there is an RLH of fine tuning phase and not I mean, these days, it, people are replacing that with DPO, etcetera etcetera. But beyond beyond LLM space, I think, like, RL is kind of very powerful has been proven powerful, let's say, in in drug discovery, space where you can, like, use RL methods to discover, like, new drugs, for example. And I think our alpha science has a as a thing is is is very powerful, but but people are just, like, kind of too bogged down into, like, ML, like, LLM space. And and maybe they are not saying it's potential beyond LLMs, but it's like, according to me, it is, like, very powerful.
Sharath:Maybe not in LLMs. Maybe it is. But but in science, for example, it's it's it's it's very good. There are, like, very good, applications for RL for sure.
Robin:Yeah. I'm not entirely facetious. Like, I I mean, obviously, I started a show and named it, about reinforcement learning, so I believe RL has a future. However, it's been surprising to me, you know, the how the the conversation has shifted over the past. I think the the idea in the early days seemed to be that DeepRL was gonna be, you know, the one true path to AGI, and and that seems a little less clear.
Robin:I mean, it definitely has its role still, But exactly what its role is gonna be, maybe it's a little, a little less clear.
Sharath:Yeah. Yeah. I do agree, regarding that. But it depends on, like, how you even, like, define AGI. Right?
Sharath:If it is just like LLMs, with the current kind of, stuff we are doing with our on LLMs, maybe it is we're already there. Or if it is, like, beyond LLMs, then there is, like, a potential application for RL over there, I guess. It all I'm trying to say is it depends on it depends exactly on how you define, you know, AGI, which is like a more philosophical debate, than
Robin:Of course, Jan Lecunns and and with his cake, with the r l being the cherry on top, I think there's people interpret that cake in very different ways. Some people have interpreted it, oh, the cherry is just a small part, so it's not important. You listen to what Jan Lacun says. He he has said that the cherry is actually the main thing we want. And I think if I understood his argument is that really we if we put enough effort into the self supervised learning part of the cake, the main part of the cake, then the cherry can be relatively small even though it's very important.
Robin:It can be a very small, amount of compute. But I think he originally, the cake was about how much data goes in to each step. Yeah. And he was saying, oh, that the cherry is small because there's only a small amount of reward signal. So and then I think people could have conflated that to say, oh, maybe the cherry is not important because it's so small.
Robin:And even he himself has said, we wanna minimize the amount of RO we wanna do, but not eliminate it because I don't think there's anything else that takes its place.
Sharath:Yeah. Definitely. I think, like, eliminating something is basically, you are missing out an opportunity. Right? So, yeah, cherries are I love cherries, honestly.
Sharath:They're very tasty. And and I think, like, it is a very good very big component. It will play a very big role in the coming future for sure. But it's it's just that we need to, like, get, like, right set of tools to kind of make it happen.
Robin:So Sharath, I was looking at your past work and, you know, outside of this in context learning paper, can you tell us about some of the other directions that you worked on the past, briefly? I saw you did, for example, some, interesting stuff on generative chemistry and other things.
Sharath:When I was, interning, when I was, like, doing my masters at Miele, I started working on, GFlowNets and, its applications to drug discovery and and, generative chemistry. So so I think, like, that's why I gave you this example of, RLL being, you know, very useful in this kind of domains because the kind of things I worked on, especially with GFlowNets, They're, like, very powerful when it comes to discovering new drugs, diverse drugs using, like, Arnold kind of approach.
Robin:Can you and for those of us for for whom GFlow nets are still a bit mysterious. I know Yeah. I know, Benjio professor Benjio is, has invented something really important. And, you know, I I've tried to understood it. I I can't claim that I do.
Robin:Just briefly for our audience, what is a GFlowNet, and what does it help you do?
Sharath:So imagine, let's say, you want a sample from this, like, complex distribution, which is like which could be intractable, etcetera etcetera. Right? And GFlowNets basically will allow you to do that kind of sampling. So if you kind of imagine a distribution with, like, multiple modes, then GFlowNets kind of find their way to reach to reach that mode and just, like, pick the sample from there and give it back. And because of this, like, impressive ability of gFlowNets to cover all the modes in the distribution and give you diverse samples, it is, like, very interesting application for, you know, drug discovery because, like, you in in drug discovery space, you want, like, diverse set of molecules that you want to generate so that, you know, you can kind of screen them away in the downstream, testing process.
Sharath:But, yeah, I think, like, I I kind of imagine GFlowNets as most mostly from the perspective of sampling. It will allow you to sample from, like, intractable distributions. And, and you also, like, kind of avoid the the problem of, like, mixing times in in in samplings. So if you kind of, think about, the traditional Marko chain Marko chain Monte Carlo, sampling techniques, the major drawback of that is, like, if the Markov chain has, like, very high mixing time, it it will take for a while to, like, kind of sample from the true true distribution. But g flown is kind of avoid that problem.
Sharath:And, you can just, like, basically, sample from these intractable distributions, get the diverse get, like, diverse samples and use that in your downstream tasks. That's how that's how I think of gFlowNets. I'm not an expert of GFLownitz, but, I think thinking from this perspective kind of help me, you know, approach that, kind of, problems that I've been solving.
Robin:And so Did you use them in conjunction with RL?
Sharath:Yeah. Yeah. So, it has some neat connections with RL because of the objective function. So what it does is basically so let me give you this this example. Right?
Sharath:So imagine a graphical model, which has, like, one, one source, state. And, and there are, like, couple of leaf nodes. You can call them async states. Right? And imagine, like, water I think this is the classical example which even, like, Bengio gives, in his talks.
Sharath:It's basically imagine, like, water is flowing from this, like, source state to all the corresponding states and it will reach the sync states. And all these, like, sync states have certain, like, reward associated with it. And and and if you kind of look at the objective function, it is basically at every, node or or at every conjunction. You wanna make sure that whatever is the inflow to that particular node is equal to the outflow. And if you look at the objective function, this kind of looks very similar to the TD objective.
Sharath:So that's why it is like it has some connections to reinforcement learning. But one one very impressive thing about gFlowNets, which is like, which which maybe you can see it that as a drawback in Huddl is basically, the diversity aspect. Right? So you won't always end up with a policy which which gives you maximum return. You kind of end up with the policy which kind of understands the distribution of the reward and kind of reach all the modes of the reward distribution, Not just like like one peak, which is like a max argmax.
Robin:I see. So it's trying to match the whole distribution with high fidelity. I would say, like, for example, I tried using VAE to get samples from for our own environment and, as you know, VAE is a very primitive way to Yeah. Get to match. And so, obviously, it wasn't matching very well.
Robin:So this is and on the other on the I guess, on the other side, GANs and diffusion models are trying to get excellent samples. Yeah. But maybe in the case of GANs, at least not really matching the distribution even though the sample individual samples are high quality. So is the GFlowNets kind of somewhere in between? Or
Sharath:Yeah. I think, like, I would not compare with VAEs and and GANs because, like, GFlowNets kind of at least right now, they are, they operate, when you have, like, discrete state spaces. So that's why, you know, you see applications of GFLONAD in drug discovery because, like, molecules are kind of discrete.
Robin:You know? Oh, it's always discrete. Oh, I didn't realize that. Okay. Cool.
Robin:Because discrete is hard in in many ways. Right?
Sharath:Yeah. Yeah. That is that is true. If you think of, like, state spaces in, like, in, like, molecules and stuff, It's it's basically in, in the orders of, like, 10 to the power 9 or something. And, in order to, like, explore that space and sample high high rewarding diverse molecules.
Sharath:It it's like very it's it's it's intractable. That's why I say, like, you can use the fluonants to sample from intractable distributions, and these samples are often diverse.
Robin:Any other things in the world of RL that you find quite interesting these days, outside of your own own work?
Sharath:Again, I wanna go back to GFlowNets. And and, and I wanna say that they're actually powerful because, I saw this recent paper again from, from Joshua's group, where they kind of use, gFlowNet training objective instead of PP objective for for training, for, like, fine tuning, the large language models. And and the interesting thing about, yeah, doing this sort of thing is basically you can you can do very interesting tasks in the send in the sense that, you can condition the the language model, on, some infilling tasks. By that, what I mean is you can basically, you know, prompt the model with the start of the story. And let's say you're doing doing, like, a story generation task.
Sharath:Right? You kind of know this beginning of the story and also end of the story, and you want the LLM to kind of generate diverse set of kind of in between stories. And, and you can use, like, gFlowNets to actually do this sort of, like, in in filling tasks. And they kind of show that if you use gFlowNets training objective and fine tune, you kind of end up with, like, diverse set of stories for the same start state and the end state. And, I find that really interesting, honestly.
Sharath:So so I think, like, they are slowly making their way to to the LLM space. And, I'm really excited to, like, see what they're doing next.
Robin:Me too. I'm very bullish on Benjo's ideas. Yeah. Thank you for for sharing that insight with us. Is there anything else that you wanna share with the audience today, while you're here?
Sharath:You know, maybe if you don't know GFlowNets, you should go ahead and read. And if you haven't read my my my paper on in context learning, please read that paper and, you know, reach out to me on Twitter. I'm kind of active over there. Or you can email me if you have any questions.
Robin:We will link to your paper, as well as other resources from this from this chat in in the episode, show notes. So, this has been this has been so interesting for me. I wanna thank you. Thank you so much, Sharath, Chandra, Ruparti, for for joining us today and sharing your insight with Talk to our audience.
Sharath:Thank you so
Robin:much.