Misha Laskin: AlphaGo never stopped improving, like you could have it become super intelligent and you could have sunk 10x or 100x more recent resources in it and become even more super intelligent.
AlphaGo. And so in principle, these systems never stop learning. just a matter of how many resources you want to sink into them. Now, with language models in RL, we're still, it's still early days. So I don't think that we've discovered the sort of maximally scalable blueprint. But there's a foothold.
Steve Hsu: Welcome to Manifold. My guest today is Misha Laskin. Misha has a background in theoretical physics. He has since transitioned to AI research and to founding his own AI companies. And so I think this interview will be especially interesting for people with a physics or academic science background, but also interesting for people who want to understand the current state of AI.
So Misha, welcome to the podcast.
Misha Laskin: Thanks for having me, Steven.
Steve Hsu: So I understand that you were an undergrad at Yale in physics and that you actually did your, you finished a PhD at the University of Chicago. And maybe just tell us what the 20 something early 20s Misha thought he was going to do with his life, why you were attracted to physics. just give us a slice of life for you at that stage.
Misha Laskin: Yeah, I think that when I, when I moved to the States I, I'm Russian Israeli, and when I moved to the States, I got really interested as a teenager in two things, and it was physics and literature, basically. Literature is basically because I didn't read very well. And so I kind of, You know, it's sort of like eating your vegetables until you like them.
And so, at first, it was very painful, but then I actually started liking it. and physics we had this, you know, group of like, the Feynman lectures in my parents library. I had a lot of time in my hands and just got something aesthetically, I would say, really beautiful about it of understanding, pretty interesting implications that are not obvious about how the world works from some set of first principles ability to explain things very clearly.
I just really enjoyed reading it. And so that's kind of. You know, my initial inclination into physics began there in high school, and then I basically wanted to do things like that. I wanted to work on impactful science. I think that was the, that was the thing, work on impactful science, and I thought that physics would be the place for me.
And so I got very interested in theoretical physics. I did my undergrad in physics and double majored in literature, but then professionally really wanted to go into the scientific realm. and did my PhD at UChicago in theoretical physics, many body quantum physics. And so it was actually a really wonderful time.
I think it's one that I look back on really fondly. It had this sort of almost physics. I imagined it 100 years ago because in theoretical physics, you can do at least that. I don't know what it's like now, but then you could still do a lot of it on a chalkboard. And so it was a lot of sessions with my advisor or some colleagues where we're doing something interesting on the chalkboard. Yeah, it was a really fun time in my life.
Steve Hsu: And yeah.
I think right after you finished your PhD, you went into Y Combinator and started an AI company? Do I have the details correct?
Misha Laskin: Yeah. So towards the end of my PhD, I kind of.sort of had a, I would say a change of heart, not because I think that the stuff that I was learning was very interesting, but I felt that I became an expert in this very narrow sliver of science and while it was aesthetically pretty beautiful and very interesting.
It was hard for me to see, to imagine the kind of impact that I would have, even if successful decades from now. I think that, yes, you know, some people go into physics and I think that are able to see kind of maybe, I mean, I don't know, at a young age or older, depending on how patient they are a great amount of impact and there were scientists like that around me, but maybe I was just impatient.
It was hard for me to imagine waiting decades to know whether the things I was working on would bear fruit or not. and so I did this kind of I would say, I think that my personal confidence went down then because it was sort of a bet that I took. I had a lot of conviction and put almost a decade of my life into it and then didn't really know, you know what, what it is that I should do. but I wanted to try kind of doing sort of almost the most practical thing I could imagine somehow entering the job force or the workforce didn't. Didn't seem that appealing. And frankly, I think to a physicist who's like just trained very theoretically, sound that is also a bit intimidating.
You're kind of, you know, I, I hadn't, I hadn't studied, you know, CS you know, formally, and so it was all very foreign to me. So I taught myself to code or this was happening towards the end of my physics PhD anyways where I had to pick up coding for some of the projects I was working on.
And I decided to do almost what I'd call a random walk through startup ideas without having like, Any I would say internal conviction at the time around what it is that I should build as kind of just random walkthrough. Is this useful or not to you know, to someone else. And ended up converging on a company that was effectively building inventory prediction systems for retailers.
So, uh. You know, how many items of clothing or, you know, should you make for your next season or something like this? And I learned a lot about startups at that point in time. I learned a lot about what I didn't want to do. Maybe during that process. But it was also interesting building, building useful, useful stuff.
It's just that there's some, I think, fundamental principles around startups that maybe I should have read some of the Paul Graham essays before, but I was reading them as I was building a startup. And I definitely think some of them were true. And, you know, one of them is a simple notion of having deep empathy.
for your customer and sort of loving your customer. And the reality of it is that I didn't have very deep empathy for you know, for the people who I was trying to help right on the retail side. I didn't understand them that way. And so it kind of converged on this consulting business that was generating revenue, but didn't really have a product around it.
And I wasn't, you know, particularly fulfilled working on it either.
But at the same time I saw deep learning taking off and in particular I remember seeing AlphaGo and that kind of changed something in me in terms of I kind of think of that part of my life as a wandering through the desert kind of part and trying to kind of put the pieces back together and find like what is deeply internally motivating to me.
And when I saw that come out, that to me seemed like this is the impactful science of You know, the time that I live in that I really want to work in. And so I basically dropped everything I was doing and went into a cave where I learned deep, you know, deep learning and reinforcement learning fundamentals.
and that was kind of my first foray into AI.
Steve Hsu: And what made you, so at that stage, you might have jumped immediately to a company and started working on AI, but you actually went to Berkeley for an academic postdoc. So what was the thought process there?
Misha Laskin: I think at that time AI as a useful piece of technology was, it was still not obvious that that was the case. It was still very much this was a number of years before language models took off and it was still, I would say, there's both a lot of industry research happening, but it was more academic in nature, I would say.
And I was considering a few things. I was in the process at OpenAI. They had this fellows program for people who are coming out of, from a different field to sort of ramp up into AI. And at the same time got introduced to Peter Abbeel, who is an AI researcher and a professor at Berkeley, who's done a lot of foundational work over the last decade.
And my decision at that point in time was. Do I think I'll learn most in the shortest period of time? And I thought that as a postdoc, I'd basically be able to iterate through a lot of different ideas, and have a lot more learning events. Which in retrospect, I mean, I don't think that that's necessarily I think both options were great.
But at the time, it was just not clear that AI as an industry, like as a commercial industry, was the place that would have the most impact. And it kind of seemed like the place where I'd learn most of the time was still in an academic setting.
Steve Hsu: Now, at this point in time, were transformers a big deal already or just starting to be?
Misha Laskin: They were the you know, some of the foundational transformers papers had come out, but I remember this was right before GPT 2. So it's not obvious you know, when I joined the lab again, since it was inspired by AlphaGo, I really wanted to do reinforcement learning, decision making, kind of solving the problem of autonomy, and There we were just using RNNs and LSTMs for anything that required memory, and oftentimes just MLPs or contents, so it was very rare to use a transformer in reinforcement learning at all, and they were taking off in NLP, but at that time, NLP was not you know, language understanding was one of the several like sub areas of AI wasn't, you know, there was computer vision, there was you know, language models, there was reinforcement learning.
And I would say that the thing that was probably most top of mind at the time was reinforcement learning, because it was just coming from, you know, the AlphaGo breakthroughs. And it wasn't when AlphaGo happened, it wasn't just AlphaGo, it was a series of papers of increasing, I would say, both they got progressively more beautiful in their simplicity and how many assumptions they removed.
And kind of power. And so these were AlphaGo, AlphaGo Zero, which then learned the game of Go without any human demonstrations. Then AlphaZero, which generalized beyond the game of Go to other games. And finally, MewZero, which was an algorithm that sort of learned the rules of the game as well, instead of being given them.
And so that was like the thing that was top of mind to, I think, the AI community at large. I think reinforcement learning was very top of mind. And even though transformers were definitely taking off in NLP, it was kind of one of the few things that was happening. It was not clear that that would be the big thing.
Steve Hsu: Do you feel, this is jumping ahead a little bit, do you feel a little bit like the field has come full circle now where maybe RL is like because of the recent reasoning models, like, or the ability of RL to condition these reasoning models, it's sort of now back at center stage a little bit?
Misha Laskin: I think it's, I think it's getting there. It's been a pretty interesting turnaround because after, I'd say, GPT 3, and certainly after Chat GPT, I think the whole field of reinforcement learning probably took a bit of a backseat. I wouldn't say it became irrelevant since the workforce algorithm that powers alignment of these models is RLHF, it was really, RLHF is a pretty weak form of reinforcement learning and it was, and a lot of people questioned to what extent it was even necessary as opposed to just really high quality curation of instruction tuning data.
So RL definitely took I'd say a backseat and, so after I worked on Gemini, which was Google's Google's large language model, kind of realized that I thought that the ingredients might be on the table where you have these really general objects that are language models, and there's nothing fundamentally wrong with reinforcement learning.
It's not like there's something that we learned that was wrong. It's just that you need a good reward signal to optimize against. It's basically the thing. You need a good, you need a task distribution to learn from, and you need a good reward, like, a way to verify that those things are being solved. And so I kind of thought that, you know, after we launched Genly 1. 5, that it was probably the time to start looking into scaling up reinforcement learning on top of language models.I think that's now come full circle in terms of reasoning models which is also rather, I think, like, one of those non obvious things. I think a year ago, maybe. Yeah, let's say a year and a half ago reasoning models were one of the many things that an AI lab pursues.
it was not clear that they would be as powerful as they're starting to get today. So I think we're kind of seeing this sort of, uh. I mean, this is kind of maybe a normal part of these AI waves is that the work starts, you know, obviously a kind of earlier before it's clear to other people, but when it's happening, it's it's actually not clear that this is going to be the thing that wins, except for, you know, for me, for a small subset of people who have a lot of conviction right now and are seeing something that other people aren't.
Steve Hsu: Right. So, coming back to your bio, cause I, I skipped ahead and so the audience doesn't really know what happened to you. So you.
You did your postdoc for a couple of years at Berkeley, and then you transitioned to Google DeepMind, if I'm not, if I'm not mistaken, and that's where you worked on Gemini.
Misha Laskin: That's right. Yeah, I joined DeepMind to, I mean, continue basically scaling up research and reinforcement learning. And again, it was not clear to me that it was not really to work with one language model at the time. I joined a team that was called the general agents team. So, you know, it was really to solve the problem of agency and autonomy with reinforcement learning.
This is a team led by Vlad Ni, who was the first author of deep networks and the paper that. I started basically the deep reinforcement learning era in 2013, I think then what happened is that I just remember this very vividly. I was at NeurIPS in New Orleans and Chad Chupetee came out and that afternoon I was giving a talk and I had some kind of dissociative moment of, you know, why am I saying these words?
Like the thing that matters clearly, you know, you know, I don't know, somehow clicked with me that it's so obvious, like what the thing that matters is now. And so why am I at a conference you know, talking about something that is not this thing. So as soon as I got back from NeurIPS, I basically dropped everything I was doing, started working on language models and entered a project with them. At the time it was a small group of people that became the Reinforcement Learning and RLHF team for Gemini.
Steve Hsu: So just to clarify for the audience, so you're giving your talk at NeurIPS on, you know, some research, some, you know, some research you've really poured your heart and soul into. But in the back of your mind, are you thinking? Scaling transformers as language models is the thing that really I should be focused on and is that explicitly what you were thinking or,
Misha Laskin: It was something similar. I was thinking the following.
So the problem with Reinforcement Learning before Language Models was that we had developed these extremely powerful algorithms that worked in very narrow domains. So you had a super intelligent go player that was hard to, if you want, it, it didn't really generalize to anything you had to, if it did generalize, you had to, it did so in the sense that you had to retrain the entire model for a different domain.
And the problem is that, in most domains of interest, it's just impractical to collect the amount of data and have, you know, the verification signal that you need in order to get something useful working there. And so there was this big, existential, I would say generalization problem if we have really powerful systems. We have no idea how to make them general.
And when ChatGPT came out, I played with it that day and it was very clear that the system is very general. It might not be very capable yet. It's not autonomous. It's you know, at that time it was a pretty weak chatbot, but it was very general. You could ask it about almost anything and still can, right?
And it will answer it and sometimes very capably. I remember at the time they shipped a feature that formatted basically code as almost like blog posts about code. So you can ask it something about code and then it writes you a blog post. That was pretty magical.So what I realized is that we were, or at least I was, let's say, spinning my wheels trying to solve the generality problem when it had already been, like, it was solved for us already, right?
But these language models are very general, and so it was just a different way to approach, like, the, the problem. And that's what I thought was really interesting.
Steve Hsu: You know, when I was very young, before I actually learned any physics, I, I read books like Gertlescher, Bach, and so I was actually quite interested in AI before I knew any physics, and I always wondered about this problem of how would you instantiate knowledge about the broad world in your A. I.? And at the time there was some very huge project at M. I. T. I think where they were just like, just literally typing true sentences into a database in the hopes that eventually it would reach some critical threshold and know about the world. And so the big thing they accomplished, which surprised me, was like, okay, a transformer trained by the next token prediction on trillions of tokens.
Wow. You get a world model that actually knows a lot about the human world, you know, as observed by human writing. And yeah, that was a zero to one moment that was just shocking. And I think people are, people are still probably underappreciating that zero to one moment. Like, historians will look back and go, yeah, that was a, that was a discontinuity in, in this whole thing.
Misha Laskin: Yeah, it's, it's very non obvious. Still, I mean, I think it's still not obvious to me why it works. It's magical that it does, but yeah, because the internet is such a big, messy data set. And so when you pre-train a model on them, I mean, most of us don't actually get to, most consumers certainly never play around with the actual pre-trained model because it's very user unfriendly.
but I remember getting, you know, access to some pre-trained checkpoints and playing around with them. And if you kind of poke at them the right way, you get to elicit some very interesting answers. And the fact that they have these powerful world models that are then I would say very steerable, like instruction, tuning and.
Reinforcing learning. You don't need to do it for that many steps before you go from your pre-trained checkpoint to something that is usable to people. And that was really interesting as well. So it solved, there was this whole field in AI before then, and maybe, I mean, still now, still is now, but I think that this kind of was an answer to that of meta learning, which is the notion of how do I learn very quickly from a very small number of examples.
And there are all sorts of sophisticated algorithms for how to do this. And it turned out that the best meta learner, the best meta learning algorithm, was just Next Token Prediction on the internet. And from a meta learning perspective, few shot prompting is basically learning very quickly from a small number of examples.
And that was really surprising.
Steve Hsu: Yeah, I, I think there's just some magic in, you know, in the idea that you know, big neural net, which is generally, Enough and you force it to do next token or get good at next token prediction it It builds structures within itself through that automated process that reflect, you know things about the world and I think a priori I had no idea that was gonna happen, but somehow they stumbled on the right way to or a way to do it
Misha Laskin: Definitely. Yeah, I think that it's really magical and something that was in the back of my mind, and I think certainly there is a bit in which it's magical with the magic fades in some sense in that I thought there's only so much information you can extract from the internet. There's only so much you can compress from it because. It's sort of a fixed body of knowledge. That's very noisy. And at some point I thought we'd hit a point where you're getting diminishing returns from how much you can imagine, you know, an infinitely large brain that's soaking up everything on the internet. That's sort of the max of how, how well you can do.
And it was just not clear at what point. Will we get there? At what point will we basically get brains? I'd say the neural brains, right? These neural networks that have sufficient capacity where you've kind of extracted almost, almost everything there is to extract from the internet.
Steve Hsu: Yeah, it's interesting because you don't want to go to the overfitting limit where you've literally memorized 15 trillion tokens or something. You, you want some intermediate where it memorized some stuff, compressed versions of some stuff, but it built some structures also that reflect relations between the information that it's seen.
So it just seems very non-trivial to me. Like I think. One of the things I think people with a more theoretical physics bent in the future when they have lots of these models to experiment with, will probably understand that dynamics better than we understand it now.
Misha Laskin: Definitely. I think another thing that was surprising about this whole pre-training era is that typically in machine learning, you think about, you have this notion of epochs, like you train over your data set multiple times, and you see where you're training. And tests for validation curves diverge, and that's when, you know, you're overfitting but with pre training, you, you, there's, you do less than one epoch, basically, you, you scan less than the total amount of data on, on the internet.
So there is overfitting that happens because sometimes data is duplicated right? Sometimes it appears twice on the internet, like an article might be syndicated or something like this. or things might be quite similar, but generally it's sort of less, less than an epoch.
Steve Hsu: Do you have a sense of whether we've hit, so, so in the scaling, in the scaling relationships that were in various papers like the Chinchilla paper, You know, it looks, looks like if you want to increase model size by an order of magnitude or the amount of useful compute by an order of magnitude, you need substantially more, you know, maybe the square root more data.
And at least according to those relationships, it looked to me like mid last year that we would run out of data before we ran out of compute or model potential model parameter size. Is that correct? Like, is that what people within say Google would say?
Misha Laskin: I think that's roughly correct, that the, that it's been harder probably to extract significant gains from the pre training corpus than, than some people would have predicted. I think that there was a sort of sense of This can, you know, just keep, keep scaling, you know, your model size and you'll just get progressively better and better models, you know, trained to the same, the same pre-training corpus.
And I think that we had already started seeing diminishing returns. and by we, I mean as a field. So you know, across multiple of these labs there was a moment where,well, at first, I think that there's clearly a lot of practical value to be derived from these models. And there's a lot of stuff that could still be done, even if you exhaust the pre-training corpus around instruction, tuning and overall data curation and just optimizing the architecture.
So probably. Even if nothing else had changed, there would probably still be quite a bit of progress. but I think there was this fear of maybe this is, as far as this idea goes, like, right, you can make it more, more efficient, but do you get, how do you get substantially more intelligent systems?
I think the North Star is still, you know, systems that help you do things autonomously, help you do the work that you want to do autonomously. So as a scientist, it might be, well, there's a bunch of the rope work of coding work and setting up experiments, these kinds of things that you might want systems with, but aspirationally, you want them to also help you, discover new knowledge and be sort of a patient collaborator.
and it was less clear how just doing pre training would get us pre training with some alignment would get us there.
Steve Hsu: Right.
So were you surprised by, I guess maybe you were still at Google when the reasoning work started there or was, were you involved in that at all?
Misha Laskin: So I was at Google when it started. I wasn't personally involved in the reasoning effort that was there, but I had some colleagues who were and, of course, I was working on the infrastructure and well, and methods for RLHF and reward model training. And, of course, that's the thing that makes these reasoning models work is the fact that it's basically cast learning to reason as a reinforcement learning problem.
So there's definitely collaboration going on there, but I wasn't on the team that was working on reasoning.
Steve Hsu: Right. So the way I explain the reasoning, you know, the advances in reasoning in the last, say, six months or something is, you know, you get the model to, instead of giving you a quick answer, it sort of talks to itself. And it learns the behavior of reasoning and it can do more if it behaves in the reasoning way rather than in the sort of just immediate response mode.
But my, my mental model of this, which I'd love to hear your thoughts on, is that, you know, the pre-trained model is not getting stronger, but you're getting it to behave in a different way. It's more powerful or more useful. in that new mode of behavior, but you haven't really improved the underlying model particularly.
Do you think that's fair?
Misha Laskin: I think it's fair if you're talking about generality. And that I think the pre-training paradigm is that the diversity of data on the Internet is just very hard to recreate synthetically. But you are improving the capability of the model and its ability to think depth wise. So, for the distribution of data that you're training on math or coding or other verifiable data, it does kind of achieve a new capability.
And I think about it very similar to how, if we forget about language models and look at how large scale reinforcement learning systems were trained they typically had an imitation learning component where you learn from, you know, some human data and then a reinforcement learning component where you left off from where the human data ends.
And had the model self improved until it became super intelligent. And that was the blueprint for AlphaGo, AlphaStar, OpenAI's DOTA project Imitation Learning followed by Reinforcement Learning. And I think the same thing is playing out now where you can think about pre-training and instruction tuning as imitation learning . Right, all this data was generated by humans. I mean, some of it is now also generated by eyes with synthetic data, but it's primarily human generated data. And that gives you a starting point where the model has non trivial, reasoning behavior out of the box. Like, it's not that it has no reasoning behavior.
And now it does it had non trivial reasoning behavior, which is this whole line of work around, A chain of thought prompting the preceded reasoning and then when you put this into an online reinforcement learning loop with a way of verifying the output that is reliable that you can, you can trust like your verification that is to say, if you can't trust your verification, then it can get hacked and, you know, your model reason the right way.
But assuming you figure out how to. Solve this reward hacking problem, then you're sort of reinforcing the good reasoning behavior that's already in the model. But at some point you actually go beyond the distribution of what the model previously knew and it's just learning new things. And I think that's what's happened with these reasoning models is that they're, they've learned new things that the pre-trained model did not know.
Steve Hsu: And the net new things are actually learned in the RL phase. So if I, if I give it some kind of math problems and it's sort of adjusting its parameters in such a way that it can succeed on these math problems, that reinforces maybe its command of change of variables or some trig identity. Is that a fair way to think about it?
Misha Laskin: Yeah, I think a fair way to think about it is exactly what you said. And I think again, kind of the AlphaGo analogy holds here where that system, AlphaGo and Xero Learned net new things, just net new strategies that were not even in the corpus of things that humans knew this famous move 37 from AlphaGo. And I think something similar is happening here, but maybe it's less drastic yet. I don't think we've seen anything close to move 37 for language models. obvious creation of net new knowledge but it is, you know, I guess one way to think about reinforcement learning is that it's a way of you're generating synthetic data and by having a way of verifying whether which traces in synthetic data are good or bad, you're kind of amplifying the good ones and downweighing the bad ones.
And so once your agent, Accidentally stumbles on a strategy that worked, that thing gets reinforced and internalized and that's where sort of this net new knowledge comes that first it's an accident, but then it gets internalized into an actual strategy. I mean, that's loosely how I think about these things. And we're in an interesting phase now, I think, where reinforcement learning is starting to work again on top of language models, but we're not yet at the AlphaGo moment for language models yet. There has not been this powerful net new knowledge creation yet.
Steve Hsu: So in the DeepSeq R1 paper, you know, they're very open about what they did. So I like reading their paper because with. Google with, you know, Gemini or OpenAI, I have to always guess what they're doing, but at least with DeepSeek, they're just pretty explicit. So in that paper, there's the vertical axis of performance on some AIM, A I M E math problems.
And I think the right hand side is maybe RL steps or something. And it, it looks like the curve is bending over, so it doesn't, so, or at least the rate of increase with training, you know, first is, is more dramatic, and then it's, it's, it's smaller, and in fact, at the end, as you can, you could guess that it's sort of just fluctuating a little bit.
It's not, if it is increasing, it's increasing very slowly. And so, So one interpretation of that graph might be, okay, without improving the base model in some other way that they haven't tried yet, even more continued RL along that direction wouldn't necessarily qualitatively improve the math ability of this model.
Do you think that's plausible or do you think maybe that's the wrong interpretation of that graph?
Misha Laskin: Well, I think, first of all, the thing that's pretty universal about when you look at reinforcement learning curves with language models or before them as well is that they tend to be log linear. And so. They, you know, if they ran the experiment for 10x longer I think, well, we may or may not see something different, but let's, let's put it this way.
If there was a, if the verification was good, if the way of detecting that this thing was solved correctly or not was good and the exploration of the model was decent, that is, it was trying reasonable strategies. Then you would get this log linear behavior where it basically never stops learning.
Now it is, reinforcement learning algorithms in practice do stop learning at some point, but usually Yeah, there are usually ways to overcome it. And so when you, maybe to give an example going back to something like AlphaGo is that it never, AlphaGo never stopped improving, like you could have, it became super intelligent and you could have sunk 10x or 100x more recent resources in it and get an even more super intelligent.
AlphaGo. And so in principle, these systems never stop learning. just a matter of how many resources you want to sink into them. Now, with language models in RL, we're still, it's still early days. So I don't think that we've discovered the sort of maximally scalable blueprint. But there's a foothold.
So to give you a sense, at least the way we see it is that even when we look at DeepSeek R1, that is a more powerful algorithm than a normal RLHF algorithm, but it's still actually a fairly weak form of RL, or RL. It's what we call single step reinforcement learning, where you have, you know, once you have, you think for a long time, and then you just generate a solution, and that's basically one step.
but I think the natural evolution of these systems, especially ones that act on your computer, are systems that are going to be thinking and acting over multiple steps. So, you think. And act and think and act and so forth. And there's this outer loop of credit assignment across the steps. So I think we're just very early on in the story of how reinforcement learning plays out on top of language models before it was similar in let's say when deep reinforcement learning started, DQM came out in 2000. 13 and the arc to, you know, AlphaGo Zero and MuZero didn't end until the late 2010s. So there was a, at least five years and a lot of progress is being made. I expect something similar to happen here, but on a compressed timeline. The amount of resources going into these things is just much larger. And I think we just moved faster now.
There's just a lot more infrastructure. And so. I suspect that, instead of five years, it'll probably be a matter of a couple of years before we see something like a, you know, a super intelligent language model in some, in some meaningful areas of knowledge work. So, I think that reinforcement learning is not going to apply to that.
Steve Hsu: Got it. I, I think I heard you say on another podcast that, you know, you were about 3 years from AGI. And I think maybe what you just said is reiterating that, that point. So, you know, 1 of the things that I, I'm expect the shoe, 1 shoe that I'm expecting to drop, i. e. somebody releasing a model of this type of paper is that, You know, doing RL where the model is taking actions on someone's computer or using their browser or something like this is, that's got to be a very fruitful thing because as you said, it thinks it takes an action.
It thinks it takes action. Maybe it's trying to, like, buy, buy something for you on the Internet or something, and it's, it's going to get feedback on each of those steps. So you could imagine that's like a super fruitful trajectory through. The RL space and yeah, I'm expecting someone to release a model.
That's just extremely good at doing things on, you know, Amazon and eBay and a bunch of you know, commercial websites and things like that. So yeah, it probably happened sooner rather than later.
Misha Laskin: It's, it's possible. I think that this is kind of an interesting era that we're entering because a lot of it depends on how you can, whether you can operationalize a good enough data distribution. It's verifiable for these tasks. So I think that that's kind of a big question when some browser based things, it is just hard to collect good enough data for it.
There's no, right, there's no repository of browser based say tasks and rewards. On a large scale, it's diverse. So, when we see these reasoning models work, they work because there's this, there's diverse data pools for questions and answers for math. And coding like textbook coding. and so we know that these systems work when you have that kind of data structure, but in more practical scenarios where it's harder to access those data pools, I think you kind of have to get creative about how do you, if they don't exist in some, you know, easy to access format, is there some strategy you can invoke that will basically get you the data that you need in a clever way.
so I think a lot of it depends on basically how you operationalize data collection. Which is pretty, it's definitely a hard thing to do. And one thing that is probably even more obscure than model training or things like that, when you read the DeepSeek paper, that is the one thing where they don't really tell you anything about, right?
and I think the other thing that's really interesting is that reinforcement learning sort of lets say that it couples to the environment that you train it in. So as soon as you have, I mean, right now we have these reasoning models, but as soon as you have environments with tools, let's say for code editing or browsers or, you know, other ways to interact with a computer, and you run a reinforcement learning algorithm through that, it gets coupled to those tools.
and so it actually loses generality right, because it might learn, unless you trained it in some way too, if you train this very general reasoning sort of way. It might learn to kind of generalize to, to some new tools, but the system that's trained coupled to the environment is likely going to achieve much more depth wise in that environment.
So, right, if you are coupled to, let's say, you know, a coding environment and a browser and some tools for doing science, and you have some way of verifying whether you're answering, say, scientific questions. The kinds of, you know, scientists care to answer correctly. So you have, you solve a data distribution problem then you train a reinforcement learning algorithm against this environment, and it will sort of really master the tools that you gave it in the environment.
And but not generalized to let's say tools and other environments. So there's an interesting way in which RL methods are coupled to the environments that you train them in. And this is going back. This is going back to what happened with reinforcement learning before language models and that they, those systems were coupled to their environments.
So AlphaStar was coupled to the Starcraft environment, AlphaGo coupled to the Go board. And I think now we'll see products where the reinforcement learning algorithms were trained for Some tasks that were the, the kind of neural network that's powering it is coupled to, to the environment that was trained against and an example of that, I mean is.
You know, now, I don't know if this is what's actually happening under the hood here, but when I look at a product like OpenAI's Deep Research that is powered by O3 it makes me think that most likely what's happening there is that the tools for deep research, like the web browser and indices that they use for, you know, for a language model to interact with probably, you know, O3 or whatever reasoning model was taken.
And then further trained as reinforcement learning against those tools to get something aligned. So I think that that's kind of maybe the next limitation, but also maybe a benefit of these systems.
Steve Hsu: So myself and some other theoretical physicists I know have been experimenting with these reasoning models to just see, you know, how useful they really are for our kind of research. And one of the things I discovered is it's quite good at finding stuff and summarizing it. But if I ask it to, you know, maybe solve an actual research level problem, or think about some research level thing, it'll often come back with what seems to be like, just like more something more reflecting the consensus in the literature, which, you know, could be wrong if it's a frontier level.
The question I'm asking, and then the frustrating part is that if it were a grad student, I was talking to the whiteboard. I could course correct the grad student and the grad student would immediately update their neural connections based on what I tell them. And then they would then reason, you know, correctly, subsequently incorporating that little nudge that I gave them that update that I gave them.
But what's frustrating about the models is that I might discover some like faulty reasoning or even contradictory reasoning and what it gives back to me. And I pointed out to the model, but it can't really update on that. It just continues to give me the same line. Back and so that sort of test time learning or test time memorization. Maybe you saw this Titans paper. To me, that's super interesting. It's like, what's the right way for it to be able to actually update itself at test time? Have you thought at all about that kind of thing?
Misha Laskin: Yeah, I think it's an interesting problem. And in some sense, these reasoning models kind of inherit the priors of the pre-trained model they were trained on. And again, we have to remember that the reinforcement learning methods they're trained with today are actually a pretty weak entry level for reinforcement learning.
And so there is this to me is kind of, again, the statement that we're far from the move 37 moment. or maybe not that far, you know, because again, if it's a matter of 2 years, that's a matter of perspective. But when I say far, that's kind of what I mean. And it goes back to the data distribution problem.
Like, how often did the model when it was being trained with? With RL or otherwise, did it see corrections and what the appropriate responses to that correction and get that reinforced? And I think that a little bit of that, like these models went from not knowing how to backtrack very well to backtracking.
So now in the, when you look at the R1's chain of thought, you'll see it often says, wait, maybe I should rethink this and go back. But there's, there are these, um. pivot words like wait or oh, wait, or hold on, which part of that was probably mixed explicitly in the data distribution that, you know, you can run a reward, a verifier and see especially if you have something that's per step, um.
And let's see when it messes up and then inject an O wait in there and then continue training on that. So I think a lot of that comes into how you sort of curate the data distribution that you train on. But fundamentally, I think that these systems have been trained with pretty weak RL. And so for that reason, they're still, they've learned some things that were not in the distribution of the pre-trained model.
Initially, there, but in terms of generating new knowledge, it's very hard and I actually had a very similar experience to you. I was wondering if it could reproduce my PhD thesis. It is basically what I was wondering about and my PhD thesis, though it's a lot of work working on it, but I can actually summarize the thing that was done fairly quickly.
and if you kind of, you know, if you pose the question the right way, it's, and it's, it's a somewhat lengthy derivation, but there are only like a couple of key parts that are really tricky and effectively, my PhD was on studying the various characteristics of the fractional quantum Hall effect and doing like basically like perturbation theory, kind of approximation of the electron density for various fractional Hall states.
And basically, when you expand this thing, the first two moments are really easy to find. Like, the first moment is basically undergrad physics, second moment is a graduate course in statistical mechanics. And the third moment, which has this very interesting physical constant on it, it's kind of a geometric characteristic of the fractional quantum Hall state, was something that I had discovered during my PhD.
And that one is not trivial like that one, you know, you need a PhD to solve it. And no matter how I prompted it. It just, it just couldn't get it right. It only got the first two things that are in textbooks, basically, and it was not able to generate that net new knowledge.
Steve Hsu: Yeah, I think so. I think that's the current feeling of people that it is still useful. Like, if I don't know an area and I'm just trying to get a summary of what's already known in literature, it can succinctly, you know, deliver that, but pushing forward is just extremely hard for anything that's really not present in a strong way in the existing literature.
Misha Laskin: Yeah, yeah, exactly. Let's go back to 2 years from now and see, see what happens. I think, I think, especially in physics and math and theoretical mathematics. that may see the changes there faster than in other fields.
Steve Hsu: Yep. So I know you have a hard stop. And so we're about five minutes out until your next meeting. And so let me just end with one last question.
So what's special about your perspective on AI coming from a background in theoretical physics? Is there anything unique about the perspective that you bring? I think I asked John Shulman about this in another interview, but I'm curious what you think.
Misha Laskin: a good question. Well, I think physics teaches you first. Physics is really hard. So when you get into AI, it's actually AI that is a lot easier than physics. At least that was my perspective that picking it up was a lot faster. And so you're kind of unfazed by the mathematics and the thing that you have to learn. You have to learn how to code and that's challenging and becoming a really good engineer is very hard.
But you know, once you go through the, you know, physics grinder, I think the willpower you have to kind of learn things is sufficient, so it's, it's all possible. I think something that's special is that this thing, it might be obvious to physicists of trying to understand things from some kind of simple set of first principles and, and deriving things from there and looking for simple solutions.
is not that obvious or maybe even common in, in AI, like a common actual, like way to write an AI paper is to, or was and I'm sure that's still somewhat true in academia now, but it was take an architecture, make it more complex, like take your algorithm, make it more complex, like add complexity to get some performance gains and then, you know, write a paper about that.
And I think that that's actually the probably most common way of operating as a researcher in AI is that you take something that exists. And you push it forward by making it more complex and getting some performance gain, but that's very short lived. I actually, I don't think, I'm not even sure if I've read any impactful papers, that's the template for writing a paper for a conference, but I don't know if any impactful papers actually had that template.
And this perspective of coming in and actually trying to simplify things and do even the simplest thing and also coming in with a blank slate and having kind of no preconceptions is very helpful. An example of this was I came into reinforcement learning and had basically 0 preconceptions around what works, what doesn't.
And this was when people were studying reinforcement learning from pixels, when you're, you're training from, you know, for robots or video games, and there are all these questions of RL is great, but it's not data efficient, you know, sometimes it doesn't work on these pixel based environments. And one of the papers, my first paper, is a very simple thing, it's just kind of tried, well, what if we just basically jitter the images, just random problems randomly because maybe it's just kind of. These systems are just always seeing the same perspective. And so they're kind of memorizing it. And this was, I took like some, you know, I took a good implementation by a different colleague of a reinforcement learning algorithm called StopActorCritic. Implemented this like random cropping, basically this jittering of the camera. And lo and behold, that was that, that simple thing outperformed at the time, basically all the state of the art algorithms that had this like additional level of complexity. So I'm not saying that that was a particularly beautiful idea or that impactful, but it was just surprising to me that no one had tried this very carefully before.
and so I think that's an interesting perspective that physicists bring is sort of trying to simplify the problem as much as possible into its core principles. And in some cases I would say some of the most impactful work that's come out of physicists coming into AI has been the work on scaling laws.
like this, this was taking a perspective of, like, scaling laws that occur at critical temperatures and you know, in theoretical physics or around critical, like, phase transitions and noticing that, you know, there are these like, scaling laws that have you know, universal physical characteristics attached to them.
And, you know, that perspective that something like that might be happening when you're training these deep learning models. it was not obvious to people at the time. And so the folks who led that work and opening on the scaling ones were not even former physicists, or they were kind of, I think, either very recent former physicists or current physicists at the time.
Steve Hsu: I think Jared still has a job at Hopkins. I'm not sure. But yeah, he could so he's maybe technically still exists. But, hey, I don't want to make you late for your next meeting. I really enjoyed this conversation. Maybe have you back in two years when we have AGI in our pocket and thanks so much.
Misha Laskin: Yeah, of course. Thanks for having me, Steve.