Steve Hsu is Professor of Theoretical Physics and Computational Mathematics, Science, and Engineering at Michigan State University. Join him for wide-ranging conversations with leading writers, scientists, technologists, academics, entrepreneurs, investors, and more.
It's not clear whether it's going to be easy to build synthetic data, which when the model is trained the language model is trained on it forces the model to really deeply understand how finance operates, the legal system, you know deep historical analysis military strategy, etc, etc.
So I think the risk in those areas that synthetic data is not going to be nearly as effective as the first few trillion tokens that were generated by humans writing on those subjects, that, that is a real risk.
Welcome to Manifold. This is a special episode. I'm recording from Reykjavik, Iceland. I have a few hours before I have to go to the airport. I'm on my way back to Michigan, and I thought I would record some thoughts about this trip. It's been kind of a long trip, about two weeks. I started out first flying to Los Angeles to visit Caltech and give a physics seminar there.
Then I continued on to Frankfurt to attend a meeting of what I call oligarchs, i. e. the people that secretly or not so secretly run the world. And then after that I came, on my way back from Frankfurt, I stopped here in Reykjavik to give another physics seminar and also because of my interest in deCODE, which is a genetics or genomics company a very famous one if you're in the field that was founded over 25 years ago.
And so what I'll do in the next few minutes, this will probably be a relatively short episode of Manifold is just share some of my thoughts about my trip.
Let me start with Reykjavik. This is my first time in Iceland. As you can see outside, even in the summer, the weather here is not necessarily great.
I sort of had bad luck and it was drizzly and overcast the whole time I've been here. I was hoping to get some bright sunshine. On previous trips where I've been to places like Oslo or Stockholm or even the north of Norway in the summer, I've been very lucky and gotten really bright, warm, sunny weather.
I actually like it when the days are really long. So here the days are really long, the sun doesn't set at 10 or 11 o'clock at night. And because of the cloud cover, it's not as pleasant as those other trips that I just mentioned. It's an amazing place. The population here is only not even 400, 000 people.
And yet they have established a very high functioning civilization. And I got to see the pinnacle of that civilization because I visited the physics department at the university of Iceland and also got to talk to people from deCODE Genetics. And let me first talk a little bit about deCODE.
It's a very interesting company. Because people in Iceland are a little bit obsessive about genealogy they had the ambition, which they achieved, of really being able to calculate genotypes for essentially all of the population of Iceland, after genotyping some relatively small fraction, I think maybe half of the population.
But by genotyping half the population and knowing the genealogical tree for the country, they're able to, in a sense, impute the genotypes of everybody in the country. And they've published many, many really impressive papers using this data set.
The history of the company is quite interesting. It went public quite a long time ago, but I think ultimately wasn't very profitable in a way that sort of mirrors the trajectory of 23andMe, which I could discuss in a minute. Then deCODE was acquired by Amgen. So it's now a division of Amgen. And a lot of this genetic science that's developed in the last 20 years or so you know, of course the natural place people might get excited about it or, or feel that it's going to generate profits or revenue, economic returns is in drug discovery.
But because most major diseases are not monogenic or even close to monogenic, they're usually in the risk that an individual has is influenced typically by, you know, hundreds or thousands of different loci or specific genetic variations among your genome, within your genome, there isn't necessarily a good drug target, even after you've fully characterized the genetic architecture of the disease.
So nevertheless, it was always an interest of mine to meet people at deCODE and, because they were, they were in a sense that people who got the ambitious big data aspect of genomics going, I think, before anybody else. And they have a really amazing building which is right on the University of Iceland campus. I posted some photos of their complex on my Twitter feed. maybe I'll put a link to that in the show notes. Thanks.
Now the other aspect of Reykjavik, which I hadn't thought of at all until I got here, was the World Chess Championship in 1972, where Bobby Fischer defeated Boris Spassky and became the first world, well actually maybe not the first world chess champion, but in the modern era, the first American, World Chess Champion. And in 1972, I was a little kid, and I actually was a chess player at the time. And so I remember very vividly the excitement around Bobby Fischer. I don't know if I remember the games being on TV. I think I actually do. But, um that those events and, and Bobby Charisma, Bobby Fisher's charisma really grabbed the attention of Americans and elevated chess from really total, sort of total, near total obscurity to, for a brief time, everyone was interested in chess.
Then I think interest waned again. And I guess recently among young people, there's been a resurgence of interest in chess.
And so wandering around Reykjavik the past few days made me think about Bobby Fischer and I went back and watched some old interviews with Bobby on YouTube. and even some, there's a great documentary, I think it's called Bobby Fischer against the world, which there's a free version of it on YouTube, which I watched again last night. And in fact, at the end of that documentary, Bobby Fisher meets Kari Steffensen, who's the founder of deCODE Genetics. So it's a fun little coincidence there.
Now I am interested in chess and I actually read quite a few books about chess when I was a kid. But I never really wanted to be a chess competitor because once I realized that you had to study openings and, and memorize a lot of stuff, I just felt the, and this is a weird thing for a young kid to think about, but I guess I was a bit of a precocious kid. I thought, well, the ROI on this is not very good. By ROI, I didn't mean fame and fortune and you know, the chances of becoming a world champion like Bobby Fischer. What I meant was the intellectual ROI. Because in reading these chess books, even though I wasn't a super strong player, I could understand the theory of chess.
So, in other words, even in these advanced books, Written, for example, by Bobby Fischer. I think I had a book called Bobby Fischer Teaches Chess. Even reading a book like that, or more advanced books, I could more or less understand the theory. I didn't necessarily have the deepest intuitions that a long-time player would have, but I could understand the arguments they made about controlling certain places on the board, why certain positions were inferior to the others, certain lines of play, the strengths and weaknesses of certain openings, and etc. So I could understand all of that theory, and I thought, well, if I can understand the theory already, what is the point of investing so many endless hours to become a great player of what is ultimately a completely artificial game in an artificial tiny world? Finite world, right?
And at the same time, I thought, well, if you study something like mathematics or physics, the intellectual ROI is unbounded because there's always more for us to understand about mathematics or about our universe.
And so I already at that time, I sort of made the judgment that in a sense, physics and mathematics and other intellectual pursuits. Actually, I actually started thinking about artificial intelligence when I was pretty young. I thought those things were ultimately deeper subjects than chess and so I sort of gave up on chess after a while even though I still enjoy reading about the colorful historical figures like Bobby Fischer or even occasionally like looking at a game or something.
I had an old friend, I have an old friend who was also a theoretical physicist. He and I were junior fellows together and his name is Jonathan Udidia and he's really a brilliant guy. He was a condensed matter theorist in physics. I think his PhD advisor was the Nobelist Phil Anderson who discovered the theory of or contributed to the theory of superconductivity.
And Jonathan was not only a very talented physicist, he was also a very talented chess player. And he actually advocated to me that universities should have departments of chess, that chess was every bit as deep and interesting as some of the other academic disciplines. Maybe not physics and math, but some of the other academic disciplines.
Chess was as worthy of having a department at a university for. And so he used to say, oh, they should have a department of chess at Boston University or Harvard University and people could just study chess there. And in a sense, in the old Soviet system, they did because they funded huge amounts of chess activity, chess education, and they funded the activities of their grandmasters.
I guess they had hundreds of grandmasters. So it's not that crazy to think about state support for an activity like chess. although, I, I sort of agreed with Jonathan that compared to some disciplines in the academy, chess is more worthy, but according, but relative to the, the most impactful disciplines in the academy, chess is probably to me, at least as I concluded when I was a kid not as valuable or as deep.
Now, Jonathan, interestingly, left physics and devoted himself to chess for several years, and I believe his top showing was, I think maybe he won or placed top few maybe in the US championships. I mean he was at the level of maybe that one at that point one of the top although not the top US players. But eventually he gave up on chess and then became, of all things, an AI researcher. And so since then he's been working in AI research. So a very interesting story: the closest friend I've had I think who's been a really top level Say almost world class or world class chess player.
I just mentioned AI, so let me backup and talk about a panel that I was on in this meeting at Frankfurt.
On this panel was also Jonathan Ross, who's the CEO of Grow. Grok is a company, G R O Q, not GROK. G R O K is the Elon Musk AI company, which is sort of linked to X. GROW is a company that makes specialized chips that allow very fast inference using large language models. And so Jonathan and I were on this panel, we were discussing the limitations of large language models and how we think AI is going to evolve in the next few years.
Obviously he's super interested because Grow stands to benefit a lot if there's a vast amount of inference being done because These models run really, really fast on their hardware, and their hardware's very energy efficient, computationally efficient, et cetera. And so Jonathan could become, you know, his company could become super, super valuable based on the demand for inference in the coming years.
Incidentally, Jonathan told me his life story and he was actually the designer of the, the tensor GPU, which Google, I don't know how many years ago this was, maybe five or seven years ago, switched all their compute, essentially, over to TPUs tensor processing units, and Jonathan was the guy who sort of kicked off the ball at Google, at the time he was at Google. He's since left Google to fund Grow.
But anyway, I'm not supposed to publicly discuss things that are said at this meeting that I was at at Frankfurt. It's kind of secretive. But I can discuss things that I said on the panel, which are things I've, I've tweeted about, or I don't know that I've discussed them on another podcast but anyway, so let me share them with you now. So, one has to do with hyperscaling, and what is likely to happen with hyperscaling.
So we have a very remarkable situation right now, which I think most people are not aware of. I think most people who follow technology or, or AI would be aware that their company is spending a huge amount of money on training more sophisticated models.So there are two separate categories for these companies.
Some are basically big tech monopolies like Meta or Google and others are actually startups like OpenAI and Anthropic that actually have to raise money to fund. their activities. I think people are aware that this is going on, but people who have not looked at the numbers may not be aware that the capex spent on data centers and NVIDIA chips and, you know, computers is in the hundreds of billions of dollars.
So it's approaching something like a percent of our GDP. So if our GDP is 30. trillion a year and you take 1 percent of that, that's probably what 300 billion a year is. So it's kind of approaching those kinds of numbers that these people are investing in. A training run. When I quote specific numbers, the best place for me to quote them is, is, is from, or one of the best places is from the paper that Meta published when they released the latest version of LLAMA, LLAMA 405B. Because Meta is particularly open. That's an open source model. You can download the weights, you can use it, modify it, and, and, and consequently, they're also quite open about their research and how much money they spent, how their training runs went, et cetera, et cetera. more so than some of the other companies that are closed source. So when I, when I give you some benchmark numbers, I'm going to quote from the Meta paper because they're, they're very clear about it.
So I think they spent something like a hundred million dollars on this training run for Llama 405B. The next scale up would be about roughly an order of magnitude more. So 10 times more expensive. So, companies are basically gearing up to spend a billion dollars for the next version of their LL, their foundation LLM model.
And this word hyperscaling refers to empirical relationships between the amount of computation that you do to train your model, to optimize it, the size of the model, i.e. how many neural connections there are in the model. How many, say, if you like, real variables or floating point 16 or floating point 8 variables that are actually in the model, characterize the model, the internals of the model. And then the third quantity is the amount of training data that's used. And there are power law relationships between these three quantities, which are established empirically, so if you want to improve the performance of your model, optimally, there is a relationship between say, you know, if I increase the amount of compute that I'm willing to pay for by 10x, how much bigger should I make the model, and how much more data should I have available for the training of the model in order that I can stay on this optimal curve of progress.
And so these companies that I mentioned and others are trying to hyperscale, move along this, this curve in the hopes that the, the better and better generations following generations of their large language models will have new emergent capabilities, which really, you know, qualitatively different than what the model could do before and therefore extremely valuable, maybe even getting to AGI itself.
And so in this last training run, Meta spent about a hundred million dollars. The next run will be a billion dollars, of computers. And what's interesting is that the empirical scaling laws suggest that the amount of data you need, or at least for their architecture, I should mention that these scaling laws can be different for different architectures. And I'll get back to that in a moment. But for the Meta people and their specific LLAMA architecture, They roughly need the square root, if they increase the, if they increase the compute amount by say 10x, they roughly need the square root of that, which is 3. 3 or something, a little more than three times more training data. So three times more words or three times more tokens in the language of the AI field.
So for the Llama 405B, they use about 15 trillion tokens in training. And so they would need about 50 trillion tokens for the next run. And for reference the whole internet and all of the volumes of books in our libraries, that's only roughly three to five, I would say maybe 5 trillion max tokens. So we've already used up, essentially all the writing that humans have ever done.
And even like LAMA 4 0 5 B, some of the data, some of those 15 trillion tokens were probably synthetically generated and by synthetically generated, I mean they use an AI to create sample language that is then used for training. And if you think that sounds a little dangerous, or, you know, might lead to unpredictable consequences, I think you're right.
For this next scaling step, if 5 trillion is everything humans have ever done, and 50 trillion is what they need to do the next scaling step, then 90 percent of the data used for the next scaling step will be synthetic.
Now when you talk to the top people at these companies, which I have done, they'll often say something very ambiguous like, well, we think we have synthetic data under control. We think we know how to do it, but they're usually not very explicit because obviously these are sort of trade secrets of how to prepare useful synthetic data for training the next generation of models.
And also I think they can't really be entirely sure because they will not have. necessarily gone far enough where say 90 percent of the data is synthetic versus only like 50 percent or less for earlier research. I don't think they can be completely sure that everything is going to work out right.
Now there are areas where you can imagine the synthetic data works really well. So there were some recent papers about models. So for example, DeepMind has built a model that is really good at solving international mathematics, mathematics Olympiad problems. And these are very difficult problems where the solution to the problem actually involves a kind of proof.
And they built a model that scores at the silver medal IMO level, which is really among the top kids on the planet in terms of the ability to do this. So it is, it is way better than the average human at doing this kind of mathematics. And for that they used a lot of synthetic data. One could say that for a very well defined field, like mathematics or physics, where the basic rules are kind of understood, and a lot of techniques are well understood, that one could just build software which combinatorially combines, it just combines different axioms and theorems and tricks and stuff to produce problems and proofs and solutions of problems.
Which are valid, which, which, which actually express the use of these tricks or the concepts behind the, the, the axioms, et cetera. But in a way that then the model is enforcing the model to predict how the proof goes or to build its own version of the proof you're forcing it to build in its neural net network structure.
Something that encodes deep concepts about mathematics. And so you can imagine in that domain. This kind of synthetic stuff could work. It's a little bit weird because what's happening here is that humans are having to understand The essence of math well enough to build a thing which is usually not a neural net It's something else some other kind of fear improving software or something commentatorial software Which produces valid new problems and solutions to those problems But then they have to go through the LLM training step to force the LLM neural net to then instantiate in its Internal connections and quote, understanding of all that.
So there's this weird process where we use computers to make synthetic data. Then we go through the standard gradient descent training process for large language models to get the large language model to quote, understand the core concepts and use of concepts and tricks and techniques that were, were understood first by humans and sort of codified by humans before building that first software step. It's a very interesting situation here.
Now, places where the models don't do so well and people would like them to do better is analyzing things like business problems, finance, legal issues, medicine. And those are really a lot fuzzier, less structured than mathematics or physics. So it's not clear whether it's going to be easy to build synthetic data, which when the model is trained the language model is trained on it forces the model to really deeply understand how finance operates, the legal system, you know deep historical analysis military strategy, etc, etc.
So I think the risk in those areas that synthetic data is not going to be nearly as effective as the first few trillion tokens that were generated by humans writing on those subjects, that is a real risk.
And so, if the hyperscaling effort fails, and by fails I don't mean that the models I, I don't mean the models stop improving entirely, but they don't improve at the rate that's predicted by the hyperscaling relationships. If that's true, to me, to me the most likely cause of it will have been the synthetic data problem.
So that's something to keep your eye on.
Now, on the panel we were discussing more than just the synthetic data, sort of, I was the one who was kind of bringing up this issue of synthetic data, but the broader issue is, is the current architecture of these large language models, which is a transformer architecture, is that the final word? Is that going to be the actual architecture that gets us to AGI? And there are different positions on this.
So one position is that, the transformer model is enough and, just by tweaking it a little bit, we can, and just staying on this hyperscaling curve, we can reach something like AGI, or really powerful emergent behaviors, capabilities of the model, maybe enough that the model can be its own AI scientist and recur and sort of improve itself, build better versions of itself, maybe at a Scary gigahertz speed that that you know, so out of the control of humans.
Right. So we, we don't know what, how this is gonna play out. I think there are a lot of people who suspect that the transformer architecture is kind of limited and it's missing certain things. So, for example, if I show a little kid some information, that little kid is able to actually, based on a relatively small amount of data, actually update their own connections. So if I show them, if I show them a picture of a horse with wings, they're probably familiar with the concept of wings. They know what a horse is, but they've never seen a horse with wings. They then will think about what this winged horse is like. Oh, is it more like a bird?
Does it eat like a horse? Can it really fly? et cetera, et cetera, and, and, and through that reflection, self reflection, the human kid can update their own, I mean, they are actually updating their own internal neural connections, and then after a while, they, they have a quote, deep understanding of this new concept of winged horse. And that's it.
Language models are not doing that because there's this training and that establishes the weights of the model, the connection strings. Then after that, you're just using the model and if you're having a long conversation with it, we're trying to get it to help you solve some problem. It's not updating its weights. It's just giving you these sort of like immediate responses. Like you, you, you give it a prompt, it gives you an answer. You give it a prompt, it gives you an answer. And it's not really updating itself or thinking deeply, going away to ponder the issues that you've raised or the new information that you've given it during the conversation. I'm not able to do that.
And there are sort of two approaches to this. One is to embed the language model in a larger software architecture where some of that information that the human interlocutor has given the AI is stored somewhere else and that could be made available again to the LLM. Or by trying to incorporate everything I just described inside one giant neural net which is able to dynamically update its weights as it learns about the world and goes through interactions and things.
The second thing, well, both of these things are future work. So our company Superfocus actually builds this kind of. externalized software architecture, which uses LLMs as subcomponents, but uses sort of more traditional software to do things like store a memory of what was said earlier, or store some base information that the model should rely on.
So it's not modifying weights or connection strengths inside a neural network, but it is storing things and in a sense, updating the, the, the, the overall understanding of the AI, which includes the LLMs without changing the LLM structure itself. I think what some people would hope to do is build a big neural net, which is more like a child's brain, which is capable of doing all the things I just described, but within a neural net architecture.
But I want to caution you that that would be quite, that may rely on multiple research breakthroughs to happen. So if people don't know how to do that just yet, and it would take a lot of research to, to get to that point. So research is unpredictable. It might take a decade of research to build that second type of AI that I just described.
I am confident that even with no more additional improvement of LLMs, or maybe just one or two steps in the hyperscaling trajectory, we will be able to build these sort of hybrid systems, which include both LLMs and then some sort of more old time traditional software techniques, computer science techniques. And that thing will look like an A, an AGI to most people. It will be able to do things like solve problems for you, and give you advice. In our case, for our company, we're using them for customer support applications. They're not quite good enough to do things like replace a management consultant or associate at a law firm or you know, MD specialist. They're not good enough to do that and, and, and, you know, that may be a thing for the future, but at the moment I don't think it's really possible to do that. But in a few years maybe, and in five or ten years I think absolutely it's possible.
So that's what we discussed on this panel. and you may not be an oligarch, but by listening to this podcast, you can have access to some of the information that these oligarchs have access to.
Let me conclude by talking a little bit about physics.
So really the bookends of this trip were seminars I gave at Caltech and at the University of Iceland on something called Black Hole Information: Quantum Black Holes, and obviously this is a very specialized topic, so it's not really something I'm going to be able to convey to you, certainly not on a podcast, certainly not without equations.
Now, the slides for my talk are available online, if you, if you look at my Substack, or you look at my Twitter feed, you can, you can find links. I occasionally post links to the slides, but they're, it's a very specialized topic. I'll just say a few words in case people are interested.
So, we are interested in the following question.
I have some compact objects. Okay, it could be a black hole or it could just be a rock or something and that object, if it has mass or energy, is a source for a gravitational field. Its existence, its mass and energy caused a gravitational field. And the question is if I am able to make really precise measurements on the quantum state or at the quantum level on that gravitational field, I might be very far away from this compact source, I might be light years away. But is it possible if I could in principle make arbitrarily precise measurements about the gravitational field sourced by the small object, the compact object, could I determine the internal quantum state of that compact object? And so this is a particular research area that people are interested in.
It's directly relevant to something called the information paradox or the information problem that Hawking first proposed almost 50 years ago. I guess he proposed it in 1974 so it's about 50 years old. And it's still unresolved. So people are still trying to understand whether the evaporation of a black hole is, to use this fancy terminology in quantum mechanics, unitary. If the black hole is prepared in what's called a pure state in quantum mechanics, where the radiation that it ultimately evaporates into is unresolved. is also in a pure state, and that is related to whether we could time reverse the radiation state and get back the original internal state of the black hole.
Sounds very esoteric, but the ideas are very fundamental to physics because a black hole, when something falls in a black hole, at least in classical general relativity, it appears that the thing that fell behind the horizon of the black hole is no longer able to influence the world outside. So in classical gravitational physics, we have a concept of a horizon and we have a concept that the horizon cuts off the ability of things behind the horizon to causally affect things that are outside the black hole.
So it's a very deep concept of maybe the deepest concept that arises in general relativity. The idea of horizons, event horizons. But in quantum mechanics, We see something else. We, we, we see that information can't be destroyed. If you have a quantum mechanical system and you evolve it forward and backward in time, information is never destroyed.
The state of the quantum system at time zero affects the state at time one, time two, time three. So there's a clash between these two deep properties. A deep property of general relativity. And a deep property of quantum mechanics. And that tension is concretized, was concretized by Hawking in analyzing the evaporation of a black hole.
And to this day people still debate exactly what happens at the fully quantum mechanical level when a black hole evaporates. I generally have very spirited discussions about this both at Caltech and here in Iceland. In Iceland there's a group here. The professor that was my host is a specialist in this area.
He's been working on this area for a long time. He was a, he's a collaborator, long time collaborator of a guy called Lenny Suskind at Stanford who has been working on this problem. And so I don't know, perhaps what I just said was not at all comprehensible to people who aren't, don't have some pretty high level training in physics, but I thought I would share that with you.
Today I'm going to get on a flight and go back to Michigan. I have been traveling a lot in the last few months since the spring semester ended. At MSU I've been all over the place. I can't even count all the different events I've attended this summer or, or meetings that I've had, but it's been a, it's been a really great summer.
I need to actually go home and recharge a little bit. I'll have a week or 10 days before the fall semester starts. And Hopefully I can be ready for that.
So thanks for listening. This is a pretty short podcast. I think I've said all the things that I wanted to say and given you a little bit of update, a little bit of color on what I've been up to for the last few weeks.
Thanks for listening.