Steve discusses Large Language Model AIs such as ChatGPT.
Steve Hsu is Professor of Theoretical Physics and Computational Mathematics, Science, and Engineering at Michigan State University. Join him for wide-ranging conversations with leading writers, scientists, technologists, academics, entrepreneurs, investors, and more.
Welcome to Manifold. This is Steve. Today, it's just going to be me. I have no guest. I'm going to talk about ChatGPT, large language models, and AI.
Let me start by saying a little bit about how large language models work. Some years ago I blogged about word vectors. The idea that one could characterize the space of concepts that human minds work in as a kind of abstract, high-dimensional space.
We know from languages like written Chinese in which, in a sense, one could interpret the individual characters or idiograms as standing for a particular concept. We know that human thought can be built up from just a few thousand or so primitive concepts, i.e., a kid can start reading the newspaper once they've mastered a few thousand Chinese characters or maybe even 1000 Chinese characters.
And so we know that although what we do is very complicated in our heads, combining these concepts, the base set of primitives is relatively small. And so one could describe the space of human concepts in which our minds work as having a dimensionality of perhaps a few thousand, or for a highly educated person, 5,000, 7,000, etc.
Now these large language models have been trained to operate in effectively roughly a 10,000 dimensional space, which means if you ask the LLM to give you a definition of a fairly esoteric, word or concept such as gradient dissent or DNA recombination, it will actually do a good job of that because it, it actually understands the primitive concept of DNA or DNA recombination as separate from some other nearby concepts.
Now, this embedding space that concepts are mapped into by the LLMs, has a certain approximate linear characteristic. So, if you take the vector that the word king is mapped to, and you subtract the vector that the word queen is mapped to the resulting residual vector, the difference between those two vectors, is very similar, almost identical to the vector you get if you take the vector for man and you subtract woman.
So for people who understand what a linear vector space is, there are aspects of the space of concepts that we use, which have an approximately linear characteristic. Now that linearity isn't going to hold if you start combining concepts into sentences or paragraphs. So there, for example, the order of the words, the way that concepts act on each other is non-linear. But the training of these large language models has produced neural nets, neural nets, with hundreds of billions of connections, which can faithfully map individual words, sentences, paragraphs, even multiple paragraphs, pages of text into the concept space properly, both mapping forward and mapping back.
And I apologize if this initial introduction to the subject is too abstract for some listeners, if, if you didn't follow what I just said, you can just forget about it and, and we'll continue. The practical story is that we have LLMs, which can map from natural language, human, natural language, into this abstract com concept space and back.
And that means that we have AIs now that understand human language and can express themselves very clearly in human language. These LLMs were built by training on very large models. Models, which roughly speaking contain to first approximation, say all the English text available on the internet, and the actual optimization problem, the neural net training is, chooses as the objective function, the ability of the neural net to complete a sentence.
So given N words in a row from human language, it's trained to be able to predict what the N plus one word is the next word. So a very simple objective function was used in the training, but in order to become good at that, solving,optimizing that objective function, the neural net has to end up encoding all sorts of aspects of human thought and human language and relationships between concepts. And it even memorizes in some sense facts that are contained inside the original training corpus.
When you have hundreds of billions of connections, which are tuned in the training process by trying to do sentence completion on essentially all of human text, human generated texts, you end up, in a way, memorizing the net neural network is memorizing. Roughly speaking, I would say at least hundreds of gigabytes of information are kind of somehow hard coded in the connections. Both individual facts are memorized, but also the structure of the neural net allows it to process information the way that a human brain would process language.
So that's again, apologies if that's a little bit too technical, but I just want to set the stage for the rest of this, podcast, which will be much less technical, and will focus more on the practical aspects of humans living in a world with good LLMs, because that's the phase that we've now entered into.
Probably everyone listening to this podcast is familiar with ChatGPT. I'm guessing almost everybody who's listening to this podcast has actually spent a little bit of time playing with ChatGPT. ChatGPT is a specific LLM created by OpenAI. Part of the process of making ChatGPT from the earlier instances of GPT, like GPT-3, G P T 3.5, involved something called reinforcement learning from human feedback.
So they used a lot of human feedback based on ratings, say, of query answer pairs or human generation of query answer pairs in order to train and refine the neural net so that it can do all the things that ChatGPT can do.
The guy who is the OpenAI expert on reinforcement learning, is one of the founders of OpenAI named John Schulman. And in the show notes I will link to an interview I did with John some years back. John was a physics major at Caltech and then did a PhD at Berkeley in computer science and then, ultimately was one of the founders of OpenAI. And he's one of the RL or Reinforcement learning gurus, who in my understanding deserves a lot of credit for the improvements that ChatGPT shows relative to the earlier instances of GPT.
And just a comment for the experts. The main training is the main architecture of these neural networks that produces the LLMs is based on the idea of transformers, which I guess transfer the transformer papers, maybe five years old. And the first papers that talked about word vectors, mapping out the space of human concepts in an embedded vector space, those are maybe almost 10 years old now. So you get a sense of academic or quasi-academic research in AI gradually making its way into industry and then producing these big breakthroughs.
Now, these big breakthroughs were not really possible without huge CapEx because my understanding is that for training these large models, you end up spending tens of millions of dollars in just raw compute. And so at this point, there are very, very few academic research groups. Purely academic, like not having some affiliation with say, Facebook or Google or OpenAI that can actually build state-of-the-art LLMs. The one example I'm aware of is at Tsinghua University in Beijing, they have produced a leading edge language model, which I think is actually slightly better on the metrics, performance metrics than GPT, three or maybe 3.5.
Now they were able to do that in collaboration with a well-capitalized startup. I think that it's maybe spun out of Tsinghua, so even that wasn't a purely academic, research project. So, you know, if you're out, if you're a government policy maker and you're listening to this, you might consider whether the Department of Energy or National Science Foundation or the National Science Foundation of another country like the UK or Germany, whether they should pay to create a national LLM so that their researchers are not left out in the cold in this very rapidly advancing field.
Now let's talk about the impact of ChatGPT. It may be that ChatGPT exhibited the fastest product adoption of all time ever. So in some surveys, it's found that 30% of all professionals-- I think Bloomberg did this survey-- 30% of all professionals had actually used ChatGPT, and this was like maybe a month ago, or at least a few weeks ago. And so that number just keeps going up. There was another survey of, I think, high school kids and something like 80% of high school kids had actually experimented with ChatGPT or maybe even used it to try to do their homework.
So it's clearly an inflection point. And, ChatGPT is bound to have a huge impact on society. I was just listening to a podcast. Sorry. No, actually it was a kind of video get together, kind of forum among a bunch of academics who are not technical, but actually they teach writing or they teach social science, historians. And listening to these people talk about ChatGPT with really, honestly, very little technical understanding of things like neural nets or AI was extremely interesting and even they realized this, you know, not only is going to lead to a new era in the way that they interact with students, the way that students do homework, but even in the way that they themselves write their papers to be submitted to academic journals.
I think one of them said they actually knew of examples where they or their colleagues had used ChatGPT to help them refine papers that, you know, ultimately have now been submitted to journals.
To put it very succinctly, I would just say a lot of people who have, even those who have just spent a little bit of time with ChatGPT, realize that, uh, we're entering a new era. It's a new inflection point. And for someone my age who remembers when the internet first became, big had a big impact on our lives. And then later when smartphones did, this feels like a very special moment where we're gonna see some very dramatic changes I think, for example, the way students do homework a year from now or six months from now, is gonna be extremely different and, the way they did it six months ago. And so, this is a very exciting time.
Now, one impact of, LLMs, which isn't, widely known to people who aren't software developers is that these LLMs can not only map natural language into a concept space, and then from that concept space back to natural language or back to a different natural language, like say translating between English and Chinese.
They also can take an algorithmic description in natural language and implement it in code, say in Python or C. And that capability has been available for some time now on GitHub. So GitHub is a place where developers store their code, uh, actually work in the GitHub development environment. GitHub was acquired by Microsoft some time ago, and as you know, Microsoft is a partner with OpenAI. In fact, they're one of the biggest funders in, in terms of providing the rest financial resources and also compute resources to OpenAI.
So together they've created something called Copilot, which is available on GitHub. And so if you are writing code on GitHub, you can try things like specify the algorithm and even maybe the APIs that you want to use in natural language, and you've got a good chance that the AI will actually produce usable code from it. And if the code is not completely usable, it'll be sort of almost usable, where, you know, you just need to modify it somewhat, but it's still, uh, a huge time savings.
So already in that world of software development, I would say LLMs have made a kind of order one change in the way that many software developers are working, and we're gonna see it more and more. I, I, I actually think of ChatGPT and things that are like ChatGPT as the word cell versions of Copilot. So for non-technical people who are mainly just dealing with words, reading words, writing words, summarizing words, that is where, ChatGPT and similar LLMs are gonna have a huge impact.
So let me describe the AI landscape a little bit. Right now, if you, if you try to list the set of companies that are capable of building and shipping an LLM, similar to ChatGPT, well you have OpenAI and Microsoft collaborating. I think Microsoft will incorporate ChatGPT like functionality into its OS eventually, probably, you know, as soon as they can get it to a quality level where it looks useful to the average, uh, windows user. Then you'll start to see it in that environment. There's an OpenAI spin out company called Philanthropic, which raised a huge sum of money and has its own ChatGPT like AI, I think it's called Claude. I think that's still in closed trials. I think it's not open to the public yet.
Google DeepMind, I believe, are every bit as capable of building something like ChatGPT and, and, and probably, maybe even something slightly better. The Chinese internet search company Baidu, which is sort of like the Google of China, has announced that in March they're gonna release an LLM functionality integrated with their search.
That's the thing which Google famously has called a quote code red about because people who understand traditional search realize that the advent of LLM capabilities is potentially gonna modify very much how humans search for information and so could potentially jeopardize the existing search engine business model.
So I expect Google and Baidu to be at the forefront of this. There are smaller companies which you would only know about if you're actually doing things with LLMs. There's a company called Hugging Face, for example, which builds sort of smaller, fine-tuned LLMs. but they have a lot of useful functionality. There are other Chinese startups that are building LLMs. For example, the one that I mentioned that collaborates with Tsinghua University.
I imagine Apple, I dunno where Apple actually is in their technical development of LLMs. But they're gonna have to play catch because Google will be able to release this on Android, Microsoft is gonna release it on Windows. So Apple is going to have to have something similar or comparable capabilities to release on iOS for the iPhone and also, for OS X on the Mac platform. So I think Apple might be the company that's really under the gun here. Meta, Facebook has been traditionally strong in AI, so I imagine they're gonna do something as well.
So one of the things that I discuss a lot with venture capitalists right now is what this is all gonna look like, what this ecosystem is gonna look like in a few years. And right now, I would say the conventional picture is that it's these big entities with lots of CapEx to spare, who can do big computers with big models, and they have a lot of expertise in terms of people who are good at implementing algorithms at scale on neural net architectures. So there's a limited set of companies like that in the world. Maybe no more than a dozen.
And, and the feeling among a lot of venture capitalists is that those companies are really gonna dominate this space. At least that, at least that's what's happened so far. So all the big leaps in LLM capabilities, say from OpenAI and others, have come from companies of that sort. But if you think carefully about this, you can come to a different conclusion. So there's a,what's the right way to say this? There, there's a, there's a slightly unconventional prediction, which I actually believe in for how this, uh, business ecosystem is gonna shake out. And I already see indications of this now.
So I think there's a good chance that there will be an entire ecosystem of competitors trying to sell access through APIs in the cloud to their LLM models. And OpenAI already does this. So many startups are building on top of an OpenAI platform. Philanthropic is trying to get to that point. Google may introduce a platform. Baidu might. Hugging Face is already basically earning revenues mainly from startups. So it could be that the large LLM model capability gets commoditized in the sense that there'll be at least multiple players, perhaps many players, competing on price and quality to offer LLM integration or LLM access via the cloud. So LLMs may just become part of cloud infrastructure.
And, we can already see some trends heading that way. And, and it may go even further as more and. Companies catch up with OpenAI in terms of the kind of platform capabilities that they can offer.
Now, an interesting thing that we should take into account when we think about this is that the amount of training data of the type so far used, i.e. large amounts of text scraped off the internet or maybe, you know, from books in libraries, that amount of data is limited. We are not going to get another 10x in human generated high quality text for training purposes.
So the models can get 10 times bigger, right? Instead of a hundred billion connections, you might go to a trillion connections. And the amount of money that you spend on tuning your model can go up by an order of magnitude. But the amount of good training data, at least of the sort that's primarily been used up till now, is not gonna go up by another order of magnitude.
So where is the new training data going to come from in order for the next evolution of this LLM technology? I think it's most, the inexhaustible source of more data comes from interactions between the LLMs and humans. So that's human feedback, training data. And as I already mentioned, our previous guest, John Schulman, was a leader of the reinforcement learning from the Human Feedback Data project that produced ChatGPT.
So the reason I emphasize this point is that the companies that are operating at the layer above the LLM platforms that are using the LLM APIs, but are building something which is. to a particular customer. Those entities may be the ones who are getting the most human feedback data and have the tightest connection to the interactions between the AIs and humans.
This sort of stands to reason. If you have a base layer where the big models are living, but then you have an application layer where people are using those big models, but plugging them into products that then serve some particular purpose for humans, it's plausible that the application layer companies are the ones that are gonna have direct access to human feedback data. Tons and tons of human feedback data, which is in principle not exhaustible.
Now, OpenAI may get that through its Microsoft collaboration because everybody using Windows will be generating that data for them. Apple may get it from their phones, computers, desktops and laptops. But there are gonna be other players that are gonna have a slightly more difficult time. Other LLM builders that need to partner probably with some application layer companies to get access to that human feedback data.
So that's an important sort of tactical point to make for people building startups in this space. And again, this may not be that important to the end user. The end user may not care whether the actual software that they're dealing with is something built by an application layer company, which is independent of one of the big LLM builders or not. Maybe the end user doesn't care, but in terms of the internal dynamics of how this industry is gonna develop, I think the point I'm making is important. And it's especially important for startups and for venture investors who are interested in this space.
So, my slightly contrarian take on this, I think has a decent chance actually being realized. I, I'll come, I'll come back to this a little bit later in this podcast.
Now, let me switch topics a little bit to talk about what I consider to be the major problem with LLMs right now with ChatGPT right now. And that is something called hallucination. As I mentioned in the basic training of these models, in order to give them the large language, the language capability, natural language capability, you had to have them look at lots and lots of data scrape from the internet.
And that data contains lots of facts which are not true. So lots of information that the model has ingested isn't actually true. And so people who have used ChatGPT find that it can generate very well-written, plausible answers to queries. But sometimes the facts in those answers are subtly wrong. So for example, if you ask it for an Einstein quote, it might give you a quote from Einstein in which the first sentence is something Einstein said. But the second sentence isn't actually something Einstein said. And if you're not just trying to create, you know, if you're not just doing creative writing, but you're actually trying to answer a scientific question or you're doing a research project for your hedge fund, you don't want the the AI or the LLM hallucinating in the middle of your, project.
You wanna have some way of preventing these kinds of hallucinations. But of course, as I just said, the hallucination problem is rather endemic. It's built in because of the way the AI is originally trained. And so the question that I've been interested in for the last, better part of 2022, is how do you actually minimize or stop hallucination of these models?
And I've been working on this with some collaborators and we're in the middle of launching a stealth, we're still in stealth mode, but we're launching a startup that actually implements methods that reduce or minimize hallucination. And we think that, once that hallucination has been minimized, then all of a sudden a bunch of narrow applications of these AIs become extremely attractive and they're huge, in industries that will be revolutionized by these applications.
So how do you stop hallucinations of an LLM? Well, the methods we've developed, let me give you a very fancy way of thinking about them. The fancy way of thinking about it is that you've built in software, some kind of wrapper around the LLM. And if you want to think about it like a physicist or an engineer, you could think of it as some kind of non-linear input and output filter.
So based on the query that the human writes, there's some processing that happens after that query, before the prompt is actually sent through to the LLM. And then based on what the LLM gives back, the response of the LLM is itself processed, perhaps iteratively before the final output is written out, so the human sees it.
So there's some extra steps involved, which could be acting on the input, could be acting on the output, could be re submitting modified output as input again to the LLM. So multiple calls to the LLM of a single call to LLM with some software intelligence built around it. So ultimately a kind of software wrapper built around the LLM, which potentially makes multiple queries to the LLM.
And the goal is to allow the user to specify a custom or specific corpus of information. And all of this filtering and wrapping is meant to force the LLM to answer the query of the human, but using only the information that's in the specified corpus. It's not allowed to use information that it quotes accidentally memorized during its large language model training.
We want to use the language ability of the LLM. , it's ability to map from natural language into concept space and back. We want to use that. But we don't want to rely on any facts or knowledge or potentially wrong knowledge of the world that it has embedded in its own connections from that original training. We want to force it to check claims and statements that it makes against the specific corpus.
Okay, so I've said that in a kind of abstract way. Let me give you a very specific example. So one of the projects that we've done at our startup is, we've taken standard textbooks, the most popular textbooks for subjects like Intro to Psychology, World History, Government, Intro to Biology, and we've specified that as the corpus. So it might be a thousand pages or a couple thousand pages. Sometimes we could use multiple textbooks. Some number of thousands of pages of specific corpus. And then when the human asks a question of the AI, we focus the AI on that corpus and force it to answer using information only that's contained in the corpus.
And if, if it, if it, if the answer to the question is not available from the information in the specific focus corpus It's supposed to answer, I don't know, or I couldn't answer based on what I know from my corpus. Now, the reason that we like textbooks is because the professors, and it's usually teams of professors writing these textbooks, themselves have at the end of every chapter, specified a bunch of questions which test the student's knowledge of the material once the student has read the chapter. So those are what we call chapter problems. And one of the natural ways to test the AI, the focused AI, is to focus the AI on the textbook, but then ask it the chapter questions as the queries, and then just check to see whether it's able to correctly answer those chapter questions.
Or you could take chapter questions from another textbook and test the AI, which maybe is focused on textbook A, but you take questions from textbook B, but they're both on Introductory Psychology, for example.
So this provides a very natural way of just actual, of actually testing performance of this focusing process. And we're currently at a stage where we have pretty, what seems to be a hundred percent correct response. Checking this is a little laborious because a human has to go and look at the, you know, the question that was taken from the end of the chapter, the response. And then, as part of the process, the AI will point to or will kind of, in a sense, footnote or reference the chunks of information from the corpus that it used to answer the question. So you can check the answer that was actually ultimately written out for humans to read against the internal sources that it used, to answer the question. And ultimately then you can check performance.
And so far we are getting pretty close to a hundred percent capability. So it's an existence proof of an AI which can communicate in human natural language, can answer questions based on a specific corpus without hallucinating, and I think we've more or less demonstrated that this is possible and it's possible to do with very high efficiency and accuracy. We can chunk and. focus chunk of corpus and focus the AI on the corpus of maybe say, up to 10,000 pages very quickly. I mean, it takes almost no time. We could even go more than 10,000 pages, but I think the biggest corpus we've tried is probably roughly of that order. So you could think of that as 10 giant textbooks or something like that. And, more or less demonstrated that this is possible. So I think there's no question about it.
Now, what does the world look like when this functionality that I just described becomes available. This is different from ChatGPT because people who are getting ChatGPT to write things for them can't really be sure that everything contained in the output is actually correct, or at least correct according to some defined ground truth. Whereas these AIs that we've built, you can be confident that what it writes back to, is based on the ground truth corpus that you define for it.
So let me describe some possible applications. Well, we talked about textbooks. We're about to launch a website that allows students in all of these introductory courses to learn from their textbook, to learn from an AI that has been focused on their text, and this may radically change-- I think it will radically change-- the way students learn.
These textbooks, for those of us who are older, if you've never seen a current college or high school level textbook, it's huge. It's like a 10 pound brick. You know, a thousand plus pages that the student has to carry, may have to carry around, or they just use the electronic version on their laptop.
But most students don't like to read linearly through these textbooks. And for some reason, the professors who write these textbooks have reached a point where they just feel they have to cram every possible thing into the chapter. So the student, if they wanna actually read the whole chapter, reads a ton of extraneous stuff that maybe their professor is not focused on, and is not specifically emphasizing in the class.
So just being able to ask the textbook, could you please summarize, what we know about the mating patterns of tropical birds? Can produce an answer assuming that the focusing worked, which is a very nice summary. It's as if your friend who's a good student read the chapter for you and took some good notes for you. And, and that is possible now.
So I think the way that education, based on textbook information, happens, is gonna change radically. And of course you can go way beyond this so you can easily like, imagine designing a lesson plan, where the professor, specifies certain chunks of the book, asks the AI to write summaries, has the student reads those summaries, can even ask the AI to generate questions that the student should try to answer. All kinds of things I think are gonna change the way students interact with, the, the, the bodies of knowledge that are covered in their courses.
Now, a general way to think about this is to just say, well, I've, I've built for you an AI who is like a smart assistant. The smart assistant reads infinitely fast, writes infinitely fast, and more or less follows your instructions.
So if you say assistant, these last 12 months of the Economist pull out the articles that were on supply chains in Vietnam and summarize for me which companies have had the most success moving their supply chains to Vietnam from China. And the AI can do those things. So the assistant who might have had a PhD or a graduate degree, that you hired to do this, you don't require as many units of that assistant to run your business. The AI can do it.
You can imagine an AI which is built so that the custom corpus is the set of product reviews that you deem very trustworthy, i.e. the set of product reviews from Consumer Reports, New York Times Wirecutter, Wired magazine, CNET. Let's suppose you define that set of product reviews as your corpus.
You can just ask that ai, Hey, I'm shopping for a cell phone for my son. I don't want to spend more than $400. But he wants to be able to play these games. He wants an OLED screen, et cetera. And the AI can just use the last two years of product reviews that you've defined as trustworthy to answer that question.
I personally have to spend a lot of time because I guess I'm a frugal guy, and performance oriented. I spend a lot of time researching stuff before I buy things, but this could cut that time down substantially.
Let me just mention one other application, which is obvious, which is customer service. if you define the corpus to focus your AI on as a product manual for some large screen TV and a hundred pages of scripts, human generated scripts that the customer service person or AI is supposed to be familiar with. Like, you know, standard questions like, how, how do I use screen in screen? Or, my thing is not connecting to the wifi, et cetera, et cetera.
And if the AI is forced to focus on that limited corpus and answer all queries from customers, using that corpus, then you can be sure that it's not saying weird or crazy things or incorrect things to the customer, but it's able to communicate to the customer in natural language. The customer might formulate the same query in many different ways, but it will understand because it understands human language that, oh, it's actually asking question number 242 in my script, among my scripts, and it just then gives the right answer.
So, again, that is going to take the place of lots and lots of human labor, perhaps in the Philippines, in India. So you can see the impact of these technologies, which I regard as pretty much proven because of what we've done with the textbooks and with other corporations. In our startup, more or less demonstrate that all this is possible. I anticipate a huge impact on the economy in the next few years as these new technologies get incorporated into standard business practice.
Let me come back to my contrarian thesis, that big LLMs will become a commodity. That there will be competition on price and quality among big LLM builders, but we're already at good enough language capability. In other words, the existing LLMs are already pretty good, so that a human can make themselves understood to the AI and the AI can make itself understood to the human no problem. So even existing LLMs could be used for the applications that I described already. And they may just very well become a cloud-based resource that any programmer can just tap into.
The most interesting new form of data, as I've mentioned, is the human feedback. So when the human is interacting with the AI, if there's some miscommunication, if there's some way that humans start to use language, that the AI is not that familiar with, if there's a kind of chain of questions where after the third question, the context should be apparent to the AI from the first question, but it doesn't currently get that. These are all things that you'll discover by using the human feedback data to learn from.
And so I think the future of these capabilities may be strongly dependent on access to human feedback. And so the people that build applications, which touch a lot, interact with a lot of humans at scale will be in a very special position to further improve the models.
More or less covers what I wanted to say. And I think that this is a slightly, perhaps shorter podcast than I usually do. But let me just leave it there andI hope you enjoyed the material that I covered. Thanks.