High Variance is an interview podcast about a world that has become harder to read — more uncertain, more volatile, stranger. Host Danny Buerkli speaks with public intellectuals, entrepreneurs, and technologists to ask what is going on and how we should respond.
Welcome to High Variants where we explore the strange times we're in. I'm Danny Buerkli and on this podcast, I speak with people who are trying to make sense of it all.
Danny Buerkli:My guest today is Antoine Bosselut. Antoine's an assistant professor at EPFL in Lausanne, Switzerland. He's also one of the creators of Apertus an open LLM released in September. Antoine, welcome.
Antoine Bosselut:Hey. Thanks for having me. It's great to be here.
Danny Buerkli:Thanks so much for doing this. To start, explain briefly what Apertus is and what distinguishes it from, first, other LLMs and then other, quote, unquote, open LLMs.
Antoine Bosselut:Yeah. So, I mean, as as you mentioned, Apertus is an LLM or a large language model. So, you know, essentially, a a very large model based off neural networks and deep learning, which has been trained or pretrained, as we say, on an an an insane amount of data. In our case, close to to 15,000,000,000,000 tokens. So it's essentially a a large language model whose scale rivals some of its more, I would say, well known counterparts such as the three models, you know, the Qwen, the Qwen two models.
Antoine Bosselut:What makes it different, I guess, in this context compared to what we often think about with large language models, which is the chat GPTs and the clods and the Geminis of the world, is that Apertus is more like llama three and and and quintu, and that it's an it's an open model where the weights are actually released on online, which allows others to essentially pull those weights from some online repository like Hugging Face or Azure or AWS and and essentially run it run it locally, train on top of it to to potentially provide it new new capabilities. And and and it's essentially just a much more flexible interface to a model than, you know, just the the chat interfaces that are often used with the with the frontier organization ones. This naturally has drawbacks, though, in that, you know, the these frontier organizations have an entire ecosystem built on top of the language model. There, the language model is kind of more the an engine inside a very large car that is, you know, cruising forward. In our case, you know, we just have the large language model, so it's only the engine in that in that more classic sense.
Antoine Bosselut:And then in terms of what makes Apertis kind of, you know, stand out on its own relative to to models such as Lama and Quen is that it's it's it's fully open. So we, you know, we we wanted to be able to provide a a a scientific artifact in addition to kind of to kind of a base a base model that could be that could be used for for downstream applications. And so to do that, we wanted to be able to release all of the data that, that we train the model on, all 15,000,000,000,000. We wanted to be able to release intermediate checkpoints. And so at at Apparatus' scale, it's actually the largest model where all of these additional artifacts such as pretraining data, such as such as checkpoints, such as evaluation suites, all of these are available to go along with the model weights.
Antoine Bosselut:And there's very few other models that have that same level of transparency in their releases, and none of them are at the scale that Apertus is at.
Danny Buerkli:Brilliant. And but one obvious question is why bother? Apart from the fact that this is a cool and presumably helpful research artifact because we may presumably want to understand the lens better in order to do that. We have to have a model we can interrogate. Arguably, the more open and compliant and all the rest of it it is, the better and also maybe for the reasons we may not want to rely on the the largesse of large multinational companies to provide these models.
Danny Buerkli:We may wanna, you know, be able to, for research purposes, have one of these ourselves. That that's great. But apart from that and that's that may already be justification enough. But apart from that, why bother? And why bother with public money?
Antoine Bosselut:There's multiple questions to sort of break down there. You know, in in terms of why bother, you know, I I'd say there's lots of of good reasons for this. You know, two, I guess, that I'll that I'll lean on here are first that, you know, having access to this type of resource from a research perspective just enables us to really expand the number of studies that can actually be performed on these models. You know, there there's essentially not all that much research you can do with the frontier models, at least for us an outsider, beyond talk to it and see what it answers and try to try try to get something back from that. With open weight models, you can do you can do a lot more because you have access to those weights.
Antoine Bosselut:You can adapt. You can, you know, provide a different stimuli and see how the internals of the model, the mechanisms inside the model change. You can try to discover circuits that are responsible for particular behaviors. So there's a whole host of things that you can do, but you're always sort of limited by the fact that you don't know what this model was trained on. What does it mean for a model to get, you know, this much performance on a benchmark if that benchmark may have been in some part of the training data that the model was trained on?
Antoine Bosselut:There's just this this gap between kind of, you know, good science that actually allows us to to to extract insights and the foundation of the experiments that we're running, which is this open weight model for which we don't have that specification. With a fully open model, like, we can actually do true science on on those types of systems because we're actually able to go and audit, you know, what has been the entire training procedure of the model up until that point and be able to say, okay. There's a flaw in my experimentation because I'm testing for something that even though I didn't know it at the time, like, the data had that in its when the model was pretrained. So I would say that that is that in itself is is a great value of such a system, and we're not the only ones that have pushed models forward along that principle. AI two had the Olmo models.
Antoine Bosselut:Luther AI has has released a suite of models in previous years. So there's a few organizations that are operating under these these fully open principles. Like, the other reason is, I guess, a it's an effect of the fact that we want it to be fully compliant, which is and we also want it to be it to be useful for businesses, which is we really wanted to release under this Apache two point o license, which would allow commercial use. And we also wanted to release all of the data that we had trained on. And that creates some sort of sticky legal conundrums there in that if somebody if you're gonna potentially be used have other people use this model to make money, and you've pre trained on a whole bunch of things that make money for other people, you know, that second group is potentially gonna wanna go and come after you legally.
Antoine Bosselut:And so we made a lot of decisions when we chose what data to train on that essentially make us a lot more compliant with regulations in Europe, including the EUAI Act and and the GDPR in terms of what what sorts of data can make it into the mix. And this is quite interesting because when we were designing this entire project in the first place, there were organizations that came up to us when we asked them, you know, what's the issue with LLMs for you today? It it didn't have anything to do with performance or capabilities. You know? They're they're pretty confident they can make something work there.
Antoine Bosselut:But they were afraid of the legal exposure that any such model that they would use would provide on their company if there was if there was any sort of mistake, if there was any sorts of safety issue along the way. And so for them, knowing that there was this much stronger data compliance standard that was that was implemented was was quite important, and it's it's it's one of the things that is that is very attractive to to a lot of the companies that we have spoken to. Now, you you know, I guess you asked the question, why use public money to do this? Well, the answer to that's quite simple is that it's it's it's super important to to do this, particularly to create, you know, these these kind of more responsible foundations of models, potentially have them be quite useful for for for sovereign innovation ecosystems as well. But there's not actually all that much money in some responsible AI at the moment to be made.
Antoine Bosselut:And so it's not necessarily a very attractive bet by a private company to release a model with all these open artifacts only to have it potentially be used a bit less because it's not as capable because you might get this this this performance gap. But, you know, to me, this is kind of the responsibility of of the public sector to to to make the investment in the foundational technologies that can then enable, you know, entire innovation ecosystems on top of it. It's it's too much money to do this for just a start up, and this is especially one in Europe. But as far as the public sector goes, it would be it's a very, it's a very important thing to actually do in order to enable the next stage of actually building the application layer on such a model.
Danny Buerkli:When speaking about money, can we put, like, an order of magnitude on this? And I imagine the biggest cost block would be the pretraining and post training runs plus salaries, but the the GPU time is presumably the most expensive thing. How much did the whole project cost?
Antoine Bosselut:I mean, there there's, I mean, there's two things there. There's how much does it cost us and how much would it have cost other people. When it comes to cluster economics, there's different ways of actually defining these costs. You know? So I think all in, if we count up the GPU hours that were used in this project, it comes out to somewhere around, you know, around 10,000,000 GPU hours to actually do the all the experimentation, the final pretraining run, the post training, all of that.
Antoine Bosselut:Now and that's, you know, the the majority of the cost. Then and then, of course, there's the salaries of all the folks that worked on it along the way, but, you know, we can probably say that all of those salaries sort of put together come out to somewhere around, let's say, 3,000,000 as a as a as a number. So then you're left with the compute cost. Now you've got these ten million hours. What is how do you how do you decide what an hour is worth?
Antoine Bosselut:Well, if you go on, like, you know, your your cloud, your Google Clouds, and and your AWS, you're gonna be able to buy a GPU hour. I mean, for spot pricing, you'll get it for less, but you can't reliably train an on spot pricing, so you'd need the full value of of that compute, and you're not gonna use it for a year. So you're not gonna the discounts that are normally given. So really, you're gonna be paying, like, you know, base rate on that compute, which might be something like $56 a GPU hour. So that would be 50 to $60,000,000 to train on on that cloud.
Antoine Bosselut:If you go for, like, a more bare bones service, you know, something like well, I'm I'm not gonna do advertising for any companies here, but there are bare bones compute services where you can often get something for, like, $22 a GPU hour, but really closer to $2.50 if, once again, you're not getting the kind of bonuses that they put in place. So they're $25,000,000. So all of these things are essentially out of reach for a private company that's that's sort of hitting the scene. However, we do have a supercomputer in Switzerland that has around 11,000 g h 200 GPUs, the kind of base costs on that for just, like, energy, for just cooling, for the salaries of running that are kind of a lot less than those amounts. And we can probably get away with, you know, essentially valuing the value of the compute on that project at closer to somewhere between 5 and 7,000,000 5 and 7,000,000, Franks.
Antoine Bosselut:And so, you know, what would have taken a private company $50,000,000 to actually run and do, which is completely out of scope, we can rely on public investments that have already made have been made in infrastructure that will continue to be made in infrastructure, and that allow us to get a much better price point on that compute. And so not only is it, I would say, the responsibility of the public sector to do it in order to build that foundation for other companies here in Switzerland, in Europe, in the world to be able to build on top of it, but it actually just makes a lot more sense economically given the other investments that have been made by the public sector previously.
Danny Buerkli:You mentioned the sort of immediate utility to research, the immediate potential utility to firms who worry about legal liability and other issues. I wonder how much there is also the optionality value because building or growing these systems presumably involves quite a lot of experiential expertise that you can only gain by actually doing it. There's presumably a limit to how much expertise you can amass by only thinking about how you would build an LLM if you were to do it. There's presumably a lot of expertise that comes from actually doing it, and then that would translate into some future optionality value where maybe and I wonder how you would, you know, weigh these two things, sort of the immediate value present day today versus how much value does this actually generate potentially into the future if we were to continue on this path.
Antoine Bosselut:Yeah. So that is that is that's an incredible question. There is so much to to unpack there. I'll start with an with with kind of a nice little anecdote. When we when we kicked off this project, I had a few people sort of tell me this is not what academics do.
Antoine Bosselut:Why why are you trying to do this? You're not you're not gonna be able to publish your typical research papers. And and to be honest, like, I didn't particularly care all that much. It just seemed like a really exciting thing to do and and and something that we needed we needed to do in in in order to to kinda stay competitive in the AI race in Switzerland. But something incredible happened along the way is we just found all of these research problems that aren't really well codified necessarily or that are well codified, but that we don't necessarily attach to the development of these LLMs because it's not necessarily talked about.
Antoine Bosselut:And so one thing that came out of this work is that there were something like twenty twenty five research papers that actually did get written. So we did kind of achieve an academic mission in the classical sense, and that we published research, and we trained a lot of students in a very important technology over the course of this project. And the reason for that is the the the expertise, the understanding of the problems that you gain from actually doing the project are really different from what you gain by reading papers by people who do these projects, particularly in the LLM space where, frankly speaking, a a lot of the the things that are published on the subject lack a bit of depth simply because there is quite a bit of value to that to that know how. And this creates a massive sort of future opportunity for the people that are involved and for the communities where they actually are. Because they gain experience in a technology that's actually quite complex that I'd say the number of people that can actually actively contribute to it in the world is to the order of thousands or tens of thousands rather than millions and tens of millions.
Antoine Bosselut:And so there's there's there's quite a lot of value to that. And then when they leave the apparatus ecosystem or or the Swiss AI initiative, which is kind of the the wrapper around it, they they they go out. They join startups. They join tech companies here in Switzerland. They they start their own startups.
Antoine Bosselut:And in and in essence, there's a whole new class of folks that are going out into the world and and taking what they've learned from these projects and and the importance of the problems within them and specifically designing innovation tools to manage and and handle those. And that's really where I think a lot of the value where this comes from. Apparatus is a model that we've released it in a few years. It's it's it's gonna be completely outdated, and we'll be on a on a completely different class of open model. But the experience and the understanding that is gained by the people that participate in these projects is a really valuable resource to send out into the world and particularly for a small country like like Switzerland.
Danny Buerkli:That's excellent. And well, it confirms that it seemed around the launch, a couple of things were not necessarily well understood. That point was not necessarily well understood that they there are there exist significant positive externalities beyond the artifact itself. And then the other thing that may not have been super well understood is what you said earlier, which is what the artifact is. Right?
Danny Buerkli:It's the engine or the fuel rod, I suppose, rather than the entire power plant. And there's a there's a big difference between
Antoine Bosselut:Mhmm.
Danny Buerkli:These two things. You just mentioned the question of, you know, how well, in a couple of years, this model will be obsolete in terms of its performance. I wonder, did we actually just sort of be in this one incredible fortuitous point in time where the compute resources available in the supercomputing center were just right to pull off a model that was sort of open source state of the art as it were. But is that now gone? Because we don't know necessarily exactly how how much how scaling will continue into the future.
Danny Buerkli:But already now, if you quickly compare it to Grok four, where a couple of numbers are available from epoch and I think Grok used sort of 60 x the power and 70 odd x the flops. And that's, of course, already today. That's not asking, well, what would that gap look like if we did the same exercise again next year or in two years? Is this a repeatable exercise, or was that it, essentially?
Antoine Bosselut:My answer to the first question is yes, and the second one is is is no. It is a it is a repeatable exercise, and this is not just it. However, that's that kinda requires a commitment to to grow in this in this sort of scale at the same time. The the the story behind this is is actually quite a quite a quite a fascinating one whenever whenever I tell it to people. It's that around the time that GPT three came out in 2020, the folks at the Swiss Supercomputing Center were just coming up on on on kind of an infrastructure investment cycle on how they should grow the capabilities of the next supercomputing center.
Antoine Bosselut:And they made this very wise decision, but but but risky decision that, okay, there seems to be something to these LLMs, and and we need to be able to provide capacity for this type of of research question in in in the supercomputing center. And so they bought 11,000 GPUs with the infrastructure investment at a price that people today would vomit if they heard about it, given how how how aggressively amazing it was. Just to be clear, the the the biggest cost point of running the cluster, I believe, is depreciation on a daily basis. Nothing else. But that's that's what enabled the apparatus project is to have access to to such an infrastructure to do this.
Antoine Bosselut:I talked about the cluster economics of training something for 50,000,000 versus 5,000,000, depending on whether you actually own the infrastructure and have competent people to operate it. It would be difficult to do that a second time around, I think, just because now everybody knows what the value of a GPU is. I don't think NVIDIA is giving the same discounts as they were previously. But the point of the apparatus one of the points of the apparatus project and other projects that we have in the Swiss AI initiative to to democratize these foundation models is is really to kind of make it clear, like, what can be done when there are these the scale of resources available for research and for and for development. And, yes, we can point to x AIs and OpenAIs and Microsoft and Amazon's very large clusters.
Antoine Bosselut:But something that I think Europe needs to sort of come to terms with is that they don't have companies of that scale that are building their own data centers. And so if it actually wants to build an innovation ecosystem in Europe that can perhaps compete with what the big players are doing in US and China, the only place where the investment in data centers can come from is the public sector. That's that's that's an interesting question because, yes, these data centers are supremely expensive, but not for an entire continent to actually invest in. And so the the big question in order to enable this next step is, can we actually make the investments into the necessary infrastructure to actually enable this type of development to happen, which in turn would would spur an innovation ecosystem around it. And there have been some some efforts in this area and the construction of these AI gigafactories that are coming online in the next few years.
Antoine Bosselut:These are absolutely awesome initiatives, and we should continue to do that and, in fact, expand on them as well.
Danny Buerkli:So what you're saying is, right, absent that, there's not gonna be another training run because we're tapped out as it stands today.
Antoine Bosselut:Well, I would say that the at the same time, I think that the frontier is plateauing a tiny bit right now. And in fact, there's a bit less gain to be made in just scaling up compute on the training runs as massively as before. People are conjecturing that post training you can scale up post training to the the level of compute of pre training. But in terms of pre training, I I think that, yes, we could have larger models still, larger architectures. But data, I I think, is gonna become a choke point at some point, though you can kind of repeat the data in a training set as well-to-do to get a bit to get a bit more bang for your buck.
Antoine Bosselut:But we can we can still kind of attack these sorts of limits on the open model side without, I think, without too much of an issue. It's the next generation that I'm more and more curious about where synthetic data really takes off as the primary data of choice for pretraining and where for post training, you we really start to scale up even more how much compute is actually expended at that point. So I think right now, we still have, I think, enough juice to go for something like an apparatus two or even apparatus three. After that, I think it will require larger scale infrastructures to actually go for it. Really, I'd say the the the blocker right now is perhaps how much experimentation we can do along the way.
Antoine Bosselut:In in terms of the large run itself, it takes on 4,000 GPUs, it took around two months to train the Apertus model. So we could double the scale, and it would take four months on 4,000 GPUs. This is an insane amount of compute to all the academics that are listening, but it's something that is possible when you have a supercomputer to actually run these types of of of large jobs. Question becomes more like, how much experimentation can you do kinda of ahead of time on this cluster, which is a shared national resource, of course, and and and and try all these all these all these little things and run how many scaling laws can you run for all the design decisions that you want? I think that's where you're a bit more limited compared to the large tech companies that have these massive clusters.
Antoine Bosselut:The other thing to remember about these massive clusters for these companies is that part of those clusters are used to deploy the systems after they have been trained. And in fact, inference is, these days, is is is closer to two thirds, if not 80% of the cost of a model compared to training. So so much of it is actually used to be able to service the models that you do train for the users, and there, you have to be hyper efficient in how you do it. And that can require kind of access to tons of compute as well in these clusters. So we shouldn't necessarily assume that everything on these close gigantic clusters that are being constructed by these big companies is only for training runs that everybody else is outgunned.
Antoine Bosselut:We have the advantage that we don't actually service the model afterwards. We put it out in the open, and others take care of that problem using the same public cloud infrastructure sorry, private cloud infrastructures that are available to tech companies as well.
Danny Buerkli:Right. And from what I understand, the the it is much easier to serve the model through public cloud infrastructure as in, not publicly funded, but publicly available cloud infrastructure than it is to this is very much not the same infrastructure you need for a training run. One thing I don't understand, when we think about, say, compute as a publicly funded infrastructure that you you just mentioned earlier, you also mentioned the depreciation aspect, and I don't I'm not quite sure how to think about this because we know how to think about public infrastructure. We've been doing public infrastructure for a very long time. We know how to build roads and bridges and railways and buildings and other kinds of things that relatively is a well understood problem also from a finance perspective.
Danny Buerkli:However, most of that infrastructure doesn't depreciate as fast as a g h 200 does. Now I don't exactly know what the lifespan is at this point, but, presumably, it is measured in the low single digit years or sort of maybe in the mid single digit years if we're lucky. And that would obviously imply that this is a recurring investment. This is not, you know, the water pipes that you lay, and then you can use them for fifty or a hundred years. It requires a whole different level of recurring investment.
Danny Buerkli:What is the correct way to think about that?
Antoine Bosselut:Yeah. So caveat here is I'm not an accountant, by any means. So I I I I'm probably the wrong person. When when I talk about depreciation, I repeat what accountants have told me in the past. But it is generally a much faster rate of depreciation.
Antoine Bosselut:That's not necessarily because the resource itself is completely useless after a certain number of years. It's mainly because the resource is completely outdated, where you may have four years of using a GPU. And depending on NVIDIA or another company's release cycle, you get essentially three new generations of of GPUs, which are all just going to have way more peak theoretical flops being able to be used, different ways of of connecting those GPUs such that you can get faster networking. The the difference between, let's just say, GPUs today that are coming out with, like, the new age b three hundreds compared to where we were at four years ago when we had a one hundreds that had 40 gigs of memory is is night and day. And so I can only imagine where it'll be four years from now.
Antoine Bosselut:And so with that in mind, this is why there's just a much faster depreciation rate on these sorts of chips compared to other public public infrastructure. But that doesn't mean that I would say that the chip all of a sudden no longer becomes useful in that context. We can still use the chips that are in these in these resources for, let's say, teaching purposes, provide them to students. Students don't necessarily need to always be using the top of the top line GPU in order to understand things like GPU programming, things like multi node scaling. There's lots of educational purposes to these resources even after.
Antoine Bosselut:They could even be distributed to build local clusters and lots of environments that never have access to compute, but where we want to give students an AI education from a lot earlier on. These are political decisions that have to be made, but there's lots of opportunity for how this infrastructure can be useful far after its depreciation cycle. But, yes, and in private companies, it's it's even more aggressive where some of these companies that essentially just provide GPUs as a service, they depreciate the value of that GPU over two years because they know they need to buy the next generation just to stay competitive. So it's there, it's it's even more aggressive.
Danny Buerkli:When thinking about the performance of the model, I wonder what is the what are the binding constraints? And one of the interesting things I think about what you've done there's many interesting things. One is because you've trained on, quote, unquote, fully compliant data, we can sort of see what the performance penalty might be relative to training on potentially not so legal data, right, which is presumably or allegedly what most commercial providers do. But there there may be other binding constraints. Right?
Danny Buerkli:Talent may be one, the size of the team, the expertise the team has, the amount of compute we've just spent time talking about. What are those binding constraints, and how big would the performance uplift be, say, if, you know, you had a bigger team if you were to train on all available data rather than just compliant data, etcetera?
Antoine Bosselut:Yeah. I mean, there's there's a lot of these of these binding constraints, I would say. The question is, what is what is the actual impact of them? There are some of those where we know what we don't know and others where we don't necessarily know what we don't know. In in terms of the compliant data, we've been able to to do some studies on this.
Antoine Bosselut:In general, here here, that's a case where we know what we know. And if you take, let's say, public data and you apply one of the methods you apply some of the methods that we use to make the data more compliant and to to reduce memorization, we can measure at least as a small scale what the impact of that is and and do scaling laws. And here what we found, at least for the data compliance aspect where one of the decisions we made, for example, is just let's take a step back about how public datasets are usually constructed. Typically, what you do is that you use a crawling service such as Common Crawl, you which just takes snapshots, samples of the web at different points in time. And you take all of their historical snapshots, which are all, like, terabytes in size, and you combine them all together.
Antoine Bosselut:You run a whole pipeline to do things like deduplication of of resources so that you're not always training on the same tokens. You select resources that are in particular languages. You do a lot of quality filtering to throw bad data out. One of the things that you can do at the start is you can look at the URLs where these documents were initially taken from. And most of those locations have something called a robots dot TXT file.
Antoine Bosselut:And what that robots dot TXT identifier identifies is whether or not crawlers are allowed to crawl and scrape the data that is on that web page. Now common crawl does this automatically. So if a website blocks scraping, like, it's not going to be in the common crawl. However, it's only at the point where the crawler actually goes through all these websites. So if back in 2021, a website wasn't blocking crawlers and in 2025 it is, it's not going to be in the 2025 crawls, but the historical snapshots are still going to have the content from back then.
Antoine Bosselut:And that essentially means that there's kind of this this this gap between the the crawling rights that are given by by by content owners online and what was recorded back in the day before that. Now many would say, okay. Well, this doesn't this doesn't matter because back then, they did not block crawlers. We took a different approach to say, well, actually, we're just gonna retroactively remove all websites from these large datasets where as of January 2025 when we were doing the data collection, they had opted out of crawling at that point. And when we measure what the impact of this is, it's it's actually quite minimal in the public data setting.
Antoine Bosselut:There's not that much of a difference on models if we remove this data through the retroactiverobots.txt opt outs. It only removes about 88% of the overall data, but performance wise, we don't actually measure that big of a gap. In terms of knowing what we don't know, there are also other datasets that are not public resources like Common Crawl snapshots, but let's say large dumps of pirated textbooks. You can train on that data if you go and find those pirated textbooks. We did not do that because we did not want to release that publicly and show that we had trained on that and also just because we didn't want to do it simply because it would essentially be stealing the intellectual property of the folks that had created it without having paid for it.
Antoine Bosselut:We know that on certain benchmarks that are often used to evaluate large language models such as MMLU, training on these types of pirated resources can often give you a pretty substantial boost between five to 10% performance on those. Now whether that's because those textbooks just have a a lot of access to knowledge that is really helpful for models or just because the MMLU benchmark comes from textbooks that might be an asset. It's it's it's it's more difficult to to measure the outcome, but the point is, like, we can kind of we know that there are measurable differences in performance depending on whether you train on that private data as opposed to only training on public resources. So that's that's definitely potentially a binding constraint. In the long run, like, respecting data compliance standards can have measurable impact on benchmarks.
Antoine Bosselut:Now does that actually translate to UX challenges is is a different story. And for that, I think it's not the the story is definitely not necessarily written yet on that. And it'd be interesting to see if somebody were to design kind of a similar interface as ChatGPT, do similar types of post training, put together this this entire these many layers on top of the engine itself to to actually create a product, whether people would necessarily notice that the model was a lot worse, I'm not I'm not necessarily completely sure about that. You
Danny Buerkli:mentioned earlier, not all not everything that we find out inside companies about how to build these models is published, and the things that are published may be more anemic than you would ideally wish for. And I I believe there was also a shift sort of pre 2023. There was maybe more more things were published versus today. This is possibly an unknown unknown, but how big would the uplift be given so if you hold the data constant, if you hold the compute constant, if you hold the team constant, but if you had access hypothetically to the expertise inside, say, the three large commercial labs, what kind of performance uplift on MMLU or any other benchmark would that translate into? What what would your guess be?
Antoine Bosselut:So is this is this just in terms of the of the data itself, or specifically, if I had a bunch of OpenAI ex OpenAI engineers and researchers on my team rather than rather than just
Danny Buerkli:Yes. I meant in terms of sort of pros in terms of process expertise, not in terms of data. So we're using the same data. We're using the same compute, but we just have more process expertise and experiential expertise of on how to build the model. How big of diff difference would that make?
Antoine Bosselut:I mean, I I I would expect that it would make a substantial difference. The thing about these teams I guess we mentioned the team earlier, but it's not often a a matter of scaling up the size of the team. Essentially, the more you scale up the team, which is typically because you wanna add more features and functionality and test more things to the model, the more you have this kind of integration nightmare at the end of it to kind of bring everything in the right way. I look at the the number of people involved in the apparatus project, which is which is to the order of of hundreds. A lot of what those folks did didn't necessarily make it into the the apparatus the apparatus model itself because the leads on the team, myself and and my my incredible colleagues, Martin Jaggi and Emin Al Schlag, we were very rigid about allowing something that was being proposed by a member of the team to actually make it into the final mixture.
Antoine Bosselut:And that anytime you bring in something new, it can become a real challenge to make sure it fits with everything else that you've done nicely. And so lots of things were were kind of left on the table even if there was the the potential that they that they would help. My guess is that if you have 30 people from from OpenAI or from Anthropic or these other companies, you don't need to try out as many of these formulas because you have a they have a good sense of what already works because they've they've done it before. And to be fair, this is this is expertise that is literally valued at billions of dollars in terms of the packages that are that are that are going out there for some of these folks. So, yes, I I I'm I I I think a a sober person would would would conclude that there's expertise there that can that can actually be that can actually be monetized and and useful, but that doesn't necessarily mean that there's nothing you can do with a team that is that is composed of a different profile.
Antoine Bosselut:And naturally, because we're a public institution and an academic one, we have to take a different approach to reaching our goals. We, unfortunately, can't pay billion dollar salaries. As a government institution, it's difficult to it's difficult to do that. And we also we all we're also quite mindful that a big part of our mission is not necessarily producing just the artifact, which is one part of it, but really training the people who are involved in the project. And that's super important to us, which means that there's lots of components of it that are that are student led.
Antoine Bosselut:And to be fair, the the the students involved, EPFL, ETH Zurich, other other institutions around Switzerland are are absolutely brilliant, but it it forces us to to have a different model for how we go around and design this type of project.
Danny Buerkli:What was something that you discovered as you were building the model that was surprising to you?
Antoine Bosselut:Oh gosh. The the tough part here is picking one. Everything was surprising along along the way, I would say. I guess one thing that was surprising was how robust and stable a lot of this is. There's been an incredible amount of work that we were able to build on from pretraining libraries to data cleaning libraries that have been developed by by other folks, and it just really makes things it it really derisks the entire enterprise quite a bit.
Antoine Bosselut:I have a massive amount of respect for the first person to ever train a language model at this scale, which I think was OpenAI with GPT three in in 2020, because I can't imagine what it would have been like to to actually be doing all of so many of these things for the for the first time, essentially. Luckily for us, we're we're coming a bit later. And so there's tons of these decisions that have already been documented by other open developers like Eleuther AI, like LLM three sixty, like AI two, and we can build on top of what they have done. And and whenever we we had kind of a design question, it's it's super helpful to be able to say, oh, these folks did it this way, and maybe we can just copy what they've done in this particular setting. And that narrows the number of design decisions that we have to to to attack from scratch, and it can allow us to kind of focus on, like, core new objectives that we want in the model.
Antoine Bosselut:So for us, our big our our our big contributions were this data compliance aspect and the large scale multilinguality, And we were able to dedicate a lot of research resources to attacking those two problems because many other aspects of this development had already been explored, understood, documented by previous open research groups. So I guess that was that was a very pleasant surprise in this experience is is is is how much of it ends up being a com a community effort in the in the open in the open domain. And this this is, I think, one of the promises of why open modeling can sort of keep up with the frontier. It I don't wanna say if it's by definition, it can't surpass the frontier of the closed models necessarily, but that, I think, is a lot harder. Because once you're actually at that frontier, you've got a real journey ahead of you.
Antoine Bosselut:You need to now start doing every single design decision, testing everything because you can't rely on the things that others have done to push you forth. But if you're operating in the open models ecosystem, you can build on a lot of technologies of the others, which are often pretty pretty close to what's going on in in Frontier Labs as well. And if you can build on what's on what others have done and the efforts others have made, it really derisks the entire enterprise and and makes it more accessible to larger numbers of groups who can then contribute back in their own way. One of the things I also like, though, about this is that even if it's even if other people are participating, it's a smaller ecosystem that are training open models at this scale. And that generally means that, you know, what is coming out of the research is is less noisy and and very, very high quality.
Antoine Bosselut:While often if you're working at sort of smaller scale and going through all of the research papers, 90% of the stuff that is published is shown under a specific set of conditions to work pretty well, but isn't actually adaptable to many of the other ones. So I I like I I what really surprised me, I guess, was was the quality of the open artifacts that we could use to actually derisk our own enterprise and to push forth a really solid model.
Danny Buerkli:In terms of public investments and what you ideally would want to see from the political system, not just in Switzerland, you know, at the European level or or possibly beyond, what would you ideally wish for?
Antoine Bosselut:There's what I would ideally wish for and and and and what is what is possible given given other constraints, and those those may be a part. But I'll say what I would ideally wish for is that we need to be able to keep designing and training fully open models that are, if not at the same level as the frontier, like, very close in terms in gap. And as the frontier keeps expanding, I think we need to have the capacity to to design and train open models at the at the same level. One example I give in this is thinking you're you're forced to kinda think five years into the future when you when you think about these questions because that's the time scale for an infrastructure investment. And so as I think about what AI will look like in five years, I'm forced to contend with some of the realities of where we are now, such as the fact that we are running out of human data to actually train these models on.
Antoine Bosselut:We can repeat data, which makes the models a bit stronger, but it doesn't really fundamentally give all that many new capabilities to the system. And so synthetic data becomes super promising. And, you know, something a colleague of mine often says is, you know, if you don't believe in the power of of of synthetic data, go spend some time going through a human web crawl for a while, and you'll immediately understand why synthetic data has a ton of promise. Who whoever has the best synthetic data in the years to come is probably going to be the entity that can train the best models. And this is this is interesting because we know that the best synthetic data tends to come from the biggest models and the most powerful models.
Antoine Bosselut:And unlike a model that you actually deploy for users to interact with where you need to make it as efficient as possible if you are going to be servicing millions of queries a day, you don't necessarily need to do that with synthetic data. And where in fact, you know, you're for higher quality data that you get less of, you might be willing to to sacrifice on efficiency just to make sure that this resource that is amortizable is across many different future queries of very high quality. And so for me, that still pushes into a game despite where what we hear about let's make the model smaller to save on energy, to make them more efficient, to make them deployable in more contest, it still points to a use case where there needs to be the ability to train absolutely massive models that can become kind of the the data generators of the next generation of smaller models. And that, if you want to train models at that scale, potentially 10 times larger than the best biggest models we have today, think models in terms of trillions of parameters, tens of trillions, then you are going to need very large scale infrastructures to be able to train those models on any sort of time delay that makes sense.
Antoine Bosselut:Now whether that is possible in in the current political and economic climates is is a different question, But there are people that that are that are thinking about this because they understand that this is what is actually needed to keep actually building AIs that are as capable as possible and not necessarily to surrender that capability to just a handful of of private companies. And like I said, I I don't think this is impossible to do, but it requires a substantial investment. It requires substantial will. It requires coordination and collaboration amongst many different public entities because it is a massive investment for a single public entity to make, but it's one that a set of a set of entities could all invest in, perhaps in a public private partnership. There's there's there's multiple ways to go about this.
Antoine Bosselut:The the logistics are something to work out. And then the question just becomes, is there is there a will to actually make this type of massive investment in order to to enable kind of a more public AI solution that can that can continue to to rival those of the private, of the private companies?
Danny Buerkli:It seems there and and maybe oddly consequential variable is to which degree we can distribute geographically distribute these clusters are not for political economy reasons, particularly in Europe. If the answer is, well, we can, that makes that opens up a lot of doors and makes these things a lot easier. If the answer is, for technological performance reasons, no, these things have to be colocated, for the most part, physically in one specific spot, then that makes it not impossible, of course, but harder in in some way.
Antoine Bosselut:So this is this is a very interesting question, and I think that it will be a very important one to try and tackle over the next few years. Because there's you're you're absolutely right that it is a politically more digestible solution to spread many smaller clusters around many different places. And that's the that's the current European model, by the way, with these various gigafactories, each that get between, you know, ten and twenty five thousand GPUs, and there's the you build 10 to 15 of those. It's a great solution. It it sounds super European.
Antoine Bosselut:Spread the love everywhere across the continent. In terms of the current paradigm of training large models, it's limiting. I don't think that we could train a model a 100 times the size of Apertus by essentially having full use of 10 supercomputing centers across different across different countries with the model that we have today. And so the question then becomes, you know, can research particularly given this new distributed infrastructure, can research advance enough in the next few years to come up with viable solutions for sort of multi node decentralized training across data centers. Because when in any model at certain point, the communicate the communication is the bottleneck.
Antoine Bosselut:Even on, like, within a single cluster, the GPUs, we're not maximizing all of the compute that is available on the GPU at every step. The best training approaches can hit, like, close to half usage of the GPU just based off the communication bottleneck of, like, transferring information between the GPUs because each part each GPU holds a different part of the neural network, a different part of the data that's being trained on in this batch. And so there's a massive amount of communication that happens between between the the chips themselves. And there's generally two layers of communication. There's the communication that's happening between chips on the same node, so it's part of the same kind of compute unit, and across nodes.
Antoine Bosselut:Cross node communication is a lot slower than intra node communication. And now if you're adding this third layer of cross data center communication, you need to think of a completely new training paradigm in order to make that happen. And there's there's potential approaches to try to make this happen, one of which is you train a a slightly different model across each of these data centers, and then you find a way to merge them later on. That's that's one way to potentially try and do it. The problem with that approach is that it's you don't actually scale up the size of the model because you're still kind of individually limited by the size of the data centers.
Antoine Bosselut:And in fact, there, you kinda have this well, I guess you called it a a binding constraint earlier where the smallest data center is the limiting one because that's the one that can fit the smallest model if you wanna do this merge approach. Now there's probably smarter ways of of going out and and trying to do this, but it's it's not kind of an obvious solution. Today, there are research threads that try to do this on, like, a much smaller scale where what happens if you have a thousand computers kind of spread out everywhere and you wanna do decentralized training of a model? I also don't think that those types of solutions will necessarily adapt well to the types of case that we have in in the cross data center approach. So there needs to be a massive amount of of of research to actually study how we can best use these types of infrastructures to do that.
Antoine Bosselut:But in the absence of advances there, the only way we'll be able to scale up models is is by having one very large data center in a single place. And that is yeah. That's a politically a more difficult political choice to make across across the board, particularly because there's only a small amount of places, I think, that could reasonably and efficiently sustain such an infrastructure because you need to find the energy to power that infrastructure, like five gigawatt data centers. You can run them on coal, but that's not very European either. So you really need to find the right renewable resource of energy to to be able to do that, and it's it's not obvious where that where that power capacity could be scaled up drastically.
Antoine Bosselut:There's only a few places I think that could pull that off. You'll need to invest in a massive amount of cooling, so you'd probably want a place that also has kind of natural cooling mechanisms to make that happen. So it becomes, like, much less a question of, like, AI training and more a question of what places in the world have natural ecosystems that can sustain this massive type of of infrastructure and its needs without necessarily having that infrastructure disrupt the ecological environment as well.
Danny Buerkli:What should I have asked but didn't?
Antoine Bosselut:I think we I think we hit on almost all of the all of the important points. Maybe something that's always worth bringing it back to is another kind of base reason why we want to have these these models in the first place. We obviously talked about the scientific reasons of having open models and being able to do this audit. We talked about the innovation reasons, people building on top of these resources, of these open models, and what fully open can mean, and and the training of the people that are involved in building them. I think an important one as well is is the sovereign aspect to having these types of models too.
Antoine Bosselut:I don't necessarily think that that every country and community needs their own model trained from scratch, but I do think it's important that most of these places and environments be represented in how these models are actually designed and and trained. And the best way to be represented is to be a player in that in that development process, in that development pipeline. Essentially, saying that you're going to surrender all AI developments to a few big technological players just because they have an advantage. I mean, you can keep making that argument every ten or so years when there's a new computing revolution. But the truth is the winners of the next computing revolution are typically the winners of the previous one.
Antoine Bosselut:The biggest players in LLMs today from Google to Microsoft to Meta, they were the ones that invested very heavily in cloud technologies in the early two thousand tens and where they saw that shift. And so all the places that were like, well, we're not gonna invest in cloud because these big players are already gobbling it up. Well, they were behind when it came from GenAI. And the winners of the cloud revolution, Amazon and Google and, Meta, they were the winners of the web revolution, five years earlier. So in in many cases, there's always a few new players that come into the mix in terms of Gen AI that was OpenAI and Anthropic, for example.
Antoine Bosselut:But the the people that were the biggest that were the biggest winners of the previous computing revolution continue that dynamic as they often just have so much cash on hand to actually be able to to to to make investments. So at a at a certain point, if countries wanna be able to build the innovation and sovereign ecosystems to kind of have an impact in a future computing revolution, they almost need to invest in the one that's happening now or else they continue to just surrender future impact, future future growth potential, future future tax based revenue, if you will, to to the companies that already exist. And while I know that it might seem to to many folks that the the battle is already lost and these players have already won it, I'd I'd say that's not necessarily true. The gap between kind of open models and what can be done in open ecosystems keeps getting smaller compared to what's being done by these these big companies at the frontier. Public institutions are able to make the kinds of large scale investment to to close that gap as well.
Antoine Bosselut:And now now, we're we're we're right on the verge of, I would say, a paradigm shift in AI research and development as well that I think makes it worth it to be a good moment to actually to actually really jump into the mix again. The chat GPT era that that that has dominated since 2022, I think, is is on the verge of of ending, a new one is is is about to start where there's there's kind of a new approach of doing things likely with these reasoning models and agentic AI. So now is a great time to actually be to actually be trying to get into the driver's seat of this new development by not necessarily needing to kinda go through the same development history that the others went to.
Danny Buerkli:Brilliant. With that, Antoine, thank you so much.
Antoine Bosselut:Thanks for having me. It was really fun to talk about this stuff.
Danny Buerkli:Thanks for listening to High Variance. You can subscribe to this podcast on Apple Podcasts, Spotify or wherever you get your podcasts. If you like this podcast, please give us a rating and leave a review. This makes a big difference particularly for newer podcasts like this one.