Venture Step Podcast: Dive into the boundless journey of entrepreneurship and the richness of life with "Venture Step Podcast," where we unravel the essence of creating, innovating, and living freely. This show is your gateway to exploring the multifaceted world of entrepreneurship, not just as a career path but as a lifestyle that embraces life's full spectrum of experiences. Each episode of "Venture Step Podcast" invites you to explore new horizons, challenge conventional wisdom, and discover the unlimited potential within and around you.
Dalton Anderson (00:01.796)
Welcome to VentureStep podcast, where we discuss entrepreneurship, intro trends, and the occasion book review. We're to be continuing our series with Meta's release of Lama 3 .1. Last week, we touched on Meta's AI studio, where you can make an AI agent. Two weeks ago, we talked about the release of the models and what it means to have an open source foundational model in the market now. Today,
We're going to be breaking down half of the research paper that they created called.
Llama 3 hurting of the llamas the herd of llamas herd of llamas is the exact name of the model.
In this model, they break down how they went about the architecture, how they discovered what size model they need, what tokens should they use? Like what is the spread of contact, like long context, short context, coding, non -coding, multilingual, how they went about troubleshooting.
issues in solving those things, root cause discovery of why their system was airing out during training phases. And it is quite in depth. It talks about the algorithms used as far as filtering or how they went about deduping data. It goes into using other research papers that they use to solve like citations. They cite
Dalton Anderson (01:44.452)
OpenAI several times, they cite other people and it is quite extensive and quite dense. I read 46 pages and that includes quite a bit of graphics. So probably around an actual amount, maybe 26 to 30 pages worth of data or not data, but actual pages, like no images. And it took me about three hours because it is just so dense. It's dense information.
I don't know everything obviously. And so some things I have to look up, read another paper to understand what they're talking about in this paper. And, know, it goes, you get down this super long rabbit hole of like reading other papers that are related to other papers of the paper that you're related to with context wise.
But today's focus of this first half is going to be talking about just a quick review of the herd of llamas and what is that like the different models and the capabilities, the foundational model, like what is a foundational model. And then we're going to be talking about the pre -training and post -training. So, and then safety, those are the three things that were touched. But in that,
there is a various of topics within each section. And so I would describe these sections to more or less be like, like 20 chapters of a book. Like one section is like 20 chapters of a book. And this is me cutting everything down into like a condensed manner.
instead of.
Dalton Anderson (03:39.63)
trying to populate in this podcast, everything that I deemed might be important. I try to just cut down as much as possible without losing what we need to have a productive conversation. And I would like to emphasize to read the paper yourself. I'm definitely not an expert. I know I'm not a scientist. I wish I was sometimes and
I'm sure there's not that many people that on a Monday night, they're like, I'm to spend three hours to read half this paper. And so other people might find it curious or might be curious of the paper, or they just might want to know more about these AI models and like what's behind the scenes and how does this stuff work. Or you're just curious. And, but I do think that there should be an emphasis on obviously utilizing these models.
And I don't think that you need to have a crazy amount of understanding, but it is nice to have in your back pocket to understand like all the work that goes into creating something like this. And then the thought of someone providing to you all this information on a silver platter in a paper with all the links to the data they use.
not their data, but open source data that they supplemented with their proprietary data that they collected and script off the internet. The papers that they used to come to their analysis, the thought process, the way that they troubleshooted, the systems that they had to create, and things that they, they did analysis on things that they would do in -house versus out -house. And then just the thought of having,
all that stuff put into a paper for you. So you can study it and apply your own methodologies and, and use their open source model on top of all this other stuff. mean, it's just awesome. It's so cool. It's so cool for a company to provide this information for free and allow people to use the model for free.
Dalton Anderson (06:05.602)
I just think it's super cool and should be emphasized in this podcast episode is like.
what Meta is doing.
with open sourcing their research and their model is huge. Like hasn't, hasn't been done before to this scale. And it's super cool. And I'm curious on how everything plays out because it, it's a great model. It's on par with chat GPT four. that being said, it's open source.
and the licensing for this model is
open to everything. So it's free use. It's a free use license. So that means that you could use it for enterprise. could copy, paste the code, repackage it as like Dolan's LLMs. And there you go. Our VentureStep LLMs. And then I can make a business out of it. I could just copy and paste the work. mean, contribute it to be mine. Or I could change the model just a little bit and make it mine.
Dalton Anderson (07:20.548)
It's just super cool. I spending some time, a couple of minutes to emphasize that before we get into the podcast episode. So now we're transitioning over to the herd of models, which is the name of the paper. The herd of models is three models. It's the eight, the 70 and the four or five billion. And those are the amount of parameters. So it'd be 8 billion, 70 billion, 405 per.
million billion parameters and parameters are like, call them weights. I mean, you'd have parameter weights, but just think about it as like different scales. So if you had like a room and you had eight scales lined up and maybe they're measuring different things on, different scales basically, and they all have different weights and importance to the final outcome. And so they have this
eight scales in your room and they are measuring different things. And then the final outcome is the combination of all these different weights. And that's, that's the end result. And that's how I would explain it is like they act independently, but their scores are aggregated, I guess. I mean, in a very simple way. So it uses the tokens, provides a weight of the context of the tokens, and it could be
it could be the attention of the model is given, you know, token a sequential token, like a group token, or there's a whole bunch of different tokens that you can have, like the way that the model gives attention to the tokens, there's a whole bunch of different ones that we could we'll talk about a little bit more later, and how that was set up.
But just the emphasis is that there's parameters, the model does stuff and yeah. Okay. So that being said, the models are quite good. 70 billion and 8 billion are best in class, meaning that they are better than all the other models across the board. 405 billion Meta's first foundational model is
Dalton Anderson (09:41.924)
about like, you know, second to third to first, it's majority of the time tying the best models and then winning or tying. It doesn't lose very much. I think it wins or ties like 70 plus percent of the time. And the rest of the time it loses on these different benchmarks that I'm not really in touch on because you could read them. You could read about them and figure out what they mean. Okay.
So the foundational model that they did, which is the 405, was trained on 15 trillion tokens.
15 trillion tokens is a lot obviously. And the way that they went about creating a say a high quality corpus and a corpus is a just say a
I mean, it's a group of texts, like a book, or it could be many things, but normally it's attributed to unstructured data and unstructured data. And that is, I got a message on WhatsApp. I don't know how to mute that. This is quite annoying.
I podcasts too. Maybe if I mute, I don't know. It's too much to figure out during this episode. So you might be getting WhatsApp notifications. That's not you, that's me. Sorry about that. Okay, that was quite distracting. Sorry about that. So it's the 15 trillion tokens.
Dalton Anderson (11:24.9)
that they trained on, but out of those 15 trillion and they cut it down like the, the, the 15 trillion is the final set, but they had more. And so they did a D duping process. And so they did a D duping by the URL. They did do duping by the document level, which uses mini hash.
mini hash, mute chat for, we chat for eight hours. Cool. Okay. So you shouldn't get any more notifications sounds cause I was getting them and maybe AMI takes that out. have no idea. We're just going to go with it. So document level DDo use the, the mini hash algorithm to do, do up across the entire document line by line D duplication.
Their parameters was if it appeared more than six, six times in a bucket of 30 million documents.
which is a lot of documents and that's really tight parameters, really tight. And then they did domain deduplication and they also removed domains with known adult content. So if I backtrack a little bit, what they're talking about is they scoured the internet, they built a HTML scraper. So that goes on to the internet, scrapes documents and
and text and whatever nonsense you can get a hold of. And then pulls back all the information to Meta. Meta runs it through this text extraction and cleaning methods. And so it takes the unstructured data from the websites. Like think about it as a tree. And this is how it's normally described. Like a tree. And then of that tree, there's different levels within the tree. So
Dalton Anderson (13:22.134)
a tree on like a map, like a family tree, you know, you have your parent, which is the website. And then the website could have different pages on that website. So you might have an about section products or research or whatever. And so these different sections and these different sections have data on them. So, you know, the first part of the data set for one website would be the website URL.
And then from the website URL, it would go to the about section, pull all the data in the about section. Okay, so then in the about section, there's data and then the about section might have different links or whatever link to the about section. So before you know it, you're like six sections in from the about section and then you repeat that same process for different sections of the website. Once you do that, you have this tree like data set.
that's tied like basically like it's like a dictionary, like you have the domain and then the domain is the key to all these other sections and the sections are populated from the domain. So they look at, you know, the URL. Okay. If we have multiple URLs, just throw it out. All right. Let's look at what's in those URLs. If we have repeat information, throw it out. If there is line by line duplication, throw it out. Okay.
last thing is there any domains that have adult content? they do throw it out. And so it's, it's, it sounds simple, but it's quite complicated and very time consuming. Once you build it, then it's done, but there's always going to be things that don't get caught and then you have to edit it. And so depending on how severe the things that you don't catch are, you might have to redo your work.
And the issue with unstructured data is finding use out of it. And LLMs have done a great job doing that. But to extract and organize unstructured data is quite time consuming. And it takes a lot of time to test and come back and test and come back. it's basically just an iterative process where you
Dalton Anderson (15:43.106)
Do something, test it. that didn't work. Do something again, test it. That was somewhat good. Do something, test it. that looks great. All right, and then you move to the next thing. All right, so now I need to worry about this. And then you repeat that process maybe X amount of times, depending on how complicated it is and how familiar you are with the topic, because different topics have different things that you need to emphasize and.
Dalton Anderson (16:12.226)
Yeah, just it's a repetitive process of tests, repeat, tests, repeat, tests, repeat until you get to the result you're looking for. But aren't a lot of things in life like practice, like you're practicing to be a better speaker. These episodes, I'm doing an episode every week to become a better speaker and a better podcast host and try to learn every week. It's pretty much the same thing. I'm just doing a test, repeat, test, repeat. Some episodes are good. Some episodes are bad. So.
training for a sport, you're not good right away. mean, there's many, there's many things that, that relates to.
But in a general sense, it's quite time consuming and it is complex. Okay, so now that we talked about how they took out the information that they didn't want, let's talk about the model architecture. So the model architecture mirrors the Llama2 architecture, which they used a
dense transformer architecture and in dense transformer, it utilizes a very simple architecture that emphasizes self -attention as a self -attention mechanism to allow the model to weigh the importance of different words in the input by generating the output. So maybe I'm not explaining that well.
So it looks at the words in the input, right? And it does an analysis of all the words in the input and then it containerizes or contextualizes, sorry, the words in the input and say that, I don't know, I went to the bank to fish, which is actually an example I did in my AI class.
Dalton Anderson (18:16.97)
where are they going? They're going to the bank. So if I didn't, if I just said, I'm going to the bank, you could be going to the river bank to fish. You could be going to the bank to deposit money. Like which bank are you going to? If I said, I'm going to the bank to fish, then okay. So you're going to the bank to go fishing. That makes sense. And so what I, what it's saying is like, okay, where you're at and where you're going,
it's going to give emphasis on the important words in your input to generate an output. So it's, if I said I'm going to bank to deposit money, then it's obvious I'm going to the bank. If I'm going to the bank to fish, I'm going to the river bank to fish. But what if I just said, I'm thinking about fishing and then I ask the AI some questions, then I
go in there and say, I'm to head out to the bank.
Dalton Anderson (19:23.336)
If it doesn't know the previous output and input, then it wouldn't know the context of where I'm going. It would just say, okay, like whatever. But since it knows, okay, the output and the input that they emphasized last time, the weights that were used for the last input regarding the fishing part and the next piece with the bank, then it knows, I'm going to the riverbank to fish. And it's like, okay, good luck fishing. Like, hopefully you get something.
And so that was the, that was the main part. there's a different approach that they could have, which is called the mixture of experts, which is basically training each section of them, of the model as like an individual expert. we have like a coding expert. have a, a test expert. have a, I don't know, text to speech expert. And then have all these different experts that you combine into one. And then
that one expert would handle certain things. So you might have a reasoning expert that does with like, you know, does logical problems for you or helps you walk through complicated problems step by step. And the mixture of experts is very powerful and efficient, but the issue is with training large models, it, it,
is difficult to scale up and it is prone to instability.
Dalton Anderson (21:00.28)
That being said.
Dalton Anderson (21:04.302)
There was some decisions to use the transformer architecture.
that allowed
meta to focus on, and this is what they did for meta too as well. just, just bear in mind that that, that allowed meta to focus more on the training and less on the instability of training a model of this size and scale, because it is already complicated enough. And to add to the already very complex problem, it is a difficult proposition.
And that's what they emphasize in the paper. They're like managing complexity. They said, Hey, like this is super complex thing that we're, we're trying to do right now. And the easiest thing to do is to try to minimize the level of complexity and to do that, let's pick the transformer architecture. And so the transformer architecture is what they picked versus the mixture of experts.
The transformer mark architecture is also the architecture that.
Dalton Anderson (22:20.58)
open AI uses that they reported to be using. And just overall, it just allows for less worry. Later in the paper, they talk about like 78 % of the issues that they had were hardware issues with just things timing out with things over not, or not necessarily overclocking, but like GPUs failing because they didn't have enough Ram.
allocated to it or just a various of things that just make it super hard. Like the architecture part is the most difficult, I think, because that's where all the issues are coming from. And so they knew that, okay, from the previous training, they had a lot of issues. And from all these other research papers, they had a lot of issues with architecture and managing that and making sure that there's low latency and efficiency.
and the ability to
group GPUs for large burst training sets.
To do that on a scale of like 10 ,000 GPUs is very difficult. And to manage the power to every single GPU and make sure that they maintain at 700 watts, it's a complex problem in itself, not even counting in that you're managing like 15 trillion tokens. And so what they did make sense.
Dalton Anderson (23:59.406)
But yeah. Okay. So instead of using the, you know, the mixture of experts, they use the transformer architecture.
how do they find out how big their model should be? So the meta team provided some documentation regarding the methodology that they utilize. They use the scaling laws. so scaling laws is a way for people or companies to understand what's the optimal size of your model, given your available compute resources.
And so it's like a formula you plug in and you say, okay, like I've got this amount of compute available and this is my training set that I have, what should I do? And so the training law is determined that, hey, like for your foundational model should be 405 billion. I think they said 403, but then they put it to 405.
So they're a little bit over the optimal compute size. And then it was similar for the eight and the 70. They were intentionally trained longer, a longer duration than that would be considered compute optimal. But it's like, it's not a surefire way to approach it from what the paper was saying. Like it's like best practice, not really best practice, but it's a guide. Think about it as a guide.
And it's not necessarily a rule that they stick by. And so for their 70 and 8 billion model, they train them longer than they needed to just to improve the model for inferences, inference speed. So if you train it more than it needs, that training will improve the speed of the outputs and, and allow it to
Dalton Anderson (26:10.052)
perform better and faster. That's how I would describe it. Like it performs better and faster. And so it's not necessarily most optimal for your resources.
That's not what the scaling laws are emphasizing. It's emphasizing, okay, what's the optimal size of a model given your compute resources and the amount of time to train it.
So for their 70 and their eight, they train it longer than they needed to and didn't follow the scaling laws in order to make the model faster and better. Okay. So that makes sense.
Dalton Anderson (26:52.324)
And they also use the four and five to train the 70 and the 8 billion square model. they did so in a manner of like having a model based filtering of the data, having the splitting the model out and allowing for the code and, and data reasoning and having multilingual data in there. And they had a determination of
this data mix that they had. And so their final data mix, I'll read it off is, is 50 % of the tokens are corresponding with general knowledge. 25 % of them are mathematical and or reasoning tokens. 17 are code tokens and 8 % are multilingual tokens. And if you're not familiar with tokens, tokens are tokens are game tokens that you can
turn in at your favorite game store. No tokens are, I would say a unique key that is put in for an input. So think about a token as like a number, right? Like a mathematic token. So the reason why these models have tokens is because a model can't read texts. So if you feed in a model,
the like T H E the model doesn't know what the is. So what you do is you tokenize your unstructured data. And once you tokenize your data, it turns your the into like a number that the model can read. And the mean machine learning will understand that going to the bank or going to the river bank is
something the model can read after it's tokenized, but if you don't tokenize the data, then it doesn't work. So there's different ways to tokenize things and they have different effects. You can tokenize per letter, you can tokenize per word, you can tokenize for phrase, you can tokenize per foul or there's just, there's hundreds of ways you can tokenize. And they're, they're better at different things. Like, you know, if you tokenize per word or letter or phrase,
Dalton Anderson (29:20.878)
depends on what you're trying to do, obviously. So it's like one of those cliche answers is it depends on what you're trying to do. And so I don't know how they're tokenizing things. It doesn't say in the paper, but I assume that they're doing a various of different types of tokening. And I don't know how you token code. I've never done that before. If only token unstructured data.
in the sense of like it being words or research papers or like, you know, just general stuff like reviews, those kinds of things or Twitter, like tweets or YouTube comments, 4chan comments, those kinds of things. I've never tokened like code or these other things. I have tokened some multilingual comments, but
Yeah, as far as how they token mathematical and or reasoning tokens, I'm not sure. Like if it's a math problem, like how do you tokenize like a math form?
I don't know. I don't know that one. Sorry. But so they talked about some of the challenges that they had with this large language model. And some of the things that they did was they did group query attention, which is a query that they group by key value pairs in this.
Dalton Anderson (30:59.502)
This would allow for, hmm.
Dalton Anderson (31:08.632)
there's some computational benefits, it allows for like further.
Dalton Anderson (31:16.484)
I see this.
Dalton Anderson (31:20.056)
It allows for further refinement of what's being tested and that refinement allows them to either have more computational efficiency or easier to decode. And so they did many different attentions. So the model is the, what is it? A defusing.
Dalton Anderson (31:48.836)
Okay, so it uses...
Dalton Anderson (31:53.796)
just wanna make sure that I find it. So it's an auto regressive decoder and it has a dense transformer architecture for their model. so it uses the dense transformer as a self -attention mechanism. And so they're using these other mechanisms on top of the self -attention.
So self -attention is what we talked about earlier. It's about using the input and each word in the input sequence to every other word, allowing the model to capture relationships and dependencies on the input to correspond a reasonable.
well thought out output. And then there is the multi -threaded attention, which, you know, corresponds to allowing not only one input, but many inputs to, to be grouped. then on top of that, then it makes the group pairing like the group, the key value pairs, which is the group query attention.
And then on top of that, it uses attention mask, which is say that I uploaded several documents. Well, this is where the training set. So just, just bear that in mind. Like they're training the model. So attention mask is when you upload. I keep saying when you upload, cause I'm thinking about from a user perspective, but we're, we got to think about we're training this model. So in the model we have.
many documents and they might be related, right? So one document might be talking about how to train your dog to sit. And the other document might be about like training your dog to sit and fly or something. I don't know. So in this context, it wouldn't allow for the tokens to, or it wouldn't allow the model to
Dalton Anderson (34:03.318)
emphasize information from both documents at the same time. So it wouldn't cross pollinate basically. So it doesn't allow for the model to utilize multiple documents. And it, this is used to improve long contextual questions and answers. So if you're asking about a document, it doesn't automatically go to a separate document that
that might be somewhat related, but not the same thing as what you're talking about. And it prevents this.
Dalton Anderson (34:39.684)
attention to multiple documents in a sequence, if that makes sense. So prevents many documents from being accessed and used as the output when they're doing the training to prevent.
Dalton Anderson (34:58.52)
these different documents to be used. I was trying to say like that without saying documents like, like text, forms, I'm not sure.
Okay. So this is in the next step would be a kneeling the data, which is the final step of pre -training and fine tuning and the annealing annealing the data is basically when they slowly starve out the models learning rate. And so the learning rate might start at like say 50 and
And if you don't know what learning rate is, it's like how fast they allow the model to learn from the training data. And then they test it and there's a, there is a learning rate that's optimal for what you're trying to do. And basically these different learning rates affect the model differently and how fast you learn and next takes less time to train. But then it might have issues with
you know, recalling everything. There's a, there's a various, it's quite, I mean, just this learning rate thing is a complicated topic on its own and it could be its own podcast episode. So I'm just trying to touch it on the surface. I understand that if you're listening, you're like, wait, that's not, that's not all it entails a learning rate. I gotcha. I'm just not trying to go there right now. Okay. So just think about it is how fast the model can learn faster learning rate, learn more stuff, takes less time.
doesn't learn as much slower learning rate learns more stuff, takes more time. Okay. So you want to find a, like a middle ground because you learn too fast and okay, then you know, they're not, you're not learning everything. If you learn too slow, you at a certain point, you're not going to learn anymore. Like if you study a math problem for 20 hours,
Dalton Anderson (37:07.78)
it probably would have been all right to study for two hours on that math problem. Like after, after two hours, you're not learning anymore. And so there's a balance of, okay, like how, how much time should you study on one problem? Okay. So two hours here, an hour here, an hour here, an hour there. Okay. So all in, I'm at like 15 hours instead of spending 20 hours on one problem. And so you want to, you want to find this middle ground between how fast you learn and the amount of time it takes.
and how deeply you learn. That's learning rate in a nutshell. Very, very light summarization of that. But basically they reduce the learning rate gradually until zero.
And I'm pretty new to this topic and I had to look it up,
you run your set through, like your data through, your training data through, and then you then reduce the learning rate to zero.
Dalton Anderson (38:24.958)
And then...
Dalton Anderson (38:32.78)
when you do that.
It curates a data set of exceptionally high quality examples.
Dalton Anderson (38:46.733)
So.
Anealing the data allows for
Dalton Anderson (38:56.236)
subtle adjustments to the parameters based on high quality data. It creates high quality data set. it finds the high quality data.
Dalton Anderson (39:09.28)
allows for these subtle changes, it provides high quality data, it
creates a checkpoint. And so this checkpoint would be saved. And so they'd save these model parameters. So they do this an anewing, anewing, and then they would save the checkpoint and they would save it like, okay, so they did their training. They'll do a save and they'll save the parameters of the model. And then they'll keep training on some different stuff. And then they would
go do that training, then they would save it and then they would kind of compare the different changes that were made at the checkpoints of the training of the model.
Yeah, my notes got a little messed up there. I copy and pasted my note like from my notes I wrote them down on a separate place that was not as organized and then put them in here but I copied and pasted them twice. Well anyways, so
what I found pretty interesting was they were talking about the compute budgets. And so they were saying these things called flaps. And so it's 10 to the 18 and then 10 to the 20 times six and flop is a floating point operation per second, which comes down to six tillian, which is 21 zeros. which is crazy. Crazy.
Dalton Anderson (40:48.004)
Six, six tilling is a lot. It's so much, so much. It it was hard to, I mean, I don't even, it's hard for me to even understand what a six stallion is. I looked, I had to look it up. I knew it was big, but I was like 21 zeros and you look at it against a billion and a billion is nine, right?
9 zeros versus 21 zeros.
Yeah, so it was crazy absurd.
But anyways, they had that. And so that was basically they, at the end, they do this anewing, which is, you know, they have, they reduced the learning rate to create high quality data. They have the checkpoint averaging, which is saved constantly during, throughout the day with the different, with the different types of training runs that they have. They,
run all that stuff, save it, but what is running it? And so the compute that they used was 16 ,000 H100s, which is I think like $600 million or $800 million worth of GPUs that they are utilizing to have these AI workflows.
Dalton Anderson (42:18.946)
And the servers are Meta's Grand Tena and Tyons AI servers. And each server contains eight GPUs and two CPUs. Then they are within a server, the GPUs are connected with an NV link to like basically is like a high speed connection between the two GPUs. And then they have a job scheduler.
which is also made by Meta, which is their mask, which is metas global scale training scheduler. So within just like this architectural piece, they open source their, what is it? The server racks, like years ago, like a long time ago, there was a company that had a monopoly on the server racks and they were like absurdly expensive. Meta designed something similar.
And then they open sourced it to drop the price of all the server racks. And so they have the server racks. They open sourced and designed basically themselves like a long time ago. So their server racks, their GPUs is like from Nvidia, but the servers themselves are metas. Their job scheduler is meta.
Dalton Anderson (43:44.292)
Crazy. mean, and then, and then they built the, the web scraper themselves. Then they have their own file system. Their file system is the tectonic file distribution system, which has a throughput apparently of two to seven trill, not trill, that'd be crazy. Two to seven terabytes per second. And they notated that the
challenge between with their file storage system was the requirements of having this like bursty checkpoint nature of the rights to the system, to the file system where like large amount of data needed to be saved quickly, which would saturate the storage system. And so they always did this checkpointing to save the model's process or progress.
And the checkpointing was, you know, done in case of failures because there was quite a few, not that many with all the things going on, but yeah, it was quite a few. They have a, they have a document within the, they're not a document. They have an infographic within the research paper that breaks down the root cause of every single time that there was a failure, which is pretty interesting. But when you add it all up, it's only a couple hundred.
And granted that they have 10 ,000 or 16 ,000 GPUs running, which I guess each, what, what did they say? They said each server has eight.
GPU so 10
Dalton Anderson (45:33.092)
I mean, I could have done that easily. So they have 2000 servers. So that puts us at 18 ,000. And then they have two CPUs. So they have, what, 22 ,000 different pieces of hardware that they're managing.
Dalton Anderson (45:59.716)
for this whole training piece.
Dalton Anderson (46:06.212)
And, and then, and then also the system apparently provides 22 or 240 petabytes of storage across 7 ,500 servers equipped with solid state drives. And by the way, something I didn't know about solid state drives, you, if you have that, your computer has a solid state drive, make sure to charge that computer or turn it on like once a month.
If not, your data might get corrupted from your solid state drive because it, the data is stored on the drive and the drive doesn't like write the way that a hard drive does. And so if you don't charge it, your data could like disappear by the way. I didn't know that. So I figured I'd just throw that out there. Cause I learned it last week. I was like, wait, what? That's crazy. What, what, how is it? How is it? How is it a hard drive then? Well, it isn't, well, it's a solid state drive, but
You know what I mean. So just throwing that out there just in case you might need it.
Okay. So we're moving from the post training or actually the last that's last thing is that I said earlier, but 78 % of the interruptions were from confirmed or suspected hardware issues, which you can find in the section three of the paper.
Okay. So post -training of llama post -training, post -training of llama would, you know, refer to just the data pruning section, enhancing the code capabilities, a multilingual expert training, and then the challenges of mathematical reasonings and the methodology to improve reasoning and the challenges and solutions and long context handling and knowledge probing techniques.
Dalton Anderson (48:07.14)
And then after that we have the safety first ensuring responsible AI. And then then we're done. We're, we're, we're almost there. So that one section took us 48 minutes. And so as I was telling you, like this is a seriously dense paper and there's just so much information to go over. And this is after I cut it down, but okay. So data pruning,
They talked about a couple of sections of their data proving. So they had topic classification, quality scoring, and difficulty scoring in semantic duplication. So topic classification is pretty easy where they just, okay, like this is a math problem. This is mathematical reason. This is, you know, mathematical reasoning with geometry, mathematical reasoning with trigonometry. And so it allows a better distribution of the data.
Quality scoring was this rule. They used a reward score that would rate the accuracy of instructions or answers to questions, and then it would be fed through with higher scores. So like it would score things, the model would, and then it would be reviewed and then ran through. If it was scored high, it would be utilized.
Ahem.
there were some pieces where you would.
Dalton Anderson (49:43.694)
try to reduce redundancy in the training set and remove things that weren't necessarily important. And so one of the things that they did was they removed data that was excessive, like emojis and explanation points. And I remember when I was first using Llama2, Llama2 would really, for some reason, just like laid on you for
emojis and like really kill you with emojis. They loved emojis. And so I think that they got that from
obviously social media because people use emojis on social media. You know what? More than normal. But yeah, I just, it just loved emojis and explanation points. And so they, in this occurrence, they removed some of that and they also removed overly apologetic phrases like I apologize. I'm sorry. Cause it is kind of weird when the model's like, I'm so sorry. My, my king, please, please don't.
Dalton Anderson (50:57.396)
Don't hurt my family or something like that. I know some of these models can be weird and they're like crazy crazy sorry about like little things. It's like just like that you're blowing this way out of proportion. I guess it's fine. Let's just move on. You're making this weird. Okay, so they for enhancing coding capabilities what I thought was pretty cool is that they made a coding agent that coded basically and then fed that data over to
their model. So they created this, this kind of expert trainer. And this was like this code expert model that trained and continually trained on a large data set of code. And then this code was then used to annotate high quality, create hint high quality annotations from human code or vice versa.
take annotations from humans and then code for finer, like for great, you know, fine tuning. And then they did synthetic data generation for just different coding pieces and providing feedback on what was made and creating, you know, new code, code snippets or, or logical walkthroughs on how to create the code and just providing.
Dalton Anderson (52:28.556)
a way to evaluate itself with its own synthetic data.
And then it would create like a problem description. So maybe takes the code or, and describes the problem with the code or takes a code or takes the problem that someone's having and then writes the code for it. And then it takes, you know, correctness of the code analyzes that and
analyzes the code for formatting, for syntax, all sorts of things. And then after all of that, it does self -correction. And then this stuff is fed into Llama. Llama's given like a whole bunch of training data from this code expert because the code expert's super good at coding. And then,
Dalton Anderson (53:37.186)
llama learns how to code. But one of the interesting pieces was it was having issues with code that or llama was having issues with code that wasn't as common, obviously. like there's a lot of Python, there's a lot of Java out there, but what about like a niche language? So their solution was to translate code from those uncommon languages into common languages. And that allowed the model to understand, okay, like if you ask about these uncommon languages,
I guess it's not really uncommon, if you did Rust or like...
Dalton Anderson (54:14.116)
Julia. If you did Russ or Julia, that might not know right away. But then if you did this translation to back into the account back into the problem where you translate the uncommon code into common code that it has a good understanding of then it understands. Okay, like this is that that is this and then it could write the code for you. Okay.
So they had multilingual training, multilingual training, and just saying that when we're talking about training, we're talking about post -training. So they're editing what was originally trained and they're kind of collecting outputs and understanding what needs to be changed with the model.
Okay, so multilingual.
Dalton Anderson (55:10.154)
They were trying to emphasize they didn't use synthetic data for this multilingual set. They try to use a diverse set of languages and cultures and male and female and in just different backgrounds to allow for a wide range of language inputs. Like, because people might have.
might have knowledge in one language, but might write it separately or speak it separate, different way. I mean, everyone understands this concept of like accents and different vocabularies that are utilized from different areas of the world, blah, blah. And so that they didn't use synthetic data to ensure that they didn't have biases and they wanted to...
reflect real world use cases and what's going on with the language in that respective place. And just in general, they just they use multiple languages from various viewpoints to train the multilingual piece. And so they did the same thing with the coding. made like a multilingual expert and that multilingual expert trained 90 % on non English.
like languages. And it kind of did the same process as the coding and then trained llama. And that's how it knows that's how llama knows how to understand
that makes sense. I kind of went into detail on the coding piece because there was some some nice bits from the research paper that that I could easily recall. And I quite enjoyed. Okay, so the next thing is challenges of mathematical reasoning. So in here, they talk about mathematical reasoning. And so there is a couple things that are big issue lack of of
Dalton Anderson (57:23.972)
prompts. There's not many complex like math prompts that are out there. Lack of ground truth of, of thought, incorrect immediate intermediate steps in teaching tool usage and or training inference. Let's break this down. So lack of prompts.
These mathematical prompts aren't easy to come by because they are complex and the amount of prompts and or formatting for
Dalton Anderson (58:08.682)
math questions greatly decrease as the complexity increases. So they have an inverse relationship on supply. There is a, it's difficult to have the model reason in the complex, in these complex math problems when there isn't that much data to train it on. with
It's like the chicken and egg thing. Like it can't really solve problems because the lack of prompts and it doesn't have a ground troop because the lack of prompts, but to get more prompts, it can't necessarily make its own prompts because there's not enough data to train it to make new data. And then there's, there's not enough. There's not enough data to train it on to have a good foundation on how to go about these problems. So there's these, you know, issues with
you know, having the right chain of thought and the intermediate steps. And then there is a training inference disparity where, you know, things that are fine tuned might not work perfectly in the world world. And then there's this, this tool teaches usage, which is the tool tool usage for these LM models or large language multi or these large language models or multi multi
modal models, which is what pretty much everything is becoming now. They used to be LLMs, but now they're multimodal. These multimodal models have what's called tool usage. So they have a training data set. And this is the easiest example I could think of. There's two that are pretty, pretty straightforward. The training data set for this is up to, I think, 2023 for 3 .1, Lama 3 .1. But you could ask it things that are from the world.
Like you can ask it a like, can you tell me what the stock price of Metta is as of August, 2024, August 11th? All right, I guess that let's do the eighth or something like that. If we did the eighth, it would be able to pull back that information. How's it do that? So it uses an API from Brave browser.
Dalton Anderson (01:00:28.75)
And I'm not sure why they use Brave, but they didn't really talk about it. Why they use Brave. And so it uses the API, looks up the information and gets the output like, and goes in and goes on websites or whatever it needs to do using Brave.
That's a tool. So it's, it's calling in. It's you're calling the model say like, I need something that's real time. I want the stock price August 8th, 2024. The model recognizes, Hey, I don't have this information. Where can I get it? I can get it from the internet. Okay, cool. So it goes to the internet, pulls it back. Okay. Here's the information. There's another one where for these complex coding or coding problems in mathematical reasoning problems,
that kind of go hand in hand for some of these things like the scientific things, it would pull up, like if you asked it a problem, it would pull up a code interpreter and create code to help solve the problem. So it makes this code interpreter to enhance its reasoning capability.
That's the tool you should say to build that functionality to allow the model to better understand the mathematical reasoning problems.
Dalton Anderson (01:01:45.292)
Okay, so the way that they went about fixing these reasoning issues with coding or with the mathematical reasoning coding, they made their own coding expert. For the mathematical problems, they converted the pre -training data set into like question answer format. And so they converted all that data
from unstructured, it's still unstructured, but they structured the unstructured data into a question and answer format to create more.
data and to allow the model to easily understand what's going on. So question, answer, question, answer, question, answer, instead of like a big long context.
Ahem.
They had a improvement to their reasoning. they used the model to train itself into generating a step -by -step solution for each of these prompts that then were filtered and verified for correctness to create a high quality training set. So they took the lack of prompts, made this question answer format, then they took those question and answers.
Dalton Anderson (01:03:14.709)
Use llama to make a step by step.
prompt, like a step -by -step solution for the prompts that they gave it. And then the humans filtered it out.
Dalton Anderson (01:03:29.366)
and only kept the high quality ones, which was, you manual.
And then they got rid of the ones that obviously they weren't that good. And then they opened it up to code and text reasoning. So then if it's text, they could use the tool that they created with the coding inference, inferencing with the code interpreter. And then if there was text reasoning, they were able just to go about solving the problem, like thinking through.
And then from there, the last thing they did was they had human feedback of the model and prompted it to correct its mistakes, improving the ability to learn from its errors. it, humans gave it feedback on things that it was wrong on. And then the model was able to correct the output for those things. And then it helped the model understand what was an error and what wasn't and overall helped the model.
understand errors and improve itself in that manner. Okay. So that was that.
So the next challenge that they had to solve was solving for long contextual handling or I guess for solving for long text contextual like inputs. And so there is a lack of prompts, obviously. Like there isn't that much. It's difficult to obtain like high quality long text, long context information.
Dalton Anderson (01:05:09.56)
with human annotation because hey, like it's super time consuming to format and to annotate like.
these texts, it's quite time consuming. People don't wanna do it.
So what they did for a long context is they heavily relied on synthetic generation and synthetic generation, they did something similar for
Dalton Anderson (01:05:42.818)
the, my gosh, so caught up on this. They did something similar for improving the mathematical reasoning where they did question answer and then they added in summaries. So they did question answering similar to mathematical reasoning. They did summarization and then they also did code reasoning. So they basically reused the same architecture of
the mathematical reasoning for long context handling. And cause they are kind of similar cause you need a logical thought process to handle mathematical problems. You normally need a logical process to process long context. So they're somewhat, somewhat similar.
Dalton Anderson (01:06:36.515)
They initialized a
Dalton Anderson (01:06:44.164)
They initialize their synthetic generation in, you know, as I said before, they use the question answering, summarization in code reasoning. They tried to balance out like short and long contexts between synthetic long context versus the original short context. And it was thought about to be at 0 .1%.
both short and long. there's not that much like synthetic long context, but I'm assuming the ones that are long are pretty long because the context window for Meta is 135 ,000 tokens. I'm pretty sure. So, yeah, it's quite interesting, I think in that regard. So now we're going to move over to
safety. So we skipped over some stuff, but now we're going to move over to safety. And safety is something that's brought up a lot, right? So these multimodal models are becoming very good in a short amount of time. And so there's a large emphasis on safety, especially when you're doing it open source. So one of the key winners or like, not really a key winner, but
key aspect of what closed models are pushing. Like, Hey, like we're closed model. Like if something bad happens, we can just shut it down. But if you're open sourced and if people are going to do bad things, they're going to be able to do bad things because it's out there. So there is a whole team of people at all these companies, especially at Metta, where they do uplift testing, which is like to evaluate whether a new technology emerging technology like this
Lama 3 .1 model could allow individuals or groups to perform tasks, particularly the ones that could cause risk. does it not only does it allow them to perform it, but increase performance. So significantly increase their capabilities in those tasks. Can they do that? And they also do red teaming, which is kind of similar to uplift testing, but a red team is just like a group of people within the company that are trying to
Dalton Anderson (01:09:09.24)
break the model and in this case, make the model do things that it shouldn't be doing like make chemical, like teach someone how to make a chemical weapon or create like crazy advanced viruses that you could get on someone's computer and run it and take down a system, take down infrastructures, those kinds of things. Make a nuclear weapon.
Those kinds of things are the events that the uplift testers and the red teams try to prevent. And they have a whole bunch of benchmarks that they talk about and you know, the overall findings were, you know, they didn't exhibit any significant vulnerabilities. However, there were some concerns that were identified from internal and external audits. So insecure code generation. So what they mean by
that is the code that's generated is good, but the code can sometimes be broken in, make models susceptible to malicious code. So large, large language models are susceptible to those things. So you could, you can make the model like force it eventually to, to do things that it shouldn't be doing.
you can do prompt injection, is like playing a game with the model to break over, I guess, game the system of the model and convince it it's doing something it's not doing, but it's doing it. so like you kind of kind of force the model to do things that shouldn't be doing when it normally wouldn't do those things. If you
we're asking in like a straightforward manner, but like if you're playing like a game with the model, it may not fully understand what the actual outputs are. It just understands it's playing a game with you and then it's making things that it shouldn't be doing.
Dalton Anderson (01:11:17.582)
Ahem.
Dalton Anderson (01:11:23.7)
I think that's it. they have code generation, code generation with that being of generating these malicious code, the prompt injection. They also have phishing attacks on there.
but they noted that 70, the 70 billion model had better higher success rate of convincing the model to make a phishing attack versus the 405 billion. And then.
Dalton Anderson (01:12:05.784)
There were some notations of cyber attacks, not too familiar with what it was on there and my notes don't say, but it said that they had, know, they are capable of network resistance. So it was fine. I mean, if it was something that they were super concerned about, obviously the model wouldn't have been published. So I don't think it's a big deal.
Dalton Anderson (01:12:33.956)
And it was notated that they had insignificant uplift from Lama 3 to 3 .1. And so we're good there. So this podcast episode was quite in depth where we discussed the architecture of Lama 3 .1, the problems that they solved, how they went about their pre -training, their post -training, what methodologies were utilized, what algorithms.
Dalton Anderson (01:13:05.349)
overall just different issues that they had.
And it was quite dense and quite complex, which I would definitely agree with you if you're nodding your head right now. My head was a little, little sore after reading all those pages and writing out the notes, which took a long time.
But I hope that when you're discussing these things with yourself internally or with your peers, you have a better understanding on what these models are doing and how are they built under the hood. And when you're doing these certain things, understanding, for these long context handling models, it's trained on question and answering. So try to break it up in question and answer question, question answer format. Okay.
It's trained on summarization. Okay, when I have long context, let's ask for a summary, it's going to do well with a summary, or understand it's trained for code reasoning. So it's trained for managing and reasoning over like large code repositories. Okay, so when I'm asking for like long context, those are the three things that it's trained on doing. So the model should know how to do those things very well.
That is a snippet of information that is very valuable. So understanding how these different sections of inputs are trained will allow you to better utilize these models in a fashion that you haven't been able to do before nor would nor of your peers. So you'll have that one step up. And it's also, I find very interesting, but it also is important, as I said before, is like,
Dalton Anderson (01:14:57.782)
Once you know how it's trained, then you know how to utilize it. So if you're doing the same things, it's trained to do, then yeah, you should, you should get pretty good results. And I think that's important, right? At the end of the day, we're all trying to save some time. And so this, this podcast might necessarily save that much time, but it might save you some time over time. I don't know. It's hard to say if everyone utilizes this information.
last snippet. And maybe I should have said this in the beginning of the episode. I am at one, no, not one year. I'm at half a year of consistently posting, which is pretty cool. And it seems like there's an uptick in subscribers that I'm getting on YouTube, which is very nice. I was surprised. I was like, wait, did someone like share this post with someone or like what's going on? It's quite confusing why there's so many people subscribing, but I don't mind it. It's pretty cool.
And once again, you know, I appreciate the support that you're providing in, helping me.
share this to the world. And I hope that you find use out of this content. I know I do cause I get to learn a lot and I hope that you're learning with me and that this information is useful like it is to myself. And hopefully in the future, I continue providing useful information, right? That's not a guarantee either. It's not a guarantee that you keep listening. It's not a guarantee that I keep providing useful information.
So we're both, we're both unsure on those fronts, but I do want to thank everyone for continuing to listen in every week. It's very, very nice. And I really appreciate you. And I hope that we can just keep learning together. That being said, what are your favorite parts of the episode? I personally love the architecture part where we go over the different architectures and like why it was the way it was.
Dalton Anderson (01:17:07.284)
and the way that they went about filtering, because I've had to do something at my work that was related kind of similar to this, not necessarily as complex, but some of the filtering ways that they filtered out and dedup data, which is quite interesting and I think will be useful for the job I work at when I have to work on another NLP project. I'll try to utilize some of the approaches that they used and
Yeah, I just thought it was super cool. And I think that's something that I can apply in my day to day life when I get another project like that. But once again, yeah, comment, comment what you think is, is interesting about this video. And of course, if you want, you could like the video, subscribe, do whatever you do. I mean, it doesn't matter if you, if you don't listen, it doesn't matter if you listen every time. I'm still thankful for every single listener.
And once again,
Have a good day, good morning, a good afternoon, a good evening, wherever you are in this world, I hope that you enjoyed listening today and that you listen again next week. Because next week we're going to be discussing.
Google's.
Dalton Anderson (01:18:30.98)
Pixel, I don't know, Pixel built by Google, built by Google release, made by Google release. I don't know what they called it, whatever. It's the Google showcase with all the products and they're gonna announce some AI stuff and AI integrated products and some just general announcements regarding the Pixel brand and some AI advancements, I'm sure. have a great weekend, week, day, whenever you're listening to this podcast, because the internet has no sense of time.
So you'd be somewhere out there in this world and I hope that you'll be out there again next week. See ya. Bye.