Venture Step | Beyond the Hype: Llama 3's Safety and Power Unleashed

Summary

In this episode, Dalton Anderson discusses the second half of the research paper on Llama 3 by Meta. He focuses on the topics of model safety, red teaming, inference, visual experience experiments, and speech experiments. The paper provides detailed insights into the challenges faced by Meta and the solutions they implemented. Dalton emphasizes the importance of simplicity in tackling complex problems and highlights the potential of Llama 3 in breaking down language barriers and improving communication across cultures.

Keywords

Llama 3, Meta, research paper, model safety, red teaming, inference, visual experience experiments, speech experiments, simplicity, language barriers

Takeaways

Llama 3 by Meta is a high-quality foundational model with over 400 billion parameters.
Red teaming is an important aspect of model safety, where the model is intentionally tested for vulnerabilities and weaknesses.
Inference in Llama 3 involves pipeline parallelism and the use of micro-batching to improve throughput and latency.
Visual experience experiments and speech experiments were conducted to train the model on image, video, and audio data.
Simplicity is key in tackling complex problems, and Meta emphasizes the importance of keeping things simple in their research and implementation.

Sound Bites

"The best approach is normally the one that is easy to implement and yields... that choice is probably the best one."
"Simplicity is the solvent for complexity."

Chapters

00:00 Exploring Model Safety and Red Teaming
20:33 Enhancing Inference and Processing Efficiency
30:17 Unleashing the Potential of Visual Experience Experiments
39:53 Revolutionizing Speech Experiments
48:21 The Power of Simplicity in Problem-Solving

Creators & Guests

Host

Dalton Anderson

I like to explore and build stuff.

What is Venture Step?

Venture Step Podcast: Dive into the boundless journey of entrepreneurship and the richness of life with "Venture Step Podcast," where we unravel the essence of creating, innovating, and living freely. This show is your gateway to exploring the multifaceted world of entrepreneurship, not just as a career path but as a lifestyle that embraces life's full spectrum of experiences. Each episode of "Venture Step Podcast" invites you to explore new horizons, challenge conventional wisdom, and discover the unlimited potential within and around you.

Dalton Anderson (00:01.104)
Welcome to Mitch Step podcast where we discuss entrepreneurship, industry trends, and the occasional book review. Llama three is here. Llama was introduced a couple of weeks ago from Metta. This is their new iteration of llama and they came out with their over 400 billion parameter model, which is the first open source foundational model. Last couple of weeks we broke down

the first half of their research paper that they published relating to their findings, their architecture, their training data, and overall just different troubleshooting things that they had to solution for while doing something like this at scale. We broke down the first half and then this week we're gonna break down the second half of the paper, which mostly focuses on how they went about red teaming, some of the things that

they're troubleshooting against and what were some of their solutions. They also introduced some new tools that we'll discuss and how they built those tools. Talking about, you know, quantized models versus non -quantitized models. They also discussed the experiments that they had with vision and voice, which aren't current capabilities with meta, but is something that's gonna be on the horizon, I assume so.

if they wrote a whole section on their research paper about vision capabilities and voice capabilities, which I think is pretty interesting. And they also go and break down the type of data that they use, which I thought was interesting as well. This is probably one of the best research papers I have read in my, I guess, research reading experience. And I had to read a couple hundred research papers, like throughout my master's and...

just in general from just different projects I've been trying to do in my free time. So I'm not by any means a research paper criticer. I like to read them because I think they're very interesting. But some of them can be a lot of information without a lot of substance. This research paper hurting the llamas was just jam packed with quality information.

Dalton Anderson (02:23.184)
I mean, it's I think 93 or 92 pages long. But the information, everything is just so high value and quite interesting, of course. And how overall was surprised how detailed they were in breaking down what data they put in, what percentage of their data, the exact issues that they had and the exact solution they came up with, their training recipe.

their parameter tuning recipe. They lay it all out for you, which is awesome and amazing that they're sharing all this information. Also, before we dive in, I'll do the host center in one second, but I am kind of sick. So my voice is a little different than it normally sounds. So if you're just tuning in, you're like, why does he sound like he's about to cry at this whole episode? Like who's bullying him? It's not that.

I'm just a little sick. So that's why my voice sounds like that. I promise I don't sound like this all the time, but it's quite funny. Anyways, before we dive in, I'm your host, Alton Anderson. My background is a bit of mix of programming, insurance, data science, offline. can find me building my side business, reading a book or doing something outside like running. You can find the podcast on YouTube if

audio and video is your thing. YouTube is the best place. If you're more of an audio fanatic, you can find the podcasts on Spotify, Apple podcasts, and of course YouTube podcasts or wherever else you get your podcasts. So as discussed, we're going to break down the research paper in the important sections that they, they notated. So the first thing that we're going to be talking about in this episode is the model safety and that entails the red teaming.

prompt attacks, the different tools they made. And then we're going to talk about the inference section. And then we're going to talk about the visual experience experiments. And then we're going to talk about the speech experiments. And then they're going to, I'm going to talk about the conclusion section, which I didn't go into too detail of. I just summarize their point and it was pretty straightforward what they're, what they were trying to say. Okay. So red teaming.

Dalton Anderson (04:47.572)
And if you're not familiar with red teaming, red teaming is like a way of, ethnically hacking these models. So you're intentionally trying to break the model and to do bad things, but you're not trying to do actual bad things. We're trying to find a way to break the model and then harden your defenses. so. Meta had independent red teams, which were

not part of meta like a third party that was vetted and they kind of validate metas findings but then meta had an internal red team which they they worked on different things and one of the one of the biggest things that they try to do is they tried to one see if there's any what's called lift so it's uplift testing so uplift testing entails

Hey, if you use this model, do you have increased capabilities? So as an adversary, so if you're a hacker, if you're trying to make chemical weapons, nuclear weapons, explosives, those kind of things that are classified as weapons, no, not necessarily all those are weapons of mass destruction. But those are weapons classified by the world as things that can cause many casualties.

but not classified as a weapon of mass destruction, but definitely a weapon of destruction.

So the red teams try to break the model and force Lama or any of the Lama models to go against the protections that Meta put into place. And so there's a couple of things that they do and a couple of techniques. I think it's good to talk about it. And I'm familiar with some of them and you'll see some on social media where like the model's doing something that it shouldn't be doing.

Dalton Anderson (06:49.572)
And those you'll see it on social media or a media company report. you know, these models aren't safe. It's doing these things, blah, blah. But what they don't include is how do they get the model to do that thing? So there's a there's a couple of different approaches. There is a multi turn approach, which is like, say I tell you to do something you say no. And then I'm like, okay, and then I ask you a different way and you say no. And I just keep asking you until

you eventually you have a refusal, what's called a refusal suppression to where you'll temporarily are subdued to my request. And that works sometimes. What's better, I think if you're trying to do these things as a hypothetical, and you can kind of get the same vibe when you are asking someone like, hypothetically, how could you do this?

And then they put on their like villain hat. And they're all in and they're like, Yeah, I would do this and that. And I would x y z and blah, blah, blah. And you're like, Hey, hold on. This is hypothetical, man. This is way too detailed. Sounds like you've been thinking about this for a while. Something like that. So an hypothetical like hypothetically, how would you

hypothetical situation is like one of those train situations like, okay, if there's a child that needs to get pushed off the train, like the train, the train analogy, pushed off the train to change the tracks, or everyone on the train dies, would you push the kid off? Like one of those normally the model would say, Hey,

I don't think taking any human life is appropriate and so I don't want to answer. But if you keep pushing these hypotheticals, what is really good is to do a multi -leveled prompt attack with many different

Dalton Anderson (09:09.264)
Prompt types like prompted to like prompt types that maybe that's the right word for it so if you had a multi -term refusal and then you added in You had these multi -term prompt requests to attack multi turn attack with these hypotheticals and then you added a persona on there with including role play and Then you gradually escalate the violations the request for violations Before you know you you've broken the model

and you've jailbreak it or unintentionally or intentionally jail broke them off.

But back to our situation, like the model originally would say, hey, I don't believe in taking any life, so I won't push the kid off, I won't do anything. That's not my role. My role is to be open and connect the world to a vast amount of information and be useful or something like that. Whatever the slogan is from that company. And then if you escalate it, keep escalating and then

And then you add in some additional hypotheticals and add in the persona and be like, know, role play as a human in this situation, what would you do? And eventually be like, yeah, I'd push the kid off. Yeah, the kid dies or something like that. And I actually was, it was kind of comical. I actually got reprimanded by my student peers in one of my master's classes.

when we had that same trolley problem where they're like, yeah, if, if you don't, you know, push this old lady off the train, 20 ,000 people die. And so everyone in the class is like, it was one of those things where it was a case study as a case study class. So you really had, like a side that you had to pick. And so everyone had to pick a side. And so it's like, like, you know, if you're on the side of like,

Dalton Anderson (11:13.072)
pushing the person off to save the 20 ,000 people or you're the side of like, I think that's wrong. I wouldn't do that. I was the only one who picked a, to, push the person off the train. Everyone else is like, I'll just let everyone die. And the whole class was upset with me. And I was like, what, why are you guys upset? You're literally killing 20 ,000 people for one person's life. Like if it was my choice, if, if I had, if I had to do that,

then I would do so.

but I'd rather exchange my own if, if that is an option, but in that, in in that prompt that they're giving you in that situation, you can't exchange yourself. But, yeah, I just think that's a rough thing to say. Like one person's life has worked 20, 20 ,000 people's lives. So anyways, not getting too, too deep in this hypothetical situation and becoming a philosophical topic. But basically if you,

want to manipulate the model to do things that you shouldn't be doing, then you add in these kind of different prompt attacks all into one. And the model kind of just just slowly breaks down. But what one thing I didn't think about and I didn't realize was multilingual attacks. So if you combine that plus multilingual, then it's easier to break in. So what they what they

they found metafound from their testing red team testing was that if you added in these, these different level attacks, you're, you're, you're fairly not fairly successful, but more successful than if you added only in one attack. if you only did multi -turn, if you only did these hypotheticals, you'd only have a certain level of success. And once the classifier caught on, you really would have a hard time to break through with one request. But one thing that they did see is, okay,

Dalton Anderson (13:15.92)
if you did multiple different types of prompt attacks, then you did have some success. But one thing they also saw was if it wasn't in the English language, it was a lot easier to break in. So they became very stringent on the other languages that they were utilizing for the model and what you could use as inputs. So what's being offered. And then what they did was they found out that if you combine these multi -prompt attacks with

multilingual. Yeah, then it becomes even more easy. So then they built they rebuilt their llama three guard. So they had llama two guard. And what their llama three guard is built off is the same kind of approach is their llama to their llama two was built off their eight per 8 billion parameter model. And then they condensed it into like 500 million parameters. And they turn it into a classifier.

of dangerous information. And so that that dangerous information is you trained on 13 hazardous categories. And it is from their AI safety tax company by vigilant as as of 2004 20 not 2014 2020 2024. And so it's shot child sexual exploitation, defamation, elections, hate,

weapons, intellectual property, nonviolent crimes, privacy, sexual related crimes, sexual content.

specialized device, suicide and self harm, violence, sorry, violent crimes. And then that's it. And so they also trained on top of that, they also trained in addition to those, they trained for the code interpreter abuse. So like making malicious code. And they also trained on

Dalton Anderson (15:20.388)
like violent images and stuff, which we would see on Meta all the time. And so for a long time, Meta had to pay people to manually take down some of these videos, like of the shootings of the mutilations of people. And so these manual, what is it? What are they?

I don't know. These manual monitors of the platform had deteriorating mental health, of course, after seeing that like 10 hours a day, eight hours a day for six months, it definitely takes a toll on you. And so these kind of things, Meta, I'm assuming is deploying this as a defense mechanism and in conjunction to their already state of the art image classifier.

that they already have. So they're pretty familiar with it. And I think that from the results that they showed, it was pretty dang good at it. It was on par with the best and it was one of the best. It was better than Google and it was better than chat. GPT turbo and they're comparing Google 1 .5 pro and I think

Google Ultra one one dot and then obviously it's chat to be T four okay. So that that's that's llama guard. And llama guard is the output, right? So they're protecting the output, right? So they've got they've got this model. The model has an input. And then the model processes all the text. And then if it wants to output something that

is potentially dangerous. It's first classified by this classifier. And then if it's not, then it's fed back through the model like, hey, like this isn't, this isn't okay, we can't send this. It's fed back through.

Dalton Anderson (17:29.858)
and then it will come back to you with a hopefully not dangerous result. But what about dangerous prompts? Like, okay, so what about fixing before you even get to the problem, preventing the problem from even happening? so Meta came out with prompt guard and code shield. And those are the two things that they trained a classifier on. they kind of, they did,

sorry, they did training on that code malicious code. And then they, they also trained on, you know, malicious prompts, obviously with the red teaming with, with the classifier. they added that in on the front end. So they made the prompt guard. So prompt guard prevents prompts that are potentially dangerous. And so it's the same kind of concept. It's a classifier. This one's a multi -language classifier.

And it looks for different things. the first thing is direct jail breaks, which is a an attempt to deliberately override the model safety. And then they have a

Dalton Anderson (18:45.474)
Injection prompt, which would be like from like third parties, third party data is like put into the window and is executed for by the user, by the like, like user executes these commands for the LLM. So those, those two are what the classifier is looking for.

And it does fairly well at preventing these inputs from happening. So if you think about the safety architecture of Meta's models, Lama models, it's they have prompt guard that's on the front part. So prompt guard prevents malicious code being sent to Lama to execute things. And then it prevents malicious prompts from being put through. And then on before that, or after that, we had the model.

And then after the model, have an additional classifier that is trained specifically on malicious code, violent data, data related to chemical weapons, nuclear weapons, explosives, those things. And it prevents outputs related to dangerous content like that from being sent to the user. So they have safety on the front end, safety on the back end, which I think is pretty interesting.

and I liked how they broke it down in the paper. I didn't go into as much detail as I could have, but if you want, there's plenty of other other YouTube videos that you could look at. know that there's, I have to find the link and I'll send it in the show notes, but there's a guy, he does like a whole bunch of academia, academia. my gosh. He, think he works in academia and then he does a whole bunch of

research paper reviews. And so he did a llama review of the model. And then he also did a review of the hurting the llamas paper. And his breakdown, I think is like five hours. So I don't want to. If I didn't cut it down the way I have, then it just would be crazy long. And I can't do that, obviously. I could but I don't want to but

Dalton Anderson (21:04.848)
You get the point is that it's condensed for a reason and there's other resources that I'll share if you want a more elaborate explanation of those things that I was talking about. OK, so then we're moving over to the next section, which is inference. And so they talked about pipeline parallelism. And so one of the issues that they had with the large 405 billion parameter model was that.

The model was too big to run on a single machine with eight. NVIDIA H100s, which I looked up how much storage and H100 has and it's 80 gigs. So they're talking about this model doesn't fit on 640 gigs of like this memory. And so what they did is they they.

opened up the

inference to be parallel. So they use parallelism. And that's basically like you're running these tasks in parallel on different machines. And it's big in databases. And kind of like this is like, okay, you have all these servers, these servers have this large task. Okay, so say that you have, I mean, it's kind of cool that we could think about this analogy of mowing a lawn. So say you have a

you have to mow a lawn, right? And you have three lawn mowers. Okay, so think about those as like your servers. And your workload is the lawn. Okay. So you could either have one person mow the lawn the whole time, and it would take, say, three hours. Easy math. If you had two people mow the lawn, it would take

Dalton Anderson (23:02.704)
two hours. If you had three people mow the lawn, would take say, there's a certain point where you lose, you lose like, you can't, you can't break up like crazy into crazy, like Adam like tasks and think it's going to run fast. There's, there's a certain point where it doesn't drop off as much. say that two workers takes two hours, three workers takes an hour 40. Okay. So in this scenario, it might make sense.

to run on two servers or two lawnmowers. And then the third lawnmower guy, woman, whatever, whoever, robot, they should work on a different lawn. In this case, a different task. And so that's what they're talking about. It's like they couldn't fit everything on one machine. So they split it up into many machines. And then they enabled this

Dalton Anderson (24:04.602)
Sorry, I have to cough. I'm quite sick, unfortunately.

Dalton Anderson (24:13.348)
So they enabled this pipeline parallelism where they run the inference through the model inferencing with micro batching with many different machines concurrently and all of the processing stages. And I kind of talked about in the previous two episodes ago was where they were running like mini tasks, like mini mini tasks on all these different

servers, per se, like these GPUs. And then they are, how do I say, they are checkpointing the model. So they're saving the model at X, they're saving the model at Y, and they're saving the parameters and the data that's fed through at many different points, just in case things blow up. And if it blows up, it isn't as big of a deal.

And another thing is when you're doing these micro batching and you're having this pipeline parallelism, you're not putting all your eggs in one basket like in that analogy of lawn mowing. If you only had one person mowing that lawn for that workload and that workload breaks or that lawn mower breaks or something happens with the code, there's a bug, the whole thing is blown up. So you have to fix the lawn mower, you got to fix the yard, whatever.

So the whole thing is a wash, but if you have it split into many different pieces, in this case, two lawnmowers, the lawnmower could get some work done and then breaks, the second lawnmower could pick up the workload. And so it allows better throughput latency, but...

it has some, you know, some trade -offs like the actual latency of the responses are, is, is, is longer, but overall pretty decent. And the solution that they came up with was keeping things simple. Like a lot of times in this paper, they talk about just keeping things simple, like life's complicated enough. Just keep it simple and find a solution and don't overcomplicate things.

Dalton Anderson (26:36.784)
And this goes to the next point of what they did. They researched the FP, so floating point of the model. So you can have the model be in different floating points. a floating point is, I would think about it in a simple manner, is the precision of the binary representation of the data, right? So floating point 64 would have

one bit positive and negative. So that has a sign and then it has 11 bits representing the exponent, a base of two, and then it has 52 bits representing just different stuff after the decimal.

And okay, so 64 is more precise. It's referenced as double precision. Eight is less than

64 and so it goes eight, 16, 32, 64. Okay. So what does that mean? 64 from eight. So 64 is more data. it's basically it's more precise. It's more precise data, but the trade off of that is it's more data, right? So they're already processing trillions of tokens, trillions of tokens and

And they also are running into throughput issues with so that's why they they've found a solution with parallelism and then they're looking for another way to increase throughput. And so they did a study and it's kind of a question that was posed by many researchers like hey does it matter how much does it matter if you reduce your model to

Dalton Anderson (28:33.744)
floating point eight or 32 or 64 like how much of a difference or 16 how much of a difference really is it so that they they did a study in this paper and they basically they're

their

results were that, hey, it's not really that big of a difference. Like there is a difference for sure, but it's not a overly material difference between like a eight and a 16 from a 64. It's better. The 64 is better. But what is important is okay, if you were looking to download this model and you only had enough space to do say,

the eight billion at 64 floating point or the quantized model at it's a four and eight billion at eight floating point eight. It's better to do the model regardless of what floating point it's at the bigger model, the one that has more parameters. Because even if the, processing of this, the model has less precise data,

they still have these floating point weights that are included that kind of can upscale the floating points. then obviously the model is more advanced at that point with having more parameters. Okay, and so what they got from that was, hey, it's not that big of a deal. That was the first part. B, they had throughput.

Dalton Anderson (30:25.232)
They had throughput efficiency in the decoding stage by 50%. So.

Dalton Anderson (30:34.584)
low precision inferences make model large language models more accessible and practical for the real world. And they said like, hey, like, it's not that big of a deal. It is non negligible for for for a number of workflows. And it can lead to less errors and

there's overall, I mean not less errors, but it leads to the ability to run larger models on less resources. There is some issues with like certain data types, like dates sometimes have issues getting processed by such a low.

floating point model. But other than that, you're good to go. I mean, there are some caveats. It's not perfect, but the gist of it is you do have the ability to run these bigger models at lower floating points. And if you have the choice to run at a lower floating point with a bigger model or a higher floating point with a smaller model, pick the higher

parameter model lower floating point.

Man. Okay. So this, this part of the paper I found super interesting was the video experiments and their audio experiments. And what I did find interesting is that they, they really broke down the data and they trained on like ASMR data, which I thought was quite interesting for, for audio.

Dalton Anderson (32:30.964)
And for a video, which we'll talk about right now, they just used a whole bunch of different types of data. They used it in a text prompt format. they would have a text, sorry, text pair is what it's called. They have an image and then they have the text paired with it. And so, you know, it would be like, all right, a dog jumping for a frisbee.

and then it would have an image of a dog jumping on a Frisbee or it would be like a putting on a hat and it'd be like a person putting on a hat. And so it's kind of like the alt text of an image basically. And so they, what they did was they ran it through their proprietary state of the art image classifier. And then they, then

had some image, like used the image classifier to describe the image in that category. And then they ran it through and trained the data, with the encoder. And so they have an encoder, they have a processor, and then they have an output, right? But the data was trained on both video and image data. Majority of the video data, which I thought was interesting, was lower than

one minute. it said, let me see it. The average duration was 21 seconds. The median was 16 seconds. And with over 99 % of videos being under one minute and they had a varying degree of pixels that went down to three 20 up to four K. But majority of the videos were around 720 and they had varying aspect ratios.

between one to two, two to one and one to one. The median was the one to one.

Dalton Anderson (34:29.028)
Very interesting that they're training majority of the model on videos under a minute.

For me, that tells me, and it's kind of a shot in the dark here, it like they're preparing for a change to Instagram reels. So Instagram reels are all under a minute. You can't post a reel, or I think they recently changed it, but majority of the time, the reels are under a minute. And I think you could post now a minute 30 now, the recent update that they did.

So I stand corrected. You can post reels more than a minute, but majority of reels are one minute. And then you can post over a minute, I think up to a minute 30, if I remember correctly.

that up.

Let's look that up. How long can a real...

Dalton Anderson (35:32.805)
be.

What does it say? Okay, so they say 90 seconds. Okay. I was, I was like really hoping that it was going to be a minute and a half and hopefully not something like 15 minutes. Cause I know I only do these one take, so it'll look like an idiot. So I'm glad that Instagram, Instagram stayed true to the word and true, true to my word actually. Okay. So they, they trained on, on videos less than a minute.

majority of their videos. And so I assume that majority of their videos are from Facebook and Instagram. Obviously not WhatsApp because they don't have that data. But that's where I assume majority of the videos are coming from. I don't know if they'd have enough and the videos that they chose to supplement with were probably like TikTok videos or YouTube shorts. Cause there's a lot of videos out there.

but they intentionally made all their videos under a minute for their data set. So I just wonder what they're planning on doing with it. I know that you can ask Meta to search for YouTube, not YouTube, sorry, reels, Meta reels on Meta .ai. You can ask Meta to find reels of places you want to go to, to explore.

or find products or find, you know, shopping ideas or just, just different things that you can, you can, instead of having prompts tell you, Hey, like I have this idea and you know, can you tell me how to do it? You could also ask it like, Hey, can you find videos? Like actual reels of people doing these things so I could see how to do it. And then like, yeah, sure. And it'll pull up a whole bunch of videos. I think it was like a carousel of like six reels. And then you could ask for more.

Dalton Anderson (37:31.12)
But I think after you ask for more twice, it makes you make another prompt asking for more. So you could, you could ask for more click, show more, show more. And then it's like, I won't show you anymore. And then you just say, Hey, I want, want to, I want to view more reels of this topic. And then it'll do it for you again.

Dalton Anderson (37:54.106)
But yeah, so they have the encoder, they've got the image adapter, and they've got the video adapter, and then they kind of, you know, let everything flow. But you gotta think about it, so they've gotta encode the image, they've got to...

like under basically encoding, you gotta encode the image and like turn it into like, how do I say they they've got to turn the image into easiest ways, like these tokens, which are a way for the model to understand mathematically what's going on. And so they encode the image in like the pixels and stuff. And then they

run it through, right? They run it through and then they run it through the LLM, which is like, they call it in the research model, they call it the language backbone. And so the language backbone would process what the image encoder or the not the encoder, but the adapter gave to the LLM.

Dalton Anderson (39:05.76)
What issue that they had was okay, it was quite time consuming for the model to process the image encoder and not the image code, but images and videos with this architecture. they changed it to dress. They changed it to address. Issues. So one issue they had was like the model scaling. So you could prompt like, all right, so you're a text prompt.

has on average, said like a hundred, 192 tokens, 200 tokens. They said on average, an image had about 2000 tokens. And so what they, what the issue they had was, okay, like the process one image or to process one token or one text prompt, like there's such a large difference in data associated with an image than there is text. So

what was happening was like the cross attention layers, which is like the, discussed it two papers ago. What was basically like what the model focuses on would be the image, but then it was kind of not paying attention to the text. And so what they did was they added, sequence parallelization into the image encoder so that the

I would say that each GPU processes the same amount of token. So what they had back to our lawn mowing problem is when an image is sent and when a text is sent. when I, I've really just said that an image is about 2300 tokens, right? On average is what is what metaset from their research and their data set. So if we had this lawn thing,

We had one big lawn, like massive lawn, and then we had, I don't know, some townhome lawn that, you you just wanted you to mow the front or something like that. And so that wasn't that much work. What was happening was we were only sitting one lawnmower to mow the big lawn and then one lawnmower to mow the small lawn. When really the small lawn could be done with clippers like

Dalton Anderson (41:32.344)
some like hand clippers, like some big ones, I don't know. And then the big lawn could be done by three different lawn mowing people. So we could free up.

Dalton Anderson (41:45.29)
resources to work on, not really free up, like I would split the workload of the available resources. So one resource isn't tied up for a long time. So what they did was they sped up that, that, that sequence parallel parallelization sped up the process of encoding the image, which is like the most like computational piece of it. And then they were able to move past to the next part of the architecture.

where they attach on to the large language model.

Dalton Anderson (42:20.496)
If that makes sense, I'm not sure. I thought that was pretty interesting. I don't know if I'm explaining it in a very simple way, but basically instead of putting all the work onto one person, they split the work amongst other people. And in this case, GPUs, the GPUs split the work evenly. Hopefully they get done at the same time and it's overall faster. That, I guess that makes sense. I don't know. It's just me here, so.

Sometimes it's difficult to know. Okay, so voice experiments. And so they did some voice experiments, which I thought were pretty cool. One of the things that I thought were most interesting, I mentioned it earlier, was the ASMR training data. They said that they had 230 ,000 hours of manually transcribed speech recording that span across 34 language of ASMR data.

And then they had some AST data.

90K hours. So they've got quite a quite a bit of audio data. I'm not sure where they get all that from. Maybe they just get it from their long there. I guess their Facebook offering with the typically longer videos. I'm not sure. But I thought it was interesting that they use ASMR. So when I think of ASMR and I might be thinking about it wrong, you might be like, like you don't even know you're talking about.

I think of like ASMR videos where like they're eating or like they're doing something weird. Like they're, I don't know, sipping tea or they're opening a box or something like that. ASMR. I'm, that's a face slap moment. That's how you know I'm sick. So it's ASR, not ASMR. So ASR versus ASMR.

Dalton Anderson (44:23.952)
my gosh. my gosh. That's crazy.

Dalton Anderson (44:35.258)
Meaning?

Dalton Anderson (44:44.453)
Okay, ASR versus ASMR. What is ASR?

Dalton Anderson (44:55.982)
audio. I'm having a hard time getting the answer.

automatic speech recognition. Okay, so it's like

text that was turned into audio or audio, audio, I guess.

Dalton Anderson (45:19.959)
Speech to text.

So it wasn't actually kind of confused now. Okay, so let's write this down. So they use the ASMR data, which is voices or voice turned into text. So that's why they have so much transcription. There's so much transcription data of the actual voice data because they were

Dalton Anderson (45:53.824)
able to map up the voice with the tacks but it makes me question the voice like is it is it just a robot that does the voice for a long time and then

Dalton Anderson (46:15.842)
I don't know. I'm not sure. all, I'm all, I'm all, I'm all jacked up on this. Sorry. I, I read that. I read that originally as ASMR and I wrote that in my, I wrote that in my, in my notes as ASMR. And I thought I was like so weird that they're using ASMR data. That makes so much more sense that they're not.

But ASMR is the purpose of converting audio data like voice calls, voice searches on your phone and podcasts into a format computers can understand, often readable text. So it's basically voice data encoded and the automatic speech recognition or voice recognition. So it has this training data of

voice data that is transcribed, but they're saying manually transcribed. And then they have additional training data with 90 hours of data. they are cooking with all this voice data. I'm not sure where they got it all from.

Dalton Anderson (47:32.366)
Well, basically they said that the model is able is multilingual. It can understand speech recognition for for multiple languages. It has translation. And so they're saying that it has a good potential to break down language barriers among people communicating across cultures, which I think is interesting and basically something similar results with their video and

audio stuff, not audio, video and imaging, images, video and image data. They had similar feelings like, Hey, like it's pretty good. it

It has a good ability to like understand and provide, you know, it's reasonable feedback with the images and videos that you provide. One of the issues with the image and video, video piece and research experiments and the speech experiments was it performed badly or I guess poorly, not badly.

performed poorly against Metta, not Metta. I'm talking about Metta. Metta performed poorly against Google and OpenAI with their safety. So it was easier to manipulate the model to get it to do things it shouldn't be doing. They jailbreak the model with image and video and audio. So there are still

evaluating how they think the best approach should be.

Dalton Anderson (49:22.956)
And I think that's one of the reasons why it hasn't released yet. And it also might be released in conjunction with a new product release. Like they might launch a new Metaglasses partnered with Ray -Bans and they might go like now, now you can have video and, and voice recognition and this, all the speech stuff and it's, it's processed at super fast speeds, low latency.

And it's almost like Gemini live or chat, tbt live, but in your glasses, which I think would be pretty cool. And I would definitely check out. Okay. So transitioning to like the conclusion, how things should be right. And so they're a real suggestion with this approach in the model and the things that they learn was okay to develop.

a high quality foundational model. There's still a lot of discovery that needs to done like this isn't a fully baked sector like there there there are a lot of practices or are said so practices that might not necessarily be true. So what they did is they they did a lot of discovery and experimentation. But while they were doing that, they focused on high quality data.

And they focused on scaling their processes and keeping it simple. That's it. That's really the feedback. So when you're doing something complicated in your life and you're trying to have a high quality effort or input of the situation and you want to scale that to your team or...

other people in your life or just general situations or your business, keep it simple. And we hear that all the time, but for something so complicated.

Dalton Anderson (51:36.27)
The best approach is normally the one that is easy to implement and yields, it might not yield the best outcome, but for outcome to experimentation, to implementation, to production, that choice is probably the best one.

And what Meta did is they laid it all out there like, does the quantization of models matter that much for quality of your model? No, doesn't. And so they're just kind of breaking through each of these barriers. Like, okay, does this actually matter? Or is this approach better than that approach? And they did it in a scientific manner at a small scale. And then they did their analysis and like, hey, like the simple approach works.

Like, it works better than the other one. Let's just do the simple one. And so, for this complex problem, they're chewing away at it little by little with simplicity. Simplicity is the solvent for.

complexity. Okay. Well that was the podcast and I hope that you enjoyed it and you learned and learned some. I definitely learned a lot. I enjoy the, I enjoyed reading the paper and creating this content for everyone. If you liked the video, like or subscribe either, either or, or give a follow depending on what platform you're on. Really appreciate it. And of course, I've always, I'll see you next week.

And wherever you are in this world, either it be a good morning, good afternoon, or good evening. I hope you have a great day and talk to you soon. Bye.