emad
swyx backup pod
SPEAKER 1 0:00:00
I want to send a huge thanks to our friends at AWS for their continued support of the podcast and their sponsorship of our reinvent 2022 series. You know AWS is a cloud computing leader, but did you realize the company offers a broad array of services and infrastructure at all three layers of the machine learning technology stack? In fact, tens of thousands of customers trust AWS for machine learning and AI services. And the company aims to put ML in the hands of every practitioner with innovative services like Amazon Code Whisperer, a new ML powered pair programming tool that helps developers improve productivity by significantly reducing the time to build software applications. To learn more about AWS ML and AI services and how they're helping customers accelerate their machine learning journeys, visit twimlai.com slash go slash AWS ML. All right, everyone, this is Sam Charrington, host of the Twiml AI podcast. And today I'm coming to you live from the Future Frequency podcast studio at the AWS reinvent conference here in Las Vegas. And I am joined by Ahmad Mostak. Ahmad is founder and CEO of Stability AI. If this is the first episode of our reinvent series that you are listening to, don't try adjusting your audio settings. It's definitely me. After a few days here at reinvent in the dry desert here in Nevada, my voice is on his last legs, but I think we'll make it through this. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. And if you want to check us out in studio, you can bounce over to YouTube for the interview.
SPEAKER 2 0:01:42
Ahmad, welcome to the podcast. Thanks so much for having me. Super excited to talk to you.
SPEAKER 1 0:01:46
You are of course the founder and CEO of Stability. Stability is the company behind StableDiffusion, which is a multimodal model that has been getting a lot of fanfare, I think. Welcome. And I'd love to jump in by having you share a little bit about your background.
SPEAKER 2 0:02:02
Yeah, no, I think it's been super interesting. I think StableDiffusion is kind of a specific text to image model. As for me, let's say I started off as a computer science at uni,
SPEAKER 1 0:02:10
enterprise developer, and then became a hedge fund manager and one of the largest video game investors in the world and then artificial intelligence. And I was doing that, it was a
SPEAKER 2 0:02:18
lot of fun. And then my son was diagnosed with autism and they said there was no cure or treatment.
SPEAKER 1 0:02:23
So I quit, switched to advising hedge funds and built an AI team to do literature review, all the autism literature, and then biomolecular pathway analysis of neurotransmitters to repurpose drugs to help him out. And it kind of worked. He went to mainstream school and was super happy. That's awesome.
SPEAKER 2 0:02:39
It was kind of cool. Good trade, good trade. Then I went back to the hedge fund world, won some awards. It's boring. Then decided to make the world a better place. So first off,
SPEAKER 1 0:02:47
took the global X prize for learning. That was a $15 million prize from Elon Musk and Tony Robbins for the first app to teach kids literacy and numeracy without internet. My co-founder and I
SPEAKER 2 0:02:55
have been deploying that around the world. And now we're teaching kids in refugee camps,
SPEAKER 1 0:02:59
literacy and numeracy in 13 months and one hour a day. And we're about to air the crap out of that.
SPEAKER 2 0:03:04
In 2020-21, I designed and led the United Nations, one of the United Nations AI initiatives against
SPEAKER 1 0:03:10
COVID-19, Kayak Collective and augmented intelligence against COVID-19 launched at Stanford, backed by the WHO, UNESCO and the World Bank. And that was really interesting because we were trying
SPEAKER 2 0:03:20
to make the world's knowledge free on COVID-19 with Core 19. So there's a 500,000 paper data set,
SPEAKER 1 0:03:26
freely available to everyone. And then use AI to organize it because it's really confusing. During that, lots and lots of interesting tech kind of came through, but I realized
SPEAKER 2 0:03:37
these foundation models are super powerful. You can't have them controlled by any one company. It's bad business and it's not the correct thing ethically. So I thought, let's widen this and create open source foundation models for everyone, because I think it can really advance humanity.
SPEAKER 1 0:03:50
And again, I think it'll be great to see these things proliferate. So we can have an open discussion about it and also have the value created from just these brand new experiences. That's awesome. And when did you get started down that part of the journey?
SPEAKER 2 0:04:03
About two years ago. Stability has been going for about 13 months now.
SPEAKER 1 0:04:07
Yeah. When I think about the lot of stable diffusion goes back to this latent diffusion paper, which was not even a year ago.
SPEAKER 2 0:04:13
It's not even a year ago. I think the whole thing kind of kicked off with Clip released by OpenAI in January of last year. So I actually had COVID doing that time while doing my COVID thing. My daughter came to me and said, dad, you know all that stuff you do, taking all that knowledge and squishing it down to make it useful for everyone. Can you do that with image? It's like, well, we can. So I'll build a system for her based on VQGAN and Clip. So an image generating model. And then Clip is an image to text model where she created like a vision board of everything she wanted, a description of what she wanted to make. And it generates 16 different images. And then she said how each one of those is different and changed the latents. And then it generated another 16, another 16, another 16. And then eight hours
SPEAKER 1 0:04:49
later, she made an image that she went on to sell as an NFT for $3,500. Wow. And donated the proceeds to India Code Relief. Okay. I thought it was awesome. She's seven years old. Wow. And then
SPEAKER 2 0:05:01
I was like, this is transformative technology. Image is the one it's at. Language, we're already at 85%. We're going to get a 95% image. We're at 10%. We're not a visual species. Like the easiest
SPEAKER 1 0:05:11
way for us to communicate is what we're doing right now. We're having a nice chat. Then text is the
SPEAKER 2 0:05:15
next hardest. And image, like be it images or PowerPoints are impossible. Let's make it easy. This tech can do that. So we started funding the entire sector, Google Colab notebooks, models, all these kinds of things. Latent diffusion was done by the Confiz Lab at the University of Munich,
SPEAKER 1 0:05:31
who are led on the stable diffusion one as well. Amazing lab led by Bjorn Jommer and led by Robin
SPEAKER 2 0:05:37
Rombach, who was one of our lead developers here at Stability. And then there was work by Catherine Kraussen, Rivers Have Wings is a Twitter handle on clip condition models and things like that. And the whole community just came together and built really cool stuff. Then you had entities like Mid Journey, where we just gave grants for the beta that started operationalizing it. It's all come together now to the finality of stable diffusion that was released on August 23rd. So that was led by the Confiz Lab. And then ourselves at Stability, Runway ML, a Luther AI
SPEAKER 1 0:06:05
community that we kind of helped run and lie on, we came together to put out 100,000 gigabytes of image-label text pairs, 2 billion images turned into a 2 gigabyte file that runs natively on your MacBook that can create anything. It's kind of insane. And the speed in which it all came together
SPEAKER 2 0:06:22
is mind modeling. Yeah, like our model was to have a core team and then site contributors and partners from academia. And then these communities that we kind of built and accelerated. So like tens of thousands of people from OpenBioML, we're doing protein folding work to a Luther with language models to Harmony with audio. And it turned out that's a really good system, just iterate
SPEAKER 1 0:06:42
and experiment with these things at exactly the right time. But now it's progressed. So like when
SPEAKER 2 0:06:46
we started with stable diffusion and launched it in August, 5.8 seconds for a generation on an A100.
SPEAKER 1 0:06:51
As of yesterday, 0.86 seconds. As of two weeks from now, it'll be 20 times faster with our new still models. So you're getting 24 frames a second high resolution image creation from basic blobs a year ago. I don't think we've ever seen anything that fast. And the uptake has been crazy. So I
SPEAKER 2 0:07:09
believe on Monday, the number of GitHub stars for stable diffusion overtook Ethereum and Bitcoin.
SPEAKER 1 0:07:15
It's overtaken Kafka, everything else. I think it'll overtake PyTorch and TensorFlow in like a
SPEAKER 2 0:07:20
month or two. And that's since inception. Like over the last month, I think Mastodon has had
SPEAKER 1 0:07:24
6,000 GitHub stars over the last week. StableDiffusion 2 has had 6,000. Yeah. And StableDiffusion 2 was
SPEAKER 2 0:07:31
just released this month, right? It was released a week ago, yeah, last month. He's time to delay. So StableDiffusion 1, we kind of use the Lyon data set to create the image model. And then we use
SPEAKER 1 0:07:39
OpenAI's CLIP L14 to kind of condition it. So we combine the text model and the image model.
SPEAKER 2 0:07:45
With StableDiffusion 2, instead use something called OpenCLIP, run by kind of the Lyon charity, whereby we had an open data set for both. Because OpenAI did amazing work open sourcing CLIP, but we didn't know what data was inside it. So it learned all these concepts. And we're like, how does it know that? And so when we launched StableDiffusion, it kind of was a collaboration. We had all these questions about attribution, about what's in the data set, safe for work, not safe for work. But you can't control that if you don't control half the data set, right? And half the learning. So StableDiffusion 2 had that, but it also had a better
SPEAKER 1 0:08:13
text encoder model. So now basically it's heading towards photorealism. You can get photorealistic outputs from it if you press the right. Yeah. And again, kind of insane. Like you
SPEAKER 2 0:08:21
just see these things generated in a second. You're like, it can be completely artistic or
SPEAKER 1 0:08:25
completely photorealistic. These people do not exist. This landscape or this interior does not
SPEAKER 2 0:08:31
exist. I don't think we've ever actually seen anything like this, because the majority of
SPEAKER 1 0:08:35
humanity doesn't believe they can visually create. Just like before the Gutenberg press, you couldn't write or read. But now hundreds of thousands of developers, I think we've had like
SPEAKER 2 0:08:46
three hundred and eighty thousand developers sign up, something on Hugging Face, and now using this to create ridiculous things. And now it gets to real time. What does that even look like when people can just seamlessly communicate visually? Like we can literally, in a few months, a year,
SPEAKER 1 0:09:01
definitely, this podcast, you could generate a live video almost on it of all the topics that we're talking about, which is insane. One of the examples that you'd like to use is killing PowerPoint. So you've got the text. That's where you usually start. And then you go through this long process to make it pretty or engaging. Aesthetic, right? Yeah. Because you know,
SPEAKER 2 0:09:18
what these models do, like these attention-based models, it's interesting to come around, actually. So with my son, where there's autism, autism is kind of a social interaction disorder.
SPEAKER 1 0:09:27
It's caused, in my opinion, largely by a Gaba-Glucimate imbalance in the brain. So Gaba calms
SPEAKER 2 0:09:32
you down when you pop a Valium, Glutamate excites you. And obviously in our industry, a lot of people
SPEAKER 1 0:09:36
kind of have people they know on the spectrum, or they're just highly there because it lends itself
SPEAKER 2 0:09:41
sometimes, but there's a dual-edged thing. Because of all that stuff, what happens is that there's
SPEAKER 1 0:09:44
too much Glucimate. It's like when you're tapping your leg because there's too much going on in your brain. Imagine that was like that all the time. You couldn't think straight. Yeah. And so you can't
SPEAKER 2 0:09:52
form the connections of like a cup means cup your hands or a cup or a World Cup in your brain.
SPEAKER 1 0:09:58
That's why there's a lot of cases where they can't communicate properly. Addressing those factors
SPEAKER 2 0:10:02
can calm it down. And then you basically start teaching them. Just like when you have a stroke, a cup means this, a cup means that, a cup means that. And they can start talking or progress. With these attention-based models, you've moved from kind of giant extrapolation of data to paying
SPEAKER 1 0:10:16
attention to the most important parts between words and pixels, which is kind of crazy for the
SPEAKER 2 0:10:21
denoising process of diffusion. The lateness that are built up there where it has all the concepts
SPEAKER 1 0:10:25
of a cup means that if you have a cup in a sentence, it understands what that is in that context, a World Cup or a Cup in the hands, and then can do these images, which is kind of insane. So it works like that part of the human brain. I think that's what's so exciting. That's what lets you have
SPEAKER 2 0:10:37
the compression of knowledge. Like I said, 100,000 gigabytes into two gigabytes. It's like we're
SPEAKER 1 0:10:43
Piper from that Silicon Valley HBO show, right? It doesn't make sense. Yeah. But that's because
SPEAKER 2 0:10:50
100,000 gigabytes, 100 terabytes was our input data and the output files two gigs. Yeah. And it's not optimized yet. We reckon we can get that to 400 megabytes. Oh, wow. A 400 megabyte file that
SPEAKER 1 0:11:02
now works on an iPhone that can generate any image in seconds by description. Yeah. And you can go the other way as well. You can take an image and turn it into text. And that text encoding is only a few lines that can generate a high resolution masterpiece. It's insane. It's nuts. And I think
SPEAKER 2 0:11:18
we were kind of a bit misguided by, not misguided, but the focus was on scale is all you need, 540 billion parameter, trillion parameter, large language models. Yeah. Stable to future is 890 million parameters. And this is kind of pointing something to the future because OpenAI took GPT-3, 175 billion parameters, and they instructed it. So reinforcement learning of human feedback by getting annotators to use it and then seeing which neurons kind of lit up, these kind of latent space things. Instruct GPT had equivalent performance. I think they probably use a large
SPEAKER 1 0:11:49
version of that at 1.3 billion parameters because kind of you don't need all the information of the world completely to do stuff. You just need some of it. Image models, though, are surprisingly small.
SPEAKER 2 0:12:03
Like the largest we've seen was the 12 billion parameters RU Dali model. But now, like I said, we're 900 million parameters, and we've had great success with our 400 million parameter models. Our 4 billion parameter models are better. Actually, the largest is Party, which was from
SPEAKER 1 0:12:17
Google at 20 billion. We don't know what an optimal data set is, what an optimal parameter size is for these particular non-text models. Yeah. Text models themselves, text is quite a dense encoding. I think we'll tend larger. But combining these models is going to be super interesting as we move forward. Yeah. So a lot of your efforts thus far have been on shrinking the model to make the performance better, to make it smaller, faster. Do you see a pull towards larger models, or do you think it's a different paradigm altogether where there's not going to be that kind of drive to make
SPEAKER 2 0:12:47
the model bigger and bigger? I think there'll be a mixture of things. Again, like what we saw with
SPEAKER 1 0:12:51
the DeepMind Chinchilla paper was that the scaling laws weren't necessarily appropriate.
SPEAKER 2 0:12:56
So that showed that a 67 billion parameter model trained on five epochs would outperform 175 billion
SPEAKER 1 0:13:03
parameter model. But actually, what it really showed, if you dig into the details, is that data
SPEAKER 2 0:13:07
is what you need. And what does that data look like? We haven't done the proper data augmentation in other studies. But this is also like, you can think of these models like stable diffusion one was a precocious kindergartener. And we talked about the whole internet, so it occasionally turned
SPEAKER 1 0:13:20
a little bit off in some of the outputs. Yeah. Stable diffusion two, you're getting into grade school now. But still super precocious. We made it safer for work and a whole bunch of other things, dedupe the data sets. We're still not feeding it the right information. Once we know what information
SPEAKER 2 0:13:34
to feed it will make it even better. And I don't think that trends to larger, I think it turns to more efficient. And I think one of these things is the accessibility. Because we optimize stable diffusion kind of as a group and collective to be available on low energy devices, not just like
SPEAKER 1 0:13:48
3090s or a 100s. You can download it on your MacBook right now and MacBook M2 as of today, can generate an image in 18 seconds of any type. In a couple of weeks, it'll be less than a second. So you can have PyTorch, you can have Jax or whatever, and you can just start coding.
SPEAKER 2 0:14:05
And so that opens it up to so many people. It's a new type of programming primitive,
SPEAKER 1 0:14:09
this hash file that can create anything. Dive into the connection between programming and
SPEAKER 2 0:14:14
stable diffusion? So if you think about it, you're creating an experience with your programming, right? And so if you use the diffusers library from Hugging Face, it's like a couple of lines, you can be using stable diffusion in a code base. And again, it can run your MacBook with no internet. Okay. So what type of experiences can you do when you have this verifiable file, words go in, images come out? It opens up a whole world of possibilities. It's like an ultra library,
SPEAKER 1 0:14:39
in a way. Like the library condensed in an AI model. And we're not really used to that, like BERT and kind of some of these other things, but nothing that has this massive range, shall we say, like 2 billion images, a snapshot of the internet, compressed down. Yeah. You're kind of thinking more broadly, like a lot of the conversation about stable diffusion today is about art and the creation part of that process. Thinking more broadly about practical applications. And this is maybe getting into something I wanted to speak about later, just where you see the company going. Talk about some of the other things that are disrupted beyond just making pretty pictures, arts and crafts, right? Yeah, I mean, I think arts is like,
SPEAKER 2 0:15:18
we think about it as like, artists never make money, right? Unless they do. Like my seven-year-old daughter, she's obviously one of the OGs now in generative art. I actually asked her, why don't
SPEAKER 1 0:15:27
you make any more art anymore? And she's like, well, dad, there's this thing called supply and demand. If I reduce the supply and you can make this whole industry, the bar for my stuff will go up. So the value of going up, like paying for your own university. The creative industry is worth
SPEAKER 2 0:15:40
hundreds of billions of dollars a year. Video games, 170 billion, like movies are 80 billion.
SPEAKER 1 0:15:46
This will all be disrupted by this technology. If you think about the creation process, like one of our directors, he was doing a shoot with a famous actress. It was a raging,
SPEAKER 2 0:15:55
it was going to be $113,000. Just fly her out and do all this and get all these other people just for
SPEAKER 1 0:16:00
three days. He fine tuned a stable diffusion model, did it in three hours, 2,000 shots. Photo realistic. Meaning the entire shoot was generated as opposed to? Yeah, like all the shots
SPEAKER 2 0:16:12
because there's going to be a shoot to kind of put her in different things to go into the movie kind of process. So concept artists are using this to become more efficient. There's a group,
SPEAKER 1 0:16:20
Corridor Digital. They created Spider-Man Everyone's Home, which is like a two and a half minute trailer in the Into the Spider-Verse style by having Spider-Verse model that they
SPEAKER 2 0:16:30
train on like 100 images. And you can't tell it's like, wow, this is amazing animation. No, it isn't. They just interpolated every single frame and used stable diffusion to kind of do image to image. It's the craziest thing. It would have cost millions of dollars before. They did it like a
SPEAKER 1 0:16:43
few days. So I think media is going to be the first to be disrupted here because that creation process is hard. And now it's easy. I would think industrial design, for example, wouldn't be too far behind like Autodesk. They spend a lot of time thinking about ways to use
SPEAKER 2 0:16:57
machine learning to help designers. Yeah, they've got amazing kind of data sets. You've got the cameras of the world that have every single click on design. It can make all of those easier because the system learns. Like it's a foundational model in some ways because also like a base foundation that you can then train on your own things. And it learns physics and all sorts of other stuff, which is a bit creepy, but it can learn about that specific type of design that you might want to do.
SPEAKER 1 0:17:21
We're working with car manufacturers right now who want to have custom models based on their entire
SPEAKER 2 0:17:25
back catalog. And then they want to iterate and combine different concepts. And then it automatically stitches together these cars and combines them. We also didn't just release the model. We also released an in-painting model. So you could delete parts of a picture and have seamless edits based
SPEAKER 1 0:17:38
on your text conditioning on that. You've got image to image model that can define it into any style. We have four, soon to be eight times up scaler. That's like enhance, enhance, enhance on a TV show, you know? And all of these are going sub-second now in terms of the speed of iteration on them.
SPEAKER 2 0:17:52
So I think creative is the first, but then I said some of this design kind of things. Then it goes into more visual communication, like I said, slides. If you've got an image model combined with a language model combined with a code model, you never need to make a presentation
SPEAKER 1 0:18:04
again. And it understands what aesthetics are. Like one of the things we did with stable diffusion
SPEAKER 2 0:18:08
thing is that we created a discord bot where everyone rated the outputs on stable diffusion 1.1. And then we use that to filter down our giant 2 billion image dataset into the most aesthetically
SPEAKER 1 0:18:18
pleasing things using clip conditioning on that. And then we trained on that and it became more
SPEAKER 2 0:18:22
aesthetic and pleasing. Bit weird in some ways, but again, these feedback loops become very, very interesting because to get the wide range of viability on these image models, language models, audio models, others, the human and the loop factor is essential because your
SPEAKER 1 0:18:37
typical training data is quite diverse, but you want to customize it to the needs and wants of the humans or the sector or the specificness of that. There are other models out there. You mentioned Mid Journey a few times. You mentioned Dolly. We've talked about performance as a kind of a target differentiator. What are some of the other ways that you see stable diffusion kind of defining itself relative to the other things that are popping up? Open source will always lag closed source because they can always just take open source and upgrade it, especially foundation
SPEAKER 2 0:19:05
models, right? I think data is kind of a key thing. So that's been a recurring theme that's
SPEAKER 1 0:19:09
come up in our conversation a lot. The idea of the human and loop and data, finding the data versus evolving the model, the whole data-centric AI idea. Yeah. So it's kind of a data-centric thing
SPEAKER 2 0:19:19
where, like, if you look at how people adapt to these models right now, they're doing few-shot learning, right? Or they're doing basic fine tuning. Because there's no point in training your own model because it's fricking moving so fast. Like we'll have stable diffusion version
SPEAKER 1 0:19:30
three in a few months. We had a 20-time speed up yesterday on the model. This is insane,
SPEAKER 2 0:19:36
these kind of moves. I don't think we've ever seen anything quite this exponential. But what happens then is that if you go via an API, there's only so much you can do. That's what a lot of these
SPEAKER 1 0:19:44
companies do. Or if you go via an interface, like, you know, in the mid-June or something like that,
SPEAKER 2 0:19:48
or a Dali, if you've got the model yourself, then you can play, you can experiment, you can adapt it. So the language models from the Aleutha community, GPT, Neo, JNX, they're GPT-level models, but only up to 20 billion parameters. They've been downloaded 20 million times by developers.
SPEAKER 1 0:20:02
They don't need to tell anyone. They just get on with things. And so one of the interesting things
SPEAKER 2 0:20:05
for me is that the positioning is the tooling around this. Because once you've got those primitives, you can build stuff around just like you've seen loads of community web UIs and other interfaces to interact with stable diffusion. And you know, for our own company, it's a very simple
SPEAKER 1 0:20:18
thing. This is like a database on steroids. You think about it. Like it's a database that comes
SPEAKER 2 0:20:23
pre-filled with interesting stuff. And that's how most people are using it right now. But soon,
SPEAKER 1 0:20:26
when we upgrade it a few bits and it comes mature. The main idea is it's a data, it's kind of a magic box database of images and your query is your prompt. Exactly. It's a data store. Except for it's a super efficient data store. 100,000 gigs to two.
SPEAKER 2 0:20:38
And it can do all sorts of wonderful things. So right now everyone's using like the pre-baked
SPEAKER 1 0:20:43
version, like the Laura Mipsum version, right? But then in a few years, everyone will want their own
SPEAKER 2 0:20:47
custom ones. So our business model is very simple. Take the exabytes of data from content companies, convert them into these models and make them useful. Because we think content is turning intelligent. And it goes beyond media companies to Bio, Pharma and others. And we're probably the only foundation model company building cutting edge AI that's willing to work with people and go to
SPEAKER 1 0:21:06
the data. So models to the data, I think is a very interesting thing based on open frameworks. So you
SPEAKER 2 0:21:11
don't have the lock-in of some of these other ecosystems. It'll be like, I'll trade a model for
SPEAKER 1 0:21:15
you, but you have to be locked into my thing. Yeah. Yeah. One of the things that you mentioned in passing is that you've seen the model learn
SPEAKER 2 0:21:22
physics. What does that mean? So like, if you type in a lady looking across a still lake, it will do her reflection in the water. Raindrops, it gets correct and things like that. And as you train it
SPEAKER 1 0:21:34
more, it learns more and more concepts of how things interact, which again is a bit insane.
SPEAKER 2 0:21:40
Yeah. Like you can show it the sides. You can train it on like a experimental car, like a
SPEAKER 1 0:21:45
Cybertron. Considering how much effort's gone into in the visualization community, trying to get that stuff right.
SPEAKER 2 0:21:52
Exactly. So like you can show it parts of like a Cybertruck. And it doesn't know Cybertruck, say
SPEAKER 1 0:21:59
for instance. And then you can ask it what the back of the Cybertruck looks like, and it will guess and it'll probably get it right. It knows the essence of truckness. So rather than having
SPEAKER 2 0:22:07
these very specific models that learn stuff, you can now have something that can do just about anything in terms of lighting and got prompt to prompt where you can say make this picture sadder.
SPEAKER 1 0:22:15
Turn them out into a clown or a stormtrooper and automatically does that. Because it understands
SPEAKER 2 0:22:20
the nature of these things and the physics and balancing of that, which again is kind of insane.
SPEAKER 1 0:22:25
This has big implications for the rendering industry and other things because this is a far more efficient renderer that can do image to image and transform something into something else.
SPEAKER 2 0:22:32
Nobody's quite sure how it works. And I got theories. This is one of these things with
SPEAKER 1 0:22:36
these foundation models, like they're just an alien type of experience when you first really
SPEAKER 2 0:22:40
start pushing it. Most people are surface level. When you start pushing through it, you're like,
SPEAKER 1 0:22:44
it's really curious that it can do this. It doesn't have agency. It's a two gigabyte file.
SPEAKER 2 0:22:48
But the fact that you can have that compression of models that understands concepts,
SPEAKER 1 0:22:52
it's really interesting. Will that always be a fundamental limiter? Meaning, if you want a quick and dirty approximation, use something like stable diffusion. But if you want precise rendering, you have to turn to traditional techniques. I think it's going to be, I always say it's part of a process architecture. You shouldn't try to do zero short everything.
SPEAKER 2 0:23:10
That's what people tend to fall into a trap of like, yeah, I just wanted to know, like
SPEAKER 1 0:23:13
have kind of KNNs or knowledge graphs or retrieval augmented systems or kind of whatever, put it as part of a process pipeline, but definitely quick and dirty. It does very,
SPEAKER 2 0:23:22
very, very well, better than anything. But then I think that also this is why we have our in painting and all these other models. It's going to be part of multiple models doing multiple things for multiple purposes. Sometimes there might be a giant model. Once you get to a certain stage, other times you might just want to have a quick and dirty two five six by two five six iteration. And so what we've seen as well, like with stable diffusion two, we actually flatten the latent
SPEAKER 1 0:23:44
space through DDP and also a bunch of other things. So it's more difficult to prompt.
SPEAKER 2 0:23:49
Still, diffusion one was quite easy to prompt. Still, diffusion two is more difficult, but it's got much more fine grain control. But where we're going, we're not going to use prompts
SPEAKER 1 0:23:56
because of a learn from- What do you see taking the place of prompts?
SPEAKER 2 0:23:59
Well, I think it will just be a case of like, you will have your own embedding store that points to
SPEAKER 1 0:24:03
points in the latent space and then pulls up like the things that you like most commonly. So it
SPEAKER 2 0:24:08
learns and then kind of there's that interaction between the two things. So embeddings being a
SPEAKER 1 0:24:11
multi-vector representation of kind of what's in there. So I think that people's own context is
SPEAKER 2 0:24:18
important and AI models have really understand people's AI person context in that or companies or other things. And again, this is fine tuning effects where you can with a two gigabyte file
SPEAKER 1 0:24:27
actually have your own model. And then why do you need to prompt training on Artstation,
SPEAKER 2 0:24:32
3D octane render and all these things when it learns that that's what you want to have? That's
SPEAKER 1 0:24:37
this type of style that you like. Having said that, I think prompting is just very difficult. Like my wife's been trying to prompt me for 16 years and she hasn't quite managed. Yeah. You've touched on a couple things, open source versus API, and very briefly, this idea of kind of customization. And I think, based on stuff that I've heard you talk about in the past, you're very strongly opinionated around the model through which you're delivering the technology beyond just the technology itself. Can you talk a little bit about your thoughts there
SPEAKER 2 0:25:12
and what's driving that? So I think this is incredibly powerful technology. I think it's one of the big epoch changes in humanity because you have a model that can do anything and approximate. There's two types of things, type one and type two, logical thinking and then
SPEAKER 1 0:25:23
principle-based thinking. This kind of gets to principle-based thinking. We still don't have AI that does good old-fashioned reasoning with logic. This can take leaps. That's what we said, like
SPEAKER 2 0:25:31
quick and dirty approximation can do that. You type it in and you get like a hundred different
SPEAKER 1 0:25:36
images of like a book or a vase or something like that that you can then iterate and improve on.
SPEAKER 2 0:25:41
Just very different experience. So our thing was like, again, put this out as foundation models, like again, benchmark models that people can then develop around because the pace of innovation will outpace anything that's closed, but also addresses things like the digital divide and
SPEAKER 1 0:25:54
minorities and other things. So like with OpenAI and Dali 2, they introduced anti-bias filter,
SPEAKER 2 0:25:59
which automatically for non-gendered words added a gender and a race. So when you type in sumo
SPEAKER 1 0:26:04
wrestler, it will do Indian female sumo wrestler, which I suppose could exist. Probably not many
SPEAKER 2 0:26:10
of them. It's probably not an intent. It's kind of limited. Whereas with our model, what happens we released it and then a team manager in Japan created a Japanese text encoder, the alternative. So salary man rather than meaning man with lots of salary, men are very sad man. Kind of these local contexts, these local elements, these local fine tunes, I think were essential. And also widening the discussion because a lot of the stuff that occurs with these big powerful models is that we won't release them because we're scared about what's going to happen because the no, no, no, no, no, no, no. So that's fair. That's an opinion. That shouldn't mean that
SPEAKER 1 0:26:42
it shouldn't be available to other people because of the power of this technology.
SPEAKER 2 0:26:44
Because otherwise they'll just go to corporates first and it won't be available, despite the fact
SPEAKER 1 0:26:48
it could uplift them creatively and communicatively and other things. One way to think about it is if I'm really mean in some discussions sometimes it's like, why don't you want Indians or Africans to have this technology? Because there's no comeback. You can't say that more education is needed
SPEAKER 2 0:27:00
or it's too dangerous. They're not responsible because the reality is this is technology of
SPEAKER 1 0:27:04
humanity and it's an echo of what happened with cryptography. We can't let cryptography be open and the government classified it as a weapon here in the U.S. Because bad guys might use it. But we use it now to protect ourselves as well. Open source will always be more secure. Open source will always be more secure than closed source if the community rallies together. Because what do we run our infrastructure on here at AWS? It's not really Windows servers, is it? It's Linux databases on YSQL and things like that. Because the community can come together and build stronger systems and more effective systems. But it's crazy how fast this is going. And so it's difficult line to tow down. Yeah. Yeah. You've mentioned that more recent versions of stable diffusion include like safe work filters, that kind of thing. So it sounds like we're thinking about and care about and not just putting out without any kind of controls.
SPEAKER 2 0:27:55
Yeah. So the original version, look, again, it was led by the Confiz Lab. And we said very specifically, you guys get to decide and we will advise. Because it was an academic endeavor. Even if the people like one of them works for us, another one works for Unway and etc.
SPEAKER 1 0:28:10
Is the nature of the thing. And so we're very respectful of kind of entities that we collaborate with because it can be a minefield, right? You're not trying to whitewash anything.
SPEAKER 2 0:28:17
So it was released under a Creative ML Open Rail license, which is a new type of license from
SPEAKER 1 0:28:21
Hugging Face that said you have to use it ethically. And a safety filter because the decision was made by the developers not to filter the data. So it could be a baseline from which we
SPEAKER 2 0:28:30
could then figure out biases and other things. And that removed a lot of nudity and kind of other things, especially because it was accidentally creating it. Stable diffusion 2 was trained on a
SPEAKER 1 0:28:38
highly safe for work data set. So it's massively more safe. It doesn't have a filter because it doesn't need one. It has some drawbacks such as one of the things that we saw during the fine tuning of stable diffusion 1 is that people trained on not safe for work images. Internet is for that whatever. They fine tuned it. They took lots of images that were not safe for work. So obviously there was the standard effect of that because again, they're free to kind of use it as a community. But the side effect is that when you actually used it for safe of work prompts, it did amazing humans like photorealistic because it learned about anatomy from these not safe for work images.
SPEAKER 2 0:29:15
It's quite funny. So stable diffusion 2 out of the box is a bit less good at anatomy because
SPEAKER 1 0:29:19
we removed a lot of those things. Not much. And again, we're adding it back in safely. We really
SPEAKER 2 0:29:25
care about that. The other thing that we care about a lot is we view this community as big. We're creating millions, hundreds of millions of artists, but the artists themselves are part of
SPEAKER 1 0:29:32
the community. So the artists were like, why are they using and prompting my name on this? So yeah,
SPEAKER 2 0:29:36
so artists are part of the community. They were asking, can we opt out of the data sets? So some are actually asking, can we opt in because we're not in the data set. And so we worked with spawning and lion and others on opt in and opt out mechanisms because I think that's the right thing to do.
SPEAKER 1 0:29:50
I think that it's ethical to use web scrapes to create models like this, especially because the diffusion process doesn't create copies or photo matches. It actually learns principles. It's like
SPEAKER 2 0:30:00
a human. At the same time, if people don't want to have their data in the data set, they should
SPEAKER 1 0:30:04
opt out. If they want it in, they should opt in. In fact, we've had thousands of artists sign up for the system. It's been 50-50, opt in and opt out. Which I think is really interesting and maybe what some people would expect. Yeah. Yeah. Interesting. Interesting. Maybe shifting gears a little bit to Stability AI as a company, as an organization. I've heard it described as seriously an art studio, kind of looks and feels a little bit like a research lab, feels a little bit like a funder of things, a provider of GPUs and instances. How do you describe what it is?
SPEAKER 2 0:30:34
I mean, Stability AI is a platform company. So we're trying to build the layer one for foundation
SPEAKER 1 0:30:39
model AI. And we think the future will be open source on this. So our research lab is researchers who have loads of freedom and they can, in their contracts, open source anything they create. And
SPEAKER 2 0:30:50
there's a revenue share for when we run the models on the API, even if the researchers don't work at
SPEAKER 1 0:30:54
Stability, they still get cut checks. Which I think is a very interesting way of doing things. We've got a product team that takes the open source stuff, just like anyone can, and productizes it into things like Dream Studio. We have Dream Studio Pro coming up,
SPEAKER 2 0:31:06
which is a full enterprise level piece of software with like 3D key framing, animation, video,
SPEAKER 1 0:31:12
audio, everything. We've got a forward deploy team, whereby for our top customers, with the most content that we transformed in foundation models, we're basically embedding
SPEAKER 2 0:31:21
teams inside there and saying, you don't need to build a foundation model team, we're your team,
SPEAKER 1 0:31:25
because we do all the modalities from text to language to audio. That's something that's super appealing to people. Then we've got InfoTeam that is supporting our 5,000 to 6,000 A100s, and the infrastructure to scale APIs to billions in support with Amazon and other society. Can you talk a little bit about some of the ways that you engage with enterprises? Like, what are the kinds of things that they want help doing with these models?
SPEAKER 2 0:31:47
So the pace of ML research is literally exponential with a 23-month doubling. It looks crazy. They can't keep on top of this. And there's very few people...
SPEAKER 1 0:31:54
Measure by published papers. Published papers on Arqif, yeah. It's always nice when you see it naturally exponential. You need an AI to help with that. But when you look at this, they're realizing they need to be on
SPEAKER 2 0:32:06
top of this technology now, and they come to us as almost consultants. It's like a Palantir type
SPEAKER 1 0:32:10
model, where we're like, we'll fine tune some models for you, and we'll make them usable through Dream Studio. But you shouldn't train your own models now, because the models aren't going to
SPEAKER 2 0:32:19
mature for another year. When that time comes, we will train the models for you, we will fine tune them for you, we'll create custom models for you. That's our highest touch engagement with a
SPEAKER 1 0:32:26
couple of dozen entities. And... When you're telling them they shouldn't train models, are you talking about from scratch? From scratch. Or...
SPEAKER 2 0:32:33
They will be able to eventually, but right now, it's not a sensible thing to train the models
SPEAKER 1 0:32:37
from scratch. Nice Double Diffusion took 200,000 A100 hours to... Like 600k you spent on it?
SPEAKER 2 0:32:42
Yeah, 600k. We actually spent less because of our discounts, but I can't say what our discounts are.
SPEAKER 1 0:32:47
And you can figure out... Retail?
SPEAKER 2 0:32:48
Yeah, retail. Retail, Still Diffusion took about 800,000 hours. Retail, Open Clip,
SPEAKER 1 0:32:54
because we had to make the clip model about $5 million. So, you know, these things add up. Yeah.
SPEAKER 2 0:32:58
Quite a large bill. I think that when you kind of look at all of these, now's not the right time
SPEAKER 1 0:33:02
to do big trainings for big companies. Because again, the model architecture is just increasing on ridiculous rate. But then it's going to level off. You can't keep improving forever.
SPEAKER 2 0:33:12
And then that's the right time to train up your own models. So they'll be better than these fine
SPEAKER 1 0:33:15
tune models. But then you have models with multiple modalities. You know, this is part of the
SPEAKER 2 0:33:20
reason we've kind of partnered with SageMaker. Because people need to get used to this technology
SPEAKER 1 0:33:23
now. And they'll have all these different primitives, these different models, they can mix and match to create brand new things going forward. And SageMaker makes it kind of easy to do that. And it makes it easy to address the tail. Because apart from the top couple of dozen companies, we just want to have a SaaS solution for everyone else to be able to access, use and modify these models. Following up a little bit on the SageMaker and the AWS announcement, kind of read as you selected AWS. From my understanding, you've been using AWS to some extent all along.
SPEAKER 2 0:33:51
Yeah, so AWS built the core cluster. And now, you know, we reached this point, it was originally a
SPEAKER 1 0:33:55
4,000 and 100 cluster, which on the public top 500 list and rack and about number 10, supercomputer in the world, which is kind of insane. So it was great job building that. But then we had to decide what's next, like the managing the resilience through some of
SPEAKER 2 0:34:08
these other things, do we build our own next cluster? Amazon came and they said, let's use the SageMaker service to offer a high level of resilience and optimization. So SageMaker
SPEAKER 1 0:34:17
crew, for example, took our language model GP Neo X, again, 20 million downloads of this family.
SPEAKER 2 0:34:24
They went and took the efficiency of 512A100 training from 103 teraflops per GPU to 163 by optimizing it for Amazon EFA fabric and pipeline parallelism and cross tension and kind of all these things. And that was amazing thing. So they're helping us optimize our entire
SPEAKER 1 0:34:39
stack from interference to training through to having resilience. So when GPUs fail, they come back up. And the final part of it was just how do we make this accessible through SageMaker and services, then ecosystem, they build around that. Now we're going to make our models available on everything. Right. Right. So like today, they became available on the MacBook M1 with native neural engine support, one of the first models ever to have that. It's massively spread out. We've got it working on Qualcomm, we're going to work on iPhones, all these things. But Amazon is
SPEAKER 2 0:35:08
a really great partner because they're infrastructure players, one of the biggest cloud providers in the
SPEAKER 1 0:35:11
world. And so that's why we kind of picked them as our preferred partner. Also, super grateful in
SPEAKER 2 0:35:16
that we wouldn't be here if they hadn't built such a freaking enormous cluster and really believed in
SPEAKER 1 0:35:20
us. Because we're only a 13 month old startup. Yeah. So everything's been in the cloud the entire time. All the entire time. We had a machine learning ops team of four people managing 4,000 A100s. Now we're up to nearly 6,000. Was that team managing that cluster kind of bare metal with your own tooling? Or how much of the Amazon tooling have you used? So it was easy too. And then the Amazon
SPEAKER 2 0:35:44
had a system called parallel cluster with Slurm that was used to kind of manage it. And so we've been working for the last four, five, six months just constantly improving it together. And again, it's open source. Yeah. If you go to the Stability AI GitHub, you can literally download all our configurations to run your own parallel cluster on there. And again, this is part of what we really like. The fact that the stack is open source and anyone can take it and they can build their own
SPEAKER 1 0:36:06
clusters. Maybe not quite to the size that we did, unless you're feeling really punchy. But still,
SPEAKER 2 0:36:13
I think these knowledge and these things should be shared. Because you find that large model training isn't really an art. It's really science. It's more of an art. Like one of the most interesting reads you can do is the Facebook OPT 175 logbook for the 175 billion parameter model. They just
SPEAKER 1 0:36:26
try stuff and it often fails. And there's the occasional weird thing like there was the Azure kind of customer support on the 23rd of December, deleted the entire cluster. And you're like, man, I feel for you guys. Like it's kind of that. But like I said, this is not just an easy click and play kind of thing. These models are difficult to train. The smallest hardware thing can throw
SPEAKER 2 0:36:46
it out. They can be just weird stuff. We're making it up and figuring out as we go along.
SPEAKER 1 0:36:50
Because remember, transformer architectures are literally only five years old. Yeah. That's incredible. He's thinking about open source and that direction broadly that the company is taking. One of the challenges that comes up in open source as it matures is this idea of governance. How do you think about, maybe it's early talking about governing a community that's just months old, but do you have thoughts on how the community governs itself over time? Yeah. So maybe it's
SPEAKER 2 0:37:14
complicated one, right? AI governance and is it policy led, is community led? Who are the voices at the table? Because there's some important things. There's such powerful technology is going to be essential, I believe, to the future of humanity. So like, for example, Luther AI is two and a bit years old. That's our language model community, 15,000 people and developers. We're kind of incubating at the moment. We're going to spin it out into its own separate 501c3. Because it shouldn't be us influencing the direction of open source large language models,
SPEAKER 1 0:37:39
right? It should be a collective effort. But now we're really going through the governance thing and looking at different examples. And the Linux Foundation is an excellent example of that. So PyTorch has just been given to the Linux Foundation. And so we're in talks with them and a whole bunch of others to say, what are the best practices here? And what should it really look like given the power of these some of the decisions you need to make about that. At Stability itself, we're sending out subsidiaries in every country, such that first off, 10% of our equity in those goes to the kids using our tablets, because I think they should influence it, because that's
SPEAKER 2 0:38:07
the next generation, this AI will be important to them. But then we want this to be independent entities that run the AI for India or Vietnam or kind of Malawi, etc. Because we need to train up
SPEAKER 1 0:38:17
a next generation of people to make those decisions for their own country. Because right now what we
SPEAKER 2 0:38:22
have a situation where you've got a few people in San Francisco making decisions on the most
SPEAKER 1 0:38:26
powerful infrastructure of the world for everyone. Because let's not deny ourselves, this AI is infrastructure. It's essential for where we're going to go. And it shouldn't be controlled by
SPEAKER 2 0:38:37
any person or entity like I'm very supportive of the whole ecosystem, the one time I was very
SPEAKER 1 0:38:42
direct. So I spoke out against open AI. Because for Dali too, they banned Ukrainians from using it. They removed any Ukrainian entities from that as well. And this is during the time when they're being oppressed. I said, basically, you have excluded and removed and deleted and oppressed people. And that is ethically and morally wrong. But it's their prerogative as a private company. And if it wasn't for us, there would be no alternative. And so I literally took Ukrainian
SPEAKER 2 0:39:07
developers, houses were destroyed, and bought them to the UK. And so this is part of the thing as well. If you have control of this artificial intelligence given to an unregulated entity,
SPEAKER 1 0:39:17
like these big companies, they can't help themselves but behave in certain ways because
SPEAKER 2 0:39:20
they can't release it. More than that, they tend to optimize. So I did a lot of counter extremism
SPEAKER 1 0:39:24
work. I'm advising multiple governments, the YouTube algorithm got hijacked by extremists, because the most engaging content was extreme. Again, that's not YouTube's fault. That's full of great people. Ad driven AI companies, they will use this technology to create the most amazing
SPEAKER 2 0:39:37
manipulative ads. I guess it's not their fault. It's kind of what they are. So regulation needs to come in appropriately. Governance needs to come in appropriately. But we need to educate and wind
SPEAKER 1 0:39:45
in the discussion on this. And the only way to do that is open source. Otherwise, it will never happen. And so you will have AI basically being a colonial tool in some ways with very Western norms, when this is the central infrastructure, like I said, I believe for everyone. I think the common retort to that is it needs to be controlled because it's so powerful, so dangerous.
SPEAKER 2 0:40:05
Yeah. So who are you to control it? I mean, this is a thing. Like I've heard it
SPEAKER 1 0:40:08
vikes into a nuclear weapon. Like it's a nuclear weapon that can allow humans to create visually.
SPEAKER 2 0:40:12
And so you're restricting it. I mean, it comes down to a question. Like I've asked this, I've
SPEAKER 1 0:40:16
never had a question. Why don't you want Indians to have this for Africans? And the only answer is because they need more education. So educate them more because they can't use it responsibly. And you can. It's racist. Like I think fundamentally, if you think about the digital divide, we've seen this with technology being restricted from minority groups and from the rest of the world frequently. It's fundamentally racist because we think we know better in the West. Whereas the
SPEAKER 2 0:40:39
reality we don't because people can take this and extend it. And people are generally good.
SPEAKER 1 0:40:43
People are not bad. And if people are bad as a society, we build systems to regulate that.
SPEAKER 2 0:40:48
So even if they create deep fakes, we build our social networks and others to have creation
SPEAKER 1 0:40:52
mechanisms. We build authenticity steams like contentauthenticity.org that we back. That sounds like your core of your answer is that the ecosystem will solve the problem. The bad actors come in, they use these tools to cause whatever havoc they'll cause. And then we'll find
SPEAKER 2 0:41:06
fix it. Bad actors have the tools already. They have tens of thousands of A100s and people.
SPEAKER 1 0:41:10
I mean, in a sense, you're the proof point of this, right? Open AI was keeping Dolly closed behind APIs and waitlists and things like that. And you came up out of nowhere and released something.
SPEAKER 2 0:41:22
And look, 4chan has had this technology for three months. What have they created? Nothing, right? This isn't going to topple humanity and have more and more people know about it. So we can bring this discussion. But we took a lot of flak. We had a lot of benefits. But we brought this discussion into the open and policy and other fields as well. Like, again, it's my hope that now
SPEAKER 1 0:41:38
we act as a forcing function. So I reckon Dali 3 will be open sourced. You know, it's like the open
SPEAKER 2 0:41:43
source whisper. And I think this will be fantastic. Let's bring it out into the open. Because again,
SPEAKER 1 0:41:47
this is foundational infrastructure for extending our abilities. It should not be closed. I don't
SPEAKER 2 0:41:52
believe that it should be free forever. And like, actually, it's not open source, because it doesn't
SPEAKER 1 0:41:58
confirm with rule zero of open source in a pure open source way. The creative ML license is not. Because we say you must use it ethically. Do we have to move it to open source? Yes. Under CC by
SPEAKER 2 0:42:08
five or MIT license, just like our other models, like our Korean language model, the polyglot one from the Luther or open clip or things like that. But again, this needs to be an open discussion,
SPEAKER 1 0:42:16
I think, rather than who is deciding it. I don't know. If regulators want to come and regulate it,
SPEAKER 2 0:42:21
again, that's a democratic decision. And so I'm a big supporter of democracy and kind of these things. But let's use our institutions on our processes, rather than try to make these decisions
SPEAKER 1 0:42:30
ourselves in closed rooms. Awesome. Well, thanks so much for taking the time to chat. It's been wonderful speaking with you and learning a bit more about what you're up to. It's a pleasure. And I hope you'll cover soon as well. Nearly done. Nearly done. Thanks so much. Take care. All right, everyone. That's our show for today. To learn more about today's guest or the topics mentioned in this interview, visit twimlai.com. Of course, if you like what you hear on the podcast, please subscribe, rate and review the show on your favorite podcatcher. Thanks so much for listening and catch you next time.