Practical AI

How has AI image generation evolved from blurry outputs to powerful visual intelligence models? Dustin Podell, Co-Founder and Researcher at Black Forest Labs, explains the progression from diffusion to flow matching, how modern image models work, and how they're being used for image editing and practical visual workflows. The conversation also explores the FLUX family of models, running image generation locally, and where visual AI is headed next.

Featuring:

Dustin Podell – LinkedIn
Chris Benson – Website, LinkedIn, Bluesky, GitHub, X
Daniel Whitenack – Website, GitHub, X

Links:

Sponsors:

Midwest AI Summit: Join AI practitioners on October 15 in Indianapolis for practical sessions, hands-on discussions, and real-world AI solutions. Use code PracticalAI20 to save 20% on your registration. https://midwestaisummit.com/#tickets
Prediction Guard: A self-hosted AI control plane for running agents in high impact environments. predictionguard.com/practicalai

Upcoming Events:

Register for upcoming webinars here!
Midwest AI Summit 2026

Creators and Guests

Host

Chris Benson

Cohost @ Practical AI Podcast • AI / Autonomy Research Engineer @ Lockheed Martin

Host

Daniel Whitenack

CEO @Prediction Guard & cohost @Practical AI podcast

Guest

Dustin Podell

What is Practical AI?

Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more).

The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!

Narrator: 00:01

Welcome to the Practical AI Podcast, where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work, and create. Our goal is to help make AI technology practical, productive, and accessible to everyone. Whether you're a developer, business leader, or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn, X, or Blue Sky to stay up to date with episode drops, behind the scenes content, and AI insights. You can learn more at practicalai.fm.

Narrator: 00:35

Now onto the show.

Dan: 00:41

Welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack. I am the CEO at Prediction Guard, and I'm joined as always by my cohost, Chris Benson, who is a principal AI and autonomy research engineer. How are you doing, Chris?

Chris: 00:56

Hey. I'm doing great. Can't wait to get into today's conversation. It's gonna be fun.

Dan: 01:01

Yes. Yes. For for an audio podcast, we're gonna talk about a lot of interesting visual things. Maybe before we get started, just just a little teaser. Practical AI is posting some videos on YouTube now.

Dan: 01:18

So if you do consume podcasts that way, you might go check us out on our on our YouTube page. But speaking of images, videos, and more specifically, image generation, really excited to have with us today Dustin Podell, who is cofounder and researcher at Black Forest Labs. Welcome, Dustin.

Dustin: 01:38

Yeah. Thanks for having me, guys. It's, really great to be here.

Dan: 01:41

Yeah. And and I know that Black Forest Labs does more than just kind of, raw image generation. There's a lot of workflow related things, hardware optimization, all sorts of cool stuff you're involved with. But as we get into some of that, I'm wondering if you can just help our audience with a bit of a state of image generation methods and workflows, for the industry. We've talked on the show before about diffusion models, and we'll link some of those episodes in the show notes maybe.

Dan: 02:13

But a lot has happened. Right? There's a lot of people working on a lot of interesting things. And, I'd love to kinda understand, like, over the last year, what are some of those main main points that might be good for people to orient themselves to where things are at now?

Dustin: 02:29

Yeah. Yeah. No. It's a it's a good question. I mean, if I'm allowed to take it even a little bit further back, mean, state of

Chris: 02:35

Absolutely.

Dustin: 02:35

Yeah. Yeah. The state of the state of image gen, video gen, generative models in in as a whole has kind of gone crazy, so to speak, in the last, like, three or four years.

Chris: 02:47

Yeah.

Dustin: 02:48

So where are we now? Where we came from about four years ago, where I first kind of entered into the scene, so to speak, is we were at Models that were essentially just doing little blobs of color that were kinda related a little bit to where you were with the prompt, you know, okay, oh, a lighthouse on the beach or this and that, and you would get something that, okay, that vaguely looks like it, and you would chose someone in the beach. Yeah, I could see it, I guess, and maybe it would interest a few nerdy people, and then now today, we're at the point where, if you've probably been seeing plenty of this stuff online, where we're seeing whole short films made entirely with AI generation where certain scenes are almost entirely indistinguishable from from reality, so to speak. So I would say we we've we've come quite far, but I will also say the core of the technology hasn't actually changed that much. It's been a pretty nice, like, forward progress.

Dustin: 03:48

I don't wanna, like, dive immediately into anything technical here, but it's it's certainly I mean, for anyone who's been paying attention, I'm sure or or or anyone who really hasn't been paying attention, this probably came a bit out of out of nowhere, so, you know.

Chris: 03:59

Yeah. Was gonna say, I guess, you know, with the broadly attention by the general public is so much on kind of more of the LLM generative world in terms of, you know, that that stuff, and everyone's finally on apps regardless of whether they're technical or not. And so I think a lot of people kind of miss the tremendous advancements you guys are making on that side. And so, like, you know, could could you could you take a quick moment, and now that you've kinda done the highest level, maybe step through some of the things that people may remember, like we talked about stable diffusion and things, a couple of those things that have kinda led up to what we're gonna dive into today with some more specifics. And and as Daniel mentioned, we can we can offer some links for past conversations if people wanna dive into those specifically, but that would that would really be interesting to kinda hear distinct points on that timeline

Dustin: 05:39

in 2017, some researchers at Google figured out this very cool thing called the transformer, which I'm sure people have have heard many, many times about right now, which These are what you would call an autoregressive language model. And so what this is is it's predicting data essentially one piece at a time. So it predicts one piece of data, and then it looks,

Chris: 07:33

I love, yeah, I feel like I'm watching a thriller where you're about to reveal the next thing. It's good.

Dustin: 07:40

Yeah, so what we do with the Fusion is, I'm just gonna tell practically what we do, and then we can kinda break down. You could ask me questions.

Chris: 07:46

That's fine.

Dustin: 07:48

So what we do with the Fusion is we take a whole continuous medium, and what I mean by continuous medium is the most famous one is images. The world is continuous. There's not you know, it's not that there's, you know, one you know, like a letter is discrete. There's a T, there's an A, a B, a C, a D, an E, and an F. A color, shape, the structure of the world, the color, the light, all of this, these are continuous natural mediums.

Dustin: 08:14

And so what

Chris: 08:17

Dustin: 08:17

do is we take one of these continuous natural mediums that that we've now encoded into digital space, so we use RGB. So, you know, okay, 256 colors we typically use for red, green, and blue, and then we encode this into an image. And then what we do is instead of trying to predict so an autoregressive view might try to predict like pixel by pixel by pixel, instead what we do is we essentially try to remove information, and the way we remove that information is by adding noise. So you imagine you like add a little bit of grain on top of an image, you can still kinda see the image, maybe it's a little grainy, maybe it's a little blurry, oh, you can't quite see it, and you add a little bit more, maybe you see some of the shapes, maybe, okay, there's like, is that a dog? Yeah, you kinda see it, the color starts to fade and then you keep adding and eventually you just get, okay, this is just noise.

Dustin: 09:02

I don't see anything in this image at all. And with the diffusion model, what we're trying to do is we're essentially adding a bit of noise and then we're saying, now try to predict what the clean image is essentially for this. So, you know, okay, so we add a little bit of noise, predict the clean, it just has to remove a little bit of noise. You add a lot of noise, and now what it has to do is there's so little information, it has to actually like come up with essentially like how to infill this properly. And then what you do is then in in what we call inference, which is when you actually run the model, you ask for an image.

Dustin: 09:35

What you would do is then you start a fully noisy image and you say, give me a dog, you know, wearing a top hat on the beach. And then essentially what it does is it tries to remove a little bit of that noise to get closer to this idea of a dog in a top hat at a beach. And you might get some structure, it tries to figure it out, it's very coarse, it's very just trying to figure out the shape and then it has a little bit of shape to latch onto and then it does the next step where it removes a little bit more of that. Oh, and it starts to come into view. It starts to you start to get a little bit more in focus, and you do this do do do do do do do and what you see is essentially you completely remove the noise of this image and oh my goodness, now you have this this completely generated image of a dog on the beach with a top hat.

Dustin: 10:13

For a long time, this was, I don't wanna say not good. I mean, all models have gotten a lot better, but when it first started out, these looked like blobs of color. It was like these little, like you could barely see it. But fundamentally, what I just described is still the process that's been happening over the last four years from these old image generator models or anyone that maybe tried like DALL mini, I mean, the old stable diffusion models that we worked on, you know, I don't know, many of the different models are out there now to these modern video models that are producing whole ten second sequences, fundamentally, it's doing the exact same process of removing this information slowly with this noise, and then just slowly going the other direction, essentially creating info to infill if this makes sense. Pause there

Dan: 10:58

for a second just to kinda like that.

Chris: 11:00

Let me tell you, because as I'm listening, very, very good explanation, probably the best one that I've heard, so you are right on target if you just carry on what you're doing, because I'm I'm enjoying this.

Dan: 11:13

And and you mentioned that when this started, right, you sort of ended with these blobs, etcetera. I think there's a lot of people that have seen very impressive things like the, like you were saying, like whole parts of maybe movies or commercials or something

Dustin: 11:30

Mhmm.

Dan: 11:30

Being generated in this way. What what is the current, I guess, state of the art in terms of, whether it be your your models or others? And I know there's more to talk through as you get in. Like, there's more than just raw image generation. There's how this fits into workflows and and doing certain tasks.

Dan: 11:47

But what is kind of the state of the art in terms of what we're able to generate? Where are the where are the bounds currently and in terms of quality and that sort of thing?

Dustin: 11:56

Sure. So I'll I'll tell you kinda like in the last year, I'll I'll split it up into, let's call it three different categories, but they kinda blend a little bit, which is these continuous mediums I talked about. So I gave the example of a continuous medium with image. Video is obviously an extension of that, adding the the time dimension. But then there's also audio as well, which now a lot of these video models generate audio, but then there's also things like, I don't know, maybe you've heard the the like, Suno or these music models.

Dustin: 12:22

Some of these are also using this exact same technique of removing the information with noise, bringing it back in, it's just a different medium that they're essentially applying to. I'm happy to talk about our models all day. I love talking about our models. But I also wanna be fair to just anyone listening and honest to kind of the state of the world on if you wanna go out and try some nice things out there as well. I will say that before I kinda like dive a little bit into what I think is maybe the best one to recommend, I will make like one statement of that like best quote unquote here is a little bit hard to define sometimes, and this is something we're always like wrestling with is like what makes a best model for someone.

Dustin: 13:04

Because, you know, obviously you when someone uses these models, they're typically prompting or putting in now you can reference images or videos or other sorts of things and kind of treat it a little bit like a more like a like a creative companion. And then what you expect out of this can be very different for different types of people. That said, I would say that if you wanna see kind of the quote unquote state of the art right now, For what I would say is like text to video at the moment, it would be Seedance. So Seedance has done some extremely impressive work with the Seedance two model over the over the last year, so they're doing I think it's now four k generations up to fifteen seconds, which is, I mean, very nice quality. It's focused very much on like cinematics.

Dustin: 13:52

I will throw a shout out to anyone that, if anyone from the Sora team ever listens to this, I think Sora two is quite a nice model. It didn't rank so high on some of these like leaderboards that we In the AI community, there's a I'm sure you guys talk about this plenty, like leaderboards, rankings, where models place. I think in the LLM side of things, this is much better covered where there's very clear metrics of how well does it program, how well does it do math. On the more creative side of things with video models, image models, audio models, we typically end up falling down to this single preference type benchmark, which is just like a general preference. Everyone in the world goes and can can vote on, you know, on these leader hordes.

Dustin: 14:30

There's a couple different nice companies and sites that that do this. But I I feel like this kind of removes some of the the specificity that the or the granularity. That's the word I was looking for.

Sponsor: 14:43

Listen. I've been to an incredible amount of AI events, many of which are good, but many of which are not practical. You know I love practicality. We're on practical AI. And some are just hype focused.

Sponsor: 14:57

Some are sales focused. That's why I'm always eager to share about an event that I truly think is practical and useful for people. That's what I discovered last year at the Midwest AI Summit. And they're gonna have another Midwest AI Summit October 15 in Indianapolis this year, 2026. One of the reasons why I love this event was there was an actual AI engineering lounge where you could sit down and talk through your use cases with actual experienced AI engineers and practitioners to really brainstorm and come back from the event with actual solutions and practicality rather than just a bunch of content and slides.

Sponsor: 15:37

But there were also amazing keynote speakers, speakers from, even that had been on the podcast before, like Rajeev Shah, was at last year's event. I would really recommend that you go to midwestaisummit.com. And, for our listeners, you can actually get 20% off with the code Practical AI 20. So go to midwestaisummit.com. Don't miss your chance to attend this event and get 20% off with the code Practical AI 20.

Chris: 16:12

I was gonna ask you, are the models that you're describing, are they kinda the traditional diffusion models? Or you had mentioned in passing a few minutes ago about flow matching, and so it's just in my head, I'm trying to kinda categorize how do the ones that you're talking about right now, how do they fit in? And and, like, and how what is to go back and pull that up, as you were talking about diffusion and you made the reference to flow matching

Dustin: 16:38

Mhmm.

Chris: 16:38

How do those what is that what is that transition and how do those models that you're talking about now fit into that?

Dustin: 16:44

So I'll I'll I'll first keep it very simple and say all of these models are doing the same process of what I described earlier of adding noise and training the model to then remove this noise. The difference kinda This is where it gets a bit more technical in how we actually approach this. So we used to do more of this process called diffusion, which I I don't I don't I don't honestly know if I wanna get into how how deep and technical this is, but essentially what we've done is we've we've cleaned this process up to what we call flow matching, and flow matching is essentially this very simple process of still doing the same thing, still training the model to remove this noise, but fundamentally what it's learning under the surface is anyone out there who has ever seen a velocity map or a flow map, what this essentially looks like is if you can envision a I almost wanna I

Dan: 17:38

don't know, Mark, you got

Chris: 17:39

the whiteboard behind it. Don't know if want the whiteboard. You're gonna

Dustin: 17:41

have to

Chris: 17:41

For the audio folks only. They're not gonna see it, but that

Dustin: 17:44

was great. Probably about

Chris: 17:46

to turn to the whiteboard. Yeah, it was great. Dustin, if you're on video, you saw it, but Dustin was literally about to turn back to the whiteboard behind him, which I love. If I was in the room, I'd be like, Go, man. Take me there.

Chris: 17:57

Unfortunately, a good bit of the audience won't be able to see us. You're going to have to describe it.

Dustin: 18:02

Yeah, yeah, no Let's keep it purely in the audio space. So I'll draw a picture with my words as best as I can. I spent enough time prompting anyway, so hopefully I can do this.

Chris: 18:13

All good.

Dustin: 18:14

But yeah, essentially the way I like to think about it is if you imagine like a landscape, like I don't know, you can imagine your town or the region you're in, and imagine you're looking at it from like the sky and you see like your house, maybe right in the center of this map. And what you might wanna do Now imagine, okay, there's wind going all over the area, and the thing that you wanna do is you wanna be able to throw like a paper airplane from anywhere in this in this landscape, your city, and you want the wind to carry it so it lands on your house. And functionally, what we're trying to train here is essentially that in a much, much grander hyperdimensional space, where instead of it being your house and instead of it being a wind and a paper airplane, what it is is your house in the scenario is what we would call the manifold of real images. It's essentially the place in I'm trying not to use too many stop me if I use too many words here. Wanna

Narrator: 19:16

say biggest.

Chris: 19:17

I'll I'll ask you, but you're doing fine. Keep going. We're fine on technical. We'll just gonna explain it as we go.

Dustin: 19:21

Yeah. It's it's the place in latent space that would be where the actual real images images are. So another way I'll I'll I'll try to try to describe in the simple terms, okay, with your your town here, you're in two d space. You're where you are above an x y and the paper airplane has to fly in this x y coordinate, you land and then you have the x y. In this space, instead of it just being two coordinates, it's enough to have all of the colors of the entire image, so that's why I say it's hyperdimensional.

Dustin: 19:51

It's the center is where literally real images are. And so what we're doing is instead of it being wind and the rest of your town, imagine the rest of your town here is every image that isn't real. And what that means is noise. If you think about noise as a real image, you can go generate a bunch of noise and save it to your computer, but that noise would exist somewhere in this space of all possible images. They're not real images.

Dustin: 20:14

They're they're noise, but they you can you can save them. You know, they're they're they're RGB values. You have them. And so what we're doing with this this essentially this flow matching is we're training the model like, when we say we're we're training it to remove noise, what's really happening under the surface is we're training this flow map so that we can land anywhere in this field of noise. And then these flows, these winds in our scenario, when you take a step, will take you closer to your house or to the manifold of real images.

Dustin: 20:44

And that's what like as you get closer, it starts to look more and more like a real image. It's it gets blurry. It has this. And then when you actually finally land on this this manifold, boom, you have hopefully a real image, but you'll have some image. You know?

Dustin: 20:57

And then the better the model is trained, the better you have mapped this flow to actually take you to where you want to end up.

Dan: 21:03

And and I'm assuming yeah. No. I and I'm assuming because you are kind of mapping this, from from where you start to to where you end up with this real image, maybe in a way that's, just to be crude, I guess, less random. I mean, at at the end of the day, all AI and machine learning, like, it's sort of that training process is very much trial and error, but we have a lot of optimizations around it. Right?

Dan: 21:29

So I'm I'm assuming that this then allows you maybe to, I guess, advantage wise shorten the the training or make that more efficient? Or am I am I misconstruing that in some way?

Dustin: 21:44

Yeah. Yeah. I mean, definitely, like, one of the things that's taken us a lot further over the last three to four years is just figuring out a lot of optimizations both in the actual training itself and then also in just, you know, architectures of the actual models have improved. But I will say the like fundamental underlying process of just removing this information and finding a path back to it is still the same process. It's it's just like the car you know, you've been having cars go down the road since the nineteen fifties, and we've made better engines and better safety and all this, but you're still driving down the road.

Dustin: 22:19

Like, that that's yeah.

Dan: 22:20

Yeah. That that makes sense. And now, definitely, I I think that's a good setting in the sense that we've got to a point where, you know, your the the models coming out of Black Forest Labs, the models coming out from other places, the models that are even, like, now in my text messaging. Right? I get an image.

Dan: 22:39

I can immediately remix it with an image model. Right? In in some way. So these things are becoming more embedded in our lives, but I don't know if everyone in the audience, some some might have been along for this ride where I was like, oh, cool. I can generate an image of a astronaut riding a horse on, you know, you know, wherever.

Dan: 23:01

And that doesn't seem that practical to folks. So I'm wondering if you could now kind of given that that foundation that we have and we know sort of where we're oriented in the state of, I love how you put it on your website, Visual Intelligence, which I think is, I I I love that statement because it gets to more like language, although it it's being used a lot in terms of agents and intelligence like you mentioned, it is it is very much a subset of the information that we process as as humans. Right? There's this visual element, there's the audio, etcetera. So, now that we have that foundation, could you help the audience understand some of, I guess, the practicalities and the outworkings of the, like, okay, we can do this now.

Dan: 23:47

So what? So how does that how does that help people in the real world other in ways other than maybe just pure creativity? There's certainly like the the, cinema, like you were saying, that side of things. Not everyone's gonna be generating movies maybe, or maybe more people will, I I guess. But, but, yeah, I I think you're understanding what I'm saying.

Dan: 24:08

Like, where where is this gonna impact where is it impacting me now? Where is it practically going in terms of the application?

Dustin: 24:15

Yeah. Yeah. Abs I'm very happy to talk about this. This is a I think we're going through, like, a very nice transition period right now where we finally get to leverage these for some very useful things. And if I'm allowed to take a little bit of a tangent and kinda Please.

Dustin: 24:27

Lead up to it again.

Chris: 24:28

I yeah. Wherever you wanna go is good. We're all good.

Dustin: 24:30

Yeah. Yeah. So, I mean, I guess I'll take it back to, like, a little bit of a timeline of things. So okay. So so early on, we had these nice models where most people recognize them for you put in a prompt, you get an image out, you put in a prompt, you get a video, maybe you get a song.

Dustin: 24:45

It's just this one way okay, you make something, this appeals to maybe creators or ad agencies or movie, know, now cinematographers. And and I mean, we love this stuff. We love the creative side of this and and and what it allows because it fundamentally to me, this is like a nice potential communication tool. But then I feel like something changed. I don't wanna say changed, but there was definitely a timeline moment when we started to move into editing.

Dustin: 25:15

So was it two years ago now? A year ago? I don't know, around a year, or take, we released our first in context editing model called FluxContext. And on the surface, would look at this and go, okay, well, is this is an image editing model. This is something you could take a photo, you can clean it up, you can add a hat, you can do silly things, you know, you can whatever you wanna do with it.

Dustin: 25:38

It's a general editing model that's supposed to supposed to do all these interrelational things. And on the surface, that's very cool. It's it seems like another creative thing. But if you think about, like, what's actually going on under the surface for the model to be able to do this, it has to understand a significant amount of relationships in the world and what it means for these like like how these relationships actually interact together. So if I say take a picture of a I don't know, we have like a water glass here on the table, and I take a picture of this water glass and I say to the model, knock the water glass over and show me what happens.

Dustin: 26:11

The model has to understand some part of the actual world. Like, it has to actually model the world in some way to know, okay, it spills over, maybe something gets wet, maybe x y z happens. And essentially what we're we're trying to do is train a model that could do all of these types of relationships. So so the model is learning not just like this one one thing, but just how the world works fundamentally so you can do this editing. Now we can take this a step further and we can look at all the video models that are coming out right now and you can do a very similar thing.

Dustin: 26:41

You can take an init image and say, okay, this person now grabs a fire extinguisher and puts out a fire and it has to actually understand these relationships to do that. So, well, maybe this wasn't our original goal way back in the day. We were trying to make very cool models and figure out how to I I think in some sense we wanted to model the world, but maybe we weren't thinking this far ahead. Inherently through this whole process, we've learned to build these models that are developing this, and I'm really trying to avoid the use of the term world model here.

Chris: 27:13

Because I'm gonna go there if you don't. I'm just telling you.

Dustin: 27:17

Yeah, you might notice me trying to skirt this, because I think it's a little overused these days, we can definitely talk about it.

Chris: 27:25

But fundamentally, this is too many people, which is where I was going.

Dustin: 27:28

Yeah. Fundamentally, this is what these models are doing when you train them at scale and you and you really train them to be general and robust is they need to be able to simulate parts of the world to get an output. And on on one end, you can take that and go, okay. Well, this is nice for creativity because you can make a film scene and it looks nice. But on the other side, this is why we're starting to move in this in this era, you know, this this area we're we're calling visual intelligence, and now we're starting to see how we can actually leverage what we're calling not just us calling it, this is this is a general field term, but the representation inside of this model of the world to actually go do practical things.

Dustin: 28:03

Now this isn't to say we're gonna drop the creativity side of things, we're still definitely pushing on the side. This is something that we're all still very fond of, but looking forward, if our models are understanding these relationships and the physics of the world like this, well, this is a great place to put it in say robotics and actually, okay, a robot has this understanding, this model of the world inside of it to go act and take actions with confidence in the world. So I'll leave it there as kind of trying to build up to it, but

Chris: 28:34

Let me ask. I'm just gonna go there because that's kinda you're in the area that I spend all my time, which is, you know, embodied intelligence, robotics, you know, UXVs and stuff like that. That's my world. And so in even within our space, the notion of world models has there's a lot of interpretation. If you get into a meeting with 20 people, there's 20 different definitions, and we have to start off by sorting all that out, you know, in terms of how we're communicating.

Chris: 29:02

As you add in this notion there and of this kind of contextual understanding, you know, that it has a representation. I am curious before we move on. Do you how do you do you and you've mentioned now robotics. Do you see it as the same thing, or is it kind of yet another variation of a world model, you know, using the word and stuff? Like, how how closely do would you would you believe those two to be related, you know, as you're talking about that?

Chris: 29:31

Just because both are big topics, you know, and Sorry.

Dustin: 29:33

Sorry. Can I can I just ask for clarification? You asked them, like, how close do I believe, like, our models are

Chris: 29:37

kind of, like,

Dustin: 29:38

a world model or a

Chris: 29:39

Well, like, when you say world model, I I'm just kind of trying to to clarify the same thing I did when there was 20 people in the room and you're asking what they mean by it. Mhmm. And you as you mentioned robotics and stuff, we all need this representation of the world out there so that we can do better about about acknowledging the context of what we're working on, whether it's robotics or in visual intelligence, presumably. Are they I'm just curious. In your mind, do you think that they are very closely related, or are they kinda distinct ideas of what a world model is?

Chris: 30:09

What what is your take on that?

Dustin: 30:10

I would say I would say that they're pretty related. I mean, I I to to me, it's, like fundamentally under the surface of what we're doing is we're trying to teach the model as much as we can about the world that we exist in and then asking for it to utilize that in some way. And up to this point, it's basically been mostly through creative mediums, but like that representation that we're talking about that, and again, I'm trying to avoid this term because we, I don't know, it's a little overused, term we're a little bit, but it is. I mean, this is what it is. It is modeling, it's creating a representation of the world we live in, and then it is using essentially that intelligence it has to be able to act in it.

Dustin: 30:48

So the idea is that if it understands enough of these relationships and and the actual, like, physical properties of it, it's it's a really good foundation to, you know, build and train robotics on top of.

Chris: 30:58

Gotcha.

Dan: 30:58

And I

Dustin: 30:59

hope that hope that answers that.

Chris: 31:00

Don't know that That did. No. That was good. Yeah. Thank you for thank you for putting up with with that.

Chris: 31:05

I was just curious.

Dustin: 31:05

No. Not not at

Chris: 31:06

It's not unusual to to to navigate that. So go ahead, Daniel. Sorry about that.

Dan: 31:11

Yeah. I I from the from the non robotics person in in the room, that that was really interesting. I appreciate you going into that. And I'm and I'm wondering there's that kind of outworking of some of this where you are now kind of understanding the context that's in this model, maybe how it's representing the world. There's also things that I've seen just crossing my paths, whether it be kind of in ecommerce or, like I say, my my phone, my text messaging.

Dan: 31:40

So on on the web where these models are being more and more integrated into workflows, whether that be kind of like try on these clothes or glasses or whatever, or it's a creative tools in, like, visual editing platforms. Are you seeing that with with your all's models in terms of kind of what's the state of and I know, I wanna get into your model families here in a second, and they do different things. Right? And many of them, I I know there's a big, some that are that are open weight, so you might not know all the ways that they're being used. Right?

Dan: 32:18

But from from at least those partners that you're working with, in terms of of today, what are some of those creative and maybe more most practical uses of these models that you see out there beyond just kind of the, fun image generation kinda side of things.

Dustin: 32:36

Yeah. Yeah. No. Absolutely. I would say, again, I'll come back to the the moment we started to get context into the model that wasn't just text made a huge like, it was it was a fundamental change in what these models could do and how people actually worked with them.

Dustin: 32:50

Now our first model, Flex context, only could take one image reference, it was mostly used as an editing model, but you could reference a picture of a product and then tell it to generate like a nice product photography set, and this was very nice. Then moving on, we introduced the Flux two family and then the Klein speedier smaller variant that now could take many references. And then now we're looking at mean, people and this kinda come backs comes back again to what I was just saying of having this representation of how things relate, and that the better we actually build this representation, the more interesting ways you can essentially tie in all of these different things. So I mean, I'll throw out one of the most commonly used, or I don't wanna say commonly used, but like kind of obvious cases is okay, like clothing drive. I'm like, this is a very, you know, here's a picture of me, here's a picture of some clothes.

Dustin: 33:44

Could you please, you know, show me what this looks like? I've seen people like I actually did some home decoration earlier this year just trying to see, like, what different couches and furniture look like in my place. One of the most interesting uses, I'll say, that kind of stood out to me that I saw at a hackathon was and I I don't know how much this could be used for like real planning purposes, but I just thought it was interesting, is someone was taking pictures of fire exits in a building and then generating what it would look like if a crowd was trying to leave through this fire exit in an emergency, so that they Brilliant. Could actually like Yeah. So they could actually gauge, like, what would this look under, like emergency, like, where is the crowd?

Dustin: 34:27

Like, I don't know. Obviously, there there's, you know, there's a generated component, so you you have to take it with a little bit of grain of salt, but you still could get a general idea of this is what this would look like under this scenario. I thought

Chris: 34:36

that extremely derailing us, you just sparked a whole bunch of ideas on that in my head. So, like, I'm I'm just like, oh, this great stuff. Keep going. Sorry.

Sponsor: 34:46

If you've been listening to the show over the past few months, you realize just how transformative AgenTic AI is, whether that's Claude Code or Hermes Agent or custom built software that you're deploying for operational efficiencies or as new products to your customers. Regardless of your maturity now, this is the world that we're headed towards, this agentic AI world. And there's a lot of security and governance teams that aren't letting these agents go into production because of risks related to agency and autonomy and how do you take care of things like prompt injections or insecure tool usage? There's a lot to take care of, and that's why I'm personally spending my time outside of the show working with an amazing team of AI engineers to build Prediction Guard. Prediction Guard is an AI control plane that you run-in your own infrastructure behind your firewall.

Sponsor: 35:44

Developers can build on top of this control plane using everything that they wanna use, OpenAI and Anthropic compatible APIs, MCP servers, frameworks like LangChain, but all of this is plugged into a built in governance harness that enforces your organization's AI policies, and all of that telemetry goes back to your monitoring and alerting systems. I'd encourage you to check out what we're doing at predictionguard.com/practicalai. You can schedule a demo with me and the team and I'd love to get your feedback on what we're doing. So visit us at predictionguard.com/practicalai. That's predictionguard.com/practicalai.

Dan: 36:28

These are all so you mentioned the the flux, family of models. So that's Black Forest family of models or at least some of the models that that you've worked on. Could you could you just kind of give us a concrete you mentioned a couple by name, but, like, give us a tour of the the family of models and then maybe where, I know that there's some on Hugging Face. People can find them. People can can look at them, but maybe just give us a little bit of a tour of the model family and then, anything that, that you wanna share about some of the some of the distinctions between them, the different ones of them?

Dustin: 37:03

Yeah. Yeah. I mean, we're I I would say fundamentally, like, as a core, we're still like a I'll say we're still like a research lab that wants to push the boundaries, so with each one of our model releases, we want to try to level up some capability of the model, not just make a general improvement, but really see some new versioning with it. So I mean, can take back through There's not that many models, so can take back through a little bit of history. Was a little That was around two years ago now, we released our first family models, which was the Flux series, and this was the first set of models that we created after we formed the company after a very fun but intense, I think it was around four or five month sprint to really like build our chops here and get something out.

Dustin: 37:48

And with that, we came up with this, at the time, this essentially distinction between the models where we released the Flux1. So we had the Flux1 Pro model, which was on our API. We had the Flux1 dev model, which was this commercially licensable model, but the weights were open for people out in the world to use. And then we had the Flux1 Chanel model, which was a high speed, like step distilled model that was totally I don't know if it was MIT or Apache, but it open to use for whatever purposes you want. Then from there, we worked on our tools series of models where we realized like we needed a lot more control with the models, and this is where alongside this work started the context project to, okay, these are people who want a lot more control of the models.

Dustin: 38:42

These models are learning a lot of relationships. How can we leverage this? Which led up to the Flux context release, which I talked about. This was our big edit release. Everything I mentioned up to this point was still kind of our like historical models.

Dustin: 38:52

Now we're getting into our more up to date models, although they're getting a little bit older now, but we have a, I don't know, something I'm excited coming, I don't wanna get dates soon, hopefully later this summer, that'll be very exciting to talk about when it comes out, but

Chris: 39:07

Looking forward to it. Yeah. We'll have to get you back on.

Dustin: 39:09

Definitely. Yeah. Happy to happen to come back. But then we got to our Flux two series of models, and Flux two was a very big upgrade for us, where we really pushed the capabilities not just T2I, but also on editing, not just on single image, but this is where we also introduced like the multi image, the kind of omni edit where you could put many people, many different items, have all sorts of relationships. This is where we started to see, you know, not just these kind of more standard advertising use cases, which still were like the most common and we really wanna support this, but like some more interesting how people kind of build, you know, like things like this fire exit thing here or other types of stuff.

Dustin: 39:44

And then right after that, we came to our Klein series of models, which was essentially like a size distillation where we really wanted to pack as much performance as we could into a small model for both use on our API, but also just for open weight release. We know that there's many people out in the world who use our models locally, and we really wanted to make sure they had something powerful they could use. So it was a text image and an editing model. It was a was a very fast model, and it was it was pretty small in comparison to you know, it was even smaller than our our Flux one series. And then we've done further a further speed up on that with our our Klein KV model, which introduced, I believe for the first time KV caching, which is this I'll just say it's an optimization technique that's very common with the language model world that we were able to bring into the editing world to get a very, very big speed up on local editing.

Dustin: 40:36

So people who wanted to actually use these models locally, but also, you know, we also serve this as well, you know. And now now since then, we've done a couple blog releases, one fairly fun research, you know, research blog on on something called self flow, which I'm repping partially because I am on there. I'm not not the lead author. Our lead author is incredible. He list.

Dustin: 40:57

She's amazing. But and then now we're we're kind of all pushing forward to our our next big release, which is hopefully gonna be I I'm quite excited about

Dan: 41:06

That that's awesome. And just to follow-up on that, I know some people out there, our listeners, are always trying things on, you know, on their laptop or or wherever they're they're they're pulling down things. I know for quite some time when I tried to access some of these models and run them myself, it was either very difficult to find find, you know, with my limited resources, the the right the right kind of configuration to run this, but also it was sometimes incredibly slow. So what to, just give a sense of some of those, you know, more hardware optimized models that you mentioned, I think the the Klein and other models. How, maybe my question is, can I reasonably run one of these models on my laptop now and create a a great image?

Dan: 41:56

Like, what's what's required here? Certainly, I'm not gonna serve my production web app in an enterprise environment off of my my laptop, but just maybe give us a little bit of a sense because that has changed on the LLM side, right, where that has continually updated and obviously the smaller models are not up to the same output quality as the larger models, but you can run a small model, you know, even on a CPU now, at least for some tasks that, is is pretty reasonable. So how how has that progression happened on the visual intelligence or image generation side?

Dustin: 42:32

Yeah. I I would say, I'll I'll first state that the LLM world definitely has a lot more people working on it, so they they definitely have a lot more of this up and down. That said, with our Klein series, I haven't run this personally. Maybe I'm not the best person in the world to speak on this, but I'm 99% certain you could run this on, say, like a modern M series Mac, like a MacBook Pro. As for speed, I don't have any numbers I can quote because I haven't tested this myself, but this is always like a big kind of trade off in this world of how, you know, even on the language side, you know, you see the scaling of a, you know, a lot of the new big open releases that people are excited about are getting into the hundreds of billions of parameters, and this is why we did the Klein series, for example, because Flux two was a 32E model, it was quite chunky, you could run it on a higher end local computer, but it was slower to run locally, you needed some more powerful computing, this is where like the Klein series popped in, but it's always a trade off because we wanna push performance and make something that we're really proud of, and then we wanna try to bring that again back into smaller, faster scale.

Dustin: 43:43

And this is something we've continually done with our Fluxone series, we had Schnell, with Flux2, we had Klein. I don't wanna promise anything going forward, but we definitely wanna keep bringing this bigger power that we generate into these smaller models that people can hopefully run locally.

Chris: 44:00

Absolutely. Super cool. I'm excited to try some of those. As we are starting to wind up, you guys have done some really, really cool, work in this space. One of the things we always like to to ask as we're closing out these episodes are kinda like we wanna get a glimpse into your head, into your thinking about what's to come.

Chris: 44:24

And and kinda to to frame it a little bit as you're thinking about, like, you know, you're you're out of the the day's work and you're, you know, enjoying your evening and your mind wanders and you're kinda thinking about, like, down the road. What would you know, what you want, what where your passion is is taking you and what you'd like to see. Can you share can you share a little bit about what the future that you would like to to create with Visual Intelligence in terms of what's next that you guys haven't addressed? I'm not asking for a product release so much because I know those those need to stay in, but kind of just like what's what what, you know, where what is your aspiration? What's your dream in terms of where these kind of capabilities might ultimately lead?

Chris: 45:07

And and love to love to get your your your kind of your dream context, if you will, as we finish up.

Dustin: 45:15

Sure. If I'm allowed to to give two split answers here, there's a

Chris: 45:21

Totally fine.

Dustin: 45:21

There's kind of two areas that constantly sit on the front of my mind. At some point, they'll hopefully become one, but I imagine the next and this is no glimpse into saying what we're doing, but this is stuff I'm personally excited about.

Chris: 45:33

Fair enough.

Dustin: 45:33

Well, I would say the two areas that I'm the most excited about is, one, is long context truly multimodal models. So nowadays, obviously, we're seeing lots of people work with agents constantly, And on our front where we're having these more like continuous generative models, we're starting to introduce more and more contextually usage with these references. I'm looking forward to when this kind of bridges a little bit more, And we have models that not just can like, okay, you have, say the agent calls the generative model, but when this kind of becomes more or less a model that not just can, you know, do work in the text and language and agent, but actually maybe can think visually as well and generate audio for you and has the full context of say all the stuff you've done over the last few weeks. You know, you don't need to say give it the reference, give it the right prompt. It just like already has this context and can reference this as needed in this, like, continuous space.

Dustin: 46:34

This excites me a lot. The other side, I'll say, is the real time stuff that's you know, we're seeing a little bit of it now out there. I think that this is, like, very early, and I think it's gonna be very exciting for real time video, audio, duplex interactions, know, being able to I don't know. You see a little bit of these, like, interactive, you know, stuff with, like, Genie where, you know, play the game, but also on the side of, you can bridge this back into robotics where it needs to take in the real world in real time and make decisions.

Chris: 47:04

Yeah. I was gonna say that sounds really familiar in terms of interest on that side of things. Mhmm. So, yeah, really cool. Great conversation, and your lead in to to what what Black Forest Labs is doing was also really good contextually in terms of kind of explaining.

Chris: 47:22

So hope our audience got a lot out of that. Dustin, thank you very, very much for coming on the show. Great conversation. And as you've as you've hinted, there are things to come, and I'm looking forward to to having our next conversation as things move forward a little bit.

Dustin: 47:38

Absolutely. Thank you so much for having me, guys.

Narrator: 47:44

All right. That's our show for this week. If you haven't checked out our website, head to practicalai.fm, and be sure to connect with us on LinkedIn, X, or Blue Sky. You'll see us posting insights related to the latest AI developments, and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the show.

Narrator: 48:05

Check them out at predictionguard.com. Also, thanks to Breakmaster Cylinder for the beats and to you for listening. That's all for now, but you'll hear from us again next week.

More episodes

Chapters

Creators and Guests

What is Practical AI?