What if we could? A podcast exploring this question that drives us.

In this episode of The What if we Could Show, hosts Bob Ullery, Kevin Nuest, and David DeVore delve into the fascinating world of generative AI, specifically focusing on the use of stable diffusion models like Stable Diffusion and Comfy UI for creating digital content. The discussion begins with an overview of the generative AI ecosystem, highlighting the simplicity and specialized nature of many commercial tools available. Bob emphasizes the fun and creativity unlocked by learning to effectively utilize these tools and making it accessible for beginners to experiment with different models.

The conversation transitions to Comfy UI, a more advanced platform that allows for intricate control over the generative process through a node-based interface. Bob and Kevin explore the capabilities of Comfy UI, demonstrating how users can craft complex generative workflows, such as image generation with specific characteristics, by linking different nodes for inputs and outputs. They illustrate the process with a practical example of generating an image of a dog on a skateboard, explaining the significance of positive and negative prompts, the selection of models and samplers, and the use of low-rank adaptation (LoRa) models to enhance details.

As the episode progresses, the hosts address the challenges and techniques associated with text generation within images, the potential for creating 3D objects, and the nuances of maintaining character consistency in storytelling or branding. The discussion also covers the practicalities of using Comfy UI for business applications, such as generating branded content and adhering to brand guidelines through generative AI.

Throughout the episode, the hosts share valuable insights into the iterative nature of working with generative AI, the importance of trial and error in achieving desired outcomes, and the future potential of these technologies in various creative and commercial fields. The episode concludes with a reflection on the educational aspects of the discussion and the encouragement for listeners to explore the capabilities of generative AI tools like Comfy UI and Stable Diffusion.

What is What if we could? A podcast exploring this question that drives us.?

"What if we Could?" A podcast exploring this question that drive us. We explore the practical application of artificial intelligence, product design, blockchain & AR/VR, and tech alpha in the service of humans.

David DeVore (00:00)
What's up? This is David DeVore with The What if We Could Show and today we're going to do something a little bit different. We have been, um, you know, taking a hard look at what's going on in the world of AI and tech alpha, primarily through the lens of the news. Um, but for our day jobs, we actually do, we were actually, you know, roll up our sleeves and do this stuff. So we thought that.

For a change of pace, what we would do today is we would teach something. And we've been deep in playing around and working inside of stable diffusion. And also have found a number of tools and tricks, one of which is called Comfy UI. And for those who are familiar with stable diffusion, it is a image generator, a generative image generator.

And it is, and of course, if you're familiar with mid-journey, you're familiar, you put in a description and it gives you something back. What stable diffusion and ComfyUI do is they really give us a lot more control over the images. so if you want an image that is trained on a very specific look or feel, you may use something called a LoRa model.

And if you want multiple images to start to have sort of consistency, then we found ComfyUI to be really, really interesting. So what we're going to do is we're going to have Bob Ullery here, take us through it and crack it open. And, you know, Kevin and I will ask lots of good questions along the way. Take it away,

Bob Ullery (01:25)
Okay. So before we jump into comfy, I thought I'd just kind of level set the. The ecosystem a little bit. Um, you mentioned stable diffusion. There are, there are many, uh, implementations of, of diffusion based generation out there. A lot of commercial grade tools. They're all really wonderful, even in sort of a nascent state in this industry. The most are very specialized and I think they take a very simple, simplistic approach, which is great.

In terms of driving adoption, people using these tools, which are incredible, both from output and a ton of fun to use once you sort of, you know, learn how to make it sing a little bit. And usually folks land, when you come to the realization coming out of like mid journey as an example, or you need more control, usually you find yourself looking at automatic 11.11, and that's what's on screen right now. Under the hood.

Automatic 11.11 and Comfy UI use essentially the exact same framework of stable fusion under the hood. Automatic 11.11 is much more of a form-based UI. And so this is really great as a starting point as you start to try out different models. And well, what is a model? A model is really just a state in time of a trained set of weights in a network that represent

the data was trained on. And so you'll see hundreds, if not thousands of models that are free to use and download, um, that represent all sorts of different styles. I think of these as like really huge models trained specifically to look in one particular way, aesthetically. Um, one place I'd point folks to check those out is civet, AI C I V I T AI.com, um, really great marketplace, uh, for it's free.

though they have some premium functions on there. The idea that you could have just run models directly from Civet's interface is a great starting point. But in general, that's kind of the go-to place to go and discover new models that might actually influence the type of vibe you're looking to output within your generations. And automatic 11.11 makes it really, really easy to both install directly from Civet by way of extension, and then leverage those models within your generations, as I mentioned, in a form-based approach.

And that's kind of what you're looking at here. Here's my positive prompt. Here's kind of what I want to see in the image. And this is my negative prompt, what I definitely do not want to see in the image. The rest is a bit of minutia, the settings and configuration behind the actual models and the components that are in use in this generation. We won't go through everything here in detail. We'll go through a little bit more of these specific inputs on the comfy side. But in general,

If you're looking to get started, a great spot is using automatic 1111. And you can run that locally on your machine as well. Um, we started with automatic 1111. We actually did a really huge generative in the summer last year, um, sort of drafting off the Pepe frog vibe. We did a collection of 2000 individual Pepe frogs all with their own unique backstory, and ultimately what we were doing behind the scenes is hitting an automatic 1111 instance.

through its API to generate these Pepe frogs as each one kind of came out of the pipeline. But in that effort, we learned a lot and realized we needed a lot more control over these generations for sure. And automatic 11.11 can get you there. It's just not intuitive in that way. There is this notion of like sending results from this screen over to this tab over here and then from that tab over to this other tab and.

It can work that way, just mentally not a great way to visualize a network of things going on. And skip to the prestige at the end, when we talk about doing things like consistent character development, these actually turn into incredibly complex systems of many, many hundreds or even thousands of steps required to achieve the desired result.

We are not gonna start here. I know, right? This is a ComfyUI. I'm gonna show you a simpler version of this here in a second, but in general, you can think of it like, the ability to link different nodes together in the form of inputs and outputs. So each node doing its own thing, it's targeted, it's niche, it does one thing really well. What are the inputs does it need?

Kevin Nuest (05:44)
What is that spaghetti?

Bob Ullery (06:10)
And what is it going to spit out the other side that another node might want to consume so it can do its work? And Comfy is really, really great at really orchestrating this DAG, right? This network graph of actions. And you're not limited or bounded to any particular paradigm. Again, this is a very complex view, but I built this from scratch, right? And sort of move these nodes around and organize them in a way that made sense mentally to me.

Though over time, quite literally, it turns into spaghetti. So my next step here is to clean this up a little bit. We won't start here. I know this is daunting to look at, but over the next, I don't know, 20 or 30 minutes, we'll give you the probably top 10 things you need to know to get started with simple image generation on Comfy and sort of expand your learnings from there. I think one great thing about Comfy as opposed to Automatic 11.11 is it sort of forces you to get into the nitty gritty.

Which is good if you're really looking to learn how this stuff works. And if you're looking to bring it into your repertoire and create valuable assets, um, it's very advantageous to know, you know, what each dial does in the flow and you can really command it and create those outputs that you're looking to do without a whole lot of iteration and trial and error sitting there for hours or even days until it kicks out the image you were, you were hoping for.

that can look at a picture and read a sentence to understand if they match. So like if you send it a picture of a dog and you say, this is a cat, it knows it won't match. It does not match those things. Okay. So we've got our model, which is sort of the macro vibe we're going for. We've got our LoRa, which is the detail vibe we're going for. And now we want to sort of tell it what to make. And so what I'll do is I'll bring in a node called text and code from clip.

And I'm going to bring the clip again. This is the, uh, the ability for it to understand sort of the baseline understanding of visual versus natural language, but this is where I define what I want it to create. This is what's called a positive condition. So I'm going to create a dog riding a skateboard. And the other nice thing about comfy is just, you know, color, color coordination. So you can visually sort of understand.

what each of these nodes do. And so generally what I do, if I'm talking about something positive, like a positive prompt, I'm gonna change that color to green. So when I see it, I know this is the positive side of my prompt, and we can also copy and paste stuff in here. I also need a negative version of this. So I'm gonna go ahead and make a red box, and I'm gonna put a cat. I definitely don't want a cat. So the clip comes in.

and the conditioning comes out, my positive conditioning and my negative conditioning. Ultimately, those are gonna go to my sampler. A sampler. So I won't give you the ELI-5 first, let's give you the other. So you can think of it, particularly in the context of generative models, it's a method to generate samples from the model's output distribution. In the case of stable diffusion,

So let's jump in, let's start on a fresh graph. And so a few things, as I drop nodes on this canvas, I'll take a moment, I'll describe what those are, both from like a technical perspective, but also thank you GPT, and explain like I'm five version. So for the non-technical folks where this might sound like gibberish, maybe the ELI5 versions will be a little more helpful.

David DeVore (10:03)
Is there a, is there a host, is there a hosted version of ComfyUI where somebody could, or, or is it, or is it purely local?

Bob Ullery (10:11)
There are, yeah. In terms of consumer grade, I'd have to do a Google search, but you could probably look for hosted ComfyUI. I'm assuming there's a few out there. Otherwise how we run it, since we're sort of doing large volume, we're running it out on GPU based hosting services like RunPod is our sort of go-to. So we run, you can sort of think of it like an EC2 box on AWS or running a server.

That server has a ton of GPU horsepower. It's got a lot of memory, which is also required as you start to inch up into more complex generation. Memory is usually the limiting factor in terms of what you can and can't do, though not always a blocker. What's interesting here too is this notion that you could break up multiple steps in a big ComfyUI workflow into smaller.

sub workflows and just call those directly and sort of move the outputs in between those. If you run into some of those limitations, but in general, the more memory, the better. And so we would recommend RunPod. There's some hybrid sort of in between implementations, you could say replicate being one of those where really it's more of the same, but they're not going to provide you necessarily with the

Comfy UI interface, though you could run your Comfy UI workflows on a replicate in a similar way. So.

Okay, so let's get started. I mean, the first step of really any generation is deciding which model you want to use to accomplish your aesthetic that you're looking for. And as I mentioned before, Civit AI is my go-to. Word of the wise here, they have both safe for work and not safe for work models here, you know what I mean? So make sure your filter is set to safe for work if you are currently in your office browsing Civit AI because there are a lot of not safe for work models here.

Here comes the AI generated pornography that we're gonna be inundated with on every channel soon enough. But for now, it's also incredibly valuable for benign and family friendly assets, which is what we're gonna focus on here. So if you wanna look for cool models, go ahead, Civit AI, look for your models, you can kind of scrub through. I've got not safe for work blurred out, thank God. You'll see there's quite a few of them here. Pardon any weird titles, maybe I can just like click one and move on.

so we don't get that in the screenshot. So this one's called Violent Explosion Style. It is a model just meant to help you generate violent explosions. And you'll see this very, very niche models which come incredibly handy when you're trying to do something very specific. Think of it like expertise. If you were creating an action movie and you were maybe designing a poster here, you'd want...

you know, huge explosions and just really action focused assets. And you could apply one or more of these models to achieve that effect. What I'm going to show you here is just going to use a single model. We're not going to get into model merging, but know that you can load many models and sort of merge or average them together before you head out on the generation side. So let's go ahead and pop a checkpoint in. You can double click anywhere on the ComfyUI canvas.

and you can do a quick text search for what you're looking for. There are hundreds, if not thousands of nodes in here, so you sort of need to know what you're looking for. By the way, there is also a Comfy UI Manager down here in your toolkit where you can very easily come in and install custom nodes that do specific things. So this is a great way to sort of understand what you have installed versus not, and look through the documentation of what's in your Comfy UI implementation. But anyhow, let's go ahead and load our checkpoint.

I'm gonna zoom in here. And I've been using a model called Poltergeist a lot lately, which is really cool. It's kind of like a comic book effect. So I'm gonna go ahead and use that. And I pre-installed that through my manager and so now I have access to it. So we have our model and that's not enough to do much with. Explain like I'm five, a checkpoint. You could think of the checkpoint like if you were building a big Lego castle.

And every hour you took a picture of it. So you sort of remember how far you've gotten. And if you have to stop, you can look at the last photo and start building again from there. And so a checkpoint is kind of like a photo for a computer program that's learning. And it's a way for it to remember where it left off. As I mentioned, the checkpoint is a snapshot. So this is the current snapshot of the Poltergeist model. The Poltergeist team that maintains and innovates on that model, doesn't need to start from scratch when they come out with the next version, they would

continue feeding it new training data and continue to expand its knowledge on the current implementation usually as an approach. Now, the next thing I wanna do is I like to apply a LoRa model to this as well. LoRa is short for low rank adaptation. And so ELI5 here is like suppose you had a magic coloring book and that book can change its pictures a little bit to make them better.

So LoRa is like adding a new colors to your cram box that make the pictures look even more amazing without needing a completely new book from scratch as you go. So we can start with the poltergeist checkpoint, which we would assume would yield a comic book like effect, but we can apply a LoRa model to it to achieve things like really excellent hands, specific faces, explosions.

a Tron-like neon vibe. So you think of these as like accoutrements that you might add into your styling. One LoRa that I really like to use a lot, I'll go ahead and load this, is the add detail. And so all this does is, under the hood, is it adds additional noise to our image, which sort of requires our sampler, which we'll get to in a second, to fix it, right? Think of...

the way this stuff works is we start with a very noisy image, like your TV back in the day, with static on the screen. And every iteration, it tries to paint it differently to achieve what it's set out to create in your prompt. And so that's the concept of noise and this notion of iteration through noise.

Kevin Nuest (16:54)
So the more noise, the more variation then, is that the right correlation?

Bob Ullery (16:59)
Yeah, that's right. That's right. And off.

Kevin Nuest (17:02)
Well, counterintuitive, right? The more static you have in there, the more it's gonna try different things to show you different variations of it.

Bob Ullery (17:11)
Yeah, sorry, it's kind of like if you're gonna squint your eyes, right? And you're like, I think that's a, that's a car and I it's red, but is it a red Ford pickup truck? I can't tell. So as the model continues to draw its eyes get more open and that starts to come into focus, it's continually refining those pieces until it achieve it's a, it's prompt or what's called conditioning. So what I'm going to do here, uh,

David DeVore (17:35)
That's a good analogy.

Bob Ullery (17:39)
And I'm going to use this add detail Excel. It's also a train LoRa model. So it's trained on images with lots of detail. And so it's trying to replicate that as an output. So we chain these things together. My checkpoint model that I loaded needs to go into the LoRa. The clip needs to go into the LoRa. CLIP stands for Contrastive Language Image Pre-training. It's a model developed by OpenAI that

essentially learns visual concepts from natural language descriptions. So it can understand text and images in a manner that's not specifically tailored to any one task. So it allows it to be applied to a wide range of things, including guiding generative models like stable diffusion. A ELI-5 for that, you can sort of think of it like a really smart robot.

Samplers are algorithms that guide this process through and affects the quality, the diversity, the characteristics of those generated images. And so the kSampler is actually a ComfyUI based term. It's just sort of a specific implementation of a sampler. Within it, we're gonna choose the sampler we want to actually use. And there are many and they're all sort of similar, but they sort of trade off between speed and image quality.

The one you'll probably see the most is Euler. DPM, DDIM are also pretty frequently used. And depending on what you have installed, you can kind of cycle through those. Ultimately creating great outputs is a function of trial and error. A lot of these are gonna be related to the model you choose. And so depending on the model you choose for your generation, it's always best to look through the documentation of that model and see what they have optimized it for. Usually Euler works well across many models.

Some may be optimized for something different. So it's good to take a look at that, make sure you align to it. So what we're gonna do here is our positive prompt condition goes into the positive condition of the case sampler, our negative goes into the negative condition of the case sampler. Our model, which started here, has now been affected by our Lora, is an output of that Lora step. I'm gonna go ahead and chain that over. So now our sampler has our model, our positive and negative conditions.

What it's missing right now is a latent image. So latent meaning, like in machine learning, compression and decompression. So you can sort of think of it as representation of data in what's called latent space. And a latent space is compressed abstract representation of the data that the model is using internally. And so think of that as sort of it.

We'll use latent space. use latent space. in controlling our dimensions of the output image that we want. we want. Though there's lots of different ways to sort of use latent there's lots of different ways to sort to an empty latent image here. I'm going to go ahead and create it. going to go ahead and create it. mode. And we'll go like 1024 wide. we'll 1024

Batch size is really how many images I want to generate here. size is really how many images I want to generate here. I'm going to connect latent to my latent image. going to connect latent to my latent image. And then from here, case sampler outputs latent. then compressed more specifically an image tensor, I believe, which image tensor, I believe, which is a representation of potentially even many images a representation of potentially even many images that we need in order to

is a VAE decode. a VAE decode. variational auto decoder. Let's see, so it's a, see, so the You think of it like you have a magical shrinking machine, think of it like you have a magical shrinking machine, right, it can take a toy and shrink it down it down to a tiny version and later you can make it big again. a tiny version and later you can make

is like this, but for pictures or data specifically. like this, but for pictures or data specifically. and then they can bring them back to normal later. then they can bring them back to VAE decode step between the generation process and the actual image the generation process and the actual else. Now, I need to connect my VAE, my variational auto encoder, I need to connect my VAE, of the model that I loaded. Another kind of just. kind

quality of life helper here in comfy is, you know, I could do this, connect them across. There's nothing wrong with that other than as these works workflows get really robust and big, it gets a little bit harder to follow your connection lines. So what I like to do is use what's called a reroute. You can just take this and drag it off, click reroute, and you can actually use a lot of these together if you like, and that helps me keep my workspace clean and I can work easily sort of understand.

where these lines are going. And then finally, I wanna see that image. So I'm gonna drag this out. I'm gonna use a preview image here, and that's gonna show me my dog on a skateboard. Hopefully, let's run it.

There we go.

Kevin Nuest (23:14)
Hey, look at that dog on a skateboard.

Bob Ullery (23:17)
Pretty awesome. Now let's mess with it. So dog on a skateboard, no sweat. Maybe he's wearing a birthday hat.

Go ahead and try that.

There he is. he is. at the words aren't too bad either. That's getting a lot better these days. getting a lot better these UI comes with a ton of nodes specifically for text creation a ton of nodes masking. get perfect text that is on your image by way of literally inserting it on your image by way of literally inserting it and then repainting over it versus having generate

Text generation is getting better and better every single day. And so in my mind, like we'll have a model really probably this year, I would assume this year, um, would bet on it. That does text gen with its image gen really robustly and, um, pretty perfectly. We're kind of already on our way there at this point. Another trick here.

Kevin Nuest (24:26)
And this workflow looks way more approachable than the one that we, we started with as the example, right? You can see end to end there, the, the nodes here doing their individual functions and you went through, walk through, explain what those functions are. And ultimately you get an image out the, the other side, something like this is a much more approachable jumping off point to then go experiment with like you're probably about ready to show us some more different things to do. Uh, this is definitely.

This is the end to end, how to get an image out the other side. It was great.

Bob Ullery (25:00)
Yeah, that's right. that's right. know, change the strength of your LoRa as an example. Here's a good example. a good example. So let's start with the concept of a seed. let's start with the concept of a need can be fixed, but essentially it provides entropy to the mob. essentially it provides entropy to the mob.

So a random number that if it is truly random, you're going to get a different output every single time. So even if I ran this again, my seed changes. And so I'm going to get a totally different output here.

See? But if I fix my seed, and so I'll leave that fixed. if I fix my seed, and so I'll leave that fixed. after the fact. So we'll get a starting image. we'll get a starting I get the exact same image, no variation. variation. In fact, it's cache. fact, it's cache. That's good practice for testing changes, right? good practice for testing being the same.

If I change one thing in this flow, how is it going to affect my output? I change one thing in this flow, how is it going to affect my output? this checkpoint is going to flow right through there. And that note is not going to affect it. that note is not going to affect it. And so now what we can do is really see precisely. precisely. What our image difference is going to look like. our image difference is going to look like. So what I'm going to do just as a little hack here is I'm going to leave my image what I'm to leave my image that was previously generated. was previously generated. I'm going to.

remove its connector so that image stays on screen, its connector so that image stays on screen, the same. So the only thing I've changed is the changed Lora to it. That's OK. OK. That's right. right. It's just telling me that there's no connector to it. just telling me that

So very similar dog, right? very similar dog, right? LoRa of the sort of generation process in terms of how it's going to be painted necessarily terms of how it's going itself, just aesthetically how it's gonna look. aesthetically how it's gonna look. And so you can see that that detailer so you amazing features on the face. Cleans up the eyes really well. The tongue looks a little more glossy. glossy. The fur is shinier. fur is shinier. You get more detail in the skin pattern. get more detail in the skin pattern.

Um, you know, so on and so forth. you know, so on and so forth. especially the, uh, palm ball at the top. Ears are better. are better. So that's, that's an interesting way to sort of test out theories as you want to that's, that's an you different trials of settings. Um, I like that LoRa's. I like that LoRa's. The other side here that I just dig into, and then we can figure out where we other side here that I just dig into, where

Kevin Nuest (27:44)
Very cool.

Bob Ullery (27:52)
is the settings on the sampler itself. the settings on the sampler itself. fixed. Steps are really the number of iterations are really the number of to try to achieve its positive and negative conditioning. positive and always overstep it. So it's continually iterating on something it's continually iterating on something that was already good enough. was already good enough. pretty wild. They're very contrasty, heavy lines very contrasty, heavy lines or ultimately makes no sense at all. ultimately

Kevin Nuest (28:11)
Mm.

Bob Ullery (28:21)
And then you've got CFG and CFG is class conditional generative flow. then you've got CFG and CFG is class conditional generative flow. a mood, I guess. Um, it models the distribution of the data in a way that allows for sort it models the distribution of the data in a way and so EELI five on this is, uh, imagine you have a magic paint brush.

And you can paint pictures based on your mood. you can paint pictures based on your mood. If you're sad, it paints rain. you're sad, it paints rain. CFG is similar to that. is means is how similar do we want it to be compared to our prompts? we to numbers you may think yield better results. Not always the case. always the case. It depends on the model. what you're actually generating. And so both of these things are commonly so both of these things

tinkered with until you get that sort of perfect output you're looking for. I'll usually run steps between 20 and 30 CFG between five and I don't know 14 ish, depending on what I'm trying to generate.

Kevin Nuest (29:31)
Yeah, that makes sense. The steps one's really interesting too from, I had a, I had an art teacher in high school that would do commissioned paintings, uh, and, you know, on, on the side. And so he would work on them from time to time and in his room. And we'd come back like two days later, he's made more progress on this giant mural. It looks awesome. And we're like, that's, you know, give them compliments, give them flowers for it. Right. And then, uh, we'd come back, you know, two more days later and he overstepped. He kept going. He.

added more layers and next thing you know, it ends up with like this super dark, uh, you really can't tell what it is anymore painting and, uh, you know, those that ended up in the trash, uh, effectively. So it makes sense that you keep going too many times, too many passes over it. And it's too far, but you have to find out what is, what is too far. And so it sounds like good range that you found in steps is 20, 20 to 30 gets you, um, gets to there a lot faster.

Bob Ullery (30:26)
Yep. And, uh, so here, here's that comparison again. And, uh, so here, here's that comparison again. it, I've sort of trialed it out, it's okay. I'll use 40 here. use 40 here. Um, but usually like, and part of the reason why some of the dev work takes so but usually like, of the reason why some of the dev work takes so long on comfy side is you're sort of constantly just like tinkering, like like minute, like, Oh, what if I bring down the CFG by 0.1?

What's the effect and then, you know, extrapolate that out to thousands of the effect and then, you know, extrapolate that out to thousands of really go down that rabbit hole and not find yourself for a couple hours.

Okay, that's sort of image gen in a nutshell. Of course, you can do a lot more in Comfy, everything from frame generation for video gen through frame interpolation to smooth out that video. 3D object generation can happen here in Comfy with a number of new models that have come out. It's really exciting. And so just in general, I've sort of fallen in love with it as a...

as an interface for sort of graph or network based programming, which seems to be everything an interface for sort of graph or network based programming, which seems to be everything going to be a tool that's future. to extend it and build other interesting things on it.

Kevin Nuest (31:56)
Yeah, the, the really complex one that you showed us that's that you've been, you've spent a lot of time building out. That's got a lot of different nodes into it. It's I liken it to, yeah, that one. I liken it to a classic software development, traditional software development where each one of these nodes or grouping of these nodes would be code. That would be a microservice that does a function.

And then what do you have to do? You have to take, you have to pass data into it to run its function. And then it does what it does. And then it passes it on to the next thing to do whatever the next function is or to display it. And that's really what this is in a very visual way is taking pieces of data and moving them through the functions until you get to where you want to go. And I think it's really neat to be able to hopefully open up access to a lot more people to effectively, yeah, visual programming here with networked.

uh, mindset and in what kind of problem space will people be able to explore them that otherwise they wouldn't have been able to, if this was in bits and bytes of lines of code that they would have had to follow along and try to, you know, recreate those functions in code, uh, for each of those microservices. So this is a, while it looks daunting is, you know, an even more approachable version than if this was in, you'll pick your language of code and how big of a, uh, um,

repo this would be for somebody to go through and try to figure out what's going on.

Bob Ullery (33:26)
It'd be huge. And I was thinking about that yesterday, actually. be huge. And I was thinking about that yesterday, actually. weeks on this pretty much heads down on it. But had I not had this approach much heads down on it. But had I not or whatever, I think it, I think it's three to four years of I think it, I think it's three to four years of

David DeVore (33:48)
Mm-hmm.

Bob Ullery (33:57)
In this flow, I won't give the detail away because I do think this is a first of its kind, and partly the reason this was a sort of lagged engineering feat in terms of weeks and not hours or days is it hadn't been done before. this flow, I won't give the detail away because I do think this is a first of its kind, and partly the reason this was a sort of lagged engineering feat in terms of weeks and not hours or days is it hadn't been done before.

comfy workflow examples that you'll find online, workflow examples that you'll find online, character swaps in an image. examples, another great site to find some of these examples great site to find comfyworkflows.com. You can literally find something you like. can literally find something you like. So here's a woman walking on the streets of Tokyo. here's a woman walking on and see that workflow here.

download it and you can just load it over here and run these things. Really cool.

Kevin Nuest (35:02)
Is that a big download? Is that like gigs of data that you then have to go install?

Bob Ullery (35:05)
Yeah. Well, so I'll answer the full question. Well, so I'll answer the full question. kilobytes, right? It's just JSON identified really sort of outlining the flow itself. just JSON identified flow you load it for the first time, Comfy will tell you what you're they used in it. You might be using some models, missing some models. might be using some models, missing some models. And so within the manager, there's a function called so that.

and it will show you a list of those nodes that you're missing. You can click install next to each restart your comfy. And there you go.

Kevin Nuest (35:42)
That's great. While the models and the data that you might need for reference could be really large and have taken hundreds and millions of hours of training potentially to access, to create those, to be able to work on and trade the workflows themselves is just a matter of a JSON file, so tiny, so small. That's how websites are rendered.

Every time you load a page, right? It's basically you're, you're going to get some JSON in there, trading data back and forth, like super fast. So that's, that's so cool that this whole complex thing is represented in something so, so quick and tight and easy to share, even if, you know, the models, you have to go get the underlying data, but like you said, that's where things like checkpoints come into play. So you can reference exactly where and what you need, uh, from, from these, these models that have had lots of compute put into them to get them to where they are.

Bob Ullery (36:33)
Yeah. You know, this is kind of cool. know, this is kind of cool. is just in a league of their own because this has been around for a couple of years now. this has been around for like this prior to the wave that we're experiencing to the wave that we're experiencing is pretty incredible and notable. pretty guess.

When you generate an image in comfy, it will embed the entire workflow you generate an image in comfy, it will embed the entire workflow that from comfy, you can just drop the image right on the canvas and it will show you the workflow and the image right on the canvas and it will show you the workflow and you can just run it very cool. can you know, we, we wanted, yeah, go ahead. yeah, go ahead.

David DeVore (37:08)
Well...

Can I ask a question? Can I ask a question about that record? So we said it's embedded in the image like it's like it's embedded in like if it's a jpg it's embedded in the jpg In the In the metadata on the jpg file. Interesting.

Bob Ullery (37:24)
In the metadata in the file, yeah. the metadata in the file, yeah. visual side of that image, yeah. Yeah, it's kind of neat. it's kind of neat. So it's super fun way, especially from like a catalog. you these things like, oh, well, what's this one, dude? well, what's this one, dude? I can't remember. can't remember. Like if you had 50 versions of this thing, if do you know what version one does versus version 40? version the image and be like, oh, that's the one, right? be like, oh,

David DeVore (37:31)
Right. Got it.

Bob Ullery (37:52)
So what we've done here, and what you'll find on most examples is some character consistency. what we've done here, and what you'll find on most examples is some character consistency. few frames, something sort of self-contained, or an elongated cannon, we can't have a person with a totally sort of self-contained, or an elongated cannon, we can't have a person with a movie.

or the story, right? the story, right? at so that the viewer of that, the consumer of that story doesn't get pushed back consumer of that story doesn't get pushed back by the fourth wall and just check out just usually of, here's how to do that with one person in an image, right? how to do that with one person in an image, right? And we... we...

need more than that. We need character consistency across a large swath of characters from a library more than that. We need character consistency across a large swath of characters from a library that's what we've pulled off here on the left is our character library. You'll see their headshots, what we've pulled off here on the left is our character library. You'll see their not real people. We then either generate then either generate

a new image given a prompt. new image given a prompt. people here. At a baseball game, right? a baseball game, right? load a picture of a in use that or generate a brand new image of people at a baseball game and use that in either case. game and use mesh masking node that really took the brunt of these three weeks to build. of these three weeks to build.

easily take each individual that's in that starter image take each individual that's in that starter image we mask them, but in a different so you'll see over here, each character has a representative color, character has a we want. Our node then knows to use those colors node use those colors to mask those individuals out, mask those where go in the scene. And then using those colors, then using those colors,

We have a complex process to essentially mask have a complex process to essentially mask they were in the starter them using their face and their full body images. their full body son. I've got the second half of this bypassed off, got the second half of do a face refinement on more

considering their starting image and the details of themselves. their starting image and the details of themselves. I don't want to give away too much of the special sauce don't want to give away too much of pretty much done with it, so we're going to start using it soon. we're going to start using it

Kevin Nuest (41:05)
That's awesome. I had a, good.

David DeVore (41:05)
what would go for Kevin you I was gonna I was gonna ask like what else are I mean obviously simple image generation right uh in which case

there's a bunch of tools whether Dolly or Midjourney or whatnot for doing that. What are people, the people who are sort of building in Comfy UI on the edge, what are they doing? What are the use cases that somebody's jumping off to Comfy versus, yeah.

Bob Ullery (41:48)
There's probably four that come to mind. probably four that come to mind. because now my memory sucks now my memory sucks and I always forget the four thing. thing. would say is consistency. The reason you need this full control reason you need this full control want consistency. Whether that's at the character level that's at the character level right? your business, you want these things to be consistent. want these things to be consistent. You don't want them. don't want them.

David DeVore (42:11)
Mm.

Bob Ullery (42:17)
You don't want to do a one shot to mid journey automatically don't want to do a one shot to mid journey automatically Right. So this gives us refinement. this gives us refinement. Um, so consistency, the second I would say is video, um, and stable so consistency, the second okay, it's that was also like two weeks ago. And every day is a new year in the space. every day is a new year in the space.

David DeVore (42:24)
Mm.

Bob Ullery (42:48)
with video diffusion in Comfy. video diffusion in Comfy. hopefully, we'll get access to Sora coming out of OpenAI, access to Sora they announced not anytime soon, as of like two days ago. of like announced soon. So I have to wait a little bit longer. I have to wait a little bit longer. The third being 3D generation. a couple really great new models out

do pretty good 3D gen. pretty good 3D gen. a mesh and the details of that mesh. Usually they're kind of rounded they're kind of rounded and sort of less complex shapes that are being generated. sort of less complex shapes are coming along really nicely. coming along really nicely. can generate some pretty amazing 3D assets and take those out of Comfy programmatically take those

David DeVore (43:31)
Hmm.

Bob Ullery (43:46)
I don't know, put them in a 3D environment don't know, put them in a 3D environment you. One could be dropping them into Unreal Engine could be dropping them into Unreal Engine scene, or just render a new image but more in a 3D aesthetic just render a new image but those in an Unreal or a pick your poison, 3JS and so on. and so on. The fourth, I would say, is text. fourth, I would say, is text. Text is really hard to do, especially in a programmatic way.

notion of like overlaying text placeholders, masking those layers of text and then in painting of like overlaying text placeholders, masking those layers of text and then in painting the ground, right? Complex generation, but text itself is simple, just not from a generative side, Complex generation, but text itself is simple, just not from a generative side, never been there on a one-shot way. So text being that fourth thing and there's... one-shot way. So text being that fourth thing and

Probably a lot more I'm not thinking of, a lot more I'm not thinking of, to mind.

David DeVore (44:49)
And do you think of text as using the same sort of masking utilities as characters?

Bob Ullery (44:56)
Yeah, yeah.

And by the way, there's tons of great third party comfy by the way, there's tons of great third party comfy a can get in here, and there's some great graphics around text. great graphics around text. So you could just draw some text on a image. a image. Here's the text. the text. Here's all the components around it. all the use to you, right? It may be as simple as, hey. may be as simple as, hey.

Let's create baseball cards with these people, right? create baseball cards with these people, right? overlay there. Just a simple, you know, graphic. a simple, you know, graphic. That's not really generative. not really generative. that, you might want their name tattooed on their neck, right? might want their name That's different. different. That's where masking would come in. where in.

David DeVore (45:47)
Yeah, cool.

Kevin Nuest (45:50)
I had a question around when we started working, especially you dove into working with automatic 11 as an interface to create images, uh, um, from stable diffusion. The, there was a lot of, you found a lot of value and emphasis in using the negative prompt words to get to where you wanted to go. And it looked like even in the example you showed, you had some probably standard ones that you've copy pasted.

Uh, over, over time, things like, you know, not messed up hands and not missing arms and the, you know, really emphasizing those things to try to get the opposite of that, which is good arms and hands and such working. Yeah. Yep. And so how much using.

Bob Ullery (46:33)
These are the three, the three you need in every negative prompt. are the three, the three you need in every negative prompt.

Kevin Nuest (46:45)
Uh, a company UI and being able to add in so many more nodes and steps and refinements in there. How much have you found, uh, Leveraging the negative prompts is, you know, changed it much, or, or do you kind of really have the, the set default negative prompts like you just showed? And like, and you're really not tweaking those and it's all about the positive of going back with the refinement models on, on top of it, kind of what's the, is there a big difference between that?

automatic 1111 and ComfyUI and the negative prompt value or is it kind of the same?

Bob Ullery (47:20)
Yeah. Yeah. I see myself using negative a lot for like big Yeah. I see myself using negative a lot for like big I don't know, like just things that I generally don't like generating a lot of runs here, which might take a like, oh, I never want to see that again, put it in the negative. And then the to see that then the other side is depending on the model you're using. So side is depending on the model you're using. So

Great example is this one, Poltergeist. example is this one, Poltergeist. this, you'll get a good look at what it's really set out to do, get a good look out to do, which is that comic book style, right? is that comic book style, right? But it's trained on a lot of superheroes, it's trained of superheroes. That's not what I'm looking for. not what I'm looking style here. So you'll see in this negative that I comment out, you'll see in this negative that I comment out, Wonder Woman, Superman, superhero. Woman, Superman, superhero.

David DeVore (48:09)
Mm.

Bob Ullery (48:17)
on purpose because the model is very much wanting to kick out superheroes. purpose because the model is very much wanting to kick out superheroes. been trained on Wonder Woman and Superman. Cause once I did that, I don't really see that much anymore. once I

Kevin Nuest (48:29)
sense.

Bob Ullery (48:33)
And then the other side is like, then the other side is like, too, you know, I might say, yeah, know, I might say, yeah, on our dog, but maybe we also don't want round wheels. maybe we I don't even know what that's gonna do, clearly a skateboard has round wheels. Let's see. see. Like, will they come back as cubes? will they come back as cubes? Will they be triangles?

That's still pretty round. still pretty round. another trick, the notion of waiting. So I definitely don't want round wheels. I definitely that a weight. Now rounded wheels are two times more important rounded wheels are two times more important than not having a cap. not

David DeVore (49:04)
Thank you.

Bob Ullery (49:20)
And so that's just a function of putting a parenthesis around so that's just a function of putting a parenthesis around you want, which is essentially a multiplier. want, which is essentially a multiplier. Think of it that way. of it wheels, maybe we want less. We don't want them, but if they do show up, OK. don't show as prominent. So still round. still round. But I might also say like. I might also

David DeVore (49:46)
It's it does. There's there are no wheels that aren't round. They don't exist. It's not a wheel. If it's not round, the, the poor AI, um, but you could, but you could pop positive prompted for square wheels. Probably right. Maybe.

Bob Ullery (49:51)
I guess you're right. Yeah. A wheel has to be, it is a circle here.

Yeah. So I put cube wheels in there. It's starting to take shape. If I over overweight that maybe. Square wheels is probably easier for it to understand.

Kevin Nuest (50:17)
I mean, this is another good example that all of this is probable statistical math. Every, every bit of this from the model to the prompts going in and, uh, why it takes trial and error to get to a workflow you like and the images you like out of it. Cause this is, this is all randomization math and you're narrowing in the problem space or broadening the problem space of what could come out of that random math.

David DeVore (50:33)
Oh man.

Bob Ullery (50:47)
Yeah, and every small nuance changes the outcome. and every small nuance changes the outcome. blockchain nerds. If you changed one bit of a transaction on chain, you changed one bit of a hash, right? Same deal here. deal here. You change one thing, totally different network pathing change one thing, totally different

For some reason I cannot get square skateboard wheels. some reason I cannot get square skateboard wheels. I could keep iterating on that. also Photoshop some square wheels on here or find another way to generate some examples of that find another way to generate then a square skateboard wheel, LoRa model that would help refine specifically would help refine specifically just the skateboard wheels into squares. the skateboard wheels into squares.

Kevin Nuest (51:34)
Makes sense. That is a very specific need. There's probably, as Dave called out, there's probably not a lot of models trained out there on square wheels. Not a ton of examples, not in the millions.

Bob Ullery (51:46)
Yeah, let's see. I bet there's some wheels. Here's something called no wheels Excel trained specifically just on flying cars. It looks like

David DeVore (51:56)
Nice.

Bob Ullery (52:03)
Yeah, pretty cool. So you think it doesn't take a lot to train a LoRa. pretty cool. So you think it doesn't take a lot to train a LoRa. good representations of the vehicles you wanted without wheels, you might just go and manually Photoshop those wheels out and then use those as and then use those as your training data for your LoRa. training data for your LoRa.

David DeVore (52:20)
We should we should do that on a future episode go LoRa training. I mean, I think it would be really interesting I know there's

We've talked primarily about characters and storytelling and whatnot. I think it would be interesting to unpack what this looks like more on. Business use case, right? So like, what would it look like for a business to generate banners with their logo and, and as well as, as words, right? Or, or whatnot, right? And now how do you capture, how do you capture a brands? Um,

how do you capture a brand's aesthetic and sort of brand guidelines correctly? And then, and then, um, and then, uh, you know, be able to give them something consistently would be, would be fun, uh, to do some time on another show. So we're coming up on, we're coming up on, um, on time here. Um, and this has been super, super great, super educational. Thank you for sharing. Uh, I want to, I want to, I'm about to go.

Bob Ullery (53:07)
totalist.

David DeVore (53:23)
create some dogs on skateboards. And, and yeah, thank you everybody who tuned in and we'll see you guys next time. See ya

Bob Ullery (53:34)
Thanks everybody.

Kevin Nuest (53:35)
Thanks.

Unpacking ComfyUI & Stable Diffusion Tips & Tricks

Unpacking ComfyUI & Stable Diffusion Tips & TricksUnpacking ComfyUI & Stable Diffusion Tips & Tricks

More episodes

Unpacking ComfyUI & Stable Diffusion Tips & Tricks

Unpacking ComfyUI & Stable Diffusion Tips & Tricks

Chapters

What is What if we could? A podcast exploring this question that drives us.?