Denoised

ComfyUI Explained: How AI Image Generation Actually Works (Step-by-Step)

VP Land Season 4 Episode 46

Curious how AI actually turns text into images? In this episode, Addy and Joey break down the inner workings of AI image generation and explore ComfyUI. We'll explain the core concepts of latent space, diffusion models, and how noise becomes a recognizable image. From basic text-to-image workflows to advanced techniques with LoRAs and image-to-image transformations, discover when to use ComfyUI versus web-based tools like Runway or ChatGPT for your creative projects.


---

The views and opinions expressed in this podcast are the personal views of the hosts and do not necessarily reflect the views or positions of their respective employers or organizations. This show is independently produced by VP Land without the use of any outside company resources, confidential information, or affiliations.

All right. Welcome back to Denoised. So this episode, we're going to kind of take a step back, broader view. We're going to look at the concepts of how AI image generation works from a high level and then look at that a little bit more detailed inside ComfyUI, talking about ComfyUI a bit and look at what a typical ComfyUI workflow looks like and also how you can use that in the future. And as AI tools get more and more advanced where ComfyUI still comes in handy for different workflow for sure. At this time, I think. We can both agree that ComfyUI is the universal tool across different AI companies, different creators. Yeah, once you sort of get to a certain level and you're like, I need to either have more control or automate things, ComfyUI usually comes to the conversation. But I mean, we've had this conversation before of, you know, with tools like Runway References or context. How useful are LoRAs and stuff like that. We're getting ahead of ourselves. Let's start high level. We'll talk about that a little bit later in the episode. Yeah, this is going to be a fun one. And please engage us with questions and comments, and we're going to get to them as well. Yeah. All right. So let's talk high level first, AI image generation. AI image generation. How does this work? What is happening? It's actually a miracle that it even works. If I go under the hood and sort of explain how it actually works. So one thing that all image generation, video generation relies on is a neural network. And a neural network is what separates computer vision, machine learning from AI. So when people say, well, we've been using AI for plus years. not exactly neural networks. That is a new thing, especially neural networks at scale. When we talk about an image generation model, like the one Midjourney has, or Stable Diffusion, you're talking about billions of nodes inside of a neural network. And a neural network node is nothing complicated. It's actually a very simple mathematical function. It's X times Y plus B. So it's essentially linear algebra. So you have a slope. which is the X and then you have a bias on the slope, which is a plus B. The thing that makes it incredibly useful is when you do this millions and millions and millions of times, have millions of nodes and then you go to billions of nodes, then it starts to develop a source of understanding of big concepts and not just concepts of what an image should look like, but associating that with words. And that is the magic here. Like for example, if you have a notion of a beach, regardless of language, you could say it in Mandarin, in Tagalog, in American English. But a beach, that concept is associated in that neural network to that of Malibu, that of, you know, Cannes Film Festival. Like beaches everywhere in the world are associated with the word beach and the notion of the word beach. So how is this all built? It's actually built from images and the way we train them, have an actual tactile demo. Okay. So this is noise and it's hard to believe that a neural network can turn this into an image that we can recognize. So it goes from this to this, or it can go in the reverse order. It can actually take this, go back into noise and it happens by training. So. The way it does this is it adds steps of noise. So it'll take an input image and then it'll slap on some noise. And you can still kind of see the image there, but this is what it's using for training to understand the denoising process. And it's the training of the denoiser. That's the magic. So it's taking a nice clean image, adding the noise and sort of memorizing each step of like, level of noise is for this image. And it's doing this billions of times. Yeah, and the neural network is able to train itself on what is noise and what is image, which is the crucial part. And as the noise gets thicker and thicker, like here, you can see the job gets harder and harder. Like, you could barely make out the text in our faces there. Yeah. But it is using recognition of features in there. And it is doing it in a mathematical space called latent space. Latent space, all right. Yeah, which is quite different from how we traditionally compute stuff like edge detection or video compression. That stuff happens in the frequency domain, typically using convolutions and cosines and sine waves. This is happening in a space that is multi-dimensional. It's happening in the multiverse. So like each vector can have different commas on it, right? Because it's different axes in this multi-dimension. you know, if you imagine like the way I imagine latent space is you have like this gladiator arena, this giant space. And then in there, there's tiny little neighborhoods. Like you have your Santa Monica, you have your Echo Park. You have your Valley and each neighborhood represents an idea of a thing that we associate with the real world. Like Echo Park is just all about, you know lamps and lights and table lamps and floor lamps and lights. Santa Monica is all about the beach and the sand and the beach umbrella and those things. like they're clustered together. So when a text prompt comes in and says, you know, I want to generate a light. that's sitting on the sand and a beach, then it's pulling, the latent space is pulling from Echo Park, it's pulling from Santa Monica, and guiding that noise into an image. Based on the training data, based on when you trained it with the image, you also had proper labeling and identification to know that these is the two guys on a podcast image, or this is a picture of a beach. And so it's trained on that way, so then when you want to generate something. Yeah. A network is only as good as the training data. like you nailed it in the head, the captioning process and the data grooming process is as important as the training process. So you're talking about millions, if not billions of images. Scale AI is worth billions of dollars. And we're only getting started. So if you follow the big AI companies, you'll see an exponential growth in scale. So for example, ChatGPT-4 to ChatGPT-3, I think had like a hundred X scale in the size and complexity of the neural network. And they are looking to one up that one again and again. we're going to get to a point where the notion of things not only exists, but the granularity in which they exist is even finer and finer. then I could determine a beach in Santa Monica versus a beach in Tahiti versus a beach in south of France. And those things will have individual. very granular changes to them. OK, so to summarize, training, we take a high quality image with good data captioning. And then it gets noisied up step by step. The models are trained on this level of noise at this step, produces this image. So then later on when we're like, OK, I want to generate an image, we start with a very noisy You just feed a random noise. You feed a random noise. Right. And that's where the seed, also the seed is a thing that comes in. So you need to start with a base noise image, empty laden image. called in, in ComfyUI. I'm checking cause I know Comfy, you know, I got to add an empty latent image, just the, which looks like one of the static images that you just showed. then it took you give this to the case sampler and ComfyUI. then based on your number of steps, starts denoising that image to create based on whatever your text prompt was. Correct. Correct. High level explanation. Yeah. And the training is so computationally expensive. Joey, you're talking about millions of images. But it doesn't go through the neural network once it'll go through a thousands of times and each time it'll like put different weights on each neural network node differently. And then there is a loss function. then when it generates something and we determine it as error or the another machine determines an error, goes back in and tries to correct itself and gets better in the next one. A lot of training. All right. So let's look at this in practice in ComfyUI and ComfyUI is kind of nice too, because you sort of, I mean, it can get very high level and technical, but this also gives you more control. Um, but this also is a good way to visualize what's happening under the hood with a lot of AI models that you are just kind of going to the website and typing in. This is in a lot of cases happening behind the scenes. Yeah. ComfyUI is a free platform available on windows Mac and I think Linux as well. I don't. We were talking about this. don't know how they make their money because this stuff is amazing. haven't quite figured out. Yeah, it is free. They've also made it lot more user friendly because originally you sort of either had to install it on terminal or from like GitHub or, and now there is an installer. Yeah. There's a Mac installer, there's a Windows installer. I know we have a video about how to do it, but it's literally just download the installer package and it sets up any Python stuff that you might need. initially when I try to get into this, I'm not the most technical terminal person. So it was a little bit. It's scary and complicated, but now it's a lot easier to use. You and I, I think we're both visual learners. For me, node-based stuff just is so intuitive. our viewers out there, if you've used Nuke, if you use Resolve or Fusion, if you've used Houdini, even Unreal Engine Blueprints, you've already got it down. You know node-based workflows, so you should have no problems. Or even any basic flow chart builder in Whimsical or Miro or something that's Yeah, same idea. Exactly. So I loaded up one of their default text to image templates. I got a lot of good templates. This is a good one because it kind of shows all the basic building blocks that happened for taking a text prompt and turning it into an image. First up in our workflow, we got the load checkpoint. Yeah. So what are the checkpoints? Yeah, checkpoint is the model that you're loading. if you're generating an image, you could pick FLUX Dev, FLUX Schnell, Stable Diffusion. There's a ton of models. Yeah, the one that default here is the Stable Diffusion XL, which I know a lot of people still like. Yeah, it's the most popular one. It's also open to date stores. It's open source. You run locally. Yeah. Anyway, I should say too, I mean, we'll talk about APIs in a second, but the other advantage of comfy. that people like a lot is you can download the models and run it on your local machine. So if you want to experiments or generate stuff, you are generating it on your computer. There's no credits. There's no paying for each generation. So it's good for messing around and experimenting. Totally. Yeah. On your hardware. Just one note, you do need an NVIDIA GPU here. You need that, yes. Because these things run on CUDA. There are some. I I've run it on my Mac. Yeah, it'll run slower. I almost burnt my legs right then. You're like, oh. With my laptop on my lap and spinning it. But yeah, it is possible. But it is more designed for run on Nvidia cards. Yeah. And I'll just say, if you notice the load checkpoint stuff, it's loading a saved tensor file. what that is is a prepackaged, precompiled version of the model that is uneditable. And that's how the load checkpoint node likes it. So there are two types of ways you can download a model, a diffuser file and a safe tensor file. Diffusers are more open to being modified. OK. You can also load those here as well. But the safe tensor is the one that's a single file. Safe tensor, single file, not modified. You can modify it in your chain that you're building with in Comfy. that model itself is locked down. Everyone's got the same model if you download it. What is, because I've seen this on different models too, like FPFPYeah, that's floating point -bit or floating point -bit. It's the precision of the model itself. So all the parameters either are holding it in a -bit float point, which is just a fancy way of saying a really long decimal number. OK. Or -bit would be half the length of the decimal number. So I think -bit is like up to that accuracy. Would you pick one based on your hardware? Because would an FP16 more beefier cards? I would say go with FP8 for ideation and stuff like that, and FP16 for the highest quality that you can get. OK. All right. So then we're going out of our checkpoint, our model, and then we've got two CLIPs, CLIP Text Encode. One for our positive prompt and one for negative prompting. OK. So we got the CLIPs. We got the two CLIPs. Because also this model supports positive prompting and negative prompting, saying what you don't want. But not every model does support that. But in this case. It does correct. And this is not the same clip that you and I think clipping like clipping an image. There's a completely different clip. think it stands for Contrastive Language Image Pre-training. It's exactly what I just talked about. It's, it's, it's the Silver Lake and the Santa Monica in the latent space. So like not just words, but notions of objects and places and experiences can be stored in the latent space that are associated with words. Right? So the actual word, actual language is not as relevant as the fact that you're trying to pull an idea from the latent space. And so this is actually going to take your simple text prompt here and it's going to encode it into a vector that goes into the latent space. then the solve will then use that vector to generate the image. Okay. And I guess at a high level, this is taking the text prompt that you wrote out and turning that text into something that they can understand and turn into a visual image. Yeah, into latent vectors. OK. And then these are all getting plugged into the case sampler. And the case sampler has sort of been described as sort of the heart of image generation. so we've got both the model, our checkpoint loading of the case sampler, our positive and negative prompts loading of the case sampler, and then a latent image. which is basically just a blank empty image at the resolution that we're going to generate it, which here is by which is pretty one's really important because a lot of models have specific latent image requirements. based on what they were trained on or what they understand. Exactly. And what they are inherently doing under the hood. for example, SD Stable Diffusion is latest is a by model. So if you give it a weird latent image size, it's going to struggle a little bit. So give it its native resolution. So don't do what I first tried to do, where it's like, I want a by by image. Yeah, so there's remedies for that. You can add upscaler models at the end to get to that. Right, but you'd add that afterwards. You wouldn't change your initial latent image because it's not really trained to do images of that size. OK, so in our case sampler, we do have the seed. And so we talked about this before. This is a random number generated, but this is how it generates the very first base layer of noise. Yeah. Think of it as a unique identifier for the noise. Theoretically, theoretically, if you have the same seed and the same prompt and you push it through again, you should get the same image. Right. Theoretically. every other settings are the same. Right. You should. I mean, I've seen people keep track of seeds too. If they find a seed works out well for some scene and they might try to use the same seed for other shots in the scene. Yeah. But theoretically, possibly helps get more consistent shots. It should. Yeah, I mean, it should get you % more consistency than just random seed, random seed, random seed. And then we got a bunch of other settings. I will say so steps. Can you talk about steps? Yeah, so that is the number of steps it is denoising at. imagine that thick fog of noise broken up into different layers and then each. layers being denoised times, right? And in theory, the more steps you have, the better image you should get because it's doing more It's diminishing returns. Up to a point. I think I've seen like to be the normal here. And then anything beyond that is just going to be computationally expensive. Like if you're wasting time. Like if you do you're not going to get a better image. You're wasting electricity. You're melting your card more. Yeah, so CFG is Classifier Free Guidance. the higher this number is, the more you're giving the AI model leeway to generate what it wants and almost pull away from the text prompt, from the clip. So if you have a high CFG value, then let the model be more creative, right? Come up with the thing that it wants to come up with. But if you have a low CFG, then it's going to really stick to what you prompted. OK. The sampler name, this is the mathematical. algorithm that's used under the hood. Euler is the most common. It's the most computationally efficient. I've seen ancestral Euler. think that's in there. Yeah, ancestral. Yeah. So I see some of that, but generally I don't see this being changed. Yeah. I I see like different options here. Like if you're an AI scientist, you're asking for more, but Joey and Addie is using Euler. Okay. This is one where it's pretty much, yeah, you don't need to mess with that too much. Scheduler. Scheduler I actually am not familiar with. I usually leave schedule and denoise. thing is if you do hover, they do have pretty good tool tips. Captions. Tips pop up. Scheduler controls how noise is gradually removed to form the image. Yeah. Probably one you just, unless you know what you're doing, just leave it alone. Yeah. And I just want to say that this entire block, the case sampler block, the mathematical computation is happening in the latent space. So this is not happening in XYZ coordinates. This is not a CG render. This is not anything that you and I can kind of comprehend in our brain, but it is happening in a high multi-dimensional space that only exists inside of a computer. Like not, we don't break the computer open like in Zoolander. The files are in the computer. So why do we use latent space? The short answer is it's a compression. So if imagine storing a billion images inside of a model and then shipping that model in a safe tensor file, you would be in the petabytes, right? Even with image compression. So latent space allows you to preserve all the notions, all the ideas of that image. And each image comes down to like kilobytes or bytes because it's no longer in pixels. It's just in this multi-dimensional vector that it exists. A vector is going to be way lighter than a pixeled rendered raster image, right? So that's why we bother with latent space is that you can get to scale and then you can still ship the models over the internet. A gigabyte model is honestly not a big deal versus like a petabyte. Yeah. OK. And so then we have denoise, which Which is the name of our show. They are promoting our podcast, so that's very nice of Comfy. Thanks, Comfy. No, this would come into play more if you're doing an image to image. Yeah, I believe so. Yeah, generally here, the steps take care. I guess it would be the amount of denoise per step. The amount of denoise in applied lower values will maintain the structure of the initial image, allowing for image to image sampling. So in this case, because you're starting from nothing, we don't want the noise to stay. But if you're starting from an image to image workflow, you would want. It would control how much of the original image do you want versus how much you want it to change. Great. Great. Yeah. We're going to get into image to image in a little bit. Yeah. And then after our case sampler, we've got the VAE decode. Yes. The VAE stands for Variational Auto Encoder and Decoder as well. So this is where you go from the latent space back to pixel space, stuff that we understand. And so this is taking all of the latent stuff, turning it into an image, and then it saves our image. We have a save node, which will save our image. Yeah. Now you're back to a JPEG or whatever. Yeah, PNG. And this is the other nice thing about Comfy is you can set up batches, and it just saves the image to your computer. So you don't have to, if you're working on one of the web portals, they're used to. And then you kind of have to flag which stuff you like and then download them individually later. Comfy, you're running it on a computer, and you designate a folder, file name, prefix, and everything you generate gets saved. You know what I found out recently, too? Any image you generate on Comfy. If you take that image, I think maybe I think it's only PNG files, but, if you drag that image into comfy, it saves the entire workflow. Oh, and so you can pull up. If you have an image or someone sends an image, you can pull up the entire workflow that was used to make that image, which is really awesome because I know why. so under the hood this whole note-based thing is really just a text file and it's a JSON file. Yeah. Those in VFX very familiar with JSON files. You can attach the entire JSON file as a metadata to the image. And Comfy will read that metadata and recreate But yeah, I did not know that it did that of every image. So if you are sharing images or something and you want to just see what the workflow was. It's all in there. It's there. Or if you make the image and you're like, what did I use for that thing? And you didn't save the workflow. You can just drag it in. So I just learned that. That's also very cool. Also, the other cool thing here is Comfy creates keeps a history of all the input and output images in the local folder that you install it. So even if you're not saving images, it's saving it for you. Yeah. So yeah, it's very good for just pulling back stuff back up again. OK, so now that we've covered basic image-orientation. Text to image. Yeah. Let's cover some other options. So the two other ones, common workflows, would be image to image and LoRA's. First off, we talked about this before. But LoRA, recap, what is LoRA? Yeah, LoRA is not the name of my girlfriend. L-O-R-A stands for low-ranking adaptation model. It's essentially, you train a model, a very lightweight model, typically like a hundred megabytes, and you can attach it to a larger model, which are in the gigabytes, right? And what I imagine visually like the large model is like the Queen Mary in Long Beach, right? It's a giant cruise ship or any of your Carnival cruise ships in Miami. And the LoRA is a tugboat. It's like a tiny little ship, but it has the power to pull that giant thing to where it needs to go on the dock. And LoRA's are never modifying the actual models. All LoRA's are doing is it's attaching itself to the correct attachment points in the big model to influence it significantly. And this could be a style. This could be a face, logo. Characters. That's probably the number one use is like, I want to create Joey every time. You have tons of photos of you. So what you would do is you would train a LoRA model on Joey, call it Joey, and then attach the Joey LoRA to FLUX, the big actual FLUX. And then you type man wearing a suit. And then you tag it with a special tag word, Joey. Whatever I named the Joe. Yeah. And then it knows that now you're. invoking the LoRA. So it'll generate not just a generic man, but you. Right. Right. Yeah. Okay. And you train LoRA as you just use a, Replet. It's a rep, not Replet. That's the coding one. Replicate is probably the most there is for FLUX, have FLUX gym for Stable Diffusion. Have automatic 1111, and Kohya. Um, so there's tons of places for you to train. LoRA's usually pretty easy web interfaces where you just give it the images and it'll train it and give you the LoRA file. It's not going to be free. think there's a cost, but it's dirt cheap. It's like a dollar. one for a few dollars. I know $5 you could train a LoRA. And you can train a LoRA on ComfyUI as well, but I don't recommend it because the internet tools are so much just easier. Yeah. All right. So we'll cover the LoRA one first. This is just a general LoRA template inside Comfy. And so a lot of the basics are the same. So we have our load checkpoint loader model. But now before we go into our case sampler, we have Load LoRA. Correct. Yeah. So you're loading the checkpoint, which is the LoRA model, not the main model. So you're going to have two checkpoints here. load checkpoint is our main model. OK, correct. So then you're going into the Load LoRA, which is going to take the main model, attach the LoRA to it, and then send that to the case sampler. Which is what's happening here in the Load LoRA node. So you see under on the load checkpoint, there is a clip white yellow line there. So this model also has the clip loaded in it, but a lot of times it could be separate from the model. So you can have a load checkpoint and a load clip and you might have to load both. Oh, you might have to load a clip. So like as in what, would that be different than our clip here, which is like just our text prompt that we want. So a lot of times the models generally come with a clip loaded on top. so the safe tensor file has this in there and you don't even notice it, but for more advanced workflows, you can actually put in a much higher quality clip encoding mechanism. And so the triple clip loader is what I see a lot of times professionals use is it'll have a large, it'll have a clip L, clip G and a TP XXL clip model. And those three clip models work together to give you three different vectors for that text prompt. what is this doing exactly and how is this different than our CLIP Text Encode here, which is just our prompt? It's giving you the vector in three different ways. So the prompt is giving you three different results here. And it's just giving the case sampler a more variable environment to build off of. So it's just giving more meat for the generation. OK. But you would still use this in addition to your text prompt that you want to generate. So right now if you see the load checkpoint, that yellow line, the clip here, so you're getting the clip from the safe tensor file there and that's passing from the LoRA into the text, into the case sampler. But if you want it to be a little bit more advanced and give the generation just more to work with, you can get a triple clip loader and then attach that yellow line, yep, right there. Oh, I put it before the LoRA? Before the LoRA, yeah. And then you would go about it that way. OK. You don't actually need that because you're not going to use the clip. OK. Yeah, got it. OK. Interesting. Good tip. Where would you get these CLIPs? Like, it something you have to download? Is it just on the? Great question. So in general, I would say 99% of the stuff that you actually need to download from ComfyUI exists on a repository called Hugging Face. OK. So Hugging Face is like the national database, the international database, if you will, of all of the models you could get. save tensors there, diffusion models there. And that's not the only place. There's a couple other places as well. If you go to Civitai. Oh yeah, they have stuff too. So civet.ai specializes in LoRAs. So if you're looking for a specific style, like let's say you want an anime style transfer, or you want a character that's consistently a certain ethnicity, you go to civet.ai and grab that LoRA. But that LoRA is going to be specific to an image model, right? You can't just use a FLUX LoRA. In a Stable Diffusion model. Right. And so here, actually, I pulled up because I was curious. Comfy does have a good model manager built in as well. Yeah. And I searched for clip models. And so here, stable cascade from Stability AI, dedicated clip model. So I assume that would go with Yeah. So these are probably referencing back to a Hugging Face repository. But the nice thing about Comfy here is it'll download directly into your Comfy local folder. So you don't have to go to the internet and find the thing. Text Encoder for FLUX FP16, Text Encoder for FLUX FP8. OK, to what you're saying. Yeah, and then the other neat thing here is if you're starting with Comfy, you've just downloaded a workflow, if you go to the manager, there is an Install Missing Nodes button. Yeah. Which is cool, because a lot of times there are nodes that are just doing mathematical functions. It's like cropping an image, clipping an image, whatever. And for that, it'll just find it on the Comfy server and then put it up here. Yeah. OK. And then as far as everything else on this workflow, it's exactly the same as what we just covered. It goes into case sampler, it goes into the VAE decode, and then it saves your image. Yeah. So yeah, we just covered all of this. Now, the fact that you have the LoRA in here is going to significantly impact your image generation. and in your text prompt, you're gonna have to add the tag word for your LoRA. So like if we added Joey, if the Joey is not in there, it's not gonna invoke the LoRA. All right, next go to image to image. Real quick, image to image. So now let's say you wanna have an existing image and you wanna modify that image. Yeah, so this is where I think a lot of professionals will end up is you'll find that text to image is not gonna get you where you need to go most of the time. In that instance, you already, have a good idea of the framing, the blocking, where you want the character, where you want the background to be, what objects. You can do something like sketch it up. Or you can go into Unreal. You can block it with a camera and take that render. Bring that in as an input to the generation. So here we have an image to image. And our workflow is pretty much the same. We have this blue box, which is to load our image. So we got a load image node. Correct. And see, all this is happening is it's taking the image and it's encoding into latent space with the VA variational autoencoder. And then that is going to go ahead and influence the case sampler. So instead of giving it a static empty image, you're giving it an actual image. And also we can see here that denoise setting that we talked about before was set at when we were doing text to image, because we didn't want any noise. But now it's set to which would modify it lot but keep some of the basics of the image. Yeah. And all that happens under the hood is like that image actually ends up having structure. like if you picture a portrait of a man, you know, you have the shoulders and the head, there's like structure to that image and it's going to use it in the, the guidance. I also noticed here too, the, the sampler for this is not a boiler. is DPMPP2M. It's that. I would say if you load one of the default Comfy workflows and settings are changed, leave it as is to start with and then maybe modify it to see what they do. they usually set these for whatever the optimal parameters are for the workflow that you're loading. And full disclosure to our viewers, I am not the ComfyUI expert here. I'm just scratching the surface. Yeah. And I think you can say the same. Yeah, I'm just catching up. The AI researchers of the world are the ones using this to its full potential. Yeah, they're really into the mathematics behind this and stuff that I'm not But we're bringing you the knowledge that we know. Yeah. Everything. Yeah, this is a leaping off to a point, too, because one of the interesting uses that I've found for Comfey lately is this is everything we're talking about. This is all running locally. so you also like, this is kind of Either going to kill or not even doable to run. Like I haven't run any of these as demos because I'm on a Macbook and it would kind of kill it. But even if you did have your own hardware, or if you just had a Macbook, what I found useful is they have API nodes and so the API nodes at first I thought, Oh, that seems like a little bit overkill, but they're actually really cool. So they have a whole variety of. I'll show you in the templates. It'll connect to Runway. It'll connect to Google. It'll connect to any of the main cloud tools that have API. So they've got connections to LLM. So open AI or Gemini, they've got connections to image API. So FLUX, which you could run locally, or if you don't have the hardware, you could go connect to their cloud service and use their cloud servers to generate Runways, stability, open AI, ideogram. And then they have video APIs as well with the main, you know, Pico Runway. Yeah. So how does billing and pricing work? So the ability to go, at first I was like, Oh, this seems cool. But then you don't have to like sign up for the API is for each and developer accounts for each tool, but not the case. Um, comfy has a cool, basically just connect your account and you can just buy credits. So they're reselling. Yeah. And they're not marking it up. They're just charging you whatever the value is of the, nodes. And then the cool thing too, is that'll tell you what. Like this is a Runway text to image node. And I think because Runways. Pricing is variable. doesn't have a price here. But on some of these nodes, if it's like cents a generation, it'll tell you up here exactly how much it's going to cost for each generation. you hit the Before you run it. And you don't have to connect any API. don't have to. You just load this node, and it just takes the money out of your account. So it's awesome. Really handy. And then the advantage of using this is you can connect this to your comfy workflow, but you have the advantages of everything you generate automatically saved to your computer so you don't have to like. use a web portal and download each image. So if you're trying to do a lot of image generation using these models, that helps speed it up a lot. And I've been using a new node too where you can do a batch generation. So you can give it a text file of a bunch of different prompts and then have it generate each prompt. And so that's kind of handy for just getting a bunch of stuff out to look around and explore and brainstorm rapidly. also I think because, like you said, you're in a Macbook. you're not on an NVIDIA GPU, you can still have the flexibility of building your own stuff, but then running it on the cloud. Exactly. You just use access cloud hardware to generate stuff. I've been finding this to be pretty handy. That's awesome. I've never actually used API, but I'm sure this is the way to go with the project that you're on right now. Yeah, exactly. And then we've talked about Comfy and building out these models. It's still kind of a base. Like if you were to compare, you know, running this workflow of a text image, you know, using Stable Diffusion or even using a FLUX compared to the more accessible tools like we have on Runway, like we have, um, FLUX context on the cloud with references, how useful or how necessary is it still to train a LoRA or build out a complicated workflow on Runway when We have the ChatGPT interface where you can just give it an image of someone and be like, put this person in a car or Runway references. And you can just do the same thing, conversational, single image. You don't have to train a whole LoRA. How useful is Comfy in these workflows with the advancement of the other tools that we're seeing? Yeah, great question. I think it all comes down to your use case and what you're trying to do. Every AI creator is going to end up doing a hybrid approach where they're going to do some of the stuff on comfy and some of the stuff on the commercially available tools like Runway or Luma. and then they're going to combine the results in their own unique ways. For me, I find that LoRA's that you build, you're on your own, you're, have really specific control over the captions, over the training set. And so you directly influenced the quality of that LoRA. And then, uh, You're all, when you're doing image to image workflow, you're using depth maps, canny edge detection, all those things. So you have direct influence and control over the ChatGPT stuff. It's good enough for impressing somebody or doing a social media post. But when you're getting paid, when you're a commercial creator, right? Some brand is paying you to make something. You want the highest level of control and consistency. You feel like you are obligated because if this looks more complicated, then we have to use this because we've got more nodes. No, I don't think that. I mean, feel like, yeah, sometimes it's like, oh, man, it feels too easy on ChatGPT. Yeah. This should have been more complicated for like, I'm a professional. I'll put it this way. But ChatGPT, still keeps like when I want to. give it a character reference image and just like, hey, put this person in a car, still does really good job. It does a good job. I would say for % of the way it's there and sooner or later, the stuff that we're building here is gonna get gobbled up by the big commercial services, right? Like FLUX Context is now essentially making LoRA's for you, right? And now they do have a version that you can download and run locally. Correct. Which last time we talked about that, that wasn't available yet, but now it is available. Yeah. I would say with the Comfy stuff, if you want to be absolutely cutting edge and have the most level control, you're like six months to a year ahead of the commercially available tools. So if you're just not getting the result that you want on ChatGPT ImageGeneron, then come back to Comfy and try it here. Or I need to do volume or like. Yeah. a lot of stuff. I had another workflow idea too with the API nodes. You could have a prompt and then connect it to, sync one prompt, connect it to every image generator node. And instantly spin up and see, OK, with this prompt, how am getting from each tool and test it out? And to be fair, there are services now that let you do that. So Weevy is one of them. Freepik we talked about. then LTX Studio, Flora AI. Uh, there there's like five or six node based. The thing I kind of wish floor was a little bit more like comfy because comfy, could build out your workflow and then you hit run and it moves through the entire workflow. FLoRA was a bit more like you kind of had to run each node step by step. Everybody is after the comfy business. this is what I mean by that. First of all, we should all make their tools free for the Comfy business model. They know that professionals want control and consistency and repeatability. And Comfy gives you that. the node base stuff could be, like you said, it's just a little bit spaghetti for most people. So Invoke, Weevy, LTX, Flora, they're all making Comfy-like features that are perhaps a little bit of a simpler user experience. I mean, I feel like for a good example of like the top level of what you can do with Comfy and some of the crazy workflows is to look at Mickmumpitz, who we've talked about before, where he's built crazy, complicated, massive comfy workflows for character consistency, sheets, creating one shot character sheets. But actually that is a pretty good use case of comfy that no other tool really does right now. Styling and blender adding lower risk to characters. think Mick does a really good job of combining traditional d with new AI rendering. Yeah. So that's something that is quite difficult to ChatGPT, right? Like, yeah, you can upload an image from Unreal and da-da-da-da. But then Unreal has something called Comfy UE now. So you can take screenshots from Unreal right into the Comfy instance. Oh, OK. Good to know, because I know Blender's had to have had that workflow for a bit. And I've seen other people connect like Cinema D and other tools. It's all about speed and throughput. Like, as an artist, you're going to be in charge of like shots or whatever. Yeah, you can't just go to Runway every time it's going to be too cumbersome and also cost too. Cause like if you have the hardware and you can just spin this up and your local machines that can save more money versus having to pay for API credits every time you want to generate a hundred percent. Yeah. All right. I think a place to wrap it up links in the video for the comfy install. We'll put that at the denoise podcast.com. Um, but yeah, let us know if you have used comfy where you kind of land on comfy in this debate, cause this is a conversation to keep going back and forth with of like how useful is comfy versus. how much better the tools are getting at just being like, I want this. And you tell it what you want. doesn't. So let's know where that Yeah. Six months from now, this could be completely useless of a video. That's how fast this is moving. And also, let us know if this is the type of content you're looking for. We'd love to be able to go another step, a little bit more advanced into this, give you even more of our knowledge here. Yeah. Yeah. Especially as we experiment more and find out more workflows that are useful. All right. Thanks, everyone. We'll catch you in the next episode.

People on this episode