StableDiffusion i2i Comparison: Latent vs ControlNet vs IPAdapter

When I wanted to generate images from images (image-to-image) with Stable Diffusion, I wasn’t sure which method to use, so I decided to actually generate and compare them.

The three main image-to-image methods are as follows. I will focus on these in this article.

  1. Method 1: Input the image as a Latent Image
  2. Method 2: Use ControlNet
  3. Method 3: Input the image as a prompt using IPAdapter

Method 1: Input the Image as a Latent Image

Overview

The image is encoded with VAE to obtain a latent variable (Latent Image), which is then used as input to Stable Diffusion for image generation. At this point, set the Denoise strength slightly lower than 1.

The color tone and rough position of colors are preserved, but depending on the prompt and parameters, the result can end up quite different from the original image — that is the main drawback.

In ComfyUI, you can generate images by connecting the nodes as shown below:

Method 1: Input as Latent Image
Method 1: Input as Latent Image

Generation Result

  • prompt: (blank)
  • model: flat2DAnimerge
  • denoise: 0.65
Method 1: Latent Image result
Method 1: Latent Image result

The original image is in the top-left.

Method 2: Use ControlNet

Overview

ControlNet is a method for specifying the composition of the generated image. ControlNet uses data such as line drawings and OpenPose to define the image composition. The flow works as follows: the input image is converted into line drawings or OpenPose data, which is then passed to ControlNet for image-to-image generation.

Using this method, the composition closely follows the original image, but the color tone is not preserved — you specify the colors via the text prompt instead.

Image-to-image flow with ControlNet
Image-to-image flow with ControlNet

In ComfyUI, the workflow looks like this:

Method 2: ControlNet workflow
Method 2: ControlNet workflow

For instructions on how to set up ControlNet, please refer to the following article:

ComfyUIでControlNet:姿勢指定した画像生成AIの基本

>-

blog.otama-playground.com

Generation Results

  • prompt: (blank)
  • model: flat2DAnimerge
Method 2: ControlNet result (lineart)
Method 2: ControlNet result (lineart)
Method 2: ControlNet result (openpose)
Method 2: ControlNet result (openpose)
Method 2: ControlNet result (depth)
Method 2: ControlNet result (depth)

The original image is in the top-left.

Method 3: Use IPAdapter to Input the Image as a Prompt

Overview

IPAdapter is a method that uses an image as a prompt. The image is interpreted by an AI model (ViT), and that interpretation is passed to the Stable Diffusion model, allowing generation of images similar to the original. Since it operates via a mechanism separate from the Text Encoder, it can be used in combination with text prompts as well.

For instructions on how to use IPAdapter in ComfyUI, see the following article:

【Stable Diffusion】ComfyUIを使って画像生成AIで遊んでみよう【IPAdapter編】

>-

blog.otama-playground.com

Generation Result

  • prompt: (blank)
  • model: flat2DAnimerge
Method 3: IPAdapter result
Method 3: IPAdapter result

The original image is in the top-left.

Comparison

Here is a summary of the advantages and disadvantages of each method:

MethodAdvantagesDisadvantages
Method 1: LatentNo additional extensions needed, simple to useParameter adjustment needed, easy to fail
Method 2: ControlNetCan reproduce composition almost faithfullyInflexible (especially line art types), generation becomes slightly heavy, additional extensions needed
Method 3: IPAdapterRoughly reproduces composition while allowing adjustment, adjustment can also be done with text promptsReproducibility of original image is mixed, additional extensions needed

Conclusion

This time I generated separately to capture the characteristics of each method, but it seems possible to use them in combination. I think it would be good to choose based on the merits needed at the time.