Stable Diffusion 3 Paper: Moving Beyond UNet to DiT Architecture

The Stable Diffusion 3 research paper was published around May, so I decided to take a look.

Stable Diffusion 3: Research Paper — Stability AI

Following our announcement of the early preview of Stable Diffusion 3, today we are publishing the research paper which outlines the technical details of our upcoming model release, and invite you to sign up for the waitlist to participate in the early preview.

stability.ai

What’s New in the Rectified Flow Model

From what I can tell, the key novelties are roughly the following:

Improved Text Encoder
Improved noise scheduler
- Uses techniques like Logit-Normal and Mode sampling instead of the conventional linear schedule
Introduction of a new Transformer-based architecture (Diffusion Transformer, DiT)
Demonstrated performance exceeding state-of-the-art models in txt2img

Architecture of the Rectified Flow Model

To summarize briefly: the biggest change is that UNet has been replaced by MM-DiT. Everything else is essentially about increasing model size and parameters to improve performance.

Text Encoder

Previous Stable Diffusion

Used a CLIP text encoder (an encoder pre-trained on large-scale image-text pairs) to encode text prompts and feed them into the image generation model.

Stable Diffusion 3

Uses a combination of CLIP and T5 text encoders: CLIP L/14, OpenCLIP bigG/14, and T5-v1.1-XXL. The outputs from each encoder are concatenated and used together.

UNet

Previous Stable Diffusion

Used the UNet architecture as the core of the generative model — a multi-layer convolutional neural network that performs encoding and decoding of the image.

Stable Diffusion 3

Adopts DiT (Diffusion Transformer) instead of UNet. DiT is a Transformer-based model capable of processing both text and image tokens together. Specifically, it uses MM-DiT (Multimodal-DiT) blocks, which process text and image tokens with separate weights while achieving bidirectional information flow between the two.

VAE

Previous Stable Diffusion

Used a VAE (Variational Autoencoder) to compress images into a low-dimensional latent space, then used those latent representations in the generation process. This enabled efficient image generation and high-resolution output.

Stable Diffusion 3

Similarly uses a pre-trained autoencoder, but the key difference is an increased number of dimensions in the latent space. This improved representational capacity leads to better performance.

MM-DiT Processing

The MM-DiT architecture is designed to handle both text and images. Here is a simplified overview of the processing flow.

Pre-processing

1. Image Encoding (image embedding)

The original RGB image is converted into a low-dimensional latent representation using a pre-trained VAE. The resulting latent representation is split into 2×2 patches along with spatial positional encodings.

2. Text Encoding (text embedding)

Text is converted into embedding vectors using pre-trained models (CLIP and T5). The outputs used are CLIP’s output and T5’s final hidden state.

MM-DiT Block

3. Modulation and Linear Transformation

In multimodal models, it’s important to prevent the model from becoming biased toward any single modality. Adjustments are made through normalization and similar operations to keep values in balance.

Modulation: Scales embeddings and adjusts bias, using information from a noise level token to adjust processing.
- Note: I couldn’t find an explicit description of the noise level token in the paper, but it likely corresponds to the noise scheduling during training and indicates how much noise is present at the current step.
Linear: Linearly transforms the embedding vector, compressing its dimensions into a shape suitable for the next input.

4. Concatenation of Latent Representations

The outputs from steps 1 and 2 are concatenated into a single sequence.

5. Joint Attention

The concatenated sequence is fed into QKV Attention, where mutual information is shared between both embeddings to generate an integrated representation.

6. Separation into Text Stream and Image Stream

Since the positions of image and text tokens are clearly defined, the tokens are separated from the attention output back into their original positions.

7. Further Linear Transformation and Modulation

The two streams from step 6 are processed and transformed into a shape suitable for the next MLP. This allows both streams to emphasize their respective features while being adjusted as needed.

Noise Level Token is again used to adjust processing here.

8. MLP

The data after linear transformation and modulation is passed through multiple layers including non-linear transformations for further processing, enabling the learning of more complex features to pass to the next block.

9. Repeat

The entire block process is repeated for as many stacked MM-DiT blocks as there are.

Post-processing

10. Image Generation from Latent Representation

Finally, the latent representation output from the MM-DiT blocks is decoded by the VAE to obtain the image.

Training Method of the Rectified Flow Model (SD3)

Noise Schedule and Sampling Method

Previous Stable Diffusion

Uses a fixed noise schedule during training (e.g., linear or cosine schedule). Learns to reverse the diffusion process — from noise back to data — through the forward diffusion process from data to noise.

Stable Diffusion 3

The Rectified Flow model introduces a new noise schedule aimed at more efficiently sampling noise at specific timesteps. It uses techniques such as Logit-Normal and Mode sampling to adjust the noise scale, improving training effectiveness at intermediate steps.

Performance Evaluation

The model achieves better scores than existing models on metrics like FID and CLIP. Honestly, just looking at the numbers doesn’t tell me much, so I’ll skip the rest of the evaluation details.

Performance comparison table from the SD3 paper

I’ll also quote the generated images from the paper. It appears that prompts written as natural sentences can now be used for generation. Additionally, text is generated without collapse, which is impressive.

Generated image examples from the SD3 paper

Three Model Variants

Stable Diffusion 3 comes in three model variants. The difference between them is likely the number of stacked MM-DiT blocks. The table below shows the parameter counts. The paper mentions 8B parameters for the 38-block case, so Ultra is listed as 38 blocks.

Model	Block Count	Parameters
Small	–	800M
Medium	–	2B
Large	–	4B
Ultra	38	8B

Conclusion

Key points to remember from a user perspective:

Improved text understanding in the Text Encoder
- Can now interpret sentence-style prompts
Text generation without collapse looks achievable
Multimodal (text and image) generation supported by default
Improved generation accuracy
Increased parameters and required models
- Likely to be somewhat heavier and slower

There are still parts around the MM-DiT block I haven’t fully grasped, but I feel like I’ve got a solid external view of the model, so I’ll stop reading here for now.

The model was also released recently. Check the article below where I introduce the procedure for using Stable Diffusion 3.

ComfyUIを使って画像生成AIで遊んでみよう【Stable Diffusion 3編】

blog.otama-playground.com

Other papers I’ve read around Stable Diffusion are summarized in the following article — please check it out if you’re interested.

Stable Diffusionガイド：論文読みリンク集

Stable Diffusion関連の論文解説記事のリンク集。画像生成・動画生成の基礎モデルから応用技術まで論文ベースで解説。

blog.otama-playground.com

Stable Diffusion 3 Paper: Moving Beyond UNet to DiT Architecture

What’s New in the Rectified Flow Model

Architecture of the Rectified Flow Model

Text Encoder

UNet

VAE

MM-DiT Processing

Pre-processing

MM-DiT Block

Post-processing

Training Method of the Rectified Flow Model (SD3)

Noise Schedule and Sampling Method

Performance Evaluation

Three Model Variants

Conclusion

Related Posts

Run on 6GB VRAM! Trying the 'FramePack' Video Gen AI Demo

StableDiffusion i2i Comparison: Latent vs ControlNet vs IPAdapter

Image Generation AI with ComfyUI: Face Detailer Edition

Generate Images from Scribbles with Krita & StableDiffusion

Image Generation AI with ComfyUI: Inpaint Edition

Image Generation AI with ComfyUI: Outpaint Edition

Image Generation AI with ComfyUI: Textual Inversion Edition

SD-Turbo & SDXL-Turbo: Overview and Local Demo Guide

Stable Video Diffusion: High-Precision AI Video Generation

Video Generation AI with ComfyUI: RIFE Edition