Stable Diffusion 3 Paper: Moving Beyond UNet to DiT Architecture

Progress 5 / 12
Table of Contents

https://stability.ai/news/stable-diffusion-3-research-paper * Use techniques like Logit-Normal and Mode sampling instead of conventional Linear schedule * Perform scaling of embedding and adjustment of bias. At this time, adjust processing using information of noise level token. * Note: Although I could not find specific mention about noise level token in paper, probably it corresponds to noise scheduling at learning, indicating how much noise level current step is. * Linearly transform embedding vector, and compress dimension to shape suitable for next input.

| :--- | :--- | | Small | – | 800M | | Medium | – | 2B | | Large | – | 4B | | Ultra | 38 | 8B |

Conclusion

Points to remember from user perspective are around ↓.

  • Improvement of understanding of TextEncoder
    • Interpret sentence further
  • Seems to be able to generate characters without collapse
  • Support multimodal (text and image) generation by default
  • Generation precision improved
  • Parameters increased, necessary models also increased
    • Seems to become slightly heavy, slow

Although there are parts not yet fully grasped around MM-DiT block, since I felt I grasped appearance, reading ends here for now.

Since model seems released just other day, please read if you like as I introduce procedure to use Stable Diffusion 3 in article below.

ComfyUIを使って画像生成AIで遊んでみよう【Stable Diffusion 3編】

>-

blog.otama-playground.com

Since papers read around Stable Diffusion are summarized in following article, please utilize if interested.

Stable Diffusionガイド:論文読みリンク集

Stable Diffusion関連の論文解説記事のリンク集。画像生成・動画生成の基礎モデルから応用技術まで論文ベースで解説。

blog.otama-playground.com