https://stability.ai/news/stable-diffusion-3-research-paper * Use techniques like Logit-Normal and Mode sampling instead of conventional Linear schedule * Perform scaling of embedding and adjustment of bias. At this time, adjust processing using information of noise level token. * Note: Although I could not find specific mention about noise level token in paper, probably it corresponds to noise scheduling at learning, indicating how much noise level current step is. * Linearly transform embedding vector, and compress dimension to shape suitable for next input.
| :--- | :--- | | Small | – | 800M | | Medium | – | 2B | | Large | – | 4B | | Ultra | 38 | 8B |
Conclusion
Points to remember from user perspective are around ↓.
- Improvement of understanding of TextEncoder
- Interpret sentence further
- Seems to be able to generate characters without collapse
- Support multimodal (text and image) generation by default
- Generation precision improved
- Parameters increased, necessary models also increased
- Seems to become slightly heavy, slow
Although there are parts not yet fully grasped around MM-DiT block, since I felt I grasped appearance, reading ends here for now.
Since model seems released just other day, please read if you like as I introduce procedure to use Stable Diffusion 3 in article below.
ComfyUIを使って画像生成AIで遊んでみよう【Stable Diffusion 3編】
>-
Since papers read around Stable Diffusion are summarized in following article, please utilize if interested.
Stable Diffusionガイド:論文読みリンク集
Stable Diffusion関連の論文解説記事のリンク集。画像生成・動画生成の基礎モデルから応用技術まで論文ベースで解説。









