Textual Inversion Guide: Controlling Stable Diffusion Prompts

This time I will introduce Textual Inversion (embedding). Textual Inversion is a technique to control output of Stable Diffusion from aspect of language, and by using it, it becomes possible to bring output closer to favorite art style or composition. In short, it means it becomes easier to output picture you want to output.

Heating only this, some might wonder what is different from LoRA, and exactly as that, what Textual Inversion and LoRA are trying to do is almost same. What differs is from which angle they try to control, LoRA tries to control output from aspect of architecture, Textual Inversion from aspect of input prompt (more specifically Text Encoder).

In this article I will check about mechanism of Textual Inversion.

What is embedding

Before entering explanation of Textual Inversion, I insert explanation about embedding which becomes important in this paper.

Overview of embedding

Embedding is what converted word (prompt) to mathematical vector, with this base model becomes able to understand meaning of prompt. For example, by converting word “cat” to list of numbers computer can understand, it becomes possible to compare numerically whether it is similar compared to other words.

In Stable Diffusion model, vector output from Text Encoder corresponds to embedding as in figure below.

Processing of embedding

In Stable Diffusion model, embedding is usually generated and processed through following process.

Input of Prompt:
- First, user inputs prompt (instruction) like “Draw picture of cat”.
Processing of Text Encoder:
- Prompt is sent to Text Encoder (e.g. BERT or CLIP). Text Encoder converts words of prompt to list of mathematical numbers, i.e. “Language Vector (embedding)”. Language Vector is thing to make computer easy to understand meaning of words.
Generation of Language Vector:
- Text Encoder uses learned weights (information model learned with many data) to generate embedding reflecting content of prompt. For example if word “cat” is included, number list having features related to that word is generated.
Input to Base Model:
- Generated embedding is input to base model like Stable Diffusion. This base model interprets embedding and generates image based on instruction.

About Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available at: https://textual-inversion.github.io

arxiv.org

Overview of Textual Inversion

Textual Inversion is a technique to change direction of generated embedding by training Text Encoder. (Please refer to image below for position of Text Encoder.) In other words, learn new concept or word, generate new language vector matching it, and make it form interpretable by base model. Thus, model becomes able to generate images using new word or concept.

Learning Method of Textual Inversion

Data Collection:
- Collect 3-5 images having specific concept and corresponding text description. For example collect “Portrait of Ieyasu Tokugawa” and its description text.
Embedding Generation:
- Add new token to existing Text Encoder (e.g. CLIP).
- Using pair of image and text, learn embedding of new token.
Embedding Optimization:
- Optimize (Learn) generated embedding to match existing embedding space of model.
- (At this time weights of original model remain frozen, learn only embedding)

Evaluation Result

You can see output changes just by changing image used for learning.

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or, 2022, An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, https://arxiv.org/abs/2208.01618

Conclusion

Textual Inversion is very useful technology in utilizing Stable Diffusion. By understanding this technique and actually applying, it becomes possible to generate images exactly as you envision. By grasping difference with LoRA and choosing appropriate method, you can obtain result with even higher precision. I am happy if understanding about basic mechanism of Textual Inversion deepened through this article.

Method to try Textual Inversion is explained in following article. Please read here if you want to try.

【Stable Diffusion】ComfyUIを使って画像生成AIで遊んでみよう【Textual Inversion編】

blog.otama-playground.com

Since papers read around Stable Diffusion are summarized in following article, please utilize if interested.

Stable Diffusionガイド：論文読みリンク集

Stable Diffusion関連の論文解説記事のリンク集。画像生成・動画生成の基礎モデルから応用技術まで論文ベースで解説。

blog.otama-playground.com

Textual Inversion Guide: Controlling Stable Diffusion Prompts

What is embedding

Overview of embedding

Processing of embedding

About Textual Inversion

Overview of Textual Inversion

Learning Method of Textual Inversion

Evaluation Result

Conclusion

Related Posts

Video Frame Rate Enhancement: RIFE and its Architecture

Stable Diffusion 3 Paper: Moving Beyond UNet to DiT Architecture

Stream Diffusion: Real-Time Video and Image Generation

IPAdapter Explained: Use Images as Prompts for Stable Diffusion

AnimateDiff: Lightweight Video Extension for Stable Diffusion

ControlNet Basics: Posture Control with Stable Diffusion

What is LoRA? Low-Cost Optimization for Large Models

Stable Diffusion: Understanding the Image Generation Mechanism

Endless Automated Refactoring with Codex and Temporal

Run on 6GB VRAM! Trying the 'FramePack' Video Gen AI Demo