Stable Video Diffusion: High-Precision AI Video Generation

ChatGPT’s video generation is completely uninteresting because it can mostly just add narration to a still image!! So, I will investigate methods to generate proper “footage” other than ChatGPT. There are several methods, but this time I will try running “Stable Video Diffusion” locally.

Overview of Stable Video Diffusion

First, start by knowing your opponent before working. Below is the official explanation.

stabilityai/stable-video-diffusion-img2vid-xt · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co
  • Stable Video Diffusion is a Latent Diffusion Model developed by Stability AI that generates short videos from a single image (image-to-video). It can generate 25-frame videos at a resolution of 576x1024.
  • According to user surveys, Stable Video Diffusion’s Image-to-Video is rated as having superior video quality compared to alternative models like GEN-2 and PikaLabs.
  • This model can only generate short videos (less than 4 seconds), and movement may be limited. Also, person generation may not be accurate.

Try the Official Demo

Prerequisites

Work Content

I don’t want to pollute my local environment, so I will do it using docker.

  1. Create docker image
    • Make the demo run automatically when creating the container
  2. Launch container (Run demo)

1. Create docker image

First, create a Dockerfile.

FROM python:3.10.14-bookworm
# huggingface login token (will error if not specified at build)
ARG token=""
# clone repo
WORKDIR /home
RUN git clone https://github.com/Stability-AI/generative-models.git .
# install requirements
RUN pip install --no-cache-dir -r ./requirements/pt2.txt
# Reinstall OpenCV using apt due to an error with libGL.so.
RUN apt -y update && apt -y upgrade && apt -y install libopencv-dev
# Register hugging face credentials (token is mandatory at build)
RUN huggingface-cli login --token $token
# Clear cache
RUN apt-get autoremove -y &&\
apt-get clean &&\
rm -rf /usr/local/src/*
# Specify command at startup
CMD ["python", "-u", "-m", "scripts.demo.gradio_app"]

Next, build the image. It takes quite a while, so have some tea while waiting.

Terminal window
# Build image with name svd-image
# Specify the path where the Dockerfile created earlier is located
docker image build -t svd-image --build-arg token={HuggingFace_Login_Token} .

2. Launch container

  • By specifying gpus, access to GPU resources from the container is enabled.
  • It takes a long time to execute, and for some reason logs are not output much, so it becomes worrying, but please wait patiently.
  • After command execution is complete, access the displayed public URL via browser to see the demo application.
Terminal window
# Execute specifying the image created earlier
docker run --rm -i --gpus all svd-image

ENJOY!

In my case, all efforts went down the drain due to an out of memory error… It seems VRAM 8GB + RAM 8GB assigned to docker is insufficient.

[Update: 2024/05/20] I was able to run Stable Video Diffusion in my environment with the method below, so I’m attaching the link to the article.

【Stable Video Diffusion】ComfyUIを使って動画生成AIで遊んでみよう

>-

blog.otama-playground.com