【Image Diffusion】Tune-A-Videoでテキストから動画生成する

2022年12月に発表されたテキストの指示により動画を生成する、画像拡散モデルのOne-Shotチューニングの手法について紹介します。

実際にテキストを与えて動画を生成してみましょう。

Google colabを使用して簡単に実装することができますので、ぜひ最後までご覧ください。

今回の内容

・Tune-A-Videoとは

・導入

・学習

・推論

1. Tune-A-Videoとは
2. Tune-A-Videoの導入
2.1. 導入
2.2. モデルのダウンロード
3. 学習
4. 推論（Tune-A-Videoの実装）
4.1. prompt = “a teddy bear is surfing”
5. まとめ

Tune-A-Videoとは

Tune-A-Videoは、テキストの指示により動画を生成する画像拡散モデルのOne-Shotチューニングの手法です。

例えば「a panda is surfing」というテキストをもとに以下のような動画を生成することができます。

なお、「Tune-A-Video」を動作させるには「Google Colab Pro/Pro+」のプレミアム (A100 40GB) が必要ですのでご注意ください。

詳細は以下のリンクからご確認ください。

コード：https://github.com/showlab/Tune-A-Video
論文：https://arxiv.org/abs/2212.11565

Tune-A-Videoの導入

導入

ここからはGoogle colabを使用して実装していきます。

今回紹介するコードは以下のボタンからコピーして使用していただくことも可能です。

まずはGPUを使用できるように設定をします。

・「ランタイムのタイプを変更」→「ハードウェアアクセラレータ」をGPUに変更
・GPUクラスをプレミアムに変更

次にGoogleドライブをマウントします。

from google.colab import drive
drive.mount('/content/drive')
%cd ./drive/MyDrive

公式よりcloneしてきます。

!git clone https://github.com/showlab/Tune-A-Video
%cd Tune-A-Video

必要なライブラリをインストールします。

!pip install -r requirements.txt

xformersもインストールします。

!pip install https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl

以上で導入は完了です。

モデルのダウンロード

「checkpoints」というフォルダを作成し、そこに「CompVis/stable-diffusion-v1-4」をダウンロードします。

import os
os.makedirs("./checkpoints", exist_ok=True)
%cd ./checkpoints

!git lfs install
!git clone https://huggingface.co/CompVis/stable-diffusion-v1-4

%cd ..

学習

次に学習を行います。

%%time

!accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"

およそ8分程度で学習が完了しました。

推論（Tune-A-Videoの実装）

実際にテキストから動画を作成してみます。

「model_id」には上で学習した結果が保存されているフォルダ名を記述します。

あとは出力したい動画を「prompt」に記述して実行します。

以下の例では、「a panda is surfing」としています。

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "./outputs/man-surfing_lr3e-5_seed33/2023-01-31T12-08-55"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("stabilityai/stable-diffusion-2-base", unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=100, guidance_scale=7.5).videos

save_videos_grid(video, "outputs/sample2.gif")