Skip to content

VideoVerses/VideoVAEPlus

Repository files navigation

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Ground Truth (GT) Reconstructed
GT Video 1 Reconstructed Video 1
GT Video 2 Reconstructed Video 2
GT Video 3 Reconstructed Video 3
GT Video 4 Reconstructed Video 4
GT Video 5 Reconstructed Video 5

Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author)

A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.


✨ Features

  • High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
  • Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
  • State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.

SOTA Table

⏰ Todo

  • Release Pretrained Model Weights
  • Release Inference Code
  • Release Training Code

🚀 Get Started

Follow these steps to set up your environment and run the code:

1. Clone the Repository

git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus

2. Set Up the Environment

Create a Conda environment and install dependencies:

conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt

📦 Pretrained Models

Model Name Latent Channels Download Link
sota-4z 4 Download
sota-4z-text 4 Download
sota-16z 16 Download
sota-16z-text 16 Download
  • Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.

📁 Data Preparation

To reconstruct videos and images using our VAE model, organize your data in the following structure:

Videos

Place your videos and optional captions in the examples/videos/gt directory.

Directory Structure:

examples/videos/
├── gt/
│   ├── video1.mp4
│   ├── video1.txt  # Optional caption
│   ├── video2.mp4
│   ├── video2.txt
│   └── ...
├── recon/
    └── (reconstructed videos will be saved here)
  • Captions: For cross-modal reconstruction, include a .txt file with the same name as the video containing its caption.

Images

Place your images in the examples/images/gt directory.

Directory Structure:

examples/images/
├── gt/
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── recon/
    └── (reconstructed images will be saved here)
  • Note: The images dataset does not require captions.

🔧 Inference

Our video VAE supports both image and video reconstruction.

Please ensure that the ckpt_path in all your configuration files is set to the actual path of your checkpoint.

Video Reconstruction

Run video reconstruction using:

bash scripts/run_inference_video.sh

This is equivalent to:

python inference_video.py \
    --data_root 'examples/videos/gt' \
    --out_root 'examples/videos/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --chunk_size 8 \
    --resolution 720 1280
  • If the chunk size is too large, you may encounter memory issues. In this case, reduce the chunk_size parameter. Ensure the chunk_size is divisible by 4.

  • To enable cross-modal reconstruction using captions, modify config_path to 'configs/config_16z_cap.yaml' for the 16-channel model with caption guidance.

Image Reconstruction

Run image reconstruction using:

bash scripts/run_inference_image.sh

This is equivalent to:

python inference_image.py \
    --data_root 'examples/images/gt' \
    --out_root 'examples/images/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --batch_size 1
  • Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.

📊 Evaluation

Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.

Evaluate Image Reconstruction

bash scripts/evaluation_image.sh

Evaluate Video Reconstruction

bash scripts/evaluation_video.sh

📝 License

Please follow CC-BY-NC-ND.

Star History

Star History Chart

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published