VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Ground Truth (GT)	Reconstructed

Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author)

Project Page | Paper | High-Res Demo

A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.

✨ Features

High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.

⏰ Todo

Release Pretrained Model Weights
Release Inference Code
Release Training Code

🚀 Get Started

Follow these steps to set up your environment and run the code:

1. Clone the Repository

git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus

2. Set Up the Environment

Create a Conda environment and install dependencies:

conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt

📦 Pretrained Models

Model Name	Latent Channels	Download Link
sota-4z	4	Download
sota-4z-text	4	Download
sota-16z	16	Download
sota-16z-text	16	Download

Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.

📁 Data Preparation

To reconstruct videos and images using our VAE model, organize your data in the following structure:

Videos

Place your videos and optional captions in the examples/videos/gt directory.

Directory Structure:

examples/videos/
├── gt/
│   ├── video1.mp4
│   ├── video1.txt  # Optional caption
│   ├── video2.mp4
│   ├── video2.txt
│   └── ...
├── recon/
    └── (reconstructed videos will be saved here)

Captions: For cross-modal reconstruction, include a .txt file with the same name as the video containing its caption.

Images

Place your images in the examples/images/gt directory.

Directory Structure:

examples/images/
├── gt/
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── recon/
    └── (reconstructed images will be saved here)

Note: The images dataset does not require captions.

🔧 Inference

Our video VAE supports both image and video reconstruction.

Please ensure that the ckpt_path in all your configuration files is set to the actual path of your checkpoint.

Video Reconstruction

Run video reconstruction using:

bash scripts/run_inference_video.sh

This is equivalent to:

python inference_video.py \
    --data_root 'examples/videos/gt' \
    --out_root 'examples/videos/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --chunk_size 8 \
    --resolution 720 1280

If the chunk size is too large, you may encounter memory issues. In this case, reduce the chunk_size parameter. Ensure the chunk_size is divisible by 4.
To enable cross-modal reconstruction using captions, modify config_path to 'configs/config_16z_cap.yaml' for the 16-channel model with caption guidance.

Image Reconstruction

Run image reconstruction using:

bash scripts/run_inference_image.sh

This is equivalent to:

python inference_image.py \
    --data_root 'examples/images/gt' \
    --out_root 'examples/images/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --batch_size 1

Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.

📊 Evaluation

Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.

Evaluate Image Reconstruction

bash scripts/evaluation_image.sh

Evaluate Video Reconstruction

bash scripts/evaluation_video.sh

📝 License

Please follow CC-BY-NC-ND.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Project Page | Paper | High-Res Demo

✨ Features

⏰ Todo

🚀 Get Started

1. Clone the Repository

2. Set Up the Environment

📦 Pretrained Models

📁 Data Preparation

Videos

Directory Structure:

Images

Directory Structure:

🔧 Inference

Video Reconstruction

Image Reconstruction

📊 Evaluation

Evaluate Image Reconstruction

Evaluate Video Reconstruction

📝 License

Star History

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ckpt		ckpt
configs/inference		configs/inference
docs		docs
evaluation		evaluation
examples		examples
scripts		scripts
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_image.py		inference_image.py
inference_video.py		inference_video.py
requirements.txt		requirements.txt

License

VideoVerses/VideoVAEPlus

Folders and files

Latest commit

History

Repository files navigation

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Project Page | Paper | High-Res Demo

✨ Features

⏰ Todo

🚀 Get Started

1. Clone the Repository

2. Set Up the Environment

📦 Pretrained Models

📁 Data Preparation

Videos

Directory Structure:

Images

Directory Structure:

🔧 Inference

Video Reconstruction

Image Reconstruction

📊 Evaluation

Evaluate Image Reconstruction

Evaluate Video Reconstruction

📝 License

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages