Ground Truth (GT) | Reconstructed |
---|---|
Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author)
A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.
- High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
- Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
- State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.
- Release Pretrained Model Weights
- Release Inference Code
- Release Training Code
Follow these steps to set up your environment and run the code:
git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus
Create a Conda environment and install dependencies:
conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt
Model Name | Latent Channels | Download Link |
---|---|---|
sota-4z | 4 | Download |
sota-4z-text | 4 | Download |
sota-16z | 16 | Download |
sota-16z-text | 16 | Download |
- Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.
To reconstruct videos and images using our VAE model, organize your data in the following structure:
Place your videos and optional captions in the examples/videos/gt
directory.
examples/videos/
├── gt/
│ ├── video1.mp4
│ ├── video1.txt # Optional caption
│ ├── video2.mp4
│ ├── video2.txt
│ └── ...
├── recon/
└── (reconstructed videos will be saved here)
- Captions: For cross-modal reconstruction, include a
.txt
file with the same name as the video containing its caption.
Place your images in the examples/images/gt
directory.
examples/images/
├── gt/
│ ├── image1.jpg
│ ├── image2.png
│ └── ...
├── recon/
└── (reconstructed images will be saved here)
- Note: The images dataset does not require captions.
Our video VAE supports both image and video reconstruction.
Please ensure that the ckpt_path
in all your configuration files is set to the actual path of your checkpoint.
Run video reconstruction using:
bash scripts/run_inference_video.sh
This is equivalent to:
python inference_video.py \
--data_root 'examples/videos/gt' \
--out_root 'examples/videos/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--chunk_size 8 \
--resolution 720 1280
-
If the chunk size is too large, you may encounter memory issues. In this case, reduce the
chunk_size
parameter. Ensure thechunk_size
is divisible by 4. -
To enable cross-modal reconstruction using captions, modify
config_path
to'configs/config_16z_cap.yaml'
for the 16-channel model with caption guidance.
Run image reconstruction using:
bash scripts/run_inference_image.sh
This is equivalent to:
python inference_image.py \
--data_root 'examples/images/gt' \
--out_root 'examples/images/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--batch_size 1
- Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.
Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.
bash scripts/evaluation_image.sh
bash scripts/evaluation_video.sh
Please follow CC-BY-NC-ND.