We provide benchmark results of spatiotemporal prediction learning (STL) methods on various video prediction datasets. More STL methods will be supported in the future. Issues and PRs are welcome!
Currently supported spatiotemporal prediction methods
Currently supported MetaFormer models for SimVP
- ViT (ICLR'2021)
- Swin-Transformer (ICCV'2021)
- MLP-Mixer (NeurIPS'2021)
- ConvMixer (Openreview'2021)
- UniFormer (ICLR'2022)
- PoolFormer (CVPR'2022)
- ConvNeXt (CVPR'2022)
- VAN (ArXiv'2022)
- IncepU (SimVP.V1) (CVPR'2022)
- gSTA (SimVP.V2) (ArXiv'2022)
- HorNet (NeurIPS'2022)
- MogaNet (ArXiv'2022)
We provide benchmark results on the popular Moving MNIST dataset using
For a fair comparison of different methods, we report final results when models are trained to convergence. We provide config files in configs/mmnist.
Method | Params | FLOPs | FPS | MSE | MAE | SSIM | Download |
---|---|---|---|---|---|---|---|
ConvLSTM-S | 15.0M | 56.8G | 113 | 46.26 | 142.18 | 0.878 | model | log |
ConvLSTM-L | 33.8M | 127.0G | 50 | 29.88 | 95.05 | 0.925 | model | log |
PhyDNet | 3.1M | 15.3G | 182 | 35.68 | 96.70 | 0.917 | model | log |
PredRNN | 23.8M | 116.0G | 54 | 25.04 | 76.26 | 0.944 | model | log |
PredRNN++ | 38.6M | 171.7G | 38 | 22.45 | 69.70 | 0.950 | model | log |
MIM | 38.0M | 179.2G | 37 | 23.66 | 74.37 | 0.946 | model | log |
E3D-LSTM | 51.0M | 298.9G | 18 | 36.19 | 78.64 | 0.932 | model | log |
CrevNet | 5.0M | 270.7G | 10 | 30.15 | 86.28 | 0.935 | model | log |
PredRNN.V2 | 23.9M | 116.6G | 52 | 27.73 | 82.17 | 0.937 | model | log |
SimVP+IncepU | 58.0M | 19.4G | 209 | 26.69 | 77.19 | 0.940 | model | log |
SimVP+gSTA-S | 46.8M | 16.5G | 282 | 15.05 | 49.80 | 0.967 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with training times of 200-epoch and 2000-epoch. We provide config file in configs/mmnist/simvp.
MetaFormer | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 200 epoch | 58.0M | 19.4G | 209 | 32.15 | 89.05 | 0.9268 | 37.97 | model | log |
gSTA (SimVPv2) | 200 epoch | 46.8M | 16.5G | 282 | 26.69 | 77.19 | 0.9402 | 38.3 | model | log |
ViT | 200 epoch | 46.1M | 16.9.G | 290 | 35.15 | 95.87 | 0.9139 | 37.79 | model | log |
Swin Transformer | 200 epoch | 46.1M | 16.4G | 294 | 29.70 | 84.05 | 0.9331 | 38.14 | model | log |
Uniformer | 200 epoch | 44.8M | 16.5G | 296 | 30.38 | 85.87 | 0.9308 | 38.11 | model | log |
MLP-Mixer | 200 epoch | 38.2M | 14.7G | 334 | 29.52 | 83.36 | 0.9338 | 38.19 | model | log |
ConvMixer | 200 epoch | 3.9M | 5.5G | 658 | 32.09 | 88.93 | 0.9259 | 37.97 | model | log |
Poolformer | 200 epoch | 37.1M | 14.1G | 341 | 31.79 | 88.48 | 0.9271 | 38.06 | model | log |
ConvNeXt | 200 epoch | 37.3M | 14.1G | 344 | 26.94 | 77.23 | 0.9397 | 38.34 | model | log |
VAN | 200 epoch | 44.5M | 16.0G | 288 | 26.10 | 76.11 | 0.9417 | 38.39 | model | log |
HorNet | 200 epoch | 45.7M | 16.3G | 287 | 29.64 | 83.26 | 0.9331 | 38.16 | model | log |
MogaNet | 200 epoch | 46.8M | 16.5G | 255 | 25.57 | 75.19 | 0.9429 | 38.41 | model | log |
IncepU (SimVPv1) | 2000 epoch | 58.0M | 19.4G | 209 | 21.15 | 64.15 | 0.9536 | 38.81 | model | log |
gSTA (SimVPv2) | 2000 epoch | 46.8M | 16.5G | 282 | 15.05 | 49.80 | 0.9670 | - | model | log |
ViT | 2000 epoch | 46.1M | 16.9.G | 290 | 19.74 | 61.65 | 0.9539 | 38.96 | model | log |
Swin Transformer | 2000 epoch | 46.1M | 16.4G | 294 | 19.11 | 59.84 | 0.9584 | 39.03 | model | log |
Uniformer | 2000 epoch | 44.8M | 16.5G | 296 | 18.01 | 57.52 | 0.9609 | 39.11 | model | log |
MLP-Mixer | 2000 epoch | 38.2M | 14.7G | 334 | 18.85 | 59.86 | 0.9589 | 38.98 | model | log |
ConvMixer | 2000 epoch | 3.9M | 5.5G | 658 | 22.30 | 67.37 | 0.9507 | 38.67 | model | log |
Poolformer | 2000 epoch | 37.1M | 14.1G | 341 | 20.96 | 64.31 | 0.9539 | 38.86 | model | log |
ConvNeXt | 2000 epoch | 37.3M | 14.1G | 344 | 17.58 | 55.76 | 0.9617 | 39.19 | model | log |
VAN | 2000 epoch | 44.5M | 16.0G | 288 | 16.21 | 53.57 | 0.9646 | 39.26 | model | log |
HorNet | 2000 epoch | 45.7M | 16.3G | 287 | 17.40 | 55.70 | 0.9624 | 39.19 | model | log |
MogaNet | 2000 epoch | 46.8M | 16.5G | 255 | 15.67 | 51.84 | 0.9661 | 39.35 | model | log |
We provide benchmark results on KittiCaltech Pedestrian dataset using
For a fair comparison of different methods, we report final results when models are trained to convergence. We provide config files in configs/kitticaltech.
Method | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
ConvLSTM-S | 100 epoch | 15.0M | 595.0G | 33 | 139.6 | 1583.3 | 0.9345 | 32.82 | model | log |
E3D-LSTM* | 100 epoch | 54.9M | 1004G | 10 | 203.7 | 1929.7 | 0.9062 | 32.04 | model | log |
MAU | 100 epoch | 24.3M | 172.0G | 16 | 177.8 | 1800.4 | 0.9176 | 32.24 | model | log |
MIM | 100 epoch | 49.2M | 1858G | 39 | 127.3 | 1461.1 | 0.9410 | 33.26 | model | log |
PredRNN | 100 epoch | 23.7M | 1216G | 17 | 130.4 | 1525.5 | 0.9374 | 33.01 | model | log |
PredRNN++ | 100 epoch | 38.5M | 1803G | 12 | 125.5 | 1453.2 | 0.9433 | 33.27 | model | log |
PredRNN.V2 | 100 epoch | 23.8M | 1223G | 52 | 147.8 | 1610.5 | 0.9330 | 32.67 | model | log |
SimVP+IncepU | 100 epoch | 8.6M | 60.6G | 57 | 160.2 | 1690.8 | 0.9338 | 32.48 | model | log |
SimVP+gSTA-S | 100 epoch | 15.6M | 96.3G | 40 | 129.7 | 1507.7 | 0.9454 | 33.05 | model | log |
Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing
and channel mixing
, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config file in configs/kitticaltech/simvp.
MetaFormer | Setting | Params | FLOPs | FPS | MSE | MAE | SSIM | PSNR | Download |
---|---|---|---|---|---|---|---|---|---|
IncepU (SimVPv1) | 100 epoch | 8.6M | 60.6G | 57 | 160.2 | 1690.8 | 0.9338 | 32.48 | model | log |
gSTA (SimVPv2) | 100 epoch | 15.6M | 96.3G | 40 | 129.7 | 1507.7 | 0.9454 | 33.05 | model | log |
ViT* | 100 epoch | 12.7M | 155.0G | 25 | 146.4 | 1615.8 | 0.9379 | 32.58 | model | log |
Swin Transformer | 100 epoch | 15.3M | 95.2G | 49 | 155.2 | 1588.9 | 0.9299 | 32.98 | model | log |
Uniformer* | 100 epoch | 11.8M | 104.0G | 28 | 135.9 | 1534.2 | 0.9393 | 32.94 | model | log |
MLP-Mixer | 100 epoch | 22.2M | 83.5G | 60 | 207.9 | 1835.9 | 0.9133 | 32.37 | model | log |
ConvMixer | 100 epoch | 1.5M | 23.1G | 129 | 174.7 | 1854.3 | 0.9232 | 31.88 | model | log |
Poolformer | 100 epoch | 12.4M | 79.8G | 51 | 153.4 | 1613.5 | 0.9334 | 32.79 | model | log |
ConvNeXt | 100 epoch | 12.5M | 80.2G | 54 | 146.8 | 1630.0 | 0.9336 | 32.58 | model | log |
VAN | 100 epoch | 14.9M | 92.5G | 41 | 132.1 | 1501.5 | 0.9437 | 33.10 | model | log |
HorNet | 100 epoch | 15.3M | 94.4G | 43 | 152.8 | 1637.9 | 0.9365 | 32.70 | model | log |
MogaNet | 100 epoch | 15.6M | 96.2G | 36 | 131.4 | 1512.1 | 0.9442 | 32.93 | model | log |