Skip to content

Latest commit

 

History

History
38 lines (24 loc) · 4.38 KB

MODEL_ZOO.md

File metadata and controls

38 lines (24 loc) · 4.38 KB

Video Swin Transformer Model Zoo

Video Swin in keras can be used with multiple backends, i.e. tensorflow, torch, and jax. The input shape are expected to be channel_last, i.e. (depth, height, width, channel).

Note: While evaluating the video model for classification task, multiple clips from a video are sampled. And additionally, this process also involves multiple crops on the sample. So, while evaluating on benchmark dataset, we should consider this current standard. Check the official config.

  • #Frame = #input_frame x #clip x #crop. The frame interval is 2 to evaluate on benchmark dataset.
  • #input_frame means how many frames are input for model during the test phase. For video swin, it is 32.
  • #crop means spatial crops (e.g., 3 for left/right/center crop).
  • #clip means temporal clips (e.g., 4 means repeted temporal sampling five clips with different start indices).

Checkpoints

In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In the following table, IN referes to ImageNet. By default, the video swin is trained with input shape of 32, 224, 224, 3.

Kinetics 400

Model Pretrain #Frame Top-1 Top-5 Checkpoints config
Swin-T IN-1K 32x4x3 78.8 93.6 h5 / h5-no-top swin-t
Swin-S IN-1K 32x4x3 80.6 94.5 h5 / h5-no-top swin-s
Swin-B IN-1K 32x4x3 80.6 94.6 h5 / h5-no-top swin-b
Swin-B IN-22K 32x4x3 82.7 95.5 h5 / h5-no-top swin-b

Kinetics 600

Model Pretrain #Frame Top-1 Top-5 Checkpoints config
Swin-B IN-22K 32x4x3 84.0 96.5 h5 / h5-no-top swin-b

Something-Something V2

Model Pretrain #Frame Top-1 Top-5 Checkpoints config
Swin-B Kinetics 400 32x1x3 69.6 92.7 h5 / h5-no-top swin-b