Video Swin in keras
can be used with multiple backends, i.e. tensorflow
, torch
, and jax
. The input shape are expected to be channel_last
, i.e. (depth, height, width, channel)
.
Note: While evaluating the video model for classification task, multiple clips from a video are sampled. And additionally, this process also involves multiple crops on the sample. So, while evaluating on benchmark dataset, we should consider this current standard. Check the official config.
#Frame = #input_frame x #clip x #crop
. The frame interval is2
to evaluate on benchmark dataset.#input_frame
means how many frames are input for model during the test phase. For video swin, it is32
.#crop
means spatial crops (e.g., 3 for left/right/center crop).#clip
means temporal clips (e.g., 4 means repeted temporal sampling five clips with different start indices).
In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In the following table, IN
referes to ImageNet. By default, the video swin is trained with input shape of 32, 224, 224, 3
.
Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config |
---|---|---|---|---|---|---|
Swin-T | IN-1K | 32x4x3 | 78.8 | 93.6 | h5 / h5-no-top | swin-t |
Swin-S | IN-1K | 32x4x3 | 80.6 | 94.5 | h5 / h5-no-top | swin-s |
Swin-B | IN-1K | 32x4x3 | 80.6 | 94.6 | h5 / h5-no-top | swin-b |
Swin-B | IN-22K | 32x4x3 | 82.7 | 95.5 | h5 / h5-no-top | swin-b |
Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config |
---|---|---|---|---|---|---|
Swin-B | IN-22K | 32x4x3 | 84.0 | 96.5 | h5 / h5-no-top | swin-b |
Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config |
---|---|---|---|---|---|---|
Swin-B | Kinetics 400 | 32x1x3 | 69.6 | 92.7 | h5 / h5-no-top | swin-b |