Video Swin Transformer Model Zoo

Video Swin in keras can be used with multiple backends, i.e. tensorflow, torch, and jax. The input shape are expected to be channel_last, i.e. (depth, height, width, channel).

Note: While evaluating the video model for classification task, multiple clips from a video are sampled. And additionally, this process also involves multiple crops on the sample. So, while evaluating on benchmark dataset, we should consider this current standard. Check the official config.

#Frame = #input_frame x #clip x #crop. The frame interval is 2 to evaluate on benchmark dataset.
#input_frame means how many frames are input for model during the test phase. For video swin, it is 32.
#crop means spatial crops (e.g., 3 for left/right/center crop).
#clip means temporal clips (e.g., 4 means repeted temporal sampling five clips with different start indices).

Checkpoints

In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In the following table, IN referes to ImageNet. By default, the video swin is trained with input shape of 32, 224, 224, 3.

Kinetics 400

Model	Pretrain	#Frame	Top-1	Top-5	Checkpoints	config
Swin-T	IN-1K	32x4x3	78.8	93.6	h5 / h5-no-top	swin-t
Swin-S	IN-1K	32x4x3	80.6	94.5	h5 / h5-no-top	swin-s
Swin-B	IN-1K	32x4x3	80.6	94.6	h5 / h5-no-top	swin-b
Swin-B	IN-22K	32x4x3	82.7	95.5	h5 / h5-no-top	swin-b

Kinetics 600

Model	Pretrain	#Frame	Top-1	Top-5	Checkpoints	config
Swin-B	IN-22K	32x4x3	84.0	96.5	h5 / h5-no-top	swin-b

Something-Something V2

Model	Pretrain	#Frame	Top-1	Top-5	Checkpoints	config
Swin-B	Kinetics 400	32x1x3	69.6	92.7	h5 / h5-no-top	swin-b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

Video Swin Transformer Model Zoo

Checkpoints

Kinetics 400

Kinetics 600

Something-Something V2

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

Video Swin Transformer Model Zoo

Checkpoints

Kinetics 400

Kinetics 600

Something-Something V2