Quo vadis, action recognition? A new model and the Kinetics dataset

João Carreira, Andrew Zisserman (2017)

Key points

New video dataset called Kinetics: more data, more action classes, from challenging YouTube videos (similar to ImageNet for images)
Two-stream inflated 3D ConvNets:
- Filters and pooling kernels of very deep image classification ConvNets expanded into 3D
- Their weights (also inflated) pre-trained on ImageNet give a very valuable initialization!
  - Concatenate N times along time axis, divide by N
  - Growth of filter's time dimension per layer not necessarily as large as the filter's spatial dimension growth: too large = merging of objects, too small = unable to capture scene dynamics
Is kinetics pre-training also useful for other video tasks?
No comprehensive exploration of architectures (such as attention methods)
Regular ConvNet only for action recognition: opening door = closing door --> include LSTM
3D ConvNets: nice hierarchical representation of spatio-temporal data, but many more parameters --> harder to train, pre-training on ImageNet not as effective
Two-stream: capture fine low-level motion and high-level variation using single frame + optical flow vectors passed through ImageNet ConvNet --> high performance while very efficient