João Carreira, Andrew Zisserman (2017)
- New video dataset called Kinetics: more data, more action classes, from challenging YouTube videos (similar to ImageNet for images)
- Two-stream inflated 3D ConvNets:
- Filters and pooling kernels of very deep image classification ConvNets expanded into 3D
- Their weights (also inflated) pre-trained on ImageNet give a very valuable initialization!
- Concatenate N times along time axis, divide by N
- Growth of filter's time dimension per layer not necessarily as large as the filter's spatial dimension growth: too large = merging of objects, too small = unable to capture scene dynamics
- Is kinetics pre-training also useful for other video tasks?
- No comprehensive exploration of architectures (such as attention methods)
- Regular ConvNet only for action recognition: opening door = closing door --> include LSTM
- 3D ConvNets: nice hierarchical representation of spatio-temporal data, but many more parameters --> harder to train, pre-training on ImageNet not as effective
- Two-stream: capture fine low-level motion and high-level variation using single frame + optical flow vectors passed through ImageNet ConvNet --> high performance while very efficient