Skip to content

Latest commit

 

History

History
15 lines (14 loc) · 1.29 KB

dl_19.md

File metadata and controls

15 lines (14 loc) · 1.29 KB

Quo vadis, action recognition? A new model and the Kinetics dataset

João Carreira, Andrew Zisserman (2017)

Key points

  • New video dataset called Kinetics: more data, more action classes, from challenging YouTube videos (similar to ImageNet for images)
  • Two-stream inflated 3D ConvNets:
    • Filters and pooling kernels of very deep image classification ConvNets expanded into 3D
    • Their weights (also inflated) pre-trained on ImageNet give a very valuable initialization!
      • Concatenate N times along time axis, divide by N
      • Growth of filter's time dimension per layer not necessarily as large as the filter's spatial dimension growth: too large = merging of objects, too small = unable to capture scene dynamics
  • Is kinetics pre-training also useful for other video tasks?
  • No comprehensive exploration of architectures (such as attention methods)
  • Regular ConvNet only for action recognition: opening door = closing door --> include LSTM
  • 3D ConvNets: nice hierarchical representation of spatio-temporal data, but many more parameters --> harder to train, pre-training on ImageNet not as effective
  • Two-stream: capture fine low-level motion and high-level variation using single frame + optical flow vectors passed through ImageNet ConvNet --> high performance while very efficient