Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton (2017)

Key points

CNN: good at detecting features + dealing with translation
- Less good at exploring spatial relationships between features (size, perspective, orientation) + other affine transformations
- May be fooled by a "Picasso face"
Solution: capsules: represent features by vectors that also include e.g. orientation and size next to likelihood
- Activity vector: instantiation parameters (pose, velocity, etc.)
  - Length: probability that the object exists (max 1)
  - Orientation: represents the instantiation parameters
- Better generalization: no separate neurons needed for differently oriented objects (as in CNN) --> number doesn't grow exponentially for more dimensions!
- Also: max pooling: lot of information lost, while capsules keep weighted sum of last layer --> better at dealing with overlap
Dynamic routing-by-agreement: top-down feedback whether or not the input is useful (based on how closely related)
- Backprop still used for training --> slow!
Capsules are good for dealing with segmentation, due to routing-by-agreement
Capsules are equivariant to viewpoint, instead of trying to eliminate viewpoint --> deal with multiple different transformations at the same time
However: capsules can only deal with 1 instance of a class at a specific location in image (crowding)