Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton (2017)
- CNN: good at detecting features + dealing with translation
- Less good at exploring spatial relationships between features (size, perspective, orientation) + other affine transformations
- May be fooled by a "Picasso face"
- Solution: capsules: represent features by vectors that also include e.g. orientation and size next to likelihood
- Activity vector: instantiation parameters (pose, velocity, etc.)
- Length: probability that the object exists (max 1)
- Orientation: represents the instantiation parameters
- Better generalization: no separate neurons needed for differently oriented objects (as in CNN) --> number doesn't grow exponentially for more dimensions!
- Also: max pooling: lot of information lost, while capsules keep weighted sum of last layer --> better at dealing with overlap
- Activity vector: instantiation parameters (pose, velocity, etc.)
- Dynamic routing-by-agreement: top-down feedback whether or not the input is useful (based on how closely related)
- Backprop still used for training --> slow!
- Capsules are good for dealing with segmentation, due to routing-by-agreement
- Capsules are equivariant to viewpoint, instead of trying to eliminate viewpoint --> deal with multiple different transformations at the same time
- However: capsules can only deal with 1 instance of a class at a specific location in image (crowding)