Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, Andrew Zisserman (2007)

Key points

Returns ranked list of images containing object specified in query image (region)
Bag-of-visual-words: more words & structure (spatial) than normal BoW, but much noisier
2 improvements:
- In terms of visual vocabulary: flat k-means with approximate nearest-neighbor gives best performance, making use of randomized k-d trees (splits randomly across dimensions with highest variance, which helps to mitigate quantization effects)
- Incorporating spatial information in ranking (deterministic)
Dealing with image variations:
- Estimate transformation between query and target image, taking advantage of the fact that images of landmarks are usually oriented correctly (so no need for in-plane rotations)
- Re-rank based on how well the verified features discriminate