Distributed representations of sentences and documents

Quoc Le, Tomas Mikolov (2014)

Key points

Unsupervised
Represent text by a fixed-length feature vector
- Bag-of-words (BoW): loses the order of words, ignores the semantics of words
Paragraph vector: unsupervised algorithm to solve this
- Consists of word matrix (vectors representing words) and paragraph matrix (vectors of fixed length representing variable-length text pieces)
- Paragraph matrix represents the missing information from the current context (sort of memory of paragraph topic)
To predict words in a paragraph: concatenate paragraph vector (unique to each paragraph) with word vectors (shared between paragraphs)
Trainable using SGD and backprop
Paragraph vector distributed memory (PV-DM): similar to continuous BoW (CBoW), predicts other words in context, semantically similar words are close
Paragraph vector distributed bag of words (PV-DBoW): only paragraph vector, predict words randomly sampled from output paragraph --> needs to store less data!
- Combine with PV-DM to upgrade its performance
- Doesn't suffer from data sparsity and high-D!