Quoc Le, Tomas Mikolov (2014)
- Unsupervised
- Represent text by a fixed-length feature vector
- Bag-of-words (BoW): loses the order of words, ignores the semantics of words
- Paragraph vector: unsupervised algorithm to solve this
- Consists of word matrix (vectors representing words) and paragraph matrix (vectors of fixed length representing variable-length text pieces)
- Paragraph matrix represents the missing information from the current context (sort of memory of paragraph topic)
- To predict words in a paragraph: concatenate paragraph vector (unique to each paragraph) with word vectors (shared between paragraphs)
- Trainable using SGD and backprop
- Paragraph vector distributed memory (PV-DM): similar to continuous BoW (CBoW), predicts other words in context, semantically similar words are close
- Paragraph vector distributed bag of words (PV-DBoW): only paragraph vector, predict words randomly sampled from output paragraph --> needs to store less data!
- Combine with PV-DM to upgrade its performance
- Doesn't suffer from data sparsity and high-D!