Skip to content

Latest commit

 

History

History
16 lines (15 loc) · 1.13 KB

dl_21.md

File metadata and controls

16 lines (15 loc) · 1.13 KB

Distributed representations of sentences and documents

Quoc Le, Tomas Mikolov (2014)

Key points

  • Unsupervised
  • Represent text by a fixed-length feature vector
    • Bag-of-words (BoW): loses the order of words, ignores the semantics of words
  • Paragraph vector: unsupervised algorithm to solve this
    • Consists of word matrix (vectors representing words) and paragraph matrix (vectors of fixed length representing variable-length text pieces)
    • Paragraph matrix represents the missing information from the current context (sort of memory of paragraph topic)
  • To predict words in a paragraph: concatenate paragraph vector (unique to each paragraph) with word vectors (shared between paragraphs)
  • Trainable using SGD and backprop
  • Paragraph vector distributed memory (PV-DM): similar to continuous BoW (CBoW), predicts other words in context, semantically similar words are close
  • Paragraph vector distributed bag of words (PV-DBoW): only paragraph vector, predict words randomly sampled from output paragraph --> needs to store less data!
    • Combine with PV-DM to upgrade its performance
    • Doesn't suffer from data sparsity and high-D!