Our use of NLTK depends on several corpora. To install them, run the following in a Python environment:
import nltk
nltkl.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
semantic_classifier.py
-
Extract all nouns from document, noun_extractor.py
a. Strip all punctuation and non-ascii characters from each line b. Tokenize the line c. Tag the POS of each token d. Filter out all non-noun tokens e. Add all noun tokens to a Python dict and track occurrences of each noun
-
Convert nouns to synsets and remove nouns for which no synset exists
-
Generate the 2D matrix of similarity values
-
Perform hierarchical clustering
-
Get clusters based on min_size, max_size, and dist parameters.
-
Sort clusters by noun occurrence, most frequent first.
-
Find the least common ancestor of each cluster of synsets.
-
Semantic classification vs. what the article is about
-
large clusters vs. iterative clusters of pairs
The larger the cluster size, the more abstract and oftentimes less accurate the hypernym. The smaller the cluster size, especially pairs, yield the most accurate hypernyms, but there is less semantic synthesis.
-
hypernym vs. content
E.g. "Photograph" is not clustered with "photography," their wup_similarity is only 0.1176. But the wup_similarity of "photograph" with "painting" is 0.705
-
Incorporate noun counts for assigning 'salience scores' to each hypernym
-
NLTK's POS tagger sometimes mis-tags words as nouns. For instance, it tags "tamer" in the following sentence as a noun: "Scientists once thought that some visionary hunter-gatherer nabbed a wolf puppy from its den one day and started raising tamer and tamer wolves".
-
Currently, the algorithm only takes the first synset and first common hypernym
a. The first synset is the most frequently occurring, but it might be the incorrect sense of the noun. b. A set of synsets might have multiple lowest common hypernyms, some of which may be more accurate than others.
-
How to do evaluation?
-
Morphology — collapse 'photography' and 'photograph'?
-
Methodological limitations
a. Only accounts for nouns b. Hypernym is not equivalent to 'semantic class' or 'content' c. A document's complete semantic meaning cannot fully be captured by a set of nouns
-
discuss clustering mode, i.e. median vs. complete