Hidden Markov Model Part-of-Speech Tagger

This repository contains the framework for a Hidden Markov Model (HMM) implementation of an English part-of-speech tagger. The training data is a segment of the Wall Street Journal corpus.

Extracting probabilities

Run the following command to extract bigram emission and transition probabilities:

cat training_data | create_2gram_hmm.sh output_2gram_hmm

Run the following command to extract trigram emission and transition probabilities:

cat training_data | create_3gram_hmm.sh output_3gram_hmm l1 l2 l3 unk_prob_file

Sample training data provided in this repository is toy_input under examples/toy.
The smoothing technique used is linear interpolation, where l1, l2, l3 are lambda values.
Probabilities in unk_prob_file are used to account for unknown words. Sample unk_prob_file provided in this repository is toy_unk under examples/toy.

Train and test sets

The data used to train the tagger is wsj_sec0.word_pos
The unk_prob_file used during training is unk_prob_sec22
The test set used to evaluate the tagger is wsj_sec22.word_pos

All of these files can be found under examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hidden Markov Model Part-of-Speech Tagger

Extracting probabilities

Train and test sets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hidden Markov Model Part-of-Speech Tagger

Extracting probabilities

Train and test sets