Hidden Markov Model Part-of-Speech Tagger
This repository contains the framework for a Hidden Markov Model (HMM) implementation of an English part-of-speech tagger. The training data is a segment of the Wall Street Journal corpus.
Run the following command to extract bigram emission and transition probabilities:
cat training_data | create_2gram_hmm.sh output_2gram_hmm
Run the following command to extract trigram emission and transition probabilities:
cat training_data | create_3gram_hmm.sh output_3gram_hmm l1 l2 l3 unk_prob_file
- Sample
training data
provided in this repository istoy_input
underexamples/toy
. - The smoothing technique used is linear interpolation, where
l1
,l2
,l3
are lambda values. - Probabilities in
unk_prob_file
are used to account for unknown words. Sampleunk_prob_file
provided in this repository istoy_unk
underexamples/toy
.
- The data used to train the tagger is
wsj_sec0.word_pos
- The
unk_prob_file
used during training isunk_prob_sec22
- The test set used to evaluate the tagger is
wsj_sec22.word_pos
All of these files can be found under examples
.