For my code and more details, please see my notebook here
The CoNLL dataset consists of roughly 10,000 sentences, with each word having a part-of-speech tag assigned to it, like the following:
Token | POS |
---|---|
When | WRB |
bank | NN |
financing | NN |
for | IN |
the | DT |
buy-out | NN |
collapsed | VBD |
last | JJ |
week | NN |
, | , |
so | RB |
did | VBD |
UAL | NNP |
's | POS |
stock | NN |
. | . |
A complete look-up table for each part-of-speech tag can be found here.
According to Wikipedia, "words that are assigned to the same part of speech generally display similar syntaxic behavior (they play similar roles within the grammatical structure of sentences)". This means that the POS of a word depends on its role in the current sentence. Consider the word "right" in the following two sentences: "This is the right (JJ) answer" vs. "You have the right (NN) to remain silent"). By itself, you could not assign the correct part of speech to it, but only with the help of the rest of the sentence. Hence, we want to use whole sequences as a model's input, not just individual words. I decided to go for a very simple approach and use a vanilla RNN.
An overview of the complete architecture I used can be seen here:
One might wonder what a convolutional layer is doing in this architecture. If you have a look at the output of the RNN model, you see that its size is (sequence_length, hidden_size)
. However, the final output should be of size (sequence_length, num_classes)
, so that I end up with a probability for each step in the sequence and for POS tag. This computation cannot be done by a standard dense layer, since it would squeeze the dimension regarding sequence length. In Keras, there's what is called Time-Distributed Dense (TDD), which is a dense layer but considering time steps. Unfortunately, there's no implementation of it in PyTorch. However, a possible subsitution for it can be a convultional layer. One important characteristic of a Time-Distributed Dense is that it applies the same weights at each time step – just like a convolutional layer does with the help of its kernel. Below, a more detailed visualization of what the convolutional layer does.
After training for around 40 epochs, the model achieved 0.8890 and 0.8667 top-1 accuracy on the validation and test set, respectively. The top-1 accuracy for each class looks like this:
An in-depth written-up analysis as well as further ideas for improvement can be found in the final sections of my notebook