Assignment 3: Implementation of a part-of-speech tagger

In this assignment, you will implement a bigram part-of-speech tagger.

Write Python code to solve the tasks described below. Send the code and the answers to the questions by email to to Mehdi (mehdi.ghanimifard@gu.se).

NB: submit your answers individually. You are allowed to discuss with your fellow students, but not write code together.

Deadline: March 13

Preliminaries

Check the slides on tagging, in particular make sure that you understand how to estimate the emission and transition probabilities (slides 36–39) and how to find the best sequence of tags using the Viterbi algorithm (slides 40–45).

Download this Python file, which contains some code you can start from.

Reading a tagged corpus

Ask the instructor for a password and then get a tagged corpus from this page. It's advisable that you select a language that you understand, so you can analyze the tagger errors.

These files use a simple column-based text format. The first column contains the words, and the second the part-of-speech tags. Sentences are separated by blank lines. Here is an example of some English text, using the Penn Treebank tagset.

The       DT
rifles    NNS
were      VBD
n't       RB
loaded    VBN
.         .

As        IN
interest  NN
rates     NNS
rose      VBD
,         ,
...

The provided code contains two functions you can use to read the tagged corpus. Have a look at the functions so that you understand what they do.

The output of read_tagged_corpus is a list of tagged sentences. Each tagged sentence is a list of word–tag pairs.

[[('The', 'DT'), ('rifles', 'NNS'), ('were', 'VBD'), ("n't", 'RB'), ('loaded', 'VBN'), ('.', '.')], 
 [('As', 'IN'), ('interest', 'NN'), ('rates', 'NNS'), ('rose', 'VBD'), (',', ','), ...],
 ...]

Divide the list of tagged sentences into a training and a test part.

A baseline unigram tagger

As you remember from the labs in the NLP course, NLTK contains a wide range of taggers. In this assignment, we will use a baseline tagger: for each word in the sentence that we tag, output the most frequent tag observed with that word in the training corpus. For previously unseen words, it outputs the tag that is most frequent in general.

Take a quick look at the function train_nltk_baseline.

Estimating the HMM parameters

You will now implement the bigram HMM tagger. The first task is to estimate the transition and emission probabilities.

def hmm_train_tagger(tagged_sentences):
    estimate the emission and transition probabilities
    return the two probability tables

This function will go through the tagged corpus and estimate the probabilities by counting occurrences of tags, tag–tag pairs, and tag–word pairs. It then returns the two probability dictionaries.

Hint. It is probably useful to introduce special tags to handle the start and end of the sentences.

Check. According to your estimates, if the previous word is a determiner (DT), what is the probability of the current word being a noun in the singular (NN)? If the current word has the tag NN, what is the probability that it is help? (If the language you're working with is not English, then test with some other words and tags.)

Tagging a sentence

To tag a sentence, you need to apply the Viterbi algorithm, and then retrace your steps back to the initial dummy link.

def hmm_tag_sentence(tagger_data, sentence):
    apply the Viterbi algorithm
    retrace your steps
    return the list of tagged words

Implement the Viterbi algorithm, which will take a list of words and output the most likely path through the HMM state space. The input to this algorithm is the sentence, and the two probability tables that you computed in hmm_train_tagger. Here is some pseudocode to give you an overview of what to do.

def viterbi(tagger_data, sentence):

    make a dummy link with log probability of 0, START tag, and no predecessor
    current list = [ the dummy link ]
    
    for each word in the sentence:
        previous list = current list
        current list = []        
        determine the possible tags for this word

        for each tag of the possible tags:
            make a link representing the highest-scoring path ending with this tag
            add that link to the current list

    end the sequence with a dummy: a link for the highest-scoring path ending with the tag END

Hint. When you determine the possible tags for a word, for efficiency reasons it is recommended that you use only the tags that have been observed in the training corpus for that word. For a word that didn't occur in the training set, you can allow all tags (or a list of open-class tags).

Clarification. What we refer to as an "link" in the pseudocode will probably be implemented as a tuple consisting of 1) the log probability of the tags up to that point; 2) the tag at that position; 3) a reference to a preceding link.

To find the highest-scoring path ending with a certain tag, use an algorithm such as in this pseudocode:

    determine the emission log probability: the probability that this tag will emit this word
    
    find the previous link that gives the highest total log probability,
    where the total log probability is the sum of
        1) the emission log probability,
        2) the log probability of the transition from the tag of the previous link to the current tag,
        3) the total log probability of the previous link

Hint. What is the emission probability of a word that we haven't seen in the training corpus? The simplest solution is probably that you don't include any emission probability for those words. A better solution might be that you estimate an emission probability for a special token <UNKNOWN>, for instance by looking at the words that occur once in the training corpus. See also the first optional assignment.

Finally, after the Viterbi algorithm has created the dummy end link, you need to collect the tags. You do this by retracing your steps, using the links from each link to its predecessor:

def retrace(end_link, sentence_length):
    tags = []
    link = predecessor of end_link
    while the tag of the link isn't START:
        add the tag of the link to tags
        link = predecessor of the link
    reverse the list of tags and return it

Alternatively, use a recursion instead of reversing the list in the end.

Evaluating the tagger

Run the tagger on the test set and estimate its accuracy. How well does it compare to the baseline tagger?

What is the accuracy on words that you haven't seen in the training set?

Which words are most frequently tagged incorrectly, and why?

Further experiments (optional)

Improved handling of unseen words

Typically, the tagging accuracy is much lower for words that we haven't observed in the training set. Try to come up with a method to improve the tagging of these words. This will probably be quite important if you try one of the morphologically more complex languages, such as Hungarian.

One idea might be that you use a Naive Bayes model to estimate the emission probabilities in these cases. This model can then use features like prefixes and suffixes, whether or not a word is capitalized or contains numbers, etc.

A simpler idea could be to convert unseen words to "pseudowords", e.g. CAPITALIZED_WORD if you encounter the unseen word Ashgabat. See figure 2.6 in the notes by Collins.

A confidence interval for the tagging accuracy

In assignment 2, we computed the confidence interval for the accuracy of a classifier using the binomial distribution. As we discussed in the lecture, it isn't statistically sound to apply that method when evaluating a tagger because the words (and the tagger's decisions) are not independent.

Implement a bootstrapping method to compute a confidence interval for the tagging accuracy. When you generate the random test sets, you should pick whole sentences, not single words.

It will be useful to precompute as much as possible. You can represent the evaluation of a sentence as a pair (number_of_correct, number_of_words); when you select sentences randomly, then just add these numbers to the total numbers.

Use more context

In a bigram tagger such as yours, the transition probabilities model the relation between the tag of one word and the tag of the previous word. Can we improve the tagging accuracy by using more than one previous tag? Many of the best taggers in fact use this idea: they use two or three previous tags.

If you use more previous tags, you will need to estimate the transition probabilities differently. You will also need to modify your implementation of the Viterbi algorithm: its state space will be more complex, so the tagging will be a bit slower.

If you use more more context, there will be a higher risk of data sparsity. You might need to use some sort of smoothing (such as interpolation smoothing) to get reliable transition probabilities.