Assignment 3: Sequence tagging

In this assignment, you will implement a named entity recognizer using sequence tagging algorithms. You will compare two different approaches to sequence tagging:

A greedy step-by-step method implemented in Python, where a classifier is called to determine each tag.
A conditional random field, using an external software package.

For the VG grade, you will also see if you can improve the step-by-step system by replacing the greedy search with a beam search.

Write Python code to implement the algorithm, and investigate how well it works compared to the CRF. Send the code and a document containing your results to the course instructor (richard.johansson -at- gu.se). This assignment is solved individually or in pairs.

Deadline: October 5

Preliminaries

Repeat the material from the lecture on structured prediction.

1. Introduction to the task and the data

Download and unpack this zip file. The package contains the code you need to get started, and all the corpora you will need. It also contains the CRF software (for Mac, which should work in the lab, but also for Linux and Windows).

The datasets

You can choose to solve this assignment using English, Spanish, or Dutch corpora.

The corpora we will use were created for the yearly NLP competition ("shared task") organized by the CoNLL conference on machine learning in NLP. The English data comes from the task in 2003, and the Spanish and Dutch data from the task in 2002. The evaluation script we will use was also created for these tasks.

These corpora use names of four different categories: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). Here is an example sentence showing some names:

[ORG United Nations] official [PER Ekeus] heads for [LOC Baghdad].

In the zip file that you downloaded, data have already been divided into training and test parts. All the files are in the directory called corpora.

English: eng.train.iob for training and eng.valid.iob for testing
Spanish: ner_es_train.txt and ner_es_test.txt
Dutch: ner_nl_train.txt and ner_nl_test.txt

Representing groups of words

So how do we find the interesting groups of words? The solution we will adopt in this exercise is to use the B-I-O encoding to convert the groups into sequences. B-I-O stands for Beginning-Inside-Outside. Here is an example:

United   B-ORG
Nations  I-ORG
official O
Ekeus    B-PER
heads    O
for      O
Baghdad  B-LOC
.        O

The meaning of a B-PER tag is "a person name begins here", an I-PER tag means "a person name continues", and an O tag means "no name here". We have now converted the grouping problem into a sequence tagging problem similar to POS tagging.

Rows and columns in the CoNLL format

The files contain some additional information in addition to the word strings. For instance, in the English file we have POS tags and noun phrase chunk tags in addition to the words (and the NE tags). The information at one position in a sentence is stored as a row in a text file. The output tag (that we want to predict) is stored in the final column. For instance:

United   NNP  B-NP  B-ORG
Nations  NNP  I-NP  I-ORG
official NN   B-NP  O
Ekeus    NNP  B-NP  B-PER
heads    VBZ  O     O
for      IN   O     O
Baghdad  NNP  B-NP  B-LOC
.        .    O     O

The first column in the text file contains words, the second column the POS tags, the third has the noun chunk labels, and the final column the B-I-O labels that we want to predict. The Dutch corpora don't contain noun chunk labels, and in the Spanish corpora we don't even have POS tags, but instead "word shape" features: in short, they are "Aa" for capitalized words and "a" for non-capitalized words (see here for an explanation).

El          Aa  O
consejero   a   O
de          a   O
Economía    Aa  B-MISC
Industria   Aa  I-MISC
Comercio    Aa  I-MISC
,           ,   O
Manuel      Aa  B-PER
Amigo       Aa  I-PER

Evaluation

The file conlleval.perl is a Perl program to evaluate the output of sequence labelers producing the B-I-O format. The output of the automatic tagger should be added as a separate column, after the gold-standard annotation. For instance:

El          Aa  O       O
consejero   a   O       O
de          a   O       B-MISC
Economía    Aa  B-MISC  I-MISC
Industria   Aa  I-MISC  I-MISC
Comercio    Aa  I-MISC  I-MISC
,           ,   O       O
Manuel      Aa  B-PER   B-PER
Amigo       Aa  I-PER   I-PER

To run the evaluation tool, we write like this:

perl conlleval.perl < output_file_name

Note: if you are doing this on your own machine, you may have to install Perl.

Here is what the evaluator writes if we run it on the output above:

processed 10 tokens with 2 phrases; found: 2 phrases; correct: 1.
accuracy:  80.00%; precision:  50.00%; recall:  50.00%; FB1:  50.00
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              PER: precision: 100.00%; recall: 100.00%; FB1: 100.00  1

The evaluator prints precision and recall measures, and their harmonic mean (the F-measure, FB1). We also see the performance for each of the types of names, MISC and PER in this case. Note that this evaluation is quite tough: we get no credit for an almost-correct group.

2. Step-by-step sequence tagging

The file assignment3.py contains Python code for a program that carries out sequence tagging in a greedy way. For each word in a sentence, it calls a classifier to guess the output label for that word. That decision is then fixed, and will never be revised when we move further into the sentence.

Your task is now to add the missing parts in this software in order to make it work. Look for the text *** YOUR CODE HERE *** to locate the places where you will edit.

Implementing the step-by-step classifier

Your first task is to collect training examples for the step-by-step classifier. Go to the function train_ne_tagger and have a look. This function will read one sentence at a time, extract training examples from each sentence, and finally train the classifier.

First, add code that extracts training examples from the sentence S. For your help, the file feature_extraction.py contains a function extract_features that will extract features for you. Take a look at it and make sure you understand what it does. When you call this function, you will need to specify the position in the sentence from which features will be extracted.

Hint: the feature extraction function also takes an input corresponding to the output tag from the previous step. Get this tag from the output sequence O.

Add all the training examples to the list X, and add the corresponding output tags (from the list O) to the list Y.

Next, add some code to train the step-by-step classifier, and return this classifier from the function. You can use any classifier, for instance something from scikit-learn.

Finally, go to the function tag_sentence, which should compute a tag sequence for a given input sentence. Add the missing code to guess the tag sequence.

Hint: here, the previous tag should be your previous guess, not taken from the gold-standard annotation as in the training function.

Hint: if you're using a classifier from scikit-learn, note that the method predict always outputs a vector, even if you are just classifying a single instance. So you may need to do something like classifier.predict(features)[0].

Running the sequence tagger

Now your program is ready to be trained and tested. If you run the file assignment3.py, then the program will use the training corpus to train a NER system. It will then apply the NER system to the test corpus.

Finally, the CoNLL evaluation script will be called automatically and the result will be printed. Execute the program and observe the results.

Note: change the variables TRAINSET and TESTSET if you prefer to use the Spanish or Dutch corpora.

Run the software and observe the result. Does it work reasonably well? Which name types seem to be easier to find and which are harder? Take a look into the output file and look at some of the errors, and see if can you find reasonable explanations for any of them.

Optional task. Can you think of any other features? These can include the strings we find in the rows and columns of the file, but possibly also other things, such as substrings, combinations, or external resources such as name lists. The paper Design Challenges and Misconceptions in Named Entity Recognition by Ratinov and Roth (2009) contains a description of some useful features.

3. Comparing to a conditional random field

As we saw in the lecture, conditional random fields (CRF) are a popular machine learning model for sequence tagging tasks. The CRF model is basically a generalization of logistic regression to sequences. It uses the Viterbi algorithm instead of the greedy step-by-step algorithm we used.

We will use a CRF implementation called CRFSGD, which was built by Léon Bottou. The program can be found in the directory called crf.

The feature definition file

The CRFSGD program uses a special format for telling the program which features to use. You can find this feature specification in the file crf_features.txt. Here is a part of it:

# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
... and so on ...

# Bigram
B

This feature definition file is CRFSGD's equivalent of the Python function extract_features that you're using in your code. As you can see, there are two kinds of features: unigram features with a name starting with U, and bigram features with a name starting with B. Unigram features include the current output tag, and bigram features include the current tag and the previous tag. The feature also often use some part of the input: that is the %x part. The numbers appearing after %x refer to the row and column of the input. A row index of 0 means the current row, -1 the row before, and so on.

Training the CRF on a training file

Open a command prompt (a terminal window on a Mac or a Linux machine, or the Command Prompt on Windows).

Now we will train a CRF with the CRFSGD program. We will train on the file TRAINING_FILE and the resulting CRF will be written to ner_crf.model. The feature definition file is called crf_features.txt. Copy the feature definition file and the executable file crfasgd from the directory crf into your working directory. Then execute the following line in your terminal window:

./crfasgd -e "perl conlleval.perl -q" ner_crf.model crf_features.txt TRAINING_FILE

The part -e "perl conlleval.perl -q" tells SGD how to run the evaluation script for diagnostics.

The training process will normally take a couple of minutes.

Note: if you are doing this on a Windows machine, then copy the files in the directory crf/windows into your working directory, then run crfasgd.exe instead of ./crfasgd. On a Linux machine, copy crfasgd from the crf/linux directory and use that instead of the Mac program.

Running the CRF on a test file

To run your CRF on a test file, write like this:

./crfasgd -t ner_crf.model YOUR_TEST_FILE > crf_output.txt

Take a look at crf_output.txt. You can then evaluate the quality of the output by using the CoNLL evaluation tool:

perl conlleval.perl < crf_output.txt

How well does the CRF work in comparison to the greedy tagging system you implemented?

4. VG task: adding a beam search

The sequence tagger we have implemented in Python is greedy: it makes a guess based on what looks best right now, and it can't change its mind as new information comes in. In the VG task, we'll implement an algorithm called beam search that tries to address this problem. The idea is to remember N different tag sequences instead of just one. (The parameter N is called the beam width.)

The beam consists of a list of search items. Each item contains a total score and a sequence of tags. For instance, it could be something such as (-14.5, ['B-ORG', 'I-ORG', 'O']).

Note: for this task, you need to use a classifier that is capable of computing log probabilities, such as LogisticRegression.

Here is the pseudocode of the beam search algorithm. Read through it and make sure you understand all the steps in principle, so that you would be able to explain the algorithm on paper. The step where we compute the log probabilities is explained below.

Make an initial beam. This is a list consisting of a single item:
 a total score of 0, and an empty list of tags

For each position i in the sentence:

   Make a new list for the beam at the current step

   For each item in the beam from the previous step:

      Compute the scores (log probabilities) of all the possible tags at this step

      For each possible tag, add a new item to the beam:
        - its total score is the sum of the scores of the
          previous item and of the the current tag  
        - its tag list is the previous list plus the current tag
 
   Retain the N top-scoring items in the beam

Return the tag list for the top-scoring item in the beam

To compute the log probabilities of the possible tags, we replace the call to the classifier's predict with predict_log_proba. This will give us a vector of log probabilities, so in order to know which number belongs to which tag, we need to get the classifier's list of possible tags (classes_). Here is an example of how it can look:

( ... Extract features at this position ...)
logprobs = classifier.predict_log_proba(features)[0]
for tag, logprob in zip(classifier.classes_, logprobs):
   (... do something useful with tag and logprob ...)

Replace the function tag_sentence with a new function that uses a beam search. Do you get any improvement over your previous greedy implementation (which would be equivalent to a beam width of 1)?

Discussion question. You might find this algorithm a bit similar to the Viterbi algorithm. Can you try to think of the strengths and weaknesses of the beam search compared to Viterbi?