In this assignment, you will implement a named entity recognizer using sequence tagging algorithms. You will compare two different approaches to sequence tagging:
For the VG grade, you will also see if you can improve the step-by-step system by replacing the greedy search with a beam search.
Write Python code to implement the algorithm, and investigate how well it works compared to the CRF. Send the code and a document containing your results to the course instructor (richard.johansson -at- gu.se). This assignment is solved individually or in pairs.
Deadline: October 5
Repeat the material from the lecture on structured prediction.
Download and unpack this zip file. The package contains the code you need to get started, and all the corpora you will need. It also contains the CRF software (for Mac, which should work in the lab, but also for Linux and Windows).
You can choose to solve this assignment using English, Spanish, or Dutch corpora.
The corpora we will use were created for the yearly NLP competition ("shared task") organized by the CoNLL conference on machine learning in NLP. The English data comes from the task in 2003, and the Spanish and Dutch data from the task in 2002. The evaluation script we will use was also created for these tasks.
These corpora use names of four different categories: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). Here is an example sentence showing some names:
[ORG United Nations] official [PER Ekeus] heads for [LOC Baghdad].
In the zip file that you downloaded, data have already been divided
into training and test parts. All the files are in the directory
called corpora
.
So how do we find the interesting groups of words? The solution we will adopt in this exercise is to use the B-I-O encoding to convert the groups into sequences. B-I-O stands for Beginning-Inside-Outside. Here is an example:
United B-ORG Nations I-ORG official O Ekeus B-PER heads O for O Baghdad B-LOC . O
The meaning of a B-PER tag is "a person name begins here", an I-PER tag means "a person name continues", and an O tag means "no name here". We have now converted the grouping problem into a sequence tagging problem similar to POS tagging.
The files contain some additional information in addition to the word strings. For instance, in the English file we have POS tags and noun phrase chunk tags in addition to the words (and the NE tags). The information at one position in a sentence is stored as a row in a text file. The output tag (that we want to predict) is stored in the final column. For instance:
United NNP B-NP B-ORG Nations NNP I-NP I-ORG official NN B-NP O Ekeus NNP B-NP B-PER heads VBZ O O for IN O O Baghdad NNP B-NP B-LOC . . O O
The first column in the text file contains words, the second column the POS tags, the third has the noun chunk labels, and the final column the B-I-O labels that we want to predict. The Dutch corpora don't contain noun chunk labels, and in the Spanish corpora we don't even have POS tags, but instead "word shape" features: in short, they are "Aa" for capitalized words and "a" for non-capitalized words (see here for an explanation).
El Aa O consejero a O de a O Economía Aa B-MISC Industria Aa I-MISC Comercio Aa I-MISC , , O Manuel Aa B-PER Amigo Aa I-PER
The file conlleval.perl
is
a Perl program to evaluate the
output of sequence labelers producing the B-I-O format. The output of
the automatic tagger should be added as a separate column, after the
gold-standard annotation. For instance:
El Aa O O consejero a O O de a O B-MISC Economía Aa B-MISC I-MISC Industria Aa I-MISC I-MISC Comercio Aa I-MISC I-MISC , , O O Manuel Aa B-PER B-PER Amigo Aa I-PER I-PER
To run the evaluation tool, we write like this:
perl conlleval.perl < output_file_name
Note: if you are doing this on your own machine, you may have to install Perl.
Here is what the evaluator writes if we run it on the output above:
processed 10 tokens with 2 phrases; found: 2 phrases; correct: 1. accuracy: 80.00%; precision: 50.00%; recall: 50.00%; FB1: 50.00 MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 1 PER: precision: 100.00%; recall: 100.00%; FB1: 100.00 1
The evaluator prints precision and recall measures, and their harmonic mean (the F-measure, FB1). We also see the performance for each of the types of names, MISC and PER in this case. Note that this evaluation is quite tough: we get no credit for an almost-correct group.
The file assignment3.py
contains Python code for a program that carries out sequence tagging in a greedy way. For each word in a sentence, it calls a classifier to guess the output label for that word. That decision is then fixed, and will never be revised when we move further into the sentence.
Your task is now to add the missing parts in this software in
order to make it work. Look for the text *** YOUR CODE HERE
***
to locate the places where you will edit.
Your first task is to collect training examples for the step-by-step
classifier. Go to the function train_ne_tagger
and have
a look. This function will read one sentence at a time, extract
training examples from each sentence, and finally train the classifier.
First, add code that extracts training examples from the sentence S.
For your help, the file feature_extraction.py
contains a
function extract_features
that will extract features
for you. Take a look at it and make sure you understand what it
does. When you call this function, you will need to specify the
position in the sentence from which features will be extracted.
Hint: the feature extraction function also takes an input corresponding to the output tag from the previous step. Get this tag from the output sequence O.
Add all the training examples to the list X, and add the corresponding output tags (from the list O) to the list Y.
Next, add some code to train the step-by-step classifier, and return this classifier from the function. You can use any classifier, for instance something from scikit-learn.
Finally, go to the function tag_sentence
, which should
compute a tag sequence for a given input sentence. Add the missing
code to guess the tag sequence.
Hint: here, the previous tag should be your previous guess, not taken from the gold-standard annotation as in the training function.
Hint: if you're using a classifier from scikit-learn, note
that the method predict
always outputs a vector, even
if you are just classifying a single instance. So you may need to do
something like classifier.predict(features)[0]
.
Now your program is ready to be trained and tested. If you
run the file assignment3.py
,
then the program will use the training corpus to train a NER
system. It will then apply the NER system to the test corpus.
Finally, the CoNLL evaluation script will be called automatically and the result will be printed. Execute the program and observe the results.
Note: change the
variables TRAINSET
and TESTSET
if you
prefer to use the Spanish or Dutch corpora.
Run the software and observe the result. Does it work reasonably well? Which name types seem to be easier to find and which are harder? Take a look into the output file and look at some of the errors, and see if can you find reasonable explanations for any of them.
Optional task. Can you think of any other features? These can include the strings we find in the rows and columns of the file, but possibly also other things, such as substrings, combinations, or external resources such as name lists. The paper Design Challenges and Misconceptions in Named Entity Recognition by Ratinov and Roth (2009) contains a description of some useful features.
As we saw in the lecture, conditional random fields (CRF) are a popular machine learning model for sequence tagging tasks. The CRF model is basically a generalization of logistic regression to sequences. It uses the Viterbi algorithm instead of the greedy step-by-step algorithm we used.
We will use a CRF implementation
called CRFSGD, which
was built by Léon Bottou.
The program can be found in the directory called crf
.
The CRFSGD program uses a special format for telling the program which features to use. You can find this feature specification in the file crf_features.txt
. Here is a part of it:
# Unigram U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] ... and so on ... # Bigram B
This feature definition file is CRFSGD's equivalent of the Python
function extract_features
that you're using in your code.
As you can see, there are two kinds of features: unigram features with a name starting with U, and bigram features with a name starting with B. Unigram features include the current output tag, and bigram features include the current tag and the previous tag. The feature also often use some part of the input: that is the %x
part. The numbers appearing after %x
refer to the row and column of the input. A row index of 0 means the current row, -1 the row before, and so on.
Open a command prompt (a terminal window on a Mac or a Linux machine, or the Command Prompt on Windows).
Now we will train a CRF with the CRFSGD program. We will train on the
file TRAINING_FILE
and the resulting CRF will be written
to ner_crf.model
. The feature definition file is
called crf_features.txt
. Copy the feature definition file
and the executable file crfasgd
from the
directory crf
into your working directory. Then execute
the following line in your terminal window:
./crfasgd -e "perl conlleval.perl -q" ner_crf.model crf_features.txt TRAINING_FILE
The part -e "perl conlleval.perl -q"
tells SGD how to run the evaluation script for diagnostics.
The training process will normally take a couple of minutes.
Note: if you are doing this on a Windows machine, then copy the
files in the directory crf/windows
into your working
directory, then run crfasgd.exe
instead
of ./crfasgd
. On a Linux machine,
copy crfasgd
from the crf/linux
directory and
use that instead of the Mac program.
To run your CRF on a test file, write like this:
./crfasgd -t ner_crf.model YOUR_TEST_FILE > crf_output.txt
Take a look at crf_output.txt
. You can then evaluate the quality of the output by using the CoNLL evaluation tool:
perl conlleval.perl < crf_output.txt
How well does the CRF work in comparison to the greedy tagging system you implemented?
The sequence tagger we have implemented in Python is greedy: it makes a guess based on what looks best right now, and it can't change its mind as new information comes in. In the VG task, we'll implement an algorithm called beam search that tries to address this problem. The idea is to remember N different tag sequences instead of just one. (The parameter N is called the beam width.)
The beam consists of a list of search items. Each item contains
a total score and a sequence of tags. For instance,
it could be something such as (-14.5, ['B-ORG', 'I-ORG', 'O'])
.
Note: for this task, you need to use a classifier that is
capable of computing log probabilities, such
as LogisticRegression
.
Here is the pseudocode of the beam search algorithm. Read through it and make sure you understand all the steps in principle, so that you would be able to explain the algorithm on paper. The step where we compute the log probabilities is explained below.
Make an initial beam. This is a list consisting of a single item: a total score of 0, and an empty list of tags For each position i in the sentence: Make a new list for the beam at the current step For each item in the beam from the previous step: Compute the scores (log probabilities) of all the possible tags at this step For each possible tag, add a new item to the beam: - its total score is the sum of the scores of the previous item and of the the current tag - its tag list is the previous list plus the current tag Retain the N top-scoring items in the beam Return the tag list for the top-scoring item in the beam
To compute the log probabilities of the possible tags, we replace the
call to the classifier's predict
with predict_log_proba
. This will give us a vector of log
probabilities, so in order to know which number belongs to which tag,
we need to get the classifier's list of possible tags
(classes_
).
Here is an example of how it can look:
( ... Extract features at this position ...) logprobs = classifier.predict_log_proba(features)[0] for tag, logprob in zip(classifier.classes_, logprobs): (... do something useful with tag and logprob ...)
Replace the function tag_sentence
with a new function
that uses a beam search. Do you get any improvement over your previous
greedy implementation (which would be equivalent to a beam width of 1)?
Discussion question. You might find this algorithm a bit similar to the Viterbi algorithm. Can you try to think of the strengths and weaknesses of the beam search compared to Viterbi?