In this assignment, you will implement the perceptron learning algorithm for structured objects. You will use this algorithm to train a dependency parser and a named entity recognizer.
Write Python code to implement the algorithm. Send the code by email to the course instructor (richard.johansson -at- gu.se). Please also include brief answers to the questions below.
This assignment is a bit more free-form than the previous two. This doesn't mean that the amount of code to write is very large, but you'll be expected to assemble a few building blocks on your own. Please ask for help if you get stuck.
Deadline: October 23
NEW: I've made an additional presentation to clarify some details about this assignment.
Repeat the material from the lecture on structured prediction.
Read the paper Spanning Tree Methods for Discriminative Training of Dependency Parsers, McDonald et al., (2005). We'll use the perceptron algorithm instead of the learning algorithm described in section 3.1, so that section can be skimmed if you prefer.
You may also take a look at the paper Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Collins (2002). This is a very influential paper and it introduced the perceptron algorithm for structured objects that we're using in this assignment. However, the paper can be a bit harder to read than the McDonald paper, since it's a bit older.
Download this zip file. It contains the parsing and tagging code that you will need, and an evaluation script you'll use for the tagging part.
The code is designed so that you should be able to develop your learning algorithm in a way that is agnostic of the problem it will be used for.
For both parsing and tagging, the code contains three main parts:
fit
and transform
, to create the
feature mapping and to convert a dataset, respectively. There is
also the combination fit_transform
.
predict(w, x)
: finds the highest-scoring parse tree
or tag sequence for the
sentence x
if the weight vector is w
. This
method uses a suitable algorithm to find the top-scoring output:
the Eisner algorithm (described in McDonald's paper) for parsing, and
the Viterbi algorithm for tagging.get_features(x, y)
: returns a feature representation
of the sentence x
and a tree or sequence y
.
Write Python code to implement the structured perceptron learning
algorithm. Write it in a flexible way so that it's not hardwired to
any specific task: the only thing it should assume is that there is a
"problem definition" as described above, with the
methods predict
and get_features
.
The problem definition should probably be given to the algorithm either as a
constructor parameter or as an input to the training method.
Hint 1. When making a weight vector, you need to know how
large it should be, that is how many features we're using. So you
probably need to give this number as a constructor parameter. The
number of features can be determined by
calling number_of_features
on the vectorizers.
Hint 2. The method get_features
in the problem definitions will return the
features for the sentence x
and a parse
tree or sequence y
. For practical reasons, this is not a single
sparse vector but a matrix (a "list" of vectors). To add
them all to your weight vector w
, you can use the
helper function below. (Here, fv
is the matrix returned
by get_features
, w
is your weight vector,
and scale
is a factor that will be +1 if you want to add fv
to
w
and -1 if you subtract.)
def add_sparse_rows_to_dense(fv, w, scale): prev = 0 for ip in fv.indptr: w[fv.indices[prev:ip]] += scale*fv.data[prev:ip] prev = ip
Optional task: The averaged perceptron
Implement the averaged perceptron algorithm (see lecture). What is the effect on the parsing and tagging accuracy?
Now you are ready to implement the parser. All code that you need is
in the file mstparser.py
. It contains:
read_dependency_treebank
, which reads
the dependency treebank contained in NLTK and converts in into the
format we're using in this assignment. It returns two lists: a
list X
of sentences, and list Y
of
corresponding dependency trees.ParseVectorizer
.MSTParsingDefinition
.
Now write code to train and evaluate a dependency parser. A typical structure of such a program would be something like this:
ParseVectorizer
.Hint 1. Unlike in scikit-learn, the vectorizer
methods fit
, transform
and fit_transform
take two inputs: the
list X
of sentences and the list Y
of
parse trees.
Hint 2. You can compare a gold-standard parse
tree y
and a guessed tree g
simply by
looking at the two lists of integers and counting at how many places
they differ.
(See read_dependency_treebank
for a description
of how we represent a tree.) Just remember
not to count the dummy token.
We will now implement a named-entity tagger. The data we'll use comes from the CoNLL-2003 shared task, a competition in named entity recognition in English and a couple of other languages. Ask the instructor about how to access the data.
Assuming that the parser now runs smoothly, tagging can be done in much the same way. Here is the code that you need.
read_sentences
in the
file conll_format.py
, which reads
sentences from the CoNLL training or testing files.
It returns two lists: a
list X
of sentences, and list Y
of
corresponding tag sequences.SequenceVectorizer
.SequenceTaggingDefinition
.
Train your named entity tagger and evaluate it on a test set. If the code runs slowly or uses all of the computer's memory, use a smaller dataset.
Hint. The simplest way to evaluate the output is probably to do something similar to what you did for parsing: comparing two lists. (Just note that there is no dummy token here.)
Optional task: better evaluation
You probably evaluated your named entity tagger based on how many tokens it labeled correctly. However, the CoNLL competition did not use this scoring method: instead, it considered the whole named entity mentions. For instance, if the true named entity was Bank of America but a system output just America, the system gets no credit.
The zip file includes the official CoNLL evaluation software: the
script conlleval.perl
. You can run it from your Python
code by calling the function run_evaluator
in conll_format.py
.
To run it, you need to print the output to a file in the CoNLL
column-based format. First convert the output from your tagger
(which probably is a list of integers at this point) into a list of
tag strings; you can use the method to_tags
in
SequenceVectorizer
for that. Then
use write_sentence
in conll_format.py
to
write a sentence to a file in the CoNLL format.
How well does your system work for different classes of named entities?
Instead of using the Viterbi algorithm and the structured perceptron, let's do something simpler: we'll just use a standard classifier to carry out the tagging. The classifier is applied for each token from left to right in a greedy fashion. If you want, you can have a look at the paper Design Challenges and Misconceptions in Named Entity Recognition by Ratinov and Roth (2009), which advocated the use of greedy methods for tagging.
Create a training set by going through the tokens in all training sentences. For each token, you extract a feature set. When you have built the training set, train a classifier. Use the classifier to tag the test sentences.
How well does this system work compared to the other in terms of speed and accuracy?
Hint: you can use the
method extract_greedy_features
in the
class SequenceFeatureExtractor
to extract
features. Apart from that, you can use standard scikit-learn
classes (e.g. DictVectorizer
, some classifier).