Assignment 3: Structured Prediction

In this assignment, you will implement the perceptron learning algorithm for structured objects. You will use this algorithm to train a dependency parser and a named entity recognizer.

Write Python code to implement the algorithm. Send the code by email to the course instructor (richard.johansson -at- gu.se). Please also include brief answers to the questions below.

This assignment is a bit more free-form than the previous two. This doesn't mean that the amount of code to write is very large, but you'll be expected to assemble a few building blocks on your own. Please ask for help if you get stuck.

Deadline: October 23

Preliminaries

NEW: I've made an additional presentation to clarify some details about this assignment.

Repeat the material from the lecture on structured prediction.

Read the paper Spanning Tree Methods for Discriminative Training of Dependency Parsers, McDonald et al., (2005). We'll use the perceptron algorithm instead of the learning algorithm described in section 3.1, so that section can be skimmed if you prefer.

You may also take a look at the paper Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Collins (2002). This is a very influential paper and it introduced the perceptron algorithm for structured objects that we're using in this assignment. However, the paper can be a bit harder to read than the McDonald paper, since it's a bit older.

Your tasks

Download this zip file. It contains the parsing and tagging code that you will need, and an evaluation script you'll use for the tagging part.

Overview of the provided code

The code is designed so that you should be able to develop your learning algorithm in a way that is agnostic of the problem it will be used for.

For both parsing and tagging, the code contains three main parts:

A function to access the data. For parsing, this will just use NLTK's built-in dependency treebank, but for tagging we will use files provided in the zip file.
A class that converts the data into sparse feature vectors that we can use for learning. This is called a vectorizer, similarly to the vectorizer classes in scikit-learn, and they have a similar set of methods: fit and transform, to create the feature mapping and to convert a dataset, respectively. There is also the combination fit_transform.
A class that contains what I call the "problem definition": things that you will need during training, but which are specific to parsing or tagging, respectively. It has the following two methods:

predict(w, x): finds the highest-scoring parse tree or tag sequence for the sentence x if the weight vector is w. This method uses a suitable algorithm to find the top-scoring output: the Eisner algorithm (described in McDonald's paper) for parsing, and the Viterbi algorithm for tagging.
get_features(x, y): returns a feature representation of the sentence x and a tree or sequence y.

Implementing the learning algorithm

Write Python code to implement the structured perceptron learning algorithm. Write it in a flexible way so that it's not hardwired to any specific task: the only thing it should assume is that there is a "problem definition" as described above, with the methods predict and get_features. The problem definition should probably be given to the algorithm either as a constructor parameter or as an input to the training method.

Hint 1. When making a weight vector, you need to know how large it should be, that is how many features we're using. So you probably need to give this number as a constructor parameter. The number of features can be determined by calling number_of_features on the vectorizers.

Hint 2. The method get_features in the problem definitions will return the features for the sentence x and a parse tree or sequence y. For practical reasons, this is not a single sparse vector but a matrix (a "list" of vectors). To add them all to your weight vector w, you can use the helper function below. (Here, fv is the matrix returned by get_features, w is your weight vector, and scale is a factor that will be +1 if you want to add fv to w and -1 if you subtract.)

def add_sparse_rows_to_dense(fv, w, scale):
    prev = 0
    for ip in fv.indptr:
        w[fv.indices[prev:ip]] += scale*fv.data[prev:ip]
        prev = ip

Optional task: The averaged perceptron

Implement the averaged perceptron algorithm (see lecture). What is the effect on the parsing and tagging accuracy?

Implementing the dependency parser

Now you are ready to implement the parser. All code that you need is in the file mstparser.py. It contains:

The function read_dependency_treebank, which reads the dependency treebank contained in NLTK and converts in into the format we're using in this assignment. It returns two lists: a list X of sentences, and list Y of corresponding dependency trees.
A vectorizer class ParseVectorizer.
A "problem definition" called MSTParsingDefinition.

Now write code to train and evaluate a dependency parser. A typical structure of such a program would be something like this:

Read the dependency treebank and split it into a training and a test part. (Probably you'll want to have small training and test sets while you're developing your code, e.g. about 10-100 sentences.)
Convert the sentences in the training set to feature vectors using ParseVectorizer.
Train a structured perceptron using the code you have written.
Test the parser on the test set.

Hint 1. Unlike in scikit-learn, the vectorizer methods fit, transform and fit_transform take two inputs: the list X of sentences and the list Y of parse trees.

Hint 2. You can compare a gold-standard parse tree y and a guessed tree g simply by looking at the two lists of integers and counting at how many places they differ. (See read_dependency_treebank for a description of how we represent a tree.) Just remember not to count the dummy token.

Implementing the named entity tagger

We will now implement a named-entity tagger. The data we'll use comes from the CoNLL-2003 shared task, a competition in named entity recognition in English and a couple of other languages. Ask the instructor about how to access the data.

Assuming that the parser now runs smoothly, tagging can be done in much the same way. Here is the code that you need.

The function read_sentences in the file conll_format.py, which reads sentences from the CoNLL training or testing files. It returns two lists: a list X of sentences, and list Y of corresponding tag sequences.
A vectorizer class SequenceVectorizer.
A "problem definition" called SequenceTaggingDefinition.

Train your named entity tagger and evaluate it on a test set. If the code runs slowly or uses all of the computer's memory, use a smaller dataset.

Hint. The simplest way to evaluate the output is probably to do something similar to what you did for parsing: comparing two lists. (Just note that there is no dummy token here.)

Optional task: better evaluation

You probably evaluated your named entity tagger based on how many tokens it labeled correctly. However, the CoNLL competition did not use this scoring method: instead, it considered the whole named entity mentions. For instance, if the true named entity was Bank of America but a system output just America, the system gets no credit.

The zip file includes the official CoNLL evaluation software: the script conlleval.perl. You can run it from your Python code by calling the function run_evaluator in conll_format.py.

To run it, you need to print the output to a file in the CoNLL column-based format. First convert the output from your tagger (which probably is a list of integers at this point) into a list of tag strings; you can use the method to_tags in SequenceVectorizer for that. Then use write_sentence in conll_format.py to write a sentence to a file in the CoNLL format.

How well does your system work for different classes of named entities?

VG assignment: Greedy tagging

Instead of using the Viterbi algorithm and the structured perceptron, let's do something simpler: we'll just use a standard classifier to carry out the tagging. The classifier is applied for each token from left to right in a greedy fashion. If you want, you can have a look at the paper Design Challenges and Misconceptions in Named Entity Recognition by Ratinov and Roth (2009), which advocated the use of greedy methods for tagging.

Create a training set by going through the tokens in all training sentences. For each token, you extract a feature set. When you have built the training set, train a classifier. Use the classifier to tag the test sentences.

How well does this system work compared to the other in terms of speed and accuracy?

Hint: you can use the method extract_greedy_features in the class SequenceFeatureExtractor to extract features. Apart from that, you can use standard scikit-learn classes (e.g. DictVectorizer, some classifier).