Assignment 1: Grammatical function tagging

In this assignment, you will build a machine learning system that assigns grammatical function tags (e.g. subject, object) to edges in a dependency tree.

The aim of the assignment is that you should get a feel for the typical workflow of machine learning-based NLP: selecting a classifier, designing features, and analyzing the errors. You will also learn the basics about the scikit-learn library.

Solve the tasks described below and describe what you have done in a brief report. In particular, you need to include answers to each Question. Send the code and the report by email to the course instructor (richard.johansson -at- gu.se). This assignment is solved individually.

Deadline: September 14

Preliminaries

Repeat the lecture slides on how to use the scikit-learn library.

As an inspiration, you may have a look at the paper Better Training for Function Labeling, Chrupała et al. (2007). Just keep in mind that they are using a phrase structure representation while we will use dependencies, so your implementation will be a bit different.

Introduction

In dependency syntax, the grammatical structure of a sentence is represented as a tree consisting of edges between the words in the sentence. Each edge is labeled with a grammatical function that specifies how the words are related. The program that we will develop will automatically assign the grammatical function labels to the edges in the trees.

Recently, the Universal Dependencies (UD) project has built a standardized description of grammatical functions. This description has been designed to be cross-lingually applicable, so that grammatical function tags are reused across languages as much as possible. Here is a recent paper that discusses the cross-lingual applicability of the UDs.

Here is an example of an English sentence annotated according to the UD guidelines:

In this tree, the word She is the nominal active subject (nsubj) of the verb lives.

The UD project distributes treebanks for 18 languages. Download the collection of treebanks (the file ud-treebanks-v1.1.tgz) from this site and decompress the package into your working directory.

For each language, there are three files containing dependency trees: a training file, a development file, and a test file. The files have been split in this way in order to encourage experimental reproducibility: it is easier to compare results if everyone uses the same experimental settings.

Running the software

Download and unpack this zip file containing the code for the assignment. The package contains the following four Python files:

assignment1.py: the main program. You will make some small edits to this program to add code to train and run the function classifier.
feature_extraction.py: contains the feature extraction function. You will edit this file to add more features.
dependency.py: helper code to load the dependency trees in the UD format.
evaluation.py: functions to evaluate the results and print some statistics.

Edit the file assignment1.py and change the variable UD_HOME so that it refers to the directory where you unpacked the UD collection. Run the program and look at the result.

Question: What is the error rate?

Hint: If you want to work with a language other than English, change the function input language from en to something else. The table languages lists the languages available in the UD collection.

This program consists of two main parts: training (in the function train_function_tagger) and evaluation (in evaluate_function_tagger). The training part of the program goes through the training file, and collects examples of edges and their corresponding function tags. It doesn't yet train a classifier: that is what you will implement later.

In the evaluation function, the program reads the development file from the treebank, and tries to guess the grammatical functions of the edges in each tree. It then compares all guessed functions labels to the true labels in the treebank, and computes an error rate. Since we haven't yet added a classifier, the program just assigns punct (punctuation) to all edges, so the error rate is quite high.

Using a classifier

Edit the file assignment1.py and add code in the function train_function_tagger to train a classifier using scikit-learn. The code used in the lecture can be used as an example. Then uncomment the lines for saving the classifier to a file.

Uncomment the lines in in evaluate_function_tagger to load the classifier that was trained in the previous step. Replace the line

Y_guess = [ 'punct' for _ in X ]

with something more useful.

Run the program again.

Question: What is the error rate now?

Going into the details

Enable the function input all_funcs to evaluate_function_tagger and run the program again. We will now see statistics (counts, and evaluation in terms of precision and recall) for all the function tags seen in the treebank.

Question: What function tags can we find successfully? Is there any frequent tag that we can't find? (For your reference, here is a list of the functions used in the UD treebanks.)

Enable the input err_stats and re-run the program. Now, the 10 most frequent errors will be listed. (The left column shows the true function tag and the right column the predicted tag.)

Question: Which are the most common errors made by your system?

You can also look directly at the sentences. Comment out the calls to train_function_tagger and evaluate_function_tagger, and uncomment the call to analysis. Insert code in analysis to load the classifier and use it, as you previously did in the evaluation part.

Adding new features

Open the file feature_extraction.py and look at the function extract_features. This function considers a token and extracts a number of features that should hopefully be useful when we classify the function of the edge going into that token. The features are stored in the dictionary x.

Question: Which features does the classifier currently use? Can you explain the results you saw previously?

Based on the problems you saw, and your own linguistic intuitions, add more features to x. (You need to add at least six features to pass the assignment.)

Hint: In the feature extraction function, there is already some code that locates some tokens around the current token. Here is a figure exemplifying the meaning of those neighbor tokens:

After you have added a feature, rerun the evaluation and observe its effect on the overall error rate and the individual errors.

Question: What is the effect of each feature? What result do you get in the end?

Selecting a learning algorithm

After you have found a set of useful features, evaluate a number of different machine learning algorithms and see which one gives you the highest result. Here are some possible choices:

perceptron: sklearn.linear_model.Perceptron
Naive Bayes: sklearn.naive_bayes.MultinomialNB
logistic regression: sklearn.linear_model.LogisticRegression
support vector classifier (SVC): sklearn.svm.LinearSVC

Question: Which classifier gives you the best result?

Optional task: Explore feature selection algorithms such as SelectKBest. How small can you make the feature set without affecting the quality of the classifier?

Optional task: You may also try out other types of classifiers from scikit-learn's list. Note that some of the algorithms may be quite expensive in terms of time or memory.

Optional task: You can try to tune the parameters of the learning algorithms to improve the performance:

perceptron: the parameter n_iter controls the number of iterations.
Naive Bayes: the parameter alpha defines the Laplace smoothing constant.
logistic regression and SVC: the parameter C controls the tradeoff between faithfulness to the training set and regularization (keeping the classifier simple). A low C value favors regularization.

Final evaluation

After you have optimized your feature set and selected a learning algorithm, remove the input number_of_sentences in the call to the training function. This means that we are using all the whole training treebank. Retrain the classifier. (This will take more time than previously. Until now, we used a small subset in order to make the training process faster.)

Question: What error rate do you get now?

Finally, change the function evaluate_function_tagger so that it uses UD's test corpus instead of the development corpus: replace dev with test here:

with open('{0}/UD_{1}/{2}-ud-dev.conllu'.format(UD_HOME,
                                                languages[lang],
                                                lang)) as f:

Rerun the evaluation one last time.

Question: What is your final error rate?