In this assignment, you will build a machine learning system that assigns grammatical function tags (e.g. subject, object) to edges in a dependency tree.
The aim of the assignment is that you should get a feel for the typical workflow of machine learning-based NLP: selecting a classifier, designing features, and analyzing the errors. You will also learn the basics about the scikit-learn library.
Solve the tasks described below and describe what you have done in a brief report. In particular, you need to include answers to each Question. Send the code and the report by email to the course instructor (richard.johansson -at- gu.se). This assignment is solved individually.
Deadline: September 14
Repeat the lecture slides on how to use the scikit-learn library.
As an inspiration, you may have a look at the paper Better Training for Function Labeling, Chrupała et al. (2007). Just keep in mind that they are using a phrase structure representation while we will use dependencies, so your implementation will be a bit different.
In dependency syntax, the grammatical structure of a sentence is represented as a tree consisting of edges between the words in the sentence. Each edge is labeled with a grammatical function that specifies how the words are related. The program that we will develop will automatically assign the grammatical function labels to the edges in the trees.
Recently, the Universal Dependencies (UD) project has built a standardized description of grammatical functions. This description has been designed to be cross-lingually applicable, so that grammatical function tags are reused across languages as much as possible. Here is a recent paper that discusses the cross-lingual applicability of the UDs.
Here is an example of an English sentence annotated according to the UD guidelines:
In this tree, the word She is the nominal active subject
(nsubj
) of the verb lives.
The UD project distributes treebanks for 18 languages. Download the
collection of treebanks (the file ud-treebanks-v1.1.tgz
)
from this
site and decompress the package into your working directory.
For each language, there are three files containing dependency trees: a training file, a development file, and a test file. The files have been split in this way in order to encourage experimental reproducibility: it is easier to compare results if everyone uses the same experimental settings.
Download and unpack this zip file containing the code for the assignment. The package contains the following four Python files:
assignment1.py
: the main program. You will make some
small edits to this program to add code to train and run the function
classifier.feature_extraction.py
: contains the feature
extraction function. You will edit this file to add more features.dependency.py
: helper code to load the dependency
trees in the UD format.evaluation.py
: functions to evaluate the results and
print some statistics.
Edit the file assignment1.py
and change the
variable UD_HOME
so that it refers to the directory where
you unpacked the UD collection.
Run the program and look at the result.
Question: What is the error rate?
Hint: If you want to work with a language other than English,
change the function input language
from en
to something else. The table languages
lists the languages
available in the UD collection.
This program consists of two main parts: training (in the
function train_function_tagger
) and evaluation
(in evaluate_function_tagger
).
The training part of the program goes through the training file, and
collects examples of edges and their corresponding function tags.
It doesn't yet train a classifier: that is what you will implement later.
In the evaluation function, the program reads the development
file from the treebank,
and tries to guess the grammatical functions of the edges in each
tree.
It then compares all guessed functions labels to the true labels in
the treebank, and computes an error rate.
Since we haven't yet added a classifier, the program just
assigns punct
(punctuation) to all edges, so the error
rate is quite high.
Edit the file assignment1.py
and add code in the function
train_function_tagger
to train a
classifier using scikit-learn. The code used in the lecture can be
used as an example. Then uncomment the lines for saving the classifier
to a file.
Uncomment the lines in in evaluate_function_tagger
to
load the classifier that was trained in the previous step. Replace the
line
Y_guess = [ 'punct' for _ in X ]with something more useful.
Run the program again.
Question: What is the error rate now?
Enable the function input all_funcs
to evaluate_function_tagger
and run the program again. We
will now see
statistics (counts, and evaluation in terms of precision and recall) for all the function tags seen
in the treebank.
Question: What function tags can we find successfully? Is there any frequent tag that we can't find? (For your reference, here is a list of the functions used in the UD treebanks.)
Enable the input err_stats
and re-run the program. Now,
the 10 most frequent errors will be listed. (The left column shows the
true function tag and the right column the predicted tag.)
Question: Which are the most common errors made by your system?
You can also look directly at the sentences. Comment out the calls
to train_function_tagger
and evaluate_function_tagger
, and uncomment the call
to analysis
. Insert code in analysis
to load
the classifier and use it, as you previously did in the evaluation part.
Open the file feature_extraction.py
and look at the
function extract_features
.
This function considers a token and extracts a number of features that
should hopefully be useful when we classify the function of the edge
going into that token. The features are stored in the dictionary x
.
Question: Which features does the classifier currently use? Can you explain the results you saw previously?
Based on the problems you saw, and your own linguistic intuitions, add
more features to x
. (You need to add at least six
features to pass the assignment.)
Hint: In the feature extraction function, there is already some code that locates some tokens around the current token. Here is a figure exemplifying the meaning of those neighbor tokens:
After you have added a feature, rerun the evaluation and observe its effect on the overall error rate and the individual errors.
Question: What is the effect of each feature? What result do you get in the end?
After you have found a set of useful features, evaluate a number of different machine learning algorithms and see which one gives you the highest result. Here are some possible choices:
sklearn.linear_model.Perceptron
sklearn.naive_bayes.MultinomialNB
sklearn.linear_model.LogisticRegression
sklearn.svm.LinearSVC
Question: Which classifier gives you the best result?
Optional task: Explore feature selection algorithms such
as SelectKBest
. How
small can you make the feature set without affecting the quality of
the classifier?
Optional task: You may also try out other types of classifiers from scikit-learn's list. Note that some of the algorithms may be quite expensive in terms of time or memory.
Optional task: You can try to tune the parameters of the learning algorithms to improve the performance:
n_iter
controls the number
of iterations.
alpha
defines the Laplace
smoothing constant.After you have optimized your feature set and selected a learning algorithm,
remove the input number_of_sentences
in the call to the training
function. This means that we are using all the
whole training treebank. Retrain the classifier.
(This will take more time than previously. Until now, we used a small subset
in order to make the training process faster.)
Question: What error rate do you get now?
Finally, change the function evaluate_function_tagger
so
that it uses UD's test corpus instead of the development
corpus: replace dev
with test
here:
with open('{0}/UD_{1}/{2}-ud-dev.conllu'.format(UD_HOME, languages[lang], lang)) as f:
Rerun the evaluation one last time.
Question: What is your final error rate?