Assignment 2: Classifier implementation

In this assignment, you will implement two algorithms to train classifiers: support vector machine and logistic regression. The main objectives of this assignment are first that you should get some experience of the practical considerations of implementing machine learning algorithms, and secondly that you should get a taste of how a typical machine learning paper looks.

Write Python code to implement the algorithms. Please also try to answer the exercise question. Send the code and the answer to the exercise by email to the lab instructor (luis.nieto.pina -at- svenska.gu.se). This assignment is solved individually.

Deadline: September 24

Preliminaries

Repeat the material from the lecture about linear classifiers (in particular the perceptron code near the end), and the the lecture about SVM and LR.

Have a look at the paper Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, Shalev-Shwartz et al. (2010). You're free to skip the mathematical details, but please read the introduction and make sure you understand the pseudocode in Figure 1, which is what you will implement in this assignment. Here are some clarifications about the pseudocode:

The authors use angle brackets for dot products: <w, x> means w·x
S is the training set and |S| is the size of the training set.
T is the number of steps in the algorithm. (This is a bit different from our perceptron, where we specified the number of times to process the whole training set.)
λ is a regularization parameter that controls the tradeoff between keeping the classifier simple (i.e. the weights small) and fitting the training set. See slide 17 in the lecture.
The optional line is there for theoretical reasons and can be ignored.

In addition to the paper, I have written a separate document that explains the algorithm in more detail, and uses a notation more similar to what we have seen previously.

Exercise question

We are developing a simple system to predict the weather based on our location and the season. Here is a training set.

City:      Month:    | Weather:
--------------------------------
Gothenburg July      | rain
Gothenburg December  | rain
Paris      July      | sun
Paris      December  | rain

We build the system by training a perceptron from scikit-learn on this training set. We then test the classifier on the same set. As expected, the classifier correctly classifies all the instances in this set; after all, it's the same set we used to train it.

Now, we repeat this experiment using another training set. We train the classifier using the following examples and evaluate it using the same examples.

City:      Month:    | Weather:
--------------------------------
Sydney     July      | rain
Sydney     December  | sun
Paris      July      | sun
Paris      December  | rain

When we use this set, for some strange reason our classifier performs poorly! We can't improve it by switching to an SVM. Do you have an idea what's going on?

Here is the Python code if you want to try the experiment yourself.

Introduction

Download and unpack this zip file. The package contains the perceptron code presented during the second lecture (lecture2.py). It also contains a program (doc_classification.py) that trains and evaluates the perceptron implementation using the review dataset that we've used previously.

Run the experiment code and make sure it works on your machine.

Your tasks

Implementing the SVM

Implement the Pegasos algorithm for training SVMs by converting the pseudocode in Figure 1 into proper Python. Test your implementation by using your own classifier instead of the perceptron in doc_clasification.py. It's probably easiest if you start from the code we saw during the lecture, in particular the class called NewPerceptron2.

You can try to find good values for the parameters T and λ. In my experiments, I set T to 10 times the size of the training set (roughly equivalent to 10 perceptron iterations) and λ to 0.01, which gave a classification accuracy that was slightly higher than that of the perceptron.

Logistic regression

As we saw in the lecture, the logistic regression classifier can be stated similarly to the SVM. The difference is just in which loss function is used: the SVM uses the hinge loss and LR the log loss.

Take a look at the table on page 15 in the Pegasos paper. The first line shows the hinge loss and the corresponding update operation that is carried out in each step in the algorithm.

Make a new training algorithm that uses the log loss (the second line) instead of the hinge loss. See how well it works compared to your previous classifier.

Optional tasks

Using sparse vectors

Our implementation is slow and wasteful in terms of memory if we use a lot of different features (which is common in NLP). With sparse vectors, this can be improved. Follow the example of SparsePerceptron in the lecture code and implement sparse versions of the SVM and LR algorithms.

In addition, remove the SelectKBest step from the pipeline and check the difference.

Hint. The helper function sparse_dense_dot(x, w) is used to carry out a dot product between a sparse vector x and a dense vector w, while add_sparse_to_dense(x, w, xw) is the counterpart of the operation w += xw * x.

Speeding up the scaling operation

At some places in the pseudocode, the whole weight vector is scaled by some factor. This can be a slow operation if our training set consists of high-dimensional, sparse vectors. Read section 2.4 about a trick for speeding up the scaling operation. Implement the learning algorithm so that it uses this trick.