Assignment 2: Classifier evaluation

In this assignment, you will evaluate the classifier developed in Assignment 1.

Write Python code to solve the tasks described below. Write a very short report including the plots and your answers to the questions. Send the code and the report by email to Mehdi (mehdi.ghanimifard@gu.se).

NB: submit your answers individually. You are allowed to discuss with your fellow students, but not write code together.

Deadline: March 3

Preliminaries

Check your notes and previous code for computing a confidence interval. You can also look at these slides, in particular slides 29–30 are useful for this assignment.

Here's the instruction lecture for this assignment.

Your tasks

Estimating the accuracy

Train the classifier from Assignment 1 using 80% of the data for training. Estimate the accuracy of the classifier on the remaining 20%, as well as the precision and recall for finding the positive class.

Learning curve

As ususal, set aside 20% of your data as the evaluation set. Use the remaining data to compute a learning curve: select training sets of increasing sizes, e.g 10%, 20%, etc. For each training set size, compute the accuracy on the test set. Plot the learning curve, for instance like this:

from matplotlib import pyplot as plt
plt.plot(sizes, accuracies, 'ro')

Computing a confidence interval for the accuracy

Compute a 95% confidence interval for the accuracy using the method you used in Exercise 2.

Cross-validation

Since our estimation of the accuracy is based on a fairly small set, your confidence interval was quite wide. We will now use a trick to get a more reliable estimate and tighter interval.

In a cross-validation, we divide the data into N parts (folds) of equal size. We then carry out N evaluations: each fold once becomes a test set, while the other folds form the training set. We then combine the results of the N different evaluations. This trick allows us to get results for the whole dataset, not just a small test set.

Here is a code stub that shows the idea:

for fold_nbr in range(N):
    split_point_1 = int(fold_nbr/N*len(all_docs))
    split_point_2 = int((fold_nbr+1)/N*len(all_docs))
    train_docs = all_docs[:split_point_1] + all_docs[split_point_2:]
    eval_docs = all_docs[split_point_1:split_point_2]
    ...
    (train a classifier on train_docs)
    (evaluate the classifier on eval_docs)
...
(combine the results)

Implement the cross-validation method. Then estimate the accuracy and compute a new confidence interval. A typical value of N would be between 4 and 10.

Comparing the accuracy to a given value

Is your classifier's accuracy significantly different from 0.80 with a p-value of at most 0.05? Use the exact binomial test (scipy.stats.binom_test) to find out.

Comparing two classifiers

Train and evaluate two different classifiers, e.g. Naive Bayes and perceptron, or Naive Bayes with two different smoothing parameters. (Or use the functions in this file to train a classifier from scikit-learn.)

Carry out a McNemar test and compare the two classifiers (on the 20% test set or with cross-validation). Is the difference between them statistically significant with a p-value of at most 0.05?

Optional tasks

Domain sensitivity (optional)

Select two topic categories from the set of reviews, e.g. camera and book reviews. Create training and test sets for each of the topics. How much does the accuracy drop when when you apply a camera review classifier to a test set of book reviews?

The relation between precision and recall (optional)

Your current classifier assigns the positive class if log P(positive) > log P(negative), that is if log P(positive) - log P(negative) > 0. If you adjust the threshold from 0 to some positive value T, you will increase the precision and lower the recall, since your classifier needs to be more sure to assign the positive class. Conversely, if T is negative, precision will decrease and recall will increase.

Compute the precision and recall for different values of T. Make a precision/recall curve by plotting the measured values, e.g. the precision on the X axis and the recall on the Y axis.