Assignment 1: Classification with Naive Bayes

In this assignment, you will implement the Naive Bayes classification method and use it for sentiment classification of customer reviews.

Write Python code to solve the tasks described below. Write answers to the discussion points (as a document or as comments in your code). Send the code and the answers by email to Mehdi (mehdi.ghanimifard@gu.se).

NB: submit your answers individually. You are allowed to discuss with your fellow students, but not write code together.

Deadline: February 25

Preliminaries

Read up on classification and the Naive Bayes classifier. For instance, repeat the slides from the lecture about classification. In particular, the slides about estimation (44–49) will be useful because they describe what you will do in this assignment.

If you wish, you may also have a look at some academic papers:

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan: Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
Thorsten Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Technical report from the University of Dortmund, 1997.
John Blitzer, Mark Dredze, and Fernando Pereira: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007).

Reading the review data

Download and unpack this zip file. The package contains a corpus (all_sentiment_shuffled.txt) of customer reviews from six of the review topics used in the paper by Blitzer et al., (2007). The data has been formatted so that there is one review per line, and the texts have been tokenized and normalized. Here is an example of a line.

music neg 544.txt i was misled and thought i was buying the entire cd and it contains one song

A line in the file is organized in columns:

0: topic category label (books, camera, dvd, health, music, or software)
1: sentiment category label (pos or neg)
2: document identifier
3 and on: the document tokens

Here is some Python code to read the entire corpus. This code is also available in the file assignment1.py in the package you unpacked.

def read_corpus(corpus_file):
     out = []
     with open(corpus_file) as f:
          for line in f:
              tokens = line.strip().split()
              out.append( (tokens[1], tokens[3:]) )
     return out

We first take away the topic label, which you don't need unless you solve the last optional task. Then, we split the data into a training and an evaluation part. For instance, we may use 80% for training and the remainder for evaluation.

all_docs = read_corpus('all_sentiment_shuffled.txt')
split_point = int(0.8*len(all_docs))
train_docs = all_docs[:split_point]
eval_docs = all_docs[split_point:]

Training the Naive Bayes classifier and classifying new documents

Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities. The input parameter of this function should be a list of documents with sentiment labels, i.e. a list of pairs like train_docs above. It could look something like this:

def train_nb(training_documents):
    ...
    (return the data you need to classify new instances)

Then write a Python function that classifies a new document. The inputs are 1) the probabilities returned by the first function; 2) the document to classify, which is a list of tokens.

def classify_nb(classifier_data, document):
    ...
    (return the guess of the classifier)

Hint 1. Here is a toy dataset similar to the example on slides 44–49. This might be useful if you want to check that you estimate the probabilities correctly.

Hint 2. In this assignment, it is acceptable if you assume that we will always use the pos and neg categories. However, it is of course nicer if the possible categories are not hard-coded, especially if you solve the last optional task.

Hint 3. Some sort of smoothing will probably improve your results.

Hint 4. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the math library. Note that log(P1 * P2) = log(P1) + log(P2), so if you use log probabilities you will sum them instead of multiplying.

Hint 5. If you prefer an object-oriented coding style, then return an object with a method classify. This method can then be used instead of classify_nb.

In this task, you are supposed to implement all parts of the classification method and you are not allowed to use a Naive Bayes implementation from a library. (This can be done in the fourth optional task instead.) You are allowed to use data structures such as FreqDist from NLTK if you prefer, but this isn't really necessary.

Evaluating the classifier

We will evaluate the classifier carefully in the second assignment. In this assignment, we just compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents. Write a function that classifies each document in the test set, compares each label to the gold-standard label, and returns the accuracy.

def evaluate_nb(classifier_data, evaluation_documents):
    ...
    (return the accuracy)

What accuracy do you get when evaluating the classifier on the test set?

Error analysis

Find a few misclassified documents and comment on why you think they were hard to classify.

The following tasks are all optional: they are not required for a pass or a VG

(Optional) Print the informative features

Print the features that are most indicative of the positive and the negative categories. To exemplify, lovely and recommend might be strong positive features and horrible and waste strong negative.

def print_top_nb(classifier_data, n_highest):
    ...

Hint. When I solved this task, I had to set the Laplace smoothing parameter to quite a high value to get interpretable results.

(Optional) Implement the perceptron learning algorithm

Implement the perceptron algorithm and compare its performance to that of Naive Bayes. You can use the code from the lecture slides or get it from somewhere else, e.g. Wikipedia.

(Optional) Improve the features

Try to think of ways to improve the accuracy by changing the features used for classification.

One possible improvement could be to try to do something about negation, since it is obvious that negation is not well handled by the bag-of-words approach we are using. Pang et al., (2002) used this trick: they added a special negation marker to all tokens appearing in a sentence after a negation (not, n't etc).

Many people use bigram features in addition to single-word features. Can you improve your classifier by using bigrams?

(Optional) Train a classifier using scikit-learn or NLTK

The classifiers in NLTK do not use a bag-of-words feature representation, only attribute-value maps. We define a little helper function that converts the list of words to a dict where each word is mapped to True.

def bow_to_avmap(doc):
    return dict((w, True) for w in doc)

Now we are ready to train a Naive Bayes classifier using NLTK:

import nltk

def train_nltk_nb(ldocs):
    nltk_training_set = [(bow_to_avmap(doc), l) for l, doc in ldocs]
    return nltk.NaiveBayesClassifier.train(nltk_training_set)

def classify_nltk(doc, model):
    return model.classify(bow_to_avmap(doc))

Here is an example of how you can train a Naive Bayes or perceptron classifier in scikit-learn:

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import MultinomialNB

def train_sk(ldocs, use_perceptron=False):
    docs = [bow_to_avmap(doc) for _, doc in ldocs]
    lbls = [l for l, _ in ldocs]
    vec = DictVectorizer()
    encoded_docs = vec.fit_transform(docs)
    if use_perceptron:
        classifier = Perceptron()
    else:
        classifier = MultinomialNB()
    classifier.fit(encoded_docs, lbls)
    return vec, classifier

def classify_sk(doc, v_c):
    vec, classifier = v_c
    encoded_doc = vec.transform(bow_to_avmap(doc))
    return classifier.predict(encoded_doc)

Here is a list of classification methods available in scikit-learn.

(Optional) Implement a six-category classifier

Implement a classifier that guesses the topic category label instead of the sentiment. This is fairly straightforward for Naive Bayes but may require some thinking for the perceptron. Discuss with the instructor if you need a hint. (You'll obviously also need to change read_corpus a little bit.)