VG assignment 1: Topic modeling with LDA

In this assignment, you will use the Latent Dirichlet Allocation (LDA) method for topic modeling. We will use the gensim library by Radim Řehůřek.

Write Python code to solve the tasks described below, and write a report that discusses your results and the questions in the assignment. Send the code and the answers by email to Mehdi (mehdi.ghanimifard@gu.se).

NB: submit your answers individually. You are allowed to discuss with your fellow students, but not write code together.

Deadline: March 28

Preliminaries

Check the lecture about topic modeling with LDA (slides 36–47) and read the popular-scientific article by Blei.

Make sure that gensim works on your machine. If you are in the lab, then try to type import gensim and see that you don't get any error messages. On your own machine, follow the instructions on this page.

Optionally, in particular if you want to implement LDA on your own, you can read more about the mathematical details of Bayesian estimation and Gibbs sampling. Resnik and Hardisty wrote a nice introduction; after reading that, you will be ready to understand LDA with Gibbs sampling in more detail. Heinrich wrote a detailed explanation including pseudocode.

Understanding the LDA probability model

Make sure that you understand the generative probability model in LDA: that is, how we think that the process works that generated the documents. Include a short overview of LDA in your report. In addition to a textual description (written in your own words), you can optionally include equations, pseudocode, or a plate diagram.

Preprocessing a corpus

We start by preparing our corpus. Since we are trying to analyze the documents in terms of topics, it can be useful to preprocess the corpus by removing words that probably are independent of topics, such as common function words ("stop words") and punctuation. It is also advisable to apply a lemmatizer or stemmer.

You are free to choose any corpus, but if you select your own corpus you will have to carry out the preprocessing on your own. You will get a higher quality of your topic models if the document collection is fairly large, but that will of course also increase processing time.

Here are a few corpora that you can try, which have already been preprocessed

the review corpus used in previous assignments (12,000 documents)
a part of the English Wikipedia (165,000 documents)
the Swedish Wikipedia (600,000 documents)

If you select a large corpus, you can still use a small corpus while developing and then use the large corpus for the end result. On a Linux or Mac machine, here is how you can make a smaller file containing the first 5,000 lines of the corpus:

head -n 5000 YOUR_CORPUS_NAME > SMALL_CORPUS_NAME

Building a topic model with gensim

We start the code by importing the gensim library and telling it to print information messages while training.

import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

A corpus in gensim is an object that can give us one document at a time. (See here for details about how gensim represents corpora.) Here is such a class that is suitable for the simple format of the corpora listed above. It will read one line at a time and then just return a document by splitting the line.

class LineCorpus(gensim.corpora.textcorpus.TextCorpus):
    def get_texts(self):
        with open(self.input) as f:
            for l in f:
                yield l.split()

Now, we have what we need to train the topic model:

corpus = LineCorpus(NAME_OF_YOUR_CORPUS)

model = gensim.models.LdaModel(corpus,
                               id2word=corpus.dictionary,
                               alpha='auto', 
                               num_topics=10, 
                               passes=5)

The inputs to the LDA training method are:

corpus: the corpus used to train the topic model.
id2word: a dictionary that maps word numerical identifiers to word strings (see e.g. this tutorial for an explanation). In our case, this is handled automatically by the corpus class.
alpha: the topic distribution. If we set it to 'auto', then we tell the algorithm to determine the topic proportions automatically.
num_topics: how many topics the algorithm should find.
passes: how many times the algorithm should go through the corpus. If you are using a large corpus (e.g. Wikipedia), then this can be set to 1, but for a small corpus (e.g. the reviews) then it should be a larger value.

The full documentation of the LdaModel is here.

Finally, you can save the topic model to a file, so that it can be reused later.

model.save(YOUR_RESULT_FILE)

Run the program. The runtime will depend on the size of the corpus and the number of passes. With a small corpus such as the reviews, and with 5 passes, it should take a few minutes. With a larger corpus, it may take one or a few hours.

Inspecting the topics

After you have trained the LDA model and saved it to a file, we can see what topics it found. Recall that in LDA, a topic corresponds to a probability distribution over words.

Make a new Python program that's separate from your previous training program (or a new function in your old program). Add the following line to load the file that you saved previously.

model = gensim.models.LdaModel.load(YOUR_RESULT_FILE)

Now, write code to print the words in each of the topics. Here are some things you might find useful:

model.num_topics: the number of topics.
model.show_topic(topic_number, topn) returns the probability distribution over words in one topic. Only the topn most probable words will be returned. It is a list of probabily–word pairs, for instance [(0.06, 'food'), (0.04, 'cook'), (0.03, 'fry'), ...]
model.alpha[topic_number]: the prominence of a topic.

Evaluate the topics qualitatively. Do they seem meaningful and coherent? Do you believe that there are important topics in the corpus that the model hasn't found?

In particular, if you are working with the review corpus, it can be interesting to see to what extent the automatically extracted topics correspond to the review types: books, cameras, DVDs, health products, music, and software.

Analyzing new documents

Assume we have trained a topic model on the review corpus. Let's see what the model thinks of a new document, such as This is a wonderful action-packed movie with Steven Seagal. Five stars! We lemmatize the document and remove the stop words; we can then let the LDA model analyze the document in terms of its topic distribution. Here is the code:

doc = ['wonderful', 'action', 'packed', 'movie', 'steven', 'seagal', 'five', 'star']
bow = model.id2word.doc2bow(doc)
topic_analysis = model[bow]

Note that we must first convert from a list of words to a bag-of-words representation: a list of word identifiers with frequencies.

The result topic_analysis is a list of topic numbers and the corresponding probabilities, for instance [(0, 0.194), (1, 0.726), ... ]. Here, topic 1 could be one that has high probabilities for words such as movie, film, watch, scene, ..., for instance.

Pick (or write) a number of test documents and print the analysis made by the model in terms of topics, as above. Do the results make sense?

Discussion about evaluation

You have evaluated the results qualitatively. Can you think of any way that we could have an evaluation that would be more objective, and can be computed automatically? (You can get some inspiration from this paper and this, but you can also try to think of simpler methods.)

Optional assignment: implementing your own Gibbs sampler

Following the pseudocode in Heinrich's introduction (Figure 8 on page 20), write your own implementation of the LDA model. Estimate it on your corpus and see if it comes up with anything similar to gensim. (Note that there might be a difference, since the Gibbs sampling algorithm is randomized.)