Exercise 2: Estimation

In this exercise, you will estimate occurrence probabilities of words and investigate the reliability of the estimates.

Solve the tasks described below. Write a short report containing your answers, including the plots. Send the report and your Python code by email to the course instructor (richard.johansson -at- gu.se).

NB: submit your answers individually. You are allowed to discuss with your fellow students, but not write code together.

Deadline: February 8

References

Preliminaries

Create a Python file starting with the following imports:

from matplotlib import pyplot as plt
from nltk.corpus import brown
import scipy
import scipy.stats

Task 1: Understanding maximum likelihood estimation

Assume that we observe the word the 5 times in a 20-word corpus. You probably remember from Lecture 4 (and the NLP course) how to compute the maximum likelihood estimate of the occurrence probability of the. What is your estimate?

Recall from lecture 4 that the ML estimate of the occurrence probability p is justified as the value of p that makes the data most probable.

To illustrate this idea, we write a function L(p) that represents the probability of our data for a given occurrence probability p.

def L(p):
    corpus_size = 20
    count_the = 5
    rv = scipy.stats.binom(corpus_size, p)
    return rv.pmf(count_the)

Now, plot the likelihood function L(p) for several values of the probability p.

p_values = scipy.arange(0, 1, 0.025)  # makes an array [0, 0.025, 0.05, ...] 
L_values = [ L(p) for p in p_values ] # compute L(p) for each p we try
plt.plot(p_values, L_values, 'ro')
plt.savefig('likelihood.pdf') # or plt.show()

Does the plot make sense?

Task 2: Estimating a probability from a corpus

Write a function mle_unigram_probability(word, corpus) that computes a ML estimate of the probability of observing word, by counting how many times it occurs in corpus.

Use your function to estimate the occurrence probabilities of the words in, big, and wampimuk. You can use the Brown corpus included in NLTK. Here is an example of how you can do that:

print(mle_unigram_probability('in', brown.words()))

Task 3: Investigating the stability of the estimate

We will now investigate how much variability there is when you estimate an occurrence probability. First, let's write a function that selects a small part of the Brown corpus:

brown_corpus = list(brown.words())

def brown_part(part_nbr, part_size):
    start = part_nbr*part_size
    end = (part_nbr+1)*part_size
    return brown_corpus[start:end]

Write a function many_estimates(word, part_size) that picks 1000 different parts of Brown, each of which with a size of part_size. The function then estimates an occurrence probability from each of the parts, and finally plots a histogram of the estimates.

Call your function many_estimates with the word in and the part sizes 100, 250, 500, and 1000. Inspect the histograms. How is the "width" of the histogram affected by the part size?

Task 4: Computing a confidence interval

Apply the method presented in Lecture 4 to compute a 95% confidence interval for the occurrence probability of in. Use the first 500 words of the Brown corpus (that is, call brown_part(0, 500)).

Do you think the lower and upper bounds of the interval are reasonable, if you compare them to the histogram in Task 3?