Assignment 3: Language Identification

Deadline: 22 October, 23.59 CET

Revision: word_tokenize and sent_tokenize are too broken for non-English texts. To tokenize into words, use, for all languages:

words = [token for token in nltk.wordpunct_tokenize(text) if token.isalpha()]

Clarification: you should consider all N-grams of the language profiles when calculating the out-of-place (but only the top 300 N-grams of the profile of the unknown text).

How to report the assignment

  • The completed assignment file should be mailed to, with an appropriate subject line. Remember to write your full names in the mail, and as a header in the assignment file (as a doc string).
  • Requirements


    This assignment is based on an article "N-Gram-Based Text Categorization" written by Cavnar et al. You do not have to read the article to do the assignment, but if you are interested, just google the title (the second hit when I looked).

    You will work on language identification. The basic idea is the following: for every language we want to identify, a profile is created. Then, given a text in an unknown language, the profile of that text is compared with the language profiles we previously created, and the languages are ordered with respect to how close they are to the text (languages with no profiles are, of course, impossible to identify correctly).

    A profile is created in the following way:

    Closeness are decided by comparing a profile of an input text with all the language profiles. Only the top 300 N-grams of the text is considered (somewhat subject-neutral N-grams). For the language profiles, we consider all n-grams.

    For every N-gram in the top 300 N-grams we calculate the out-of-place, which is the offset of the position of an N-gram of the profile of the text compared to the current language profile. If an N-gram does not occur in the current language profile, it is given an offset of the number of N-grams in the language profile plus one. Summing all offsets gives us a ranking that we use to order the potential languages (lower is closer). Example of offset calculation:

    P1   P2  Out-of-place
    --   --
    TH   TH      0
    ER   ING     3
    ON   ON      0
    LE   ER      2
    ING  AND     1
    AND  ED      ...
    ...  ...

    Hint when calculating offsets: the built-in function abs(n) removes the sign of number n.

    The essential function of this assignment is a function able to create N-grams. You are given a function ngram that produces a dictionary of N-grams for a list of words for a specific N. Make sure that you understand what the code is doing.

    def ngram(words, dictionary, n):
        pad = ' '*(n-1)
        for word in words:
            padded_word = '%s%s%s' % (pad,word,pad)
            index = 0
            while index+n <= len(padded_word):
                gram = padded_word[index:index+n]
                if gram in dictionary:
                    dictionary[gram] += 1
                    dictionary[gram] = 1
                index += 1
        return dictionary

    Use the following example texts (UTF-8 encoded) to create the language profiles: english_text.txt, dutch_text.txt, italian_text.txt, and french_text.txt (retrieved from Gutenberg). This gives us the training data that should be used to create the language profiles.

    training_data = [("English","english_text.txt"),

    Actually, these texts are a bit too small to get decent language profiles, but they suffice in this assignment.

    The end goal is to define a top-level function language_classify(filename) that given the content of the file filename, matches its profile against the language profiles, and orders the languages with regard to the ranking. The result should be an ordered list of languages together with a ranking.

    The first thing you need to make sure of is that the texts in the training data is classified correctly. Add some test code to your file similar to the code snippet below.

    if __name__ == '__main__':
        for (language,filename) in training_data:
            print 'Current language: %s in file "%s"' % (language,filename)
            for (n,(lang,ranking)) in enumerate(language_classify(filename)):
                print '   %i. %s [normalized rank: %.1f]' % (n+1,lang,ranking)
    $ python
    Current language: English in file "english_text.txt"
       1. English [normalized rank: 1.0]
       2. Dutch [normalized rank: 0.8]
       3. French [normalized rank: 0.7]
       4. Italian [normalized rank: 0.5]
    Current language: Dutch in file "dutch_text.txt"
       1. Dutch [normalized rank: 1.0]
       2. French [normalized rank: 0.7]
       3. English [normalized rank: 0.7]
       4. Italian [normalized rank: 0.6]
    Current language: French in file "french_text.txt"
       1. French [normalized rank: 1.0]
       2. Italian [normalized rank: 0.7]
       3. English [normalized rank: 0.7]
       4. Dutch [normalized rank: 0.6]
    Current language: Italian in file "italian_text.txt"
       1. Italian [normalized rank: 1.0]
       2. French [normalized rank: 0.7]
       3. English [normalized rank: 0.7]
       4. Dutch [normalized rank: 0.5]

    Note that some normalization of the numbers have been added to avoid having to compare large numbers (not required). Here, the highest number is the closest one, and 1.0 is the best possible. Another note is that it is only the differences of the numbers that matters, not the numbers themselves.

    Next, try some unseen text in some of the target languages. Note that one explanation to incorrect classification of unseen text is the small set of training data.