Assignment 3: Language Identification

Deadline: Friday, October 16, 23.59 CET

How to report the assignment

  • The completed assignment file assign3.py should be mailed to luis.nieto.pina@svenska.gu.se, with an appropriate subject line. Remember to write your full names in the mail, and as a header in the assignment file (as a doc string).
  • Requirements

    Description

    This assignment is based on an article "N-Gram-Based Text Categorization" written by Cavnar et al.

    You will work on language identification. The basic idea is the following: for every language we want to identify, a profile is created. Then, given a text in an unknown language, the profile of that text is compared with the language profiles we previously created, and the languages are ordered with respect to how close they are to the text (languages with no profiles are, of course, impossible to identify correctly).

    A profile is created in the following way:

    Closeness are decided by comparing a profile of an input text with all the language profiles. Only the top 300 N-grams of the text is considered (somewhat subject-neutral N-grams). For the language profiles, we consider all n-grams.

    For every N-gram in the top 300 N-grams we calculate the out-of-place, which is the offset of the position of an N-gram of the profile of the text compared to the current language profile. If an N-gram does not occur in the current language profile, it is given an offset of the number of N-grams in the language profile plus one. Summing all offsets gives us a ranking that we use to order the potential languages (lower is closer). Example of offset calculation:

    P1   P2  Out-of-place
    --   --
    TH   TH      0
    ER   ING     3
    ON   ON      0
    LE   ER      2
    ING  AND     1
    AND  ED      ...
    ...  ...
    

    Hint: when calculating offsets: the built-in function abs(n) removes the sign of number n, so e.g. abs(-5) will return 5.

    Note: you should consider all N-grams of the language profiles when calculating the out-of-place (but only the top 300 N-grams of the profile of the unknown text).

    The essential function of this assignment is a function able to create N-grams. You are given a function collect_ngram_statistics that produces a dictionary of N-grams for a list of words for a specific N. Make sure that you understand what the code is doing.

    def collect_ngram_statistics(words, dictionary, n):
        pad = ' '*(n-1)
        for word in words:
            padded_word = '%s%s%s' % (pad,word,pad)
            index = 0
            while index+n <= len(padded_word):
                ngram = padded_word[index:index+n]
                if ngram in dictionary:
                    dictionary[ngram] += 1
                else:
                    dictionary[ngram] = 1
                index += 1
    

    Use the following example texts to create the language profiles: english_text.txt, dutch_text.txt, italian_text.txt, and french_text.txt (retrieved from Gutenberg). This gives us the training data that should be used to create the language profiles.

    training_data = [("English","english_text.txt"),
                     ("Dutch","dutch_text.txt"),
                     ("French","french_text.txt"),
                     ("Italian","italian_text.txt")]
    

    Actually, these texts are a bit too small to get decent language profiles, but they suffice in this assignment.

    The end goal is to define a top-level function language_classify(filename) that given the content of the file filename, matches its profile against the language profiles, and orders the languages by their rankings. The result should be an ordered list of languages together with a ranking.

    The first thing you need to make sure of is that the texts in the training data are classified correctly. Add some test code to your file assign3.py so that it prints the languages and the ranks.

    if __name__ == '__main__':
        for (language,filename) in training_data:
            print('Current language: %s in file "%s"' % (language,filename))
            for (n,(lang,ranking)) in enumerate(language_classify(filename)):
                print('   %i. %s [normalized rank: %.1f]' % (n+1,lang,ranking))
            print()
    

    Optional task. Try to come up with a more interpretable number, for instance a score between 0.0 and 1.0 (where 1.0 corresponds to a perfect language match) instead of printing the ranks.

    Here is an example of how the output could look. In this case, we are using some sort of normalization as in the optional task.

    $ python assign3.py
    
    Current language: English in file "assign3/english_text.txt"
       1. English [normalized rank: 1.0]
       2. Dutch [normalized rank: 0.9]
       3. French [normalized rank: 0.9]
       4. Italian [normalized rank: 0.8]
    
    Current language: Dutch in file "assign3/dutch_text.txt"
       1. Dutch [normalized rank: 1.0]
       2. English [normalized rank: 0.8]
       3. French [normalized rank: 0.8]
       4. Italian [normalized rank: 0.7]
    
    Current language: French in file "assign3/french_text.txt"
       1. French [normalized rank: 1.0]
       2. Italian [normalized rank: 0.9]
       3. English [normalized rank: 0.9]
       4. Dutch [normalized rank: 0.9]
    
    Current language: Italian in file "assign3/italian_text.txt"
       1. Italian [normalized rank: 1.0]
       2. French [normalized rank: 0.9]
       3. English [normalized rank: 0.9]
       4. Dutch [normalized rank: 0.9]
    

    Next, try some unseen text in some of the target languages. Depending on the text you selected, you may get misclassifications. There are several possible explanations: maybe your text was very short, or just unusual? Maybe our training sets are too small?