Assignment 2: Flesch-Kincaid readability test

Deadline: 7 October, 23.59 CET

How to report the assignment

  • The completed assignment file assign2.py should be mailed to johan.roxendal@svenska.gu.se, with an appropriate subject line. Remember to write your full names in the mail, and as a header in the assignment file (as a doc string).
  • Requirements

    Description

    You have been newly employed at a company, and your new boss storms into your office the first day. The boss is depressed, saying: I'm not able to read the company documents, and our writers don't know how to write simpler. Can you help them, help me? Sure, you say.

    You think about the problem for a while and remember Flesch-Kincaid readability test, a text measure that indicates how difficult a text is.

    You look at the measure, and have the following idea: given a text, how about not only presenting the measure, but also the top ten worst sentences (i.e., the top ten longest sentences) and the top ten worst words (i.e., the top ten words with most syllables) according to this measure?

    Use nltk.sent_tokenize and nltk.word_tokenize (requires a sentence as input). You may not use any other functions in nltk. Note: the measure is about words, not symbols or numbers.

    Optional: Define your own word and sentence tokenization. Compare the result of using your tokenization with nltk's tokenizers. Does it make a difference in the output?

    Optional: Generate a prettified report for the writers in HTML. Read HTML tutorial, if you do not already know how HTML works.

    To think about: Is Flesch-Kincaid readability test a good measure for language complexity? Is the measure language independent?

    Start with the following code (in a file 'assign2.py'):

    """Assignment 2: Flesch-Kincaid readability test
    Name 1: NAME
    Name 2: NAME
    """
    import nltk # word_tokenize and sent_tokenize
    
    def produce_report(filename):
        pass
    

    Test your program with the following text files: sherlock.txt and metamorphosis.txt, downloaded from Project Gutenberg. The files are UTF-8-encoded, so you need to decode the file content before processing.

    Example usage:

    >>> import assign2
    >>> print assign2.produce_report('sherlock.txt')
        [ prints the report of file 'sherlock.txt' ] 
    
    >>> report = assign2.produce_report('sherlock.txt')
    >>> with open('result.html', mode='w') as f: # write HTML file
    ...     f.write(report)
    ... 
    >>> 
    
    $ open result.html # on a Mac.