Assignment 2: Flesch-Kincaid readability test

Deadline: 7 October, 23.59 CET

How to report the assignment

The completed assignment file assign2.py should be mailed to johan.roxendal@svenska.gu.se, with an appropriate subject line. Remember to write your full names in the mail, and as a header in the assignment file (as a doc string).

Requirements

The assignment is done in groups of two.
It is ok to discuss solutions with students outside of your group, but copying code is the same as cheating (the same goes for the web, of course).
If requested, both members in a group should be able to explain the solution to the assignment.
Always hand in the assignment before the deadline, even if you have not been able to fully complete the assignment.
If the assignment is uncompleted, clearly state what is left to do, and what you need help with.
Comment your code.
Use documentation strings for your module and all your functions.

Description

You have been newly employed at a company, and your new boss storms into your office the first day. The boss is depressed, saying: I'm not able to read the company documents, and our writers don't know how to write simpler. Can you help them, help me? Sure, you say.

You think about the problem for a while and remember Flesch-Kincaid readability test, a text measure that indicates how difficult a text is.

You look at the measure, and have the following idea: given a text, how about not only presenting the measure, but also the top ten worst sentences (i.e., the top ten longest sentences) and the top ten worst words (i.e., the top ten words with most syllables) according to this measure?

Use nltk.sent_tokenize and nltk.word_tokenize (requires a sentence as input). You may not use any other functions in nltk. Note: the measure is about words, not symbols or numbers.

Optional: Define your own word and sentence tokenization. Compare the result of using your tokenization with nltk's tokenizers. Does it make a difference in the output?

Optional: Generate a prettified report for the writers in HTML. Read HTML tutorial, if you do not already know how HTML works.

To think about: Is Flesch-Kincaid readability test a good measure for language complexity? Is the measure language independent?

Start with the following code (in a file 'assign2.py'):

"""Assignment 2: Flesch-Kincaid readability test
Name 1: NAME
Name 2: NAME
"""
import nltk # word_tokenize and sent_tokenize

def produce_report(filename):
    pass

Test your program with the following text files: sherlock.txt and metamorphosis.txt, downloaded from Project Gutenberg. The files are UTF-8-encoded, so you need to decode the file content before processing.

Example usage:

>>> import assign2
>>> print assign2.produce_report('sherlock.txt')
    [ prints the report of file 'sherlock.txt' ] 

>>> report = assign2.produce_report('sherlock.txt')
>>> with open('result.html', mode='w') as f: # write HTML file
...     f.write(report)
... 
>>> 

$ open result.html # on a Mac.