Machine learning: project

Your work

In the project work, your task will be to

find an interesting task that can be addressed with machine learning methods
find (or annotate) an appropriate dataset for that task
look around for related work
implement software to train some machine learning model for your task
evaluate your system
write up a report about your results
present your work at a seminar
read a report by one of your fellow students and ask questions about it at the seminar

The projects would normally be done either individually or in groups of two. If you'd like to have a larger group, please ask nicely to the course instructor.

The report and the presentation

When writing your report, please use one of the these style templates (available for Word and LaTeX). The deadline for submitting the report by email to me is October 26. If the report is in a preliminary state, please submit it anyway.

The reports will be presented at a seminar on October 30, 10.00-12.00 in K333. You should prepare a presentation (ideally a slideshow) of about 10 minutes, and be ready to answer question about your work. Each student will also be assigned to read the report of another project, and will have to prepare one or more questions for that presentation.

Selecting a topic

I think it's most rewarding for you if you define your own project. However, I understand that it can be hard to come up with an idea, so here are some possible options:

Easy or safe projects

Reader categorization of medical text. Karin has categorized a large set of medical documents in Swedish as being intended for a specialist or layman reader. Can you develop a classifier that learns to distinguish the two categories?
Grading learner level. Build a classifier or regression model that determines what proficiency level a language learner would need to understand a given text. This project would involve some interaction with Ildikó.
Word sense disambiguation for a small set of word types. Build classifiers that select the appropriate sense of an ambiguous word such as line or serve.
Adapting a lab assignment. Start from one of the lab assignments and change it to something that is interesting for you. Maybe you'd like a grammatical function tagger or named entity tagger for your language? Or maybe change assignment 3 to a PoS tagger? Or maybe see if you can use some external resource to get better features?

Harder or riskier projects

Implementing a machine learning algorithm. Select a classification algorithm of your choice and implement it in Python or some other language. Apply it to some suitable classification task. (The difficulty of this project of course depends on which algorithm you select.)
Reproducing the results of your seminar paper. (Again, the difficulty of this will obviously depend on which paper you have selected.)
Maximum Spanning Tree dependency parsing. Implement McDonald's famous dependency parsing method using the structured perceptron as described in the lecture. (This task used to be an assignment in this course, so there is some code you can have for reading the files and for finding the top-scoring parse tree.)
Comparing word representations. It has recently become popular to use representations learned from large corpora as features for various machine learning tasks. Examples of such representations include LDA topics, clusters from Percy Liang's software and vectors from word2vec or gensim. Compare how well these representations can be used for one or more machine learning tasks, e.g. text categorization, named entity recognition, or parsing.
Domain adaptation of sentiment classifiers. The sentiment dataset we have been using consists of six different domains. If you train a book review sentiment classifier it will perform poorly when applied to reviews of e.g. kitchen appliances. Investigate a method for domain adaptation, and see if you can improve the out-domain classification performance.
Machine translation. Find a task in machine translation that can be solved with a machine learning approach. This can include selecting the best output from a set of candidates, or estimating the quality of the output. Prasanth can help a little bit with this.
Building a coreference solver. Implement a machine learning system to resolve coreference of noun phrases, e.g. the classical Soon approach or something more recent. You can use a dataset like the CoNLL-2011 shared task set. (For this assignment, I can give you some code to read the data.)