Machine learning: project
Your work
In the project work, your task will be to
- find an interesting task that can be addressed with machine
learning methods
- find (or annotate) an appropriate dataset for that task
- look around for related work
- implement software to train some machine learning model for
your task
- evaluate your system
- write up a report about your results
- present your work at a seminar
- read a report by one of your fellow students and ask questions
about it at the seminar
The projects would normally be done either individually or in
groups of two. If you'd like to have a larger group, please ask nicely to
the course instructor.
The report and the presentation
When writing your report, please use one of the
these style templates (available for Word and LaTeX).
The deadline for submitting the report by email to me is October 26. If the report is in a preliminary state, please submit it
anyway.
The reports will be presented at a seminar on October 30,
10.00-12.00 in K333.
You should prepare a presentation (ideally a slideshow) of about 10
minutes, and be ready to
answer question about your work. Each student will also be assigned
to read the report of another project, and will have to prepare
one or more questions for that presentation.
Selecting a topic
I think it's most rewarding for you if you define your own project.
However, I understand that it can be hard to come up
with an idea, so here are some possible options:
Easy or safe projects
- Reader categorization of medical text. Karin
has categorized a large set of medical documents in Swedish as
being intended for a specialist or layman reader. Can you
develop a classifier that learns to distinguish the two
categories?
- Grading learner level. Build a classifier or regression
model that determines what proficiency level a language learner
would need to understand a given text. This project would involve
some interaction with Ildikó.
- Word sense disambiguation for a small set of word
types. Build classifiers that select the appropriate sense of an
ambiguous word such as line or serve.
- Adapting a lab assignment. Start from one of the lab
assignments and change it to something that is interesting for
you. Maybe you'd like a grammatical function tagger or named entity tagger
for your language? Or maybe change assignment 3 to a PoS tagger? Or
maybe see if you can use some external resource to get better
features?
Harder or riskier projects
- Implementing a machine learning algorithm. Select a classification
algorithm of your choice and implement it in Python or some other
language. Apply it to some suitable classification task. (The
difficulty of this project of course depends on which
algorithm you select.)
- Reproducing the results of your seminar paper. (Again, the
difficulty of this will obviously depend on which paper you have selected.)
- Maximum Spanning Tree dependency
parsing. Implement McDonald's famous dependency parsing method
using the structured perceptron as described in the lecture. (This task
used to be an assignment in this course, so there is some code you
can have for reading the files and for finding the top-scoring parse tree.)
- Comparing word representations. It has recently become
popular to use representations learned from large corpora as
features for various machine learning tasks. Examples of such
representations include LDA topics, clusters from Percy
Liang's software and vectors from
word2vec or gensim. Compare how well
these representations can be used for one or more machine learning
tasks, e.g. text categorization, named entity recognition, or parsing.
- Domain adaptation of sentiment classifiers. The sentiment
dataset we have been using consists of six different domains. If you
train a book review sentiment classifier it will perform poorly when
applied to reviews of e.g. kitchen appliances. Investigate a method
for domain adaptation, and see if you can improve the out-domain
classification performance.
- Machine translation. Find a task in machine translation
that can be solved with a machine learning approach. This can
include selecting the best output from a set of candidates, or
estimating the quality of the output. Prasanth can
help a little bit with this.
- Building a coreference solver. Implement a machine learning
system to resolve coreference of noun phrases, e.g. the
classical Soon approach or something
more recent. You can use a dataset like
the CoNLL-2011 shared task set. (For this assignment,
I can give you some code to read the data.)