=========================================
Morphology learning data and program code
=========================================

This directory contains the data and the code for the paper:

"Semi-supervised learning of morphological paradigms and lexicons"
Ahlberg, M., M. Forsberg, and M. Hulden.
EACL 2014

=======
License
=======

The data and code are placed in the public domain under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

http://creativecommons.org/licenses/by-sa/3.0/

=======
General
=======

* The main results in the paper should be reproduced by running "make" in the main directory (uses Python 2.7 and perl 5).
* Also requires the foma finite-state toolkit installed in the path (http://foma.googlecode.com).

   - data/wiktionary-morphology contains the Durrett & DeNero (2013) data set with one minor correction to the finnish verb infinitive tags. This data set was used for experiments 1 & 2.
   - data/wikipedia contains the Wikipedia frequency dumps used for experiment 2.
   - data/saldo contains the Swedish tables used in experiment 3.
   - paradigms/ contains precalculated learned paradigms by src/extract.perl used in all experiments. These will be reconstructed by "make" if deleted, but the process takes some minutes.  See src/extract.perl for the format specification.

===============================
Table extraction and collapsing
===============================

The paradigm extraction program as described in 3.2. is stand-alone code and is found in src/extract.perl. It requires the foma finite-state toolkit installed in the path, since it in turns calls src/extract.foma.  See the program code in extract.perl for documentation.