========================================= Morphology learning data and program code ========================================= This directory contains the data and the code for the paper: "Semi-supervised learning of morphological paradigms and lexicons" Ahlberg, M., M. Forsberg, and M. Hulden. EACL 2014 ======= License ======= The data and code are placed in the public domain under the Creative Commons Attribution-ShareAlike 3.0 Unported license. http://creativecommons.org/licenses/by-sa/3.0/ ======= General ======= * The main results in the paper should be reproduced by running "make" in the main directory (uses Python 2.7 and perl 5). * Also requires the foma finite-state toolkit installed in the path (http://foma.googlecode.com). - data/wiktionary-morphology contains the Durrett & DeNero (2013) data set with one minor correction to the finnish verb infinitive tags. This data set was used for experiments 1 & 2. - data/wikipedia contains the Wikipedia frequency dumps used for experiment 2. - data/saldo contains the Swedish tables used in experiment 3. - paradigms/ contains precalculated learned paradigms by src/extract.perl used in all experiments. These will be reconstructed by "make" if deleted, but the process takes some minutes. See src/extract.perl for the format specification. =============================== Table extraction and collapsing =============================== The paradigm extraction program as described in 3.2. is stand-alone code and is found in src/extract.perl. It requires the foma finite-state toolkit installed in the path, since it in turns calls src/extract.foma. See the program code in extract.perl for documentation.