=============================
Wiktionary Morphology Dataset
=============================

This directory contains Wiktionary morphology data released to the public as a
supplementary dataset for the paper:

Supervised Learning of Complete Morphological Paradigms
Greg Durrett and John DeNero
NAACL 2013

---------
Changelog
---------

Version 1.1: Updated this README and the filenames for the Finnish data to
accurately reflect that both nouns and adjectives are present.

-------
License
-------

These data are placed in the public domain under the Creative Commons
Attribution-ShareAlike 3.0 Unported license.

http://creativecommons.org/licenses/by-sa/3.0/

-------
General
-------

The data consists of complete inflection tables for a large number of base
forms for five language/part of speech pairs (German nouns, German verbs,
Spanish verbs, Finnish nouns/adjectives, and Finnish verbs). For each base
form, we list the inflected form corresponding to each setting of relevant
morphological attributes for that language (e.g. each German verb lists an
inflected form for the first person singular present indicative).

For Finnish, nouns and adjectives are merged because the same base template
(fi-decl) is used for both nouns and adjectives. Moreover, these two parts of
speech are sensitive to the same morphological features (case and number) and
in some cases may even share morphological inflection rules.

Sizes:
German nouns: 2764 base forms, 8 inflected forms each, 22112 total items
German verbs: 2027 base forms, 27 inflected forms each, 54729 total items
Spanish verbs: 4055 base forms, 57 inflected forms each, 231135 total items
Finnish nouns/adjectives: 40589 base forms, 28 inflected forms each, 1136492 total items
Finnish verb: 7249 base forms, 53 inflected forms each, 384197 total items

Development and test sets are always of size 200, and the remaining forms are
used for the training set.

--------
Creation
--------

Complete inflection tables for nouns and verbs were extracted from the Wiki
markup files that generate the HTML output of Wiktionary.  We developed a
parser for the subset of the Wiki Markup Language (see
http://en.wikipedia.org/wiki/Help:Wiki_markup) relevant to parsing inflection
tables. This domain-specific language has substitution semantics and supports
function application via "templates".  For most languages, repeated function
application yields a useful intermediate result: a labeled list of all
inflected entries in an inflection table, before any HTML formatting is applied
to the result. These lists are exported as the comma-separated value (CSV)
files.

All English Wiktionary data was collected on February 6, 2013.  We apply the
following post-processing rules to the output:
--Items with spaces, hyphens, or colons are removed, including multi-word verbs.
--If multiple inflections are present for a given attribute set, only the first
is retained.
--If multiple sets of inflections are present for a given base form, only the
first is retained.
--Entries that do not appear to be inflected forms (but instead descriptions,
errors, etc.) are filtered.
--Incomplete inflection tables (for instance, nouns that only appear in the
plural form) are filtered.