Alignment of Word Forms

Kimmo Koskenniemi
University of Helsinki

The presentation points out the importance of proper character by character alignment of word forms. Alignment is needed in matching different stems of the same lexeme, in relating semantically similar words in related languages, or in normalizing dialectal or old texts, or even when correcting spelling or OCR errors. Characters represent more or less directly phonemes. Forms are often aligned using simple edit distances, but the similarity of and differences between phonemes can be described more accurately by using phonetic features such as those used in International Phonetic Alphabet (IPA). Corresponding word forms can be aligned e.g. using weighted finite-state transducers (WFST) which implement distances based on phonetic features. As a result we get a list of alignments where the best candidates come first. Chosing the alignment determines the character correspondences to be described by rules. If we have an alignment, phonological two-level rules are easier to discover than conventional string rewrite rules (such as in XFST or Foma). This is so, because each two-level rule can be discovered separately whereas rewrite rules depend on each other and especially the order in which they are applied. Good alignment is, of course, also vital for statistical methods which can be used for describing the character correspondences, and similarly, for methods for finding the smallest finite-state transducers which accept the good examples and reject the bad ones. Alignment provides both types of examples, so many types of rules and their discovery procedures could be used.