Text processing in historical texts

In many text processing tasks, features based on words are crucial. While it may be useful to preprocess the words carefully, e.g. stemming, normalization and stop word removal, and to select word weighting schemes, it is in general quite hard to beat simple bag-of-words baselines in many cases.

However, it is widely observed that purely lexical features lead to brittle NLP systems that are easily overfit to particular domains. Furthermore, in certain text types such as social media, the brittleness is particularly acute due to the very wide variation in style, vocabulary, and spelling. Twitter texts, for instance, are notorious in this respect due to their short format. Problems of variation of style and spelling is also an issue in other nonstandard text types such as historical texts. In NLP systems that process these types of texts, it may be necessary to combine lexical features with more abstract back-off features that generalize beyond the surface strings in order to achieve a higher degree of robustness, and learning methods may need to be adjusted in order to learn models with more redundancy.

A number of issues are problematic for annotation of Old Swedish texts. For example, sentence splitting cannot be handled with standard tools, as sentence boundaries are often not marked by punctuation or uppercase letters. Compared to modern Swedish texts, the Old Swedish texts have a different vocabulary and richer morphology, show a divergent word order, and Latin and German influences. Finally, the lack of a standardized orthography results in a wide variety of spellings for the same word.