Beyond Word N-Grams: Two And A Half Case Studies of Structural Models for Text Classification and Analysis

Caroline Sporleder, University of Trier

Lexical models based on word n-grams have been shown to perform very well on various text classification and analysis tasks such as author, genre or topic detection. Yet lexical models only capture some aspects of a text; structural properties remain largely unaccounted for. In this talk, I will present two and a half case studies which aimed at investigating the usefulness of deeper, more structure-oriented models for text classification and analysis. The first study is concerned with analysing the lyrics of pop songs structurally in order to detect their genre and age and distinguish `good' from `bad' lyrics. The second and third studies use social networks to detect the author and genre of works of fiction and to mine historical newspaper archives.