Models

We provide a model that enables lemmatization of Swedish text following the SUC3 standard. Note that SUC3 lemmatization does not exactly match the SALDO standard that is used in our Korp resources.

SUC3 was randomly split into training, validation and test sets (80:10:10). The model was trained for 30 epochs using the default Stanza settings. The accuracy on the test set is 99.18.

Lemmatizing and training

Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary). Stanza was created for parsing UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where Language is the language name and Treebank is the treebank name (e.g. UD_Swedish-Suc). The files have to be named lang_treebank-ud-set.conllu, where lang is a two-letter code for language (sv), and set is train, dev or test (e.g. sv_suc-ud-train.conllu). Use a Linux-like environment. GPU is strongly recommended.

Lemmatizing

Unzip the model and place the .pt file in stanza/saved_models/lemma. Run bash scripts/lemma.sh UD_Swedish-Suc to lemmatize a test set using a pretrained model. The output file will be created in the stanza/corpora folder.

Training your own models

Run bash scripts/run_lemma.sh UD_Swedish-Suc gold.