We provide a model that enables lemmatization of Swedish text following the SUC3 standard. Note that SUC3 lemmatization does not exactly match the SALDO standard that is used in our Korp resources.
SUC3 was randomly split into training, validation and test sets (80:10:10). The model was trained for 30 epochs using the default Stanza settings. The accuracy on the test set is 99.18.
Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary).
Stanza was created for parsing UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where Language
is the language name and Treebank
is the treebank name (e.g. UD_Swedish-Suc). The files have to be named lang_treebank-ud-set.conllu, where lang
is a two-letter code for language (sv), and set
is train, dev or test (e.g. sv_suc-ud-train.conllu).
Use a Linux-like environment. GPU is strongly recommended.
Unzip the model and place the .pt file in stanza/saved_models/lemma. Run bash scripts/lemma.sh UD_Swedish-Suc
to lemmatize a test set using a pretrained model. The output file will be created in the stanza/corpora folder.
Run bash scripts/run_lemma.sh UD_Swedish-Suc gold
.