Talbanken is a widely used Swedish treebank, read more about its history and different versions here. This version originated as a copy of TalbankenSTB, but unlike the STB version, this one is open to changes and corrections. This is also the version indexed by our search engine Korp. The changes made by us can be found in changelog.txt.
The following layers of annotation were added (or corrected) manually and can be considered gold data: tokenization, sentence segmentation, POS, MSD, dependency syntax (deprel and dephead).
Tokenization, sentence segmentation, POS and MSD follow the SUC format, syntactic annotation follows the Mamba-Dep format, a conversion of the MAMBA format used in the original Talbanken76 to dependency grammar.
Read more about these annotation layers in the documentation for TalbankenSTB or at Joakim Nivre's page: tokenization and sentence segmentation, POS and MSD, dependency syntax.
TalbankenSBX is provided in our standard XML format and in a (pseudo-)CONLLU format, where UPOS is POS in the SUC format, XPOS is POS+MSD, Feats are MSD converted to the UD/CONLLU standard, and Deprel is a Mamba-Dep relation. There are currently no text
and SpaceAfter
attributes.
You may convert our XML to this format Talbanken yourself using the script in this repository.
We provide two splits of TalbankenSBX. MorphSplit is used for POS-tagging purposes: the treebank is divided into two parts with the same number of sentences (the split is completely random, no blocks are used). One part is used as the development set, the other is the test set (SUC3 is the training set). You may resplit the Talbanken yourself using the script in this repository.
SyntSplit used is for dependency parsing: the treebank is divided into the training, development and test sets. The training set is the same as the one in TalbankenSTB, whereas dev and test approximate dev and test in the UD version as much as possible. The SyntSplit is provided only in the CONLLU format.