I. IDENTIFYING INFORMATION
Title* Swedish analogy test set v1.0
Subtitle Swedish semantic and syntactic similarity test set
Created by* Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/analogy
License(s)* CC BY 4.0
Abstract* The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
Funded by* Vinnova (grant no. 2019-02996)
Cite as [1]
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
II. USAGE
Key applications Intrinsic evaluation of Swedish word embeddings
Intended task(s)/usage(s)
Recommended evaluation measures
Dataset function(s) Testing
Recommended split(s) Test set only
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples
Nature of the content* Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Format* Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Data source(s)* Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
Data collection method(s)* Two Swedish native speakers proof-read the finished version and the inter-agreement score calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
Data selection and filtering* Does not apply
Data preprocessing* Does not apply
Data labeling* Does not apply
Annotator characteristics Two Swedish native speakers
IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for
V. ABOUT DOCUMENTATION
Data last updated* 2021-05-12
Which changes have been made, compared to the previous version* Some linguistic errors and typos in the previous version have been corrected by Lars Borin and Aleksandrs Berdicevskis
Access to previous versions None
This document created* 2021-05-20, Tosin Adewumi
This document last updated* 2021-05-20, Tosin Adewumi
Where to look for further details [2],[1]
Documentation template version* v1.0
VI. OTHER
Related projects
References [1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281. [2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.