I. IDENTIFYING INFORMATION |
|
Title* |
Swedish analogy test set v1.0
|
Subtitle |
Swedish semantic and syntactic similarity test set
|
Created by* |
Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
|
Publisher(s)* |
Språkbanken Text (sb-info@svenska.gu.se)
|
Link(s) / permanent identifier(s)* |
https://spraakbanken.gu.se/en/resources/analogy
|
License(s)* |
CC BY 4.0
|
Abstract* |
The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
|
Funded by* |
Vinnova (grant no. 2019-02996)
|
Cite as |
[1]
|
Related datasets |
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
|
|
|
II. USAGE |
|
Key applications |
Intrinsic evaluation of Swedish word embeddings
|
Intended task(s)/usage(s) |
|
Recommended evaluation measures |
|
Dataset function(s) |
Testing
|
Recommended split(s) |
Test set only
|
|
|
III. DATA |
|
Primary data* |
Text
|
Language* |
Swedish
|
Dataset in numbers* |
Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples
|
Nature of the content* |
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
|
Format* |
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
|
Data source(s)* |
Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
|
Data collection method(s)* |
Two Swedish native speakers proof-read the finished version and the inter-agreement score calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
|
Data selection and filtering* |
Does not apply
|
Data preprocessing* |
Does not apply
|
Data labeling* |
Does not apply
|
Annotator characteristics |
Two Swedish native speakers
|
|
|
IV. ETHICS AND CAVEATS |
|
Ethical considerations |
|
Things to watch out for |
|
|
|
V. ABOUT DOCUMENTATION |
|
Data last updated* |
2021-05-12
|
Which changes have been made, compared to the previous version* |
Some linguistic errors and typos in the previous version have been corrected by Lars Borin and Aleksandrs Berdicevskis
|
Access to previous versions |
None
|
This document created* |
2021-05-20, Tosin Adewumi
|
This document last updated* |
2021-05-20, Tosin Adewumi
|
Where to look for further details |
[2],[1]
|
Documentation template version* |
v1.0
|
|
|
VI. OTHER |
|
Related projects |
|
|
|
References |
[1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281. [2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.
|