I. IDENTIFYING INFORMATION
Title* SweParaphrase v1.0
Subtitle Sentence-level semantic similarity dataset (a subset of the Swedish STS Benchmark).
Created by* Dana Dannélls (dana.dannells@svenska.gu.se)
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/sweparaphrase
License(s)* CC BY 4.0
Abstract* SweParaphrase is a subset of the automatically translated Swedish Semantic Textual Similarity dataset (Isbister and Sahlgren, 2020). It consists of 165 manually corrected Swedish sentence pairs paired with the original English sentences and their similarity scores ranging between 0 (no meaning overlap) and 5 (meaning equivalence). These scores were taken from the English data, they were assigned by Crowdsourcing through Mechanical Turk. Each sentence pair belongs to one genre (e.g. news, forums or captions). The task is to determine how similar two sentences are.
Funded by* Vinnova (grant no. 2020-02523)
Cite as
Related datasets Part of the SuperLim collection . Created from the development version of the automatically translated Swedish STS Benchmark https://github.com/timpal0l/sts-benchmark-swedish. The English source http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark.
II. USAGE
Key applications Machine Translation, Question Answering, Information Retrieval, Text classification, Semantic parsing, Evaluation of language models.
Intended task(s)/usage(s) Given two senetences determine how similar they are.
Recommended evaluation measures Pearson correlation coefficient or alternative measures.
Dataset function(s) Testing
Recommended split(s) Test data only.
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* 165 sentence pairs; 3 genres; 9 sources.
Nature of the content* Each pair belongs to one genre (e.g. news, forums or captions) and is linked to a file from source (e.g. headlines, answers-forums, images). The English pairs from which the Swedish sentences were translated are also included.
Format*

The downloadable 'sweparaphrase-dev-165.csv' file contains 8 tab-separated columns:
(1) Sentence ID from the automatically translated Swedish dataset;
(2) Genre from source (captions, news, forum);
(3) File from source (images, headlines, answers);
(4) and (5) manually corrected Swedish sentence pairs;
(6) Similarity score from source (based on the English sentence pairs done by Crowdsourcing through Mechanical Turk);
(7) and (8) English sentence pairs from source.

Data source(s)*

The original STS benchmark comprises 8628 sentence pairs, collected from SemEval 2012 (task 6), 2014 (task 10), 2015 (task 2), 2016 (task 1), 2017 (task 1) and *SEM 2013.

Data collection method(s)*

Isbister and Sahlgren, 2020 [1] translated the complete English STS-B http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark#Reference. The original English set is collected from datasets from the SemEval shared tasks.

Data selection and filtering*

This subset is taken from the automatically translated version of STS-B. First we focused only on the development version. Second, we selected only sentences which were deemed accurate translations.

Data preprocessing*

English sentence pairs were tab-seperated. Large chunks of texts appearing after the full stop of the sentence were removed. Scores with decimals longer than 4 were shortened.

Data labeling*

No additional labeling was added. In the English version each sentence pair is annotated with a score (0-5). This annotation was done by Crowdsourcing through Mechanical Turk. Scores were assigned to the source English pairs.

Annotator characteristics

Native speaker of Swedish; fluent non-native speaker of Swedish.

IV. ETHICS AND CAVEATS

Ethical considerations

Things to watch out for

The similarity scores are based on the English data and are not necessarily representative for the Swedish counter parts.

V. ABOUT DOCUMENTATION

Data last updated*

2021-05-31, v1.0

Which changes have been made, compared to the previous version*

This is the first official version

Access to previous versions

This document created*

2021-05-31, Dana Dannélls

This document last updated*

2021-06-07, Dana Dannélls

Where to look for further details

[1],[2],[3],[4]

Documentation template version*

v1.0

VI. OTHER

Related projects

Language models for Swedish authorities Vinnova (grant no. 2019-02996)

References

[1] Isbister, T. and Sahlgren, M. (2020): Why not simply translate? A first swedish evaluation benchmark for semantic similarity. Proceedings of the Eighth Swedish Language Technology Conference (SLTC), University of Gothenburg. https://gubox.box.com/v/SLTC-2020-paper-15. The automatically translated dataset https://svn.spraakdata.gu.se/sb-arkiv/pub/sweparaphrase/stsb-mt-sv.zip

[2] Yvonne Adesam, Aleksandrs Berdicevskis, Felix Morger (2020): SwedishGLUE – Towards a Swedish Test Set for Evaluating Natural Language Understanding Models. University of Gothenburg. https://gupea.ub.gu.se/bitstream/2077/67179/1/gupea_2077_67179_1.pdf

[3] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018): GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. https://arxiv.org/pdf/1804.07461.pdf

[4] Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia (2017): Semeval-2017 task1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In 11th International Workshop on Semantic Evaluations, 2017. https://www.aclweb.org/anthology/S17-2001.pdf