I. IDENTIFYING INFORMATION
Title* SemEval 2020 Task 1
Subtitle Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection (repackaged for SuperLim)
Created by* Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* Resource page in the SuperLim collection, containing the SuperLim-style documentation (this document): https://spraakbanken.gu.se/en/resources/semeval2020t1-superlim. Original resource page, containing the data: https://spraakbanken.gu.se/en/resources/semeval2020
License(s)* CC BY 4.0
Abstract* This data collection contains the Swedish test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Swedish text corpus pair (one corpus covering the time period 1790-1830, another 1895-1903) and 31 lemmas which have been annotated for their lexical semantic change between the two corpora.
Funded by* Swedish Research Council (dnr 2018-01184)
Cite as [1], [2]
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
II. USAGE
Key applications Machine Learning, Semantic change detection, Evaluation of language models
Intended task(s)/usage(s) Evaluate models on the following task: (1) given two the corpora (for time periods t1 and t2) and the set of target words, decide which words lost or gained senses between t1 and t2, and which ones did not; as annotated by human judges; (2) given the two corpora, rank the set of target words according to their degree of lexical semantic change between t1 and t2, as annotated by human judges. A higher rank means stronger change.
Recommended evaluation measures (1) Accuracy; (2) Spearman's correlation coefficient (see [1] for a possible strategy of dealing with ties)
Dataset function(s) Testing
Recommended split(s) Test data only
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* 31 target words; two corpora (71 million tokens for 1790-1830; 111 million tokens 1895-1903).
Nature of the content* The list of target words and two corpora
Format* Corpora: plain text. Target word lists: two columns (word and label).
Data source(s)* See [1]
Data collection method(s)* See [1]
Data selection and filtering* See [1]
Data preprocessing* See [1]
Data labeling* See [1]
Annotator characteristics See [1]
IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for
V. ABOUT DOCUMENTATION
Data last updated* 2020-02-19, v1.0
Which changes have been made, compared to the previous version* This is the first official version
Access to previous versions
This document created* 2021-06-15, Aleksandrs Berdicevskis
This document last updated* 2021-06-15, Aleksandrs Berdicevskis
Where to look for further details [1], [2], [3]
Documentation template version* v1.0
VI. OTHER
Related projects
References [1] Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.
SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain, 2020. Association for Computational Linguistics https://www.aclweb.org/anthology/2020.semeval-1.1/
[2] Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection, https://zenodo.org/record/3730550#.YMiazfkzZPY
[3] SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection https://competitions.codalab.org/competitions/20948#learn_the_details-data