I. IDENTIFYING INFORMATION |
|
Title* |
SemEval 2020 Task 1
|
Subtitle |
Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection (repackaged for SuperLim)
|
Created by* |
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi
|
Publisher(s)* |
Språkbanken Text (sb-info@svenska.gu.se)
|
Link(s) / permanent identifier(s)* |
Resource page in the SuperLim collection, containing the SuperLim-style documentation (this document): https://spraakbanken.gu.se/en/resources/semeval2020t1-superlim. Original resource page, containing the data: https://spraakbanken.gu.se/en/resources/semeval2020
|
License(s)* |
CC BY 4.0
|
Abstract* |
This data collection contains the Swedish test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Swedish text corpus pair (one corpus covering the time period 1790-1830, another 1895-1903) and 31 lemmas which have been annotated for their lexical semantic change between the two corpora.
|
Funded by* |
Swedish Research Council (dnr 2018-01184)
|
Cite as |
[1], [2]
|
Related datasets |
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
|
|
|
II. USAGE |
|
Key applications |
Machine Learning, Semantic change detection, Evaluation of language models
|
Intended task(s)/usage(s) |
Evaluate models on the following task: (1) given two the corpora (for time periods t1 and t2) and the set of target words, decide which words lost or gained senses between t1 and t2, and which ones did not; as annotated by human judges; (2) given the two corpora, rank the set of target words according to their degree of lexical semantic change between t1 and t2, as annotated by human judges. A higher rank means stronger change.
|
Recommended evaluation measures |
(1) Accuracy; (2) Spearman's correlation coefficient (see [1] for a possible strategy of dealing with ties)
|
Dataset function(s) |
Testing
|
Recommended split(s) |
Test data only
|
|
|
III. DATA |
|
Primary data* |
Text
|
Language* |
Swedish
|
Dataset in numbers* |
31 target words; two corpora (71 million tokens for 1790-1830; 111 million tokens 1895-1903).
|
Nature of the content* |
The list of target words and two corpora
|
Format* |
Corpora: plain text. Target word lists: two columns (word and label).
|
Data source(s)* |
See [1]
|
Data collection method(s)* |
See [1]
|
Data selection and filtering* |
See [1]
|
Data preprocessing* |
See [1]
|
Data labeling* |
See [1]
|
Annotator characteristics |
See [1]
|
|
|
IV. ETHICS AND CAVEATS |
|
Ethical considerations |
|
Things to watch out for |
|
|
|
V. ABOUT DOCUMENTATION |
|
Data last updated* |
2020-02-19, v1.0
|
Which changes have been made, compared to the previous version* |
This is the first official version
|
Access to previous versions |
|
This document created* |
2021-06-15, Aleksandrs Berdicevskis
|
This document last updated* |
2021-06-15, Aleksandrs Berdicevskis
|
Where to look for further details |
[1], [2], [3]
|
Documentation template version* |
v1.0
|
|
|
VI. OTHER |
|
Related projects |
|
|
|
References |
[1] Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.
SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain, 2020. Association for Computational Linguistics https://www.aclweb.org/anthology/2020.semeval-1.1/
[2] Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection, https://zenodo.org/record/3730550#.YMiazfkzZPY
[3] SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection https://competitions.codalab.org/competitions/20948#learn_the_details-data
|