The Swedish Culturomics Gigaword Corpus
=======================================
Current Web home: https://spraakbanken.gu.se/resource/gigaword
The Swedish Culturomics Gigaword Corpus, a one billion word dataset, was
compiled from various sources included in Korp. Korp is a collection of
corpora and tools to process them, hosted and managed by Språkbanken (Gothenburg
University):
This dataset contains annotated contemporary Swedish text from a balanced mix of
sources and time periods (1950-2015). A table providing an overview of the
sources can be found below.
The format of this dataset is a series of bzip2-compressed XML files, each
file consisting of texts from a single decade and containing up to one million
sentences. The files are sorted in subfolders depending on decade.
The XML generally follows the following structure: , , ,
.
The text element contains metadata depending on the source. We have added two
standardised attributes to each text element: A four digit 'year' attribute and
a 'genre' attribute which is one of the following:
* fiction
* government
* news
* science
* socialmedia
Each word in the dataset is annotated with various syntactical, morphological
and semantic information. We provide code to extract the following data:
* Plain (the original text)
* Lemma (each word replaced by its lemma when possible)
* Word sense (as classified by SALDO)
* Lemgram (including part-of-speech)
Whenever a lemma/sense/lemgram doesn't exist it is substituted by the word (e.g.
for punctuation or unknown words). If there are more than one lemma/sense/
lemgram possible, the code will by default output a list of all possibilities.
It is also possible to get only the first possibility.
Additionally, you can filter on genre, and multi-word expressions can be
contracted. For more information on how to extract the data, see README.code
TODO: A resource intensive method is used to find the most likely multi-word
expressions. This means that some very long sentences cannot be analysed in a
reasonable time. Currently, we simply skip these sentences altogether if --mwe
mode is used. In a future version, we will implement some pre-processing with
alternative heuristic to handle these cases.
An overview of the sources used in the dataset:
!=======================================================!
| Source |
| Genre | Time period | Tokens | Sentences |
|-------------------------------------------------------|
| Bonniersromaner |
| Fiction | 1976-1981 | 10,884,795 | 806,627 |
|-------------------------------------------------------|
| Norstedtsromaner |
| Fiction | 1999 | 2,534,307 | 194,699 |
|-------------------------------------------------------|
| SALT svenska-nederländska |
| Fiction | 1980-1989 | 1,335,455 | 96,995 |
|-------------------------------------------------------|
| SUC-romaner |
| Fiction | 1990-1999 | 4,653,784 | 330,127 |
|-------------------------------------------------------|
| Smittskydd |
| Government | 2000-2009 | 691,716 | 41,066 |
|-------------------------------------------------------|
| Statens offentliga utredningar |
| Government | 1950-1999 | 50,000,071 | 2,391,382 |
|-------------------------------------------------------|
| Svensk författningssamling |
| Government | 1990-1999 | 8,335,298 | 277,030 |
|-------------------------------------------------------|
| Svenska partiprogram och valmanifest |
| Government | 2000-2009 | 821,777 | 50,684 |
|-------------------------------------------------------|
| 8 Sidor |
| News | 2000-2009 | 678,766 | 59,236 |
|-------------------------------------------------------|
| Dagens Nyheter |
| News | 1987 | 5,122,237 | 364,226 |
|-------------------------------------------------------|
| Göteborgsposten |
| News | 1994-2013 | 271,239,984 | 18,935,974 |
|-------------------------------------------------------|
| Press 65-98 |
| News | 1965-1998 | 41,177,162 | 2,891,152 |
|-------------------------------------------------------|
| Webbnyheter |
| News | 2001-2013 | 271,806,921 | 15,112,300 |
|-------------------------------------------------------|
| DiabetologNytt |
| Science | 1996-1999 | 228,398 | 14,129 |
|-------------------------------------------------------|
| Forskning & Framsteg |
| Science | 1990-1999 | 744,000 | 44,538 |
|-------------------------------------------------------|
| Humaniora |
| Science | 2010-2015 | 14,437,043 | 673,820 |
|-------------------------------------------------------|
| Läkartidningen |
| Science | 1996-2005 | 19,471,910 | 1,085,785 |
|-------------------------------------------------------|
| Samhällsvetenskap |
| Science | 2000-2009 | 10,873,267 | 523,102 |
|-------------------------------------------------------|
| Svenska Wikipedia |
| Science | 2015 | 152,333,391 | 5,972,649 |
|-------------------------------------------------------|
| Bloggmix |
| Social media | 1998-2015 | 35,253,548 | 2,254,343 |
|-------------------------------------------------------|
| Familjeliv |
| Social media | 2000-2015 | 68,011,169 | 4,521,566 |
|-------------------------------------------------------|
| Flashback |
| Social media | 2000-2015 | 45,000,152 | 3,095,212 |
!=======================================================!