The Swedish Culturomics Gigaword Corpus ======================================= Current Web home: https://spraakbanken.gu.se/resource/gigaword The Swedish Culturomics Gigaword Corpus, a one billion word dataset, was compiled from various sources included in Korp. Korp is a collection of corpora and tools to process them, hosted and managed by Språkbanken (Gothenburg University): This dataset contains annotated contemporary Swedish text from a balanced mix of sources and time periods (1950-2015). A table providing an overview of the sources can be found below. The format of this dataset is a series of bzip2-compressed XML files, each file consisting of texts from a single decade and containing up to one million sentences. The files are sorted in subfolders depending on decade. The XML generally follows the following structure: , , , . The text element contains metadata depending on the source. We have added two standardised attributes to each text element: A four digit 'year' attribute and a 'genre' attribute which is one of the following: * fiction * government * news * science * socialmedia Each word in the dataset is annotated with various syntactical, morphological and semantic information. We provide code to extract the following data: * Plain (the original text) * Lemma (each word replaced by its lemma when possible) * Word sense (as classified by SALDO) * Lemgram (including part-of-speech) Whenever a lemma/sense/lemgram doesn't exist it is substituted by the word (e.g. for punctuation or unknown words). If there are more than one lemma/sense/ lemgram possible, the code will by default output a list of all possibilities. It is also possible to get only the first possibility. Additionally, you can filter on genre, and multi-word expressions can be contracted. For more information on how to extract the data, see README.code TODO: A resource intensive method is used to find the most likely multi-word expressions. This means that some very long sentences cannot be analysed in a reasonable time. Currently, we simply skip these sentences altogether if --mwe mode is used. In a future version, we will implement some pre-processing with alternative heuristic to handle these cases. An overview of the sources used in the dataset: !=======================================================! | Source | | Genre | Time period | Tokens | Sentences | |-------------------------------------------------------| | Bonniersromaner | | Fiction | 1976-1981 | 10,884,795 | 806,627 | |-------------------------------------------------------| | Norstedtsromaner | | Fiction | 1999 | 2,534,307 | 194,699 | |-------------------------------------------------------| | SALT svenska-nederländska | | Fiction | 1980-1989 | 1,335,455 | 96,995 | |-------------------------------------------------------| | SUC-romaner | | Fiction | 1990-1999 | 4,653,784 | 330,127 | |-------------------------------------------------------| | Smittskydd | | Government | 2000-2009 | 691,716 | 41,066 | |-------------------------------------------------------| | Statens offentliga utredningar | | Government | 1950-1999 | 50,000,071 | 2,391,382 | |-------------------------------------------------------| | Svensk författningssamling | | Government | 1990-1999 | 8,335,298 | 277,030 | |-------------------------------------------------------| | Svenska partiprogram och valmanifest | | Government | 2000-2009 | 821,777 | 50,684 | |-------------------------------------------------------| | 8 Sidor | | News | 2000-2009 | 678,766 | 59,236 | |-------------------------------------------------------| | Dagens Nyheter | | News | 1987 | 5,122,237 | 364,226 | |-------------------------------------------------------| | Göteborgsposten | | News | 1994-2013 | 271,239,984 | 18,935,974 | |-------------------------------------------------------| | Press 65-98 | | News | 1965-1998 | 41,177,162 | 2,891,152 | |-------------------------------------------------------| | Webbnyheter | | News | 2001-2013 | 271,806,921 | 15,112,300 | |-------------------------------------------------------| | DiabetologNytt | | Science | 1996-1999 | 228,398 | 14,129 | |-------------------------------------------------------| | Forskning & Framsteg | | Science | 1990-1999 | 744,000 | 44,538 | |-------------------------------------------------------| | Humaniora | | Science | 2010-2015 | 14,437,043 | 673,820 | |-------------------------------------------------------| | Läkartidningen | | Science | 1996-2005 | 19,471,910 | 1,085,785 | |-------------------------------------------------------| | Samhällsvetenskap | | Science | 2000-2009 | 10,873,267 | 523,102 | |-------------------------------------------------------| | Svenska Wikipedia | | Science | 2015 | 152,333,391 | 5,972,649 | |-------------------------------------------------------| | Bloggmix | | Social media | 1998-2015 | 35,253,548 | 2,254,343 | |-------------------------------------------------------| | Familjeliv | | Social media | 2000-2015 | 68,011,169 | 4,521,566 | |-------------------------------------------------------| | Flashback | | Social media | 2000-2015 | 45,000,152 | 3,095,212 | !=======================================================!