The Swedish Culturomics Gigaword Corpus
=======================================
                  Current Web home: https://spraakbanken.gu.se/resource/gigaword

The Swedish Culturomics Gigaword Corpus, a one billion word dataset, was
compiled from various sources included in Korp. Korp is a collection of
corpora and tools to process them, hosted and managed by Språkbanken (Gothenburg
University): <http://spraakbanken.gu.se/korp/>

This dataset contains annotated contemporary Swedish text from a balanced mix of
sources and time periods (1950-2015). A table providing an overview of the
sources can be found below.

The format of this dataset is a series of bzip2-compressed XML files, each
file consisting of texts from a single decade and containing up to one million
sentences. The files are sorted in subfolders depending on decade.

The XML generally follows the following structure: <corpus>, <text>, <sentence>,
<w>.

The text element contains metadata depending on the source. We have added two
standardised attributes to each text element: A four digit 'year' attribute and
a 'genre' attribute which is one of the following:
* fiction
* government
* news
* science
* socialmedia

Each word in the dataset is annotated with various syntactical, morphological
and semantic information. We provide code to extract the following data:
* Plain (the original text)
* Lemma (each word replaced by its lemma when possible)
* Word sense (as classified by SALDO)
* Lemgram (including part-of-speech)
Whenever a lemma/sense/lemgram doesn't exist it is substituted by the word (e.g.
for punctuation or unknown words). If there are more than one lemma/sense/
lemgram possible, the code will by default output a list of all possibilities.
It is also possible to get only the first possibility.
Additionally, you can filter on genre, and  multi-word expressions can be
contracted. For more information on how to extract the data, see README.code

TODO: A resource intensive method is used to find the most likely multi-word
expressions. This means that some very long sentences cannot be analysed in a
reasonable time. Currently, we simply skip these sentences altogether if --mwe
mode is used. In a future version, we will implement some pre-processing with
alternative heuristic to handle these cases.

An overview of the sources used in the dataset:
!=======================================================!
| Source                                                |
| Genre        | Time period | Tokens      | Sentences  |
|-------------------------------------------------------|
| Bonniersromaner                                       |
| Fiction      |  1976-1981  |  10,884,795 |    806,627 |
|-------------------------------------------------------|
| Norstedtsromaner                                      |
| Fiction      |  1999       |   2,534,307 |    194,699 |
|-------------------------------------------------------|
| SALT svenska-nederländska                             |
| Fiction      |  1980-1989  |   1,335,455 |     96,995 |
|-------------------------------------------------------|
| SUC-romaner                                           |
| Fiction      |  1990-1999  |   4,653,784 |    330,127 |
|-------------------------------------------------------|
| Smittskydd                                            |
| Government   |  2000-2009  |     691,716 |     41,066 |
|-------------------------------------------------------|
| Statens offentliga utredningar                        |
| Government   |  1950-1999  |  50,000,071 |  2,391,382 |
|-------------------------------------------------------|
| Svensk författningssamling                            |
| Government   |  1990-1999  |   8,335,298 |    277,030 |
|-------------------------------------------------------|
| Svenska partiprogram och valmanifest                  |
| Government   |  2000-2009  |     821,777 |     50,684 |
|-------------------------------------------------------|
| 8 Sidor                                               |
| News         |  2000-2009  |     678,766 |     59,236 |
|-------------------------------------------------------|
| Dagens Nyheter                                        |
| News         |  1987       |   5,122,237 |    364,226 |
|-------------------------------------------------------|
| Göteborgsposten                                       |
| News         |  1994-2013  | 271,239,984 | 18,935,974 |
|-------------------------------------------------------|
| Press 65-98                                           |
| News         |  1965-1998  |  41,177,162 |  2,891,152 |
|-------------------------------------------------------|
| Webbnyheter                                           |
| News         |  2001-2013  | 271,806,921 | 15,112,300 |
|-------------------------------------------------------|
| DiabetologNytt                                        |
| Science      |  1996-1999  |     228,398 |     14,129 |
|-------------------------------------------------------|
| Forskning & Framsteg                                  |
| Science      |  1990-1999  |     744,000 |     44,538 |
|-------------------------------------------------------|
| Humaniora                                             |
| Science      |  2010-2015  |  14,437,043 |    673,820 |
|-------------------------------------------------------|
| Läkartidningen                                        |
| Science      |  1996-2005  |  19,471,910 |  1,085,785 |
|-------------------------------------------------------|
| Samhällsvetenskap                                     |
| Science      |  2000-2009  |  10,873,267 |    523,102 |
|-------------------------------------------------------|
| Svenska Wikipedia                                     |
| Science      |  2015       | 152,333,391 |  5,972,649 |
|-------------------------------------------------------|
| Bloggmix                                              |
| Social media |  1998-2015  |  35,253,548 |  2,254,343 |
|-------------------------------------------------------|
| Familjeliv                                            |
| Social media |  2000-2015  |  68,011,169 |  4,521,566 |
|-------------------------------------------------------|
| Flashback                                             |
| Social media |  2000-2015  |  45,000,152 |  3,095,212 |
!=======================================================!