extract_bw.py
contract.py
=============
                    Current Web home: http://spraakbanken.gu.se/folder/subfolder

usage: bw_extract.py [-h] [--mode {plain,lemma,saldo,lex}] [--mwe]
                     [--first-only]
                     [--genre {fiction,government,news,science,socialmedia,all}]
                     outfile

-h:           Shows help for the program.
--mode:       Determines output format. Defaults to 'plain'.
--mwe:        Contracts multi-word expressions. Defaults to False.
--first-only: Outputs only the first lemma/saldo/lex value. Defaults to False.
--genre:      Filters texts on genre. Defaults to 'all'.

extract_bw.py is a Python 3 program to extract data from the Swedish Culturomics
Gigaword Corpus, but should work with any other corpus from Språkbanken's Korp
annotated resources: <http://spraakbanken.gu.se/eng/resources>. contract.py is a
companion file, the function of which is to contract multi-word expressions, and
not intended to be run on its own.

When run, extract_bw.py will attempt to process any file with a name ending in
'xml.bz2' in the current folder and all subfolders. The output will be written
to a file, the name of which must be passed to the program as an argument. The
writing is done in append mode, which means that if the outfile exists it will
not be overwritten, but added to. If you only wish to process a certain decade,
make sure that you run the program from the appropriate subfolder. Note also
that processing the whole corpus can take a long time, especially if you use the
--mwe flag, with which it can take several days on a fast computer.

The output can be one of four modes, which is passed to the program with the
--mode flag. The possible output modes are:

* plain (the original words from the source without any formatting)
* lemma (each word is replaced by its lemma where Korp has found one)
* saldo (each word is replaced by its word sense as classified by SALDO)
* lex (each word is replaced by a lemgram, which contains the part-of-speech
       tag as well as a number signifying the conjugation paradigm)

If the mode is not specified, the program defaults to plain mode. Additionally,
a flag --mwe can be used, which will contract multi-word expressions. In
practice, this means that the whole multi-word expression will be written at the
position of the first word of that expression, while subsequent words in the
same expression are removed from their respective positions. The --mwe flag can
not be used in plain mode.

Often, there are more than one lemma/sense/lemgram to choose from. The code
will by default output a list of all possibilities, separated by '|'. You can
choose to only get the first of these possibilities using the flag --first-only.
Note, however, that the first-only choice is only likely to be correct for the
saldo mode (i.e. word senses). For lemma and lex modes, it is generally not a
good idea to use the first-only flag if getting the correct value is important.

Use examples:

$ extract_bw.py --mode plain outfile.txt
hönan lade sina ägg i gräset .

$ bw_extract.py --mode lemma outfile.txt
höna lägga sig ägg i gräs .

$ bw_extract.py --mode lemma outfile.txt
höna lägga ägg sig i gräs .

$ bw_extract.py --mode saldo outfile.txt
höna..1 lägga..1|lägga..2|lägga..3 sig..1 ägg..1|ägg..2|ägg..3|ägg..4 i..2
gräs..1|gräs..2 .

$ bw_extract.py --mode saldo --first-only outfile.txt
höna..1 lägga..1 sig..1 ägg..1 i..2 gräs..1 .

$ bw_extract.py --mode saldo --mwe outfile.txt
höna..1 lägga_ägg..1 sig..1 i..2 gräs..1|gräs..2 .

$ bw_extract.py --mode saldo --mwe --first-only outfile.txt
höna..1 lägga_ägg..1 sig..1 i..2 gräs..1

$ bw_extract.py --mode lex outfile.txt
höna..nn.1 lägga..vb.1 sig..pn.1 ägg..nn.1 i..pp.1 gräs..nn.1 .

$ bw_extract.py --mode lex --mwe outfile.txt
höna..nn.1 lägga_ägg..vbm.1 sig..pn.1 i..pp.1 gräs..nn.1 .

The program can also filter on a genre using the --genre flag. The genre
defaults to 'all' but can be any one of the following:
* all
* fiction
* government
* news
* science
* socialmedia
It is currently only possible to filter on one genre (or all) at the time. To
extract several select genres, the program must be run multiple times.