extract_bw.py contract.py ============= Current Web home: http://spraakbanken.gu.se/folder/subfolder usage: bw_extract.py [-h] [--mode {plain,lemma,saldo,lex}] [--mwe] [--first-only] [--genre {fiction,government,news,science,socialmedia,all}] outfile -h: Shows help for the program. --mode: Determines output format. Defaults to 'plain'. --mwe: Contracts multi-word expressions. Defaults to False. --first-only: Outputs only the first lemma/saldo/lex value. Defaults to False. --genre: Filters texts on genre. Defaults to 'all'. extract_bw.py is a Python 3 program to extract data from the Swedish Culturomics Gigaword Corpus, but should work with any other corpus from Språkbanken's Korp annotated resources: . contract.py is a companion file, the function of which is to contract multi-word expressions, and not intended to be run on its own. When run, extract_bw.py will attempt to process any file with a name ending in 'xml.bz2' in the current folder and all subfolders. The output will be written to a file, the name of which must be passed to the program as an argument. The writing is done in append mode, which means that if the outfile exists it will not be overwritten, but added to. If you only wish to process a certain decade, make sure that you run the program from the appropriate subfolder. Note also that processing the whole corpus can take a long time, especially if you use the --mwe flag, with which it can take several days on a fast computer. The output can be one of four modes, which is passed to the program with the --mode flag. The possible output modes are: * plain (the original words from the source without any formatting) * lemma (each word is replaced by its lemma where Korp has found one) * saldo (each word is replaced by its word sense as classified by SALDO) * lex (each word is replaced by a lemgram, which contains the part-of-speech tag as well as a number signifying the conjugation paradigm) If the mode is not specified, the program defaults to plain mode. Additionally, a flag --mwe can be used, which will contract multi-word expressions. In practice, this means that the whole multi-word expression will be written at the position of the first word of that expression, while subsequent words in the same expression are removed from their respective positions. The --mwe flag can not be used in plain mode. Often, there are more than one lemma/sense/lemgram to choose from. The code will by default output a list of all possibilities, separated by '|'. You can choose to only get the first of these possibilities using the flag --first-only. Note, however, that the first-only choice is only likely to be correct for the saldo mode (i.e. word senses). For lemma and lex modes, it is generally not a good idea to use the first-only flag if getting the correct value is important. Use examples: $ extract_bw.py --mode plain outfile.txt hönan lade sina ägg i gräset . $ bw_extract.py --mode lemma outfile.txt höna lägga sig ägg i gräs . $ bw_extract.py --mode lemma outfile.txt höna lägga ägg sig i gräs . $ bw_extract.py --mode saldo outfile.txt höna..1 lägga..1|lägga..2|lägga..3 sig..1 ägg..1|ägg..2|ägg..3|ägg..4 i..2 gräs..1|gräs..2 . $ bw_extract.py --mode saldo --first-only outfile.txt höna..1 lägga..1 sig..1 ägg..1 i..2 gräs..1 . $ bw_extract.py --mode saldo --mwe outfile.txt höna..1 lägga_ägg..1 sig..1 i..2 gräs..1|gräs..2 . $ bw_extract.py --mode saldo --mwe --first-only outfile.txt höna..1 lägga_ägg..1 sig..1 i..2 gräs..1 $ bw_extract.py --mode lex outfile.txt höna..nn.1 lägga..vb.1 sig..pn.1 ägg..nn.1 i..pp.1 gräs..nn.1 . $ bw_extract.py --mode lex --mwe outfile.txt höna..nn.1 lägga_ägg..vbm.1 sig..pn.1 i..pp.1 gräs..nn.1 . The program can also filter on a genre using the --genre flag. The genre defaults to 'all' but can be any one of the following: * all * fiction * government * news * science * socialmedia It is currently only possible to filter on one genre (or all) at the time. To extract several select genres, the program must be run multiple times.