# NyLLex
This is the companion repo to the NyLLex paper [1]. Note that the resource in this version is larger than the one described in the paper. This is due to two reasons:
1. An increased number of books available for the source material (from 247 to 280)
2. An updated method to filter out bad entries due to erraneous OCR-readings from the soruce PDFs. 

In practice, this means that the number of entries (unique words) of the resource is signifcantly larger (more than double the number of entries) in this version, since entries that only appear once in the source material are no longer discarded. However, for the total frequency counts for all entries, the difference between this updated version and the paper version is only around 2%. 


## Structure
The CSV contains the following headers:
- word: a word in its lemma form
- POS: a part-of-speech tag in the SUC-format
- level1_freq - level6_freq (six headers): the dispersed frequency of the word in the given reading proficiency level
- total_freq: the adjusted frequency for the word across all reading proficiency levels
- n_level1 - n_level6 (six headers): raw frequency of the word in the given reading proficiency level
- n_total: raw frequency for the word across all reading proficiency levels


#### References
[1.] Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of Swedish Words Annotated with Reading Proficiency Level. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1326–1331, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.141.pdf