Dataset for Vector space model and the usage patterns of Indonesian denominal verbs

Preface

This is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).

How to cite the dataset

If you use any of the dataset in this repository (except for the malindo_dbase, see below), please cite as:

Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.

Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).

The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:

Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.

Tutorial on how to use the data together with the R Notebook for the analyses is available on GitHub and figshare:
Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205

Dataset description

1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).

2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.

3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.

4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.

5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.

6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions:
- For me-: ^(?i)(me)([a-z-]{3,})$
- For di-: ^(?i)(di)([a-z-]{3,})$
- For ter-: ^(?i)(ter)([a-z-]{3,})$

7. malindo_dbase is the MALINDO Morphological Dictionary (see above).

References

Schmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.