Dataset for Vector space model and the usage patterns of Indonesian denominal verbs
2019-10-12T03:55:45Z (GMT) by
This is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).
How to cite the dataset
If you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the
malindo_dbase, see below), please cite as:
Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.
Alternatively, click on the dark pink
Citebutton to browse different citation style (default is
malindo_dbasedata in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:
Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.
Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:
Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205
Leipzig_w2v_vector_full.binis the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).
2. Files beginning with
ngramexmpl_...are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.
3. Files beginning with
sentence_...are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [
.rds]). Information of the corpus file and sentence number in which the verb is found are included.
me_parsed_nountaggedbase(in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.
wordcount_leipzig_allcorpus(in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.
wordlist_leipzig_ME_DI_TER_percorpus.tsvis a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions:
- For me-:
- For di-:
- For ter-:
malindo_dbaseis the MALINDO Morphological Dictionary (see above).
Schmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.
- Digital Humanities
- Data Format
- Data Format not elsewhere classified
- Data Structures
- Database Management
- Natural Language Processing
- Pattern Recognition and Data Mining
- Numerical Computation
- Programming Languages
- Computational Linguistics
- Indonesian Languages
- Language Studies not elsewhere classified
- Linguistic Structures (incl. Grammar, Phonology, Lexicon, Semantics)
- Linguistics not elsewhere classified
CC BY 4.0