figshare
Browse
baza_engleskih_rijeci_u_hrvatskome (1).xlsx (383.9 kB)

The database of English words in Croatian.xlsx

Download (383.9 kB)
dataset
posted on 2022-06-07, 11:14 authored by Irena BogunovićIrena Bogunović, Mario Kučić

To build a dataset to train and test the model, 60,000 words were manually labelled according to language membership by three independent evaluators. N-gram feature representation was used in combination with a linear Support Vector Machine classification algorithm (SVM) (Smola & Schölkopf, 2004) to extract English words from the ENGRI corpus (Bogunović & Kučić, 2021; Kučić, 2021). An F1 score of 0.9669 was achieved on the test set. The database contains 9,453 English words as well as their absolute and relative frequencies.

Funding

English words in Croatian: Identification, affective-semantic norming and investigation into cognitive processing via behavioural and neuroscientific methods

Croatian Science Foundation

Find out more...

History