baza_engleskih_rijeci_u_hrvatskome (1).xlsx (383.9 kB)
The database of English words in Croatian.xlsx
To build a dataset to train and test the model, 60,000 words were manually labelled according to language membership by three independent evaluators. N-gram feature representation was used in combination with a linear Support Vector Machine classification algorithm (SVM) (Smola & Schölkopf, 2004) to extract English words from the ENGRI corpus (Bogunović & Kučić, 2021; Kučić, 2021). An F1 score of 0.9669 was achieved on the test set. The database contains 9,453 English words as well as their absolute and relative frequencies.
Funding
English words in Croatian: Identification, affective-semantic norming and investigation into cognitive processing via behavioural and neuroscientific methods
Croatian Science Foundation
Find out more...History
Usage metrics
Categories
Keywords
English wordsCroatianlanguages in contactClassification algorithmswords extractionEnglish words databaseforeign wordsanglicismsCorpus CompilationLanguageLinguisticsInformation Retrieval and Web SearchNatural Language ProcessingAnalysis of Algorithms and ComplexityEnglish LanguageEnglish as a Second Language