Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Mansurov, B.; Mansurov, A.

doi:10.6084/m9.figshare.13565057.v2

uzbek-cyrillic-latin-cyrillic-machine-transliteration.pdf (71.52 kB)

Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Version 2 2021-01-13, 22:03

Version 1 2021-01-13, 02:09

preprint

posted on 2021-01-13, 22:03 authored by B. MansurovB. Mansurov, A. Mansurov

In this paper, we introduce a data-driven approach to transliterating Uzbek dictionary words from the Cyrillic script into the Latin script, and vice versa. We heuristically align characters of words in the source script with sub-strings of the corresponding words in the target script and train a decision tree classifier that learns these alignments. On the test set, our Cyrillic to Latin model achieves a character level micro-averaged F 1 score of 0.9992, and our Latin to Cyrillic model achieves the score of 0.9959. Our contribution is a novel method of producing machine transliterated texts for the low-resource Uzbek language.