Initial Explorations in English to Turkish Statistical Machine Translation

Oflazer, Kemal; Kahlout, Ilknur Durgar-El

doi:10.1184/R1/6377369.v1

W06-3102.pdf (134.23 kB)

Initial Explorations in English to Turkish Statistical Machine Translation

journal contribution

posted on 2006-06-01, 00:00 authored by Kemal OflazerKemal Oflazer, Ilknur Durgar-El Kahlout

This paper presents some very preliminary results for and problems in developing a statistical machine translation system from English to Turkish. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with complex agglutinative word structures, we experiment with morphologically segmented and disambiguated versions of the parallel texts in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words. Morphological segmentation on the Turkish side also conflates the statistics from allomorphs so that sparseness can be alleviated to a certain extent. We find that this approach coupled with a simple grouping of most frequent morphemes and function words on both sides improve the BLEU score from the baseline of 0.0752 to 0.0913 with the small training data. We close with a discussion on why one should not expect distortion parameters to model word-local morpheme ordering and that a new approach to handling complex morphotactics is needed.

History

Publisher Statement

Published in Proceedings of the Workshop on Statistical Machine Translation, pages 7–14, New York City, June 2006.

Date

2006-06-01

Usage metrics

Keywords

Statistical Machine Translation Turkish CMU Qatar

Licence

CC BY-NC-SA 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Initial Explorations in English to Turkish Statistical Machine Translation

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports