figshare
Browse
W06-3102.pdf (134.23 kB)

Initial Explorations in English to Turkish Statistical Machine Translation

Download (134.23 kB)
journal contribution
posted on 2006-06-01, 00:00 authored by Kemal OflazerKemal Oflazer, Ilknur Durgar-El Kahlout
This paper presents some very preliminary results for and problems in developing a statistical machine translation system from English to Turkish. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with complex agglutinative word structures, we experiment with morphologically segmented and disambiguated versions of the parallel texts in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words. Morphological segmentation on the Turkish side also conflates the statistics from allomorphs so that sparseness can be alleviated to a certain extent. We find that this approach coupled with a simple grouping of most frequent morphemes and function words on both sides improve the BLEU score from the baseline of 0.0752 to 0.0913 with the small training data. We close with a discussion on why one should not expect distortion parameters to model word-local morpheme ordering and that a new approach to handling complex morphotactics is needed.

History

Publisher Statement

Published in Proceedings of the Workshop on Statistical Machine Translation, pages 7–14, New York City, June 2006.

Date

2006-06-01

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC