W07-0704.pdf (442.13 kB)
Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation
journal contribution
posted on 2007-06-01, 00:00 authored by Kemal OflazerKemal Oflazer, Ilknur Durgar-El KahloutWe investigate different representational
granularities for sub-lexical representation
in statistical machine translation work from
English to Turkish. We find that (i) representing
both Turkish and English at the
morpheme-level but with some selective
morpheme-grouping on the Turkish side of
the training data, (ii) augmenting the training
data with “sentences” comprising only
the content words of the original training
data to bias root word alignment, (iii) reranking
the n-best morpheme-sequence outputs
of the decoder with a word-based language
model, and (iv) using model iteration
all provide a non-trivial improvement over
a fully word-based baseline. Despite our
very limited training data, we improve from
20.22 BLEU points for our simplest model
to 25.08 BLEU points for an improvement
of 4.86 points or 24% relative.