Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation

Oflazer, Kemal; Kahlout, Ilknur Durgar-El

doi:10.1184/R1/6377300.v1

W07-0704.pdf (442.13 kB)

Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation

journal contribution

posted on 2007-06-01, 00:00 authored by Kemal OflazerKemal Oflazer, Ilknur Durgar-El Kahlout

We investigate different representational granularities for sub-lexical representation in statistical machine translation work from English to Turkish. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with “sentences” comprising only the content words of the original training data to bias root word alignment, (iii) reranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (iv) using model iteration all provide a non-trivial improvement over a fully word-based baseline. Despite our very limited training data, we improve from 20.22 BLEU points for our simplest model to 25.08 BLEU points for an improvement of 4.86 points or 24% relative.

History

Publisher Statement

Published in Proceedings of the Second Workshop on Statistical Machine Translation, pages 25–32, Prague, June 2007.

Date

2007-06-01

Usage metrics

Keywords

Statistical Machine Translation Alignment Morphology Turkish

Licence

CC BY-NC-SA 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports