TREF - TRanslation Enhancement Framework for Japanese-English

We present a method for improving existing statistical machine translation methods using an knowledge-base compiled from a bilingual corpus as well as sequence alignment and pattern matching techniques from the area of machine learning and bioinformatics. An alignment algorithm identifies similar sentences, which are then used to construct a better word order for the translation. Our preliminary test results indicate a significant improvement of the translation quality.


I. INTRODUCTION
M ACHINE translation has been an active research area throughout the last 40 years.During this period, many promising concepts were proposed; however, there is still much room for improvement [1].Especially when translating languages with radically different surface characteristics, as it is the case for Japanese-English, current machine translation techniques tend to produce unsatisfying results.The problems of automated translation between these languages become readily apparent when looking at current Webbased translations, e.g. from www.excite.co.jp/world/ english, which is shown in Fig. 1.While the translations of short phrases are of reasonable quality, translation systems struggle with long sentences.This is due to the growing complexity of sentences with increasing length and the vast differences in word and subclause order between these langauges.Additionally, the characteristics of the Japanese language pose a great challenge for translation into other languages in general [2], [3].Those characteristics are: • two syllabaries and a system of several thousand kanji, i.e. originally Chinese characters with several pronunciations and readings, • lack of spaces to delimit word boundaries, • a very high ambiguity in the grammar, as there exist no articles to indicate gender or definiteness, • the tendency to omit information which can be inferred implicitly, • sociolinguistic factors, e.g.avoiding direct and decisive expressions for reasons of politeness, • an extensive system of formality with several levels of politeness forms, honorific expressions, and humble verb forms depending on the social status, relationship and other factors of the people involved.To overcome those intricacies, we have directed our attention to a new and interdisciplinary approach.We have designed and implemented a method for finding structurally similar sentences with the help of an algorithm usually employed in the field of bioinformatics [4], [5].The underlying assumption of our approach is that there is a significant overlap between the structure of a sentence and its meaning.In this paper, we show that it is possible to enhance statistical machine translation results using this assumption.The TRanslation Enhancement Framework (TREF) [6] utilizes aligned and clustered sentence pair data to enhance the output of the statistical machine translation system Moses [7].
Though trained for the Japanese-English language pair, the system is modular and flexible.An adjustment or extension to other languages is a matter of changing mere implementation details and adding the language-specific resources, such as lexica, parser, corpora, etc.It is important to mention, however, that our translation framework is specifically designed and well-suited for languages with radically different surface characteristics, e.g.European-Asian language pairs.The rest of this paper is organized as follows: In Sect.II the research relevant to our work is narrated, before we discuss TREF in Sect.III.Section IV presents our evaluation method and the results, followed by a conclusion and future work in Sect.V.

II. RELATED WORK
The ultimate goal of machine translation, i.e. abolishing language barriers, is presented by [8] in an entertaining narration.This ambitious pursuit of a system which will relieve the lingua franca and enable boundless communication between cultures is not quite yet in the realm of the possible.Nonetheless, research efforts towards this goal have been undertaken.In this section, we outline the research relevant to our work.Japanese-English language pair.The currently predominant ones are the Tanaka corpus [9], the Jenaad corpus [10], and the Verbmobil treebank [11].The Verbmobil treebank contains dialogs from telephone conversations in English, German, Japanese, and other languages, collected during the speech recognition research project of Verbmobil.The Japanese part contains around 160,000 words of text and is written in Romaji, i.e. the transcription of Japanese script into Roman literals.The Tanaka corpus consists of roughly 180,000 sentences and has a very broad domain.It has been collected over several years from various sources and compiled by Yasushito Tanaka in 2001.The Jenaad corpus is a collection of close to 150,000 sentence pairs.Extracted from news articles, it offers a certain consistency in terms of sentence types, while still offering a wide range of vocabulary and a variety of grammatical constructs.Because of these qualities, we have chosen the Jenaad corpus for our work.In addition, it is written in Japanese script, thereby avoiding potential ambiguities of the Romaji transcription.

B. Machine Translation
The research in machine translation has ever since included many different approaches.An overview of different techniques can be obtained from [1].Their visual classification is exemplified by the Vauquois' triangle in Fig. 2 [12].The historically first method, located at the very top of the triangle, is the interlingua approach.It aims towards a languageindependent representation, which mediates between two or more languages.In contrast, statistical machine translation is at the bottom of the triangle, where no intermediate information is considered in the process, and there is a direct mapping from source to target text, depending on previously trained statistical data.A good overview of this technique can be obtained from [13].
Other approaches, which are also described in more detail in [14], are found somewhere between those two extremes, and the advantage of each depends on the demands of the given language pair.The challenges of translating Japanese to English gave birth to the new idea of corpus-based machine translation [15].Apart from its success in translating between these languages, it further provides the opportunity for enhancing language learning environments by presenting the intermediate steps, i.e. the linguistic analysis of the translation process, to the learner.This was successfully accomplished by [16], [17].The corpus-based method was quickly adopted by the machine translation community and merged with other techniques, as for example in [18].Together with the idea of [19], that a mapping of grammatical functions and semantic roles is crucial for the Japanese-English pair, we have decided to mold these ideas into a new approach.
We have chosen a statistical machine translation method for a baseline translation in TREF, since it performs well in terms of translation of individual words and short phrases.It does not adhere to finding transition rules for syntax ordering and therefore leaves a good first candidate for the post-editing done by TREF.
Amongst different tools, we have chosen Moses, since it is particularly effective when trained with a sufficiently large bilingual corpus.Moses scores well for structurally similar languages; however, for language pairs like Japanese-English, the word order is disarranged, which significantly lowers the quality of the translation, up to the point where the meaning of the sentence is irrecognizable.Moses does not consider any grammatical rules, so the output is syntactically wrong most of the time.The post-editing and rearranging of the Moses output aims at addressing this problem.Our method finds the correct word order for the translation result and produces a grammatically correct sentence, which conveys the meaning of its English counterpart.

C. Natural Language Processing
To analyze the tokens of our bilingual corpus, we have used the MontyTagger from the MontyLingua project [20] for English, and ChaSen [21] for Japanese.Besides a part-of-speech tagging capability, MontyLingua offers an end-to-end natural language processing toolkit.ChaSen is a high-quality partof-speech tagger tool for Japanese.Recently, CaboCha [22], a Japanese dependency parser, which offers an even wider spectrum of NLP capabilities, has been developed, and we plan to integrate it into TREF in the near future.

D. Sequence Alignment
The Needleman-Wunsch algorithm for computing similarities in protein building blocks, i.e. amino-acid chains, was published in 1970 [23].Quickly, many derivatives and extensions of this method followed.The basic idea behind this concept was to depict amino-acid chains as strings of alphabetic characters, align them to offer the best match between two strings, and compute a similarity measure [24].This method was further improved by [25], using a distance measure in conjunction with dynamic programming.Many other research efforts found different distance measures to identify the similarity of sequences.The approach of [26] is generic enough to be extended to the area of machine

III. TREF
The overview of the architecture of TREF is shown in Fig. 3.The PoS-Tagger/Formatting module tokenizes the input sentence and assigns PoS tags in a format which is described below.The sentences in their tokenized format are then aligned with the clustered corpus to find the target structure, which is sent to the Comparison and Merging module.This module takes this input as well as the translation from Moses and enhances its translational quality applying a template approach.The resulting translation can then be evaluated and added to the corpus.Each step is described in detail in the following subsections.

A. Part-of-Speech Tagging
The input sentence is sent to either one of the part-of-speech (PoS) tagger modules MontyTagger [27] or ChaSen [21].The result of this process can be seen in Fig. 4 and Fig. 5 for Japanese and English respectively.The Japanese sentence is written in Roman transcription for the reader's convenience.The tags produced by ChaSen consist of a sentence token, its katakana representation (one of the Japanese syllabaries, which indicates the pronunciation of a kanji), and a numerical representation of the morphological data.The English tags contain the word itself and the PoS tag as an acronym.After each sentence token is assigned a PoS tag, the sentence and its tags are compared with the sentences already stored in a clustered corpus, which is a customized and enriched version of the Jenaad Corpus [10].We have modified it by removing as much noise as possible, assigned PoS tags to each sentence token, and stored them in an SQL database.We have kept

B. Aligning and Clustering
In order to identify similar sentences, we have used a slightly modified alignment algorithm from bioinformatics.Instead of aligning protein chains, we align chains of words, i.e. sentences.We have applied relational sequence alignment [4], [5] to obtain clusters of structurally similar sentences.The alignment is done according to the Nienhuys-Cheng distance function.
An example of a distance between the tokens of each sentence is shown in Fig. 6.If the token and its PoS tag differ, the distance is 1.In the case of a structural match, the distance is 0.5, and 0 for a perfect match.The subsequent distance calculation of an entire sentence is depicted in Fig. 7. Gaps, which are identified and symbolized with (g) in the example, are assigned variable gap penalties.In order to achieve better d(nn(house),nn(house)) = 0 d(nn(house),nn(office)) = 0.5 d(nn(house),dt(the)) = 1 The similarity measure parameters can be adjusted to finetune the result, depending on the text type and text domain.By allowing lower similarity values, a higher number of candidates can be produced, whereas a higher similarity value reduces the number of candidates.This flexibility can be utilized for a language learning application to present an arbitrary amount of similar translations to the student.The output is then evaluated by the user and added to the corpus.Once the distances are computed, clusters can be defined setting a threshold value.This concept is shown in Fig. 8 in a Cartesian coordinate system.Each sentence which has a distance lower than a certain threshold value is assigned to a cluster and is therefore considered structurally similar to sentences in this cluster.

C. Comparison and Merging
The comparison of the query sentence with the clusters yields several similar structures.At the same time, the query sentence is processed with Moses to obtain a preliminary translation.This translation is then used to fill the template of the structures which have been found in the previous step.Thereby, a certain number of translation candidates is produced.The filling of the structure templates from the aligning step is shown in Fig. 9.In this example, we use the TREF transforms this by filling the structure template into "we welcome Lebanon freed in hostages".As can be seen, some tokens are lost in the process of filling the template, which leaves room for future work and potential for further improvement of the translational quality.

D. Web Interface
The clustered corpus of PoS tagged sentence tokens in several representations, as well as morphological information, is stored in a MySQL database and is accessible through a Django Web framework (http://www.djangoproject.com).In Django, all interactive content as well as settings, modules, and database setup are written in Python, which made it a good candidate for our system due to its powerful string and text manipulation capabilities.Further, Django provides stable Web development and administrative utilities.In particular, the communication to the database and efficient Web design tools including HTML code inheritance made it an ideal developing environment.The structure of the framework is depicted in Fig. 10.From the main site, the user can navigate to the translation module, the sentence pair input, the random sentence output, as well as legends for the PoS tags for English 544 PROCEEDINGS OF THE IMCSIT.VOLUME 5, 2010 and Japanese.The translation module offers an interface, which upon input of a sentence sends it to the server and -after the above described translation process -displays the result.The sentence input module takes a sentence pair input, which is flagged as a new addition and is checked manually before being added to the database.The random sentence output is a first step towards the language learning functionality and outputs a sentence from the database including its translation, its tags, and morphological information.We have created a page for the explanation of PoS tags.The translation of the original Japanese ChaSen tags into English is, to the best of our knowledge, the only English ChaSen PoS-tag legend available.
The framework is available on the Web server maintained by the authors under the URL: (https://wloka.dac.univie.ac.at/project/).

E. Showcase
Figure 11 shows an example of the workflow from the input of a sentence to an output of several translation candidates.The input "My name is Yamada." is tagged and compared with the clustered data.The PoS tags for the sentence in this case are: My/POP (personal pronoun), name/NN (noun), is/VBZ (verb), Yamada/NNP (proper noun).The alignment detects sentences in the database, which are similar in terms of words and PoS-tags (see Fig. 6).The translations of the identified structures are also checked for similarities within other clusters.This step, which we call structure-to-meaningmapping identifies other structures of potential translation candidates.These structures are sent to the matching and translation step, where the structures and the output from Moses are merged to yield the final output, i.e. the translation candidates.

IV. EVALUATION
To create a testing scenario, we have extracted 1000 out of the total 150,000 sentences from the Jenaad corpus.The remaining 149,000 sentences were used as training data for Moses and for clustering.Due to the long processing time for each sentence, we have decided to analyze fewer sentences in detail instead of using standard scoring tools, such as [28] or [29], which would be more significant for larger amounts of output.Morover the validity of automated scoring tools of this kind has been criticized by [14], [30].Hence our evaluation was done by an expert who judged each translation on four categories: word order, word translations, semantics, and fluency.The categories were equally weighted with a top score of 25 each (see Fig. 12).A total of 40 sample sentences were evaluated, and a statistical significance of the result was verified with a Wilcoxon signed-rank test [31].The result was a better score for the sentences processed with TREF with a score of W=139 over a sample size of N=34 and a P(1-tail) value of 0.119.

V. CONCLUSION
In this paper, we have described a design for enhancing state-of-the-art machine translation using sequence alignment from the area of bioinformatics, combined with PoS tagging and clustering of a bilingual corpus.Our results have proven that similarities in sentence structure can be used to create templates for translation candidates, in particular for the Japanese-English language pair.We have described our implementation of the system and its Web framework.We have trained the system with the Jenaad Corpus and tested the system for Japanese-English.The evaluation of the system yielded promising results.At the time of writing, TREF is already integrated in another research project focusing on ubiquitous translation and language learning with the help of mobile devices.
For future work, we plan to optimize the parameters in the aligning process to fine-tune the word reordering as well as adding grammatical parsing steps after the template filling to improve the syntactical correctness of the sentence.An BARTHOLOM ÄUS WLOKA, WERNER WINIWARTER: TREF TRANSLATION ENHANCEMENT FRAMEWORK FOR JAPANESE-ENGLISH additional dictionary lookup will be integrated to amend word translations, which could not be processed by the statistical translation step.
We want to extend the language learning aspect of the system to offer a Web-based learning platform and improve the efficiency of the entire system with pre-computing and indexing methods.We plan to incorporate a Japanese dependency parser.The currently active research efforts on the Japanese WordNet [32] and CaboCha [22] are promising candidates for an additional extension for TREF as a language learning platform offering extensive semantic and syntactic information as well as visual representations of vocabulary.The main research interest of Prof. Winiwarter is human language technology, in particular machine translation and computer-assisted language learning.In addition, he also works on data mining and machine learning, Semantic Web, information retrieval, electronic business, and education systems.

Fig. 1 .
Fig. 1.Example of current Web-based machine translation

石炭の利用拡大は大気汚染をさらに悪化させるFig. 4 .Fig. 5 .
Fig. 4. Tagged Japanese sentence expanded use of coal worsens air pollution expanded use of coal worsens air pollution VBN NN IN NN VBZ NN NN Fig. 5. Tagged English sentence

Fig. 11 .
Fig. 11.Translation via Clustering Input sentence: 我々は、レバノンにおける復興努力を 支持する。 Correct Translation: We support the efforts of reconstruction in Lebanon.Moses Translation: we support in lebanon reconstruction efforts .Enhanced by TREF: we support lebanon in reconstruction .Word Order Word Translations Bartholomaeus Wloka, MSc is a doctoral student at the Department of Scientific Computing, University of Vienna, Austria.He received his BSc degree in 2005 at the University of South Alabama, USA and his MSc degree in 2009 at the University of Freiburg, Germany.His main research interests are human language technology, machine translation and computer-assisted language learning, in particular combined with mobile learning.Prof. Dr. Werner Winiwarter is the Vice Head of the Department of Scientific Computing, University of Vienna, Austria.He received his MS degree in 1990, his MA degree in 1992, and his PhD degree in 1995, all from the University of Vienna, Austria.