Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation

.


Introduction
Spoken language translation (SLT) systems generally consist of two components: (i) an automatic speech recognition (ASR) system that transcribes source language utterances and (ii) a machine translation (MT) system that translates the transcriptions into the target language. These two components are usually developed independently and then combined and integrated (Ney, 1999;Matusov et al., 2006;Casacuberta et al., 2008;Zhou, 2013;He and Deng, 2013).
While this architecture is attractive since it relies only on components that are independently useful, such systems face several challenges. First, spoken language tends to be quite different from the highly edited parallel texts that are available to train translation systems. For example, disfluencies, such as repeated words or phrases, restarts, and revisions of content, are frequent in spon-taneous speech, 1 while these are usually absent in written texts. In addition, ASR outputs typically lack explicit segmentation into sentences, as well as reliable casing and punctuation information, which are crucial for MT and other text-based language processing applications (Ostendorf et al., 2008). Second, ASR systems are imperfect and make recognition errors. Even high quality systems make recognition errors, especially in acoustically similar words with similar language model scores, for example morphological substitutions like confusing bare stem and past tense forms, and in high-frequency short words (function words) which often lack both disambiguating context and are subject to reduced pronunciations (Goldwater et al., 2010).
One would expect that training an MT system on ASR outputs (rather than the usual writtenstyle texts) would improve matters. Unfortunately, there are few corpora of speech paired with text translations into a second language that could be used for this purpose. This has been an incentive to various MT adaptation approaches and development of speech-input MT systems. MT adaptation has been done via input text pre-processing, by transformation of spoken language (ASR output) into written language (MT input) (Peitz et al., 2012;; via decoding ASR nbest lists (Quan et al., 2005), or confusion networks Casacuberta et al., 2008), or lattices (Dyer et al., 2008;Onishi et al., 2010); via additional translation features capturing acoustic information (Zhang et al., 2004); and with methods that follow a paradigm of unified decoding (Zhou et al., 2007;Zhou, 2013). In line with the previous research, we too adapt a standard MT system to a speech-input MT, but by altering the translation model itself so it is better able to deal with ASR output (Callison-Burch et al., 2006;Tsvetkov et al., 2013a).
We address speech translation in a resourcedeficient scenario, specifically, adapting MT systems to SLT when ASR is unavailable. We augment a discriminative set that translation models rescore with synthetic translation options. These automatically generated translation rules (henceforth synthetic phrases) are noisy variants of observed translation rules with simulated plausible speech recognition errors ( §2). To simulate ASR errors we generate acoustically and distributionally similar phrases to a source (English) phrase with a phonologically-motivated algorithm ( §4). Likely phonetic substitutions are learned with an unsupervised algorithm that produces clusters of similar phones ( §3). We show that MT systems augmented with synthetic phrases increase the coverage of input sequences that can be translated, and yield significant improvement in the quality of translated speech ( §6).
This work makes several contributions. Primary is our framework to adapt MT to SLT by populating translation models with synthetic phrases. 2 Second, we propose a novel method to generate acoustic confusions that are likely to be encountered in ASR transcription hypotheses. Third, we devise simple and effective phone clustering algorithm. All aforementioned algorithms work in a low-resource scenario, without recourse to audio data, speech transcripts, or ASR outputs: our method to predict likely recognition errors uses phonological rather than acoustic information and does not depend on a specific ASR system. Since our source language is English, we operate on a phone level and employ a pronunciation dictionary and a language model, but the algorithm can in principle be applied without pronunciation dictionary for languages with a phonemic orthography.

Methodology
We adopt a standard ASR-MT cascading approach and then augment translation models with synthetic phrases. Our proposed system architecture is depicted in Figure 1.
Synthetic phrases are generated from entries in the original translation model-phrase translation 2 We augment phrase tables only with synthetic phrases that capture simulated ASR errors, the methodology that we advocate, however, is applicable to many problems in translation (Tsvetkov et al., 2013a;Ammar et al., 2013;Chahuneau et al., 2013). pairs acquired from parallel data. From a source side of an original phrase pair we generate list of its plausible misrecognition variants (pseudo-ASR outputs with recognition errors) and add them as a source side of a synthetic phrase. For k-best simulated ASR outputs we construct k synthetic phrases: a simulated ASR output in the source side is coupled with its translation-an original target phrase (identical for all k phrases). Synthetic phrases are annotated with five standard phrasal translation features (forward and reverse phrase and lexical translation probabilities and phrase penalty); these were found in the original phrase and remain unchanged. In addition, we add three new features to all phrase pairs, both synthetic and original. First, we add a boolean feature indicating the origin of a phrase: synthetic or original. Two other features correspond to an ASR language model score of the source side. One is LM score of the synthetic phrase, another is a score of a phrase from which the source side was generated. We then append synthetic phrases to a phrase table: k synthetic phrases for each original phrase pair, with eight features attached to each phrase. We show synthetic phrases example in Figure 2.

Plausible misrecognition variants
For an input English sequence we generate top-k pseudo-ASR outputs, that are added as a source side of a synthetic phrase. Every ASR output that we simulate is a plausible misrecognition that has two distinguishing characteristics: it is acoustically and linguistically confusable with the input sequence. Former corresponds to phonetic similarity and latter to distributional similarity of these two phrases in corpus.
Given a reference string-a word or sequence of words w in the source language, we generate k-best hypotheses v. This can be modeled as a weighted finite state transducer: where • D maps from words to pronunciations • T is a phone confusion transducer • D −1 maps from pronunciations to words • G is an ASR language model D maps words to their phonetic representation 5 , or multiple representations for words with several 4 Value of K=12 was determined empirically. 5 Using the CMU pronounciation dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict pronunciation variants. To create a phone confusion transducer T maps source to target phone sequences by performing a number of edit operations. Allowed edits are: • Deletion of a consonant (mapping to ).
• Doubling of a vowel. The phone clustering algorithm that produced these is detailed in the previous section.
After a series of edit operations, D −1 transducer maps new phonetic sequences from pronunciations to n-grams of words. The k-best variants resulting from the weighted composition are the k-best plausible misrecognitions.
One important property of this method is that it maps words in decoding vocabulary (41,487 types are possible inputs to transducer D) into CMU dictionary which is substantially larger (141,304 types are possible outputs of transducer D −1 ). This allows to generate out-of-vocabulary (OOV) words and phrases, which are not only recognition errors, but also plausible variants of different source phrases that can be translated to one target phrase, e.g., verb past tense forms or function words.
Consider a bigram tells the from our synthetic phrase example in Figure

Experimental setups
To establish the effectiveness and robustness of our approach, we conducted two sets of experiments-expASR and expMultilingual-with transcribed and tells the T EH L Z DH IY chelsea CH EH L S IY translated TED talks (Cettolo et al., 2012b). 6 English is the source language in all the experiments.
In expASR we used tst2011-the official test set of the SLT track of the IWSLT 2011 evaluation campaign on the English-French language pair . 7 This test set comprises reference transcriptions of 8 talks (approximately 1.1h of speech, segmented to 818 utterances), 1-best hypotheses from five different ASR systems, a ROVER combination of four systems (Fiscus, 1997), and three sets of lattices produced by the participants of the IWSLT 2011 ASR track.
In this set of experiments we compare baseline systems performance to a performance of systems augmented with synthetic phrases on (1) reference transcriptions, (2) 1-best hypotheses from all released ASR systems, and (3) a set of ASR lattices produced by FBK (Ruiz et al., 2011). 8 Experiments with individual systems are aimed to validate that MT augmented with synthetic phrases can better translate ASR outputs with recognition errors and sequences that were not observed in the MT training data. Consistency in performance across different ASRs is expected if our approach to generate plausible misrecognition variants is universal, rather than biased to a specific system. Comparison of 1-best system with synthetic phrases to lattice decoding setup without synthetic phrases should demonstrate whether nbest plausible misrecognition variants that we generate assemble multiple paths through a lattice.
The purpose of expMultilingual is to show that translation improvement is consistent across different target languages. This multilingual experiment is interesting because typologically different languages pose different challenges to translation (degree and locality of reordering, morphological richness, etc.). By showing that we improve results across languages (even with the same underlying ASR system), we show that our technique is robust to the different demands that languages place on the translation model. We could not find any publicly available multilingual data sets of the translated speech, 9 therefore we constructed a new test set.
We use our in-house speech recognizer and evaluate on locally crawled and pre-processed TED audio and text data. We build SLT systems for five target languages: French, German, Russian, Hebrew, and Hindi. Consequently, our test systems are diverse typologically and trained on corpora of different sizes. We sample a test set of seven talks, representing approximately two hours of English speech, for which we have translations to all five languages; 10 talks are listed in Table 1.
Due to segmentation differences in the released TED (text) corpora and then several automatic preprocessing stages, numbers of sentences for the same talks are not identical across languages. Therefore, we select English-French system as an oracle (this is the largest dataset), and first align it with the ASR output. Then, we filter out test sets for non-French MT systems, to retain only sentence pairs that are included in the English-French test set. Thus, our test sets for non-French MT systems are smaller, and source-side sentences in the English-French MT is a superset of source-side sentences in all five languages. Training, tuning, and test corpora sizes are listed in Table 2  al., 2001). The acoustic model is maximum likelihood system, no speaker adaptation or discriminative training applied. The acoustic model training data is 186h of Broadcast News-style data. 5-gram language model with modified Kneser-Ney smoothing is trained with the SRILM toolkit (Stolcke, 2002) on the EPPS, TED, News-Commentary, and the Gigaword corpora. The Broadcast News test set contains 4h of audio; we obtain 25.6% word error rate (WER) on this test set.
We segment the TED test audio by the timestamps of transcripts appearance on the screen. Then, we manually detect and discard noisy hypotheses around segmentation boundaries, and manually align the remaining hypotheses with the references which are the source side of the English-French MT test set. The resulting test set of 843 hypotheses, sentence aligned with transcripts, yields 30.7% WER. Higher error rates (relatively to the Broadcast News baseline) can be explained by the idiosyncratic nature of the TED genre, and the fact that our ASR system was not trained on the TED data.
For the expASR set of experiments the ASR outputs and lattices in standard lattice format (SLF) were produces by the participants of IWSLT 2011 evaluation campaign.

MT
We train and test MT using the TED corpora in all five languages. For French, German and Russian we use sentence-aligned training and development sets (without our test talks) released for the IWSLT 2012 evaluation campaign (Cettolo et al., 2012a); we split Hebrew and Hindi to training and development respectively. 11 We split Hebrew and Hindi to sentences with simple heuristics, and then sentence-align with the Microsoft Bilingual Sentence Aligner (Moore, 2002). Punctuation marks were removed, corpora were lowercased, and tokenized using the cdec scripts (Dyer et al., 2010).
In all MT experiments, both for sentence and lattice translation, we employ the Moses toolkit (Koehn et al., 2007), implementing the phrasebased statistical MT model (Koehn et al., 2003) and optimize parameters with MERT (Och, 2003  language models are trained on the training part of each corpus. Results are reported using caseinsensitive BLEU with a single reference and no punctuation (Papineni et al., 2002). To verify that our improvements are consistent and are not just an effect of optimizer instability (Clark et al., 2011), we train three systems for each MT setup. Statistical significance is measured with the Mul-tEval toolkit. 12 Reported BLEU scores are averaged over three systems.
In MT adaptation experiments we augment baseline phrase tables with synthetic phrases. For each entry in the original phrase table we add (at most) five 13 best acoustic confusions, detailed in Section 4.  We first measure the phrasal coverage of recognition errors that our technique is able to predict. We compute a number of 1-and 2-gram phrases in ASR hypotheses from the tst2011 that are not in the references: these are ASR errors. Then, we compare their OOV rate in the English-French phrase tables, original vs. synthetic. The purpose of synthetic phrases is to capture misrecognized sequences, ergo, reduction in OOV rate of ASR errors in synthetic phrase tables corresponds to the portion of errors that our method was able to predict. Table 4 shows that the OOV rate of ngrams in phrase tables augmented with synthetic phrases drops dramatically, up to 54%. Consistent reduction of recognized errors across outputs from five different ASR systems confirms that our error-prediction approach is ASR-independent.  Table 4: Phrasal coverage of recognition errors that our technique is able to predict. These are raw counts of 1-gram and 2-gram types that are OOVs in the baseline system and are recovered by our method when we augment the system with plausible misrecognitions. Percentages in parentheses show OOV rate reduction due to recovered n-grams.
Next, we explore the effect of synthetic phrases on translation performance, across different (1best) ASR outputs. For references, ASR hypotheses, and ROVERed hypotheses we compare translations produced by MT systems trained with and without synthetic phrases. We detail our findings in Table 5. Improvements in translation are significant for all systems with synthetic phrases. This experiment corroborates the underlying assumption that simulated ASR errors are paired with correct target phrases. Moreover, this experiment supports the claim that incorporating noisier translations in the translation model successfully adapts MT to SLT scenario and has indeed a positive effect on speech translation. Interestingly, improvement of reference translations is also observed. We speculate that this stems from better lexical selection due to a smoothing effect that our technique may  have. Finally, we contrast the proposed approach of translation models adaptation to a conventional method of lattice translation. We decode FBK lattices produced for IWSLT 2011 Evaluation Campaign, and compare results to FBK 1-best translation results, which correspond to system1 in Table  5. Table 6 summarizes our main finding: 1-best system with synthetic phrases significantly outperforms lattice decoding setup with baseline translation table. 14 The additional small improvement in lattice decoding with synthetic phrases suggests that lattice decoding and phrase table adaptation are two complementary strategies and their combination is beneficial.

expMultilingual
In the multilingual experiment we train ten MT setups: five baseline setups and five systems with synthetic phrases, three systems per setup. For each system we compare translations of the reference transcripts and ASR hypotheses on the multilingual test set described in Section 6. We evaluate translations produced by MT systems trained with and without synthetic phrases.   Table 7: Comparison of the baseline translation systems with the systems augmented with synthetic phrases. We measure MT performance on the reference transcripts and ASR outputs. Consistent improvements are observed in four out of five languages.
Modest but consistent improvements are observed in four out of five setups with synthetic phrases. Only French setup yielded statistically significant improvement (p < .01). However, if we concatenate the outputs of all languages, the improvement in translation of references with BLEU score averaged over all systems becomes statistically significant (p = .03), improving from 16.8 for the baseline system to 17.3 for the adapted MT outputs. While more careful evaluation is required in order to estimate the effect of acoustic confusions, the accumulated result show that synthetic phrases facilitate MT adaptation to SLT across languages.

Analysis
We conducted careful manual analysis of actual usages of synthetic phrases in translation. The purpose of this qualitative analysis is to verify that predicted ASR errors are paired with phrases that contribute to better translation to a target language. Table 8 shows some examples. In the first sentence from the tst2011 test set (output from system 4) the word area was erroneously recognized as airy,  which is an OOV word for the baseline system. Our confusion generation algorithm also produced the word airy as a plausible misrecognition variant for the word area and attached it to a correct target phrase zone, and this synthetic phrase was selected during decoding, yielding to a correct translation for the ASR error. Second example shows a similar behavior for an indefinite article a. Third example is taken from the English-Russian system in the multilingual test set. Gauge was produced as a plausible misrecognition variant to age, and therefore correctly translated (albeit incorrectly inflected) as возраста(age+sg+m+acc). Synthetic phrases were also used in translations containing misrecognized function words, segmentationrelated examples, and longer n-grams.

Related work
Predicting ASR errors to improve speech recognition quality has been explored in several previous studies. Jyothi and Fosler-Lussier (2009) develop weighted finite-state transducer framework for error prediction. They build a confusion matrix FST between phones to model acoustic errors made by the recognizer. Costs in the confusion matrix combine acoustic variations in the HMM representations of the phones (information from the acoustic model) and word-based phone confusions (information from the pronunciation model).
In their follow-up work, Jyothi and Fosler-Lussier (2010) employ this error-predicting framework to train the parameters of a global linear discriminative language model that improves ASR. Sagae et al. (2012) examined three protocols for 'hallucinating' ASR n-best lists. First approach generates confusions on the phone level, with a phone-based finite-state transducer that employs real n-best lists produced by the ASR system. Second is generating confusions at the word level with a MT-based approach. Third is a phrasal cohorts approach, in which acoustically confus-able phrases are extracted from ASR n-best lists, based on pivots-identical left and right contexts of a phrase. All three methods were evaluated on the task of ASR improvement through decoding with discriminative language models. Discriminative language models trained on simulated n-best lists produced with phrasal cohorts method yielded the largest WER reduction on the telephone speech recognition task.
Our approach to generating plausible ASR misrecognitions is similar to previously explored FSTbased methods. The fundamental difference, however, is in speech-free phonetic confusion transducer that does not employ any data extracted from acoustic models or ASR outputs. Simulated ASR errors are typically used to improve ASR applications. To the best of our knowledge no prior work has been done on integrating ASR errors directly in the translation models.

Conclusion
The idea behind the novel ASR error-prediction algorithm that we devise is to identify phonological neighbors with similar distributional properties, i.e. similar sounding words for which language model probabilities are insufficient for their disambiguation. These sequences have been identified as significant contributors to ASR errors (Goldwater et al., 2010). Additional and even more important factors that cause recognition errors are disfluencies in speech (Tsvetkov et al., 2013b). In the task of adapting MT to SLT these and other irregularities can effectively be incorporated in a useful general framework: synthetic phrases that augment phrase tables. Our experiments show that simulated acoustic confusions capture real ASR errors and that proposed framework effectively exploits them to improve translation.