The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.


Introduction
Morphologically rich languages pose a challenge for statistical machine translation (SMT), as these languages possess a large set of morphological features producing a large number of rich surface forms.This increase in surface forms leads to larger vocabularies and higher sparsness, adversely affecting the performance of SMT systems.The effects of these factors are magnified when translating into a morphologically rich language.
In this work we address the challenge posed by the morphological richness of Arabic in the framework of a broad coverage English-to-Arabic statistical phrase-based machine translation (PBSMT).We explore the largest-to-date set of Arabic segmentation schemes ranging from full word forms to fully segmented forms separating every possible Arabic clitic, and we examine the effect on system performance.We conduct an in-depth analysis on the effect of segmentation choices on the different components that make up the PBSMT system, including the language model and the extracted phrase table.We also assess the variation of the Arabic translation output across the different segmentation schemes.
The segmentation schemes are applied in a preprocessing step to both the Arabic side of the training data and the test sets.Twelve different broad-coverage PBSMT systems are trained on the NIST09 Constrained Training Condition Resources (NIST09) data, segmented using these various schemes.The built PBSMT systems are evaluated and compared on English-to-Arabic test sets that we construct from existing NIST09 Arabic-to-English test sets.Based on this comparison we identify the best and the worst segmentation schemes and lay out a set of general observations on the effect of splitting of different sets of clitics (affixes) on the performance of a broad coverage PBSMT system.We also experiment with six different detokenization techniques, of increasing level of complexity, for recombining the segmented Arabic output.
We then conduct an in-depth analysis on the effect of segmentation on the different components of the PBSMT system by comparing the systems' components along various features defined in this work.We also investigate the variation across the output of the systems trained using the different segmentation schemes.
Previous work that addressed the effect of Arabic rich morphology and tokenization on SMT concentrated on Arabic-to-English machine translation (Lee 2004;Sadat and Habash 2006;Zollmann et al. 2006).However, few works focused on SMT into Arabic.Sarikaya and Deng (2007) use joint morphological-lexical language models to rerank the output of an English-dialectal Arabic MT system.Research more relevant to our work was done by Badr et al. (2008).In their work they compare a segmented English-to-Arabic system with an unsegmented system.They also experiment with a number of detokenization techniques.A more recent work, following the steps of Badr et al. (2008), was done by El Kholy and Habash (2010a).In their work they experiment with Arabic-side normalization and segmentation, and introduce three additional segmentation schemes.They show that their best segmentation scheme outperforms the best segmentation proposed by Badr et al. (2008).
In contrast with previous works that apply segmentation schemes previously proposed for Arabic-to-English machine translation, we explore the largest-to-date set of Arabic segmentations.Starting from a full word form, we gradually peal off affixes, creating 12 different segmentations.While some of these segmentation schemes were introduced before, other segmentations have not been used in any previous work.
Furthermore, previous works applied their Arabic segmentation to a small data scenario of at most 4.5 million words, extrapolating their conclusion to larger data scenarios.In this work we investigate the effect of Arabic segmentation in the framework of a broad coverage translation system with at least 150M words used as training data.We reveal that in the broad-coverage scenario segmentation schemes exhibit a different behavior from what has been shown previously for a small data scenario.Simple segmentation that lagged behind under small data scenario can perform as well as the best and more complicated segmentation scheme.Furthermore, our results demonstrate that the choice of segmentation scheme still has a significant effect on the performance of the PBSMT system in a large data scenario, in contrast to the diminishing effect predicted in previous works.
Finally, while previous works based their conclusion just on the comparison of the final scores of the different systems, we conduct a deeper investigation and compare the components that make up these systems, providing insight on the reasons behind the differences in the performance of the systems.
The remainder of the paper is organized as follows: In Sect. 2 we present some relevant background on Arabic linguistics to motivate the Arabic preprocessing schemes discussed in Sect.3.All the different detokenization schemes are described in Sect. 4. The training and test data used is described in Sect.5, while Sect.6 describes the experiments and results for all the different segmentation schemes.In Sect.7 we conduct an analysis on the components making up the different translation systems and investigate the variation in their output.Finally, conclusions and future work are described in Sect.8.

Arabic morphology and orthography
Arabic is a morphologically rich language with a large set of morphological features1 that are realized using both concatenative (affixes and stems) and templatic (root and patterns) morphology.Arabic has a set of attachable clitics (affixes), to be distinguished from inflectional features such as gender, number, person, voice, aspect, etc.These clitics attach to the word, increasing the ambiguity of alternative readings.Arabic clitics apply to a word base in a strict order: Table 1 lists the Arabic clitics 2 divided into 4 classes: conjunction proclitics (CONJ+), particle proclitics (PART+), definite article (DET+), and pronominal enclitics (+PRON) which comprise of possessive and object pronouns.The first three classes of clitics in Table 1 are given along with their English meaning.The clitics of the fourth class (PRON) are given followed by O (for object pronoun) or P (possessive pronoun), followed by their morphological features: person, gender, and number in the this order (Habash and Rambow 2005).Arabic orthography introduces further challenges as certain letters in Arabic script are often spelled inconsistently which leads to an increase in both sparsity (multiple forms of the same word) and ambiguity (same form corresponding to multiple words).One example is the letter Alif in Arabic, which can appear with Hamza on top , or below , and with maddah on top All these forms are often written as bare Alif Another example is the two letters Ya and Alif Maqsura which are often used interchangeably in word final position.Added to all this is the optionality of diacritics (short vowels) in Arabic script.
This inconsistent variation in raw Arabic text is typically addressed using orthographic normalization which maps all Alif to bare Alif, Dotless Ya/Alif Maqsura form to Dotted Ya and deletes diacritics.
El Kholy and Habash 2010a called this type of orthographic normalization of Arabic text "reduction".This reduction may be acceptable when Arabic is the source language, but is clearly problematic when translating into Arabic.Therefore, we use the "enriched" form of the Arabic raw text throughout this work.According to El Kholy and Habash 2010a terminology, the enriched form of text uses the correct form of Alif and the right form of Ya and Alif Maqsura in word final position while omitting all diacritics.

Arabic preprocessing schemes
We experiment with various Arabic preprocessing schemes by splitting of different subsets of the clitics mentioned in Sect. 2. The raw Arabic text is enriched and tokenized using the Morphological Analysis and Disambiguation for Arabic (MADA) toolkit (Habash and Rambow 2005;Habash 2007). 3The various Arabic tokenization schemes that we experiment with range from coarse segmentation, which uses unsegmented text, to fine segmentation which splits off all possible clitics.All the different tokenization schemes are described in detail below from coarse to fine: • UT: This scheme uses the full (un-tokenized) enriched form of the word (ST in Habash and Sadat 2006).This scheme is used as input to produce the other schemes.• S0: This scheme splits off the conjunction proclitic w+ (WA in Habash and Sadat 2006).• S1: This scheme splits off +f in addition to the w+ split by S0 (D1 in MADA).
• S2: This scheme splits off all the particle proclitics (PART+) in addition to the clitics split off by S1 (D2 in MADA).• S3: This scheme splits off all clitics from the (CONJ+) class and all clitics of (PART+) class except s+ prefix.It also splits off all the suffixes from the (+PRON) class.This scheme is equivalent to the Penn Arabic Treebank (PATB; Maamouri et al. 2004) tokenization, but to distinguish between the possessive and object pronouns, which have the same surface form, we use their morphological features (henceforth, MF form), instead as given in Table 1 between parentheses.• S0PR: This scheme splits off all suffixes from the (+PRON) class in addition to the w+ prefix split off by S0.The MF forms of the (+PRON) clitics are used here.• S4: This scheme splits off all clitics split by S3 plus splitting off the s+ clitic.This scheme is equivalent to the Arabic Treebank: Part 3 v3.2(ATBv3.2) tokenization.
The MF forms of the (+PRON) clitics are used here.• S5: This scheme splits off all the possible clitics appearing in Table 1.The MF form of the (+PRON) clitics are used here (D3 in MADA).
We also experiment with a number of variations of these schemes: • S4SF: Similar to scheme S4 but with the (+PRON) clitics in their surface form.
• S5SF: Similar to scheme S5 but with the (+PRON) clitics in their surface form.This scheme is similar to the main segmentation scheme suggested by Badr et al. (2008).• S5SFT: Similar to scheme S5 but with the prefixes concatenated together into one prefix.This scheme is similar to the best scheme suggested by Badr et al. (2008).• S3T: Similar to scheme S3 but with the prefixes concatenated together into one prefixes.
Table 2 exemplifies the effect of all the different schemes on the same sentence from the training data.
As can be seen from the example in Table 2 the text's fragmentation increases as we move from coarse to fine tokenization.This increased fragmentation, as we will see in Sect.4, enhances the complexity of recombining the tokens of the Arabic output.However, this also has a positive effect, as it decreases the vocabulary (word types), which results in lower out-of-vocabulary counts on a held out test set.For each tokenization scheme, Table 3 shows the number of tokens and types of the Arabic side of the training data, and the OOV on a held-out set.
The held-out set comprises of 728 sentences and 18,277 unsegmented words from the NIST MT02 test set.

Arabic automatic detokenization
The Arabic output produced by all MT systems trained using all the schemes described in Sect. 3 except UT is segmented and needs to be recombined in order to produce the final Arabic text.We call the process of recombining the Arabic output as detokenization.

Challenges of Arabic detokenization
Arabic detokenization is far from being a simple concatenation of the tokens, as several morphological adjustments, driven by morpho-phonological rules, apply to the tokens when they are combined.The first three rows of Table 4 include examples of such morphological adjustments.Another challenging aspect of Arabic detokenization is that in some cases it could be ambiguous i.e. tokens could be combined into more than one grammatically correct form.Examples of Arabic detokenization ambiguity are given in Table 5.The first column in Table 5 gives the token sequence while the second column lists all the possible combined forms for this sequence.Each possible combined from is followed by the probability, computed over the training data, of this word being the combined form of the given token sequence appearing in the training data.The second line of Table 5 demonstrates that the combined form corresponding to the sequence token could depend on the morphological case of the word base.In this case the word base >bnA "sons" is a noun which could have three cases: nominative, accusative, genitive.
When a possessive pronoun suffix attaches to >bnA then the case of the noun is marked using three different letters &, , and }.However, when no suffix is present then the case marker is a diacritic appearing on the last letter of the noun >bnA .This diacritic is omitted in the Arabic enriched form used here, which creates the ambiguity that we see in the second entry of Table 5.

Detokenization schemes
We experiment with six different detokenization techniques of increasing complexity: C: This is the most trivial technique which just concatenates the tokens of the segmented form together.R: This technique uses manually defined morphological adjustments rules to combine the Arabic tokens.Examples of such rules are given in Table 4.We use a script implementing the complete set of morphological adjustments rules as described in (El Kholy and Habash 2010b).T: Uses a table derived from the Arabic side of the training data to map the segmented form of the word to its original enriched form.If a segmented word has more than one original form then it is mapped to the most frequent one.A segmented word that does not appear in the table will be mapped to the output as is.For example, in Table 5, the segmented word >bnA +hA is associated with three original forms in training data with different frequencies (normalized to probabilities).According to the T technique, it will be mapped to >bnA}hA as it is the form with the highest probability.T + C: Similar to the T technique but backs off to the C method when encountering an unknown token sequence.T + R: Similar to the T technique but backoff to the R method when encountering an unknown token sequence.T + LM + R: In addition to the table used by T + R, this technique also uses a 5-gram language model trained on the full enriched form.The full enriched form of the tokenized input sentence is determined by selecting the FullForm which maximizes: This was implemented using the disambig utility available within the SRILM toolkit (Stolcke 2002).
For evaluating the detokenization schemes described above, a test set of 50k sentences (∼1.3M words) were randomly selected and removed from the Arabic training corpora.The remaining corpora were used to train the tables for the last four detokinzation techniques and the 5-gram language models used by the T + LM + R technique.
Table 6 lists the percentage of sentence error rate (SER) of the six detokinzation techniques for all Arabic tokenizations schemes that we experiment with.A general theme that we notice by looking at Table 6 is that the SER increases as we move from coarse to fine tokenization scheme: The more fragmented the text the harder it is to recombine.We notice that the SER for the S3 and the S5SF schemes are similar to the SER of the S3T and the S5SFT schemes respectively.This is because most of the morpho-phonological rules, as discussed in Sect.4.1 apply to the boundary of the affix and the stem when they are combined.This boundary remains the same when the prefixes are concatenated together.
Going from left to right over the results in Table 6, we notice that the SER drops with the increase in the complexity of the detokeniztion technique.However, this drop in SER diminishes as we move up the complexity ladder.The extremely high SER of the C technique demonstrates that detokenization is far from being a simple concatenation of the tokens.From the R column we see that introducing morphological adjustments rules gives a significant improvement over the simple concatenation.An additional significant improvement in SER is achieved, especially on fine segmentation, when using tables learned from the data as in the T technique.In an analysis of the output of the R technique we found that some of the combination errors are caused by tokenization errors introduced by the morphological analyzer.These kind of errors are fixed using the T method, which demonstrates the advantageous ability of the T method to successfully cope with errors introduced by the morphological analyzer.Additional improvement in SER is obtained when backing off to the C method, as can be seen from the T + C column in Table 6.Backing off to R, in most of the cases , gives minor improvement over backing off to C. Furthermore, using a language model in the detokinization process, as in the T + L M + R, gives a very small improvement over the T + R technique.This very small improvement in SER comes at a costly price of a 9-fold increase in detoknization time, besides having to load the LM into memory (>1 GB).For these reasons we use the T + R method for detokinizing the output of our SMT systems during evaluation in the Sect.6.

Training and testing data
We use the NIST09 Constrained Training Condition (NIST09) Resources to train and test broad-coverage English-to-Arabic phrase based statistical machine translation systems.

Training data
The Arabic-English parallel training data available within the NIST09 resources consists of about 5 million sentence pairs with about 150 million and 172 million words on the Arabic and English side respectively.The English side of the training corpora was first tokenized using the Stanford English tokenizer 4 then lower cased.The Arabic side was enriched and the different tokenizations generated using the Morphological Analysis and Disambiguation for Arabic (MADA) toolkit (Habash and Rambow 2005;Habash 2007).The parallel training corpora was then filtered by first removing sentence pairs longer than 99 words on either side then deleting unbalanced sentence pairs with a ratio of more than 4-to-1 in either direction.
After preprocessing and filtering, the parallel corpora consisted of 4,867,675 sentence pairs with 152 million on the English side.The Arabic side of the training corpora is used to train twelve 5-gram language models for the different tokenization schemes using the SRILM toolkit (Stolcke 2002).An additional two 7-gram language models were trained for the S4 and S5 tokenization schemes in order to account for the increase in length of the segmented Arabic.Tokens and type counts of the Arabic training corpora, using different tokenization schemes, is given in Table 3.
The processed and filtered parallel corpora was then aligned using MGIZA++ (Gao and Vogel 2008); an extended and optimized multi-threaded version of GIZA++.The Moses toolkit (Koehn et al. 2007) is then used to symmetrize the alignment using the grow-diag-final-and heuristic and to extract phrases with maximum length of 7. A distortion model lexically conditioned on both the Arabic phrase and English phrase is then trained.

Tuning and testing sets
We use existing Arabic-to-English test sets available within the NIST09 resources to construct our English-to-Arabic tuning and test sets.As all NIST09 test sets were intended for use in Arabic-to-English machine translation, each Arabic source sentences is associated with four English references.From such a test set, an English-to-Arabic test or tuning set could be constructed in a number of ways.One possible way is constructing an English-to-Arabic test set by pairing each Arabic source with only one of the four English references, giving us four different single reference test sets.Alternatively, an English-to-Arabic test set could also be constructed by pairing each Arabic source sentence with all four English references resulting in a single reference test set four times larger than the test sets constructed previously.
Before deciding which of the above techniques to use in constructing the English-to-Arabic tuning set, we tested the effect of these different test set construction techniques on the overall performance of the PBSMT system.Using the techniques described above, we construct 5 different English-to-Arabic tuning sets using 728 sentences chosen from the NIST09 MT02 test set.The UT system is then tuned on the different tuning set and tested on an English-to-Arabic test sets constructed from the NIST MT03-MT05 test sets by pairing each Arabic source sentence with the first English reference.We report the results on the MT03-MT05 test sets using the BLEU-4 (Papineni et al. 2002) evaluation metric.All the results are given in Table 7.
UTi is the UT system tuned on a tuning set constructed from MT02 by pairing the Arabic source with the ith English reference, while UTAll is the UT system tuned on the tuning set constructed by pairing the Arabic source with all the four English references.Comparing the performance of the systems UT1-UT4, and UTAll we notice that there is no significant difference between the scores of UT1, UT3, UT4 and UTALL on MT03-MT05 while UT2 performs the worst, especially on MT04 and All the systems in this work are tested on the MT03-MT05 test sets used in this section.Table 8 includes information about the tuning and all the test sets, including number of sentences and tokens, and division of sentences according to their genres.

Results
We test and compare the performance of twelve PBSMT systems trained using the different tokenization schemes.The systems use the translation, reordering and language models described in Sect. 5.
The decoding weights for these components were optimized for Bleu-4 (Papineni et al. 2002) on the MT02 tuning set using an implementation of the Minimum Error Rate Training procedure (Och 2003).We use the Moses (Koehn et al. 2007) decoder with a distortion window of 6 is to decode the systems on the MT03, MT04, and MT05 test sets.As discussed in Sect.4.2, we use the T + R detokenization technique to recombine the Arabic tokens of the different segmentation schemes.The evaluation results reported are all on the detokenized output of systems evaluated against unsegmented enriched single reference test sets.
We report the results on all test sets using a number of evaluation metrics including BLEU-4, TER 5 (Snover et al. 2006), and METEOR 5 (Lavie and Denkowski 2009).Table 9 lists the translation results of all the systems on MT03 using all the evaluation 5 METEOR v1.2, language independent version.11.

123
All statements below about the difference in BLEU score were tested for statistical significance using paired bootstrap resampling (Koehn 2004) with 95% confidence interval.Looking at the results, we see that across all test sets, S0/S4/S3 perform best (highlighted with bold, while S2/S5SF (highlighted with italic) perform the worst.The performance of all the other segmentation schemes falls between these two ends.
The difference in translation scores between S0 and S5SF is 2.31 BLEU, −2.28 TER and 1.75 METEOR points averaged over all test sets.This big difference in translation quality indicates that the choice of the segmentation scheme has a significant effect on the performance of English-to-Arabic PBSMT systems in a large data scenario.The S4 (ATBv3.2) scheme outperforms S5SFT (the best scheme in Badr et al. (2008) S5SFT) by 2.25 BLEU point averaged on all test sets.
The results also show that a simple segmentation scheme S0 which just splits off the w+ (and) can perform as well as the best and more complicated S4 scheme.The simplicity of S0 gives it advantage over the S4 as it can be both generated and recombined with lower error rate in the tokenization and detokenization processes respectively, as described in Sect. 4.
Comparing the scores of different schemes across all test sets we are also able to come up with the following observations: • S1 outperforms S2 on all test sets, which indicates that splitting off the particle proclitics (PART+) can hurt the performance.• The effect of splitting off the (PRON+) suffixes on the system depends on the prefixes that are split off.When the only prefix that is split off is w+ as in S0, splitting off the (PRON+) suffixes in S0PR causes an insignificant drop of 0.15 (no change) BLEU points on average on all test sets.However, in case the prefixes split off are the (PART+) and (CONJ+) clitics, as in S2, then splitting off the (PRON+) suffixes as in S3 causes a significant increase of 1.44 BLEU averaged on all test sets.• S4 outperforms S5 on all test sets, indicating that splitting off the definite article Al+ hurts the performance.• S3 and S4 perform about the same on all test sets indicating that splitting off the s+ (will) clititc has no significant effect on the performance of the system.• Comparing S4 with S4SF and S5 with S5SF we see that using morphological features instead of the surface form of the suffixes can only benefit the system.• Concatenation of the prefixes together improved the performance of S5FT scheme by a significant 1.07 BLEU points averaged on all test set, while dropping by an insignificant −0.16 (no change) BLEU points averaged on all test sets in the case of S3.This indicates that concatenating the prefixes has a positive effect on the most fragmented scheme S5SF but this effect diminishes as the scheme becomes less and less fragmented as in the case of S3. • Comparing S4-5.7gram with S4-5 on all test sets indicates that using higher order (>5) n-grams for highly fragmented schemes has no significant effect on the performance of the system.

Systems comparison
In previous sections we described all the different segmentation schemes and their effect on the final performance of the systems.In this section we conduct an in-depth analysis on the effect of segmentation choices on the different components that make up the PBSMT system, including the language model and the extracted phrase table.
We also assess the variation of the Arabic translation output across the different segmentation schemes.

Language models
The Arabic side of the training corpora for all the different tokenization schemes was used to train twelve 5-gram language models using the modified Kneser-Ney smoothing and cutoffs of 1 for orders bigger than 2. The size of the training corpora used to build the different language model is given in Table 3, Sect.3. The different language models are compared by computing the n-gram precession (coverage) and perplexities on the Arabic side of the MT03 test set.The n-gram precision is defined as the percentage of n-grams in the test set which appears in the language model.Table 12 lists the size of the MT03 test set and the type/token n-gram precision for all the language models trained using the different segmentation schemes.The perplexity of all the language models is evaluated on the MT03 test set and is given in Table 12.
Looking at Table 12, we notice that the more fragmented the scheme the higher is the n-gram precision.We also notice that the difference between the n-gram precision of a fine and a coarse scheme becomes more significant for higher order n-grams.This difference in n-gram precession between coarse and fine segmentations is reflected in perplexity scores on the test set.The perplexity steadily decreases from 108.682 for the UT scheme down to 33.24 for the most fragmented scheme S5.However, the n-gram precision and the perplexity were computed over tokens where the definition of a token varies across the different segmentation schemes.This variation is expressed in the different sizes of the MT03 test sets for each scheme, which makes a comparison of the language models based on n-gram precision and perplexity much less meaningful.One way to make the comparison of the different language models perplexities more meaningful is to use "normalized perplexity" (Kirchhof et al. 2006).
The normalized perplexity of an k-gram language model on a test set of size M is given in Eq. 1.As we see in Eq. 1, the normalized perplexity differs from the regular perplexity only in the normalization factor.In the case of normalized perplexity the log likelihood of the data is averaged by dividing it by the number of the unsegmented words N in the test set, as opposed to the number of tokens in test set M. This is done in order to compensate for the effect that perplexity tends to be lower for a text containing more individual units, since the sum of log probabilities is divided by a larger denominator.
The normalized perplexities of all language models are given in the last column of Table 12.Looking at the normalized perplexities gives us a totally different picture than the one we got from comparing regular perplexities.We see that normalized perplexities increase as we move from coarse to fine segmentation.The most significant change in normalized perplexity occurs when moving from S4 to S5, where the normalized perplexity increases by 12.79%.As S5 differs from S4 in splitting off an additional prefix Al+ (the), this big increase in normalized perplexity indicates that splitting off the Al+ has a significant negative effect on the language model.123 The low normalized perplexities that we see in Table 12 for the UT and S0 language models contributes to the fact that coarse segmentation systems can perform as good as systems built using the more complicated schemes.Furthermore, we notice that using morphological features instead of surface forms for the suffixes has no significant effect on the perplexity of the language model, as can be seen from comparing S5 to S5SF and S4 to S4SF.We also notice that the difference in normalized perplexities between the language model of S5SF and S5SFT is 1.33 points compared to the 0.034 difference between S3 and S3T.This contributes to the significant difference in the performance between the S5SF and S5SFT compared to the much smaller difference between S3 and S3T systems in Tables 9, 10, and 11.

Phrase table
The phrase table is one of the most important components of a PBSMT system.In this section we compare and analyze the differences between all the phrase tables built and trained on the various segmentation schemes defined in this work.
All the phrase tables are first filtered to the MT03 test set then contrasted according to several features: • Average number of target phrases per phrase length: The phrase table entropy provides a measure to the amount of uncertainty in choosing a translation averaged over the whole phrases in the phrase table.However, it would be very useful to zoom in on the phrase table entropy and look into the phrase table target side ambiguity for each phrase length.Therefore, we compute the average number of target phrases (ANTP) per phrase lengths of 1-7 (max phrase length).All the results are given in Table 13.
Looking at Table 13, we notice that the number of phrase pairs steadily and gradually grows when moving from the coarse UT to the fine S4SF scheme, while the number of source phrases relatively remains the same.The PTE for these segmentations does not significantly change and remains in the range 3.33-3.41.However the most significant increase in phrase table size and PTE happens when moving from S4SF to the S5 scheme and its variants S5SF and S5SFT.The size of the phrase table 123 increases by 21.6% relative when comparing S5 to S4, while the number of source phrases decreases by 3.31%.This significant increase in the size of the phrase table compared to a small increase in the number of the source phrases adds to the uncertainty in choosing the candidate translation phrases as can be seen by comparing the PTEs of the two systems.We see a relatively significant jump in the phrase table entropies (PTE) of S5 compared to S4.The PTE increases by 10% relative when moving from S4 to S5.A clearer explanation for this increase in PTE can be found by comparing the ANTPs of the S4 and S5 system.We notice that the ANTP of S5 is higher from the ANTP of S4 for short phrases but is lower for longer phrases.The ANTP1 of the S5 system is 26.62% higher than the ANTP1 of S4.This difference drops to 14.21% for ANTP2 and 4.73% for ANTP3.A total change in the trend occurs for ANTP4 and higher, where the ANTP of S5 becomes lower than for S4.The ANTP4 of S5 is −3.1% lower than the ANTP4 of S4, this difference increases to −9.72% for ANTP5, −34.62% for ANTP6, and −42.55% for ANTP7.This relatively high PTE, and ANTPs for S5 and its variants contribute to the fact that these segmentation are among the worst performing segmentation as seen in Sect.6.One reason for the significant difference in phrase table size, PTE, and ANTP between S5 and S4 (and the other schemes) can be found when looking into the set of affixes that these two schemes split off.The only difference between the S4 and S5 scheme is that the S5 scheme splits off the Al+ (the) in addition to all the affixes split of in S4.From the results discussed above, we conclude the splitting off the Al+ causes a significant increase in the size of the phrase table and magnifies the ambiguity and the uncertainty inherited in the target side choice in the phrase table, especially for shorter phrases.
We looked into the phrase tables of both S4 and S5 and found several cases of source phrases for which the splitting off the Al+ caused an increase in the average number of target phrases.One of the most frequent cases was source phrases with the "noun adjective" POS pattern.In Arabic, the adjective follows the noun in definiteness which is expressed by attaching the Al+ before the word.For example, the expression Al$rq Al>wsT (lit."the east the middle") "the middle east", could also appear in the indefinite form as $rq > wsT (lit." east middle") "middle east", but never in the ungrammatical form $rq Al>wsT .However, we found that when splitting of the Al+ prefix as in S5 an Arabic phrase such as $rq Al# >wsT could be extracted from the Arabic text and end up as a target phrase for the English source phrase "middle east".Such cases are frequent and increase the average number of target phrases by introducing ungrammatical target phrases that did not exist in the S4 phrase table, especially for short source phrases (<3).

Output variation
One important question which could be asked here is how different are the outputs of the PBSMT systems that were trained using the different segmentation schemes?
One way for quantifying the output variation is to find out how much gain in performance, compared to the best single system, could be achieved when performing an oracle combination over the output of all the systems.Therefore, we conduct here an oracle study into system combination.
An oracle combination output was created by selecting for each input sentence the output of the system with the highest sentence-level METEOR score.One way for doing this oracle combination is to include in the combination the output of all the systems built in this work then to evaluate the combined output.However, it would be much more useful to divide the systems into intra-related groups in order to isolate their contribution to the performance of the final combined system.This will give us an insight into the variation of the output across the different systems groups.
We start by performing an oracle combination on the systems in the first group (G1).Then we gradually add each group to the combined systems.Table 14 lists the five system groups and the names of the systems in each group.The results of the combined systems on MT03, MT04, and MT05 are given in Tables 15, 16 and 17 respectively.The best single system (BSS) for each test set is used as a baseline.
Looking at Tables 15, 16, and 17 we notice a significant improvement in the performance of the oracle combination of all the systems (G5) over the best single system (BSS).The G5 system outperforms the BSS by 7.28 BLEU points averaged over all test sets.This great difference between the combined system and the BSS is an indication of the complementary nature of the output produced by the systems using different schemes.It also demonstrates the great potential in automatically combining the output of the different systems.These results are consistent with the results of Sadat and Habash Sadat and Habash (2006).In their work, they demonstrate, using oracle combination, the great potential in automatically combining the output of different Arabic-to-English systems which use different Arabic segmentations in a small data scenario.In this work we investigated the impact of Arabic morphological segmentation on the performance of a broad-coverage English-to-Arabic SMT system.We explored the largest-to-date set of Arabic segmentation schemes ranging from full word forms to fully segmented forms, and we examined the effects on system performance.Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes, indicating that the choice of segmentation scheme has a significant effect on the performance of English-to-Arabic PBSMT systems in a large data scenario.We also show that a simple segmentation scheme which just splits off the w+ (and) can perform as well as the best and more complicated (ATBv3.2) segmentation scheme.
An in-depth analysis on the effect of segmentation choices on the components that make up a PBSMT system reveals that the normalized perplexities of the language models increase as we move from coarse to fine segmentation.The analysis also shows that aggressive segmentation such as S5, which splits of all possible affixes including Al+ (the) can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding which has a negative effect on the machine translation quality.A significant improvement of 7.28 BLEU averaged over all test sets is achieved over the best single system in an oracle combination of the output of the different systems.This demonstrates the complementary nature of the output and the great potential in automatically combining the output of the different systems.
Following the findings in this work we plan to experiment with automatic system combination on the output of the systems built here.We also plan to explore whether current findings extend to English-to-Arabic syntax-based and hierarchical SMT systems.

Table 1
Arabic clitics divided to four classes

Table 6
SER for different tokenization scheme using the six different detokenization schemes

Table 7
MT05.Therefore, for tuning all the systems built in this work, we use a tuning set constructed from MT02 test set by pairing each Arabic source sentence with the first English reference.

Table 9
BLEU, TER, and METEOR scores for all the systems on the MT03 test set

Table 12
Number of tokens, type/token n-gram precision, perplexity and normalized perplexity on the MT03 test for all the language models Scheme

• Number of source phrases and Phrase pairs:
For each scheme we calculate the number of phrase pairs and source phrases.The results are given in the first two columns of Table13.• Phrase Table Entropy (PTE): Phrase Table Entropy (Koehn et al. 2009) captures the amount of uncertainty involved in choosing candidate translation phrases.For each source phrase s with a set of possible translations (target sides) in the phrase table T, the phrase entropy of s PE(s) is defined in Eq. 2. The Phrase Table Entropy is defined as the average of phrase entropy for all the source phrases in the phrase table.Table 13 gives the phrase table entropy for all schemes.

Table 13
All the features calculated for the different phrase tables of the various segmentation schemes