Investigations of Issues for Using Multiple Acoustic Models to Improve Continuous Speech Recognition

This paper investigates two important issues in constructing and combining ensembles of acoustic models for reducing recognition errors. First, we investigate the applicability of the AnyBoost algorithm for acoustic model training. AnyBoost is a generalized Boosting method that allows the use of an arbitrary loss function as the training criterion to construct ensemble of classifiers. We choose the MCE discriminative objective function for our experiments. Initial test results on a real-world meeting recognition corpus show that AnyBoost is a competitive alternate to the standard AdaBoost algorithm. Second, we investigate ROVER-based combination, focusing on the technique for selecting correct hypothesized words from aligned WTN. We propose a neural network based insertion detection and word scoring scheme for this. Our approach consistently outperforms the current voting technique used by ROVER in the experiments.


Introduction
The past few years have witnessed the success of Boosting algorithm in many research fields including continuous speech recognition.In Boosting training, a set of recognition models are iteratively generated such that the examples misclassified by the current model will be given higher weights in the training of subsequent models.In a generalization stage, the hypotheses predicted by individual models are composed together to form the final hypothesis using combination techniques such as majority voting.Most Boosting approaches such as AdaBoost [1] and LogitBoost [2] can be viewed as special cases of AnyBoost, an abstract algorithm developed by [3] as an endeavor to unify Boosting training into a generalized framework via gradient descent in function space.The advantage of AnyBoost is that it provides a platform that enables us to investigate appropriate loss functions.MCE (Minimum Classification Error) is a discriminative training method extensively used in continuous speech recognition [4].The goal of MCE is to increase the separability between desired and competing classes.In MCE training, model parameters are optimized by minimizing the value of a sigmoidbased differentiable loss function that quantitatively measures the classification error on training set.The successes of both Boosting and MCE suggest that a combination of these two techniques may have the capability to improve the performance of acoustic modeling.This paper presents an AnyBoost-based training scheme that uses the MCE discriminative criterion for constructing ensembles.Our solution is different from the work of other researchers [5] in which MCE is performed as a separate post-processing module to update the classifiers obtained from Boosting training.Hypothesis generation is another important issue for ensemble based continuous speech recognition.Standard Boosting uses sentence level majority voting to select the most likely hypothesis as the final output.This method ignores some important information associated with individual words in the hypothesis, such as confidence and segmentation.Research has shown that ROVER (Recognizer Output Voting Error Reduction) [6], a word level combination method that integrates word information, can significantly improve recognition accuracy.However, the present version of ROVER has its own weakness.For example, the voting module adopted by ROVER to search the best word sequence from WTN (Word Transition Network) is essentially the linear combination of two types of information: frequency of occurrence and confidence score.In many cases, the correct hypothesized words cannot be found due to the simplicity of this strategy.This paper proposes a neural network based two-level scoring scheme to address this problem.Once the WTN is constructed, we first use a binary classifier to determine if the questioned WTN node is an insertion error.If not, each word in the node will be scored on the basis of a variety of features extracted from multiple information sources, and the one with the highest score is chosen as the decoding result.

AnyBoost with MCE criterion
In this section we investigate the AnyBoost algorithm embedded with a MCE loss function for acoustic model training.

Discriminative Loss Function of MCE
then (1) can be rewritten as

AnyBoost
Traditionally, MCE is only used to train a single classifier.The AnyBoost algorithm provides a general framework for Boosting approaches and allows the use of the MCE criterion to generate ensembles.
denote the ensemble of classifiers after the t-th component t f has been learned.The goal of AnyBoost is to iteratively find a new classifier where 1 + t α denotes the weight of 1 + t f in ensemble.Viewed in terms of parameter space, this is equal to seeking the "direction" , the negative functional gradient of loss L at t F .However, in some situations, we can not choose ) ( , the inner product of the new classifier with the gradient of the loss function.Once

∑ ∑
Please note that finding an The preceding discussion results in the MCE based AnyBoost algorithm, illustrated in Table 1, in which new classifiers are iteratively generated to minimize the weighted error ε . Initialization: For t=1 to T:

•
Choose weight t α for t f via line search.

Table 1 AnyBoost with MCE criterion
In traditional AdaBoost training, the condition to increase the weight of an example is that the example is misclassified, namely ) , ( ) , ( . In contrast, (6) shows that in AnyBoost with MCE criterion, the weight of an example is maximized when ) , ( ) , (

A new scoring scheme for ROVER
ROVER is a word-level combination approach developed at NIST that aims to yield reduced word error rate by exploiting differences in the nature of the errors made by multiple speech recognition system [6].Rover proceeds in two stages.In the first stage, the best word hypotheses produced by different recognizers are progressively aligned together to build a single composite Word Transition Network (WTN) by using dynamic programming.Once the WTN is generated, each node in the network is then evaluated by a voting module to select the best word as the final recognition result.The present version of ROVER uses a simple voting strategy that linearly combines two types of information in word selection: frequency of occurrence and confidence score.The general scoring formula is as follows.
where N(word) denotes the normalized frequency of occurrence, C(word) denotes the average or maximum confidence score, and β is a parameter trained to balance N(.) and C(.).
Our preliminary experiments showed that in many cases, the correct hypothesized words cannot be found due to the simplicity of this strategy.To address this problem, we propose a neural network based two-level insertion detection and word scoring scheme.Once the WTN is constructed, we first use a binary classifier to determine if the WTN node in question is an insertion error.If not, each word in the node will be scored on the basis of a variety of features extracted from multiple information sources (Table 2).
For each node in the WTN: • Compute features for the node and each word in the node.The implementation of the two-level scheme involves two aspects: the identification of useful features and the training of neural networks.In our experience (and in that of others), good features usually play a critical role in creating a successful system.Our features are based on use in previous work.The first task is to train a neural network based binary classifier to determine if a WTN node is an insertion error.There are five features adopted to fulfill this task.
• Average frequency of occurrence for real words.
• Average frequency of occurrence for filler words and null arcs.

• Average word level posterior probability for real words.
Word level posterior probability is an extensively used feature in confidence annotation [7].This feature measures how likely a particular hypothesized word is a correct recognition result.The value is computed from the word lattice or the N-best list by summing and normalizing the scores of paths passing through the word in question.• Average word level posterior probability for filler words and null arcs.A default value is set to null transition arcs since they do not have word probabilities.• Entropy.This feature is designed to measure the degree of confusion within a WTN node.The feature value is computed from the normalized frequency of occurrence.To train the neural network, the class label of each WTN node is set to either 1 or 0, representing whether it is an insertion error or not.The value can be manually transcribed or obtained by performing an alignment with references.The standard Backpropagation algorithm is used in our experiments as the training method.The hidden layer of the neural network is set to contain 20 nodes.The next task is to select the best real word from the WTN node not classified as insertion.This is realized by a neural network based scorer.For each word to be evaluated, its input to this neural network consists of seven features.
• Frequency of occurrence.
• LM Back-off Mode.This is a language model related feature.For each real word, the value of LM back-off mode is determined according to whether the 1, 2, or 3-gram is used to compute language model score.
• Contextual LM Back-off Mode.The average LM Back-off Mode over the left and right neighbors of the questioned word.• Utterance level posterior probability.The posterior probability of the sentence hypothesis that the questioned word occurs in.• Word level posterior probability.
• Frame level posterior probability.This feature originally measures the probability of a word occurring at a given frame [8].For a word in the WTN node, the feature value is computed by averaging frame probability, across all the frames that the word spans.• Recognizer's word accuracy.The word accuracy of the recognizer that generates the questioned word.The value is computed on the training set.The neural network is trained in a discriminative way to minimize the following objective function.
where (.) s denotes the output value of the neural network scorer, d i u , denotes the input feature vector of the desired word in WTN node i, c i u , denotes the feature vector of the second best word competing with d i u , , and I is the number of WTN nodes participating the training.In our experiments, the neural network scorer has one hidden layer containing 30 nodes, and we use gradient descent as the learning approach to optimize its parameters.

Experiments
Our research was carried out in the context of a continuous speech recognition task in a meeting environment [9].We selected meetings from the ICSI Bro-series, of these 22 meetings were used as the training set which has about 30K transcribed utterances, and the remaining 1 meeting as the test set which has about 1.2K utterances or 7.5K words.3K utterances were further separated from the training set to form a hold-out set for neural network training.The sampling rate is 11025Hz, and the frames rate is 105 per second.A 13dimension MFCC feature vector is computed for each frame and then converted to a 39-dimension acoustic feature vector by adding delta and delta-delta coefficients.All of our experiments, both training and test, were performed using the Carnegie Mellon Sphinx III system, a fully-continuous HMM recognizer designed for LVCSR.The dictionary adopted in the experiments was drawn from cmudict (http://www.speech.cs.cmu.edu/cgi-bin/cmudict).The context independent phone set for acoustic model training contained 49 basic phonemes.In context dependent training stage, these phones were transformed to triphones and then tied together to make 2000 senones.A 3-state left-to-right architecture was adopted to model each speech unit.Each state was modeled with a mixture of 32 Gaussians.The language model was trained using the speech transcripts and text data from other available sources i.e.WSJ., where x denotes the sequence of feature vectors for an input utterance, and y denotes a sentence hypothesis.

Experiment on AnyBoost
) | ( x y P can be derived from acoustic model scores and language model scores.The number of acoustic models in the ensemble is set to 7 due to computation concerns.The performance of AnyBoost is compared with standard AdaBoost.Table 3 shows the results, in which T=n is the word error rate obtained from hypotheses combination of n models.We use standard ROVER for the combination.The baseline is 31.42%,which is the word error rate achieved by using single acoustic model (T=1

Experiment on ROVER
The second experiment investigates the new two-level scoring scheme for ROVER combination.

Conclusion
This paper investigated two issues in ensemble based continuous speech recognition: how to construct multiple acoustic models and how to combine the hypotheses.We applied the classic MCE discriminative criterion in Boosting training using the framework of AnyBoost algorithm.In addition, we developed a new two-level insertion detection and word scoring scheme for ROVER combination.The new scheme is conceptually simple and straightforward to implement.Encouraging results are observed in experiments with a realword meeting recognition corpus.
), parameter η controls how the competing class i y y ≠ are weighted, and ρ controls how sharply the sigmoid function changes at the transition point.If we create a pseudo class i y to cover all the competing classes that

Figure 1
Figure 1 Comparison of Three Approaches • Use neural network based binary classifier to determine: Is the node an insertion error?• If yes, return null for this node; else, • Use neural network based scorer, giving each real word a new confidence score.• Return the one with highest score.Table 2 Two-Level Scoring Scheme Our first experiment investigated the AnyBoost algorithm with MCE criterion in acoustic model training.The class space Y is

Table 3
). AnyBoost vs. AdaBoost Table3shows that AnyBoost and AdaBoost are in general comparable to each other in the word error rate they achieved.It is worth noting that the performance of AdaBoost starts to degrade when the 6-th acoustic model are added for decoding.This suggests that increase of the weights of hard-to-learn examples to some extent may cause problems i.e. overfitting, especially in the case that these examples are corrupted with noise or mis-transcribed.In contrast, AnyBoost with MCE criterion demonstrates some advantages in this situation.

Table 4
Table 4shows the recognition results of standard ROVER (old) and the new approach (new).The acoustic models are trained from the previous experiment using AnyBoost algorithm with MCE criterion.Experimental results demonstrate the effectiveness of the two-level scheme, which consistently outperforms traditional method in all six different size ensembles, and finally reduces word error rate from the baseline of 31.42% to 29.24%, representing a 6.9% relative reduction.The experimental results are also illustrated in Figure1.Standard ROVER (Old) vs. New Scoring Scheme