NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:

This paper addresses the issue of learning hidden Markov model (HMM) parameters for speaker-independent continuous speech recognition. Bahl et al. [Bahl 88a] introduced the corrective training algorithm for speaker-dependent isolated word recognition. Their algorithm attempted to improve the recognition accuracy on the training data. In this work, we extend this algorithm to speaker-independent continuous speech recognition. We use cross-validation to increase the effective training size. We also introduce a near-miss sentence hypothesization algorithm for continuous speech training. The combination of these two approaches resulted in over 20% error reductions both with and without grammar. This research was sponsored by Defense Advanced Research Projects Agency Contract N00039-85-C-0163. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies either expressed or implied, of the Defense Advanced Research Projects Agency, or teUSGov«mmaT^ ^ Table of


Introduction
At present, hidden Markov models (HMMs) constitute the predominant approach to automatic speech recognition.HMM-based systems make certain structural assumptions, and then try to learn two sets of parameters-output probabilities, which represent speech events, and transition probabilities, which represent duration and timescale distortions.Most HMM-based systems use the Baum-Welch (or forward-backward) algorithm [Baum 72,Jelinek 76,Bahl 83], which adjusts the parameters to obtain an approximation to the maximum-likelihood estimates (MLE) of the HMM parameters.
Maximum likelihood estimators have many desirable properties, and many successful systems [Jelinek 85,Chow 87,Lee 89a,Rabiner 88] are based on MLE.However, maximum likelihood estimation has one serious flaw: it assumes that the underlying models are correct.In reality, however, typical HMMs make extremely inaccurate assumptions about the speech production process.This suggests two avenues of research: (1) attempt to rectify these assumptions, or (2) use new estimation techniques that work well in spite of these inaccurate assumptions.In this paper, we will consider the latter approach.
Bahl et al. [Bahl 88a] introduced the corrective training algorithm for HMMs as an alternative to the forward-backward algorithm.While the forward-backward algorithm attempts to increase the probability that the models generated the training data, corrective training attempts to maximize the recognition rate on the training data.This algorithm has two components: (1) error-correction learning -which improves correct words and suppresses misrecognized words, (2) reinforcement learning -which improves correct words and suppresses near-misses.
Applied to the IBM speaker-dependent, isolated-word office correspondence task, this algorithm reduced the error rate by 16%.
In this study, we extend the corrective and reinforcement learning algorithm to speaker-independent, continuous speech recognition.Speaker independence presents some problems, because corrections appropriate for one speaker may be inappropriate for another.However, with a speaker-independent task, it is possible to collect and use a large training set.More training provides not only improved generalization but also a greater coverage of the vocabulary.We also propose the use of cross-validation to increase the effective training data size used to locate the misrecognitions needed by the correction algorithm.Cross-validation partitions the training data and determines misrecognitions using models trained on different partitions.This simulation of actual recognition leads to more realistic misrecognition hypotheses.
Extension to continuous speech is more problematic.With isolated-word input, both errorcorrecting and reinforcement training are relatively straighforward, since all errors are simple substitutions.Bahl,et al. [Bahl 88a] determined both misrecognized words and near-misses by matching the utterance against the entire vocabulary.However, with continuous speech, the errors include insertions and deletions.Moreover, many substitutions appear as phrase-substitutions, such as home any for how many.In general, word boundaries are neither known nor reliably detectable.Without word-boundary information, it would be difficult to suppress misrecognized words and hypothesize near-misses.These problems make reinforcement learning difficult We propose an algorithm that hypothesizes near-miss sentences for any given sentence.First, a dynamic programming algorithm produces an ordered list of likely phrase substitutions.
Then, this list is used to hypothesize the near-miss sentences used in reinforcement learning.
We applied our corrective training procedure to the 997-word DARPA continuous resource management task, using the speaker-independent database.Without a grammar, we obtained a 20.3% error-rate reduction over the standard MLE-trained SPHINX System.With a word-pair grammar, we obtained a 23.4% reduction.These improvements are comparable to the IBM results with speaker-dependent, isolated-word recognition.Thus, we have successfully demonstrated the extensibility and applicability of the corrective training and reinforcement learning algorithm to speaker-independent continuous speech recognition.
In this paper, we first give a brief overview of hidden Markov models in Section 2. A brief description of the SPHINX system, on which this work is based, is presented in 3. We present our algorithm in Section 4, and our results in Section 5. Section 6 discusses possibilities for future work and finishes with a brief conclusion.

Hidden Markov Models and Maximum Likelihood Training
Hidden Markov models (HMM) were first described in the classic paper by Baum [Baum 72].Shortly afterwards, they were extended to automatic speech recognition independently at CMU [Baker 75] and IBM [Bakis 76,Jelinek 76].It was only in the past few years, however, that HMMs became the predominant approach to speech recognition, superseding dynamic time warping.
A hidden Markov model is a collection of states connected by transitions.Each transition carries two sets of probabilities: a transition probability, which provides the probability for taking this transition, and an output probability density function (pdf), which defines the conditional probability of emitting each output symbol from a finite alphabet, given that some transition from the state is taken.There are several types of hidden Markov models.In this study, we will assume discrete density HMMs, which are defined by: • {s}-A set of states including an initial state 5 7 and a final state S F .
• {a^}-A set of transitions where a {i is the probability of taking a transition from state / to state j.
• {bij(k)}-The output probability matrix: the probability of emitting symbol k when taking a transition from state i to state y.
The forward-backward algorithm is used to estimate a and b.We provide only a simplistic sketch here; details of the algorithm can be found in [Bahl 83, Lee 88a], The forward-backward algorithm adjusts a and b iteratively.For each iteration, the estimates from the previous iteration are used to count how frequently each symbol is observed for each transition, and how frequently each transition is taken from each state.These counts are then normalized into new parameters.Let c^Qc) represent the frequency (or count) that the symbol k is observed when the transition from i to j is taken, the new output probability b -(k) is given by the normalized frequency: Cijik) (1) K Similarly, transition probabilities are re-estimated by normalizing the frequency that a transition is taken from a particular state: K Baum [Baum 72] showed that re-estimating a and ft, as shown in equations 1 and 2, will increase the likelihood of generating the training data, unless a local maximum has been reached.Although the forward-backward algorithm guarantees only a local maximum, it efficiently produces an approximation to the maximum-likelihood estimates (MLE) of the HMM parameters.
In spite of the many advantages of maximum-likelihood estimation, it suffers a serious problem, namely, it assumes that the underlying models, in this case HMMs, are correct [Brown 87].However, HMMs are poor models of reed speech, due mainly to the Markov independence assumption.With an incorrect model, there is no guarantee that maximum-likelihood estimation will converge to the best values for speech recognition.

The SPHINX Speech Recognition System
Our experiments in this paper were run by modifying an existing speech recognition system, SPHINX [Lee 89a].SPHINX is a large-vocabulary, speaker-independent, continuousspeech recognition system based on maximum-likelihood HMMs.SPHINX, which uses vector quantized LPC-derived cepstral coefficients in discrete HMM's, is based on phonetic hidden Markov modeling.Each word is represented by a pronunciation network of phones, and the set of sentences accepted by the grammar is represented by a network of words.Recognition in SPHINX is carried out by a Viterbi beam search [Viterbi 67,Schwartz 85].While these techniques have worked well in speaker-dependent or isolated-word recognition, we have found that they alone are inadequate for our difficult task.It is necessary to improve these techniques to deal with speaker independence and continuous speech.
In order to deal with speaker independence, we experimented with various ways of adding knowledge to SPHINX.The simplest way to add knowledge to HMM's is to add more framebased parameters.We use three sets of parameters : (1) instantaneous LPC cepstrum coefficients, (2) differenced LPC cepstrum coefficients, and (3) power and differenced power.These parameters are vector quantized separately into three codebooks, each with 256 entries.We found that quantizing these parameters separately both improved recognition accuracy and reduced VQ distortion.We also incorporated a word duration knowledge source into the recognizer.
Two great problems introduced by continuous speech are unclear function words and coarticulation.
In order to improve the recognition of function words, we use funetion-word-dependent phone models, whose parameters depend on the word in which they appear.Since function words occur frequently, these models can be adequately trained.Moreover, since function words appear frequently in any task, they can be trained in a taskindependent fashion.Finally, these models are still phone models, so we can interpolate their parameters with context-independent models when their training is insufficient.In order to deal with coarticulation, we introduce the generalized triphone models.Generalized triphone models are similar to context-dependent triphone models [Schwartz 85].However, instead of modeling all left and right contexts, we use a maximum likelihood clustering procedure to merge similar contexts together.The use of function-word-dependent phone models and generalized triphone models gives us a total of 1076 models.
We applied SPHINX to the 997-word resource management task used by the DARPA projects.We used 4200 sentences produced by 105 speakers for training, and another 150 sentences by 15 different speakers for testing.We obtained word accuracies of 93.7%, and 70.6% for grammars with perplexity 60 and 997, respectively.More information about SPHINX can be found in [Lee 88a, Lee 88b, Lee 89a, Lee 89b, Lee 89c].

Corrective and Reinforcement Learning for
Speaker-Independent Continuous Speech Recognition

The IBM Corrective Learning Algorithm
In view of the dependence of MLE on the problematic assumptions of HMMs, Bahl, et al. [Bahl 88a] proposed the IBM corrective training algorithm for speaker-dependent isolated-word recognition.This procedure, inspired by perceptron models [Minsky 69], attempts to tune the models to minimize recognition errors.This goal has a definite practical appeal, since error rate, not sentence likelihood, is the bottom line for speech recognition.
In order to minimize errors, the algorithm first attempts to recognize each training utterance it, representing some word w, with some initial model If u is misrecognized as co, then the parameters of the system are modified so as to make w more probable and co less probable.This is the corrective component of the algorithm.The other component, which we call the reinforcement learning, is always activated whether or not u is recognized correctly.In that case, a list of near-misses co ; .are identified.Each near-miss is then made less probable with respect to the correct word.Figure 4-1 illustrates this algorithm in detail 1. Generate an initial set of models from forward-backward training, preserving counts (c^k)) for the transitions and output symbols.
2. For each training utterance u, use normalized to compute P(u\w) for the correct word w, and /^(ulco^ for a list of "confusable" words, co m .This list consists of: • misrecognitions -if P(u | coj > P(u | w) • near-misses -if \ogP(u\co m ) -logP(u\w) > -5 where 8 is a positive threshold determined a priori.
3. Run the forward-backward algorithm on each utterance u, using the model for the correct word w to obtain the counts c£".Do the same with the model for each co m to obtain the counts c!?V Then, replace original counts c {j with c^ + Y(c£* -c ™ m )-Y is an adjustment factor: • For misrecognitions, it is set to |3.
• For near misses, it decreases linearly from p to 0 as the difference in log-probabilities decreases from 0 to-5.
4. Replace any negative counts by a small positive constant, and continue with step 2, until enough iterations have been run.Error correction occurs in step 3.By adding counts for the correct word (cjp to actual counts for the models (c t y) f the correct word is made more probable.Conversely, by subtracting counts for the confiisable words (cT m ) from c iy , the confiisable words are made less probable.The parameters of this algorithm include: • 8 -the threshold value for determining what constitutes a near-miss.
• (3 -the maximum step size in correction.yis directly computed from logP(u\ w), log^wlco^, and p.
Since word boundaries are known in their isolated-word task, it was possible to generate co m , the list of near-miss words by matching the utterance with all the words in the vocabulary.Bahl, et al. actually used a fast match algorithm [Bahl 88b] which does this efficiently.From this list, misrecognitions and near-misses were determined using the criterion in step 2 of Figure 4-1.
Over a series of four experiments, the IBM corrective training algorithm produced an average error reduction of 16% on test data, and 88% on training data.These experiments successfully demonstrated the feasability of corrective training for speaker-dependent, isolatedword recognition.
The simplest way to apply IBM's corrective training algorithm to continuous speech is to treat each sentence as a "word."A misrecognition of a sentence can then be corrected by adjusting the counts for the entire sentence.This simple-minded approach has at least two flaws.First, there is no convenient way to produce multiple misrecognitions, so the corrective component would have at most one error per sentence.This provides very little training.Second, this approach does not suggest any good ways of generating near-misses, because there is no readily available list of near-miss sentences.This makes reinforcement learning impossible.In the next two sections, we will introduce techniques that solve both these problems.

Using Cross Validation to Increase Training
The IBM corrective training algorithm uses the same data to train the HMM probabilities and to determine what and how much to correct.However, recognition on training data invariably results in fewer and less realistic errors than does recognition on independent test data.SPHINX makes almost twice as many errors on a test set than on a training set.Thus, correcting on training data will provide only half as many misrecognitions for correction.
In order to alleviate this problem, we propose the use of cross-validation.First, the training data is divided into two partitions, and HMMs are trained on each partition.Then, HMMs trained from one partition are used to recognize the sentences from the other.Not only will we obtain many more errors this way, but these errors will also be more realistic.We then use these errors to correct the models trained on the entire set.Partitioning for cross-validation no longer makes sense after the first iteration, because the HMMs will have been trained on one partition and corrected on the other.Therefore, in our implementation we use the misrecognized sentences from cross-validation for the first iteration.In future iterations, we reuse them as nearmiss sentences.Such an approach is unsuitable for continuous speech, where we need to produce near-miss sentences given a correct sentence.This information is unavailable from a continuous speech recognizer due to pruning.We decompose this problem as follows:

Reinforcement Learning for Continuous Speech Recognition
1. Produce a long list of near-miss phrase substitutions, where each phrase may have zero to several words.
2. Use this list to hypothesize near-miss sentences by substituting one or more of the near-miss phrases with their respective replacements.The next two sections will describe these two components of our reinforcement learning algorithm for continuous speech recognition.

Generating Near-Miss Phrase Substitutions
In this section, we describe our algorithm to generate a list of near-miss, or confusable, phrases.The first issue is the definition of a confusable phrase.It is inadequate to simply model word-for-word substitutions, because errors in continuous speech recognition are rarely so simple.More complex errors have been modeled by scoring routines [Pallett 87] that attempt to compute the error rate of a system, and provide a list of frequent errors.These routines typically consider insertions (ship -» the ship) and deletions (a sub -» sub) in addition to substitutions.While these three categories are adequate for determining the error rate of a system, they are unsuitable for finding near-miss phrases for two reasons.First, they contain no contextual information (the is more likely to be deleted in list the uttered word than in the word was uttered).Second, phrase substitutions (during that are m, how many -> home any) cannot be decomposed into substitutions, insertions, and deletions.
In view of the above, we model system errors as near-miss phrase substitutions.A nearmiss phrase substitution is a pair of phrases, where each phrase may have zero or more words.We generate these confusable phrases as follows.First, we use cross-validation recognition to obtain realistic misrecognized sentences.Then, to find plausible phrase errors, each misrecognized sentence is matched against the corresponding correct one by a dynamic programming (DP) algorithm [Aho 74].We could then define near-miss phrases as phrase alignments that have reasonable costs.
In order to align two sentences, each sentence must be first decomposed into a sequence of comparable units.For example, the scoring programs use words as units.The problem with using words as units is that they have no self-evident distance metric.Scoring programs use a simple distance metric, where if two identical words are aligned, the distance is zero, and substitutions, insertions, deletions are penalized with some constant distances.This type of distance metric is insensitive to similarities at the subword level, and will result in unreasonable alignments when multiple alignments are possible.Phones are a better unit, because multiple alignments can be resolved using phonetic similarity.However, there is no principled way of incorporating duration into phonetic distances.Also, similarities at the sub-phone level may be useful.
Therefore, we use the smallest unit available to us, and represent each sentence as a sequence of frames, where a frame is represented as an output distribution, by.To obtain the distribution sequence, we align the actual utterance against the hidden Markov model for the correct sentence using the Viterbi algorithm.This is repeated for the misrecognized sentence.This process converts the correct and misrecognized sentences to their corresponding {by} sequences.
In order to align two by sequences, we need a distance metric between the byS.We use an information theoretic distance that measures the change in entropy when the two distributions are merged [Lee 88a].These distances are scaled to lie in [0,1], with a cost of 1 for insertions and deletions.
Given the by sequences for the correct sentence (C) and for the misrecognized sentence (A/), each with length L, we are now in a position to generate near-miss phrase substitutions.A phrase substitution can be thought of as a "box", indicated by its coordinates, C^CpM^M^ This "box" matches two phrases: (1) frames i to j of the correct by sequence, and (2) frames k to / of the misrecognized by sequence.The cost of a box can be determined by aligning the two sequences of byS using a dynamic programming (DP) algorithm [Aho 74] to find the optimal alignment and cost.This cost is defined by the following equations:

CosKCpCpMtMJ^O
(3) CostiCt, C^, Af* Af w ) +Dwr(C ; , Af,) where DistiCj 9 M t ) is the entropic distance between frame j of the correct sequence and frame / of the misrecognized sequence.
For example, the box in Figure 4-2(a) shows alignment of two entire sentences, which gives us a globally optimal cost, or Cost(C v C L M\*Mj). Figure 4-2(b) shows the decomposition of the global box into three boxes.The central box represents the substitution who is in -» will wasn+t.We consider the central box a near-miss phrase substitution if the total cost of the three boxes is within e of the globally optimal cost (that of the box in Figure 4-2(a)).In other words, the phrase substitution designated by the central box will be considered a near-miss if the following equation is satisfied: (5) This gives us a principled way of finding good near-miss phrases that actually caused the misrecognition.In the example, the central box in Figure 4-2(b) was considered a near-miss, while the central box in Figure 4-2(c) was not All the near-miss phrase substitutions are saved, along with their cost in the central box, for later use.
The costs of all possible peripheral boxes can be precomputed efficiently, because there are only C w xM w possible end points for the left box, and C w xM w possible begin points for the right box (C w is the number of words in the correct string, and M w is that in the misrecognized string).As a byproduct of the DP algorithm, alignment of the entire sentences yields all the possible costs for left boxes.Similarly, aligning the sentences in reverse yields the costs for the possible right boxes.However, there are almost (C^xAf^) 2 possible central boxes, one for every possible combination of the two peripheral boxes, and DP alignment will have to be run for each combination.Some pruning is needed to contain the central box computation.We use branchand-bound to discard any box for which the combined peripheral cost already exceed the globally optimal cost by more than e, on
(6) In addition, we restrict the substitution phrases to at most three words.We felt that longer phrases would become too specific for the training data.
We processed 4150 pairs of correct and recognized sentences using the above algorithm, obtaining a list of 13000 phrase substitutions in about 4 hours of CPU time on a Sun-4.As an example of the substitutions produced, the substitutions produced for matching "Who is in west Siberian sea?" with "If will wasn't last Siberian sea" are:

west -» last is in -» wasn't who -» will who is in -> will wasn't
To digress momentarily, we would like to point out that the algorithm described in the section could be modified into a sophisticated scoring algorithm that can give more accurate estimates of error rate and automatically provide a list of errors of analysis [Pallett 87,Hunt 88].

Hypothesizing Near-Miss Sentences
The candidate list produced by the phrase generation algorithm is then used by the sentence hypothesize^ which heuristically hypothesizes likely near-miss sentences.Almost any reasonable method will do, but for completeness we describe our algorithm in Figure 4-3.This algorithm hypothesizes one near-miss sentence given a correct sentence.Since it is nondeterministic, we can iterate it until we have enough near-misses for each correct sentence.We use an average of 6 near-miss sentences per original sentence.
1. Start at the beginning of the correct sentence by setting the current word position to zero.
2. Make a list of possible phrase substitutions starting at the current position.
3. Randomly make a substitution from the list determined in step 2.
The probability of making a substitution is a monotonically decreasing function of the cost in the DP process.Making no substitution is allowed, with no cost 4.If we made a substitution, advance the current position to the end of the substituted phrase.Otherwise, advance the current position by one word.
5. If at end of sentence, stop.Otherwise, go to step 2. This algorithm is very efficient We can hypothesize 250 near-miss sentences per second on a Sun-4.The hypotheses generated by the algorithm depend greatly on the phrase substitution list with which it is provided.Using the substitution list derived from recognition without grammar, the hypotheses produced for who is in west Siberian sea include: who is in were sub area be who is in west it Siberian be who is in when Siberian same who is in when Siberian it sea On the other hand, using a word-pair-grammar substitution list, we obtained a different substitution list: who was the west Siberian sea who is in the west Siberian sea who list at west Siberian sea who has been in west Siberian sea Without a grammar, many ungrammatical substitutions occur, such as west -> when and sea -> be.But with a grammar, the hypothesized sentences are more grammatical, with substitutions such as is -» was and in -» at.

Final Algorithm Description
Figure 4-4 summarizes our corrective and reinforcement algorithm for continuous speech recognition.There is one implementational detail that we have not covered.A problem we encountered during recognition was that corrective training would make some probabilities too small for test data.As a result, a few sentences could not be recognized.To remedy this problem, we use a large |3 and then smooth the trained parameters with the MLE parameters.For example, with a smoothing weight of 0.2 for the MLE parameters and 0.8 for the new parameters, we ensure that no parameter can shrink by more than 80% from its MLE estimate.This is more sensitive than simply using a smaller (3, or setting a large minimum value for output probabilities.
1. Partition the training sentences into two equal sections for crossvalidation and train a set of models on each half.
2. Recognize each half with the models trained on the other half (cross-validation).
3. Perform the DP algorithm on misrecognized sentences in Section 4.3.1 to obtain a list of near-miss phrase substitutions.
4. Iterate the non-deterministic algorithm described in Figure 4-3 to hypothesize a list of near-miss sentences.
5. Generate the list of confiisable sentences (co m ) by combining: • The actual misrecognitions from the models of the previous iteration.
• The list of hypothesized near-miss sentences in the previous step.
• Sentences from cross-validation, for iterations after the first.
6. Run the forward-backward algorithm on each spoken sentence u, using the model for the correct sentence w to obtain the counts cf™.Do the same with each (O m to obtain the counts c?V Then, ij m y replace the each original count with + y(c™ -c"? m ).y is an adjustment factor: • For misrecognitions, it is set to J3.
• For near misses, it decreases linearly from (3 to 0 as the difference between logP(u\w) and logP(u\(o m ) decreases from 0 to -5.
7. Smooth the resulting models with the MLE parameters, using a weighted average.
8. Go to step 3, until a sufficient number of iterations have been run.

Results and Discussion
The algorithm described in Section 4.4 can be used for recognition systems with and without grammar.We applied the algorithm to both cases.
To obtain the baseline MLE results, we trained the SPHINX system using the forwardbackward algorithm, as described in Chapter 3. We applied SPHINX to the 997-word Tl Resource Management task [Price 88], using 4150 training sentences (104 speakers) and 150 test sentences (10 speakers).Without a grammar, the perplexity is 997.With the word-pair grammar, which knows only about the legality of pairs of words classes, the perplexity is 60.
Our first experiment was run on the no grammar recognizer, where the MLE models made 375 errors 1 on test data, for an error rate of 29.4%.One iteration of simplistic corrective training, without near-misses or cross-validation, reduces the number of errors on test data to 336.More realistic cross-validation sentences reduces this to 329.Finally, one iteration of the complete algorithm, described in Figure 4-4, with reinforcement learning reduces this number to 316.Thus, after one iteration, our corrective training algorithm achieves a 15.7% error rate reduction on test data.Most of the improvement comes from basic corrective training, although a significant portion comes from enhancements such as cross-validation and reinforcement learning.We then ran additional iterations of the corrective and reinforcement learning algorithm.For each iteration, we used the system from the previous iteration to produce misrecognitions for corrective training.A different set of near-miss sentence hypotheses was used, without regenerating the near-miss phrase substitutions.In addition, the misrecognitions from crossvalidation were repeatedly used.The results are shown in Table 5-2.Also shown are the error rates on the training sentences.Note that the results shown for MLE training were obtained

Errors on
x An error could be a substitution, a deletion, or an insertion.
without cross-validation2 , so that the entries in the first column would be more comparable.We found that after running two iterations, the result improved only negligibly.The final result gives an error rate of 23.4%, for an error rate reduction of 20.3% over the MLE models.We also applied the corrective training algorithm to the models used by SPHINX's word-pair grammar.We first tried to use the models adjusted with the no-grammar hypotheses, but the error rate was actually greater than with MLE models.This happens because the two grammars make very different errors, which we verified by comparing their near-miss phrase hypotheses.For example, the word-pair grammar recognizer usually does not confuse than with their, but the no grammar one often does.Adjusting the model parameters that disambiguate these two words might actually hurt the word-pair recognition rate.So, we regenerated the cross-validation data and the phrase substitution list for the word-pair grammar correction.The results are shown in  and 63% with grammar.Thus, we have not only demonstrated the extensibility of the corrective training algorithm to speaker-independent continuous speech recognition, but also narrowed the gap between training and testing results through the use of more training and cross validation.

Conclusion
Hidden Markov models with maximum-likelihood estimation constitute the predominant approach to automatic speech recognition today.However, maximum likelihood produces inferior results when the underlying models are incorrect, which HMM's obviously are as models of real speech.For this reason, the IBM corrective training algorithm produced good results when applied to the IBM isolated-word, speaker-dependent, office-correspondence task.The basic idea of this algorithm is to modify the HMM parameters so as to maximize the recognition rate on the training set This is accomplished by making the correct words more probable, and the confusable words less probable.This paper extended this algorithm to continuous, speaker-independent speech recognition.
In order to increase the effective training size, we used cross-validation.In order to extend the algorithm to continuous speech, we introduced an algorithm that hypothesized near-miss sentences.This algorithm has two components: (1) a near-miss phrase substitution generator that used a dynamic programming algorithm to produce a long list of possible phrase substitutions, and (2) a non-deterministic near-miss sentence hypothesizer that used the phrase substitution list to hypothesize possible near-miss sentences from a correct sentence.These enhancements, aided by the use of a large training database, led to error rate reductions of 20.3% without grammar, and 23.4% with a word-pair grammar.One notable finding was that grammarspecific training was necessary, because corrections appropriate for one grammar may be suboptimal, or even harmful, to another grammar.Another finding was that smoothing a heavily-corrected set of parameters with MLE parameters led to the best results.
These results clearly demonstrated that the corrective training algorithm is applicable to speaker-independent, continuous speech recognition.The applicability of this algorithm to continuous speech is demonstrated through the use of novel algorithms for near-miss sentence hypothesization.The applicability of this algorithm to speaker-independent recognition is also important, because much more training data can be collected in speaker-independent mode.Through the use of a large multi-speaker database and cross-validation to increase the effective training size, we have narrowed the gap between the results of training and testing data.However, there is still a large difference between training and testing results.In order to further reduce this difference, more training will be needed, and more efficient techniques must be investigated to deal with the increased training.
There are number of other directions for future work.The phrase generation method described in this paper relied on having a large training database.This is practical for a 1000word vocabulary, but would not be for a 20,000 word system.A method to generate confusable phrases given only a grammar and a dictionary would be very useful.
Another research area is whether the corrective training algorithm will be useful when training data is sparse.For example, could we improve a model even when it has not been observed?Positive results in this area will make corrective training an excellent speaker-adaptation algorithm.Otherwise, it will still be a good adaptation algorithm when a reasonable amount of speaker-specific data is available.
Corrective and reinforcement learning is but one innovation that departs from the traditional maximum-likelihood estimation.A few other techniques, such as maximum mutual information estimation [Brown 87] and minimum discrimination information estimation [Ephraim 88], have been proposed.We believe that this is a rich area for further research.

Figure 2
Figure 2-1: A simple hidden Markov model with two states, and two output symbols, A and B.
Bahl, et al.[Bahl 88a]  found that reinforcement (near-miss) learning aided the convergence of corrective training considerably.For isolated-word tasks, near-miss training is conceptually simple, since the only errors are simple word-for-word substitutions (such as for -> far).To generate near-misses, Bahl, et al. used a list of near-miss words produced by a fast-match algorithm [Bahl 88b].

Figure 4 - 2 :
Figure 4-2: Three examples of DP alignment in the near-miss phrase generation, (a) is the globally optimal path, (b) is an example of a good alignment, and (c) is an example of a poor alignment The quality of the alignment is assessed by comparing the resulting distance with that of the optimal path.

Table 5 -1:
Results of maximum-likelihood training vs. one iteration of variants of corrective and reinforcement training.

Table 5 -
3. As expected, comparable error reduction with the no grammar recognizer was achieved.