Topic adaptation for language modeling using unnormalized exponential models

We present novel techniques for performing topic adaptation on an n-gram language model. Given training text labeled with topic information, we automatically identify the most relevant topics for new text. We adapt our language model toward these topics using an exponential model, by adjusting the probabilities in our model to agree with those found in the topical subset of the training data. For efficiency, we do not normalize the model; that is, we do not require that the "probabilities" in the language model sum to 1. With these techniques, we were able to achieve a modest reduction in speech recognition word-error rate in the broadcast news domain.


INTRODUCTION
A language model is a probability distribution pwjh estimating how frequently a word w occurs given that the history (or previous words in the sentence) is h.Language models have many applications, most notably in speech recognition in helping to disambiguate acoustically ambiguous utterances.
The dominant technology in language modeling are n-gram models.In speech recognition, typically a single n-gram model (usually a trigram model) is built on the training data.The task of topic adaptation is concerned with identifying the topic of new data and adapting the language model toward that topic.For example, if a speech document is recognized as describing O.J. Simpson's trial, then the probability of the word Kato occurring should be boosted.
There has been much previous work in topic adaptation. 1 Numerous efforts have demonstrated large improvements in the measure of perplexity [2,4,9]; however, perplexity has been shown to correlate poorly with speech recognition performance.Several papers have reported modest speech recognition word-error rate (WER) improvements of about 0.5% absolute: Sekine and Grishman [14] add ad hoc topic and cache scores to their language model score in log probability space, and Iyer and Ostendorf [3] This work was supported by the National Security Agency under grants MDA904-96-1-0113 and MDA904-97-1-0006.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. government. 1Here, we only discuss research where it is necessary to identify the topic of the current text automatically.This contrasts with the situation where a topic-specific adaptation text is explicitly given, as in Spoke 2 of the 1994 ARPA CSR evaluation [6].and Seymore and Rosenfeld [16] use linear interpolation to combine topic n-gram models with a general n-gram model.
In this work, we extend the research in [16] by using unnormalized exponential models to combine topic information.In [16], a first-pass transcription hypothesis is generated for each article in the test set using an unadapted trigram model.The twenty most relevant topics for each hypothesis are identified using a Bayes classifier.Then, a trigram model is built for each of these topics by just using those articles in the training data labeled with the given topic.(Each article in the training data is manually annotated with topic information.)Finally, these twenty models are linearly interpolated with a trigram model built on the entire training set to yield the language model used for speech recognition.
Recently, there has been evidence that exponential models are superior to linear interpolation in combining multiple information sources [13,5,4].Exponential models have the following form where Zh = P w exp P i fih; wip0wjh is a normalization term, p0wjh is a prior probability, fih; w are the features of the model, and i are parameters associated with these features.
As an example, consider the case where we take p0wjh to be a trigram model.If there are no features fi, then we will simply have that pwjh = p0wjh.However, let us say that we want to model the phenomenon that the word Kato is more common when the topic is O.J. Simpson.We can do this by creating a feature f1h; w = 1 topic(h) = O.J. Simpson, w = Kato 0 otherwise and by setting 1 such that e 1 equals how many times more probable the word Kato becomes.This will have the effect of boosting the probability of Kato when the topic is O.J. Simpson (and consequently depressing other probabilities through the normalization term Zh), and leaving probabilities unchanged when the topic is not O.J. Simpson.This procedure is the basis of how we perform topic adaptation on n-gram models.
Unfortunately, the evaluation of exponential models is expensive due to the calculation of the normalization factor Zh; this calculation generally makes exponential models orders of magnitude slower than trigram models.In this research, we omit the normalization term Zh.As a result, we no longer have probabilities in our model but instead scores, and we can no longer calculate perplexities.On the other hand, our models are virtually as fast as trigram models and can easily be used to calculate WER's in expensive tasks such as lattice rescoring.To prevent scores from rising above 1, we use the following formulation The use of the term p 0 wjh 1,p 0 wjh instead of p0wjh maintains the property that pwjh = p0wjh when there are no features.
We consider three types of exponential features for performing topic adaptation.
We consider features that depress the probabilities of topical words that are off-topic, e.g., the word Kato if the topic is Libya.(We use the term topical to describe a word whose frequency depends strongly on topic, e.g., the word Kato as opposed to the word that.) We consider features that boost the probabilities of topical words and n-grams when they are on-topic, e.g., the word Kato or bigram Kato Kaelin if the topic is O.J. Simpson.
We consider features that boost the probabilities of words and n-grams that occur frequently in the current article being evaluated.These features are similar in effect to a language model cache [7].
In the next sections, we discuss each of these feature types in turn.
Our training data consists of 121,000 articles of Broadcast News data containing a total of 130M words, with each article manually labeled with a set of topics. 2 Each article is labeled on average with 3.6 topics out of a set of about 10,000.

DEPRESSING OFF-TOPIC WORD PROBABILITIES
The frequency of a topical word in off-topic articles will often be much lower than its frequency calculated over the entire training set.For example, in 130M words of Broadcast News text, the word Kato occurs 3111 times, yielding a unigram frequency of about 2:4 10 ,5 .However, 2990 of these occurrences happen within articles labeled with the topic O.J. Simpson, these articles comprising a total of 16M words.Thus, the word Kato has a frequency of only 3111,2990 130,1610 6 1:1 10 ,6 when the topic is not O.J. Simpson, which is more than ten times less than its general frequency.
Modeling this phenomenon in an exponential model is fairly straightforward: referring to equation (1), we want to find a factor w for each word w such that e w expresses how much less frequently that word occurs in off-topic text than in general text, i.e., e w = p off-topic w p0w (2) The corresponding features fw are of the form fwh; w 0 = 1 w is off-topic w.r.t.h, w 0 = w 0 otherwise 2 The text and topic labels were acquired from Primary Source Media.To calculate p off-topic w for a word w, we need to determine which topics are off-and on-topic with respect to w.One reasonable heuristic for guessing that a topic is on-topic is if the frequency of w in articles labeled with that topic is much higher than its frequency over the entire training set.However, this heuristic is not ideal as indirect dependencies may exist.For example, if many articles with the topic O.J. Simpson are also labeled with the topic DNA testing (recall that articles usually have multiple topics), then the topic DNA testing may be considered on-topic for the word Kato according to this heuristic.
A method for modeling these partial dependencies is to use maximum entropy training for exponential models [1].Consider a topic unigram model, or model with features of the forms fT;wh; w 0 = for each topic T and word w. (For p0 in equation ( 1), we use a uniform distribution.)After maximum entropy training, the magnitude of each parameter T;w will be, roughly speaking, an indication of how strongly correlated the word w is with topic T , taking into account indirect dependencies.Furthermore, pwjh for a history h where topich = is an estimate of the frequency p off-topic w we need in equation ( 2).
The complete procedure we used to calculate our off-topic depression factors is as follows: we began with a 51k vocabulary of the most common words in the Broadcast News data.To reduce the number of features in the topic unigram model to a manageable size, we only included the feature fT;w if the word w occurred much more frequently in articles labeled with topic T than in general according to a 2 test.This process yielded about 200,000 features.Unlike the other exponential models used in this work, the topic unigram model was normalized.We used optimizations as described by Lafferty and Suhm [8] in the maximum entropy training; each iteration took less than 10 minutes on a Pentium II processor.The training yielded positive depression factors for 30,000 words.An excerpt of these factors is displayed in Table 1.
In evaluation, we used the procedure described in Section 1 to find twenty relevant topics for each article.We took a word w to be off-topic if the frequency of w in the training data in each of the twenty topics was not significantly higher than its off-topic unigram probability according to a 2 test.

BOOSTING ON-TOPIC N -GRAM PROBABILITIES
In boosting the probabilities of words and n-grams that are topical and on-topic, first consider the case where we would like to adapt a language model toward a single topic T .A reasonable procedure would be to set each adapted probability p adapt wjh to the baseline n-gram probability p0wjh unless the topic probability pT wjh is significantly different (e.g., according to a 2 test), in which case the adapted probability should be set to the topic probability.We can take the topic model pT wjh to be an n-gram model built on the training data labeled with topic T .
To perform this adaptation for exponential models, we can first loop through all unigrams w.Whenever pT w is significantly different from p0w we add a feature fwh; w 0 as in equation (3) with w set such that e = p T w p 0 w .Then, we loop through all bigrams wi,1wi, comparing pT wijwi,1 against p0wijwi,1 combined with all unigram features created.(In exponential models, an n-gram feature affects all n 0 -gram probabilities for n 0 n.)We can repeat this process for all levels of the n-gram model. 3owever, articles are generally a combination of multiple topics, and it is not clear how to reconcile probabilities in this more complex situation, especially in light of the indirect dependencies mentioned in Section 2. A theoretically motivated method would be to build a maximum entropy topic n-gram model (analogous to the topic unigram model described earlier) and to train this model on the entire training set; however, this would require a stupendous amount of computation.
We instead choose a simple heuristic that can be considered in spirit to be a very poor approximation to maximum entropy training.In particular, for each level of our n-gram model we apply the procedure described previously for adapting to a single topic to each of the topics in turn, except that we only consider probability increases.That is, for each probability p adapt wjh we take the maximal pT wjh over all of the relevant topics T, as long as this probability is significantly higher than the baseline n-gram probability according to a 2 test.Intuitively, we are assuming that the probability of a word or n-gram in the adapted model should be large if it is large in any of the relevant topics.

Filtering Adaptation Topics
We have found that usually not all of the twenty topics for an article returned by our Bayes classifier are relevant.To select the most relevant topics of the twenty, we build a model for each topic adapting the general model to just that topic.We calculate the likelihood of the first-pass hypothesis transcription using these models, and use a topic only if its corresponding likelihood is substantially lower (0.3 bits/word) than the likelihood assigned by the general model. 4In Table 2, we display the results of this process for an article concerning racial issues between blacks and whites.

Boosting Article-Specific n-Gram Probabilities
Cache models attempt to characterize the phenomenon that words and n-grams tend to repeat themselves within articles, by increasing the probabilities of n-grams that have occurred previously in an article [7].We can place this type of modeling within our adaptation framework by viewing the first-pass hypothesis transcription of an article to be another topic adaptation text.We can adapt our  2: Results of topic filtering by likelihood for an article concerning racial issues between blacks and whites language model to this text in the same way that we adapt it to each relevant topic.Words or n-grams that occur surprisingly frequently in the hypothesis will have their probabilities boosted in the adapted language model.In conventional caching, hypotheses are processed beginningto-end and all previous words in a hypothesis are assumed to be correct and placed in the cache.In our scheme, the whole article is processed before features are created, and features are only created if they pass a significance test.Thus, it seems likely that our scheme is less susceptible to speech recognition errors.

EXPERIMENTS
In our experiments, we used speech recognition lattices generated by the Sphinx-III system [10] on 20 articles of Broadcast News data (16,700 words).For each article, we first generated a hypothesis using a trigram model generated by the CMU language modeling toolkit [11] from our 130M words of training text.The word-error rate of these hypotheses were 30.8%.We found twenty relevant topics for each article using a Bayes classifier on these first-pass hypotheses.In each experiment, word-error rates were calculated through lattice rescoring with the adapted model.The baseline model for adaptation is the trigram model described above.

Depressing Off-Topic Word Probabilities
We investigated whether the depression of off-topic word probabilities alone would improve word-error rate.Using the 30,000 depression features described in Section 2, we found that the WER improved by 0.1% absolute to 30.7%.To get a detailed view of the variation between the hypothesis generated by the baseline trigram model and the hypothesis generated by the adapted model, we aligned these two hypotheses to find their word differences.We then aligned these differences against the reference transcript, to determine how many errors were fixed and created with the adapted model.Over the 16,700 words in the test set, there were 43 word differences between the baseline and adapted hypotheses.Of these 43 differences, 17 were errors fixed in the adapted hypothesis, 5 were errors created, and 21 were errors in both hypotheses.
As an upper bound on the WER reduction of these techniques, Rosenfeld et al. [12,15] estimate that if no out-of-vocabulary errors are introduced, then removing 10,000 words from a large vocabulary improves WER by about 0.2% absolute, so depressing 30,000

Boosting On-Topic and Article-Specific n-Gram Probabilities
In experiments with on-topic and article-specific features, we did not use depression features as they seemed to have little effect.We performed adaptation with unigram and bigram features.We display the article-by-article error rates of on-topic and articlespecific adaptation in Table 3.We achieved our best WER improvement of 0.5% absolute using both adaptations together.Improvements varied widely between articles, with our best article WER improvement being 1.8% absolute in article A. In the final column of the table, we display the results of adding only unigram adaptation features; bigram features seem to effect a small improvement.
Comparing the baseline and best adaptation hypotheses using the methodology described in Section 4.1, we found that the two hypotheses differed by 854 words.Of these 854 words, 261 were errors fixed by adaptation, 162 were errors created by adaptation, and 431 were errors in both hypotheses.

DISCUSSION
To summarize, we introduced several novel topic adaptation techniques for unnormalized exponential models.The use of unnormalized exponential models has the advantage of efficient computation while hopefully retaining some of the properties of conventional exponential models.We were able to run lattice rescoring experiments at about 3 times real-time on a Pentium II processor.Because we use unnormalized models, it is not meaningful to calculate perplexity; however, perplexity has been shown to correlate poorly with speech recognition performance.
This work is the first to explicitly model the depression of offtopic word probabilities.We describe how to use maximum entropy training to determine these depression factors.We present a novel implementation for robust caching, which fits in a unified manner within our topic adaptation framework.We describe an effective method for filtering out irrelevant topics by using the likelihood of the first-pass transcription.Throughout our work, we use statistical testing to select only those adaptation features which are significant.
We achieved a minimal reduction in WER by depressing offtopic word probabilities, but achieved a modest reduction through boosting on-topic and article-specific n-gram probabilities.Our WER reduction is comparable to the best existing results for this task.

Table 1 :
Estimates of how much less frequent words w are when off-topic (i.e.,1e w )

Table 3 :
Speech recognition performance for models with on-topic and article-specific n-gram features words completely and perfectly would lead to a WER improvement of about 0.6%.