Language Modeling for Dialog System

Language modeling for speech recognizer in dialog systems can take two forms. Human input can be constrained through a directed dialog, allowing the decoder to use a state-specific language model to improve recognition accuracy. Mixed- initiative systems allow for human input that while domain-specific might not be state-specific. Nevertheless, for the most part human input to a mixed-initiative system is predictable, particularly when given information about the immediately preceding system prompt. The work reported in this paper addresses the problem of balancing state-specific and general language modeling in a mixed-initiative dialog system. By incorporating dialog state adaptation of the language model, we have reduced the recognition error rate by 11.5%.


Introduction
Recent advances in speech recognition technologies and computer hardware have made it possible to build humancomputer spoken dialogue systems for a wide variety of application. However, the performance of speech recognition is still a bottleneck of these systems [7]. A lot of research effort has been devoted to detecting and recovering from recognition errors.
In this work, we have tried to improve the recognition performance of Carnegie Mellon Communicator [11], a telephone based automated travel agent system, by incorporating dialogue state adaptation of the language model. Language modeling for speech recognizer in dialog systems can take two forms. Human input can be constrained through a directed dialog, allowing the decoder to use a state-specific language model to improve recognition accuracy [6] [9]. In this way, dialog states have been used to partition the whole set of utterances into subsets and then train standard n-gram language models from each partitioned set. Mixed-initiative dialog systems allow for human input that while domain-specific might not be state-specific. Nevertheless, for the most part, human input to a mixed-initiative system is predictable, particularly when given information about the immediately preceding system prompt. In [10], the state-specific language models were interpolated with a general language model using Viterbi algorithm.
The work reported in this paper addresses the problem of balancing state-specific and general language modeling in a mixed-initiative dialog system. We have shown that by our approach, we can improve the system performance. The performance is reported in terms of perplexity and actual recognition word accuracy.

System overview
The dialogue system we do our experiment on is CMU Communicator, a telephone based automated travel planning system. Communicator is a mixed-initiative spoken dialog system. In this system, Sphinx-2 speech recognizer transcribes user's speech into text and passes to Phoenix parser to generate semantic interpretation. Then dialog manager decides how to interact with user and database. At different state of a dialog, the dialog manager will give different prompt to the user and the user's response may or may not relate to system's prompt.

Building state dependent language model
Because user's response depends on what is heard by the user, we can define state as preceding system prompt (i.e. the natural language generation frame). User's utterances are classified into 16 states (Table 1) according to its preceding natural language generation frames. We take the following procedures to build the state dependent language model.

•
A general language model is built from the whole corpus, using Katz backoff with Good-Turing discount. We use CMU-Cambridge toolkit [1] to build the language model.

•
We build a trigram back-off language model for each state. The unigram probabilities are back-off to the unigram probability of the general model: V is the vocabulary from the corpus of each state, λ is a normalization factor in order to make ( ) w P Ŝ sum to 1.

•
Then each state dependent language model is linearly interpolated with the general language model. The interpolation weights are optimally determined by EM (Expectation-Maximization [2]) algorithm using separate holdout data. The interpolated probability of a word is given by: where the interpolation weights α and β satisfy 1 = + β α Our corpus is drawn from data collected using the CMU Communicator. The log file of the system contains every system output and user input. Data collected from June 1998 to May 1999 is used as development data, which has 182K words, 42K utterances. Test data is from the recording in June 1999, which has 6289 words, 1750 utterances. Perplexity and recognition word error rate of general language model and state dependent language model for each dialog state are shown in Table 1. The result shows a high correlation between dialog states and the responses from user. The word error rate has a significant reduction of 11.5% after using state dependent language model.

Clustering of Utterances
The improvement in section 3 is encouraging. However it should be possible to further improve the predictive ability of the language model. The following observations can be made about users' language in a particular state: • Users tend to talk about a number of different topics and will naturally use different language for each topic. If we know the topic of the utterance (within the state), we can more precisely model the language.
• There are some patterns in user's utterances. Within each utterance cluster, word sequence is more predictable.
Given an utterance, its cluster is not known a priori. So we have a probability distribution of the clusters. The probability of an utterance is the weighted sum of the conditional probability that the utterance is generated from each cluster: [12] also proposed similar clustering idea. However, our approach is different from previous work in that we directly use trigram instead of unigram to do the clustering. Using unigram for clustering cannot model the local regularities of language, while it is possible to find out some local regularities by using n-gram ( 2 ≥ n ).
In different dialog state, the distribution of cluster is different. This can be modeled by conditioning the cluster probability on the dialog state: We want to classify user's utterances into different clusters such that the utterances within each cluster are similar. This can be done using likelihood as the measure of similarity among the utterances within a cluster. We build c trigram language models for each cluster such that the likelihood of the whole data is maximized. EM algorithm can be used to find the optimal parameters that maximize the likelihood. Here, the parameters that need to be optimized are the trigram probabilities  probability. The figure clearly shows that different dialog state has very different cluster distribution. In some states, most utterances belong to only one cluster; while in some other state, user's language tend to distribute over many clusters.
After above algorithm, we can get the probability of an utterance belonging to a cluster given its dialog state: Thus each utterance can be labeled with the most probable cluster that it belongs to. Then all the utterances are partitioned into different clusters and a Katz back-off trigram language model is built for each cluster using CMU-Cambridge Toolkit. These trigram language models need to be smoothed since they do not have enough training data. We can do the smoothing by interpolating the cluster model with a general language model. There can be two ways of interpolation.
One way is to interpolate at the utterance level. The probability of an utterance is the weighted sum of the probability calculated using cluster model and the general model.
The other way is to interpolate at word level. The probability of a word is the weighted sum of the probability calculated using cluster model and the general model.
Again, EM algorithm is used to estimate the optimal interpolation weights on holdout data.  Table 3 Difference when clustering using different n-gram To compare the recognition result, first we generate word lattice using the state dependent language model. Then we use state dependent language model and cluster language model to rescore the lattice respectively. It turns out that that the cluster language model slightly increases the word error rate. [3][4] [12] also reported that using clustering technique either deteriorates or slightly improves the recognition performance. [12] suggested that better smoothing method (e.g. Kneser-Ney smoothing) need to be applied to cluster language models in order to get good performance.
We have also compared the effect of different length of gram on clustering. We cluster the utterances using unigram, bigram, and trigram respectively and build different cluster language models. Table 3 shows the different performance when clustering utterances using different n-gram. Using trigram for clustering gives better performance for the cluster language model.

Conclustions
Using state dependent language models, both perplexity and word error rate of speech recognition can be improved significantly and the dynamic switching between different state dependent language models has been implemented in the Communicator system to benefit from the reduced recognition error rate.
Utterance cluster language model does not improve the recognition performance. However, better smoothing techniques are expected to improve the performance of cluster model. Using trigram to cluster utterances is better than using unigram for the purpose of language modeling.