WHOLE-SENTENCE EXPONENTIAL LANGUAGE MODELS: A VEHICLE FOR LINGUISTIC-STATISTICAL INTEGRATION

We introduce an exponential language model which models a whole sentence or utterance as a single unit. By avoiding the chain rule, the model treats each sentence as a “bag of features”, where features are arbitrary computable properties of the sentence. The new model is computationally more efﬁcient, and more naturally suited to modeling global sentential phenomena, than the conditional exponential (e.g. Maximum Entropy) models proposed to date. Using the model is straightforward. Training the model requires sampling from an exponential distribution. We describe the challenge of applying Monte Carlo Markov Chain (MCMC) and other sampling techniques to natural language, and discuss smoothing and step-size selection. We then present a novel procedure for feature selection, which exploits discrepancies between the existing model and the training corpus. We demonstrate our ideas by constructing and analyz-ing competitive models in the Switchboard domain, incorporating lexical and syntactic information.

guage modeling is devoted to estimating terms of the form Pr¤ © $ ' ¦ .
The application of the chain rule is technically harmless since it uses an exact equality, not an approximation.This practice is also understandable from a historical perspective (statistical language modeling grew out of the statistical approach to speech recognition, where the search paradigm requires estimating the probability of individual words).Nonetheless, it is not always desirable.Terms like Pr¤ © $ ' ¦ may not be the best way to think about estimating Pr¤ £ ¦ : 1. Global sentence information such as grammaticality or semantic coherence is awkward to encode in a conditional framework.Some grammatical structure was captured in the structured language model of [1] and in the conditional exponential model of [2].But such structure had to be formulated in terms of partial parse trees and left-to-right parse states.Similarly, modeling of semantic coherence was attempted in the conditional exponential model of [3], but had to be restricted to a limited number of pairwise word correlations.
2. External influences on the sentence (for example, the effect of preceding utterances, or dialog level variables) are equally hard to encode efficiently.Furthermore, such influences must be factored into the prediction of every word in the current sentence, causing small but systematic biases in the estimation to be compounded.
3. Pr¤ © $ ' ¦ is typically approximated by Pr¤ © ! 4$ © !& 65 87 0 § 0 © !& 9¦ for some small @ (the Markov assumption).Even if such a model is improved by including longer distance information, it still makes many implicit independence assumptions.It is clear from looking at language data that these assumptions are often patently false, and that there are significant global dependencies both within and across sentences.
As a simple example of the limitations of the chain rule approach, consider one aspect of a sentence: its length.In an A -gram based model, the effect of the number of words in the utterance on its probability cannot be modeled directly.Rather, it is an implicit consequence of the A -gram prediction.This is later corrected during speech recognition by a "word insertion penalty," the usefulness of which proves that length is an important feature.However, the word insertion penalty can only model length as a geometric distribution, which does not fit well with empirical data, especially for short utterances.
As an alternative to the conventional conditional formulation, this paper proposes a new exponential language model which directly models the probability of an entire sentence or utterance.The new model is conceptually simpler, and more naturally suited to modeling whole-sentence phenomena, than the conditional exponential models proposed earlier.By avoiding the chain rule, the model treats each sentence or utterance as a "bag of features", where features are arbitrary computable properties of the sentence.The single, universal normalizing constant cannot be computed exactly, but this does not interfere with training (done via sampling) or with use.Using the model is computationally straightforward.Training the model depends crucially on efficient sampling of sentences from an exponential distribution.
In what follows, Section 2 introduces the model and contrasts it with the conditional exponential models proposed to date.Section 3 discusses training the model: it lists several techniques for sampling from exponential distributions, shows how to apply them to the domain of natural language sentences, and compares their relative efficacies.
Step-size selection and smoothing are also discussed here.Section 4 describes experiments we performed with this model, incorporating lexical and syntactic information.Section 5 analyzes the results of the experiments, and Section 6 summarizes and discusses our ongoing effort and future directions.Various portions of this work were first described in [4,5,6].

WHOLE SENTENCE EXPONENTIAL MODELS
A whole sentence exponential language model has the form: where the erties, or features, of the sentence £ .B 6H ¤ ¥£ §¦ is any arbitrary initial distribution, sometimes loosely referred to as the "prior".For example, B 6H ¤ £ ¦ might be the uniform dis-tribution, or else it might be derived (using the chain rule) from a conditional distribution such as an A -gram.The features ( V !¤ £ ¦ 1 are selected by the modeler to capture those aspects of the data they consider appropriate or profitable.These can include conventional A -grams, longerdistance dependencies, global sentence properties, as well as more complex functions based on part-of-speech tagging, parsing, or other types of linguistic processing.

Using the Whole Sentence Model
To use the whole sentence exponential model to estimate the probability of a given sentence £ , one need only calculate B XH ¤ ¥£ §¦ and the values of the various features V §! ¤ £ ¦ , and use Equation 1. Thus using the model is straightforward and (as long as the features are not too complex) computationally trivial.Because the features could depend on any part of the sentence, they can in general only be computed after the entire sentence is known.Therefore, when used for speech recognition, the model is not suitable for the first pass of the recognizer, and should instead be used to re-score N-best lists.

Whole Sentence Maximum Entropy Models
The term "exponential model" refers to any model of the form (1). A particular type of such a model is the so-called "Maximum Entropy" (ME) model, where the parameters are chosen so that the distribution satisfies certain linear constraints.Specifically, for each feature V §! ¤ ¥£ §¦ , its expectation under B ¤ ¥£ §¦ is constrained to a specific value Y !: These target values are typically set to the expectation of that feature under the empirical distribution c B of some training corpus ( £ d 0 § 0 £ 2e 1 (for binary features, this simply means their frequency in the corpus).Then, the constraint becomes: If the constraints (2) are consistent, there exists a unique solution within the exponential family (1) which satisfies them.Among all (not necessarily exponential) solutions to equations (2), the exponential solution is the one closest to the initial distribution B 6H ¤ £ ¦ in the Kullback-Liebler sense, and is thus called the Minimum Divergence or Minimum Discrimination Information (MDI) solution.If B XH ¤ £ ¦ is uniform, this becomes simply the Maximum Entropy (ME) solution 1 .Furthermore, if the feature target values Y ! are the empirical expectations over some training corpus (as in equations ( 3)), the MDI or ME solution is also the Maximum Likelihood solution of the exponential family.For more information, see [7,8,3].
The MDI or ME solution can be found by an iterative procedure such as the Generalized Iterative Scaling (GIS) algorithm [9].GIS starts with arbitrary T !'s.At each iteration, the algorithm improves the ( T ! 1 values by comparing the expectation of each feature under the current B to the target value, and modifying the associated T .In particular, we take where v !determines the step size (see Section 3.2).
In training a whole-sentence Maximum Entropy model, computing the expectations V !¤ £ ¦ requires a summation over all possible sentences £ , a clearly infeasible task.Instead, we estimate a 6 V ! by sampling from the distribution B ¤ £ ¦ and using the sample expectation of V ! .Sampling from an exponential distribution is a non-trivial task.Efficient sampling is crucial to successful training.Sampling techniques will be discussed in detail in Section 3.1.As will be shown later on in this paper, with these techniques it is possible to train whole-sentence ME models using very large corpora and very many features.As mentioned earlier, the term "exponential models" refers to all models of the form (1), whereas the term "ME models" refers to exponential models where the parameters are set to satisfy equation (2).Most of this paper deals with ME models.For a different training criterion for exponential models, see Section 2.5.

Comparison to Conditional ME Models
It is instructive to compare the whole-sentence ME model with conditional ME models, which have seen considerable success recently in language modeling [10,11,8,3].The conditional model usually has the form: where the features are functions of a specific word-history pair, and so is the baseline B H .More importantly, E here is not a true constant -it depends on ' and thus must be recomputed during training for each word position in the training data.Namely, for each training datapoint ¤ ' 0 © ¦ , one must compute where is the vocabulary.This computational burden is quite severe: training a model that incorporates A final note of comparison: A whole-sentence ME model incorporating the same features as a conditional ME model is in fact not identical to the latter.This is because the training procedure used for conditional ME models restricts the computation of the feature expectations to histories observed in the training data (see [8] or [3, section 4.4]).This biases the solution in an interesting and sometimes appropriate way.For example, consider word trigger features of the form , and E .If m and q are correlated in the training data, this will affect the solution of the conditional ME model.In fact, if they are perfectly correlated, always co-occurring in the same sentence, the resulting T 's will likely be one half of what their value would have been if only one of the features were used.This is beneficial to the model, since it captures correlations that are likely to recur in new data.However, a whole-sentence ME model incorporating the same features will not use the correlation between m and q , unless it is explicitly instructed to do so via a separate feature.This is because the training data is not actually used in whole-sentence ME training, except initially to derive the features' target values.

Normalization and Perplexity
Just as it is infeasible to calculate exactly feature expectations for whole-sentence models, it is equally infeasible to compute the normalization constant Fortunately, this is not necessary for training: sampling (and hence expectation estimation) can be done without knowing E , as will be shown in Section 3.
Using the model as part of a classifier (e.g., a speech recognizer) does not require knowledge of E either, because the relative ranking of the different classes is not changed by a single, universal constant.Notice that this is not the case for conditional exponential models.
Nonetheless, at times it may be desirable to approximate E , perhaps in order to compute perplexity.With the wholesentence model this can be done as follows.
Let t u¤ £ ¦ ¨I 2P RQ ¤ !T !V !¤ £ ¦ v¦ be the unnormalized modification made to the initial model B XH ¤ £ ¦ .Then B ¤ ¥£ §¦ w B 6H ¤ ¥£ §¦ t u¤ ¥£ §¦ .By the normalization constraint we have: From which we get: Thus Z can be approximated to any desired accuracy from a suitably large sample z H of sentences drawn from B H . Often, B H is based on an A -gram.A reasonably efficient sampling technique for A -grams is described later.
To estimate reduction in per-word perplexity ({ x{ ) over the B XH baseline, let z d| be a test set, and let } hy ~| and } | be the number of sentences and words in it, respectively.By definition It follows then that the perplexity reduction ratio is: Substituting in the estimation of Z, the estimated perplexity reduction is: where the arithmetic mean is taken with respect to B XH and the geometric mean with respect to z d| .Interestingly, if z ~| ¨z H , i.e. if the test set is also sampled from the baseline distribution, it follows from the law of inequality of averages that the new perplexity will always be higher.This, however, is appropriate because any "correction" to an initial probability distribution will assign a lower likelihood (and hence higher perplexity) to data generated by that distribution.

Discriminative Training
Until now we discussed primarily Maximum Entropy or Maximum Likelihood training.However, whole-sentence exponential models can also be trained to directly minimize the error rate in an application such as speech recognition.This is known as Minimum Classification Error (MCE) training, or Discriminative Training.The log-likelihood of a whole-sentence exponential model with @ features is given by The last term is a weighted sum, which can be directly (albeit only locally) optimized for MCE using a heuristic grid search such as Powell's algorithm, which searches the space defined by the T 's.In fact, the second term ( w y B H ¤ £ ¦ ) can also be assigned a weight, and scores not related to language modeling can be added to the mix as well, for joint optimization.This is a generalization of the "language weight" and "word insertion penalty" parameters currently used in speech recognition.For an attempt in this direction, see [12].

Conditional Whole-Sentence Models
So far we have discussed non-conditional models of the form B ¤ £ ¦ .In order to model cross-sentence effects, one can re-introduce the conditional form of the exponential model, albeit with some modifications.Let ' refer to the sentence history, namely the sequence of sentences from the beginning of the document or conversation up to (but not including) the current sentence.The model then becomes: Although the normalization constant is no longer universal, it is still not needed for N-best rescoring of a speech recognizer's output.This is because rescoring is typically done one sentence at a time, with all competing hypotheses sharing the same sentence history.
We will not pursue models of this type any further in this paper, except to note that they can be used to exploit session wide information, such as topics and other dialog level features.

Sampling
Since explicit summation over all sentences £ is infeasible, we will estimate the expectations `a V §! by sampling.In this section, we describe several statistical techniques for sampling from exponential distributions, and evaluate their efficacy for generating natural language sentences.Gibbs Sampling [13] is a well known technique for sampling from exponential distributions.It was used in [14] to sample from the population of character strings.We will now describe how to use it to generate whole sentences from an unnormalized joint exponential distribution, then present alternative methods which are more efficient in this domain.
To generate a single sentence from B ¤ £ ¦ , start from any arbitrary sentence £ , and iterate as follows: 1. Choose a word position (either randomly or by cycling through all word positions in some order).
2. Let £ !be the sentence produced by replacing the word in position in sentence £ with the word © .For each word © in the vocabulary , calculate B ¤ ¥£ 3. Choose a word at random according to the distribution 8 .Place that word in position in the sentence.This constitutes a single step in the random walk in the underlying Markov field.
To allow transitions into sentences of any length, we do the following: The end-of-sentence position is also considered for replacement by an ordinary word, which effectively lengthens the sentence by one word.
When the last word position in the sentence is picked, the end-of-sentence symbol </s> is also considered.If chosen, this effectively shortens the sentence by one word.
After enough iterations of the above procedure, the resulting sentence is guaranteed to be an unbiased sample from the Gibbs distribution B ¤ ¥£ §¦ . 2enerating sample sentences from a Gibbs distribution as described above is quite slow.To speed things up, the following variations are useful: Draw the initial sentence £ from a "reasonable" distribution, such as a unigram based on the training data, or from B H .This tends to reduce the necessary number of iterations per step.
For an initial sentence £ use the final (or some intermediate) sentence from a previous random walk.This again tends to reduce the necessary number of iterations.However, the resulting sentence may be somewhat correlated with the previous sample 3 .
At each iteration, consider only a subset of the vocabulary for replacement.Any subset can be chosen as long as the underlying Markov Chain remains ergodic.This trades off the computational cost per iteration against the mixing rate of the Markov chain (that is, the rate at which the random walk converges to the underlying equilibrium distribution).
Even with these improvements, Gibbs sampling is not the most efficient for this domain, as the probability of a great many sentences must be computed to generate each sample.Metropolis sampling [15], another Markov Chain Monte Carlo technique, appears more appropriate for this situation.An initial sentence is chosen as before.For each chosen word position , a new word is proposed from a distribution R¤ ¥ ¦ to replace the original word © in that position, resulting in a proposed new sentence £ ! .This new sentence is accepted with probability Otherwise, the original word © is retained.After all word positions have been examined, the resulting sentence is added to the sample, and this process is repeated. 4The distribution R¤ ¦ used to generate new word candidates for each position affects the sampling efficiency; in the experiments reported in this paper, we used a unigram distribution.
As in Gibbs sampling, adapting the Metropolis algorithm to sentences of variable-length requires care.In one solution, we pad each sentence with end-of-sentence tokens </s> up to a fixed length .A sentence becomes shorter if the last non-</s> token is changed to </s>, longer if the first </s> token is changed to something else.
In applying Metropolis sampling, instead of replacing a single word at a time it is possible to replace larger units.In particular, in independence sampling we consider replacing the whole sentence in each iteration.For efficiency, the distribution R¤ ¥£ §¦ used to generate new sentence candidates must be similar to the distribution B ¤ £ ¦ we are attempting to sample from.
In importance sampling, a sample ( £ 80 § § 0 £ s¡ 1 is generated according to some sentence distribution R¤ £ ¦ , which similarly must be close to B ¤ ¥£ §¦ for efficient sampling.To correct the bias introduced by sampling from R¤ ¥£ §¦ instead of from B ¤ £ ¦ , each sample £ p is counted times, so that we have Which sampling method is best depends on the nature of B ¤ ¥£ §¦ and R¤ ¥£ §¦ .We evaluated these methods (except Gibbs sampling, which proved too slow) on some of the models to sampling algorithm Metropolis independence importance  1: Mean and standard deviation (of mean) of feature expectation estimates for sentence-length features for three sampling algorithms over ten runs be described in Section 4. These models employ a trigram as the initial distribution B H ¤ £ ¦ .(Generating sentences from a A -gram model can be done quite efficiently: one starts from the beginning-of-sentence symbol, and iteratively generates a single word according to the A -gram model and the specific context, until the end-of-sentence symbol is generated.Generating a single word from an A -gram model requires ³ R¤ $ $ ¦ steps.While this computation is not trivial, it is far more efficient than sampling directly from an exponential distribution.)Therefore, taking R¤ £ ¦ to be a trigram model for independence and importance sampling is very effective.To measure the effectiveness of the different sampling algorithms, we did the following.Using an exponential model with a baseline trigram trained on 3 million words of Switchboard text ( [16]) and a vocabulary of some 15,000 words, for each of the sampling methods we generated ten independent samples of 100,000 sentences.We estimated the expectations of a set of features on each sample, and calculated the empirical variance in the estimate of these expectations over the ten samples.More efficient sampling algorithms should yield lower variances.
In our experiments, we found that independence sampling and importance sampling both yielded excellent performance, while word-based Metropolis sampling performed substantially worse.As an example, we estimated expectations for sentence-length features of the form over ten samples of size 100,000.In Table 1, we display the means and standard deviations of the ten expectation estimates for each of the five sentence-length features under three sampling algorithms.The efficiency of importance and independence sampling depends on the distance between the generating distribution R¤ £ ¦ and the desired distribution B ¤ £ ¦ .If R¤ ¥£ §¦ B6H ¤ ¥£ §¦ , that distance will grow with each training iteration.Once the distance becomes too large, Metropolis sampling can be used for one iteration, say iteration @ , and the re-sulting sample retained.Subsequent iterations can re-use that sample via importance or independence sampling with ¤ ¥£ §¦ .Note that, even if training were to stop at iteration @ , B 6» 5 2¼  is arguably a better model than the initial model B H , since it has moved considerably (by our assumption) towards satisfying the feature constraints.
Using the techniques we discuss above, training a wholesentence ME model is feasible even with large corpora and many features.And yet, training time is not negligible.Some ideas for reducing the computational burden which we have not yet explored include: Use only rough estimation (i.e.small sample size) in the first few iterations (we only need to know the direction and rough magnitude of the correction to the T 's); increase sample size when approaching convergence.
Determine the sample size dynamically, based on the number of times each feature was observed so far.
Add features gradually (this has already proven itself effective at least anecdotally, as reported in Section 4.1).

Step Size
In GIS, the step size for feature update is inversely related to the number of active features.As sentences typically have many features, this may result in very slow convergence.Improved Iterative Scaling (IIS) [14] uses a larger effective step size than GIS, but requires a great deal more bookkeeping.
However, when feature expectations are near their target value, IIS can be closely approximated with equation ( 4) where v ! is taken to be a weighted average of the feature sum over all sentences; i.e., if the set of sentences £ were finite, we would take In our implementation, we approximated v ! by summing only over the sentences in the sample used to calculate expectations.This technique resulted in convergence in all of our experiments.

Smoothing
From equation ( 4) we can see that if hga V §! ¨l then we will have T !p n ¿¾ .To smooth our model, we use the fuzzy maximum entropy method first described by [17]: We introduce a Gaussian prior on T !values and search for the maximum a posterior model instead of the maximum likelihood model.This has the effect of changing Equation (2) to for some suitable variance parameter À ! .With this technique, we found that over-training (overfitting) was never a problem.For a detailed analysis of this and other smoothing techniques for Maximum Entropy models, see [18].

FEATURE SETS AND EXPERIMENTS
In this section we describe several experiments with the new model, using various feature sets and sampling techniques.We start with the simple reconstruction of A -grams using Gibbs sampling, proceed with longer distance and class based lexical relations using importance sampling, and end with syntactic parse-based features.For subsequent work using semantic features, see [19].

Validation
To test the feasibility of Gibbs sampling and generally validate our approach, we built a whole-sentence ME model using a small (10K word) training set of Broadcast News [20] utterances 5 .We set B XH ¤ £ ¦ to be uniform, and used unigram, bigram and trigram features of the form The features were not introduced all at the same time.Instead, the unigram features were introduced first, and the model was allowed to converge.Next the bigram features were introduced, and the model again allowed to converge.Finally the trigram features were introduced.This resulted in faster convergence than in the simultaneous introduction of all feature types.Training was done by Gibbs sampling throughout.
Below we provide sample sentences generated by Gibbs sampling from various stages of the training procedure.Table 2 lists sample sentences generated by the initial model, before any training took place.Since the initial T 's were all set to zero, this is the uniform model.Tables 3 through 5 list sample sentences generated by the converged model after the introduction of unigram, bigram and trigram features, respectively.It can be seen from the example sentences that the model indeed successfully incorporated the information provided by the respective features.
The model described above incorporates only "conventional" features which are easy to incorporate in a simple conditional language mode.This was done for demonstrative purposes only.The model is unaware of the nature or complexity of the features.Arbitrary features can be accommodated with virtually no change in the model structure or the code.As we mentioned earlier, Gibbs sampling turned out to be the least efficient of all sampling techniques we considered.As we will show next, much larger corpora and many more features can be feasibly trained with the more efficient techniques.

Generalized A -grams and Feature Selection
In our next experiment we used a much larger corpus and a richer set of features.Our training data consisted of 2,895,000 words (nearly 187,000 sentences) of Switchboard text (SWB) [16].First, we constructed a conventional trigram model on this data using a variation of Kneser-Ney smoothing [21], and used it as our initial distribution B 6H ¤ ¥£ §¦ .We then employed features that constrained the frequency of word Agrams (up to A =4), distance-two (i.e.skipping one word) word A -grams (up to A =3) [3], and class A -grams (up to A =5) [22].We partitioned our vocabulary (of some 15,000 words) into 100, 300, and 1000 classes using the word classing algorithm of [23] on our training data.To select specific features we devised the following procedure.First, we generated an artificial corpus by sampling from our initial trigram distribution B H ¤ ¥£ §¦ .This "trigram corpus" was of the same size as the training corpus.For each A -gram, we compared its count in the "trigram corpus" to that in the training corpus.If these two counts differed significantly (using a Ã test), we added the corresponding feature to our model. 6We tried thresholds on the Ã statistic of 500, 200, 100, 30, and 15, resulting in approximately 900, 3,000, 10,000, 20,000 and 52,000 A -gram features, respectively.A -grams with zero counts were considered to have 0.5 counts in this analysis.
In Table 6, we display the A -grams with the highest Ã scores.The majority of these A -grams involve a 4-gram or 5-gram that occurs zero times in the training corpus and occurs many times in the trigram corpus.These are clear examples of longer-distance dependencies that are not modeled well with a trigram model.However, the last feature is 6 The idea of imposing a constraint that is most violated by the current model was first proposed by Robert Mercer, who called it "nailing down".For each feature set, we trained the corresponding model after initializing all T ! to 0. We used importance sampling to calculate expectations.However, instead of generating an entirely new sample for each iteration, we generated a single corpus from our initial trigram model, and re-weighted this corpus for each iteration using importance sampling.(This technique may result in mutually inconsistent constraints for rare features, but convergence can still be assured by reducing the step size v ! with each iteration.)We trained each of our feature sets for 50 iterations of iterative scaling; each complete training run took less than three hours on a 200 MHz Pentium Pro computer.
We measured the impact of these features by rescoring speech recognition ) which were generated by the Janus system [24] on a Switchboard/CallHome test set of 8,300 words.The trigram B H ¤ ¥£ §¦ served as a baseline.For each model, we computed both the top-1 word error rate and the average rank of the least errorful hypothesis.These figures were computed first by combining the new language scores with the existing acoustic scores, and again by considering the language scores only.Results for the three largest feature sets are summarized in Table 7 (for the smaller feature sets improvement was smaller still).While the specific features we selected here made only a small difference in N-best rescoring, they serve to demonstrate the extreme generality of our model: Any computable property of the sentence which is currently not adequately modeled can (and should) be added into the model.

Syntactic Features
In the last set of experiments, we used features based on variable-length syntactic categories to improve on an initial trigram model in the Switchboard domain.Our training dataset was the same Switchboard corpus used in Section 4.2.
Due to the often agrammatical nature of Switchboard language (informal, spontaneous telephone conversations), we chose to use a shallow parser that, given an utterance, produces only a flat sequence of syntactic constituents.The syntactic features were then defined in terms of these constituent sequences.

The Shallow Switchboard Parser
The shallow Switchboard parser [25] was designed to parse spontaneous, conversational speech in unrestricted domains.It is very robust and fast for such sentences.First, a series of preprocessing steps are carried out.These include eliminating word repetitions, expanding contractions, and cleaning disfluencies.Next, the parser assigns a part-of-speech tag to each word.For example, the input sentence Okay I uh you know I think it might be correct will be processed into I/NNP think/VBP it/PRPA might/AUX be/VB correct/JJ As the next step, the parser breaks the preprocessed and tagged sentence into one or more simplex clauses, which are clauses containing an inflected verbal form and a subject.This simplifies the input and makes parsing more robust.In our example above, the parser will generate two simplex clauses: simplex 1: I/NNP think/VBP simplex 2: it/PRPA might/AUX be/VB correct/JJ Finally, with a set of handwritten grammar rules, the parser parses each simplex clause into constituents.The parsing is shallow since it doesn't generate embedded constituents; i.e., the parse tree is flat.In the example, simplex 1 has two constituents:

[ np] ( [NP head] it/PRPA ) [ vb] ( might/AUX [VP head] be/VB ) [ prdadj] ( correct/JJ )
The parser sometimes leaves a few function words (e.g.to, of, in) unparsed in the output.For the purpose of feature selection, we regarded each of these function words as a constituent.Counted this way, there are a total of 110 constituent types.

Feature Types
As mentioned above, the shallow parser breaks an input sentence into one or more simplex clauses, which are then further broken down into flat sequences of constituents.We defined three types of features based solely on the constituent types; i.e., we ignored the identities of words within the constituents:

Feature Selection
We followed the procedure described in Section 4.2 to find useful features.We generated an artificial corpus, roughly the same size as the training corpus, by sampling from B XH ¤ ¥£ §¦ .We ran both corpora through the shallow parser and counted the occurrences of each candidate feature.If the number of times a feature was active in the training corpus differed significantly from that in the artificial corpus, the feature was considered important and was incorporated into the model.Our reasoning was that the difference is due to a deficiency of the initial model B H ¤ ¥£ §¦ , and that adding such a feature will fix that deficiency.
We assumed that our features occur independently, and are therefore binomially distributed.More precisely, we had two independent sets of Bernoulli trials.One is the set of A simplex clauses of the training corpus.The other is the set of Ï simplex clauses of the artificial corpus.Let È be the number of times a feature occurs in the training corpus and Ð that in the artificial corpus.Let { É and { Ñ be the true occurrence probabilities associated with each corpus.We tested the hypotheses Ò H ÔÓ { É ¨{ Ñ .Approximating the Generalized Likelihood Ratio test, we rejected (see for example [26, page 335]).We incorporated into our model those features whose Ò H was rejected.

Results
Constituent Sequence features: There were 186,903 candidate features of this type that occurred at least once in the two corpora.Of those, 1,935 show a significant difference between the two corpora at a 95% confidence level (twotailed).The feature V a 2ã a a 9a had the most significant standard score 21.9 in the test, with È =2968 occurrences in the SWB corpus and Ð =1548 in the artificial corpus.More interesting is the feature

Perplexity and Word Error Rate
We incorporated the 1953 Constituent Sequence features, 1310 Constituent Set features, and 3535 Constituent Trigram features into a whole-sentence maximum entropy language model, and trained its parameters with the GIS algorithm.The baseline perplexity of a 90,600-word SWB test set calculated under the initial model B 6H was 81.37.The perplexity under the new maximum entropy model was estimated as 80.49¨0.02, a relative improvement of only 1%.
Next, we tested speech recognition word error rate by N-best list rescoring.A 200-best list with 8,300 words was used.The WER was 36.53% with the initial model and 36.38% with all of the syntactic features added, a mere 0.4% relative improvement.

ANALYSIS
In trying to understand the disappointing results of the last section, we analyzed the likely effect of features on the final model.The upper bound on improvement from a single binary feature V §! is the Kullback Liebler distance between the true distribution of V §! (as estimated by the empirical distribution c B ¤ V §! ¦ ) and B ¤ V §! ¦ (the distribution of V §! according to the current model) [14, p. 4].The effect of multiple features is not necessarily additive (in fact, it could be supraor sub-additive).Nonetheless, the sum of the individual effects may still give some indication of the likely combined effect.For the syntactic features we used, we computed: which translates into an expected perplexity reduction of 0.43% (Ç x °î x ¥ï µ x , where 10 is the average number of words in a sentence).The potential impact of these features is apparently very limited.We therefore need to seek features V for which: ).In the features we used, the latter was quite large, but the former was very small.Thus, we need to concentrate on more common features.
An ideal feature should occur frequently enough, yet exhibit a significant discrepancy."Does the sentence make sense to a human reader?" is such a feature (where c B ¤ V ¦ C and B H ¤ V ¦ l ).It is, of course, AI-hard to compute.However, even a rough approximation of it may be quite useful.Based on this analysis, we have subsequently focused our attention on deriving a smaller number of frequent (and likely more complex) features, based on the notion of sentence coherence ( [27]).
Frequent features are also computationally preferable.Because the training bottleneck in whole-sentence ME models is in estimating feature expectations via sampling, the computational cost is determined mostly by how rare the features are and how accurately we want to model them.The more frequent the features, the less the computation.Note that computational cost of training depends much less on the vocabulary, the amount of training data, or the number of features.

SUMMARY AND DISCUSSION
We presented an approach to incorporating arbitrary linguistic information into a statistical model of natural language.We described efficient algorithms for constructing wholesentence ME models, offering solutions to the questions of sampling, step size and smoothing.We demonstrated our approach in two domains, using lexical and syntactic features.We also introduced a procedure for feature selection which seeks and exploits discrepancies between an existing model and the training corpus.
Whole-sentence ME models are more efficient than conditional ME models, and can naturally express sentencelevel phenomena.It is our hope that these improvements will break the ME "usability barrier" which heretoforth hindered exploration and integration of multiple knowledge sources.This will hopefully open the floodgates to experimentation, by many researchers, with varied knowledge sources which they believe to carry significant information.Such sources may include: Distribution of verbs and tenses in the sentence Various aspects of grammaticality (person agreement, number agreement, parsability, other parser-supplied information) Semantic coherence

Dialog level information
Prosodic and other time related information (speaking rate, pauses,.. .) Since all knowledge sources are incorporated in a uniform way, a language modeler can focus on which properties of language to model as opposed to how to model them.Attention can thus be shifted to feature induction.Indeed, we have started working on an interactive feature induction methodology, recasting it as a logistic regression problem [27,19].Taken together, we hope that these efforts will help open the door to "putting language back into language modeling" [28].

1
's are the parameters of the model, E is a universal normalization constant which depends only on the ( T ! 1 's, and the ( V §! ¤ £ ¦ 1 's are arbitrary computable prop- <s> ENOUGH CARE GREG GETTING IF O. ANSWER NEVER </s> <s> DEATH YOU'VE BOTH THEM RIGHT BACK WELL BOTH </s> <s> MONTH THAT'S NEWS ANY YOU'VE WROTE MUCH MEAN </s> <s> A. HONOR WE'VE ME GREG LOOK TODAY N. </s> <s> DO YOU WANT TO DON'T C. WAS YOU </s> <s> THE I DO YOU HAVE A A US </s> <s> BUT A LOS ANGELES ASK C. NEWS ARE </s> <s> WE WILL YOU HAVE TO BE AGENDA AND </s> <s> THE WAY IS THE DO YOU THINK ON </s> <s> WHAT DO YOU HAVE TO LIVE LOS ANGELES </s> <s> A. B. C. N. N. BUSINESS NEWS TOKYO </s> <s> BE OF SAYS I'M NOT AT THIS IT </s> <s> BILL DORMAN BEEN WELL I THINK THE MOST </s> <s> DO YOU HAVE TO BE IN THE WAY </s>

3 .
Constituent Trigram features: for any ordered constituent triplet ¤ Ê 8Ë 0 Ê sÌ 0 Ê sÍ ¦ , =1 if and only if sentence £ contains that contiguous sequence at least once.Otherwise V ¢ =0.This set of features resembles traditional class trigram features, except that they correspond to a variable number of words.
is significantly larger.The second term on the right-hand side is usually negligible.The two factors affecting this number are thus the prevalence of the feature ( c B ¤ V ¦ ) and the log discrepancy between the truth and the model (

Table 2 :
Sentences generated by Gibbs sampling from an initial (untrained) model.Since all WELL VERY DON'T A ARE NOT LIVE THE </s> <s> I RIGHT OF NOT SO ANGELES IS DONE </s> <s> I ARE FOUR THIS KNOW DON'T ABOUT OF </s> <s> C. GO ARE TO A IT HAD SO </s> <s> OFF THE MORE JUST POINT WANT MADE WELL </s> <s>

Table 3 :
Sentences generated by Gibbs sampling from a whole-sentence ME model trained on unigram features only.

Table 6 :
A-grams with largest discrepancy (according to Ã statistic) between training corpus and trigram-generated corpus of same size; A -grams with " " token are distancetwo A -grams; © %AE © notation represents a class whose two most frequent members are © and ©

Table 7 :
Top-1 WER and average rank of best hypothesis using varying feature sets.a class unigram, and indicates that the trigram model overgenerates words from this class.On further examination, the class turned out to contain a large fraction of the rarest words.This indicates that perhaps the smoothing of the trigram model could be improved.
1. Constituent Sequence features: for any constituent sequence È and simplex clause £ , V §É ¤ £ ¦ =1 if and only if the constituent sequence of simplex clause £ exactly matches È .Otherwise V §É ¤ ¥£ §¦ =0.For instance, V np vb ¤ I THINK ¦ ¨C , V np vb prdadj ¤ IT MIGHT BE CORRECT ¦ ¨C , V np vb ¤ IT MIGHT BE CORRECT ¦ ¨l , and so forth.2. Constituent Set features: for any set È of constituents, V §É ¤ £ ¦ =1 if and only if the constituent set of sentence £ exactly matches È .Otherwise V É ¤ ¥£ §¦ =0.This set of features is a relaxation of Constituent Sequence features, since it doesn't require the position and number of constituents to match exactly.As an example, both V np, vb¤ I LAUGH ¦ ¨C and V np, vb¤ I SEE A BIRD¦ ¨C , although the constituent sequence of I LAUGH is np vb while that of I SEE A BIRD is np vb np.
=0, Ð =19.One may suspect that this is where the initial trigram model "makes up" some unlikely phrases.Looking at the 19 simplex clauses confirms this: Constituent Set features: These features are more general than Constituent Sequence features and thus there are fewer of them.A total of 61,741 candidate Constituent Set features occurred in either corpus, while 1310 showed a significant difference.The one with the most significant