Integrating Intra-Speaker Topic Modeling and Temporal-Based Inter-Speaker Topic Modeling in Random Walk for Improved Multi-Party Meeting Summarization

This paper proposes an improved approach of summarization for spoken multi-party interaction, in which intra-speaker and inter-speaker topics are modeled in a graph constructed with topical relations. Each utterance is represented as a node of the graph, and the edge between two nodes is weighted by the similarity between the two utterances, which is the topical similarity, as evaluated by probabilistic latent semantic analysis (PLSA). We model intra-speaker topics by sharing the topics from the same speaker and inter-speaker topics by partially sharing the topics from the adjacent utterances based on temporal information. For both manual transcripts and ASR output, experiments conﬁrmed the efﬁcacy of combining intra- and inter-speaker topic modeling for summarization.


Introduction
Speech summarization is very important [1], because multimedia/spoken documents are more difficult to browse, and it has been actively investigated before.While most work focused primarily on news content, recent effort has been increasingly directed to new domains such as lectures [2,3] and multi-party interaction [4,5,6].We take meeting recording as multi-party interaction and do experiments on this dataset, where we perform extractive summarization on ASR and manual transcripts [7].
For text summarization, many approaches focus on graph-based methods to compute lexical centrality of each utterance to extract summaries [8].The speech summarization carries intrinsic difficulties due to the presence of recognition errors, spontaneous speech effect, and lack of segmentation.A general approach has been found very successful [9], in which each utterance in the document d, U = t 1 t 2 ...t i ...t n , represented as a sequence of terms t i , is given an importance score: where s(t i , d), l(t i ), c(t i ), g(t i ) are respectively some statistical measure (such as TF-IDF), linguistic measure (e.g., different part-of-speech tags are given different weights), confidence score and N-gram score for the term t i , and b(U ) is calculated from the grammatical structure of the utterance U , and λ 1 , λ 2 , λ 3 , λ 4 and λ 5 are weighting parameters.For each document, the utterances to be used in the summary are then selected based on this score.
In recent work, we proposed a graphical structure to rescore I(U, d) above in (1), which can model the topical coherence between utterances using random walk within documents [3,5].Unlike lecture and news summarization, meeting recording is the multi-party interaction corpus so that the relations such as topic distribution within a single speaker or between speakers can be considered.Thus, this paper models intra-and inter-speaker topics together in the graph by partially sharing topics with the utterances from the same speaker or adjacent utterances to improve meeting summarization [10].

Proposed Approach
We first preprocess the utterances in all meetings: word stemming and noise utterance filtering.Then we construct a graph to compute the importance of all utterances.We formulate the utterance selection problem as random walk on a directed graph, in which each utterance is a node and the edges between them are weighted by topical similarity.The basic idea is that an utterance similar to more important utterances should be more important [3].We then keep only the top N outgoing edges with the highest weights from each node, while consider incoming edges to each node for importance propagation in the graph.A simplified example for such a graph is in Figure 1, in which A i and B i are the sets of neighbors of the node U i connected respectively by outgoing and incoming edges.

Parameters from Topic Model
Probabilistic latent semantic analysis (PLSA) [11] has been widely used to analyze the semantics of documents based on a set of latent topics.Given a set of documents {d j , j = 1, 2, ..., J} and all terms {t i , i = 1, 2, ..., M } they include, PLSA uses a set of latent topic variables, {T k , k = 1, 2, ..., K}, to characterize the "term-document" co-occurrence relationships.The PLSA model can be optimized with EM algorithm by maximizing a likelihood function [11].We utilize two parameters from PLSA, latent topic significance (LTS) and latent topic entropy (LTE) [12].The parameters also can be computed by other topic model such as latent dirichlet allocation(LDA) [13] in similar way.
Latent Topic Significance (LTS) for a given term t i with respect to a topic T k can be defined as where n(t i , d j ) is the occurrence count of term t i in a document d j .Thus, a higher LTS ti (T k ) indicates the term t i is more significant for the latent topic T k .Latent Topic Entropy (LTE), for a given term t i can be calculated from the topic distribution where the topic distribution P (T k | t i ) can be estimated from PLSA, LTE(t i ) is a measure of how the term t i is focused on a few topics, so a lower latent topic entropy implies the term carries more topical information.

Statistical Measures of a Term
Here in this work, the statistical measure of a term t i , s(t i , d) in ( 1) can be defined based on LTE(t i ) in (3) as where γ is a scaling factor such that 0 ≤ s(t i , d) ≤ 1, so the score s(t i , d) is inversely proportion to the latent topic entropy LTE(t i ).Some works [12] showed that this measure outperformed the very successful "significance score" [9] in speech summarization, and here we use LTE-based statistical measure, s(t i , d), as the baseline.

Topical Similarity between Utterances
Within a document d, we can first compute the probability that the topic T k is addressed by an utterance U i , Then an asymmetric topical similarity Sim(U i , U j ) for utterances U i to U j (with direction U i → U j ) can be defined by accumulating LTS t (T k ) in (2) weighted by P (T k | U i ) for all terms t in U j over all latent topics, where the idea is similar to generative probability in IR.
We call it generative significance of U i given U j .

Intra/Inter-Speaker Topic Modeling
We additionally consider speaker information to model topics more accurately, where w intra is topic sharing weight for intra-speaker and w inter is for inter-speaker topic sharing, which are described as Section 2.4.1 and 2.4.2 respectively.

Intra-Speaker Topic Sharing Weight
Since we assume that the utterances from the same speaker in the dialogue usually focus on similar topics, which means if an utterance is important, the other utterances from the same speaker are more likely to be important in the dialogue [5].Then we can estimate Sim (U i , U j ) by setting w intra (U i , U j ) as S k is the set including all utterances from speaker k and δ is a weighting parameter for modeling the speaker relation.Here the topics from the same speaker can partially shared.

Inter-Speaker Topic Sharing Weight
Topic transition between adjacent utterances should be slow so that adjacent utterances should have similar topic distribution [14] even though they are not from the same speaker, and then we can increase Sim (U i , U j ) if U i and U j have closer position in the dialogue.Thus, we compute the weight for inter-speaker topic sharing as where l i is the position of the utterance U i in the dialogue, which means U i is the l i -th utterance in the dialogue.The boundary of utterance is decided by SmartNote [4].(10) is under an assumption that topic sharing is based on a normal distribution with a standard deviation σ.If |l i −l j | is smaller, which means U i and U j is closer to each other, and they may share their topics so that w inter (U i , U j ) is larger in (10).σ is a parameter of topic sharing range, which can be tuned by dev set.We normalize the similarity summed over the top N utterance U k with edges outgoing from U i , or the set A i , to produce the weight p(i, j) for the edge from U i to U j on the graph,

Random Walk
We use random walk [3,15] to integrate two types of scores over the graph obtained above.v(i) is the new score for node U i , which is the interpolation of two scores, the normalized initial importance, r(i), for node U i and the score contributed by all neighboring nodes U j of node U i weighted by p(j, i), where α is the interpolation weight, B i is the set of neighbors connected to node U i via incoming edges, and r(i) is normalized importance scores of utterance U i , I(U i , d) in ( 1). ( 12) can be iteratively solved with the approach very similar to that for the PageRank problem [16].Let v = [v(i), i = 1, 2, ..., L] T and r = [r(i), i = 1, 2, ..., L] T be the column vectors for v(i) and r(i) for all utterances in the document, where L is the total number of utterances in the document d and T represents transpose.(12) then has a vector form below, where P is L × L matrices of p(j, i), and e = [1, 1, ..., 1] T .Because i v(i) = 1 from ( 12), e T v = 1.It has been shown that the closed-form solution v of ( 13) is the dominant eigenvector of P [17], or the eigenvector corresponding to the largest absolute eigenvalue of P .The solution v(i) can then be obtained.

Corpus
The corpus used in this research is the sequences of natural meetings, which featured largely overlapping participant sets and topics of discussion.For each meeting, SmartNotes [4] was used to record both the audio from each participant as well as his notes.The meetings were transcribed both manually and using a speech recognizer; the word error rate is around 44%.In this paper we use 10 meetings held from April to June of 2006.On average each meeting had about 28 minutes of speech.Across these 10 meetings there were 6 unique participants; each meeting featured between 2 and 4 of these participants (average: 3.7).Total number of utterances is 9837 across 10 meetings.In this paper, we separate dev set (2 meetings) and test set (8 meetings).Dev set is used to tune the parameters such as α, σ, and δ.
The reference summaries are given by the set of noteworthy utterances.Two annotators manually labelled the degree (three levels) of "noteworthiness" for each utterance, and we extract the utterances with the top level of "noteworthiness" to form the summary of each meeting.In following experiments, for each meeting, we extract top 30% number of terms as the summary.

Evaluation Metrics
Automated evaluation will utilize the standard DUC evaluation metric ROUGE [18] which represents recall over various n-grams statistics from a system-generated summary against a set of human generated peer summaries.F-measures for ROUGE-1 (unigram) and ROUGE-L (longest common subsequence) can be evaluated in exactly the same way, which are used in the following results.

Results
Table 1 shows the performance achieved from all proposed approaches.Row (a) is the baseline, which use LTE-based statistical measure to compute the importance of utterances I(U, d).Row (b) is the result after applying random walk with only topical similarity.Row (c) is the result additionally including intra-speaker topic modeling (w intra = 0); row (d) includes inter-speaker topic modeling (w inter = 0).Row (e) is the result performed by integrating two types of speaker information (with w intra = 0 and w inter = 0).
Note that the performance of ASR is better than manual transcripts.Because a higher percentage of errors is on "unimportant" words, the recognition errors are harder to obtain high scores; then we can exclude the utterances with more errors to get better summarization results.Some recent works also show better performance for ASR than manual transcripts [3,5].

Graph-Based Approach
We can see the performance after graph-based recomputation row (b) is significantly better than baseline row (a) for both ASR and manual transcripts.The improvement for ASR is more than for manual transcripts,

Effectiveness of Speaker Information Modeling
We find that modeling intra-speaker topics can improve the performance (row (b) and row (c)), which means speaker information is useful to model the topical similarity.The experiment shows intra-speaker modeling can help us include the important utterances for both ASR and manual transcripts.Then we find that only modeling inter-speaker topics cannot offer significant improvement for ASR transcripts (row (b) and row (d)) probably because sharing topics with adjacent utterances may decrease the centrality especially for the utterances with recognition errors.For manual transcripts, the improvement of inter-speaker topic model is not significant.
Row (e) is the result from proposed approach, which integrates intra-speaker and inter-speaker topic modeling into a single graph, considering two types of relations together.For ASR transcripts, row (e) is better than row (c) and row (d), which means intra-speaker and interspeaker information cover different types of relations, and the relations can be additive.Note that only using inter-speaker topic modeling cannot improve the performance, but integrating with intra-speaker topic modeling can offer better results.The reason may be that intraspeaker topic modeling enhances centrality of important utterances, and additionally involving inter-speaker topic modeling slightly decreases centrality but successfully smoothing topic transition for adjacent utterances.For manual transcripts, row (e) also perform better by combing two types of speaker information, and the improvement is larger than ASR transcripts.Since without recognition errors topical similarity can model the relations accurately, integrating two types of speaker information can effectively improve the performance.
In addition, Banerjee and Rudnicky [4] used supervised learning to detect noteworthy utterances in the same corpus, performing 43% (ASR) and 47% (manual) for ROUGE-1.Compared to it, our unsupervised approach performs better especially for ASR transcripts.

Conclusions
Extensive experiments and evaluation with ROUGE metrics showed that inter-and intra-speaker topics can be modeled together in one single graph and that random walk can combine the advantages from two types of speaker information for both ASR and manual transcripts, where we achieved more than 6% relative improvement.

Figure 1 :
Figure 1: A simplified example of the graph considered.

Table 1 :
Maximum relative improvement (RI) with respect to the baseline for all proposed approaches (%).