Two-layer mutually reinforced random walk for improved multi-party meeting summarization

This paper proposes an improved approach of summarization for spoken multi-party interaction, in which a two-layer graph with utterance-to-utterance, speaker-to-speaker, and speaker-to-utterance relations is constructed. Each utterance and each speaker are represented as a node in the utterance-layer and speaker-layer of the graph respectively, and the edge between two nodes is weighted by the similarity between the two utterances, the two speakers, or the utterance and the speaker. The relation between utterances is evaluated by lexical similarity via word overlap or topical similarity via probabilistic latent semantic analysis (PLSA). By within- and between-layer propagation in the graph, the scores from different layers can be mutually reinforced so that utterances can automatically share the scores with the utterances from the same speaker and similar utterances. For both ASR output and manual transcripts, experiments confirmed the efficacy of involving speaker information in the two-layer graph for summarization.


INTRODUCTION
Speech summarization is important [1] for spoken or even multimedia documents, which are more difficult to browse than text, and has therefore been investigated in the past. While most work focused primarily on news content, recent effort has been increasingly directed towards new domains such as lectures [2,3] and multi-party interaction [4,5,6]. In this work, we perform extractive summarization on the output of automatic speech recognition (ASR) and corresponding manual transcripts [7] of multi-party "meeting" recordings.
Many approaches to text summarization focus on graphbased methods to compute lexical centrality of each utterance, in order to extract summaries [8,9]. Speech summarization carries intrinsic difficulties due to the presence of recognition errors, spontaneous speech effects, and lack of segmentation. A general approach has been found to be very successful [10], in which each utterance in the document d, U = t 1 t 2 ...t i ...t n , represented as a sequence of terms t i , is given an importance score where s(t i , d), l(t i ), c(t i ), and g(t i ) respectively are some statistical measure (such as TF-IDF), some linguistic measure (e.g., different part-of-speech tags are given different weights), a confidence score, and an N-gram score for the term t i ; b(U ) is calculated from the grammatical structure of the utterance U , and λ 1 , λ 2 , λ 3 , λ 4 and λ 5 are weighting parameters. For each document, the utterances to be used in the summary are then selected based on this score.
In recent work, we proposed a graphical structure to rescore I(U, d) in (1) above, which can model the topical coherence between utterances using a random walk process within documents [3,5]. Unlike lecture and news summarization, meeting recordings contain spoken multi-party interactions, so that the speaker "importance" scores can be added to the estimation of the importance of individual utterance [11]. Thus, this paper proposes to use two-layer mutually reinforced random walk to compute the speaker importance and to increase the scores of utterances similar to the utterances from important speakers. It models intra-and inter-speaker topics together in the two-layer graph by automatically propagating scores to the utterances from the same speaker or similar utterances to improve meeting summarization [9,12]. Section 2 describes the construction of two-layer graph and the algorithms about computing the importance of utterances, which includes between-layer propagation and integration of within-and between-layer propagation. Section 3 shows the results of applying proposed approaches, and discusses the difference between two algorithms and the difference between topical and lexical similarity for both ASR and manual transcripts. Section 4 concludes.

PROPOSED APPROACH
In this paper, we use ASR and manual transcripts and make both types of text similar. We first preprocess the utterances in all meetings by applying word stemming 1 and noise utterance filtering, where the utterances with word counts smaller than 3 are removed. For extractive summarization, we set a cut-off ratio to retain only the most important utterances to form the summary of each document based on the "importance" of each utterance. Thus, we formulate the utterance selection problem as computing the importance of each utterance. Then we construct a two-layer graph to compute the importance for all utterances and speakers in speaker-layer and utterance-layer respectively. In the two-layer directed graph, each utterance is a node in utterance-layer and the edges between these are weighted by topical or lexical similarity described in Section 2.3. Each speaker in the meeting is a node in speaker-layer and the edges between them are weighted by speaker-to-speaker relation. The edges between different layers are weighted by the relation between speakers and utterances.
The basic idea is that an utterance similar to more important utterances should be more important [3,13], so the importance of each utterance considers the scores propagated from other utterances according to the similarity between them. In this approach, the propagated scores are not only based on utterance-to-utterance relation. Instead, the scores integrate three types of relations (utterance-to-utterance, speaker-to-speaker, and utterance-to-speaker) to automatically consider speaker information in the graph. Figure 1 shows a simplified example for such a two-layer graph, in which there are speaker-layer and utterance-layer containing speaker nodes and utterance nodes respectively.

Parameters from Topic Model
Probabilistic latent semantic analysis (PLSA) [14] has been widely used to analyze the semantics of documents based on a set of latent topics. Given a set of documents {d j , j = 1, 2, ..., J} and all terms {t i , i = 1, 2, ..., M } they include, PLSA uses a set of latent topic variables, {T k , k = 1, 2, ..., K}, to characterize the "term-document" co-occurrence relationships. The PLSA model can be optimized using the EM algorithm, by maximizing a likelihood function [14]. We utilize two parameters from PLSA, latent topic significance (LTS) and latent topic entropy (LTE) [15]. The parameters can also be computed by other topic models, such as latent dirichilet allocation (LDA) [16] in a similar way.
Latent topic significance (LTS) for a given term t i with respect to a topic T k can be defined as  Fig. 1. A simplified example of the two-layer graph considered, where a speaker S i is represented as a node in speaker-layer and an utterance U j is represented as a node in utterance-layer of the two-layer graph. There are three different types of edges corresponding to different relations (utterance-to-utterance, speaker-to-speaker, and utteranceto-speaker). Note that each utterance node has edges connected to all speaker nodes not only the sourcing speaker node. where Thus, a higher LTS ti (T k ) indicates that the term t i is more significant for the latent topic T k . Latent topic entropy (LTE) for a given term t i can be calculated from the topic distribution P (T k | t i ), where the topic distribution P (T k | t i ) can be estimated from PLSA. LTE(t i ) is a measure of how the term t i is focused on a few topics, so a lower latent topic entropy implies the term carries more topical information.

Statistical Measures of a Term
The statistical measure of a term t i , s(t i , d) in (1) measures the importance of t i such as TF-IDF. In this work, it can be defined based on LTE(t i ) in (3) as where γ is a scaling factor such that s(t i , d) lies within the interval [0, 1], so the score s(t i , d) is inversely proportion to the latent topic entropy LTE(t i ). This measure outperformed the very successful "significance score" [15,10] in speech summarization, so we use the LTE-based statistical measure, s(t i , d), as our baseline.

Similarity between Utterances
We compute two different types of similarity between utterances based on topical and lexical distribution.

Topical Similarity via PLSA
Within a document d, we can first compute the probability that the topic T k is addressed by an utterance U i , Then an asymmetric topical similarity where the idea is similar to generative probability in information retrieval. We call it generative significance of U i given U j .

Lexical Similarity via Word Overlap
Within a document d, the lexical similarity is the measure of word overlap between the utterance U i and U j . We compute LexSim(U i , U j ) as the cosine similarity between two TF-IDF vectors from U i and U j like well-known LexRank [8]. Note that LexSim(U i , U j ) = LexSim(U j , U i ).

Two-Layer Mutually Reinforced Random Walk
For each document d, we construct a linked two-layer graph G containing utterance set and speaker set to compute the importance of each utterance.
, and E U S correspond the relation between utterances, the relation between speakers, and the relation between utterances and speakers respectively [9]. We compute Word overlap between utterances may be sparse due to recognition errors, so it's possible that topical similarity via PLSA can capture more information than lexical similarity. L SS = [w Si,Sj ] |V S |×|V S | , where w Si,Sj is the cosine similarity between the TF-IDF vectors containing all utterances from speaker S i and S j , which means a speaker node in the graph is represented by all utterances from the speaker. Similarly, where w Ui,Sj is the TF-IDF cosine similarity between utterance vector and speaker vector. Note that it is possible that w Ui,Sj > 0 when U i / ∈ S j because they both may have the same terms. Row-normalization are performed for L U U , L SS , L U S , L SU [17]. They can be viewed as utterance-toutterance, speaker-to-speaker, and utterance-to-speaker affinity metrics. Note that L U S is different from L T SU because of row-normalization.
Traditional random walk integrates the original scores and the scores propagated from other utterance nodes [3,11,18]. Here the proposed approach additionally considers the speaker information and integrates importance propagated from speaker nodes to model intra-and inter-speaker relation automatically. We propose two algorithms, one using scores only from between-layer propagation and another using scores both from within-and between-layer propagation, which are described in Section 2.4.1 and 2.4.2.

Between-Layer Propagation
Here we use two-layer mutually reinforced random walk to propagate the scores based on external mutual reinforcement between different layers through the edges E U S .
S denote the importance scores of the utterance set V U and speaker set V S in t-th iteration respectively. In the algorithm, they are the interpolations of two scores, the initial importance (F For utterance set, each utterance combines initial importance and the scores propagated from speaker-layer weighted by utterance-to-speaker similarity. Similarly, nodes of speakerlayer also include the scores propagated from utterance-layer. Then F (t+1) U and F (t+1) S can be mutually updated by the latter parts in (7) iteratively.
The algorithm will converge and then (8) can be satisfied [9].
We can solve F * U as below.
where the e = [1, 1, ..., 1] T . It has been shown that the closed-form solution F * U of (9) is the dominant eigenvector of M 1 [19], or the eigenvector corresponding to the largest absolute eigenvalue of M 1 . The solution of F * U denotes the updated importance scores for all utterances. Similar to the PageRank algorithm [20], the solution can also be obtained by iteratively updating F

Integrating Within-and Between-Layer Propagation
Here we use a two-layer mutually reinforced random walk to propagate the scores based on internal importance propagation within the same layer and external mutual reinforcement between different layers. Here S integrates the initial importance and the score including within-and between-layer propagation.
For utterance set, L U S F (t) S is the score from speaker set weighted by utterance-to-speaker similarity, and then the scores are propagated based on utterance-to-utterance similarity L U U . Compared to Section 2.4.1, the algorithm additionally considers the within-layer relation through L U U and L SS . Then F (t+1) U can be updated by the latter part in (10), and F (t+1) S as well. Similarly, the algorithm converges satisfying (11).
Similarly, the closed-form solution F * U of (9) is the dominant eigenvector of M 2 [19].
For both algorithms, we set F (0) U is the baseline score from I(U, d) in (1) after normalization such that the scores sum to 1 and F (0) S = e T /|V S |, which means we assume all speakers in the document have the equal importance.

Corpus
The corpus used in this research is a sequences of natural meetings, which features largely overlapping participant sets and topics of discussion. For each meeting, SmartNotes [4] was used to record both the audio from each participant, as well as his notes. The meetings were transcribed both manually and using a speech recognizer; the word error rate is around 44%. In this paper we use 10 meetings held from April to June of 2006. On average, each meeting had about 28 minutes of speech. Across these 10 meetings, there were 6 unique participants; each meeting featured between 2 and 4 of these participants (average: 3.7). Total number of utterances is 9837 across 10 meetings. In this paper, we empirically set α = 0.9 for all unsupervised experiments because (1 − 0.9) is a proper damping factor [20,18].
The reference summaries are given by the set of "noteworthy utterances": two annotators manually labelled the degree (three levels) of "noteworthiness" for each utterance, and we extract the utterances with the highest level of "noteworthiness" to form the summary of each meeting. In the following experiments, for each meeting, we extract about 30% of the number of terms as the summary.

Evaluation Metrics
Our automated evaluation utilizes the standard DUC (Document Understanding Conference) evaluation metric, ROUGE [21], which represents recall over various n-grams statistics from a system-generated summary against a set of human generated summaries. F-measures for ROUGE-1 (unigram) and ROUGE-L (longest common subsequence) can be evaluated in exactly the same way. Table 1 shows the performance achieved from all proposed approaches. Row (a) is the baseline, which uses an LTE-based statistical measure to compute the importance of utterances I(U, d). Row (b) is the result after applying well-known LexRank approach [8]. Row (c) is the result after applying random walk using topical similarity for utterance-toutterance affinity metric. Row (d) is the result of proposed two-layer mutually reinforced random walk (MRRW) using between-layer propagation (BP) 2 . Row (e) is the result of proposed model using within-and between-layer propagation (WBP), which uses lexical similarity to measure utteranceto-utterance relation. Row (f) is the same as row (e) except it uses topical similarity for utterance-to-utterance metric.

Results
Note that the performance of ASR is better than manual transcripts. Because a higher percentage of errors is on MRRW-BP: mutually reinforced random walk using between-layer propagation MRRW-WBP: mutually reinforced random walk using within-and between-layer propagation "unimportant" words, the utterances with incorrectly recognized words are harder to obtain high scores due to less similar to other utterances. Therfore, utterances with more errors tend to get excluded from the summarization results.
Other recent work also shows better performance for ASR than manual transcripts [3,5].

Comparing Baseline and Single-Layer Graph Approaches
We can see the performance after basic graph-based recomputation (row (b) and row (c)) is significantly better than baseline (row (a)) for both ASR and manual transcripts. The improvement for ASR is larger than for manual transcripts, because ASR output contains recognition errors, which makes determination of original scores inaccurate, and graph approach is used to propagate importance based on similarity between utterances, which can effectively compensate recognition errors. Thus, sharing importance with similar utterances can significantly improve the performance for both using lexical (row (b)) and topical similarity (row (c)).

Comparing Single-and Two-Layer Graph Approaches
Two-layer graph approaches (row (d) -row (f)) utilize the speaker information by automatically modeling intra-and inter-speaker relation in the graph. We find that two-layer approaches involving speaker information perform better than single-layer approaches (compared to row (b) and row (c)), which means the utterances from the speakers who speak more important utterances tend to be more important [11]. Thus, propagating the importance scores between the utterances from the same speaker can improve the results. The experiment shows two-layer mutually reinforced random walk can help include the important utterances, giving better performance than traditional single-layer random walk for both ASR and manual transcripts.

Effectiveness of Within-Layer Propagation
Row (d) only uses between-layer propagation, while row (e) and row (f) integrate within-and between-layer propagation. For ASR transcripts, additionally considering within-layer propagation by topical similarity (row (f)) performs better than only using between-layer propagation (row (d)). For manual transcripts, within-layer propagation using lexical similarity (row (e)) improves the results from the approach via between-layer propagation (row (d)). Therefore, withinlayer propagation is useful for both ASR and manual transcripts. The reason may be that two-layer mutually reinforced random walk only using between-layer propagation doesn't utilize the relation between utterances coming from different speakers. Therefore, integrating different types of relations performs better.

Comparing Lexical and Topical Similarity
We analyze the difference between using lexical and topical similarity for both single-layer and two-layer graph approaches. Comparing traditional LexRank and random walk with topical similarity (row (b) and row (c)), we find that ASR and manual transcripts have different results, where topical similarity performs better for ASR transcripts but worse for manual transcripts. Due to recognition errors from the recognizer, lexical similarity from word overlap may have some noises and lose some information; thus, topical similarity is better for measuring the relation between utterances from ASR output. Also, since in the absence of recognition errors lexical similarity can model the relations accurately.
For two-layer graph approaches (row (e) and row (f)), they have similar condition as single-layer approaches. In conclusion, topic model helps modeling similarity between utterances from imperfect ASR transcripts for all graph-based approaches.

Comparison to Other Approaches
Row (g) shows the result from the approach integrating intraspeaker and inter-speaker topic modeling into a single-layer graph [11], where there are more than three parameters for controlling intra-speaker and inter-speaker topic sharing weights. However, our proposed approaches can automatically model importance of speakers and utterances by mutually propagating scores from different layers. There's only one parameter α in our unsupervised approaches. Table 1 shows our proposed approaches perform better, because they consider utterance-to-utterance, speaker-to-speaker, and speaker-to-utterance relation together by a data-driven approach. Proposed approaches achieve 7.2% and 8.2% relative improvement compared to the LTE baseline for ASR and manual transcripts respectively. On the same corpus, Banerjee and Rudnicky [4] used supervised learning to detect noteworthy utterances in the same corpus, achievieng ROUGE-1 scores of 43% (ASR) and 47% (manual). In comparison, our unsupervised approach give significant improvement, especially for ASR transcripts.

CONCLUSIONS AND FUTURE WORK
Extensive experiments and evaluation with ROUGE metrics showed that two-layer mutually reinforced random walk can model importance of speakers and utterances in a single graph. The speaker information can be automatically included in importance of utterances by between-layer propagation. Integrating within-and between-layer propagation can model three types of relation together, achieving about 7.2% and 8.2% relative improvement compared to the LTE baseline for ASR and manual transcripts respectively. In the future, we plan to model speakers' topic across different documents and to integrate them into a single graph.

ACKNOWLEDGEMENTS
The first author was supported by the Institute of Education Science, U.S. Department of Education, through Grants R305A080628 to Carnegie Mellon University. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied of the Institute or the U.S. Department of Education.