Quasi Real-Time Summarization for Consumer Videos

With the widespread availability of video cameras, we are facing an ever-growing enormous collection of unedited and unstructured video data. Due to lack of an automatic way to generate summaries from this large collection of consumer videos, they can be tedious and time consuming to index or search. In this work, we propose online video highlighting, a principled way of generating short video summarizing the most important and interesting contents of an unedited and unstructured video, costly both time-wise and financially for manual processing. Specifically, our method learns a dictionary from given video using group sparse coding, and updates atoms in the dictionary on-the-fly. A summary video is then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary. The online fashion of our proposed method enables it to process arbitrarily long videos and start generating summaries before seeing the end of the video. Moreover, the processing time required by our proposed method is close to the original video length, achieving quasi real-time summarization speed. Theoretical analysis, together with experimental results on more than 12 hours of surveillance and YouTube videos are provided, demonstrating the effectiveness of online video highlighting.


Introduction
With the widespread availability, to both consumers and organizations, of low-cost devices capable of high-volume video recording, such as digital cameras on mobile phones, tablets, and soon, wearable gadgets such as glasses and watches; and various surveillance cameras and monitoring devices all over the world and in space, we are inundated with billion hours of video footage every day potentially containing events, people, and objects of context-dependent and time-space-sensitive interests.However, even to the creators/owners of such data, let alone all the people who are granted access for various purposes, the contents in all these videos remain dark matter in the data universe, because watching these recorded footage in real-time, or even playing at 2x or 4x speed is hardly possible and enjoyable.It is no surprise that with this increasing body of video data, which are largely left unedited and unstructured, all information therein are like trees falling in the forest -they are nearly impossible to access unless already been seen and indexed, an undertaking too tedious and time consuming for human, but an ideal challenge for machine intelligence.In this paper, we refer to those unstructured and unedited videos as consumer videos, in contrast to movies, news or sports videos which are often edited by human or having special structure (such as shot, scene, etc.).Specifically, we attempt to develop a method that offers the following function and alike: "I only have 1 minute for this hour-long video, tell me where/what to watch".That is, it automatically compiles the most salient and informative portion of the video for users, by automatically scanning through video stream, in an online fashion, to remove repetitive and uninteresting contents.Our method differs from some previous attempts to video summarization that eliminate completely the time axis, and show a synopsis of the video by collecting a few key frames which are selected either arbitrarily, or according to some importance criteria [33,11,15].Such key frame representation loses the dynamic aspect of video and is uninteresting to watch.More importantly, taking merely frames as unit of content in a video prevents many important information such as suspicious behaviors to be recognized automatically by a machine, therefore compromises the quality of the summary.On the other hand, the summary generated by our proposed method is a short video itself, revealing the essence of the original video, just like a "trailer".
We propose onLIne VidEo highLIGHTing (LiveLight), a principled way of online generation of a short video summarizing the most important and interesting contents of a potentially very long video.Specifically, LiveLight scans through the video stream, divided into a collection of video segments temporally.After processing the first few segments, it starts to build its own dictionary, which will be kept updated and refined later.Given a new video segment, LiveLight attempts to employ its current version of dictionary to sparsely reconstruct this previously unseen segment, using group sparse coding [3].A small reconstruction error of the new video segment reflects that its content is already well represented in the current dictionary, further suggesting video segments containing similar contents have been observed in early part of the video.Hence, this segment is excluded from the summary, and the algorithm moves on to next segment.On the other hand, if the new video segment cannot be sparsely reconstructed, i.e., a high reconstruction error is suffered, indicating unseen contents from previous video data, our method incorporates this video segment into the summary, and updates the dictionary according to the newly included video data.This process continues until the end of the video is reached.In summary, our method sequentially scans the video stream once, learns a dictionary to summarize contents seen in the video and updates it after encountering video data that could not be explained using current dictionary.A summary video is then constructed as a combination of two groups of video segments: (1) the first few segments used to learn initial dictionary, capturing background and early contents of the video; (2) video segments causing dictionary update, suggesting unseen and interesting contents.Moreover, as the entire process is carried out online, LiveLight could handle hours or even endless video data, ubiquitous in consumer videos.

Related Works
Previous research on video summarization has mainly focused on edited videos, e.g., movies, news, and sports, which are highly structured [21,30].For example, a movie could be naturally divided into scenes, each formed by one or more shots taken place at the same site, and each shot is further composed of frames with smooth and continuous motions.However, consumer videos lack such structure, often rendering previous research not directly applicable.
Key frame based methods compose video summary as a collection of salient images (key frames) picked from the original video.Various strategies haven been studied, including shot boundary detection [9], color histogram [33], motion stability [33], clustering [13], curve splitting [7], and frame self-expressiveness [11].However, isolated and uncorrelated still images, without smooth temporal continuation, are not best suited to help viewer understand the original video.Moreover, [15] proposes a saliency based method, which trains a linear regression model to predict importance score for each frame in egocentric videos [15].However, special features designed in [15] limit its applicability only to videos generated by wearable cameras.Besides picking frames from the original video, methods creating new image not present in the original video have also been studied [1,5,17,24,26,27], where a panoramic image is generated from a few consecutive frames having some important content.However, the number of consecutive frames from original video used to construct such panoramic image is limited by occlusion between objects from different frames.Consequently, these approaches generally assume short clips with few objects.Finally, summaries composed by a collection of video segments, have been studied for structured videos.Specifically, [23] use scene boundary detection, dialogue analysis, and color histogram to produce trailer for a feature film.[2] and [16] extract important segments from sports and news programs utilizing special characteristics of these videos, including fixed scene structures, dominant locations, and backgrounds.Moreover, [29] and [28] utilize closed caption and speech recognition to transform video summarization into a text summarization problem and generate summaries using natural language processing techniques.However, the large body of consumer videos usually have no such special structure, nor audio information at all.
Sparse coding [22] has led to state-of-the-art results in several vision tasks such as image denoising and restoration [10,20], as well as classification [32,31] and anomaly detection [34].Moreover, brute-force deployment of sparse coding for video summarization, using the entire video as dictionary, and selecting key frames based on zero patterns of the coding vector, has recently been attempted [11,6] on short videos (less than few minutes long).We will discuss and compare against such method later in this paper.

Summary of Contributions
To conclude the introduction, we summarize our main contributions as follows.(1) We propose a principled way of generating short summary video of a potentially very long video, summarizing its most important and interesting contents while eliminating repetitive events, enabling viewer to understand the video without watching the entire sequence.(2) We propose an online dictionary update method, enabling our method to generate summaries onthe-fly.(3) We provide theoretical analysis of the proposed method, guaranteeing convergence of the online dictionary update and generalization ability to unseen video segments.(4) We demonstrate the effectiveness of LiveLight on realworld data, including both surveillance videos and YouTube videos, achieving quasi real-time speed on all tested videos.

Online Video Highlighting
Given an unedited and unstructured consumer video, online video highlighting starts with temporal segmentation, breaking original video into segments.Such temporal segmentation should ensure minimum variation, and consistency of objects, view and dynamics within each segment.Unlike structured videos, where shot boundary detection could be employed for temporal segmentation, most consumer videos do not even have such shot boundary, but instead with continuous camera movement.Therefore, we choose to evenly divide the original video into segments, each with a constant length of 50 frames.Such short temporal length ensures the consistency within each segment.These video segments are the base units in LiveLight, in the sense that a few selected ones will compose the final summary video.A key component in LiveLight is dictionary, which summarizes the contents of seen video.Specifically, a dictionary is initially learned using video segments at the beginning of the input video, with group sparse coding.After dictionary initialization, LiveLight scans through the rest video segments following temporal order, and attempts to reconstruct each video segment using the learned dictionary.Those video segments with reconstruction error higher than certain threshold are considered to contain interesting contents unprecedented in previous video, and are included into the summary video.Moreover, the dictionary is updated accordingly to incorporate the newly observed video contents, such that similar video segments seen later will suffer much smaller reconstruction error.On the other hand, those video segments that could be well reconstructed using the current dictionary is excluded from the summary, as small reconstruction error suggests its content is already well represented in the current dictionary, further indicating video segments containing similar contents have been observed in early part of the video.Hence, the dictionary represents the knowledge about previously seen video contents, and is updated in an online fashion to incorporate newly observed contents.Algorithm 1 provides the work flow of LiveLight, where X 0 = {X 1 , . . ., X m } is used to learn initial dictionary with m K, and 0 is a pre-set threshold parameter controlling length of the summary video.

Algorithm 1 Online Video Highlighting (LiveLight)
input Video X composed of temporal segments {X1, . . ., XK } output Short video Z summarizing most important and interesting contents of X 1: Learn initial dictionary D using X0 = {X1, . . ., Xm} via group sparse coding and initialize Z = X0 2: for all Video segments X k ∈ {Xm+1, . . ., XK } do 3: Reconstruct video segment X k using current dictionary D and compute reconstruction error k 4: end if 7: end for

Video Segment Reconstruction
The basic idea for our approach is to represent the knowledge of previously observed video segments using the learned dictionary D, whose columns (a.k.a.atoms) are bases for reconstructing future video segments.Given learned dictionary D (details of learning initial dictionary will be provided later in this section), LiveLight attempts to sparsely reconstruct query video segment using its atoms.Specifically, sparse reconstruction indicates both small reconstruction error and small footprint on the dictionary, i.e., using as few atoms from the dictionary as possible.Consequently, video summarization is formulated as a sparse coding problem, seeking linear decomposition of data using a few elements from a dictionary learned in online fashion.
We start with discussion of feature representation for video data.Specifically, we adopt the representation based on spatio-temporal cuboids [14,8,12], to detect salient points within the video and describe the local spatiotemporal patch around the detected interest points.Different from optical flow, this feature representation only describes spatio-temporal salient regions, instead of the entire frame.On the other hand, spatio-temporal cuboids are less affected by occlusion, a key difficulty in tracking trajectory based representations.Specifically, we adopt the spatiotemporal interest points detected using the method in [8], and describe each detected interest point with histogram of gradient (HoG) and histogram of optical flow (HoF).The feature representation for each detected interest point is then obtained by concatenating the HoG feature vector and HoF feature vector.Finally, each video segment is represented as a collection of feature vectors, corresponding to detected interest points, i.e., X k = {x 1 , . . ., x n k }, where n k is the number of interest points detected in video segment X k .
Different from conventional settings of sparse coding, where input signal is a vector, the input signal in our problem is a video segment, represented as a group of vectors X k = {x 1 , . . ., x n k }.Therefore, our goal is to effectively encode groups of instances in terms of a set of dictionary atoms D = {d j } |D| j=1 , where |D| is the size of the dictionary, i.e., number of atoms in D. Specifically, given learned dictionary D, LiveLight seeks sparse reconstruction of the query segment X, as follows where A = {α 1 , . . ., α |X| }, α i ∈ R |D| is the reconstruction vector for interest point x i ∈ X, and |X| is the number of interest points detected within video segment X.The first term in ( 1) is reconstruction cost.If video segments similar to X have been observed before, this term should be small, due to the assumption that the learned dictionary represents knowledge in the previously seen video data.The second term is the group sparsity regularization.Since dictionary D is learned to sparsely reconstruct previously seen video segments, if X contains no interesting or unseen contents, it should also be sparsely reconstructible using few atoms in D. On the other hand, if contents in X have never been observed in previous video segments, although it is possible that a fairly small reconstruction cost could be achieved, we would expect using a large amount of video fragments for this reconstruction, resulting in dense reconstruction weight vectors.Moreover, the special mixed 1 / 2 norm of A used in the second term regularizes the number of dictionary atoms used to reconstruct the entire video segment X.This is more preferable over conventional 1 regularization, as a simple 1 regularizer only ensures sparse weight vector for each interest point x i ∈ X, but it is highly possible that different interest points will have very different footprint on the dictionary, i.e., using very different atoms for sparse reconstruction.Consequently, reconstruction for the video segment X could still involve large number of atoms in D. On the other hand, the 1 / 2 regularizer ensures a small footprint of the entire video segment X, as all interest points within segment X are regularized to use the same group of atoms for reconstruction.Moreover, the tradeoff between accurate reconstruction and compact encoding is controlled by regularization parameter λ.Finally, we denote the value of (1) with optimal reconstruction matrix A as , which is used in Algorithm 1 to decide if segment X should be incorporated into the summary video.Consequently, LiveLight encapsulates the following intuitions for what we would think of a video summary.Given a dictionary optimized to sparsely reconstruct previously seen video contents, a new segment exhibiting similar contents seen in previous video data should be reconstructible from a small number of such atoms.On the other hand, a video segment unveiling contents never seen before is either not reconstructible from the dictionary of previous video segments with small error, or, even if it is reconstructible, it would necessarily build on a combination of a large number of atoms in the dictionary.Crucial to this technique, is the ability to learn a good dictionary of atoms representing contents seen in previous video segments, and being able to update the dictionary online to adapt to changing content of the video, which we discuss in detail later in this section.

Optimization
To find the optimal reconstruction vectors {α i } for interest points in X, we need to solve problem (1).We employ alternating direction method of multipliers (ADMM) [4] to carry out such optimization, due to its efficiency.Specifically, ADMM consists of the following iterations: ∀i :

Learning Initial Dictionary
In this section, we discuss how to learn an initial dictionary, necessary to launch the LiveLight algorithm.Specifically, we would like a learning method that facilitates both induction of new dictionary atoms and removal of dictionary atoms with low predictive power.To achieve this goal, we again apply 1 / 2 regularization, but this time to dictionary atoms.The idea for this regularization is that uninformative dictionary atoms will be regularized towards 0, effectively removing it from the dictionary.Given first few video segments X 0 = {X 1 , . . ., X m }, we formulate learning optimal initial dictionary as follows min D,{A1,...,Am} where J(X k , A k , D) is the objective function in (1), and γ balances sparse reconstruction quality and dictionary size.Though non-convex to D and {A 1 , . . ., A m } jointly, ( 5) is convex w.r.t.{A 1 , . . ., A m } when D is fixed, and also convex w.r.t.D with fixed {A 1 , . . ., A m }.A natural solution is to alternate between these two variables, optimizing one while clamping the other.Specifically, with fixed dictionary D, each A k ∈ {A 1 , . . ., A m } can be optimized individually, using optimization method described in the previous section.On the other hand, with fixed {A 1 , . . ., A m }, optimizing dictionary D can be similarly solved via ADMM.
Due to limit of space, we omit details for this optimization.

Online Dictionary Update
As LiveLight scans through the video, segments that cannot be sparsely reconstructed using current dictionary, indicating unseen and interesting contents, are incorporated into the summary video.However, all following occurrences of similar contents appearing in later video segments, should ideally be excluded.Consequently, it is crucial to update the dictionary such that those video segments already included in the summary video should no longer result in large reconstruction error.Assume the current version of summary is Z t , composed of t video segments {X k } t k=1 , then the optimal dictionary is the solution of the following problem where we need to store feature representations {X k } t k=1 for all t segments in Z t .This might not cause problem for short videos, however, for hours of videos, especially surveillance videos running endlessly, storing these feature representations requires huge space.Moreover, solving the above optimization problem from scratch using alternating optimization for each dictionary update, is extremely time consuming, and would hinder the algorithm from applicable to real world consumer videos.Therefore, LiveLight employs online learning for approximate and efficient dictionary update [19].Specifically, instead of optimizing dictionary D and reconstruction coefficients {A 1 , . . ., A t } simultaneously, LiveLight aggregates the past information computed during the previous steps of the algorithm, namely the reconstruction coefficients { Â1 , . . ., Ât } computed using previous versions of dictionary, and only optimizes D in problem (6).Therefore, the online dictionary update seeks to solve the following approximate optimization problem It is easy to see that f (D) upper bounds f (D) in problem (6).Moreover, theoretical analysis shown in the next section guarantees that f (D) and f (D) converges to the same limit and consequently f (D) acts as a surrogate for f Moreover, it is easy to show that problem ( 7) could be equivalently reformulated as follows where T r(•) is matrix trace, P t and Q t are defined as Therefore, there is no need to store { Âk } t k=1 or {X k } t k=1 , as all necessary information is stored in P t and Q t .Finally, problem (8) could be efficiently solved using ADMM.

Importance of Dictionary
Very recently, there have been attempts employing the idea of sparse reconstruction for video summarization [11,6].However, those approaches use the entire video itself as basis for reconstruction, instead of learning and updating a dictionary as concise summary of video contents.Using the entire video as reconstruction basis [11,6], significantly increases the complexity of optimization and computational time, as shown later in the experiments, the approach in [6] takes nearly 10 times more CPU time than LiveLight on the same videos.Such heavy computational footprint hinders those approaches from being applied in temporally long consumer videos (actually, [11,6] only used videos with at most several minutes in their empirical study).Moreover, [11,6] have to see the entire video before starting to generate summary, eliminating the possibility of real-time summarization.On the other hand, the dictionary learned and updated in LiveLight concisely summarizes the contents of seen video, significantly reduces computational cost, and captures any concept drift in video streams.

Sanity Check
We use synthetic video to perform sanity check on Live-Light.Specifically, we use two types of video sequences from Weizmann action recognition data [12], i.e., walk and bend.The synthetic video is constructed by combining 5 walk sequences, followed by 5 bend sequences, and 5 more walk sequences.Details of this synthetic video are shown in Figure 1.LiveLight learns initial dictionary using the first walk sequence, and carries out reconstruction and online dictionary update on the rest 14 video sequences.There are 2 clear peaks in Figure 1, corresponding to the third walk sequence, which is the first occurrence of walking from left to right (the first and second sequences are both walking from right to left), and the first bend sequence.Moreover, the reconstruction error for the fourth walk sequence, which also shows walking from left to right, is significantly smaller then the third walk sequence, indicating the dictionary has learned the contents of walking to the right, through online dictionary update.Finally, the last 5 walk sequences all result in small reconstruction errors, even after LiveLight has just observed 5 bend sequences, showing that the dictionary retains its knowledge about walk.

Theoretical Analysis
We first study the convergence property of online dictionary update.Specifically, we have the following theorem, Theorem 1 Denote the sequence of dictionaries learned in LiveLight as {D t }, where D 1 is the initial dictionary.Then f (D), defined in (7), is the surrogate function of f (D), defined in (6), satisfying (1) f (D) − f (D) converges to 0 almost surely; (2) D t obtained by optimizing f is asymptotically close to the set of stationary points of ( 6) with probability 1 Theorem 1 guarantees that f (D) could be used as a proper surrogate for f (D), such that we could optimize (7) to obtain the optimal dictionary efficiently, instead of solving the much more time-consuming optimization problem (6).
Next, we study the generalization ability of LiveLight on unseen video segments.Specifically, as LiveLight scans through the video sequence, the dictionary is learned and updated only using video segments seen so far.Consequently, the dictionary is optimized to sparsely reconstruct contents in seen video segments.It is crucial for LiveLight to also be able to sparsely reconstruct unseen video segments, composed of contents similar to video segments seen before.This property is called generalization ability in statistical machine learning terminology.Specifically, Theorem 2 Assume data points X (i.e., video segments) are generated from unknown probability distribution P. Given t observations {X 1 , . . ., X t }, for any dictionary D, and any fixed δ > 0, with probability at least 1 − δ where J * (X, D) = min A J(X, A, D) is the minimal reconstruction error for X using dictionary D, as defined in (1), and (t, δ) = o(ln t/ √ t) is a small constant that decreases as t increases.
The above theorem is true for any dictionary D, and obviously also true for the dictionary learned in LiveLight.Therefore, Theorem 2 guarantees that if dictionary D has small reconstruction error on previously seen video segments, it will also result in small reconstruction error for unseen video segments with similar contents.

Experiments
We test the performance of LiveLight on more than 12 hours of consumer videos, including both YouTube videos and surveillance videos.The 20 videos in our data set span a wide variety of scenarios: indoor and outdoor, moving camera and still camera, with and without camera zoom in/out, with different categories of targets (human, vehicles, planes, animals etc.) and covers a wide variety of activities and environmental conditions.Details about the data set are provided in Table 1.

Experiment Design and Evaluation
We compare LiveLight with several other methods, including evenly spaced segments, K-means clustering [6] using the same features as our method, and DSVS algorithm proposed in [6], state-of-the-art method for video summarization.It is shown in [6] that DSVS already beats color-histogram based method [25] and motion-based method [18].Parameters for various algorithms are set such evenly spaced segments; CL: K-Means Clustering; DSVS: sparse reconstruction using original video as basis [6].
that the length of generated summary videos are the same as ground truth video.For LiveLight, we fix the number of atoms in dictionary to 200, though better performance is possible with fine tuning of parameters.
For each video in our data set, three judges selected segments from original video to compose their preferred version of summary video.The final ground truth is then constructed by pooling together those segments selected by at least two judges.Following [6], to quantitatively determine the overlap between algorithm generated summary and ground truth, both video segment content and time differences are considered.Specifically, two video segments must occur within a short period of time (two seconds in our experiments), and must be similar in scene content and motion pattern to be considered equivalent.Final accuracy is computed as the ratio of segments in algorithm generated summary video that overlaps with ground truth.

Results
According to the quantitative comparison provided in Table 2, we have following observations: (1) LiveLight achieves highest accuracy on 18 out of 20 videos, and in most cases beats competing algorithms with a significant margin; (2) On the 5 surveillance videos, both LiveLight and DSVS outperform other two algorithms, showing the advantage of sparse reconstruction based methods on summarizing surveillance videos; (3) Averaged across 20 videos, LiveLight outperforms the state-of-the-art summarization method DSVS by 8%, revealing the advantage of LiveLight.
Besides quantitative measures, we also show the automatically generated summary ("trailer") for YouTube video PolicePullOver (more summary videos are provided in supplementary material).As shown in Figure 2, the summary video captures the entire story line of this near hour long video, achieving more than 40 times compression in time without losing semantic understandability of the summary video.Moreover, the background in this video involves various cars passing in both directions, and it is interesting that LiveLight is not affected by this background motion.

Time Complexity
LiveLight is implemented using MATLAB 7.12 on a 3.40 GHZ Intel Core i7 PC with 16.0 GB main memory.Table 3 compares the processing time of various algorithms, with the following observations: (1) The last column under LiveLight shows the ratio between its computational time and video length.For all videos, this ratio is less than 2, and for 6 videos even less than 1.Thus, with MATLAB implementation on a conventional PC, LiveLight already achieves near real-time speed, further revealing its promise in real world video analysis applications; (2) LiveLight is nearly 10 times faster than DSVS, revealing the advantage of learning and updating dictionary in an online fashion, instead of using original video as basis for sparse reconstruction.

Conclusions
In this paper, we propose LiveLight to generate short video summarizing the most important and interesting contents, of a potentially very long video.LiveLight enables viewer to understand the video without watching the entire sequence.Theoretical analysis is provided, focusing on online dictionary update convergence, and generalization ability to unseen video segments.Experiments on real world surveillance videos and YouTube videos demonstrate the effectiveness and efficiency of LiveLight.The fact that LiveLight is quasi real-time on all tested videos shows its promise on summarizing the huge body of consumer videos.Table 3. Processing time of LiveLight and competing algorithms (all time shown is in minutes).T video is the length of original video.T 1 is the time spent on generating feature representations in LiveLight, and T 2 is the combined time spent on learning initial dictionary, video segment reconstruction and online dictionary update.T total = T 1 + T 2 is the total processing time of LiveLight, and Ratio = T total /T video for all algorithms.

Figure 2 .
Figure 2. (Best viewed in color and zoom-in.)Some frames of the summary video generated by LiveLight for a YouTube video showing police pulling over a black SUV and making arrest (frames are organized from left to right, then top to bottom in temporal order).From the summary video, we could see the following storyline of the video: (1) Police car travels on the highway; (2) Police car pulls over black SUV; (3) Police officer talks to passenger in the SUV; (4) Two police officers walk up to the SUV, and open the passenger side door of the SUV; (5) Police officer makes arrest of a man in white shirt; (6) Police officer talks to passenger in the SUV again; (7) Both police car and black SUV pull into highway traffic; (8) Police car follows black SUV off the highway; (9) Both vehicles travel in local traffic; (10) Black SUV pulls into local community.

Table 1 .
Data set details.The first 15 videos are downloaded from YouTube, and the last 5 videos are from surveillance cameras.Video length (Time) is measured in minutes.CamMo stands for camera motion, and Zoom means camera zoom in/out.

Table 2 .
T is the length (seconds) of summary video.LL: LiveLight; ES: