TimeSeer: Scagnostics for High-Dimensional Time Series

We introduce a method (Scagnostic time series) and an application (TimeSeer) for organizing multivariate time series and for guiding interactive exploration through high-dimensional data. The method is based on nine characterizations of the 2D distributions of orthogonal pairwise projections on a set of points in multidimensional euclidean space. These characterizations include measures, such as, density, skewness, shape, outliers, and texture. Working directly with these Scagnostic measures, we can locate anomalous or interesting subseries for further analysis. Our application is designed to handle the types of doubly multivariate data series that are often found in security, financial, social, and other sectors.


INTRODUCTION
S UPPOSE we have data consisting of many time series over many variables with many time points. Suppose, further, that we want to identify unusual events at individual time points across all the series. If there is only one time series for each variable, the usual analytic approach to this problem is to perform spectral modeling of cross-covariance functions among the series. A visual analytic analog of this approach is to plot pairs of series and highlight noteworthy features in the pairs that appear to be substantially related. We might see, for example, one series showing average fines for auto speeding each week in a US state and another series showing weekly auto accidents in that state. If we see a rise in auto speeding fines preceding by several weeks a reduction in accidents, we might conclude (with appropriate ceteris-paribus qualification) that a rise in fines may lead to a reduction in accidents.
This traditional approach will not work if there is more than one time series for each variable (more than one state in our example). An alternative, of course, would be to examine all singletons or pairs of individual series for patterns. This alternative does not scale. Assume we have t time points, p variables, and n series which describe a doubly multivariate data series. In this case, we have for each pair of variables at each time point a 2D scatterplot of n points. In our US economy example, we would have for a single year's data 52 time points (each week in a year), 10 variables (one for each economic sector), and 50 series (one for each state). In this rather small example (from our perspective), we have 2,340 scatterplots with 50 points each to examine. We will consider a much larger example in this paper.
If we can solve the multitude-of-scatterplots problem, we gain some important visual-analytic insights. By looking at scatterplots instead of individual series, we can analyze patterns that would not be evident in other types of multivariate time series visualizations. Suppose, for instance, that we have n individuals (including some potential terrorists) interacting at t time points through p different channels (websites, text messages, cell conversations, etc.). Suppose, also, that we are interested in recognizing time points where one or more subgroups of these individuals begin to cluster together in a communication clique. With the right tools, we should be able to identify scatterplots where these clusters are apparent. Furthermore, we should be able to examine with these tools other features, such as outliers or correlations, that characterize conspiratorial interaction. Fig. 1 shows an example where Timeseer examines the Outlying Scagnostic feature. In particular, the scatterplots with lighter backgrounds are high Outlying plots.
An important aspect of our proposal is that it is firmly grounded in time series methodology. We are not simply looking for unusual scatterplots in a large collection of scatterplots. We are looking at time series of aspects of scatterplots. For example, we can investigate a time series of Clumpiness (cliques) or Outliers (rogues) or Monotonicity (conspirators). We will see in our examples below that these series have coherent behavior in real data that becomes apparent and revealing when viewed with the right tools. It is noted that multiple views, interactions and analytical components are particularly useful in analyzing timeoriented data [1]. Our system, TimeSeer, not only looks at a data abstraction through aspects of the raw data but also provides multiple views together with filtering, searching and focusing interactions.
This work is a natural extension of our work on Scagnostics [48], an idea that allows us to characterize the "shape" of scatterplots. In developing a working platform, however, we discovered that we had to design custom tools to deal with the challenges posed by massive time series data sets. We found some clues in related work, but most of what we have developed is, in our understanding, new.
Our contributions in this paper are as follow: . We devise a framework for applying Scagnostics in the context of time-varying data analysis. This induces data reduction which allows for fast identification of interesting features such as outliers in high-dimensional data sets. . We propose a dissimilarity measure for scatterplots based on their Scagnostics. . We design an interactive system for visually mining doubly multivariate data series using multiple visual metaphors in a novel combination. The paper is structured as follows: we describe related work in the following section. We describe an overview of our interactive system, TimeSeer, in Section 3. Section 4 illustrates TimeSeer on real data sets. Finally, Section 5 draws conclusions and indicates future developments.

RELATED WORK
In reviewing related work, we must keep in mind that some approaches that seem visually similar to ours are fundamentally different and some approaches that seem visually quite different nevertheless provided important guidelines for our own development. We begin with our work on Scagnostics.

Scagnostics
In the mid 1980s, John and Paul Tukey proposed an exploratory graphical method called Scagnostics. The Tukeys intended to characterize a collection of 2D scatterplots through a small number of measures of the pattern of points in these plots. These measures included the area of 2D isolevel kernel density contours, the perimeter length of these contours, a nonlinearity measure of association based on principal curves [23], and other statistics. By using these measures, the Tukeys aimed to detect anomalies in density, shape, association, and other features.
We described Scagnostics in a plenary session at the 2003 InfoVis conference. Seo and Shneiderman followed our general description by using ordinary parametric statistics (mean, standard deviation, correlation coefficient, etc.) instead of the kinds of nonparametric measures proposed by the Tukeys [27]. Consequently, we decided to implement the original Tukey idea through nine Scagnostics defined on planar proximity graphs. We gave these measures ordinary names (Outlying, Skewed, Clumpy, Sparse, Striated, Convex, Skinny, Stringy, Monotonic) and presented a scalable program for computing these new graph-theoretic measures [47]. Following this work, Fu [16] extended Scagnostics to 3D and still others used analogs of the word to describe feature-based descriptions for parallel coordinates and pixel displays [14], [39].
Although the original motivation for Scagnostics was to locate interesting scatterplots in a large scatterplot matrix, we soon realized the idea had more general implications. We have argued [48] that Scagnostics should be regarded as a type of projection that enables us to examine features in Scagnostics space and then make inferences about patterns that would not be apparent in the raw data space. In other words, Scagnostics space can serve as a basis for visual analytics much as the complex plane does for spectral analytics, although the Scagnostics projection is not invertible. Our time series platform rests on this fundamental principle.
We now outline the Scagnostic algorithm.

Binning
We begin by normalizing the data to the unit interval and then use a 40 by 40 hexagonal grid [9] to aggregate the points in each scatterplot. If there are more than 250 nonempty cells, we reduce the bin size by half and rebin. We rebin until there are no more than 250 nonempty cells. The choice of bin size is constrained by efficiency (too many bins slow down calculations of the geometric graphs) and sensitivity (too few bins obscure features in the scatterplots). We compute all our measures on the binned points using the counts in each bin as weights. The Scagnostics measures depend on proximity graphs that are all subsets of the Delaunay triangulation: the convex hull, the minimum spanning tree (MST), and the alpha complex [15].

Deleting Outliers
Before computing the Scagnostics, we delete outliers to improve robustness. We consider an outlier to be a vertex whose adjacent edges in the MST all have a weight (length) greater than F innerþ , where where q 75 is the 75th percentile of the MST edge lengths and the expression in the parentheses is the interquartile range of the edge lengths.

Computing Scagnostic Measures
We now present the Scagnostic measures computed on our three geometric graphs. In the formulas below, we use H for the convex hull, A for the alpha hull, and T for the minimum spanning tree. We are interested in assessing three aspects of scattered points: density, shape, and association. Density measures. The following measures detect different aspects of point densities.
. Outlying. The Outlying Scagnostic measures the proportion of the total edge length of the minimum spanning tree accounted for by the total length of edges adjacent to outlying points (as defined above). We do this calculation before deleting outliers for the other measures . Skewed. We use two other density measures based on MST edge-lengths. The first is a relatively robust measure of skewness in the distribution of edge lengths of the MST q skew ¼ ðq 90 À q 50 Þ=ðq 90 À q 10 Þ: ð3Þ . Sparse. The second edge-length statistic, Sparse, measures whether points in a 2D scatterplot are confined to a lattice or a small number of locations on the plane. This can happen, for example, when tuples are produced by the product of categorical variables. It can also happen when the number of points is extremely small. We choose the 90th percentile of the distribution of edge lengths in the MST. This is the same value we use for the statistic .
Clumpy. An extremely skewed distribution of MST edge lengths does not necessarily indicate clustering of points. For this, we turn to another measure based on the MST: the RUNT statistic [22]. The runt size of a dendrogram node is the smaller of the number of leaves of each of the two subtrees joined at that node.
Since there is an isomorphism between a singlelinkage dendrogram and the MST [19], we can associate a runt size (r j ) with each edge (e j ) in the MST, as described by Stuetzle [40]. The RUNT graph (R j ) corresponding to each edge is the smaller of the two subsets of edges that are still connected to each of the two vertices in e j after deleting edges in the MST with lengths less than lengthðe j Þ.
The RUNT-based measure responds to clusters with small maximum intracluster distance relative to the length of their nearest neighbor intercluster distance. In the formula below, j runs over all edges in T and k runs over all edges in R j . Striated. We define coherence in a set of points as the presence of relatively smooth paths in the minimum spanning tree. Smooth algebraic functions, time series, and curves (e.g., spirals) fit this definition. So do points arranged in flows or vector fields. Another common example is the pattern of parallel lines of points produced by the product of categorical and continuous variables.
We use a measure based on the number of adjacent edges in the MST whose cosine is less than À0:75. Let V ð2Þ V be the set of all vertices of degree 2 in V and let IðÞ be an indicator function. Then Iðcos eðv;aÞeðv;bÞ < À0:75Þ: ð6Þ Shape measures. The shape of a set of scattered points is our next consideration. We want to detect if a set of scattered points on the plane appears to be connected, convex, and so forth. Of course, scattered points are by definition not these things, so we need additional machinery (based on geometric graphs) to allow us to make such inferences. In particular, we will measure aspects of the convex hull, the alpha hull, and the minimum spanning tree.
. Convex. Our convexity measure is based on the ratio of the area of the alpha hull and the area of the convex hull. This ratio will be 1 if the nonconvex hull and the convex hull have identical areas . Skinny. The ratio of perimeter to area of a polygon measures, roughly, how skinny it is. We use a corrected and normalized ratio so that a circle yields a value of 0, a square yields 0.12 and a skinny polygon yields a value near one . Stringy. A stringy shape is a skinny shape with no branches. We count vertices of degree 2 in the minimum spanning tree and compare them to the overall number of vertices minus the number of single-degree vertices c stringy ¼ jV ð2Þ j jV j À jV ð1Þ j : ð9Þ We cube the Stringy measure to adjust for negative skew in its conditional distribution on n.
Association measure. We are interested in a symmetric and relatively robust measure of association.
. Monotonic. We use the squared Spearman correlation coefficient to assess monotonicity in a scatterplot. We square the coefficient to accentuate the large values and to remove the distinction between negative and positive coefficients. We assume investigators are most interested in strong relationships, whether negative or positive This is the only coefficient not based on a subset of the Delaunay graph.

Visualizing Multivariate Time Series
Some have developed viewers for multivariate time series. Theme River [24] was one of the first; it employed kernel smooths of time series, stacking them in a single display. Theme River can be quite effective for displaying up to 20 time series simultaneously, but it is not as useful for displaying raw series. Theme River trades detail for overall impact. Because it smooths and stacks, the absolute levels of the series are difficult to discern. Cleveland has discussed in more detail the problems involved in stacking time series [12] (see also [7]). Wattenberg [45] developed an applet called Name Voyager that presents interactive stacked graphs of raw series with exploratory widgets that allow the manipulation and visualization of multiple series in a single stacked display. With his tool, it is easy to drill-down to an individual series to investigate details. Name Voyager continues to be one of the more popular visualization sites on the Web, perhaps because it is so engaging and easy to use. Other recent multivariate time series viewers include [3], [41], [25], [34], [28], [5].

Time Series Pattern Search
Long time series cannot be visualized on ordinary or megapixel displays. There are not enough pixels to represent each time point in these series. The problem is especially acute in live feeds or streaming data sources because the feeds are effectively infinite [36]. A common remedy is to pan and zoom into "interesting" segments of the series with lensing or other widgets. How do we identify "interesting" segments, however? One popular method is to search for motifs or anomalous patterns in time series using statistical and data mining algorithms [33], [8], [10], [35], [29]. Time Searcher [26] contains widgets that could be effective when paired with these algorithms. Superimposing similar subseries can facilitate within-series comparisons.

Aggregation
One way to deal with multivariate series is to aggregate across similar series. If series are already categorized (within states, countries, economic sectors, hospital patients, etc.), then averaging the series is a possibility. Otherwise, one must use cluster analysis to identify clusters of similar series [37], [42], [21]. Aggregation risks concealment of important features, however. One must be sensitive to outlying series and other anomalies that can bias the aggregation.

TIMESEER
TimeSeer is a platform for visualizing Scagnostic time series. As we indicated before, our model is fundamentally different from other time series visualizations. It is based on the recognition that synchronized multivariate time series have multivariate point distributions at each time point. Our data model is a multivariate generalization of the series models employed in the papers we have reviewed. We have t time points and p variables, resulting in p-multivariate time series. For each variable, however, we have n series, resulting in a doubly multivariate distribution. We have found no visual analytic platform capable of handling this model.
Typical data for this model are: t months, p economic indicators, and n countries; t minutes, p vital signs, and n patients; t trading days, p stock indices, and n markets (exchanges); t seconds, p network protocols, and n nodes. A significant challenge for visual analytics on data like these is scalability. We normally expect t, p, and n to be large. It is not uncommon to find the product of these parameters to be in the tens of thousands. Visualizing them with conventional tools is out of the question.
We will illustrate the features of TimeSeer mainly through examples. In this section, however, we will describe the overall architecture of the system. As we have indicated, our solution to the overall problem is to regard simultaneous time points in this multivariate system as collections of point sets. Characterizing those point sets will allow us to discern patterns that we could not see with conventional time series analytics or statistics. Our system leverages juxtaposition and explicit encoding of relationships in data (through the Scagnostics) as visual comparison strategies [18].
The most obvious benefit of our parameterization is to reduce n to 1. That is, if we can characterize a scatterplot with a single measure (monotonicity, clumpiness, etc.), then we can use methods designed for ordinary multivariate time series. The tradeoff here is, of course, that we might lose detail at a given time point. Our remedy for that tradeoff is to devise a display that incorporates a pixel-scale scatterplot at each time point. An additional way we ameliorate this problem is to provide selection tools to switch easily between different Scagnostics. Our display changes almost instantly when a different Scagnostic is selected for analysis. This feature allows an analyst to focus on a particular aspect of scatterplots without excluding other possibilities. We confine our model at this point to 2D scatterplots. There is nothing preventing us from computing most Scagnostics in higher dimensions, but display issues come into play as the dimensionality increases. We believe that analysts are more familiar with 2D scatterplots than with more exotic displays, but that is a belief that requires testing in the future.

The TimeSeer GUI
The TimeSeer GUI incorporates two major systems: Variable Selection SPLOM and Time Series Viewer. The first enables the analyst to select Scagnostics and then variables. We employ a scatterplot matrix and a novel lensing tool to navigate through the matrix and select cells. The selections made in the SPLOM system direct the visualization in the Time Series Viewer. Fig. 5 shows instances of this tool. The top panel shows an implementation of Table Lens [38], in which a row/column is enlarged and the remaining rows/ columns are reduced. The lower two panels show our implementation, which involves a smooth lens so that distant rows/columns are reduced proportionally. Fig. 2 depicts the basic difference between Table Lens and our lensing method. In Table Lens, the Degree Of Interest (DOI) of rows/columns outside the lensing area is uniformly small. In our lensing method, we implement a smooth transition in DOI proportional to the distance of a frame to the lensing frame. This lensing technique is similar to Cartesian Fisheye View [31].
The following algorithm shows how we extend Table Lens to achieve smooth lensing. For simplicity, we will explain the algorithm in one dimension, say X-axis, and one side of lensing area, say on the right of lensing area. The 2D smooth lensing can be achieved by applying the same algorithm for both sides and both dimensions. Table Lens, we increase the width of the lensed column W max . Let k be the number of columns to the right of the lensed column, the widths of these columns are reduced to

As with
. Now we compute a lensing factor s ¼ ðW small À W min Þ=ððk À 1Þ=2Þ where W small is the smallest width (of the farthest column) that we want to lens. 3. The width of columns are recomputed by W i ¼ W i þ sððk þ 1Þ=2 À iÞ for i ¼ 1; . . . ; k Fig. 3 shows an example of our lensing method on X-axis where W max ¼ 98, W small ¼ 34, W min ¼ 10, k ¼ 7, and s ¼ 8.
All widths are in pixels.
After we have selected pairs in the SPLOM system, we go to the Time Series Viewer where time series in selected cells are expanded in a full window as depicted in Fig. 7. This main window contains multiple Scagnostic series at the bottom and scatterplots at each timepoint. A variety of pan-and-zoom tools facilitate smooth navigation throughout this window. As depicted in Figs. 7b and 7c, the same smooth lensing technique is applied on the X-axis.
Simple buttons offer alternate views and filtering. The Time Series Viewer is discussed in greater details in Section 4.3.
We could elaborate on details in this section, but it is easier to understand the workings of this rather large application by looking at real data examples. We will explain technical details as we cover various capabilities of the program.

EXAMPLES
In this section, we use two different data sets to demonstrate the performance of TimeSeer. The first is a series of US Employment data and the second is a series of US Weather data. The US Employment data comprise monthly employment statistics for 50 states over 22 years from 1990 to 2011. The data were retrieved from http:// www.bls.gov/. There are 25 variables in the collected data: Total Nonfarm, Construction, Manufacturing, Non-Durable Goods, Trade, and Transportation, Wholesale Trade, Retail Trade, Transportation, and Utilities, Financial Activities, Real Estate and Leasing, Professional and Business, Scientific and Technical, Administrative and Support, Education and Health, Educational Services, Social Assistance, Leisure and Hospitality, Arts and Entertainment, Accommodation and Food, Other Services, Government, Federal Government, State Government, Local Government, and State Employment. For these data, we have 78,600 scatterplots with 50 data points each to examine.
The Weather data comprise hourly meteorological measurements over a year from the Gulf of Maine in 2008. There are 17 variables represented in the data set: current speed, current direction, temperature, East Current Velocity, North Current Velocity , significant wave height, dominant wave period, air temperature, wind speed, wind gust, wind direction, visibility, barometric pressure, water temperature, salinity, sigma-T, and conductivity. Data and variable descriptions can be found at http://gyre.umeoce.maine. edu/buoyhome.php. For these data, we have 50,000 scatterplots with 24 data points (24 hours in a day) each to examine.
We begin with the US Employment data.

Variable Selection SPLOM
With p variables, there are pðp À 1Þ=2 pairs of variables. To do our analysis, we need to: 1) select the Scagnostic of interest: Outlying, Monotonic, Stringy, Skinny, Sparse, Striated, Convex, Clumpy, Skewed, 2) select a criterion to order variables in SPLOM: mean or variance of the Scagnostic series, and 3) select a subset of the scatterplots, either by picking individual frames in the scatterplot matrix or by picking all frames corresponding to a single variable. The mean and variance of a Scagnostic time series (a pair of variables) is computed by averaging that Scagnostic measure over time series as shown in (11) and (12) where T is the number of data points in time series, p and q are two variables The Scagnostic mean and variance of a variable p is computed by averaging all pairs of variables containing p as an element as shown in (13) and (14) where V is the number of variables. The mean or variance of variables (depending on which one we have selected) is used to order variables in SPLOM We offer both mean and variance for ordering Scagnostics series because each captures a different aspect of the Scagnostic process that might interest an analyst. The mean selection offers the opportunity to pick series with extremely high or low series means on a Scagnostic. The variance selection ranks by variability, so that single peaks and valleys in the Scagnostic time series will be more discernable in the main time series window. Fig. 4a shows the scatterplot matrix for 25 variables in the US Employment data. We have selected the Outlying measure and sorted the variables by their means. In particular, each plot (each pair of variables) is colored by its mean of the selected Scagnostic time series; the embedded small graph shows a thumbnail of the actual Scagnostic time series. On the top of Fig. 4a is the color legend for the mean of Outlying Scagnostic time series. We use a Kelvin color temperature scale [32] to encode the range of all possible Outlying mean values with red corresponding to high values of means and green corresponding to low values of means. This range (always within the 0 and 1 interval) is different when we select a different Scagnostic feature. TimeSeer sorts the variables so that low Outlying series are at the bottom and high Outlying series are at the top. Notice that we also color variable names to differentiate and group them by categories and subcategories.
Single plot selection is depicted in Fig. 4a. This mode is invoked by clicking on any of the panes in the scatterplot matrix. This selection mode allows the analyst to investigate specific Scagnostic series that show interesting patterns of behavior among the two featured variables. Single variable selection mode is depicted in Fig. 4b This mode is invoked by clicking on the angled variable names to the right of the scatterplot matrix diagonal. The figure shows Total Nonfarm selected. Black rectangles are used to denote selected plots. This selection mode allows the analyst to examine all variables paired with a specific variable of interest.

Lensing
For many more variables, the Scagnostic time series will be difficult to discern inside the scatterplot matrix. Consequently, we added zooming and lensing to the matrix. If one hovers over a plot the graph inside is enlarged. There are two types of lensing. Fig. 5a shows an implementation of  top and in the corner. Unlike the standard implementation of Table Lens, our smooth lensing offers a smooth transition when we move the mouse over different plots. Therefore, it is easier to keep track of the whole context (SPLOM) and the focused area.
After a user has selected variables from the scatterplot, all pairwise combinations are displayed in the controller window, as depicted in Fig. 6 (eight pairs of variables are selected). Nine other Scagnostic measures on each pair of variables are presented on the same row as the selected pair but they are faded. Each plot is colored by the mean of that feature over a year. Additionally, all pairs are reordered by the selected criterion: the mean of the selected Outlying feature. In Fig. 6, the third pair is brushed, and other features of the third pair are highlighted. We can keep track the position of the brushing pair in the scatterplot matrix by highlighting the graph inside this selection display. Both views are linked. Moreover, variable names are colored in the same as they appear in SPLOM. This helps in locating variables in both views.

Time Series Viewer
After we have selected pairs, we go to the Time Series Viewer. There are several ways to visualize multiple time series: small multiples or multiples superimposed with or without lensing. Fig. 7 is an example. The series are built from ones we selected in the controller based on the US Employment data. We selected Monotonic as the Scagnostic for this example, and we chose nine pairs of variables sorted by their means. Notice the slanted orientation of the second variable in each pair. This device helps the viewer to understand which variable is on the X-axis and which variable is on the Y -axis. Fig. 7a shows nine small multiples corresponding to nine pairs. The lensing in the Y -dimension is applied to the first three series, colored red, green, and blue, respectively; the other pairs are colored gray and greatly reduced in size. We also employ a gradient on the lensed series to make the profiles more discernible and to coordinate highlighting with the scatterplots at the top of the window. The larger the series value, the lighter the coloring (like snow on mountains). This use of brightness also facilitates the highlighting of the scatterplots at the top. Each scatterplot corresponds to the appropriate colored Scagnostic series directly below it, and is highlighted with the same brightness as the point on the series.
We can change the number of pairs in the lensing area or replace them by other pairs in the nonlensing area by a simple click and drag. Moreover, we can resize the Ydimension of a Scagnostic time series (both lensing and non  lensing area) by a simple mouse scroll. This helps to accommodate different numbers of Scagnostic time series into the fixed height of application window, as users can always go back to the SPLOM to select different pairs of variables.
For the arrangement we have selected, we notice that the Monotonic Scagnostic shows a distinct seasonal pattern with an annual cycle. This is consistent with what we would expect for variables related to farming.
In the US employment data, there are 262 data points on each Scagnostic time series. As we can see in Fig. 7a, however, the data point are crowded enough so that we cannot read any details in a season. The common solution in this case is displaying a selected season or interval. One limitation of this method is that when we select a season or interval, we lost the overall context of the time series. As a remedy, we chose to implement smooth lensing for the Xdimension. When we lens a season, we still can see what is going on in the other seasons throughout the entire time series. We can simply move the mouse to enlarge a different season. Moreover, smooth lensing allows continuous transition as we move the mouse. Fig. 7b shows a lens applied to the Scagnostic series. Vertical lensing is applied to three first pairs of variables and horizontal lensing are applied to two seasons (highlighted in Box A). The lensing works over the series as well as the scatterplots, so we are able to investigate individual scatterplots to see the configuration of points that led to the value of the Scagnostic shown in the series. Fig. 7c shows an alternate view of the same series. This view superimposes scatterplots on line graphs of the Scagnostics series. Such an arrangement allows investigation of individual scatterplots without anchoring or reference to a row of scatterplots elsewhere in the window. We believe this layout is useful once interesting segments are found in the series. In any case, toggling between views can be done in an instant.

Filtering, Brushing, and Drill-Down
Information visualization systems should allow one to perform analysis tasks that largely capture people's activities while employing information visualization tools for understanding data [2]. In the rest of this section, we describe four basic analysis tasks implemented in TimeSeer: filtering, brushing, drill-down, and searching.

Filtering
We employ a gradient on the Scagnostic series and the scatterplots at the top to help users locate scatterplots with high Scagnostic values. However, it is not possible for users to filter only scatterplots with selected Scagnostic values in a specific interval (for example, with Monotonicity from 0.6 to 0.8). The range sliders on the left of each Scagnostic time series allow users to do that. Fig. 8c shows an example of filters applied to the Outlying series for the three pairs in the lensing area. We are looking for outliers. When the user moves a range slider, a number is displayed to show the current value on the range slider. The filtered parts of time series are faded. In the pairwise view area, the filtered distributions are faded so viewers can focus on data distributions with high numbers of outliers.

Brushing
Looking at an interesting distribution, we may want to check out the data point for further details. For example, we may want to see which state is the outlier in an Outlying distribution. Or, we may ask where is New York in the overall picture for 2001. Or, we may want to compare Illinois and California in 2011. We implement a brushing tool allowing users to do these things. Fig. 8d illustrates the use of a brush. When we brush a state (a data point in a scatterplot), the state name is displayed in a tool tip, the same state is highlighted in other plots and a line appears connecting adjacent plots. This reveals, in effect, a spatialtemporal series. We can see the changes in the orientation of the state in scatterplots over time. This kind of detail view provides information that cannot be discerned in the original raw time series. In Fig. 8a, we see a peak in several Scagnostic series. This suggests a time point in 2005 (highlighted in Box A) in which we would expect to see outliers in the relevant scatterplots. We lens this region in Fig. 8b. We see that a period in the Fall has an unusually high peak. This is the precise point where we expect to find an outlier. In Fig. 8c, filtering out low outlying plots allows us to focus on outliers. We now can see clearly the high outlying plots in all selected time series in September of 2005.

Drill-Down
We get our details on demand, as shown in Fig. 8d, by clicking on the red scatterplots. Subsequently, the raw series of State Employment and Accommodation and Food (which are the 1-month net change in employment rate) are displayed in the lower graph on two separated scales (in cyan and pink). By brushing the outlier in the scatterplot of September of 2005, we see the actual line graph for that brushed state in yellow. In this case, the outlier is Louisiana. Hurricane Katrina wreaked havoc on their employment and productivity figures (note the sudden drop in Louisiana employment rate and many industries). Notice that Louisiana is also the outlier in the scatterplot of December of 2005, even as it recovered.
In Fig. 8e, we brush another outlier of State Employment and Accommodation and Food scatterplot which is Mississippi. Similarly, Mississippi is also an outlier in the scatterplot of December of 2005. However, Mississippi situation is different (Mississippi got another even more serious drop in Accommodation and Food in December of 2005). We can obtain that information by only looking at the scatterplots. Further details can be found in raw data time series.

Searching for Similar Patterns or Interesting Distributions
where S and P are two arrays of nine Scagnostics of the two scatterplots.

Automatic Search for Similar Distributions
In Fig. 9, we show the result from the user selecting a plot from the main screen and requesting a search. TimeSeer searches and plots the top five most similar scatterplots, as characterized by the selected Scagnostic (outlying in this case). In this example, taken from the US employment data, we have selected a plot with a high outlier on State Employment against Accommodation and Food in September of 2005. In the lower panel, the first plot on the left is the plot we selected, highlighted by a yellow rectangle. The five most similar plots are ordered over the nine Scagnostics (the smaller the index, the more similar). Again, the background of a plot is colored by the time series containing that plot; saturation encodes the value of interested feature (the brighter the shade, the more salient the Scagnostic). We also can see the time (on the top of each plot) for when the data distributions happen to be similar. The Scagnostics of the selected plot and top five plots are also grouped and ordered appropriately. From Fig. 9, we note that it is interesting that Louisiana and Mississippi are outliers in all six plots. Additionally, four out of five similar plots are in the same month (September of 2005). This tells us that Hurricane Katrina affected Louisiana employment in the selected economy sectors.
The last similar plot is another month which has similar Scagnostics as State Employment against Accommodation and Food in September of 2005. However, the situation in December of 2005 is different. While Louisiana had recovered in both Employment Rate and Accommodation and Food, Mississippi was still struggling. Fig. 10 shows a similar result for the Weather data. We have selected a plot with a high Stringy and Skinny Scagnostic value on barometric pressure versus air temperature and searched for similar distributions in the same time series. This is a rather fascinating example of an unusual relation between variables that would not be evident in summary statistics such as the Pearson correlation. It is well known that air temperature and barometric pressure are related, but these plots make clear that it is not a simple functional relationship. By searching for Stringy Scagnostics, we see that this dynamic relationship between barometric pressure and air temperature has little error (the strings/paths are quite smooth) but is highly nonlinear (they wind around instead of following a straight line).

Manual Search for Similar Distributions
We have devised an annotation that allows a user to search for similar plots manually. The user selects a plot from the main screen and TimeSeer computes the Scagnostic dissimilarity of each plot compared to the selected plot. It then displays this dissimilarity underneath the series. Fig. 11a shows an example. We have selected an Outlying Scagnostic. We also selected the scatterplot in State Employment against Accommodation and Food in September of 2005. The similarity at each time point compared to the selected scatterplot is presented by the saturation (in purple) of the bar under it; the higher the saturation, the more similar the scatterplots. The slider at the bottom is used to filter similarity. Above this slider is the dot histogram showing the similarity distribution of all plots in the 6 selected time series colored accordingly. Notably, as we have selected a high outlying plot, the brighter plots tend to appear in the front of the slider and vice versa.
The user can filter these plots to see only the most similar ones. Fig. 11b shows an example. All plots with a dissimilarity greater than 0.5 have been filtered (by using the slider). The user can brush on the remaining dissimilarity purple bars or on the dot histogram to check the dissimilarity. When the mouse is over a purple bar or a plot in the dot histogram, a small window appears right below the purple bar. On this window, a new scatterplot and its Scagnostics is plotted next to the selected plot. In the example in Fig. 11b, we can see that the distribution of State Employment against Education and Health in September of 2005 (and its Scagnostic histogram) is similar to the distribution of State Employment against Accommodation and Food (and its Scagnostic histogram) in the same time. Fig. 12 shows a similar result for the Weather data. We have selected a plot with a high Striated and Skinny Scagnostic value on temperature versus salinity on day 64 and searched for similar distributions in the selected time series.
We should note that the time for searching similar distribution does not depend on the number of data points. The time for comparing two scatterplots is Oð1Þ compared to OðnÞ because we are searching on nine Scagnostics, not on individual points. Therefore, the time to search for distributions similar to one in a selected plot is Oðt Ã p 2 Þ instead of Oðt Ã p 2 Ã nÞ In a real-time application with a huge time series involving multiple variables, and many data points at each time point, we can cache Scagnostics at each time point. When we need to find plots having a distribution similar to a target plot, we can exploit our cache. The time saved in this approach is considerable.

CONCLUSIONS
TimeSeer is a visual analytic tool for analyzing a doubly multivariate time series. It highlights the strength of visual analytics itself because statistical modeling of this type of series is problematic. There are no off-the-shelf algorithms for dealing with the doubly multivariate time series design, even in advanced statistical packages like SAS or R and even in existing visual analytics platforms.
It should be clear that TimeSeer is not a simple application designed for nontechnical users. To leverage its capabilities, a user needs to become familiar with scatterplots, scatterplot matrices, Scagnostics, and multivariate time series. Consequently, we have focused on giving this relatively sophisticated class of analysts a set of tools that enables searches for structure in very high dimensional time series spaces. There are surely ways these tools can be improved, and a study of user interactions can help with this task. Nevertheless, our first challenge has been to devise an interactive platform that can handle huge multivariate collections like the BLS and weather data sets without running out of memory, time, or screen resolution.
One might wonder whether data fitting in this model are rare and whether the model itself is esoteric. We believe the opposite is true. We have cited examples in economics, security, medicine, and other fields that indicate how prevalent these types of data are. An exploratory tool that provides an integrated analytic environment for these types of series can make it possible for the first time to examine real-world data sets that arise from massive data collection systems and sensor networks. The use of Scagnostics also provides an ordinary-language descriptor for distinctive patterns in time series. We see the power of this descriptive language when we compare the plots in Figs. 9 and 10. It is appropriate and obvious to characterize the first as Outlying and the second as Stringy.
TimeSeer is sensitive to how data is normalized. We normalize each variable independently to have the range 0 to 1. Variables that change the same amount in the normalized scale even with vastly different relative values, produce scatterplots that appear similar and Scagnostics do not help detect these changes especially if the changes in absolute values are small.
The enormous compression we achieve by collapsing n to 1 through the use of Scagnostics provides TimeSeer with the scalability to handle huge data sets. Subcomponents of the system can deal with thumbnails rather than multiple raw series.
Finally, we plan to investigate the use of TimeSeer on large security databases to assess the gains we claim for its performance. In addition, we expect to investigate how TimeSeer can be extended to spatial data. Time and space have similar statistical issues when modeling [13], so extending time series analytics to spatial analytics makes sense.
Tuan Nhon Dang received the BSc degree in computer science from the University of Technology, Ho Chi Minh City, Vietnam in 2006, the MSc degree in computer science at the University of Illinois at Chicago, in 2009, and the MSc degree in computing systems at Politecnico di Milano, Italy. Currently, he is working toward the PhD degree in computer science at the University of Illinois at Chicago. His research interests include visual analytics and computer animation.
Anushka Anand received the BSc degree in computer science from the American University of Sharjah, United Arab Emirates and the MSc degree in computer science from the University of Illinois, Chicago. She is working toward the PhD degree in computer science at the University of Illinois, Chicago. Currently, she serves as a student member of the Board of Trustees of the Anita Borg Institute and as a chair of the Chicago Chapter of the ACM. Her research interests include visual analytics and data mining. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.