Efficient Similarity Search In Sequence Databases

. We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the (cid:12)rst few frequencies are strong. Another important observation is Parseval's theorem, which speci(cid:12)es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the (cid:12)rst few Fourier coe(cid:14)cients, we use R (cid:3) - trees to index the sequences and e(cid:14)ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe(cid:14)cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences. (cid:3) On sabbatical from the Dept. of Computer Science, University of Maryland, College Park. This research was partially funded by the SystemsResearchCenter (SRC) at the University of Maryland, and by the National Science Foundation under Grant IRI-8958546 (PYI), with matching funds from EMPRESS Software Inc. and Thinking Machines Inc.


Introduction
Sequences constitute a large portion of data stored in computers. There have been several e orts to model timesequenced data, to design languages to query such data, and to develop access structures to e ciently process such queries (see 25] for a bibliography). Most of the work, however, has focussed on \exact" queries. New emerging applications, particularly database mining applications 2], require that databases be enhanced with the capability to process \similarity" queries. The following are some examples of the similarity queries over sequence databases: Identify companies with similar pattern of growth. Determine products with similar selling patterns. Discover stocks with similar movement in stock prices. Find if a musical score is similar to one of the copyrighted scores. Similarity queries can be classi ed into two categories: a. Whole Matching. The sequences to be compared have the same length n.
b. Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence. We concentrate on whole matching, and present an indexing technique that can be used to e ciently process such queries. Within the whole matching case, we consider the following problems: a1. Range Query. Given a query sequence, nd sequences that are similar within distance . a2. All-Pairs Query (or`spatial join').
Given N sequences, nd the pairs of sequences that are within of each other. The parameter is a distance parameter that controls when two sequences should be considered similar. It could be either user-de ned, or determined automatically (eg., =10% of the 'energy' of the query sequence; see Eq. 3 for the de nition of 'energy').
Approximate matching has been attracting increasing interest lately. Motro described a user interface for vague queries 18]. Shasha and Wang 24] proposed an indexing method that uses the triangular inequality and some precomputed distances to prune the search. However, the space overhead of the method seems quadratic on the number of objects, which may make it prohibitive for large databases. Aurenhammer 5] surveyed recent research on Voronoi diagrams, along with their use for nearest neighbor queries. Although Voronoi diagrams work well for approximate matches in 2-dimensional spaces, they need intricate transformations to work for a 3-d space, and they do not work at all for higher dimensionalities. Jagadish 15] suggested using a few minimum bounding rectangles to extract features from shapes and subsequently managing the resulting vectors using a spatial access method, like k-d-B-trees, grid les, etc.
For numerical sequences, we propose extracting k features from every sequence, mapping it to k-dimensional space, and then using a multidimensional index to store and search these points. The multidimensional indexing methods currently in use are R -trees 6] and the rest of the Rtree and k-d-Btree family 12,14,16]; linear quadtrees 22]; and grid- les 19]. There are two subtle problems with this approach that must be addressed: Completeness of feature extraction: How to extract features, and how to guarantee that we do not miss any qualifying object (time sequence, in our case). To guarantee no \false dismissal", objects should be mapped to points in k-dimensional space such that the Euclidean distance in the k-dimensional space is less than or equal to the real distance between the two objects. Dimensionality \curse": Most multidimensional indexing methods scale exponentially for high dimensionalities, eventually reducing to sequential scanning. For linear quadtrees, the e ort is proportional to the hyper surface of the query region 13]; the hyper surface grows exponentially with the dimensionality. Grid les face similar problems, since they require a directory that grows exponentially with the dimensional-ity. The R-tree based methods seem to be most robust for higher dimensions, provided that the fanout of the R-tree nodes remains > 2. Experiments 21] indicate that Rtrees work well for up to 20 dimensions. The feature extraction method should therefore be such that a few features are su cient to di erentiate between objects.
We propose to use the Discrete Fourier Transform 20] for feature extraction. Given a sequence, we transform it from the time domain to the frequency domain. We then index only on the rst few frequencies, dropping all other frequencies. This approach addresses the two problems cited above as follows: Completeness of feature extraction: Parseval's theorem 20], discussed in Section 2, guarantees that the distance between two sequences in the frequency domain is the same as the distance between them in the time domain. Dimensionality curse: As we discuss in subsection 3.3, a large family of interesting sequences exhibit strong amplitudes for the rst few frequencies. Using the rst few frequencies then avoids the dimensionality problem, while still introducing few false hits. The false hits are removed in a post-processing step.
The organization of the rest of the paper is as follows. Section 2 gives some background material on the Discrete Fourier Transform, and introduces Parseval's theorem that provides the basis for the indexing technique we propose. A resume of our indexing technique is given in Section 3.
We also justify our choice of similarity measure and the selection of DFT for feature extraction in this section. Section 4 contains performance experiments that empirically show the e ectiveness of our technique. We conclude with a summary in Section 5.
The signalx can be recovered by the inverse transform: x t = 1= p n n?1 X f=0 X f exp (j2 ft=n) t = 0; 1; : : :; n?1(2) X f is a complex number (with the exception of X 0 , which is a real, if the signalx is real). There are some minor discrepancies among books: some de ne X f = 1=n P n?1 t=0 : : : or X f = P n?1 t=0 : : :. We have followed the definition in (Eq 1), for it simpli es the upcoming Parseval's theorem (Eq 4).
ax t ]() aX f ] (8) Also, a shift in the time domain changes only the phase of the Fourier coe cients, but not the amplitude.
x t?t0 ]() X f exp (2 ft 0 j=n)] (9) Given the above, Parseval's theorem gives kx ?ỹ k 2 kX ?Ỹ k 2 (10) The latter implies that the Euclidean distance between two signalsx andỹ in the time domain is the same as their Euclidean distance in the frequency domain.
We believe that for a large number of time sequences of practical interest, there will be a few frequencies with high amplitudes. Thus, if we index only on the rst few frequencies, we shall have few false hits. This is a key observation for our proposed method.

Proposed Technique
We propose using the square root of the sum of squared di erences as the distance function between two sequences.
Speci cally, the distance D(x;ỹ) between two sequencesx andỹ is the square root of the energy of the di erence: If this distance is below a user-de ned threshold , we say that the two sequences are similar.
The importance of Parseval's theorem (Eq 4) is that it allows to translate the query from the time domain to the frequency domain. Coupled with the conjecture that few Fourier coe cients are enough, it allows us to build an effective index with a low dimensionality.
The following is a resume of our proposed technique: 1. Obtain the coe cients of the Discrete Fourier Transforms of every sequence in the database.

Build a multidimensional index us-
ing the rst f c Fourier coe cients, where f c stands for`cut-o frequency'. Thus, each sequence becomes a point in a 2f c -dimensional space (recall that the Fourier coefcients are complex numbers). We discuss in subsection 3.3 why f c can be taken to be small (< 5). As discussed earlier, we recommend the R -trees as the indexing structure, since it has been shown to work well for at least up to 20 dimensions 21]. This index will be called`F-index' henceforth. 3. For a range query, obtain the rst f c Fourier coe cients of the query sequence. Use the F -index to retrieve the set of matching sequences that are at most distance away from the query sequence. 4. For an all-pairs query, we do a spatial join using the F -index. The result of the join will be a superset of the answer set. 5. The actual answer set is obtained in a post-processing step in which the actual distance between two sequences is computed in the time domain and only those within distance are accepted.
The`completeness' of this method is based on the following lemma: We only give the proof for range queries; the proof for`all-pairs' queries is very similar. Suppose we want all sequencesx that are similar to a query sequenceq, within distance , i.e.: D(x;q) (12) (14) Keeping only the rst f c < n coecients, we have Thus, equation (14) implies the following condition fc?1 X f=0 jX f ? Q f j 2 2 (16) In other terms, the condition of (Eq. 16) will retrieve allX that are in the answer, plus some false hits. Thus, our index acts as a lter that returns a superset of the answer set.

Choice of Similarity Measure
The similarity measure is clearly application-dependent. Several similarity measures have been proposed, for In fact, the Euclidean distance is the optimal distance measure for estimation 11], if signals are corrupted by Gaussian, additive noise. Thus, ifq is our query andx is a corrupted version of it in the database, a searching method using the Euclidean distance should produce good results.
A valuable feature of the Euclidean distance is that it is preserved under orthonormal transforms. Other distance functions, like the L p norms L p (x;ỹ) = ( X jx t ? y t j p ) 1=p (17) do not have this property, unless p = 2 (because L 2 Euclidean distance).

Using DFT
Having decided on the Euclidean distance as the distance measure, we would like a transform that (a) preserves the distance (b) is easy to compute and (c) concentrates the energy of the signal in few coe cients. The distance-preservation requirement is met by any orthonormal transform 10], DFT being one of them. Orthonormal transforms form two classes: (1) the data-dependent ones, like the Karhunen-Loeve (K-L) transform, which need all the data signals to determine the transformation matrix and (2) the data-independent ones, like the DFT, Discrete Cosine (DCT), Harr, or wavelet transform, where the transformation matrix is determined a-priori.
The data-dependent transforms can be ne-tuned to the speci c data set, and therefore they can achieve better performance, concentrating the energy into fewer features in the feature vector. Their drawback is that, if the data set evolves over time, e.g., a recomputation of the transformation matrix may be required to avoid performance degradation, requiring expensive data reorganization. We, therefore, favor data-independent transforms.
Among them, we have chosen the DFT because it is the most well known, its code is readily available and it does a good job of concentrating the energy in the rst few coe cients, as we shall see next. In addition, the DFT has the attractive property that the amplitude of the Fourier coe cients is invariant under shifts (Eq. 9). Thus, using Fourier transforms for feature extraction has the potential that our technique can be extended to nding similar sequences ignoring shifts.
Note that our approach can be applied with any orthonormal transform. In fact, our response time will improve with the ability of the transform to concentrate the energy: the fewer the coefcients that contain most of the energy, the faster our response time. Thus, the performance results presented next are just pessimistic bounds; better transforms will achieve even better response times.

Using Few Fourier Coe cients for Indexing
Using a small value for the number of Fourier coe cients retained f c does not a ect the correctness | the F -index is a lter that returns a superset of the answer set. However, our proposed technique will not be very e ective if the choice of a small f c results in a large number of false hits. The worst-case signal for our method is white noise, where each value x t is completely independent of its neighbors x t?1 , x t+1 . The energy spectrum of white noise follows O(f 0 ) 23], that is, it has the same energy in every frequency. This is bad for the F -index, because it implies that all the frequencies are equally important. However, we have strong reasons to believe that real signals have a skewed energy spectrum. For example, random walks (also known as brown noise or brownian walks) exhibit an energy spectrum of O(f ?2 ) 23], and therefore an amplitude spectrum of O(f ?1 ). Stock movements and exchange rates have been successfully modeled as random walks (e.g., 8,17]).
Using the data set available through ftp from s .santafe.edu, we show in 1] that the Fourier transform of the movement of the exchange rate between the Swiss franc and the US dollar follows closely the same 1=f behavior as for a random walk.
Our mathematical argument for keeping the rst few Fourier coecients agrees with the intuitive argument of the Dow Jones theory for stock price movement (see, for example , 9]). This theory tries to detect primary and secondary trends in the stock market movement, and ignores minor trends. Primary trends are de ned as changes that are larger than 20%, typically lasting more than a year; secondary trends show 1/3-2/3 relative change over primary trends, with a typical duration of a few months; minor trends last roughly a week. From the above de nitions, we conclude that primary and secondary trends correspond to strong, low frequency signals while minor trends correspond to weak, high frequency signals. Thus, the primary and secondary trends are exactly the ones that our method will automatically choose for indexing.
In addition to stock movements and exchange rates, it is believed that several families of real signals are not white noise. For example, 2-d signals, like photographs, are far from white noise, exhibiting a few strong coe cients in the lower spatial frequencies. The JPEG image compression standard 26] exactly exploits this phenomenon, e ectively ignoring the highfrequency components of the Discrete Cosine Transform, which is closely related to the Fourier transform. If the image consisted of white noise, no compression would be possible at all. Birkho 's theory 23] claims that interesting' signals, such as musical scores and other works of art, consist of pink noise, whose energy spectrum follows O(f ?1 ). The argument of the theory is that white noise with O(f 0 ) energy spectrum is completely unpredictable, while brown noise with O(f ?2 ) energy spectrum is too predictable and therefore boring. The energy spectrum of pink noise lies inbetween. Signals with pink noise also have their energy concentrated in the rst few frequencies (but not as few as in the random walk). In addition to the above, there is another group of signals, called black noise 23]. Their energy spectrum follow O(f ?b ), b > 2, which is even more skewed than the spectrum of the brown noise. Such signals model successfully, for example, the water level of rivers as they vary over time 17].

Performance Experiments
To determine the e ectiveness of our proposed method (the F -index method), we compared it to a sequential scanning method. We used the Rtree for the index. For range queries, the sequential scanning method computes the distance between the query sequence and each data sequence. In our e ort to do the best possible implementation for the sequential scanning, we stop the test as soon as the square of the distance exceeds 2 , and we declare the two sequences to be dissimilar. Thus, a data sequence is fully scanned only if it is similar to the query sequence. For`all-pairs' queries, each sequence in the database is tested against every other sequence, for a total of N (N ? 1)=2 tests.
We investigated the following questions in these experiments: How to choose the number of Fourier coe cients to be retained (cuto frequency f c ) in the F -index method. A larger f c reduces the false hits but at the same time increases the dimensionality of the Rtree, and hence the search time.
How does the search time grow as a function of number of sequences in the database? How does the length n of the sequences a ect the performance?

Experimental setup
We generated synthetic sequences for the experiments. Each sequencex = x t ] was a random walk: x t = x t?1 + z t (18) where z t (t = 1; 2; : : :) are independent, identically distributed (IID) random variables. For implementation convenience, each z t variable is uniformly distributed in the range (-500, 500). The probability distribution of each z t is immaterial; the results would be the same had we chosen a gaussian distribution, or a fair, random coin. For each set S of N sequences, queries were generated by creating a distorted copy x t ] of each sequence x t ] in S. This was accomplished by adding a small amount of noise to every x t , i.e., x t = x t + p w t (19) where p=0.05 and w t (t = 1; 2; : : :) are IID random variables, each following a uniform distribution in the range (-500,500).
Let Q be the set of distorted sequences, which we shall use as queries.
For range queries, we search S for sequences within distance for every distorted sequence in Q. For allpairs queries, we concatenate S and Q, and ask for all sequence pairs within distance. The execution time for the F -index method includes both the search time in the R -tree and the postprocessing time.     We repeated each experiment 10 times by generating 10 sequence sets with di erent seeds, and averaged the execution times from these repetitions. Table 1 summarizes the parameters of the experiments. As the number of Fourier coe cients (`cut-o frequency' f c ) increases, the dimensionality of the R -tree increases. Recall that each Fourier coe cient, being a complex number, increases the dimensionality of the R -tree by 2. The increase in dimensionality results in better index selectivity, which gives fewer false hits. This reduction in false hits is re ected in the post-processing time, which decreases with the cut-o frequency. However, the time to search the R -tree increases with the dimensionality, because the fanout is smaller, and the tree is taller. Figures 3 and 4 are in complete agreement with the above intuitive arguments.
Given the trade-o between the tree-search time and the post-processing time, it is natural to expect that there is an`optimal' f c . Indeed, the total execution time of our method shows such a minimum, as illustrated in Figures 1 and 2. Notice that this minimum is rather at, and, more importantly, it occurs for small values of the cut-o frequency f c . This experiment con rms our early conjecture that we can e ectively use a small number of Fourier coe cients for indexing sequences.
For the rest of the experiments, we kept f c =2 Fourier coe cients for indexing, resulting in a 4-dimensional R -tree.

Varying the number of sequences in the database
The next experiment compares the F -index method with the sequential scanning method for increasing number of sequences in the database. Figures 5 and 6 show the execution time per query for range and all-pairs queries respectively, for di erent values of the number of sequences in jSj. Clearly, the F -index method outperforms the sequential scanning. As the number of sequences increases, the gain of the F -index method increases, making this method even more attractive for large databases. 4.4 Varying the length of sequences First, we show the results for range queries. We varied the length of sequences, keeping the number sequences in S xed to 400. The distance parameter was set to (1000 n) 1=2 (where n is the length of a sequence). Figure 7 shows the execution time per query for range queries for di erent sequence lengths. The gain of the F -index method increases with n. Figure 8 shows the results of the experiments for all-pairs queries. The trends are similar with the ones for range queries.

Discussion
The major conclusions from our experiments are: The minimum in the execution time for both range and`all-pairs' queries is achieved for a small number of Fourier coe cients (f c = 1 ? 3). Moreover, the minimum is rather at, which implies that a sub-optimal choice for f c will give search time that is close to the minimum. Increasing the number of sequences in the database results in higher gains for our method.
Increasing the length of the sequences n also results in higher gains for our method.
Thus, the experiments show that the proposed F -index method achieves increasingly better performance, as the volume of the data increases.
Finally, we should mention that we also examined whether a`naive' feature extraction method would work as well. For example, consider a method that keeps the rst few values of each time sequence, and indexes on them. We carried out an experiment in which we indexed on the rst 10 values of each time sequence. The performance of this method was very poor compared to the F -index method; there were many false hits, resulting in a large postprocessing time. Judging that further details are of little interest, we omit the experimental results.

Summary
We proposed a method to index time sequences for similarity searching. The major highlights of this method are: The use of an orthonormal transform, and speci cally, the Discrete Fourier Transform, to extract features from a sequence. The attractive property of the DFT is that the Euclidean distance in the time domain is preserved in the frequency domain, thanks to Parseval's theorem. Thus, the DFT ful lls the \completeness of feature extraction" criterion. In addition, the DFT is fast to compute ( O(n logn) ). The recognition that a large family of sequences have only a few (f c ) strong Fourier coe cients. For example, random walks, stock price movements, exchange rates, exhibit an amplitude spectrum of O(1=f). Ignoring the weak coe cients, we introduce a few false hits, but no false dismissals. The importance of this observation is that it avoids the \dimensionality curse" at the expense of a modest post-processing cost. Keeping the rst f c coe cients, each sequence becomes a point in a 2f c {dimensional space (recall that the Fourier coe cients are complex numbers). The use of spatial access methods, and speci cally R -trees, to index those points. We believe that R -trees are more robust than their competitors, for medium dimensionalities.
Extensive empirical evaluation demonstrated the e ectiveness of the proposed method. We generated random walks, which model well stock price movements. The conclusions from our experiments are the following: (a) the execution time of our method shows a rather at minimum for a small cut-o frequency (f c 1-3) (b) compared to sequential scanning, our method achieves better gains with increasing number of sequences and increasing length. Thus, our method will be more and more attractive, as the volume of the database increases to Gigabytes and Terabytes.
Although we have made certain choices (Euclidean distance between sequences in time domain for similarity measure, DFT for feature extraction, and R tree for maintaining indexes), our technique can be trivially adapted for any similarity measure that can be expressed as the Euclidean distance between feature vectors in some feature space any distance-preserving (eg., orthonormal) transform (the more the energy concentrated on few coe cients, the faster our response time) any multi-dimensional index that performs well for the number of features used for indexing. Future work could examine the following issues Examination of other orthonormal transformations, in addition to the Discrete Fourier Transform. Extensions of our approach to 2-d and higher-dimensionality signals (e.g., images), in addition to 1-d signals (time sequences) that we have examined.
The work reported in this paper has been done in the context of the Quest project 2] at the IBM Almaden Research Center. In Quest, we are exploring the various aspects of the database mining problem. Besides the problem of queries over large sequences, some other problems that we have looked into include the enhancement of the database capability with the classi cation queries 3] and with \what goes together" kinds of association queries 4]. The eventual goal is to build an experimental system that can be used for mining rules embedded in massive databases. We believe that database mining is an important application area, combining commercial interest with intriguing theoretical questions.