A signature technique for similarity-based queries

Jagadish et al. (see Proc. ACM SIGACT-SIGMOD-SIGART PODS, p.36-45, 1995) developed a general framework for posing queries based on similarity. The framework enables a formal definition of the notion of similarity for an application domain of choice, and then its use in queries to perform similarity-based search. We adapt this framework to the specialized domain of real-valued sequences. (Although some of the ideas we present are applicable to other types of data as well). In particular we focus on whole-match queries. By whole-match query we mean the case where the user has to specify the whole sequence. Similarity-based search can be computationally very expensive. The computation cost depends heavily on the length of sequences being compared. To make such similarity testing feasible on large data sets, we propose the use of a signature based technique. In a nutshell, our approach is to "shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated. Being shorter, signatures can usually be compared much faster than the original sequences. In addition, signatures are usually easier to index. For such a signature-based technique to be effective one has to assure that (1) the signature comparison is fast, and (2) the signature comparison gives few false alarms, and no false dismissals. We obtain measures of goodness for our technique. The technique is illustrated with a couple of very different examples.


Introduction
Sequences of real-valued data arise in many applications ranging from the stock market to electro-cardiograms. Often, it is of interest to locate sequences that are similar to a speci ed query sequence. The notion of similarity is application dependent, and even within a single application, may vary from one query to the next.
Work in this area is usually speci c to one particular domain and uses one speci c notion of similarity. For example, Faloutsos et al 6, 1] studied the problem of searching a database of time sequences for sequences similar to one given. They reduced sequences to points in a low-dimensional space by using Fourier transforms and used the Euclidean distance in this space to measure similarity. This notion of similarity is extended in 21] by allowing a class of transformations that includes moving average and time warping to be applied to sequences before computing the Euclidean distance. Retrieval by similarity has also been studied in the context of image retrieval 15], genome/protein matching 3,12] and text string searching 31]. Other authors have recently studied models and languages for databases containing sequences, (e.g., 5], 13], 10], 22], 26]), but without taking into account notions of similarity or approximation.
In a previous paper 16], Jagadish et al developed a general framework for posing queries based on similarity. The framework enables a formal de nition of the notion of similarity for an application domain of choice, and then its use in queries to perform . This work was partially supported by the Institute of Systems Research and by the National Science Foundation under Grants No. EEC-94-02384, IRI-9205273 and IRI-9625428 similarity-based search. In this paper, we adapt this framework to the specialized domain of real-valued sequences. (Although some of the ideas we present are applicable to other types of data as well). In particular we focus on whole-match queries. By whole-match query we mean the case where the user has to specify the whole sequence (e.g., in a collection of 2-second voice clips with the phrase \good-morning" nd the ones that are similar to my own utterance).
Similarity-based search can be computationally very expensive. The computation cost depends heavily on the length of sequences being compared. To make such similarity testing feasible on large data sets, we propose the use of a signature based technique. In a nutshell, our approach is to \shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated. Being shorter, signatures can usually be compared much faster than the original sequences. In addition, signatures are usually easier to index. For such a signature-based technique to be e ective one has to assure that (1) the signature comparison is fast, and (2) the signature comparison gives few false alarms, and no false dismissals. We study these issues below, and present conditions under which these requirements are satis ed. Our goal is to show that this general framework ts many real-life applications and leads to e cient searching. The techniques we suggest have been implemented and tested. At least in one application of interest, these techniques did lead to a signi cant improvement in performance.

Basics
In this section we present the basic framework on sequences, similarity measurements, and signature extraction.

Sequences, Distance Functions, and Transformation Languages
Real-valued sequences, like stock-market or electro-cardiogram data, can be viewed as strings of numbers. For example, a possible data sequence could be the string x = f10:2; 12:5; 3:0g.
We use the following notational conventions: x i denotes the i-th entry of the sequencex x i:j denotes the sub-sequence fx i ; x i+1 ; : : :; x j g of the sequencex Following the framework of 16], the dissimilarity between two objects can be measured as the cost of transforming one into another by means of a transformation sequence selected from a transformation language T . Thus the distance between two sequences measures the cost to transform the rst sequence to the second, or both to a common, third sequence, given an application dependent set of allowable transformations and their associated costs. Given a set of transformations T and a transformation T 2 T , and a sequencex in some set of sequences S (e.g., S = < n , n = 1; 2; : : :), T(x) is the sequence in S that results from applying transformation T tox. The cost of this transformation is cost(T). We extend 16] by allowing the possibility that, after all allowable transformations are exhausted, the two sequences are still di erent, in which case we measure the distance between the transformed strings using a traditional distance function, denoted by D 0 (), such as the Euclidean distance or the city-block (Manhattan or L 0 ) distance function. The D 0 () distance will be called the base distance. The distance between two stringsx;ỹ is de ned as the cost of transforming each of the strings to two strings that are as close as possible in base distance, plus the base distance between the transformed strings. Formally, D(s;t) = min T 1 ;T 2 2T (cost(T 1 ) + cost(T 2 ) + D 0 (T 1 (x); T 2 (ỹ))) (1) Often, the allowable transformations consist of a sequence of basic building blocks.
In this case, let T 0 be the set of these basic, atomic transformations. For example, for the string-editing distance, the set of atomic transformations could be T 0 = f 'insert', 'delete', 'substitute' g. A composite transformation is an allowed sequence of such atomic transformations with cost that is the sum of the individual costs, then we can express the distance function recursively as follows: D(x;ỹ) = min ( min T 1 ;T 2 2T 0 (cost(T 1 ) + cost(T 2 ) + D(T 1 (x); T 2 (ỹ))) D 0 (x;ỹ) As we show in Appendix 7, several practical distance functions follow this model. The Euclidean distance readily obeys the model, if no transformations are allowed, and D 0 () is the Euclidean distance. De nition 2.1 A base distance function D 0 () is said to be additive if for sequences x;ỹ of equal length l D 0 (x;ỹ) = i=1:::l d(x i ; y i ), where d() is some non-negative function, and D 0 (x;ỹ) is unde ned otherwise.
We require in this paper that the base distance function used be additive. This is not an onerous requirement since every example we are aware of in practice does satisfy this requirement. Observe that one cannot compute the base distance between two sequences of unequal length.

Signatures
Given a database containing sequences and a query sequence to be matched within a certain distance, a naive evaluation strategy is to iterate over the sequences in the database and for each one compute the distance from the given sequence. The complexity of each such test is determined by the length of the sequence and the notion of similarity being used (that is, the class of transformations allowed). Since individual sequences in the comparison can often be large, approximate matching can be computationally intensive.
We wish to reduce the computation cost by using short representative signatures to perform the matching instead of the real sequences. Signatures are scanned sequentially and matched against the signature of the given query. Due to their small size, this scan can take place orders of magnitude faster than a full scan and match on the entire database. In some applications, if the signatures are short enough, it may even be possible to build index structures on the signatures.
A signature is a word in a selected description language. We associate a deterministic Turing machine T L with a given description language L. We say that a sequencẽ x is (exactly) represented by a word w in a description language L, ifx is the output of T L on the input w. Note that no two sequences are represented by the same word; we often refer to the sequence represented by word w as seq(w). We extend the seq mapping from words to sub-words by saying that a subsequencex 0 ofx is represented by a sub-word w 0 of w whenx 0 is the output of T L on w 0 . A sequence may be represented in many di erent ways in a given description language. Even when a sequence does not have a compact signature, it may be possible to use a compact signature that represents a \similar" sequence. An example of such an approximate signature is the representation of a sequence by its rst few Fourier coe cients 1].
De nition 2.2 Given a sequencex, a base-distance measure D 0 (:), and a description language L, the -complexity ofx is the smallest integer K such that there exists a word w x 2 L with jw x j = K and D 0 (x; seq(w x )) . If there is no such word in L, the -complexity ofx is unde ned. Such a word w x , which is not necessarily unique, is called an -signature ofx in L.
Although not every sequence will have an exact representation in the description language L, we will choose L and in each application to ensure that every sequence has an -signature. For brevity we will omit the and just use the word signature. A signature of a sequencex will be denoted by w x . Sequences that have an exact representation in a description language L will be called canonical. For a signature w, the sequence seq(w), is the canonical sequence represented by w. If w x is a signature ofx, the canonical sequence represented by w x will be called a canonical form ofx and denotedx. In general, nding a good representation for a sequence is di cult. Given a description language L (with associated Turing machine T L ), a base distance function D 0 , a set of transformations T , a distance bound , a sequence s, and a number k, we call the problem of testing whether s has -complexity of k, the signature testing problem.
Note that the problem of determining the Kolmogorov complexity of a sequence 17] is a special case of this problem, from which it follows that: Theorem 2.1 The signature testing problem is undecidable.
The undecidability comes from the power of having an arbitrary Turing machine T L to compute sequences from descriptions; in practice, people do not use arbitrary description languages. They use Fourier transform, piece-wise linear approximations, regular expressions, etc. For such languages it is easy to devise a simple grammar to determine whether a description word is valid in that it describes some sequence. And for such words the mapping usually takes time that is at most linear in the length of the sequence. Figure 1 gives an example, where the transformation T is what we call \regional add": A regional add transformation R < i; j; > of magnitude at positions i through j of a sequence adds to every entry of the sequence, starting from position i until position j, included. We assume that the description language L uses piecewise constant (i.e., zero-th order polynomial) approximations to obtain canonical sequences. Speci cally, Figure 1 shows (a) a sequence (light line) and (the canonical representation of) its signature (bold line), and (b) the e ect of a transformation (`regional add') on the sequence and the signature. The heavy line with the double arrows stand for the 'jump' of magnitude at position i. (a) sequence and canonical form (b) a`regional add' transformation Figure 1: (a) Illustration of a sample sequence, and its signature (piece-wise at approximation (b) example of a transformation (`regional add', of magnitude at interval i-j and its e ect on a sequence and its signature 3 Similarity Retrieval -Conditions for E ciency and Correctness

Lower Bounding
We need to ensure that the process of matching signatures of the database strings against the query signature does not lead to false dismissals. For this we show that, under realistic conditions, the distance between two signatures provides a lower bound on the distance between the pairs of strings that map to them. Thus, if two signatures are far apart, we know the corresponding strings must also be far apart.
De nition 3.1 Let C L be the set of all canonical sequences with respect to language L: We need to know how the transformations in T distort distances, and to nd bounds on these distortions. First, we de ne the maximum distortion of the distance between a sequencex and any of its canonical formsx that can be introduced by a transformation T 2 T . De nition 3.2 Given a set S of sequences, T a transformation language, L a description language and D 0 () a base distance function, letx 2 S and T 2 T . We de ne the transformed signature error forx and T to be K x;T maxfD 0 (T(x); T(x)) jx is a canonical form ofx g (4) Let K x max T2T (D 0 (T(x); T(x)) (5) be the maximum distortion that any transformation can introduce between a sequencẽ x and any of its canonical formsx. Finally, we maximize over all sequences in S: Theorem 3.1 If the base distance measure D 0 () satis es the triangular inequality, then D(x;ỹ) D(x;ỹ) ? K x ? K y D(x;ỹ) ? 2K The proof is omitted.
In other words, one can nd the distance between (canonical representations of) signatures and use it to bound the distance between the original sequences, even if the transforms to be applied in the matching process are di erent in the two cases.
An issue that now arises is that T(x) might not be a canonical form, that is T(x) = 2 C L . For example, if L keeps the rst few DFT coe cients of the sequencex, no canonical formx can have high frequencies; thus, if we apply to a canonical form x a transformation T that introduces high frequencies (e.g., a regional add with a large \jump"), the result T(x) cannot possibly belong to C L . It is desirable to nd a related transform T 0 , so that T 0 (x) 2 C L , while not too far away from T(x), and with cost similar to the cost of T. When this holds, we say the description language is correspondence-bounded with respect to the transformation language. More precisely: De nition 3.3 A description language L is said to be -correspondence-bounded with respect to a transformation language T if there is a constant such that for every pair of transformations T 1 and T 2 in L and for every pair of canonical sequencesx andỹ in C L there exist two other transformations T 0 1 and T 0 2 in T such that T 0 1 (x); T 0 2 (ỹ) are canonical sequences and D 0 (T 1 (x); T 2 (ỹ)) + cost(T 1 ) + cost(T 2 ) D 0 (T 0 1 (x); T 0 2 (ỹ)) + cost(T 0 1 ) + cost(T 0 2 ) + (8) The quantity is called the correspondence error bound.
De nition 3.4 A description language L is said to be closed with respect to a transformation language T if for all T 2 T and for all w 2 L we have that T(seq(w)) 2 C L . That is, all transformations in L map canonical sequences to canonical sequences.
So, we have bounds on the two sources of error: the error introduced by matching signatures instead of the original sequences, bounded by K, and the error introduced by using transformations that preserve canonical forms, instead of the transformations that we would use on the original strings; this one is bounded by .
In the rest of this extended abstract we consider only cases where is zero. The case of arbitrary is considered in the full version of the paper 7]. For lack of space we omit it here. This leaves us with two tasks | one is to compute the distance between (the canonical representations of) two signatures, by looking only at the signatures. The other is compute the bound K, for speci ed transformation and description languages. We pursue both in turn.

Match E ort
The reason to use signatures is that the comparisons of query and data can proceed rapidly { much faster than if the longer actual sequences were to be compared. Is this always true?
All that one can say in general is that it is asymptotically no more expensive to compute the distance between two sequences represented as signatures than to compute the distance between the original sequences themselves. The reason is that the complexity of obtaining the distance between two sequences is at least linear in the length of the sequences, since an additive distance function will at least require reading each point in the sequence once. The complexity of expanding a signature into a full sequence is also typically proportional to the length of the full sequence.
Of course, the whole point of using signatures is that these comparisons be significantly faster. Ideally, we would like comparisons to require time that is a function only of the length of the signature, independent of the length of the original sequence and of the canonical representation of the signature.
De nition 3.5 We say that a description language L, closed w.r.t. T , is T -comparepolynomial if the distance between (the canonical sequences of) two signatures can be computed in time polynomial in the length of the signatures, that is, for all w x , w y in L, D(x;ỹ) can be computed in time polynomial in the lengths of w x and w y .
Note that, in the presence of transformations, the distance between two sequences may be hard to compute, even in the case of fully expanded sequences. For carefully chosen transformation languages, this computation can be done in polynomial time. Consequently, achieving polynomial time computation of the distance between two signatures is a good objective. We present such a case below. We need the following auxiliary de nitions.
It is often the case that a signature w x of a sequencex is a list of numbers w 1 w 2 : : : w . For example, a signature extraction algorithm would be to replace every 10 samples ofx with their average: w 1 = avg(x 1 ; x 2 ; : : : ; x 10 ), w 2 = avg(x 11 ; : : :; x 20 ) etc. The value of each w i in this example depends only on 10 contiguous symbols ofx. In general, if every symbol w i of the signature depends exclusively on a small subsequence ofx, then the description language L is called modular. The formal de nition is as follows: De nition 3.6 A description L is said to be modular if there is a function h L such that for every sequencex and every signature w x ofx, there exist subsequencesx i of x, withx =x 1 : : :x m , such that for each symbol w xi of w x , h L (x i ) = w xi . We de ne the module bound to be the length of the longest such substring ofx.
De nition 3.7 A transformation language T is said to be local if for any two sequencess ands 0 that agree on the i-th symbol, T(s) and T(s 0 ) also agree on the i-th symbol. That is, the value of T(s) i only depends on the value ofs i . Furthermore, the cost cost(T) is the sum of the costs of all transformations T i that transform the i-th symbol of their input as T does and leave the rest of the input unchanged.
We are now ready to present the theorem. The proof (omitted) relies on the ability to compute the distance between pairs of (transformed) subsequences represented in the modular description language in time independent of sequence length, and then uses dynamic programming to deal with overlaps of subsequences and constraints on transformations.
Theorem 3.2 If T is a local transformation language, and L is a modular description language closed w.r.t. T , then L is T -compare polynomial.
A special case of particular interest is when the transformation language speci ed is empty. In this case, we use the name simple-compare-linear/polynomial/exponential, etc. For example, a Fourier series description of a sequence, with an inverse Fourier transform as the signature inverting function, is simple-compare-linear for the Euclidean (L 2 ) distance measure (because \energy" is preserved in the transform domain), but is not simple-compare-polynomial for other distance measures. A piecewise linear description of a sequence, with a zero order or rst order interpolation as the inverse, is simple-compare-linear for all L p distance measures. In fact we can show the following: Theorem 3.3 Every modular description language is simple-compare-linear.

Finding the Bounds
On the basis of the previous section, the basic question to ask now is how do we nd the bound K. (Recall that we foccus in this extended abstract on the case were = 0). We show in this section that for selected classes of transformations and description languages, such bound K can indeed be found.
It is often the case that a transform T cannot amplify an existing di erence too much. For example, it may be the case that, if two sequencesx andỹ di er by , any transform T 2 T might amplify this di erence by a predictable amount. Formally, for a given transformation language T , let f() be a function such that if D 0 (x;ỹ) = (9) then D 0 (T(x); T(ỹ)) f( ) 8 T 2 T (10) It is easy to see the following. Theorem 3.4 Let f() be a function such that for every two sequencesx;ỹ and every transformation T 2 T , D 0 (T(x); T(ỹ)) f(D 0 (x;ỹ)). Then the transformed signature error bound is In particular, if f() is the identity function, that is, the base distance is invariant under the same transformation, then K = . This is the case for all add (to y axis) transformations and L p distance measures. On the other hand, for a uniform scaling transformation, f() is clearly the scaling factor.

A Comprehensive Example
To place all the concepts of the preceding sections in perspective, we work through an example in this section. We consider a transformation language that allows \Regional Adds". In other words, we permit the sequence level to shift abruptly. There is a cost CostOfTransform associated with each such shift in level. Such distance functions with regional adds have been used in the past 27]; other distance functions go even further, including time-shifts, scaling etc. 2]. Note that no straightforward base distance functions can accommodate such changes. Therefore most of the currently published retrieval techniques cannot be used e ectively.
Such \regional adds" often occur in sequences as a result of sudden changes in environment or other catastrophes. One is often interested in nding sequences that are similar, modulo a few such level shifts. For example, consider companies X and Y , whose stocks move similarly because the companies belong to the same market segment. Suppose that something unexpected happens to company X only (e.g., it wins a major contract) -this unexpected change boosts the stock price by, say, . Thus, if we could factor-out this \catastrophe", the two stock prices would look very similar. Based on this example, we show how our approach works.

Problem de nition -our input
Suppose that a domain expert, trying to take these \catastrophe" events into account, furnishes us with the following distance function D(): For two sequencesx andỹ, their (squared) distance is the sum of squared errors plus the cost of \catastrophes", after the optimal number of \catastrophes" has been placed at the optimal points, with cost 2 for each \catastrophe" of magnitude .
This is the only input to us { it is up to us to decide how to bring this problem within our framework, which description language L to choose, how to obtain signatures, and which signature-to-sequence function seq() to choose. However, if we manage to do all that, we will have (a) a potentially fast access method (\shrink-and-search") and (b) the guarantee that our method will not have false dismissals.
Thus, our set of atomic transformations is T 0 = fR < i; j; > j1 i j n; 2 <g.
As the seq(w) function, that operates on a word w and generates the canonical representation, we select piece-wise constant interpolation. Thus, for each pair (v, d) of the word w, seq() will \stutter" d times the value v. Transformations Over Signatures Before we can compute distances between signatures, we have to agree upon what the equivalent of our transformation is in the signature domain. If we choose to have steps permitted to take place at any arbitrary position, including points in the middle of substrings represented by single symbols in the signature, then the resulting transformed sequence will have a di erent number of piecewise constant regions, and will not be a canonical sequence of the -signature language selected as our description language L. Instead, we require steps to be added only at substring boundaries. With this restriction to the transformation language, our description language is closed w.r.t. the transformation language.
Bounding Error and Match E ort It is not hard to show 7] that the correspondence error introduced here is 0. (Omitted here for lack of space) Our framework also satis es the requirements of Thm 3.4. Therefore the transformed signature error bound, K, is simply .
Since is zero, and K is , performing the match with signatures is no worse than performing the match with the canonical sequences represented by the signatures.
The description language we have here is modular, the transformation language is local, and the description language is closed with respect to the (restricted) transformation language. Therefore, all conditions of Thm 3.2 are satis ed. Thus we know that we can match in the signature domain, in time that is polynomial in signature length.

The Steps To Follow
Signature Extraction We choose to work with xed-length signatures of length , that is -signatures. With the above choices of L and seq(), the problem of nding the -signature of a sequencex is the classic problem of piece-wise constant sequence approximation, constrained on the number of segments , where the cost function is the sum of squared errors. The solution is based on Dynamic Programming; the details are omitted for brevity. We can show that the running time of the algorithm is O(n 2 ).
Comparing Full Sequences A dynamic programming algorithm suggests itself, and indeed one has been proposed for a variant of this problem in 27]. Eq. 12 is already in a recursive form -the dynamic programming works as follows: There is one \stage" in the algorithm for each discrete sample point. At each stage, a step has to be taken, and its duration p e has to be selected. No additional steps are permitted until the end of this duration. The magnitude of the step can then be obtained by solving a very simple optimization: for our setting (the Euclidean distance and the cost of a step being 2 ), we have to maximize a quadratic polynomial on . Further transitions then compute the distance between sequences, taking into account the chosen step.
Thus, there are n stages in a dynamic programming algorithm to compute the distance between two sequences of length n. At each stage, there are O(n) possible states. Computing the cost of transition (based on optimization performed for step size) is O(1), by careful book-keeping: we can keep the partial sum P pe?1 i=1 from the previous step, and update it in constant time. Multiplying these, this algorithm has complexity O(n 2 ).
Computing Distance Between Signatures For this speci c case, one can work through the details of the dynamic programming formulation to show that the time required is actually O( 2 2 ), where 2 is the sum of the lengths of the query and data signatures.

Experimental Veri cation
We implemented our searching method on a database of stock price movements, from ftp://ftp.ai.mit.edu /pub/stocks/results. We used 7 stocks; for each stock, we took its rst n days (n=150-300), and considered the closing prices only. As queries we used the very same stocks. We used -signatures. and implemented the searching algorithms in nawk. We always 'diff'ed the output of the two methods, to verify that there were no false dismissals.
Our rst set of experiments was to try to nd a good value for . Figure 2(a) shows the results: It gives the logarithm of the response time of our method, as a function of the parameter, for several values of the tolerance , for n=150 days-long stocks. We varied from 10-50. For a careful choice of , the proposed method achieves 50% savings in response time. These savings increase with the length n of the sequences. Notice that (a) the higher the tolerance, the less useful the signatures, as expected and (b) for small tolerances, the optimal value of is approximately 30. We conjecture that the optimal value of will be related to the square-root of the length n. We also plot the response time of the 'naive' method, which was around 370 seconds. The speci c plot corresponds to =4. As expected, it is roughly constant, the and the tolerance have no e ect on its strategy. Figure 2(b) examines how the speedup will scale for longer sequences. It gives the fraction of response times (our method over the naive) as a function of the sequence length n. We have chosen the tolerance to be =0 and to be 30, 40 and 50, respectively, for n=150, 212 and 300. Notice that the gains of our method increase drastically with the sequence length.

Another Example
Consider uniform scaling as the similarity transform of interest. That is T = fT a ja 2 <g where T a (x) = ax. That is, a constant scale factor a can be used to multiply all observations in one sequence to better make it match the other. Simple calculus will show the desired scale factor a to be P i x i y i = P i x 2 i . The cost cost(T a ) is determined by the application: it could be zero, or a 2 , or j log jajj. Consider a description language L, obtained as the Fourier coe cients of the original sequence. A short signature is obtained by discarding the higher frequency coe - The error in the signature is the sum of squares of the truncated coe cients. This error can be bounded by retaining enough coe cients.
The equivalent transformation in the signature domain is also multiplication by the same scaling factor as in the original sequence domain. The optimum scaling factor can be computed in the signature domain using the same calculus formulation as in the original domain. The error is bounded by the product of the error before scaling and the scaling coe cient. Therefore K = scaling . Since the transformations in the signature domain are identical to those in the full sequence domain, we know that the correspondence error is identically zero. This is not a modular description language, yet signature matching is possible in polynomial time, through a parametric optimization. 7 Related Work 7.1 Approximate Matching Approximate matching for numerical time sequences ('signals') include the work on voice matching (see 20] for a textbook), where time-warping is considered. When the distance is the Euclidean metric, we have proposed an indexing method using the rst few Discrete Fourier Transform (DFT) coe cients, for matching full sequences 1], as well as for sub-pattern matching 6]. This technique has been extended by Goldin and Kanellakis 11] for matching time sequences, so that it allows for shifts and scalings.
Approximate matching of signals in general are discussed in 30], 28], with a recent survey in 4]. There, the idea is to allow some elastic deformations (ie., space warpings), before matching the two signals. Signals can be, eg., 2-d gray-scale images, or 3-d MRI brain scans.
Closely related is the work on string matching. An excellent starting point is the book by Sanko and Kruskal 25], which examines strings, signals and DNA molecules, along with popular distance functions. The survey by Hall and Dowling 14] examines matching of typed English strings, along with the basic, dynamic programming algorithm, that computes the editing distance. The book by Frakes and Baeza-Yates 9] examines Information Retrieval applications, including approximate matching there.

Distance Metrics
We list some popular distance functions. They are all encompassed within our framework, and they use zero or more of the following transformations. The transformations accept a sequences as input and return another sequence. They also take some parameters, within angle-brackets (<>) drop < p > (s): dropss p] and shifts the elements left, to close the gap. stutter < p > (s): repeatss p] once, and shifts the elements to the right.
L p Metrics and Euclidean distance: For two sequences x = x 1 : : : x n , y = y 1 ; : : : y n this distance is de ned by the formula D p (x; y) = X i=1:::n jx i] ? y i]j p (13) For p = 1 the L p metric reduces to the`Manhattan' or`city-block' distance; for p = 2 it becomes the popular Euclidean distance.
Editing distance in strings: This is the minimum number of insertions, deletions and substitutions that are needed to transform a string s into another string t 25,18]. D(s; t) = min 8 > < > : cost(Del(t 1])) + D(s; Rest(t)) cost(Del(s 1])) + D(Rest(s); t) cost(Sub(s 1]; t 1])) + D(Rest(s); Rest(t)) (14) where Del(t 1]) (Del(s 1])) stands for deleting the rst character of t (s), and (Sub(s 1]; t 1])) for substituting the rst character of s by the rst character of t, cost is the cost of the deletion/substitution, and Rest(t) (Rest(s)) is the string t (s) without its rst character.
As shown in Table 1 D(x; y) = D 0 (Head(x); Head(y)) + min 8 > < > : D(x; Rest(y)) = x ? stutter = D(Rest(x); y) = y ? stutter = D(Rest(x); Rest(y)) = no stutter = (15) where Head(x) returns the rst element of x, and Rest(x) returns the remainder. Table 1 shows that the above distance is a special case of our framework, by setting (a) the`basic' transformation language T 0 to consist of only the stutter transformation, with cost = 0 and (b) the D 0 () distance function to be the L 1 metric, that is, the city-block distance. Figure 3 shows two time sequences, before and after the time-warping. The sequences are mixtures of similar harmonics: x(t) = 10 sin(0:5t) + 5sin(:25t) and y(t) = 11 sin(:55t) + 4:5 sin(:26t) respectively. In our main example, we used sub-sampling with piece-wise constant interpolation to depict a sequence. Additional signature languages include any piece-wise polynomial functions (see Sidiropoulos 27] for a survey of optimal algorithms to achieve such approximations), as well as techniques from Digital Signal Processing (DSP) 19], for optimal function approximation. The most popular techniques from there include the Discrete Fourier Transform (DFT), the Discrete Cosine Transform (DCT) (which is the basis of the JPEG image compression standard 29]), and, recently, the very promising Discrete Wavelet Transform (DWT) 23,24].
The DWT is, in principle, a 'short window' Fourier transform; the major di erence is that the length of the window takes several values (typically, powers of 2). Thus, it leads to multi-resolution analysis, which seems to be promising for real signals.
Natural signals seem to require few wavelet coe cients to be described with small error 8]; this is exactly the reason that a wavelet decomposition is useful for compression, feature extraction and searching. Figure 4 shows some of the basis functions for the very well-known Daubechies-4 DWT. The basis functions are translations or dilations of each other: Eg., #5 is a dilation of #9, which is a dilaton of #17 etc.  (1,5,9,17) 8 Conclusion In this paper we described a generic signature-based technique that can be used e ectively for retrieval based on many di erent, application-speci c, notions of similarity. For a variety of general conditions, we obtained measures of goodness for our technique. We illustrated our technique with a couple of very di erent examples. While the work in this paper focused on sequence data, we believe that the basic framework developed here is equally applicable to other contexts such as image, video, or text data.