The optimal selection of secondary indices for files

We consider the problem of finding an optimal set of indices for a file. A general model for a file is assumed together with a probabilistic model of the transactions conducted with it: Queries, Updates, Insertions and Deletions. It is shown that all the information assumed for each attribute can be condensed into two parameters and that properties of the optimal solution can be derived from this condensed information. An algorithm to find the optimal set of indices based on these properties is exhibited


NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:
The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material.Any copying of this document without permission of its author may be prohibited by law.

INTRODUCTION
We consider here the problem of selecting a set of attributes for which a secondary index should be provided in order to minimize the expected cost per transaction conducted with a file.
A proper solution to this problem has to consider the system in which the file will exist as well as characteristics such as accessing mechanisms to it and statistical properties of the transactions that are conducted with the file.Some systems which have been implemented use the simple approach of providing indices for all attributes [2] while others do not provide indices at all [6], For these systems the optimization problem does not exist.
•Between these two extreme approaches there are systems like XRM [S] which allow the user to specify which domains of the relation should be indexed.Some high level languages have been designed [4] for which the system, while in the process of performing a transaction, creates temporary indices for some attributes.After the transaction is completed, the user is informed of the newly created indices and may then choose to keep them or delete them.Systems such as these and others for which on line transactions may be conducted with a file by users whose demands on the file change in time are best suited for solutions as described here.After collecting statistical properties of the transaction that the current set of users conducts with the file, the system may compute an optimal set of indices to minimize transaction processing time.Then, it creates its optimal index set throwing away those indices which contribute to increase transaction time and creating indices which help to decrease transaction time.Since this overhead processing may be appreciable, the system will probably perform this updating only at some fixed intervals of time.
Recently, a number of studies have appeared in the literature which consider this problem.Some of them have taken an empirical approach [9] while others [7], [10], [14] attempt to formalize the problem to obtain an analytic solutioa Since the data base usually exists in a complex environment there are innumerable factors that will influence the index selection problem and, in order to hope for a solution, some assumptions must be made.In [7] the restriction was made that transactions could only specify one attribute value in their specification part while in [14] the transactions were reduced to queries and updates only and the statistical properties considered were minimal.We will now present a model which encompasses a variety of situations by allowing great flexibility in the specification of the statistical properties of the transactions as well as on the storage organization and retrieval mechanism for the indices.The inclusion or exclusion of an attribute in an optimal set of indices will depend on two parameters which are derived for each attribute when the specific properties mentioned above are known to the system.By studying properties of the optimal solution we are able to describe an efficient algorithm which makes use of these pairs of parameters to determine the optimal solution.

SECTION 1
In this section we present the model of the file, the assumptions on storage organization and retrieval, and the types of transactions conducted with it.
We will assume a relational model for a data base [5] and we will consider the problem of index selection when there is a single relation in the data base (the results shown here can be directly extended to a multi-relational data base provided we assume independence between them. The file F will consist of a set of N vectors (or records) v = (v j,v2,.»,vm)where each Vj € Aj, the j-th attribute.Thus F £ A1xA2x...xAm.We also assume the existence of atoms for.the domains [15].Each attribute will then be a finite set whose elements can appear in a transaction.
We will assume that the use of secondary indices is the only mechanism that will be used to facilitate the search for records in the file.Thus, we will consider here that the file is randomly stored in secondary memory and is not clustered according to any criteria.The approach of clustering records has been studied in the literature [11] but cannot be taken when we assume that the data base is used by several users with different requirements on it.Thus, the time required to retrieve n records will be the time to bring the pages which contain them.Since the records are assumed to be randomly distributed and that selection of an optimal set of indices will result in a small expected number of records to be accessed when processing a transaction, this time will be cn, for some constant c which measures the time required to bring a page to main memory.
Since the method for storing and retrieving indices is a fundamental characteristic of a system using secondary indices for processing transactions we will not impose any restriction on it.The only structure that we impose is that, if an index Ij for the j-th attribute exists, knowledge of a value a i Aj will produce a list of pointers to records (which we will in the sequel refer to as tid for tuple identifier) in time f(j,a) for some function f.For example, if the indices have the usual two level hierarchical organization and are stored on a disk, f(j,a) could be taken to be cj*C2(3j(a)N where Cj is the constant time required to find the head of the appropriate list of tid's, c2 is the transfer time per tid and j3j(a) gives the proportion of the total number N of records that have the value a as its j-th attribute.
Let N be the size of the file.Even though deletions and insertions will be allowed, we assume that the size of the file remains fixed throughout the period between successive computations of optimal sets of indices.For each attribute Aj we will assume a distribution /3j which gives the proportion 0j(a) of records in the file having a particular value a d Aj as its j-th attribute.(One simplifying assumption that sometimes is made is to consider /3j a constant function.) There will be four types of transactions conducted with the file: Queries, Updates, Insertions and Deletions.It is frequently the case that an Update or a Deletion is specified in two components: a selection component which determines a set of records to be processed, and an action component which, in the case of Updates determines how each record in the set is to be updated, and in the case of a Deletion is null, simply meaning that.that set of records which was found is to be removed from the file.As will be discussed below, finding the relevant records implies a number of actions.After they are done, we can no longer assume that any part of an index is in main memory so that the updating on the indices that results due to an Update or a Deletion can be assumed to be independent of the processing of the selection part of the corresponding transaction.For this reason we do not include a selection part in the formal specification of an Update or Deletioa A Query Q will be specified as Q = (qi,q2)->qm) where qj t Aj U {x}, x is a symbol not in any set Aj.If qj * Aj, we say that the i-th attribute is specified in the query.Note that a query to our model can result from a true query in the system, or an update or a deletion as discussed above.An actual query is sometimes accompanied by an output part which specifies that some attributes of the set of records found to satisfy the query are to be displayed or further processed.These operations take time independent of the set of attributes for which an index exists and thus can be safely ignored for our analysis, S We will assume, as is done in most systems, that a query is processed as follows: 1.
For each specified attribute j, for which an index exists, a list Lj is found containing all tids of records whose j-th attribute have the specified value.

2.
From all lists Lj, an intersection list L is formed L = fl Lj.This list contains all tids of records all of whose attributes for which an index exists and which are specified in the query have the specified value.These records will be referred to as partially qualified records.

3,
All partially qualified records are brought to main memory where the values of those specified attributes for which there is no index are checked Those records which are found not to satisfy the query are disregarded (these are sometimes referred to as false drops) and a list of qualifying records (or their tids) is obtained which can be further processed.As was mentioned before any further operation is independent of the chosen index set and will not be considered any further.
Let ctj(a) be the probability that the j-th attribute is specified in a query to have the value a d Aj.For convenience we will denote as pj the probability that the j-th attribute is specified in a query.Thus, The expected cost to process step 1 above is given by I I a:(a)f(j,a) j<D a<Aj J where D is the set of attributes for which an index exists.The cost of step 2 can be considered negligible as compared to the cost of 1 and 3.The reason for this is that the intersection list can be constructed in main memory and any processing time spent here is small compared with the cost of interacting with secondary storage.It is not hard to see that the expected number of tids in the resulting intersection list L is given by Step 3, as explained at the beginning of this section is proportional to the length of L, c|L|.The cost of removing the false drops to form the final list can be considered, as in step 2, to be negligible since it is done in main memory.Thus the expected cost to process a query is given by An Update U will be specified as U = (v:uj,u2,...,um) where each Uj £ Aj U {x} and v d F. The intended meaning for an update is that record v (which can be identified by its tid) has to be updated on those specified attributes Aj (i.e., those for which Uj / x) to have the new value Uj.
To process an Update, one has to (1) retrieve the record v, update its specified attributes, and store it back again and ( 2) update all relevant indices.The time required to perform (1) is independent of the existing set of indices and thus we do not consider it.
Updating the indices requires reading and writing back the bucket(#) for the old value of the attribute and doing the same for the new value.We will assume that both these operations are performed even if these buckets coincide.(As it turns out, consideration of this fact would only result in a slightly more complicated expression for the parameter K(i) associated to the i-th attribute (see Section 2), but does not otherwise change the nature of the algorithm to find an optimal set of indices.) Let f'(j,a) be the time required to read and write back the "a" bucket for the j-th index.Then, the expected cost of an update on the j-th index is &j*J W 1 a?A?i ,a ' ' ,(i ' a) * '' m where the first term inside the square bracket is the cost of updating the old bucket (assuming that the distribution of the number of tids on all buckets of the j-th index is given by (3j) and the second term is the cost of updating the new bucket.Also we assume a probability Pj(b) that the value b < Aj for the j-th attribute will be specified in an Update.Combining all expressions obtained above, we get that the expected cost per transaction is given by: E(D) = rqCg • ryCy • rjCj + rjCj where rq, ry, rj and rj are respectively, the probabilities that the transaction is a Query, an Update, an Insertion or a Deletion.This expression can also be written as , where jCD The problem of finding an optimal set of indices can be now formally defined as that of finding a smallest set D £ M = {l,2,...,m} which minimizes the above expression for E(D).

SECTION 2 Analysis of the Cost Function
A straightforward evaluation of E(D) for all subsets of M would certainly solve the optimization problem.We are interested in finding algorithms which take less than 2 m to obtain the optimal set.Another approach would be to construct chains of sets D0=4>,Dj,D2,...,D^, with each Dj+j a superset of Dj obtained by adding one more element of find a collection of locally optimal sets D (i.e., all sets D such that, for all j, E(D) < E(D U {j}) and E(D) < E(D -{j}).In general, by following this procedure we may not find the optimal solution among the collection of sets found.For example, assume an optimization problem There is a simple condition on a cost function, which we call the regularity condition, which suffices to guarantee that the above procedure will obtain the optimal solution among the collection of sets which finds.The condition states that, if while performing the procedure, a set D is reached and there is an index j i D which increases the cost function, then the index j can be ignored in any subsequent search from D. Formally, we have: Definition U Let E be a cost function defined on subsets of a set S of points.Let A(D,j) = E(D U{j}) -E(D).Then E is said to be regular if A(D',j) > A(D,j) for any point j and sets D, D' for which DeD' and j 4 D'.
Note that if E is regular and for some D and j i D, A(Dj) > 0 then, for all D £ D' with j { D' we have A(D',j) > 0.
The following lemma states that, for a regular function, the above procedure succeeds in obtaining an optimal solution.
Lemma J_: Let E be a regular cost function and D a locally optimal set (i.This assumption is justified since p^ = 0 or |Ak| = 1 imply F(k) = 0 and thus A(D,k) > 0 so the k-th attribute would not be part of any optimal solution.In the sequel, we will assume this to hold for all attributes).
Other classes of functions satisfying Lemma 2 (and thus the regularity condition) have appeared in the literature.In [3] the following cost function is studied in connection with an optimal allocation of copies of files in an information network with n nodes: where Uk depends only k<D K i=l 1 K on parameters associated to the k-th node in the network and Gj(D) « Xj min d^ where Xj is a constant associated with the i-th node and d^ k€D is a cost associated to the link between nodes i and k.E(D) is the cost associated to selecting the set D of nodes as information storage nodes.A result like Lemma 1 but specialized to this function was obtained.Since where x*'y-ifx>y then x-y else 0 , it follows that this cost function satisfies the conditions of Lemma 2. This implies that it is regular and Lemma 1 holds.Thus, Theorem 1 in [3] is obtained as a special case of Lemma 2.

SECTION 3
Since our cost function is regular we know that a depth first search as described in Section 2 will find the optimal solution.In this section we will show that we do not need to examine all possible nodes which could be reached during an unrestricted search.Thus the time required to find the optimal set will be reduced.This result will be obtained by characterizing properties of the optimal solutioa Definition: Let Aj,A2,...,Am be the set of attributes for our file.For each Aj, define a tuple (F(j),K(j)) where F(j) is defined as in the proof of Theorem 1 and K(j) = H(j)/F(j) and H(j) has also been defined.Thus we get a set S of m vectors, each having two components.
Let Sj = (F(i),K(i)), and sj = (F(j),K(j)) be two.suchvectors.A partial order < can be defined as follows: Sj < Sj iff F(i) < F(j) and K(i) > K(j).If Sj < Sj then Sj > Sj.A decreasing chain of points in S is a sequence sj,S2,...,sn such that Sj > S2 > S3 > ... > sn.The following theorems characterize the set of points in an optimal solution.
Theorem 2: Let S be a set of points as above.If Sj < S belongs to an optimal solution , then all points Sj * Sj such that Sj > Sj are also contained in an optimal solution.
Since Sj > Sj we have F(i) > F(j) and K(i) < K(j) (but Sj ^ sj so F(i) > F(j) or K(i) < K(j)).Thus, if A(D,j) < 0 (i.e., K(j) -G(D) < 0), then A(D) < 0, while if A(D,j) = 0, (i.e., K(j) = G(D)) then A(D) < 0 (if K(i) ^ K(j)) or A(D) = 0 (if K(i) = K(j)).In any case, the theorem is proved.| Theorem 2 says that if an optimal solution contains a point s, then all points in a decreasing chain ending in s are also part of an optimal solution.Using this fact, the search for an optimal solution can be organized as follows: Partition S into the smallest set of disjoint descending chains.Let w be the number of such chains.The set of candidates to be adjoined to a current partial solution D is obtained by considering the subset of independent points among the set of points which are maximal in each chain.Thus, at each step of the search, at most w points have to be considered.If mpr^-inr^ are the number of points in each chain, the maximal number of sets examined during the entire search will be less than (1 +mj)(1 +m2) ... (1 +mw) < (1 +m/w) w .Assuming that all points have different components, the value of w turns up to be the length of a longest increasing sequence in a permutation of m elements.There is no known expression for the average of this quantity but empirical studies [1] have shown that the asymptotic behavior is 2m®-^.(Recently, Steckin has shown [13] that this average is bounded above by em^).So, an upper bound for the average number of sets examined, assuming all permutations being equally likely is (1 +m 0 -5 /2) 2m05 = O(2 m0 -5 '°g m ).
The partial order defined in the definition above, has induced a precedence in the order in which points have to be examined.Theorem 2 established that this precedence was partial as nothing was established for independent points.Theorem 3 will provide conditions under which a precedence can be established for these independent pairs of points.Notice that if a precedence could be established for all independent pairs, then a total precedence would exist and a linear scan would determine the optimal set.Theorem 3: Let i, j be two independent points in S such that F(i) < F(j) and K(i) < K(j).If A(D,i) < A(D,j) for some D then A(D',i) < A(D\j) for any superset D' of D.
Proof: As in the proof of Theorem 2, we have for D' => D, Thus A(D') < A(D) < 0 as was to be shown.I Theorem 3 says that, if while performing the depth first search procedure, whenever two points in an independent set can be chosen to be included in a set D and the one with smaller value of the F function is preferable to the other (i.e., it decreases more the value of the cost function), it remains preferable at any later stage of the search, which extends the current set D. Thus, at some point during a partial search we may discover a precedence between two points in an independent set Using these results we may give the following informal description of an algorithm to find the optimal set of indices from a set S specified by tuples (F(i),K(i)).The algorithm keeps track of the precedence that exists between points.
1. Initialization (Dg is the current choice for global optimum, ogt is the lowest value of E(D0) obtained so far, R is a pushdown stack whose entries are pairs Dj,Pj, where Dj s S, Pj is a directed graph with at most |S| points): D0 «-<t >, opt = oo.Define an initial directed graph Pjnjt as follows: Nodes are all points i i S such that A(<j>,i) < 0. (Points i with A(4>,0 > 0 are never included in an optimal solution so they need not be considered.)Node i is directed to j if either F(i) £ F(j) and K(i) < K(j) (thus i > j and Theorem 2 applies) or F(i) < F(j), K(i) < K(j) and A(<fr,i) < A(*,j) (by Theorem 3).Let R *-(4>,Pinit).then (local optimal found is best so far) begin opt +-E(D); D0 <-D end else begin (let Source (P) be the set of all nodes in P having no ingoing directed edges.Note that Source (P) is an independent set.Let P' be the graph obtained by augmenting P by joining i i Source (P) to j i Source (P) whenever K(i) < K(j) and A(D,i) < A(DJ)).
For each i i Source (P), let R <-(D U {i},P'); (push the stack) end; end; As was mentioned above, an upper bound on the asymptotic average running time of a deterministic version of this algorithm is 0(2 m ^ '°8 m ) which is a big improvement over 2 m obtained by enumeratioa Empirical studies with it have shown that even this reduced upper bound is still much higher than the actual number of nodes visited.

CONCLUSIONS
The problem of index optimization has been solved under very general assumptions and properties of the optimal solution have been found which allows the existence of an efficient algorithm to determine the solution.It is easy to see that previously reported methods for solving this problem ( [7], [12], [14]) are special cases of the results shown here.

Thus, if we
have a set D of indices, the expected cost of an Update is given by CM = I I [VfiXa) • P:(a)] f(j,a) where u j(D a(Aj J J J V: = I P:(b) is the probability that the j-th attribute is specified in an update.J bcAj J # A bucket is the set of all tids of tuples having the same value on an indexed attribute.Insertions: An insertion is specified as I(v).It requires insertion of the record itself plus the updating of all indices.As was the case for updates the first cost is independent of the index set.Assuming a distribution of values given by 0j, the cost of the second component is j<D aCAj J Deletion: A deletion T(v) of the record v requires a similar set of operations as an insertion and the expression for the resulting cost is the same, Cj = Cj.
e., E(D U {k}) > E(D) and E(D -{k}) > E(D)).Then E(D0) > E(Dj) > ... > E(Dn) for all sequences of sets D0,Dj,...,Dn « D satisfying |Dj| « i. Proof: We have to show that for any subset D' of D, E(D' U {j}) < E(D') for all j < D-D f or equivalently, A(D\j) < 0. Assume the contrary, and consider the set D w = D -{j}.Clearly, D' £ D".By assumption, A(D',j) > 0, which implies, since E is regular, that A(D M ,j) > 0. This contradicts the fact that D is a local optimum.| Lemma 1 states that if D is a local optimum, it will be found by the procedure described above because for an^ chain 4> = D0,Dj,...,Dn = D with |Dj| = i we have that E(D0),E(Dj ^...jECDp) is a nonincreasing sequence.(We note'here that an analogous proof shows that Lemma 1 also holds if the sequence of sets is decreasing, i.e., S = D0,Dj,...,Dn satisfying |Dj| « |S| -i, so, in particular, a search which starts from S will also find the global optimum.There is a class of cost functions which includes our particular cost function, which are regular.They are characterized in the next definition and lemma.Definition 2: Let K be a function which maps subsets of S to a totally ordered domain with order relation given by < .Then K is said to be monotone nonincreasing (mni) if D c D' implies K(D') < K(D).Lemma 2: Let E be a cost function such that A(D,j) = E(D U {j}) -E(D) can be written as A(D,j) = A(j) -B(D,j) where, for each fixed j, B(D,j) is mni.Then E is regular.Proof: Let D', D be subsets with D £ D' and let j be a point j i D\ We have, by definition, A(D\j) -A(D,j) = [A(j) -B(D',j)] -[A(j)-B(D,j)] = B(D,j) -B(D\j).Since, for fixed j, B(D,j) is mni then D £ D' implies B(D',j) < B(D,j).So, A(D',j) > A(D,j) as required | Our first result shows that the cost function we are dealing with is regular.Theorem 1; Let E(D) -I H(j) + G(D) as defined in (1).Then E is regular.Proof: By definition, A(D,j) = H(j) • G(D U {j}) -G(D) = H(j) -(Pj -I a:(a)a;(a)) G(D) = H(j) -F(j)G(D) where a>^ /A j F(j) « p: -I cc:(a)a:(a).Since G is clearly mni, E is regular, by Lemma 2. | Since I a^(a) = pk, we have that 0 < F(k) < 1 (assuming > 0 and |A^| > 1. a(Ak Proof:Let Sj d D', an optimal solution.For a given pair Sj, Sj, letA(D) = A(D,i) -A(D,j).(Here, A(D,i) stands for E(D U {sj}) -E(D)).Assume sf / D\ Consider the set D = D' -{sp.Since D' is an optimal solution, A(D,j) < 0. Claim: It suffices to show that A(D,j) < 0 =*> A(D) < 0 and A(D,j) = 0 => A(D) < 0. This follows because, if A(D,j) < 0 then A(D) < 0 so that A(D,i) < A(D,j).Since we have assumed Sj i D', we get E(D U {SJ}) < E(D'), contradicting the optimality of D'.Thus Sj £ D', an optimal set.If, on the other hand, A(D,j) = 0 then A(D) < 0 and so, A(D,i) < 0. If A(D,i) < 0 we get a contradiction as before, and so Sj £ D', an optimal solution.Finally, if A(D,i) « 0, E(D U {SJ}) = E(D') which means D U {SJ} is also an optimal set, which again, proves the theorem.To see why the claim holds, A(D) = F