Succinct BWT-based Sequence prediction

,


Introduction
Sequences of symbols (strings) are a type of data found in many domains. For instance, they can be used to represent sequences of words in a text, events in a business process log, purchases made by customers, or point-of-interests visited by tourists. An important task in data mining is sequence prediction. Given a multi-set of training strings (or sequences)D = {x 1 , . . . , x d } defined over a finite ordered alphabet of symbols, sequence prediction consists of predicting the next symbol of the prefix of an unknown query sequence Q. The underlying assumption is that all the strings are created by a same underlying process. To perform sequence prediction, a predictor can be trained using the training strings. Then the predictor can perform predictions.
Various sequence prediction models have been proposed, having various characteristics. They have been used in many domains to perform tasks such as predicting heart failure [18], human activities [19] and webpage prefetching [6]. Although numerous prediction models have been proposed, many are lossy models [3,9,15,16,21]. In other words, they discard information from training sequences to build small models. But the drawback of this approach is that they may lack information when its time to make a prediction, which can result in low prediction accuracy [7]. Some models such as DG [15] also adopt simplifying assumptions such that each symbol of a string only depends on the previous one. But this assumption often does not hold in real life applications.
The aforementioned limitations of lossy predictors have recently been addressed by proposing lossless models, which keep all information about training sequences in memory to perform more accurate predictions. The assumption is that a lossless model should be more accurate because they can use all the available information to make each prediction. Some of the best models of this type is CPT [7], which was then extended as CPT+ [6]. These models store training sequences in a trie-based structure, and were shown to be more accurate than multiple state-of-the-art lossy models. However, the CPT/CPT+ have several important drawbacks: -To perform a prediction, the CPT/CPT+ models utilize the bag-of-words model, which does not consider the order between symbols. But for some domains, the order is important. -The CPT/CPT+ models require choosing several dataset-specific parameters.
The prediction accuracy can vary greatly depending on how these parameters are set. Setting these parameters is not trivial and requires to have background knowledge or use a trial-and-error approach to find optimal parameter settings. -All lossless predictors end up storing the entire training sequence in main memory. Thus, it is essential that a lossless predictor should store the training sequence space-efficiently. We use the following variables to denote the size of the sequence database D: d is the number of sequences, M is the total length of all the sequences and σ is the alphabet size. We note that the information-theoretic lower bound for storing D is M log σ bits 4 in the worst case. On the other hand: • CPT+ uses σ bit-strings of length d to represent the sets of symbols contained in each sequence. This alone takes dσ bits, which can be much larger than M log σ bits if σ is large. • CPT+ stores the training dataset in a trie. In the worst case, there could be Ω(M ) trie nodes, and each trie node contains three (64-bit) pointers, a significant overhead. • CPT+ uses ideas such as Patricia compression and replacing frequently occurring sub-sequences by a single symbol to try to minimize the number of trie nodes [6]. However, success is unpredictable, and the frequent pattern mining slows down the training phase. -During the prediction phase, given a query Q of k symbols, CPT+ performs several bitwise-and of up to k bit-strings of length d each to find sequences containing a subset of symbols in Q.
can be as large as 2 k . In practice, many fewer than 2 k combinations are tried, and the constants in the O() are small. However, as we show, the query time of CPT+ grows linearly with d.
This paper addresses drawbacks of the CPT/CPT+ models by proposing a novel sequence predictor named SuBSeq. This model adopts the succinct Wavelet Tree data structure and the Burrows-Wheeler Transform to store training sequences in a very compact way, while still allowing fast access to training sequences for prediction. An experimental evaluation shows that SuBSeq has a very low and predictable memory consumption (the space usage varies between 1.6 and 2.2 times the binary size of D) and excellent accuracy when compared to state-of-the-art predictors on real datasets. Last but not least, SuBSeq is largely parameter-free.
The rest of this paper is organized as follows. Section 2 introduces preliminaries about sequence prediction. Section 3 presents the proposed SuBSeq predictor. Section 4 presents the performance evaluation. Finally, a conclusion is drawn and future work is discussed.

Preliminaries
is a sequence of |x| = n symbols drawn from a constant ordered alphabet of size σ. For i = 0, . . . , n − 1 we write X[i..n−1] to denote the suffix of X of length n−i+1, that is . We will often refer to suffix X[i..n − 1] simply as "suffix i". Similarly, we write X[0..i] to denote the prefix of X of length i + 1. We write X[i..j] to represent the substring X[i]X[i + 1] . . . X[j] of X that starts at position i and ends at position j.
In this paper we consider a multiset of We representD as a single string by concatenating the strings in D into a single string D = x 1 $x 2 $ . . . $x d , using a special symbol $ to delineate individual strings, which does not occur in any string x i . We let M = |D| denote the length of D.
Suffix Arrays. We make use of several standard data structures built from D. The first of these is the suffix array [10], denoted SA, which is an array Backward Search. The FM-index is a compressed text index (see [13]) that consists of two main components: a wavelet tree build from the BWT string, and an array C of σ integers such that C[c] gives the total number of symbols in the BWT string that are less than symbol c. Searching with an FM-index is based on a procedure called backward search, which finds the range of SA containing all suffixes that begin with a given query pattern Q. This range then contains the positions of occurrence of Q in D. Figure 2 shows how backward search is used for counting the number of occurrences (the count query). In the algorithm, C[c] is the position of the first occurrence of the symbol c in F, and the function rank L is defined as rank L (c, j) ≡ {i | i < j and L[i] = c} . The main difference between the members of the FM-family is how they implement the rank L -function. The best ones use wavelet trees.  Wavelet Tree. The wavelet tree [12] of string D over an alphabet Σ is a binary tree with leaves labelled by the symbols of Σ. Each node v is associated with the subsequence of D consisting of those symbols that appear in the subtree rooted at v. The associated strings are not stored; instead each internal node v stores a bitvector B(v) that tells for each character in the associated string whether it is in the left or right subtree of v.
In a wavelet tree the total length of the bitvectors is |D|⌈log |Σ|⌉, which is exactly the length of D in bits using the standard representation.
A rank query rank D (c, r) over a wavelet tree is evaluated by a traversal from the root to the leaf labelled by c. Wavelet trees answer rank queries in O(log σ) time. A similar procedure enables one to access a given symbol D[i] in O(log σ) time, or to enumerate all the distinct symbols in a range of the string, as well as compute the frequency of each of those symbols. Wavelet trees answer these distinct(i, j) queries in O(k log σ) time, where k is the number of distinct symbols in D[i..j]. Wavelet trees also support the query select(c, i) in O(log σ) time, which returns the position of the ith occurrence of symbol c in D. The queries rank, select, access, and distinct involve rank (or select) queries over the bitvectors stored on the root-to-leaf path. There are many data structures for representing bitvectors so that rank and select queries can be answered in constant time [14,17]. These data structures are a standard component in succinct data structure design. Recent experimental studies of these bitvectors can be found in [5,8].

Succinct BWT-based Sequence prediction model
The Succinct BWT-based Sequence prediction model (SuBSeq) is a new lossless predictor. Its main distinctive characteristics are that (1) efficiently stores the entire input training data without any loss (2) fetches training sequences similar to a given sequence (query prefix) (3) it does not depend in any parameter-set fine-tuning in order to be accurate (4) SuBSeq keeps into account the item order of a given query prefix. The latter is the main key difference to the CPT+ prediction model. CPT+ searches for sequences using the bag-of-words model. This model does not take into account the items order of a prefix for matching it in the training data (which might be important aspect for some domain applications, as discussed).

Algorithm description
The SuBSeq prediction algorithm is consisted of two main phases; the train phase and the ready-for-prediction phase. A multisetD of training sequences is given as an input. During the train phase, SuBSeq will use the D to produce the FM-index and store BWT in memory using a wavelet tree. During the ready-forprediction phase, SuBSeq is ready to answer query prefixes. The answers that SuBSeq returns can further be evaluated with the query suffix (see Section 4.2).
For every query prefix SuBSeq will try to give an answer by finding similar sequences in its training data sequences. This is done through the given query prefix and a generated collection of sub-queries. Due to the fact that SuBSeq is only able to locate exact matches of a given pattern in its training data, it is essential to have a mechanism that expands our prediction model coverage to more training data. The collection of sub-queries plays the role of this mechanism. Every sub-query comes from the initial query prefix. These are produced by allowing operations of deletion and substitution. The deletions are always at the start of the query or sub-query and the substitutions are limited to two. After SuBSeq has found the similar sequences, it uses them to produce possible answers and eventually order them according to a weight. Producing possible answers is done through the consequents of the similar sequences. The consequent of a similar sequence s is considered the subsequence from the item common to both s and the current (sub-)query used, and up to the last item of s. For SuBSeq we will be using consequents of length up to two items long. Every time a (sub-)query is used to find similar training sequence, we come up with consequents. The items of the consequents are put into a Frequency Array and they are ordered by a weight. A final prediction answer is the item in the array with the highest weight value. The final answer is given either (a) when SuBSeq has collected all possible consequents for both the initial query prefix and its all produced sub-queries or (b) when a threshold of confidence is met.
Finally, when an item of a consequent is inserted to the frequency array, it is assigned a weight value. If the item exists in the array then the new value is added-up on the old value. The weight formula is defined as w = y/Y + (2 − sub)/2 + 1 + r. We consider y to be the suq-query length, Y the initial query length, sub the number of substitutions and r = 1 index+1 . The later indicates the index of the item in the consequent.
The backwardSearch can be implemented by tweaking the FM-Count (see Figure 2) to return the (b, e) for a query item at a time.
The forwardSearch does the opposite of the backwardSearch for a given i. It gives the index The neighbourExpansion constitutes the key function of our prediction model. Using the FM-index, one can only find exact matches for a given pattern. This creates a twofold issue; (1) there is no way to locate similar training sequences (2) usually in sequence prediction, searching only for exact matches does not give an enough coverage (if any) for confident predictions. The main idea of neighbour expansion is that for a given query prefix, it will perform a normal backwardSearch if the prefix does not have any substitutions in place or for any substitution that it mets it will recursively expand to all possible symbols that might follow. Taking into account our previous example of sub-queries, Q 3 , we will make the following assumption; before a [c, d] all of the {a, b, c, d} appear in the training data. This can be figured out with a distinct call for a range in L. Then Q 3 will be expanded to [a, a, c, d], [a, b, c, d], [a, c, c, d] and [a, d, c, d] for a normal backwardSearch each.
The getConsequents utilises the f orwardSearch definition to obtain the consequents for ranges that have been acquired through the neighbourExpansion. Expanded sub-queries which result in patterns that have already been used, are excluded. We do this by utilising a bit-vector of length M . Every index of successful neighbourExpansion ranges, is a set bit in the bit-vector. Thus, consequents from sub-queries that have been prior utilised, will not be re-used and only new consequent information will added in Frequency Array.
A C++ implementation of our prediction model can be found on github. com/rafkt/SUBSEQ.

Evaluation
We split this section as: the set-up environment, our experimental aims, the competition to our prediction model and finally the discussion of accuracy and performance evaluation. For this section, full details about our experimental data and about our results can be found on github.com/rafkt/SUBSEQ.

Experimental Setup
Environment. Experiments were performed under macOS 10.14.1 with an Intel Core i7 (4 Cores, 256KB L2 per Core, 8MB L3), 32GB DDR3 1867MHz RAM and a 8.0 GT/s Link speed SSD. The lossless predictors, CPT+, CPT, were ran using IPredict framework [6] under java version 1.8.0 112 with JIT enabled which allows the bytecode to be compiled into native machine code, allowing a fair comparison with native implementations. The SuBSeq Predictor was compiled under clang-1000.11.45.5, while SPiCe baseline [1] was compiled and run under Python 2.7.10. We used the sdsl-lite library [4] for implementing SuBSeq.
Aims. To measure and compare different prediction models in terms of their accuracy and their performance. Performance is measured in terms of the execution time a prediction model needs to train itself; the execution time it needs to complete answering a testing set; the memory usage it utilises after the training phase is complete.
Data. For our experiments we used datasets with various characteristics from SPMF library 5 library. In addition, we used synthetic data 6 which was generated by IBM QUEST data generator [20].

Accuracy of prediction
Each dataset is read in memory, and then is split into a training set and a testing set using the k-fold cross validation. Once a predictor has been trained, each sequence of the testing set is split into two parts, the query prefix and the query suffix. The size of each can be defined through a parameter in advance. Then a trained prediction model is called to give answers for every prefix in the testing set. A prediction answer for a query prefix is accurate if it appears within the query suffix 7 . The accuracy rate is the ratio of accurate predictions to the total number of test sequences. Each prediction model has been trained and tested using k-fold cross validation with k = 14 to obtain a low variance for each run.
Accuracy results are shown in Table 1. Our prediction model provides better accuracy than any other lossy predictor for SIGN, KOSARAK and FIFA datasets. At the same time, we can observe that SuBSeq has an overall better accuracy than any predictor for MSNBC and BIBLE CHAR. However, if we take into consideration the accuracy variation of CPT+ (as show in the Table 1 at CPT+ column in a [min-max] range) based on its different possible parameter tunes, then SuBSeq provides an overall better accuracy performance for KOSARAK and FIFA as well. Thus, CPT+ gets less competitive if it is not finely tuned making SuBSeq more attractive.  The Memory of SuBSeq was measured by using the relevant api in sdsl library. The memory for the rest of the predictors was measured through IPredict. We compared the different prediction models through the ratio of their memory usage over the training set binary size. In the Table 2, SuBSeq is the most consistent and most memory efficient prediction model. It uses an average memory of up to 2.2 times the memory of the input training set binary size. Prediction models like TDAG and CPT+ appear to be highly inconsistent. TDAG can utilise space between 70 to 2500 times the input binary size while CPT+ between 0.5 to 80 times; indicating an unpredictable performance. The running time of SuBSeq was directly compared to CPT+ for various datasets (Figure 3c) in respect of the testing-phase (and training-phase). Evaluations also included input data of an increasing σ, n, d using the QUEST generator. The results showed competitive and consistent performance for SuBSeq in comparison to CPT+.

Optimisation discussion
Our current implementation of SuBSeq is not fully optimised yet. Experimental evaluation showed that 90% of the time needed from SuBSeq to answer a query, it is spent for neighbour expansion. Further experiments revealed that   Thus, preventing neighbour expansion from performing excessive rank calls in the wavelet tree, would optimise the speed performance of SuBSeq for datasets with large σ. Figure 3c shows that for a dataset like KOSARAK (σ = 654, 987), SuBSeq performance is less competitive. One way to minimise excessive rank calls is to store (retrieve) each rank result in (from) a trie-based data structure.

Conclusion
Lossless sequence predictors are often very accurate but can consume a large amount of memory. To address this issue, this paper presented a novel predictor named SuBSeq that is lossless and utilizes the succinct Wavelet Tree data structure and the Burrows-Wheeler Transform to compactly store and efficiently access training sequences for prediction. Experimental results have shown that SuBSeq has a very low and predictable memory consumption (varying 1.6 to 2.2 times the binary size of D) and excellent accuracy in comparison to state-of-theart predictors on real datasets. Moreover, SuBSeq is mostly parameter-free. Future work includes optimising SuBSeq neighbour expansion along with its overall speed performance.