GaKCo: a Fast Gapped k-mer string Kernel using Counting - supplementary information, code and data

This data record consists of supplementary information files, all datasets and code used in the ECML PKDD 2017 paper <b>GaKCo: a Fast Gapped k-mer string Kernel using Counting</b> (README included).<div><br></div><div><div>GaKCo is a fast and naturally parallelizable algorithm for gapped k-mer based string kernel calculation. GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is scalable to larger dictionary size and more number of mismatches.</div><div><br></div><div><b>GaKCo_ECML17_Supplementary.pdf</b> - presents schematics of the GakCo algorithm, formal proof regarding Hamming Distance Property, justification of GaKCo's Sort and Count Method, connections to previous studies, details of the datasets, an overview of empirical Performance of GaKCo versus Neural Networks and other related experiments.</div><div><br></div><div><b>data.zip</b> - contains 38 training and testing nucloetide and peptide datasets in <b>.fasta</b> format: a text-based format in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. Protein, DNA and text dictionaries are provided in <b>.txt</b> format. Both formats are openly accessible via text edit software.</div><div><br></div><div><div>**Datasets for GaKCo:** </div><div>We perform 19 different classification tasks to evaluate the performance of GaKCo. These tasks belong to the discussed three categories: (1) TF binding site prediction (DNA dataset), (2) Remote Protein Homology prediction (protein dataset), and (3) Character-based English text classification (text dataset).</div><div><br></div><div><b>code.zip - </b>contains code files in C++ format: <b>.cpp</b>, <b>.h,</b> and bash file: <b>.sh</b> to compile GaKCo using the openMP g++ compiler, and to obtain kernel output using data files and dictionaries above. See below for further detail.<br></div><div><br></div><div>Compiling GaKCo (with openMP) : </div><div>```</div><div>g++ -c GaKCo.cpp -o GaKCo -fopenmp</div><div>```</div><div>To get kernel output: </div><div>```</div><div>GaKCo </div><div>#User options: > 5, =(0,..,g-1), = 0 for single-thread/ 1 for multithread</div><div>```</div><div>Bash script to run end-to-end kernel calculation:</div><div>```</div><div>processing.sh</div><div><br></div><div><br></div><div><b>Background</b></div><div>String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (<i>Σ</i>) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to O(<i>Σ</i><sup>M</sup>). We propose a fast algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger <i>Σ</i> and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the Σ<sup>M</sup> term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets).<br></div></div></div>