Additional file 2: of A method for identification of highly conserved elements and evolutionary analysis of superphylum Alveolata

Presents in detail all clusters found in the final graph by our algorithm. The clusters are ordered by their numbers in column A; the first cluster is a giant one (shown partially). The first line of each cluster is marked with fixed numbers in columns C–D; it contains the number of vertices (words) of that cluster in column E. Each of the subsequent lines corresponds to a word and contains the following data in columns A–J: the cluster number (A), the number of species in the cluster (B), the vertex degree (C), the vertex density, i.e., the number of graph parts this vertex is connected to (D), the species name (E), the sequence name (F), start position of the word in the sequence (G), the word length (H), DNA strand indicator (I), and the word itself (J). A part of the word shown in capital letters corresponds to the intersection of all words merged at this vertex (a group); lowercase letters correspond to the union of those words. If the word overlaps with regions of a gene and its coding sequence (CDS) according to the genome annotation available in GenBank, this word corresponds to a protein. In such cases, the gene data including the protein description is shown in columns K–O; and CDS data, in columns P–R. If only the first condition is satisfied, the word belongs to a gene untranslated region such as an intron; in this case, only the data on the gene are shown. If a word is a fragment of a known non-protein-coding RNA according to Rfam database, columns S–AB contain the RNA name and other data. The clusters that correspond to untranslated regions or unknown HCEs are highlighted in gold or blue, respectively, in column A. (XLSX 10204 kb)