Prediction of potential barcoding sites on ITS1 by wavelet transform

For sustainable development, biodiversity conservation and life-quality improvement must be simultaneously considered. Molecular techniques have greatly impacted biotechnology. These methods have, in particular, improved the capability to investigate the fine differences among organisms and, as a consequence, to better investigate the effects on environmental factors on them. We propose an approach to support the optimal selection of molecular probes for barcoding application in many biotechnological fields. The aim of our work is specificity maximization. To this purpose, we have integrated a filter system based on wavelet transforms with biological knowledge about the sequence proneness to mutation and post-translational modification. Specifically, we have tested the proposed method on ITS1 sequences that are a region of the rRNA locus. Our analysis has shown the presence of other local relative stable conformations in addition to known cleavage site. Their characteristics differ within the group of mammals selected for our analysis. These variations could be used to design new species-specific barcoding probes or other quick molecular screening tools.


Introduction
Sustainable development is one of the major present-time challenges. In this perspective, biodiversity conservation and life-quality improvement must be simultaneously considered. The capability to rapidly detect, not only on human but also in crops and domestic animals, the changes induced by environmental stress factors is fundamental to predict the long-term sustainability. In a human health perspective, great importance have acquired the rapid screening of drug-resistant pathogens, genetically modified organisms in the food and chronic exposure to environmental stress factors have acquired great importance for human health.
The new molecular techniques have deeply impacted many biotechnological fields allowing to make available several biocontaminant-specific screening tools. The reliability of these systems is often constrained by several factors, such as molecular cross-reactivity or lack of tool sensitivity. Many new tools are founded on nucleic acid analysis (Meng et al., 2015). The maximal specificity of nucleotide probes is mandatory to maximize the screening efficiency. International research consortia are aiming to identify species-specific probes such as DNA barcoding (Scheffers, Joppa, Pimm, & Laurance, 2012). The current process to select the probe is mainly based on individual expertise or on computational systems that do not always take into account sequence variations. We have followed an approach aiming to support the design of application-speci-fic molecular sensors that include two additional genomic characteristics: the single nucleotide variations and methylation proneness. Taking the previous constrains into account, we have selected the locus coding ribosomal RNA (rRNA) as a test application for our proposed method.
The rRNA is one of the most evolutionary conserved gene loci (Coleman, 2015) because ribosomes are functional machineries widespread in all biological systems. Ribosome has a very complex structure and its assembly process requires a stringent of control. The ribosome structure is composed of two subunits: the large subunit (LSU) and the small subunit (SSU). The rRNA genes are transcribed in a long and complex messenger. In a simplified view, its architecture contains a 5′ externally transcribed sequence (5′ETS), two internal transcribed sequences (ITS) indicated as ITS1 and ITS2, and a 3′ externally transcribed sequence (3′ETS). These domains disentangle the regions' coding for 18S, 5.8S, and 28S ribosomal proteins. The sequence that contains all these domains is called 47S rRNA.
The region, embedding rRNA genes for SSU, is flanked downstream by ITS1 domain. The LSU's genes are located downstream this region. The rRNA showed a different evolutionary speed in the coding genes compared with transcribed spacer regions. Their degree of conservation is strongly dependent on their secondary structure, related to their compositional characteristics ( Figure 1). In eukaryotes, the ITSs, mainly ITS2, present a large conformational variability that is used as an evolutionary marker (Wolf, 2005). The length of ITS1s, in the majority of eukaryotes, is less than 500 bp, whereas the mammalian ITS1 sequences largely deviate from the one of other vertebrates. This study is focused on a set of mammalian ITS1s that are characterized by a length of about 1 kb. From the structural point of view, ITS1s are characterized by the presence of several local conformations that seem to be equally spaced (Coleman, 2015). Both ITSs are fundamental elements for the regulation of the enzymatic activity necessary to process the whole rRNA messenger. Several biochemical studies have confirmed the existence of enzymatic cleavage sites in these regions; the majority of the findings about the kinetic of cleavage were obtained from ITS2, and only a relative limited number is obtained from ITS1. Each ITS domain appears to have a specific combination of cleavage sites. In our analysis, we have investigated the ITS1, taking into account the relative lack of knowledge about the whole profile of potential functional sites, in order to predict the existence of additional nucleotide features that can be involved in other control mechanisms of rRNA processing. It is well known that ITS1 has, near to its 5′ end, a well-structured stem-loop that could be involved in functional recognition of mRNA binding domain (Lebaron et al., 2012). The most current data about the ITS1 architecture and function are inferred from yeast analysis. It is very plausible than higher vertebrates could have some small different locally ordered structures that could have a role in the enzymatic processing. The identification of these differences is important to improve the design of new screening tools. Experimental findings have confirmed the presence of an AC-rich region encompassed between the first and second local ordered conformation, starting form 5′end. A similar spacer domain is also present in the region between the second and third local conformation. It is important to underline the presence of additional putative biological functional regions in the ITS1 (Preti et al., 2013). Different bioinformatics tools are available for functional nucleotide feature characterization. Among them the signal processing of biosequence was initially applied (Felsenstein, Sawyer, & Kochin, 1982) mainly to predict boundary regions between exons and introns (Cheever, Searls, Karunaratne, & Overton, 1989) or nucleosome position (Uberbacher, Harp, & Bunick, 1988). The advances of sequencing projects disclosed new possibilities of applications of digital signal processing (DSP) methods to address the new requirements deriving from high-throughput techniques.

Methods
We propose an integrated method combining signal processing with structural and biological information. In our analysis, we correlated the nucleotide domains, identified by wavelets, with their mutational proneness and conformational properties. The analytical protocol requires three different steps: (a) prediction of mutational load, (b) DSP phase, and (c) determination of conformational properties.

Structural bioinformatics analysis of ITS1
A survey of currently available genomic sequencescombined with a literature search and with the outcome of microarray analysishas allowed to identify the rRNA as the best element to apply a DSP approach for the identification of species-specific domains that can be used to design new highly efficient molecular probes. We have downloaded the available mammalian ITS1 sequences from NCBI; Table 1 summarizes the information about the data-set that we have considered.
In order to obtain a structural landscape of ITS1, we have applied specific tools to obtain compositional properties and to predict the folding of ITS1. In addition, we have estimated the average mutational pressure by Kimura two-parameter distance (K2P) because it allows to separately consider transitional (Purine → Purine, Pyrimidine → Pyrimidine) and transversional events (Purine → Pyrimidine and vice versa) (Kimura, 1980). We have calculated K2P distance by Equation (1), where P is the transition probability and Q is the transversion probability. In our analysis, we have estimated the <K2P> for each mammal. This evolutionary measure was used as a parameter to select and rank the species-specific nucleotide domains filtered by wavelets. The retrieved sequences were checked for their position with respect to CpG-rich domains (CpG island). This information is valuable in the perspective to correlate the screened potential signals with methylation sensitivity. In order to relate the outcome of DSP analysis to a biological function, we searched the restriction enzyme recognition sites located in each ITS1. On the basis that these sequences are displaced in GC-rich domains (CpG islands), we have selected some enzymes that are capable to cut in these regions. Table 2 shows the enzyme acronyms with their specific cognate sites. The search of recognition sites was performed on the corresponding DNA of each ITS1 sequence.
The restriction enzyme database contains several hundreds of methylation-sensitive enzymes. We have selected some of them because the rRNA gene is characterized by a high content of CG. The compositional characteristics suggest the existence of possible sites for these kinds of restriction enzymes (Type II endonucleases). The restriction enzymes have a bacterial origin but they are commonly used to obtain a genomic map by their enzymatic cleavage (restriction map). We would underline that two enzymes in the list (MseI and DraI) seem to recognize an AT-rich site.
The functionality of a nucleotide domain is strongly related to its conformational properties, and for this reason, we have also predicted the folding of the regions identified by wavelets by using MFOLD.

DSP in bioinformatics
The search of functional motifs in nucleic acid sequences is one of the more ancient bioinformatics activities (Staden, 1994). Different computational algorithms were applied and, among them, the DSP was one of the first used. The DSP are mainly applied on the DNA analysis to find periodicities (Frenkel & Korotkov, 2008), to detect nucleotide repeats (Sharma, Issac, Raghava, & Ramaswamy, 2004) and to perform the exon/intron discrimination (Hall & Stern, 2004). The application of DSP methods to a nucleotide sequence is based on the translation of the bases into a numerical form that can be thought as a sampled numerical signal. To make this translation, normally a number is associated to each single nucleotide. Several criteria have been proposed in the past to allow the process the signal and decode the hidden information contained inside it. The most frequently used methods are based on representations of nucleotides by lexicographic order (A = 1, C = 2, T = 3, G = 4) or by binary codes based on the specific kind of nucleotide and on some of its properties, such as weakness or strength and number of H-bonds (Kwan & Arniker, 2009).
The Fourier transform methods are the more extensively used in these applications; conversely, the wavelets are less common. DSP was used, on nucleic acids, in a 'discriminative perspective', whereas it has been used to a lesser extent in an evolutionary perspective. We propose an application of wavelets to decipher the fine structure of mammals ITS1, in order to screen the differences among the species that can be used to design RNA-based probes for different diagnostic purposes. In this application, we have also chosen a coding method based on physical properties of nucleotide: the EIIP (Veljković & Lalović, 1973). Unlike lexicographic, ordinal or binary coding, the electron-ion interaction potential (EIIP) (Nair & Sreenadhan, 2006) takes into account the energy properties of the considered atoms. EIIP is the average energy of the electron located within the electron cloud of specific atoms of the nucleotide and it is based on the resonant recognition model (RRM) (Cosic, 1994;Pirogova, Simon, & Cosic, 2003), that takes into account mathematical and physical aspects of the nucleotide sequence. In a biology perspective, EIIP is more meaningful because it takes directly into account a physical property, while lexicographic methods are only based on representation of the presence and location of each nucleotide in the sequence. For these reasons, we feel that it is a very promising method for sequences conversion and processing. Specifically, EIIP is based on the average energy states of all valence electrons in each nucleotide, and the numerical sequence is obtained replacing each nucleotide in the sequence with the EIIP value for that nucleotide. EIIP for organic molecules can be determined by Equation (2).
where Z * is the average quasi-valence number determined by where n i is the number of atoms of the ith atomic component, Z i is the valence number of the ith atomic component, N is the total number of atoms, and m is the number of atomic components in molecule (Lalović & Veljković, 1990). Differently from the majority of bioinformatics application of DSP, we preferred to analyze the differences at dinucleotide level. The dinucleotides are the basic elements required for molecular recognition proteins as well other nucleic acids (Gupta & Gribskov, 2011). It is well known that recognition processes can be affected by the 'wobbling' pairing (Das & Duncan Lyngdoh, 2013;Mondal, Mukherjee, Halder, Bhattacharyya, & Woodson, 2015). This phenomenon was initially investigated in the tRNA codon-anticodon pairing. The wobbling seems to play a role in structural stability of other RNAs (Varani & McClain, 2000). In addition to this structural information, we have also considered the dinucleotide because of the need to estimate mutational pressure taking transition and transversion into account (Allen & Whelan, 2014). From a DSP perspective, the use of dinucleotides coding allows to improve the signal-to-noise ratio. It is important to point out that two sequences may have the same nucleobase composition, but they may differ in terms of dinucleotide frequency; these differences are shown in Tables 3 and 4. For our analysis, we have used the EIIP. The value in Rydberg (Ry) are listed in Table 3. The dinucleotide's EIIP is estimated by the sum of all atoms in the two nucleobases that form the dimer; therefore, the ones that are formed by the same bases have equal EIIPs (i.e. CA and AC; CU and UC).

Wavelet analysis
Wavelet transforms are defined as mathematical functions that provide a signal decomposition on a set of basis function derived from dilatation, contractions, and shift of a function called mother wavelet ψ that has to satisfy R w t ð Þdt ¼ 0. The functions obtained from the mother wavelet have a frequency-dependent width; they are narrow at high frequencies and broad at low frequencies. Using WT, it is possible to partition data into frequency components, and then study each component with a resolution matched to its scale. WT has the ability to detect and characterize singularities, which usually are shortlived components of a signal.
The WT of a signal x(t) is defined by where b is the space parameter and a is the scale parameter. Unlike Fourier transforms, WT transforms introduce a location information conversely to Fourier transform, with the capability of perform a space-scale analysis expanding signals in terms of wavelets. WT automatically provide the time (or spatial resolution) window: window width changes with increasing scale. Analyzing a signal with a wide window, it is possible to recognize bigger features, whereas smaller features became apparent by narrow windows.
Since the EIIP values are all positive, the sequence transformation introduces an intrinsic continuum component. In order to eliminate this component that does not carry any frequency information, we have filtered the numerical sequence before applying WT.
For the analysis reported here, we have used the Mexican hat mother wavelet because in order to best capture peak changes in the width and height, the mother wavelet should have the basic features of a peak. Mexican hat wavelet provides the best match doing the best results (Yanglan, Guobing, Jihong, & Guangwei, 2014). The Mexican hat mother wavelet is defined as  Figures 2 and 3.

Results
The previously described analytical protocol was applied on mammalian ITS1 sequences to test its validity. Here, we separately analyzed the results obtained from structural genomic analysis and DSP. The complete framework of biological signals located in ITS1 region is far from being fully analyzed. We retrieved all the available vertebrate ITS1 sequences in spite of that we focused this test analysis only those mammalian ones because they have a comparable length. In fact, the complete set of vertebrate sequences shows a wide variation in length that can influence the performance of DSP method. We guess that these differences could affect the wavelets discriminating capability in the multisequence comparison in which is necessary use zero value to adequate the shortest to the longest. The multisequence comparison by wavelets does not allow to correlate mutational propensity because the excess of null value can bias the estimate. The cladogram in Figure 4 shows result of mutational distance among the mammals in our test set and the average K2P distance, which summarized the relative mutational propensity of ITS1 for each mammal, are shown in Table 5.
In order to evaluate how mutational pressure can influence the functionality of the ITS1 domains, we have analyzed the predicted profile of restriction enzyme activity. The summary of this analysis is shown in Table 5.
The two enzymes, marked by red bold characters, are the most methylation-specific ones. In the first place, it is important to point out that the enzyme Bisl (Hsieh, Xiao, O'loane, & Xu, 2000) is able to recognize and operate an asymmetric cleavage of 5-methylcytosines displaced on both strands.
In order to check if there are differences in methylation level, we have searched the sites for another methylation-specific enzyme: GalI (Abdurashitov, Chernukhin, Gonchar, & Degtyarev, 2009). GalI, in spite of BsiI, is able to operate a symmetric cut of 5-methylcytosine facing each other on two strands.
Interestingly, the high number of cognate-sites by GalI is a clue evidence of a higher methylation level in primates and bovine ITS1 with respect to ITS1 of rodents. A general survey of the table indicates the species-specific differences in the distribution of enzymatic recognition sites that can be related to differences in the wavelets spectrograms.
This table supports the comprehension of spectrograms obtained by wavelets. We show the human and water buffalo spectrograms (Figures 2 and 3; Please see supplementary materials for the other analyzed sequences).
For each spectrogram, we have extracted the regions where the highest energy is shown. It can be noted that scale 25 is the one that intercepts the maximum number of regions with a significant length (not very narrow, not very broad). The regions for human and water buffalo are reported in Tables 3 and 4, respectively.
The WT have filtered the sequences but, as any other signal processing method, they require a further analysis to assign a biological meaning to the extracted  nucleotide features. Using MFOLD, we have associated the DSP outcome to the presence of a local ordered structure such as hairpins or stem-loops. Tables 6 and 7 report the folding in dot-bracket notation.

Discussion
The ribosome is the key functional element of translational process. The locus coding rRNA has a complex structure, and its assembly is strictly controlled. A small change, affecting the ribosome processing, can have dramatic consequences in protein production.
There are many evidences that ribosomal alterations are often related to the emergence of pathological conditions. The ITS regions are removed by enzymatic cleavage but their role in the whole mechanism of ribosome assembly is not yet completely clarified (Tafforeau et al., 2013). Specifically, the possible functional role of ITS1 is still unclear (Woolford, 2015). Some experimental results indicate enzymatic sites in ITS1 as critical element in the 18s-RNA processing though the D-cleavage site that is in the boundary region between 18s RNA and ITS1. The second cleavage site (site 2) is the interior of ITS1 and its role is related to the processing of rRNA. The current knowledge suggests that ITS1 is important in the selection of the pre-rRNA pathway.
This choice is probably controlled by other molecular elements such as ncRNAs or RNA processing enzymes. The presence of additional regulatory sites in ITS1 is not yet completely unraveled. One of the major interesting results of our wavelets analysis is the possibility to obtain a complete profile of all potential functional sites displaced along the sequence of ITS1. The method appears to be capable to recognize not only canonical   .))))))).′ −14.90 cleavage sites (Mullineux & Lafontaine, 2012), but also additional domains that could have a functional role. These nucleotide features, characterized by EIIPs, probably represent recognition sites, which are required for quality control of rRNA processing. Taking into account the current available knowledge, we can suggest that these domains may be required by non-coding RNA or that they may be regions in which the rRNA messenger is post-transcriptionally modified. These predictions may be a valuable support for the design of wet lab experiments for their functional validation. It is important to point out that the majority of molecular biology validation experiments to study rRNA processing are based on PCR-based techniques, primarily addressed to sequence hybridization. Hybridization experiments seem do not have a unique approach to design the probes. This could be related to the compositional context in which ITS1 is embedded. All ITSs, we have analyzed, are located in a CpG island that can increase the difficulty to design hybridization probes. The prediction of sensitivity to the action of methylation endonuclease, combined with the estimation of mutational pressure, can help to bypass these compositional constrains in species-specific probe design. A brief survey of available literature indicates the need to develop a more standardized design protocol. The emerging relevance of ITS1 in different biotechnological fields (Yeh, Tseng, Chang, Wu, & Tsai, 2014) encourages the development and application of more rational methods to design probes, especially in the case of a high compositional bias (Cross & Bird, 1995). Even though some cleavage sites, embedded in ITS1, have been experimentally identified, it is reasonable to consider the possibility that other recognition sites not yet validated are also be present. The high content of CG is a limiting factor for experimental screening for two different reasons: (a) the difficulty to design efficient amplification probes; and (b) the possibility that a probe with high content of GC could trigger cellular responses. In order to develop molecular screening tools that overcome the constrains above, we reckon that our proposed approach can be a valuable support to design synthetic oligomers (aptamers) (Roncancio et al., 2014) for different biotechnological applications, such as screening of OGM, food quality control, pathogen identification, and human non-infectious diseases.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.10.1080/07391102.2015.1056550.