Universal full-length nucleosome mapping sequence probe

For the computational sequence-directed mapping of the nucleosomes, the knowledge of the nucleosome positioning motifs – 10–11 base long sequences – and respective matrices of bendability, is not sufficient, since there is no justified way to fuse these motifs in one continuous nucleosome DNA sequence. Discovery of the strong nucleosome (SN) DNA sequences, with visible sequence periodicity allows derivation of the full-length nucleosome DNA bendability pattern as matrix or consensus sequence. The SN sequences of three species (A. thaliana, C. elegans, and H. sapiens) are aligned (512 sequences for each species), and long (115 dinucleotides) matrices of bendability derived for the species. The matrices have strong common property – alternation of runs of purine–purine (RR) and pyrimidine–pyrimidine (YY) dinucleotides, with average period 10.4 bases. On this basis the universal [R,Y] consensus of the nucleosome DNA sequence is derived, with exactly defined positions of respective penta- and hexamers RRRRR, RRRRRR, YYYYY, and YYYYYY.


Introduction
Deformational properties of DNA are reflected in periodical appearance of various dinucleotide elements along chromatin DNA (Trifonov & Sussman, 1980), reflecting the same periodicity within individual nucleosomes. Thirty-three years of studies on the periodical DNA sequence pattern, also called nucleosome positioning pattern (reviewed in Trifonov, 2011) converged to a universal RRRRRYYYYY sequence (Frenkel, Bettecken, & Trifonov, 2011;Mengeritsky & Trifonov, 1983;Salih, Salih, & Trifonov, 2008;Rapoport, Frenkel, & Trifonov, 2011;Salih, Tripathi, & Trifonov, in press). Imperfect repeats of the sequence appear in the nucleosome DNA with the non-integer period 10.4 bases (reviewed in Salih et al., in press) so that actual distances between likenamed dinucleotides in adjacent periods are either 10 or 11 bases. A consensus sequence of the nucleosome DNA (R 5-6 Y 5-6 ) 11 , where 11 stands for the number of DNA periods in contact with the histone octamer (Trifonov, 2011), should consist, therefore, of intermingled periods of lengths 10 and 11 bases, making together the average period 10.4 bases. The same holds for the matrix of nucleosome DNA bendability (Gabdank, Barash, & Trifonov, 2010a, 2010bMengeritsky & Trifonov, 1983) where the occurrences of all 16 dinucleotides are taken into account, each preferentially located at their respective positions within the 10.4 base period. In the study of Gabdank et al. (2010aGabdank et al. ( , 2010b a full-length matrix is derived from nucleosome DNA sequences of C. elegans, the first of the kind. It was preceded, though, by an incomplete AA/TT only matrix (Ioshikhes, Bolshoy, Derenshteyn, Borodovsky, & Trifonov, 1996). In the Gabdank's matrix the 10 and 11 base sections are intermingled, essentially, in arbitrary, though symmetrical order. The exact distribution of the dinucleotides and exact consensus sequence of the nucleosome DNA can be only derived by alignment of sufficiently large number of full-length nucleosome DNA sequences. This task is fulfilled in the current study, where the strong nucleosome (SN) DNA sequences, those with visible sequence periodicity (Salih et al., in press), are used for the derivation of the full-length nucleosome positioning probe. The idea on possible existence of the strongly periodical natural sequences followed from an elegant earlier study of Lowary and Widom (1998) in which it was experimentally demonstrated that random DNA sequences with highest affinity to histone octamers are characterized by 10-11 base sequence periodicity. They suggested the sequence (CTAGxxxxxx) n , or (AGxxxxxxCT) n , as possible nucleosome positioning motif. In (Salih et al., in press) this motif was extended to (AGAGGCCTCT) n , on the basis of original data in (Lowary & Widom, 1998). Notably, in [R,Y] alphabet the motif is (RRRRRYYYYY) n . In this work the SN sequences from A. thaliana., C. elegans, and H. sapiens (total 1536 sequences), are aligned, resulting in the consensus: The nucleosomes can be detected in any long sequence of interest by simple matching of the sequence to this standard.
2. Results and discussion 2.1. Full-length matrices of bendability derived from SN sequences Alignment (see Methods) of 512 SNs of A. thaliana and of the same number of SN sequences of C. elegans and H. sapiens, chromosomes 1-5 (Salih et al., in press) resulted in very similar distributions. One of them, for A. thaliana, is shown in Figure 1, where the graphs (lines of the matrix) are sorted by the amplitude of 10.4 base oscillations. The matrices for two other species are deposited in the Supplement (Figures S1(a) and S1(b)). (The matrices in numerical form can be provided by request) The full-length matrix of the nucleosome DNA bendability in Figure 1 is presented in complementarily symmetrized form, which results from summing together the 5′→ 3′ Watson strands and 5′→ 3′ Crick strands, since both represent the nucleosome which harbors the respective nucleosome DNA. Such symmetry is also required by the dyad symmetry of DNA (Mengeritsky & Trifonov, 1983;Trifonov, 2010). The component Watson strand and Crick strand matrices have only minute differences (not shown). The AA and TT dinucleotides are the main contributors of the matrix. Reading only topmost matrix elements consecutively results in the following linear form of the matrix (consensus sequence for the full-length 116 base nucleosome probe for A. thaliana): TAAAAATTTTTAAAAATTTTTTAAAAA TTTTTAAAAATTTTTAAAAAATTTTTAAAAA TTTTTAAAAATTTTTTAAAAATTTTTAAAAA TTTTTAAAAAATTTTTAAAAATTTTTA Inspection of the matrix reveals that every 10-11 base period is essentially the same, with the same dinucleotide distributions. The periods of the probe above are identical as well except for two extra A (in 5th and 10th periods) and two extra T bases (in 2nd and 7th periods). This near identity of the periods is not a trivial result, since some positions within the nucleosome DNA could systematically deviate from the repeating AAAAATTTTT consensus.

Matrices of bendability in [R,Y] alphabet
Although somewhat different in details, the matrices for three species have common dominant oscillation of AA and TT dinucleoides with the consensus period AAAAATTTTT confirming earlier results for the same species, obtained from respective genomic sequences by Shannon N-gram extension (Rapoport et al., 2011). As the typical matrix shown in Figure 1 demonstrates, the RR dinucleotides AG, GA, and GG display oscillations nearly in phase with AA, while CT, TC, and CC oscillate in phase with TT. Such behavior has been observed first in (Mengeritsky & Trifonov, 1983), suggesting the universal pattern (RRRRRYYYYY) n .
The same three full-length matrices, but in [R,Y] alphabet, are shown in Figure 2. They clearly demonstrate oscillations of RR and YY dinucleotides as the main feature common for all three species. That is, the consensus (RRRRRYYYYY) n is confirmed.

Linear [R,Y] nucleosome mapping probe
In order to derive more exact linear form, reflecting the average 10.4-base rather than 10-base periodicity, we combined the plots for three species in the Figure 3(A). The plot for difference RR-YY (Figure 3(B)) allows reading the exact consensus. Indeed, while positive peaks correspond to runs of R and negative onesto runs of Y, the crossings with the x-axis give the sequence coordinates of the transition dinucleotides RY and YR. These are -52, -42, -31, -21, -10, 0, 10, 21, 31, 42, and 52 for RY,5,16,26,36, and 47 for YR. The exact positions for YR dinucleotides are confirmed by the maxima in the YR plot (Figure 3 periods may be anything between 10 and 11 bases, that is by as much as one base off (33º-36°) (Ong, Richmond, & Davey, 2007). It is also understood that small changes in the [R,Y] consensus, like change of R5Y6-R6Y5 or Y5R6-Y6R5 here and there, would lead to only minor change in the match of a given sequence to the probe.

Advantages of using the [R,Y] consensus probe for nucleosome mapping
The full-length matrix of nucleosome DNA bendability may take several forms. First, the ideal matrix would list energy values for all 16 dinucleotides in all posi-tions along the nucleosome DNA. Such matrix, unfortunately, will become available, perhaps, in decades from now. No reliable deformation energy parameters are known today even for a single specific base-pair stack. The dinucleotide frequency full-length matrices, as the ones in this work, and earlier versions (Gabdank et al., 2010a(Gabdank et al., , 2010bIoshikhes et al., 1996) are useful approximations, since these take into account the species-specific dinucleotide frequencies.
The mapping of the nucleosomes using such matrices as mapping probes is at the moment the only way to computationally locate the nucleosomes on the sequences, with single-base resolution (Hapala & Trifonov, 2011. There is one seemingly insurmountable problem: individual nucleosome sequences may have their individual biases in choice of a repertoire of periodically placed dinucleotides. In other words, the individual nucleosome would need its own specific matrix to adequately evaluate the nucleosome-forming quality of the sequence, which becomes impractical. The composition-specific matrices for different isochores of the same species are a good example . The species-and composition-specific biases of individual nucleosome DNA sequences are leveled off if 16 different dinucleotides are combined in RR, YY, RY, and YR groups. Indeed, it is known since 1983 (Mengeritsky & Trifonov) that the consensus repeating motif in the nucleosome DNA is (RRRRRYYYYY) n , rather than just (AAAAATTTTT) n . This is confirmed in later sequence analysis studies (Salih et al., 2008, in press), and strongly supported by deformational properties of various [R,Y] type base-pair stacks of DNA (Cui & Zhurkin, 2010;Trifonov, 2010). This justifies the aim of this work derivation of the full-length [R,Y] sequence consensus of the nucleosome DNA. The consensus of 116 R and Y bases, as above, can serve as the nucleosome DNA sequence probe, equally sensitive to AAAAATTTTT and GGGGGCCCCC periodical motifs, as in A. thaliana (Salih et al., in press) and in H3 isochores Salih et al., in press), respectively, as well as to any other RRRRRYYYYY motif, like the consensus GGAAATTTCC for C. elegans (Gabdank, Barash, & Trifonov, 2009). The [R,Y] sequence probe derived in this work is superior to full-length matrix of bendability for C. elegans hitherto used for mapping purposes in various sequences. The [R,Y] probe is free of specific C. elegans dinucleotide biases. In particular, it does not give any special preference to CG dinucleotides.
It may well be, however, that certain dinucleotides of the RR and YY families still do have their specific preferences to some positions within the repeating RRRRRYYYYY motif due to their individual deformation properties. Future studies may clarify this point.

Implementation of the [R,Y] probe for the nucleosome mapping
Usage of the [R,Y] consensus sequence rather than whole 16-line matrix, as in (Gabdank et al., 2010a(Gabdank et al., , 2010b, for the mapping is both justified and more convenient. It is justified because the sequence in the [R,Y] alphabet is not sensitive to neither species-specific, nor individual compositional biases. It is equally applicable to A + T-rich and to G + C-rich sequences, since the RRRRRYYYYY pattern is common for all Rapoport et al., 2011;Salih et al., in press). It is more convenient, as the mapping is reduced to counting simple dinucleotide match to the probe at every sequence position. Maximal match is 115, according to the size of the probe, statistical match is 115/4≈29. The values in this range, from 29 to 115, provide a simple measure of the nucleosome "strength". The strongest nucleosomes of C. elegans, for example, are characterized by match value 91 (Salih & Trifonov, in press).
It is important to keep in mind that the sequence is not the only determinant of the nucleosome position (Struhl & Segal, 2013;Valouev et al., 2011). It is, clearly, a contributor, as the nucleosomes do display the sequence periodicity (Trifonov & Sussman, 1980), especially stronger nucleosomes (Lowary & Widom, 1998). Other important actors in the game are transcription factors, barriers for nucleosome sliding, and proteins involved in chromatin remodeling (Teif & Rippe, 2009). Any sequence-directed nucleosome mapping procedure only offers the sequence component of the positioning.
The particular choice of RR and YY dinucleotides by different species depends on many factors, G + C content first of all. The follow-up studies may reveal further details of the universal pattern. The publicly accessible server for the mapping with the [R,Y] probe is under construction. The code for the mapping can be also provided by request.

Sequences of SNs
The sequences of A. thaliana SNs and C. elegans SNs are taken from the paper of Salih et al. (in press). The 512 sequences of SNs of H. sapiens (chromosomes 1 -5) generated anew, using the same "magic distance" technique (ibid.), that is, by selecting DNA fragments with maximal number of distances multiples of the period 10.4 bases, between like-named dinucleotides. Human genome sequences are taken from: ftp.ncbi.nih.gov/genomes/ H_sapiens.

Alignment of SN DNA sequences
The sequences of SNs of the three species have visible periodicity of obviously the same motif AAAAATTTTT (ibid.). Therefore, the choice of the first and subsequent sequences to which the other sequences should be consecutively aligned, is not as crucial, as for the nucleosome DNA sequences with weak, hidden periodicity (Ioshikhes et al., 1996). Nevertheless, the precaution was taken, to ensure that none of the sequences of the whole ensemble would have an advantage and excessive influence on the outcome. Each of the three sets of the SN sequences for three species was treated separately, thus, generating species-specific nucleosome DNA bendability matrices.
Definitions: the 2 0 -matrix, of 16 lines (dinucleotides) in 115 positions, represents single SN sequence, and consists of matrix elements 0 and 1.
2 n -matrix consists of 16 lines (dinucleotides) and up to 153 columns, 115 initial ones, and additional flanking columns, originally empty. Each 2 n -matrix is result of alignment of two 2 n − 1 -matrices, and corresponds to 2 n aligned sequences.
Flanking columns are introduced to accommodate ±15 base shifts during the pairwise alignments of the 2 nmatrices.
Alignment is result of sliding of one 2 n -matrix along another one, and calculating sums of products of corresponding matrix elements in register. When the line position with maximal sum of products is reached, the matching elements of two matrices are summed, generating the 2 n+1 -matrix.
The alignments were conducted in the following steps: (1) Pairwise alignment of the SN sequences, in arbitrary order, resulting in 256 2 1 -matrices (2) Consecutive pairwise alignments of 2 n -matrices, generating 2 n + 1 matrices, until exhaustion of all 512 (2 9 ) sequences. Last two matrices aligned are 2 8 -matrices. (3) After each alignment the zero flanking columns are removed.
All computations have been conducted on Matlab.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.org/10.1080/07391102.2014.891262.