Visible periodicity of strong nucleosome DNA sequences

Fifteen years ago, Lowary and Widom assembled nucleosomes on synthetic random sequence DNA molecules, selected the strongest nucleosomes and discovered that the TA dinucleotides in these strong nucleosome sequences often appear at 10–11 bases from one another or at distances which are multiples of this period. We repeated this experiment computationally, on large ensembles of natural genomic sequences, by selecting the strongest nucleosomes – i.e. those with such distances between like-named dinucleotides, multiples of 10.4 bases, the structural and sequence period of nucleosome DNA. The analysis confirmed the periodicity of TA dinucleotides in the strong nucleosomes, and revealed as well other periodic sequence elements, notably classical AA and TT dinucleotides. The matrices of DNA bendability and their simple linear forms – nucleosome positioning motifs – are calculated from the strong nucleosome DNA sequences. The motifs are in full accord with nucleosome positioning sequences derived earlier, thus confirming that the new technique, indeed, detects strong nucleosomes. Species- and isochore-specific variations of the matrices and of the positioning motifs are demonstrated. The strong nucleosome DNA sequences manifest the highest hitherto nucleosome positioning sequence signals, showing the dinucleotide periodicities in directly observable rather than in hidden form.

Here, we suggest a novel approachcomputational extraction of the positioning pattern(s) from strong nucleosome DNA sequences in which the periodicity would be obvious, rather than hidden. The strong nucleosomes are understood here as those which are formed on the strongly periodic sequences. One experimental example is the nucleosome-forming sequence '601' selected from random pool, with periodically positioned TA dinucleotides (Lowary & Widom, 1998;Vasudevan, Chua, & Davey, 2010).
As a simple sensitive measure of the sequence periodicity in this work, we use occurrence of various dinucleotides at distances which are multiples of the nucleosome DNA period. In the ideally periodical sequence of the nucleosome DNA, every dinucleotide is separated from all other like-named dinucleotides by the distances 10.4•n bases. Such sequence, therefore, would show in the positional autocorrelation functions, the maximal possible count of such distances. Not exactly ideal strong nucleosome DNA sequences would also have high counts. The search for the strong nucleosome DNA sequences in the genomes is, thus, reduced to calculation of such counts for every fragment of length 115 bases (Trifonov, 2011) in the genome sequences, and thereby selection of the winners.
Various estimates of the nucleosome DNA period converge to the values in the interval 10.36-10.40 base pairs (Bettecken & Trifonov, 2009;Cohanim, Kashi, & Trifonov, 2006;Prunell et al., 1979;Trifonov & Bettecken, 1979;Ulanovsky & Trifonov, 1983;Winter, Song, Mukherjee, Furey, & Crawford, 2013). We, thus, have chosen the period 10.4 to derive the multiples. The closest integers to the multiples are 10 bases (10.4), 21(20.8), 31(31.2), 42(41.6), 52(52), 62(62.4), 73(72.8), 83(83.2), 94(93.6), 104(104) and 114 bases (114.4), within the span of 115 bases involved in contact with the histone octamer (Trifonov, 2011). It is worth to note that, strictly speaking, frequent occurrence of the distances may not necessarily mean the sequence periodicity. However, since the 'magic' distances are characteristic of the nucleosome DNA, and no other sequence signals are known, unrelated to the nucleosomes, which would display such distances, there is a good chance that this simple technique would be able to, indeed, extract the strong nucleosome DNA sequences. Importantly, since such procedure counts the magic distances for all 16 dinucleotides, there is no a priori bias in the choice of the most significant ones, so that none of the signal components would be missed.
Using the counts of the 10.4•n bases distances between the same dinucleotides, for all 16 of them, as measure of relevant sequence periodicity, we scanned various genomic sequences in a search for those sequence fragments of length 115 bases which manifest the highest scores. These would represent the strongest nucleosomes. As the data described below demonstrate, the identified strong sequences of Arabidopsis thaliana, Caenorhabditis elegans and Homo sapiens show all major nucleosome DNA sequence characteristics revealed in earlier work (Frenkel, Trifonov, Volkovich, & Bettecken, 2011;Gabdank et al., 2009;Rapoport et al., 2011), and, thus, the calculated strong nucleosome DNA patterns are, indeed, representative of strong nucleosomes of the respective species. Utilization of only strong sequences for derivation of the nucleosome sequence patterns significantly reduces the noise level, thus, making the derived patterns more reliable. The extracted in this work, periodical strong nucleosome DNA sequences for the first time demonstrate visible, rather than hidden nucleosome DNA sequence periodicity.

Extraction of strong nucleosome sequences
Given a (long) DNA sequence, we screen it with an overlapping sliding window of length 115 bases (sliding with 10 base steps, instead of 1, to boost the calculation). For each fragment (skipping fragments that contain simple tandem repeats, like … ATATAT …) we calculate the autocorrelation function in the domain [1…115], and consider the sum of the function's values at the magic distances 10.4•n (i.e. 10, 21, 31, 42, …, 114) as the score of this specific DNA fragment. Finally, we rank all the fragments according to their scores and choose all fragments above some cut-off threshold (see Section 2.4 for the determination of the threshold). In case of overlapping fragments, the strongest are taken. The fragments selected by the procedure are called 'strong nucleosome DNA sequences'.

Determination of strong nucleosome's cut-off threshold
Using random sequences, appropriately generated, one can evaluate the score cut-off threshold. In this context, the null hypothesis H 0 would be that 'Random sequences of base composition similar to those of the DNA sequence in question do not contain strong nucleosomes'. We use, therefore, the following algorithm: (1) generate many random sequences (say 1000 sequences of 1 million bases each) according to some base composition distribution, (2) for each sequence, independently, apply the algorithm described in previous section to calculate highest scoring fragment and (3) choose the maximum score of the highest scoring fragments over all sequences to be the cut-off threshold.
The range for the estimated thresholds is betweeñ 90 for H3 isochore of H. sapiens and~120 for A. thaliana (with significance level .001). With these thresholds all accepted sequences display clear visible sequence periodicity.
2.5. Derivation of optimal and near optimal linear motifs from nucleosome DNA bendability matrix Bendability matrix (16 × 10 elements) contains the information that describes deformational preferences of 16 dinucleotides in various positions along one period (10 bases) of the nucleosome DNA sequence. However, it is convenient also to look at a simpler representation (while losing some information)a linear sequence motif which fits best to the matrix. Reading the best motif from the matrix requires solving the optimization problem of finding the legal path with the largest sum of weights along that path.

Definitions
Path is a list of 10 dinucleotides chosen from the matrix, one entry per column.
Legal path is a path for which every two adjacent entries (dinucleotides) overlap (e.g. 'AT' and 'TG').
Finding the highest scoring legal path is equivalent to finding the dominant (optimal) linear motif corresponding to a given matrix of bendability. This is solved by simple Dynamic Programming (DP) (Dreyfus & Law, 1976). The same holds for close-to optimal motifs problem which is solved by the modified version of DP suggested in (Waterman & Byers, 1985).
To verify, additionally, the validity of the chosen period, we conducted the same calculations as described below, using magic distances for other periods, from 6 to 15 bases, and deriving the corresponding sequence motifs. The linear motifs obtained for various periods are, essentially, shorter and longer versions of the standard (see below) 10-base long motif AAAAATTTTT for the period 10.4 bases. For example, for the periods 10, 10.1, 10.2, 10.3 and 10.5 the resulting motif is the same AAAAATTTTT and for the period 11 bases, the resulting motif is AAAAGGGTTTT. The occurrence of the 11 bases motif is substantially lower than one of A 5 T 5 .
3.2. Strong nucleosome-forming DNA sequences extracted from random mixture (Lowary & Widom, 1998) The SELEX collection of the clones corresponding to strong nucleosomes contains both Watson strands of the clones and some Crick strands. We have taken only one strand of each nucleosome DNA of the collection, ending with 54 non-redundant sequences of lengths 66-100 bases.
As the distance analysis demonstrates, the major heterodinucleotide contributors to the 10.4•n distances in the strong nucleosome DNA subset of SELEX sequences are dinucleotides TA (Table 1, first column), which confirms original observation (Lowary & Widom, 1998). Following the technique of reconstruction of the nucleosome DNA bendability matrix from its incomplete forms (Gabdank et al., 2009), we combined all occurrences of TA at distance 10 bases (total 48 TANNNNNNNNTA sequence fragments within the 54 sequences of the dataset) and derived the matrix of bendability (Figure 1), where the observed frequencies of various dinucleotides in various positions (phases) of the nucleosome DNA period are shown. Taking frequent homodinucleotides AA or TT for similar analysis (i.e. AANNNNNNNNAA and TTNNNNNNNNTT fragments) would smear the final pattern, as these dinucleotides appear often as runs AAA … and TTT … As this precious ensemble of strong nucleosome DNA sequences is of rather small size, the statistics of the occurrences of the dinucleotides in the matrix is poor. Nevertheless, two continuous dominant (consensus) sequences -TAGAGTGGCTTA (sum of scores 154) and TAGAGGCCTCTA (sum of scores 151)-are generated by DP (Waterman & Byers, 1985) that selects from all possible continuous sequences the ones with the highest sums of scores. Almost the same result can be obtained manually. Indeed, the dinucleotides TA (column 0), AG (highest score in column 1), GA (column 2) and AG (column 3) make continuous linear motif TAGAG. Similarly, by fusing the neighbouring elements in two last columns, one derives the motif CTA, resulting in TAGAGNNNNCTA. Manual reconstruction of the full-length continuous motif meets uncertainties, so that the reconstruction has to be made by optimization via DP which completes the partial motif above: TAGAGGCCTCTA, the second highest scoring pattern derived by the DP. This pattern can be considered as consensus sequence of the whole collection of the one period long TANNNNNNNNTA fragments.
The complementary TAG and CTA are the most prominent elements of the above-pattern, as it follows from the matrix. This has been also noted in (Lowary & Widom, 1998), where the periodical nucleosome positioning motif CTAGNNNNNNCTAGNNNNNN … was suggested. Remarkably, the self-complementary sequence TAGAGGCCTCTA not only agrees with the earlier resultsalternation of runs of purines and pyrimidines (YRRRRRYYYYYR) in the nucleosome DNA (Mengeritsky & Trifonov, 1983;Rapoport et al., 2011;Salih et al., 2008;Trifonov, 2010), but also strictly conforms to the complementary symmetry of the bendability pattern, which follows from DNA duplex symmetry (Trifonov, 2010).

Strong nucleosome sequences of A. thaliana
In the Figure 2, a randomly chosen subset of strong nucleosome DNA sequences extracted from A. thaliana genome is shown. Simple manual alignment reveals somewhat distorted, but clearly periodic character of the sequences. The periodically repeated runs of adenines are highlighted (in red), as well as the periodically reappearing TA dinucleotides, in the same sequences (in green). Note that the periodicity as such was not searched for by the procedure, which only counted the occurrences of 'magic' distances of 10.4•n bases. The procedure, therefore, attests to the fact that the magic distances appear to be delivered by, indeed, 10.4•n periodic DNA sequences characteristic of the nucleosome DNA.
The Figure 2 presents the nucleosome DNA sequence periodicity in the most obvious way. Even the strongest hitherto known periodical nucleosome DNA sequence, the clone 601 (Lowary & Widom, 1998)  visible period, and special means are required to detect the periodicity, as illustrated well by the whole history of the studies on the nucleosome DNA periodicity (Trifonov, 2011), starting with the positional autocorrelation analyses (Trifonov & Sussman, 1980). Community of chromatin researchers is generally not familiar with the weak signals buried in noise. They may accept the existence of the periodicity but remain skeptical. The clearly periodic nature of the strong nucleosome sequences, as in the Figure 2, perhaps, would be more convincing than the whole arsenal of indirect demonstrations of the periodicity. It is clear, however, that even the strong sequences are not ideal repetitions of whatever pattern, i.e. there are no ideal strong nucleosome sequences in the genome of A. thaliana, nor in other genomes. Such ideal nucleosome sequences, the strongest ones, if at all existing, would be a very rare exception, as they, perhaps, will create severe obstacles for transcription and replication. The obvious similarity of the sequences above, however, is suggestive of existence of some ideal (consensus) pattern with repeating runs of A and of T. Mere inspection of the sequences suggests that the most prominent frequently occurring sequence motif is … TTTAAA … Since heterodinucleotide TA is the major contributor to the magic distances in A. thaliana (Table 1), we collected all occurrences of the TANNNNNNNNTA sequences and derived the matrix of bendability for A. thaliana (Figure 3). The strongest elements in the left part of the matrix (columns 1-4) are AA dinucleotides (TAAAAA) while on the right half (columns 7-10), these are rather TT dinucleotides (TTTTTA). This suggests the linear form of the matrix (consensus seguence): TAAAAATTTTTA. The highest scoring pattern derived by DP procedure applied to the matrix is, indeed, TAAAAATTTTTA. The motifs with A 4 or A 6 instead of A 5 are the next highest scoring sequences.
The dominant pattern, TAAAAATTTTTA is a good match to the motifs obtained earlier for A. thaliana genome and other sequences (Rapoport et al., 2011), by other approaches.

Matrix of bendability for genome of C. elegans
In the Figure S1 (supplement), the matrix is shown, based on the starting pattern ATNNNNNNNNAT, since the AT heterodinucleotide is the major contributor to the observed 10.4•n distances in C. elegans genome ( Table 1). The linear motifone complete 10-base periodderived from the elements of the matrix, the same way as in case of A. thaliana, is ATTTTTAAAAAT, or in RY-central form TAAAAATTTTTA, i.e., identical to the strongest pattern in A. thaliana. Interestingly, the next strongest motifs, according to DP analysis of the C. elegans matrix are very similar to the dominant one: ATTTTCAAAAAT, ATTTTAAAAAAT, ATTTTTGAAAAT, ATTTTTTAAA AT, ATTTTGAAAAAT, ATTTTTCAAAAT and ATTTTC GAAAAT.
Comparison of the dominant pattern TAAAAATTTT TA (in RY-central form) with the one calculated earlier by a different technique (Gabdank et al., 2009), CGRAAATTTYCG, shows 80% match (per 10 bases). The difference, CG instead of TA, can be attributed to different ways of evaluation of the 'strengths' of various dinucleoitides contributing to the patterns. In the regeneration of signal approach (ibid) as the measure of strength, the selectivity of a given dinucleotide to the best position within the period was chosen. In that work, the dinucleotides CG and AT appeared as the strongest. In the current work, rather the absolute contribution of the dinucleotides to the 10.4•n distances is selected. Only eventual rigorous energy calculations of the importance of various dinucleotides for DNA bending in the nucleosomes may provide more appropriate measures. In the mean time the most frequently used (consensus) pattern, i.e., TAAAAATTTTTA in case of C. elegans, seems to be a right choice, though the CGRAAATTTYCG pattern can be read from the matrix ( Figure S1) as well. CGAAAATTTTCG motif is the eighth strongest motif (see above) calculated by DP from the C. elegans matrix ( Figure S1). Both the TA and CG dinucleotides show highest preference to the central YR position of the matrix. That is, both C. elegans patterns, as above, can be used for nucleosome mapping, and they both satisfy the most general pattern YRRRRRYYYYYR (Mengeritsky & Trifonov, 1983;Rapoport et al., 2011;Salih et al., 2008;Trifonov, 2010).
To quantitatively characterize contributions of all dinucleotides to the DNA bendability in the nucleosome, the matrix description is more appropriate, since the columns of the matrix contain other elements also, in addition to the dominant ones. Those which appear in the same column have the same rotational setting in the DNA molecule on the surface of the histone octamer. The simple linear presentations of the matrices of DNA bendability, the consensus motifs are convenient for the purpose of quick comparisons.

H. sapiens
For derivation of the matrix of bendability, sequences of chromosomes one to five are taken. 4193 strong nucleosome sequences are extracted. The heterodinucleotide that is found to contribute most to the magic distances is AT (Table 1). After collecting all occurrences of ATNNNNNNNNAT in the strong nucleosome DNA sequences of H. sapiens, the matrix of bendability is derived as shown in Figure S2. The strongest elements of the matrix (TT in positions 1-4, and AA in columns 6-9) make the highest score one-period motif, already familiar, ATTTTTAAAAAT, or in AT-central form: TAAAAATTTTTA. The DP generates the same pattern. It coincides with the motif derived for H. sapiens earlier Rapoport et al., 2011). The next strongest motifs, by DP, contain runs of T and A of lengths 3-7 (7-3) instead of T 5 and A 5 .

Isochores
The motifs described above, all TAAAAATTTTTA for three different genomes, essentially, represent the prevailing A + T rich isochores L1 and L2 of these genomes (Cammarano, Costantini, & Bernardi, 2009;Costantini et al., 2006;Zhang & Zhang, 2004). Other isochores are expected to generate different patterns, with dinucleotides containing G and C. However, the bendability matrix calculated for the isochores H1 boils down to the same dominant TAAAAATTTTTA pattern as for isochores L1 and L2 (not shown). This is due to higher than 50% A + T composition of the isochores L1-H1 (all < 46% G + C). The matrix for strong nucleosome sequences of isochores H2 (46-53% G + C) is shown in the Figure S3. The linear pattern derived from this matrix is different. The major dinucleotide contributor in this case is CA, and the DP calculation generates from the matrix in Figure S3, the predominant motif CAAAACCCCCCA showing the alternation of runs of purines and pyrimidines. The standard R 5 Y 5 variant CAAAAACCCCCA shows very high DP score as well. It can also be read directly from the matrix. The complementary symmetry in four-letter alphabet is not observed in this case, since the CA element itself is not symmetrical. If, however, the strongest symmetrical contributor is taken, AT in this case (Table 1), the DP generates also symmetrical pattern ATTTTTAAAAAT. The difference from the above C-rich motifs, apparently, reflects different local sequence environments (composition) of the motif-forming CA and AT dinucleotides.
In the Figure 4, the matrix for strong nucleosome sequences of isochores H3 (> 53% G + C) is shown. The linear patterns derived from this matrix are substantially different: GCCCCCCGGGGC, and, of nearly the same strength, GCCCCCGGGGGC, the latter showing also complementary symmetry, in addition to the standard alternation of runs of R and Y.

Nucleosome positioning sequences, various levels of their description
The most appropriate description of the positioning pattern is listing which dinucleotide elements have physical (deformational) preference to which positions along the period of the DNA in the nucleosome, and how strong (selective) the preference is. This is called the matrix of DNA bendability (Gabdank et al., 2009;Mengeritsky & Trifonov, 1983;Trifonov, 1980). Currently the ultimate final matrix of bendability, general for all cases, is unavailable. Rather there are several more or less similar matrices, different for different species and for DNA with different nucleotide composition this work). The matrices described in this paper are typical examples. The matrix of nucleosome DNA bendability can be expressed as frequencies of occurrences of various dinucleotides, as in the examples, or as matrix of energy parameters and values of deformational preferences of the dinucleotides. This last form of the bendability matrix at the moment can only be imagined, as the preferences are not evaluated yet to build a set of 160 values (16 dinucleotide elements, 10 positions along the period). Rigorous estimation of the values is rather problematic (e.g. Cui & Zhurkin, 2010;Tolstorukov, Colasanti, McCandlish, Olson, & Zhurkin, 2007).
A simplified presentation of the matrix of bendability is the continuous sequence motif, 10 bases long, derived by connecting the topmost dinucleotides of each column the consensus motif (one-period-long consensus or dominant nucleosome positioning sequence). In the current and in earlier papers these motifs are given rather as 12-base sequences, with identical dinucleotides (e.g. YR) at both ends of the period.
As the Figure 2 illustrates, the consensus nucleosome positioning pattern, TAAAAATTTTTA in case of A. thaliana, is not readily seen as such even in the strongest nucleosome DNA sequences. It is compromised almost in every period by the bases which do not belong to the consensusan obvious noise in the pattern. In typical nucleosome DNA with only marginal periodicity, the noise, actually, dominates. However, the signal in these sequences is presentthe sequence elements of the consensus, appearing here and there in phase with the hidden repeat, collectively result in easier deformation in certain direction and in the recognition of the DNA by the histone octamers. In other words, one does not have to have an exact match of the sequence to the ideal repetition of the consensus. Even few matching signal dinucleotides may be sufficient for the accurate recognition, i.e. nucleosome positioning. Now, with the consensus patterns available, the critical amount of the signal elements can be experimentally determined.
Yet, even simpler presentation of the matrix of bendability is linear form YRRRRRYYYYYR, in binary alphabet. Its advantage is in its universality, as all the consensus linear patterns described above match the form. It can be applied to any sequence for preliminary mapping of the nucleosomes. More accurate evaluation of the nucleosome strengths can be achieved by taking into account the dinucleotide composition, i.e. using the full matrix of bendability for the given species or composition.

Optimal positions of various dinucleotides within nucleosome DNA period
The dinucleotide stacks YR•YR are located at the local axes of the minor grooves oriented toward the histone octamer (Cui & Zhurkin, 2010;Vasudevan et al., 2010) orientation 'IN'. The stacks RY•RY, half-period away, are oriented 'OUT'. Accordingly, all YR dinucleotides (TA, CG, TG and CA) should concentrate in central columns of RYNNNNNNNNRY matrices, which is the case, with only occasional deviations due to statistical reasons (see Figures S1 and S2). Similarly, in YRNNNNNNNNYR matrices the dinucleotides AT, GC, AC and GT are found preferably in central or closeto-central positions (Figure 3). All the RR dinucleotides concentrate in one half-period, while all YY dinucleotides are found in another half-period of the positioning pattern. This has a general explanation based on the asymmetry of deformation of the RR•YY stacks (Trifonov, 2010). Some of the particular RR and YY dinucleotides show more specific positional preferences within the period, as e.g. in case of strong SELEX nucleosomes ( Figure 1). Interestingly, all the matrices of bendability demonstrate substantially stronger elements close to the pattern-forming 'dominant' dinucleotides, with rather uncertain preferences around the central positions. This, probably, reflects the avoidance of having excessively stable nucleosomes, with all key elements of the matrix equally present.

Towards high resolution mapping of the nucleosomes on any sequence of interest
The simplest and most general nucleosome positioning periodical pattern would be (RRRRRYYYYY) n , with occasional extension of the 10 base sequence repeat to 11 bases, to satisfy the observed average period 10.4 bases of DNA in the nucleosome, as in (Gabdank, Barash, & Trifonov, 2010a, 2010b. More accurate mapping would be based on the periodically repeated matrices of bendability, specific for given sequence type (species or isochores). One obvious way to derive such full length matrices is alignment of the strong nucleosome DNA sequences and derivation of its 16 × 114 elements (values), corresponding to respective orientations of 16 dinucleotide stacks in the 115 base long nucleosome DNA (work in progress). In the mean time one can use instead the matrix for C. elegans (Gabdank et al., 2009) the first of the kind, already generating single-base resolution maps (http://www.cs.bgu.ac.il/~nuc leom) to study orientation of important sites in DNA on the surface of the histone octamers (Hapala & Trifonov, 2011.

Conclusions
Computational extraction of strong potential nucleosome DNA sequences from genomes demonstrates for the first time that the strong nucleosome sequences possess obvious periodicity rather than mere hidden periodicity as indirectly demonstrated in earlier studies. Speciesspecific matrices of bendability are derived, with respective dominant nucleosome positioning sequence motifs. The universal nucleosome positioning motif (RRRRRYYYYY) n is confirmed.