Information Measure for Long-Range Correlated Sequences: the Case of the 24 Human Chromosomes

A new approach to estimate the Shannon entropy of a long-range correlated sequence is proposed. The entropy is written as the sum of two terms corresponding respectively to power-law (ordered) and exponentially (disordered) distributed blocks (clusters). The approach is illustrated on the 24 human chromosome sequences by taking the nucleotide composition as the relevant information to be encoded/decoded. Interestingly, the nucleotide composition of the ordered clusters is found, on the average, comparable to the one of the whole analyzed sequence, while that of the disordered clusters fluctuates. From the information theory standpoint, this means that the power-law correlated clusters carry the same information of the whole analysed sequence. Furthermore, the fluctuations of the nucleotide composition of the disordered clusters are linked to relevant biological properties, such as segmental duplications and gene density.

C omplex systems are probed by observing a relevant quantity over a certain temporal or spatial range, yielding long-range correlated sequences or arrays, with the remarkable feature of displaying 'ordered' patterns, which emerge from the seemingly random structure. The degree of 'order' is intrinsically linked to the information embedded in the patterns, whose extraction and quantification might add clues to many complex phenomena 1-12 . In this work, an information measure for long-range correlated sequences, worked out from a partition of the sequence into clusters according to the method proposed in 8,9 , is put forward. The clusters are characterized by their length ,, duration t and area A, obeying power-law probability distributions, with a cross-over to an exponential decay at large size. The probability distribution function of the lengths is considered to estimate the Shannon (block) entropy S(,) of the clusters. The entropy can be written as the sum of three terms, respectively constant, logarithmic and linear function of the cluster length. The clusters with dominant logarithmic term of the entropy are power-law correlated and correspond to 'ordered' structures, while those with dominant linear term are exponentially distributed and correspond to 'disordered' structures. The information measure is illustrated by analyzing the 24 nucleotide sequences of the human chromosomes. Each sequence is first mapped to a fractional Brownian walk (the so-called DNA walk). Then, the probability distribution function P(,) and the entropy S(,) of the DNA clusters are estimated by adopting the proposed approach.
It is worth recalling that the investigation of the block entropy of a signal was originally motivated by cryptography. Claude Shannon attempt was aimed at encoding information in ways that still allowed recovery by the receiver, the main question to be answered being: 'How the signal can be compressed in elementary messages which still contain the relevant information to be communicated?'. The approach proposed in this work represents a possible answer. Furthermore, this question recalls the concept of Kolmogorov complexity KC(,) which quantifies the interplay of randomness/determinism of the strings output of a computational program. The Kolmogorov complexity is quantified in terms of the minimal length of the program that can still generate a random string. It can be demonstrated that the length of the program, which is defined case-by-case in the specific computational framework, is comparable to the length of the string plus a constant, and varies as the logarithm of the length of the string itself.
From the information theory standpoint, the present work shows that by taking the nucleotide composition of the whole sequence as the relevant information to be transmitted from the source to the receiver, the whole sequence is encoded in blocks (clusters), which are able to transmit the same information of the whole sequence if they are power-law correlated. Specifically, it is shown that the power-law correlated clusters are characterized by a nucleotide content, purine-pyrimidine pairs (GC)% and (AT)%, on the average equal to the value of the whole chromosome sequence under analysis. Conversely, the exponentially correlated clusters are characterized by a percentage of purine-pyrimidine pairs exhibiting fluctuations around the value taken by the whole sequence. Interestingly, the standard deviationof the cluster composition fluctuations for each of the 24 chromosomes is correlated to biologically relevant properties, such as duplication frequency and gene density. It is worthy of remark that the nucleotide composition is taken as a case study for the illustration of the implementation and meaning of the proposed entropy measure, but it is not the only biologically relevant information carried by a DNA sequence.

Results
The entropy of a sequence, coded in blocks, has been extensively studied since its introduction by Shannon (see e.g. 2-5 , and Refs. therein). The practical application of the Shannon entropy concept requires a symbolic representation of the data, obtained by a suitable partition transforming the continuous phase-space into disjoint sets. As discussed in 5 , the construction of the optimal partition is not a trivial task, being crucial to effectively discriminate between randomness/determinism of the encoded/decoded data. The method commonly adopted for partitioning a sequence and estimating its entropy is based on a uniform division in blocks having equal length ,. Then the entropy is estimated over subsequent partition corresponding to different blocks lengths ,. The novelty of the present work resides in the method used for partitioning the sequence which directly yields power-law or exponential distributed blocks (clusters). This is a major advantage, as it allows one to straightforwardly separate the set of inherently correlated/uncorrelated blocks along the sequence.    Tables S1-S6 of Supplementary  A random sequence y(x) can be partitioned in elementary clusters by the intersection with the moving averageỹ n x ð Þ where n is the size of the moving window. The clusters correspond to the regions bounded by y(x) andỹ n x ð Þ between two subsequent crossings points x c (i) and x c (i 1 1) 8 . The intersection between y(x) andỹ n x ð Þ produces a generating partition, yielding different sequences of clusters for different values of n. The probability distribution function P(,, n) of the lengths , for each n can be obtained by counting the clusters N ' 1 ,n ð Þ, N ' 2 ,n ð Þ,…, N ' i ,n ð Þ respectively with length , 1 , , 2 , …, , i . By doing so, one obtains 8 : where D 5 2 2 H and H indicate respectively the fractal dimension and the Hurst exponent of the sequence. The exponent H is widely used for quantifying long-range correlations (power-law decaying) as opposed to short-range (exponentially decaying) correlations in many complex systems. The Hurst exponent has been estimated for the 24 chromosome sequences, as reported in the 3 rd of Table 1. The occurrence of long-range correlations means that the nucleotides are organized along the sequences in similar way, a fact that can be defined as compositional self-similarity of the chromosomes. The function F ',n ð Þ in equation 1 can be taken of the form: F ',n ð Þ accounts for the drop-off of P(,, n) due to finiteness of n when , ? n. The quantity m(,, n) , , D exp(,/n) is proportional to the size of the subsets spanned by the random walkers which ranges from a line proportional to , for H 5 1 to a square proportional to , 2 for H 5 0 for n . ,. The probability distribution function P(,, n) is shown in Fig. 1 for a wide range of n values, estimated for a long range correlated series with Hurst exponent H < 0.6. For n R 1, the lengths , of the elementary clusters are centered around a single value. When n increases, a broader range of lengths is obtained and, consequently, P(,, n) spreads over all values.

www.nature.com/scientificreports
The Shannon entropy is defined as 2-5 : where the sum is performed over the number of elementary clusters with length , obtained by the intersection with the moving average for each n. This number ranges from 1 to m(,, n) -1 depending on how many clusters are generated by the intersection with the moving average. The value 1 is obtained when only one cluster with length , is found in the partition. As already noted, the standard method for partitioning a sequence and estimating its entropy is by splitting the sequence into a set of disjoint blocks with equal length ,. Conversely, in the present work, the intersections of the sequence with the moving average generate a set of disjoint blocks with a broad distribution of lengths , corresponding respectively to power-law or exponential correlation. This particular partition retains the determinism/randomness of the blocks by simply varying n, an aspect intimately related to the Kolmogorov complexity concept.
By using equations (1) and (3), the cluster entropy writes: which, after taking into account equation (2), becomes: where S 0 is a constant, log , D is related to the term , -D and ,/n is related to the term F ',n ð Þ. To clarify the meaning of the terms appearing in equation (5), it is worthy of remarking that for isolated systems, the entropy increase dS is related to the irreversible processes spontaneously occurring within the system. The entropy tends to a constant value as a stationary state is asymptotically reached (dS $ 0). For open systems interacting with their environment, the increase is given by a term dS int , due to the irreversible processes spontaneously occurring within the system, and a term dS ext due to the irreversible processes arising through the external interactions. The term log , D in equation (5) should be interpreted as the intrinsic entropy S int . It is indeed independent of n, i.e. it is independent of the method used for partitioning the sequence, which plays here the role of the external interaction. The logarithmic term is of the form of a Boltzmann entropy S 5 log V, where V is the maximum volume occupied by the isolated system. The quantity , D corresponds to the volume occupied by the random walker. Whenever , could reach the maximum size L of the sequence, the second term on the right side would write log L D . The term ,/n in equation (5) represents the excess entropy S ext introduced by the partition process. It comes into play when the sequence is partitioned in clusters and depends on n. Fig. 2 shows the entropy S(,, n) evaluated by using the probability distribution P(,, n) plotted in Fig. 1. One can note that S(,, n) increases logarithmically as log , D and is n-invariant for small values of ,, while it increases as a linear function at larger ,, as expected according to equation (5). Clusters with lengths , larger than n are not power-law correlated, due to the finite-size effects introduced by the window n. Hence, they are characterized by a value of the entropy exceeding the curve log , D , which corresponds to powerlaw correlated clusters. It is worthy to remark that clusters with a given length , can be generated by different values of the window n. For example, clusters with , 5 2500 have entropies corresponding to the point A (for n 5 1000) or A0 (for n 5 3000 and n 5 10000) as shown in Fig. 2. One can observe that A0 corresponds to power-law correlated (ordered) clusters, since A0 lies on the curve log , D . Conversely, the point A does not correspond to power-law correlated clusters, since A lies on the curve ,/n which originates from the term F ',n ð Þ. In other words, clusters with lengths shorter than n are ordered (longrange correlated), whereas clusters with lengths larger than n are disordered (exponentially correlated).
To gain further insight in the meaning of the terms appearing in equation (5), the source entropy rate s is calculated for the entropy S(,, n). The source entropy rate is a measure of the excess randomness and increases as the block coding process becomes noisier. By using the definition and equation (5), the source entropy rate writes: The excess randomness of the clusters is found to be inversely proportional to n and, thus, becomes negligible in the limit of n R '. This clearly occurs in the curves of Fig. 2, where one can note that higher entropy rates correspond to steeper slopes of the linear term ,/ n (smaller n values).

Discussion
In this section, the information measure is implemented on the 24 human chromosomes, mapped to fractional Brownian walks (mapping details are described in Method). The nucleotide composition of the DNA sequence is taken as the relevant information quantity to be encoded from the source and decoded from the receiver. It is well-established that the two strands of DNA are held together by hydrogen bonds between complementary bases: two bonds for the AT pair and three bonds for the GC pair, which is therefore stronger. The existence of GC-rich and GC-poor segments may play different roles in biological processes as duplication, segmentation, unzipping [13][14][15] .
Nonuniformity of nucleotides composition within genomes was revealed several decades ago by thermal melting and gradient centrifugation. On the basis of findings concerning buoyant densities of melted DNA fragments, a theory for the structure of genomes of warm-blooded vertebrates known as the isochores theory was put forward [16][17][18][19] . Isochores were defined as genomic segments that are fairly homogeneous in their guanine and cytosine (GC) composition.
Though it is widely accepted that the human genome contains large regions of distinctive GC content, the availability of fully sequenced DNA or RNA molecules allows one to accurately investigate the local structure by statistical methods. The development of efficient algorithms achieving deep and accurate description of the complex genomic architecture is thus a timely endeavour [20][21][22][23][24][25][26][27][28][29][30] .
The chromosomes can be mapped to numeric sequences according to different approaches. In this work, first the DNA is mapped (as detailed in the section Method) to a random walk, then the clusters are generated as described in the previous section. Once having generated the clusters, one can answer the question 'How much of the relevant information is still contained in the clusters?'. The answer to this question is obtained by counting the ATGC basis for each cluster and plotting the percentage as a function of the cluster length. In Figs. 3-8, the nucleotide compositions are plotted as a function of the cluster length , for n 5 2, n 5 4 and n 5 10. The range of n values used in this work varied from 2 to 10.000. One can observe that the nucleotides count is roughly constant for clusters having length comparable or shorter than n. This means that ordered DNA clusters with constant nucleotide composition are found, when the entropy varies as a logarithm of ,. For cluster lengths , larger than n, the power-law correlation breaks down with the onset of exponentially correlated clusters ('disordered' clusters). An even more interesting result is that the amplitude of the fluctuations is not constant as it takes a characteristic value for each chromosome. One can note from the data plotted in Figs. 3-8 that the fluctuations of the cluster composition is very small for example in chromosomes 8, 9, 17, Y. Conversely, they are quite large for chromosomes 14, 15, X. It should be remarked that Figs. 3-8 show the nucleotide composition of the ordered-disordered clusters. These plots are related to the entropy of the blocks if one bears in mind the original aim of the Shannon work. The estimate of the block entropy was originally motivated by the attempt at decoding information in ways that still allow recovery of the relevant information by the receiver. In other words, the main question raised by Claude Shannon is: ''How the signal can be compressed in elementary messages (blocks) which still contain the relevant information to be communicated?''. The approach proposed in this work answers this question. The DNA sequence is encoded in short messages (clusters) able to transmit the same information of the whole sequence (from where they were cut out) only if they are power-law correlated. In this manuscript, the information considered relevant to the receiver is the nucleotide composition, which, of course, is not the only choice for the relevant information to be transmitted, as other characteristic features might be interesting as well. It is also discussed to what extent nucleotide  Table 2). The correlation s C with bilogical features characteristic of each chromosome, such as length, gene density, inter-chromosomal duplications, intra-chromosomal duplications, local ATGC composition (data taken from Refs. 14, 15) have been considered. The correlation coefficients r C are shown in Table 3. Negative correlations between s C and intra-and interchromosomal duplications are found. Conversely, strong positive correlations are observed between s C and AT-rich regions. These findings might point to the important result that the cluster fluctuations are fingerprints of recent segmental duplications.

Methods
A DNA sequence is composed of four nucleotides: adenine (A), thymine (T), cytosine (C) and guanine (G). The first step of the analysis consists in the conversion of the four-letter genome alphabet into a numerical format. There are several ways of mapping a DNA sequence to a walk: one-dimensional up to 4 dimensional, real or complex representations. As the proposed Shannon entropy measure applies to onedimensional sequences, the present discussion is limited to one-dimensional real representation of the four nucleotide bases. The sequence of the nucleotide bases is mapped according to the following rule: if the base is a purine (A,G), the base is mapped to 11, otherwise if the base is a pyrimidine (C,T), the base is mapped to -1 (Fig. 9). The sequence of 11 and -1 is summed and a random walk y(x) (DNA walk) is obtained. This coding rule is preferable, as it keeps the nonstationarity of the series at a minimum. Large nonstationarity of the numerical series might be an issue when longrange correlation should be investigated. The average concentration of A and T are about 0.30, those of G and C are about 0.20. The concentration of purines (A 1 G) and pyrimidines (C 1 T) are very close to 0.50 along the sequence. Therefore, coding of purines and pyrimidines to 11 and 21 guarantees a high degree of symmetry of the numerical series. Conversely, an asymmetric coding rule would amplify the strong variations of the local density distribution of the bases along the sequences, giving rise to higher nonstationarity of the corresponding random walk.
The functionỹ n x ð Þ is calculated for the DNA walk with different values of the window n. The intersection between y(x) andỹ n x ð Þ yields a set of clusters, which correspond to the segments between two adjacent intersections of y(x) andỹ n x ð Þ. Since each cluster of the DNA walk corresponds to a cluster of ATGC nucleotides, the number of nucleotides can be counted and plotted as a function of the length , for each cluster. In Figs. 3-8 the nucleotide composition of the clusters as a function of the length , is shown for the 24 human chromosomes. The clusters have been cut out of  Table 1). The method proposed here has been however implemented on several sequences with different lengths (varying from 10 5 to 10 7 have been considered in this study). This range takes into account that, on one hand, a scaling law is sound when it is observed at least over three decades of a logarithmic scales, and the computational time and complexity on the other hand. One can note that the average composition of the power-law correlated clusters is comparable with the composition of the whole sequence of the analysed data. For example the nucleotide composition of the power-law correlated clusters of the chromosome 1 should be confronted with the data reported in the column 8 th , 9 th , 10 th , 11 th of Table 1 for the same chromosome, while the standard deviation is reported in Table 2. The statistical robustness of the method has been checked by estimating the correlation coefficient r c of the variance and other biological parameters of the sequences (Table 3).
One common problem in data mining is the statistical validation of the model envisioned to describe data structures and patterns. The error is estimated on the entire sample set for small quantity of data. For large data sets, more sophisticated cross-validation methods have been developed to quantify the performance of algorithms and models over disjoint subsets. Depending upon the criterion used to split the data, the process of training and validation across disjoint sets is named random, k-fold or leave-one-out 31 . In particular, the leave-one-out is the degenerate case of the k-fold cross-validation, with only one disjoint subset (k 5 1) and is particularly useful for very sparse datasets with few samples, though its error might be larger than the error of the estimates themselves and computation time might be quite long. As the analysed dataset (the 24 genomic sequences) is large enough, the random and k-fold cross validation can be used with the advantage of higher accuracy and velocity of the estimates. In the Supplementary Tables S1-S6, the average values and variances of the nucleotide contents obtained over three disjoint data sets are reported for the 24  Table 3 | Correlation r C of the cluster fluctuations for the first (M 1 ), the second (M 2 ) and the third (M 3 ) disjoint sets of the 24 human chromosome sequences. The fluctuations are anticorrelated with length, gene density, inter-chromosomal and intra-chromosomal segmental duplications, while they exhibit a positive correlation with the AT-rich regions. Very little correlation is found with the GC-rich regions and global AT composition. Length values are shown in the 2 nd column of Table 1. Gene density data are taken from Refs. 14, 15. Inter-and intra-chromosomal duplications data are taken from Ref. 14. Base compositions are shown in Table 1 (respectively 4 th -7 th columns for the whole sequence, 8 th -11 th columns for the first 10MBases, and in Tables S1-S6 of the Supplementary    chromosomes. For each subset, when the parameter n is varied, clusters of any lengths are generated in random position of the sequence allowing to estimate the average composition and the statistical errors at different position along the sequence. For each set the standard deviations are also reported in the Supplementary Tables S1-S6. Finally, we note that the Hurst exponent for the 24 chromosomes is reported in the 3 rd column of Table 1. As one can see the value of the exponent H is higher than 0.5, implying that a positive correlation (persistence) exist among the nucleotides. The values of the Hurst exponents have been obtained by using the method described in Refs. 8-10. The sequences used in this analysis were retrieved from the NCBI ftp server (ftp:// ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/).