Information Measure for Long-Range Correlated Sequences: the Case of the 24 Human Chromosomes

Carbone, A.

doi:10.1038/srep02721

Download PDF

Article
Open access
Published: 23 September 2013

Information Measure for Long-Range Correlated Sequences: the Case of the 24 Human Chromosomes

A. Carbone^1,2,3

Scientific Reports volume 3, Article number: 2721 (2013) Cite this article

2112 Accesses
21 Citations
3 Altmetric
Metrics details

Subjects

Abstract

A new approach to estimate the Shannon entropy of a long-range correlated sequence is proposed. The entropy is written as the sum of two terms corresponding respectively to power-law (ordered) and exponentially (disordered) distributed blocks (clusters). The approach is illustrated on the 24 human chromosome sequences by taking the nucleotide composition as the relevant information to be encoded/decoded. Interestingly, the nucleotide composition of the ordered clusters is found, on the average, comparable to the one of the whole analyzed sequence, while that of the disordered clusters fluctuates. From the information theory standpoint, this means that the power-law correlated clusters carry the same information of the whole analysed sequence. Furthermore, the fluctuations of the nucleotide composition of the disordered clusters are linked to relevant biological properties, such as segmental duplications and gene density.

scGHOST: identifying single-cell 3D genome subcompartments

Article 08 April 2024

Kyle Xiong, Ruochi Zhang & Jian Ma

Spike sorting with Kilosort4

Article Open access 08 April 2024

Marius Pachitariu, Shashwat Sridhar, … Carsen Stringer

A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain

Article Open access 13 December 2023

Zizhen Yao, Cindy T. J. van Velthoven, … Hongkui Zeng

Introduction

Complex systems are probed by observing a relevant quantity over a certain temporal or spatial range, yielding long-range correlated sequences or arrays, with the remarkable feature of displaying ‘ordered’ patterns, which emerge from the seemingly random structure. The degree of ‘order’ is intrinsically linked to the information embedded in the patterns, whose extraction and quantification might add clues to many complex phenomena^{1,2,3,4,5,6,7,8,9,10,11,12}.

In this work, an information measure for long-range correlated sequences, worked out from a partition of the sequence into clusters according to the method proposed in^8,9, is put forward. The clusters are characterized by their length ℓ, duration τ and area , obeying power-law probability distributions, with a cross-over to an exponential decay at large size. The probability distribution function of the lengths is considered to estimate the Shannon (block) entropy S(ℓ) of the clusters. The entropy can be written as the sum of three terms, respectively constant, logarithmic and linear function of the cluster length. The clusters with dominant logarithmic term of the entropy are power-law correlated and correspond to ‘ordered’ structures, while those with dominant linear term are exponentially distributed and correspond to ‘disordered’ structures. The information measure is illustrated by analyzing the 24 nucleotide sequences of the human chromosomes. Each sequence is first mapped to a fractional Brownian walk (the so-called DNA walk). Then, the probability distribution function P(ℓ) and the entropy S(ℓ) of the DNA clusters are estimated by adopting the proposed approach.

It is worth recalling that the investigation of the block entropy of a signal was originally motivated by cryptography. Claude Shannon attempt was aimed at encoding information in ways that still allowed recovery by the receiver, the main question to be answered being: ‘How the signal can be compressed in elementary messages which still contain the relevant information to be communicated?’. The approach proposed in this work represents a possible answer. Furthermore, this question recalls the concept of Kolmogorov complexity KC(ℓ) which quantifies the interplay of randomness/determinism of the strings output of a computational program. The Kolmogorov complexity is quantified in terms of the minimal length of the program that can still generate a random string. It can be demonstrated that the length of the program, which is defined case-by-case in the specific computational framework, is comparable to the length of the string plus a constant and varies as the logarithm of the length of the string itself.

From the information theory standpoint, the present work shows that by taking the nucleotide composition of the whole sequence as the relevant information to be transmitted from the source to the receiver, the whole sequence is encoded in blocks (clusters), which are able to transmit the same information of the whole sequence if they are power-law correlated. Specifically, it is shown that the power-law correlated clusters are characterized by a nucleotide content, purine-pyrimidine pairs (GC)% and (AT)%, on the average equal to the value of the whole chromosome sequence under analysis. Conversely, the exponentially correlated clusters are characterized by a percentage of purine-pyrimidine pairs exhibiting fluctuations around the value taken by the whole sequence. Interestingly, the standard deviationof the cluster composition fluctuations for each of the 24 chromosomes is correlated to biologically relevant properties, such as duplication frequency and gene density. It is worthy of remark that the nucleotide composition is taken as a case study for the illustration of the implementation and meaning of the proposed entropy measure, but it is not the only biologically relevant information carried by a DNA sequence.

Results

The entropy of a sequence, coded in blocks, has been extensively studied since its introduction by Shannon (see e.g.^2,3,4,5 and Refs. therein). The practical application of the Shannon entropy concept requires a symbolic representation of the data, obtained by a suitable partition transforming the continuous phase-space into disjoint sets. As discussed in⁵, the construction of the optimal partition is not a trivial task, being crucial to effectively discriminate between randomness/determinism of the encoded/decoded data. The method commonly adopted for partitioning a sequence and estimating its entropy is based on a uniform division in blocks having equal length ℓ. Then the entropy is estimated over subsequent partition corresponding to different blocks lengths ℓ. The novelty of the present work resides in the method used for partitioning the sequence which directly yields power-law or exponential distributed blocks (clusters). This is a major advantage, as it allows one to straightforwardly separate the set of inherently correlated/uncorrelated blocks along the sequence.

A random sequence y(x) can be partitioned in elementary clusters by the intersection with the moving average where n is the size of the moving window. The clusters correspond to the regions bounded by y(x) and between two subsequent crossings points x_c(i) and x_c(i + 1)⁸. The intersection between y(x) and produces a generating partition, yielding different sequences of clusters for different values of n. The probability distribution function P(ℓ, n) of the lengths ℓ for each n can be obtained by counting the clusters ,,…, respectively with length ℓ₁, ℓ₂, …, ℓ_i. By doing so, one obtains⁸:

where D = 2 − H and H indicate respectively the fractal dimension and the Hurst exponent of the sequence. The exponent H is widely used for quantifying long-range correlations (power-law decaying) as opposed to short-range (exponentially decaying) correlations in many complex systems. The Hurst exponent has been estimated for the 24 chromosome sequences, as reported in the 3^rd of Table 1. The occurrence of long-range correlations means that the nucleotides are organized along the sequences in similar way, a fact that can be defined as compositional self-similarity of the chromosomes. The function in equation 1 can be taken of the form:

accounts for the drop-off of P(ℓ, n) due to finiteness of n when ℓ ≫ n. The quantity μ(ℓ, n) ~ ℓ^D exp(ℓ/n) is proportional to the size of the subsets spanned by the random walkers which ranges from a line proportional to ℓ for H = 1 to a square proportional to ℓ² for H = 0 for n > ℓ. The probability distribution function P(ℓ, n) is shown in Fig. 1 for a wide range of n values, estimated for a long range correlated series with Hurst exponent H ≈ 0.6. For n → 1, the lengths ℓ of the elementary clusters are centered around a single value. When n increases, a broader range of lengths is obtained and, consequently, P(ℓ, n) spreads over all values.

Table 1 Nucleotide Composition. Length L (2^nd column), Hurst exponent H (3^rd column), base composition (% of ATCG, 4^th–7^th columns) of the 24 chromosome whole sequences. Average nucleotide composition (% of the ATCG, 8^th–11^th columns) of the clusters, estimated according to the proposed method with n = 4 over the first 10MBases of the 24 chromosome sequences. In particular, the data in the 8^th–11^th columns correspond to the plots shown in the middle panels of Figs. 3,4,5,6,7,8 for each chromosome. In Tables S1–S6 of Supplementary Information, further results, estimated over different data sets with different values of n, are reported
**Figure 3**
Cluster Composition.
Base composition (% of A (blue) T (red) C (blue) G (red) nucleotides) of the clusters in the human chromosomes 1, 2, 3, 4. For each chromosome, the plots refer to windows n = 2, n = 4, n = 10. Data refers to the first 10Mbases of each chromosome. See Tables S1–S6 of Supplementary Information for further estimates.
Full size image
**Figure 4**
Same as Fig. 3 but for the chromosomes 5, 6, 7, 8.
Full size image
**Figure 5**
Same as Fig. 3 but for the chromosomes 9, 10, 11, 12.
Full size image
**Figure 6**
Same as Fig. 3 but for the chromosomes 13, 14, 15, 16.
Full size image
**Figure 7**
Same as Fig. 3 but for the chromosomes 17, 18, 19, 20.
Full size image
**Figure 8**
Same as Fig. 3 but for the chromosomes 21, 22, X, Y.
Full size image

The Shannon entropy is defined as^2,3,4,5:

where the sum is performed over the number of elementary clusters with length ℓ obtained by the intersection with the moving average for each n. This number ranges from 1 to μ(ℓ, n)^–1 depending on how many clusters are generated by the intersection with the moving average. The value 1 is obtained when only one cluster with length ℓ is found in the partition. As already noted, the standard method for partitioning a sequence and estimating its entropy is by splitting the sequence into a set of disjoint blocks with equal length ℓ. Conversely, in the present work, the intersections of the sequence with the moving average generate a set of disjoint blocks with a broad distribution of lengths ℓ corresponding respectively to power-law or exponential correlation. This particular partition retains the determinism/randomness of the blocks by simply varying n, an aspect intimately related to the Kolmogorov complexity concept.

By using equations (1) and (3), the cluster entropy writes:

which, after taking into account equation (2), becomes:

where S₀ is a constant, log ℓ^D is related to the term ℓ^–D and ℓ/n is related to the term .

To clarify the meaning of the terms appearing in equation (5), it is worthy of remarking that for isolated systems, the entropy increase dS is related to the irreversible processes spontaneously occurring within the system. The entropy tends to a constant value as a stationary state is asymptotically reached (dS ≥ 0). For open systems interacting with their environment, the increase is given by a term dS_int, due to the irreversible processes spontaneously occurring within the system and a term dS_ext due to the irreversible processes arising through the external interactions. The term log ℓ^D in equation (5) should be interpreted as the intrinsic entropy S_int. It is indeed independent of n, i.e. it is independent of the method used for partitioning the sequence, which plays here the role of the external interaction. The logarithmic term is of the form of a Boltzmann entropy S = log Ω, where Ω is the maximum volume occupied by the isolated system. The quantity ℓ^D corresponds to the volume occupied by the random walker. Whenever ℓ could reach the maximum size L of the sequence, the second term on the right side would write log L^D. The term ℓ/n in equation (5) represents the excess entropy S_ext introduced by the partition process. It comes into play when the sequence is partitioned in clusters and depends on n.

Fig. 2 shows the entropy S(ℓ, n) evaluated by using the probability distribution P(ℓ, n) plotted in Fig. 1. One can note that S(ℓ, n) increases logarithmically as log ℓ^D and is n-invariant for small values of ℓ, while it increases as a linear function at larger ℓ, as expected according to equation (5). Clusters with lengths ℓ larger than n are not power-law correlated, due to the finite-size effects introduced by the window n. Hence, they are characterized by a value of the entropy exceeding the curve log ℓ^D, which corresponds to powerlaw correlated clusters. It is worthy to remark that clusters with a given length ℓ can be generated by different values of the window n. For example, clusters with ℓ = 2500 have entropies corresponding to the point A (for n = 1000) or A″ (for n = 3000 and n = 10000) as shown in Fig. 2. One can observe that A″ corresponds to power-law correlated (ordered) clusters, since A″ lies on the curve log ℓ^D. Conversely, the point A does not correspond to power-law correlated clusters, since A lies on the curve ℓ/n which originates from the term . In other words, clusters with lengths shorter than n are ordered (long-range correlated), whereas clusters with lengths larger than n are disordered (exponentially correlated).

To gain further insight in the meaning of the terms appearing in equation (5), the source entropy rate s is calculated for the entropy S(ℓ, n). The source entropy rate is a measure of the excess randomness and increases as the block coding process becomes noisier. By using the definition and equation (5), the source entropy rate writes:

The excess randomness of the clusters is found to be inversely proportional to n and, thus, becomes negligible in the limit of n → ∞. This clearly occurs in the curves of Fig. 2, where one can note that higher entropy rates correspond to steeper slopes of the linear term ℓ/n (smaller n values).

Discussion

In this section, the information measure is implemented on the 24 human chromosomes, mapped to fractional Brownian walks (mapping details are described in Method). The nucleotide composition of the DNA sequence is taken as the relevant information quantity to be encoded from the source and decoded from the receiver.

It is well-established that the two strands of DNA are held together by hydrogen bonds between complementary bases: two bonds for the AT pair and three bonds for the GC pair, which is therefore stronger. The existence of GC-rich and GC-poor segments may play different roles in biological processes as duplication, segmentation, unzipping^13,14,15.

Nonuniformity of nucleotides composition within genomes was revealed several decades ago by thermal melting and gradient centrifugation. On the basis of findings concerning buoyant densities of melted DNA fragments, a theory for the structure of genomes of warm-blooded vertebrates known as the isochores theory was put forward^16,17,18,19. Isochores were defined as genomic segments that are fairly homogeneous in their guanine and cytosine (GC) composition.

Though it is widely accepted that the human genome contains large regions of distinctive GC content, the availability of fully sequenced DNA or RNA molecules allows one to accurately investigate the local structure by statistical methods. The development of efficient algorithms achieving deep and accurate description of the complex genomic architecture is thus a timely endeavour^{20,21,22,23,24,25,26,27,28,29,30}.

The chromosomes can be mapped to numeric sequences according to different approaches. In this work, first the DNA is mapped (as detailed in the section Method) to a random walk, then the clusters are generated as described in the previous section. Once having generated the clusters, one can answer the question ‘How much of the relevant information is still contained in the clusters?’. The answer to this question is obtained by counting the ATGC basis for each cluster and plotting the percentage as a function of the cluster length. In Figs. 3,4,5,6,7,8, the nucleotide compositions are plotted as a function of the cluster length ℓ for n = 2, n = 4 and n = 10. The range of n values used in this work varied from 2 to 10.000. One can observe that the nucleotides count is roughly constant for clusters having length comparable or shorter than n. This means that ordered DNA clusters with constant nucleotide composition are found, when the entropy varies as a logarithm of ℓ. For cluster lengths ℓ larger than n, the power-law correlation breaks down with the onset of exponentially correlated clusters (‘disordered’ clusters). An even more interesting result is that the amplitude of the fluctuations is not constant as it takes a characteristic value for each chromosome. One can note from the data plotted in Figs. 3,4,5,6,7,8 that the fluctuations of the cluster composition is very small for example in chromosomes 8, 9, 17, Y. Conversely, they are quite large for chromosomes 14, 15, X. It should be remarked that Figs. 3,4,5,6,7,8 show the nucleotide composition of the ordered-disordered clusters. These plots are related to the entropy of the blocks if one bears in mind the original aim of the Shannon work. The estimate of the block entropy was originally motivated by the attempt at decoding information in ways that still allow recovery of the relevant information by the receiver. In other words, the main question raised by Claude Shannon is: “How the signal can be compressed in elementary messages (blocks) which still contain the relevant information to be communicated?”. The approach proposed in this work answers this question. The DNA sequence is encoded in short messages (clusters) able to transmit the same information of the whole sequence (from where they were cut out) only if they are power-law correlated. In this manuscript, the information considered relevant to the receiver is the nucleotide composition, which, of course, is not the only choice for the relevant information to be transmitted, as other characteristic features might be interesting as well. It is also discussed to what extent nucleotide fluctuations, characterizing the exponentially correlated clusters of each chromosome, might be linked to features relevant to biological processes. To this purpose, the standard deviation of the fluctuations has been calculated for the nucleotide composition ATGC of the clusters (values are reported in Table 2). The correlation σ_C with bilogical features characteristic of each chromosome, such as length, gene density, inter-chromosomal duplications, intra-chromosomal duplications, local ATGC composition (data taken from Refs. 14, 15) have been considered. The correlation coefficients ρ_C are shown in Table 3. Negative correlations between σ_C and intra-and inter-chromosomal duplications are found. Conversely, strong positive correlations are observed between σ_C and AT-rich regions. These findings might point to the important result that the cluster fluctuations are fingerprints of recent segmental duplications.

Table 2 Standard deviation of the cluster nucleotide composition. Standard deviations refer to the average values (% of the ATCG, 8^th–11^th columns), estimated according to the proposed method with n = 4 over the first 10MBases of the 24 chromosome sequences. Standard deviationscan be appreciated in the middle panel plots of Figs. 3,4,5,6,7,8 for each chromosome. In Tables S1–S6 of Supplementary Information, further values over different chromosome sets and with different values of n are reported

Full size table

Table 3 Correlation ρ_C of the cluster fluctuations for the first (M₁), the second (M₂) and the third (M₃) disjoint sets of the 24 human chromosome sequences. The fluctuations are anticorrelated with length, gene density, inter-chromosomal and intra-chromosomal segmental duplications, while they exhibit a positive correlation with the AT-rich regions. Very little correlation is found with the GC-rich regions and global AT composition. Length values are shown in the 2^nd column of Table 1. Gene density data are taken from Refs. 14, 15. Inter- and intra-chromosomal duplications data are taken from Ref. 14. Base compositions are shown in Table 1 (respectively 4^th–7^th columns for the whole sequence, 8^th–11^th columns for the first 10MBases and in Tables S1–S6 of the Supplementary Information)

Full size table

Methods

A DNA sequence is composed of four nucleotides: adenine (A), thymine (T), cytosine (C) and guanine (G). The first step of the analysis consists in the conversion of the four-letter genome alphabet into a numerical format. There are several ways of mapping a DNA sequence to a walk: one-dimensional up to 4 dimensional, real or complex representations. As the proposed Shannon entropy measure applies to one-dimensional sequences, the present discussion is limited to one-dimensional real representation of the four nucleotide bases. The sequence of the nucleotide bases is mapped according to the following rule: if the base is a purine (A,G), the base is mapped to +1, otherwise if the base is a pyrimidine (C,T), the base is mapped to –1 (Fig. 9). The sequence of +1 and –1 is summed and a random walk y(x) (DNA walk) is obtained. This coding rule is preferable, as it keeps the nonstationarity of the series at a minimum. Large nonstationarity of the numerical series might be an issue when long-range correlation should be investigated. The average concentration of A and T are about 0.30, those of G and C are about 0.20. The concentration of purines (A + G) and pyrimidines (C + T) are very close to 0.50 along the sequence. Therefore, coding of purines and pyrimidines to +1 and −1 guarantees a high degree of symmetry of the numerical series. Conversely, an asymmetric coding rule would amplify the strong variations of the local density distribution of the bases along the sequences, giving rise to higher nonstationarity of the corresponding random walk.

The function is calculated for the DNA walk with different values of the window n. The intersection between y(x) and yields a set of clusters, which correspond to the segments between two adjacent intersections of y(x) and . Since each cluster of the DNA walk corresponds to a cluster of ATGC nucleotides, the number of nucleotides can be counted and plotted as a function of the length ℓ for each cluster. In Figs. 3,4,5,6,7,8 the nucleotide composition of the clusters as a function of the length ℓ is shown for the 24 human chromosomes. The clusters have been cut out of 10⁶ bases of each chromosome at once. To be statistically meaningful, there is a need to operate over subsequences having the same length (note that the 24 human chromosomes have different lengths L, 2nd column of Table 1). The method proposed here has been however implemented on several sequences with different lengths (varying from 10⁵ to 10⁷ have been considered in this study). This range takes into account that, on one hand, a scaling law is sound when it is observed at least over three decades of a logarithmic scales and the computational time and complexity on the other hand. One can note that the average composition of the power-law correlated clusters is comparable with the composition of the whole sequence of the analysed data. For example the nucleotide composition of the power-law correlated clusters of the chromosome 1 should be confronted with the data reported in the column 8^th, 9^th, 10^th, 11^th of Table 1 for the same chromosome, while the standard deviation is reported in Table 2. The statistical robustness of the method has been checked by estimating the correlation coefficient ρ_c of the variance and other biological parameters of the sequences (Table 3).

One common problem in data mining is the statistical validation of the model envisioned to describe data structures and patterns. The error is estimated on the entire sample set for small quantity of data. For large data sets, more sophisticated cross-validation methods have been developed to quantify the performance of algorithms and models over disjoint subsets. Depending upon the criterion used to split the data, the process of training and validation across disjoint sets is named random, k-fold or leave-one-out³¹. In particular, the leave-one-out is the degenerate case of the k-fold cross-validation, with only one disjoint subset (k = 1) and is particularly useful for very sparse datasets with few samples, though its error might be larger than the error of the estimates themselves and computation time might be quite long. As the analysed dataset (the 24 genomic sequences) is large enough, the random and k-fold cross validation can be used with the advantage of higher accuracy and velocity of the estimates. In the Supplementary Tables S1–S6, the average values and variances of the nucleotide contents obtained over three disjoint data sets are reported for the 24 chromosomes. For each subset, when the parameter n is varied, clusters of any lengths are generated in random position of the sequence allowing to estimate the average composition and the statistical errors at different position along the sequence. For each set the standard deviations are also reported in the Supplementary Tables S1–S6.

Finally, we note that the Hurst exponent for the 24 chromosomes is reported in the 3^rd column of Table 1. As one can see the value of the exponent H is higher than 0.5, implying that a positive correlation (persistence) exist among the nucleotides. The values of the Hurst exponents have been obtained by using the method described in Refs. 8,9,10.

The sequences used in this analysis were retrieved from the NCBI ftp server (ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/).

References

Scheffer, M. et al. Early-warning signals for critical transitions. Nature 461, 53–59 (2009).
Article CAS ADS Google Scholar
Crutchfield, J. P. Between Order and Chaos. Nat. Phys. 8, 17–24 (2012).
Article CAS Google Scholar
Wang, C. & Hubermann, B. A. How Random are Online Social Interactions? Sci. Rep. 2, 633 (2012).
Article ADS Google Scholar
Grassberger, P. & Procaccia, I. Characterization of strong attractors. Phys. Rev. Lett. 50, 346–349 (1983).
Article ADS MathSciNet Google Scholar
Steur, R., Molgedey, L., Ebeling, W. & Jimenez-Montano, M. A. Entropy and optimal partition for data analysis. Eur. Phys. J. B 19, 265–269 (2001).
Article ADS Google Scholar
Bose, R. & Hamacher, K. Alternate entropy measure for assessing volatility in financial markets. Phys. Rev. E 86, 056112 (2012).
Article ADS Google Scholar
Shalizi, C. R., Shalizi, K. L. & Haslinger, R. Quantifying Self-Organization with Optimal Predictors. Phys. Rev. Lett. 93, 118701 (2004).
Article ADS Google Scholar
Carbone, A., Castelli, G. & Stanley, H. E. Analysis of clusters formed by the moving average of a long-range correlated time series. Phys. Rev. E 69, 026105 (2004).
Article CAS ADS Google Scholar
Carbone, A. & Stanley, H. E. Scaling properties and entropy of long-range correlated time series. Physica A 384, 21 (2007).
Article ADS Google Scholar
Carbone, A. Algorithm to estimate the Hurst exponent of high-dimensional fractals. Phys. Rev. E 76, 056703 (2007).
Article ADS MathSciNet Google Scholar
Türk, C., Carbone, A. & Chiaia, B. M. Fractal heterogeneous media. Phys. Rev. E 81, 026706 (2010).
Article ADS Google Scholar
Shao, Y. et al. Comparing the performance of FA, DFA and DMA using different synthetic long-range correlated time series. Sci. Rep. 2, 835 (2012).
Article Google Scholar
Lander, E. C. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001).
Article CAS ADS Google Scholar
Bailey, J. A. et al. Recent Segmental Duplications in the Human Genome. Science 297, 1003–7 (2002).
Article CAS ADS Google Scholar
Deloukas, P. et al. A Physical Map of 30,000 Human Genes. Science 282, 744–746 (1998).
Article CAS ADS Google Scholar
Lee, W. et al. A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics 39, 1235–1244 (2007).
Article CAS Google Scholar
Bernardi, G. The neoselectionist theory of genome evolution. Proc. Natl. Acad. Sci. U.S.A. 104, 8385–8390 (2007).
Article CAS ADS Google Scholar
Costantini, M., Clay, O., Auletta, F. & Bernardi, G. An isochore map of human chromosomes. Genome Research 16, 536–41 (2006).
Article CAS Google Scholar
Clay, O. Standard deviations and correlations of GC levels in DNA sequences. Gene 276, 33–38 (2001).
Article CAS Google Scholar
Cohen, N., Dagan, T., Stone, L. & Graur, D. GC composition of the human genome: in search of isochores. Mol. Biol. Evol. 22, 1260–72 (2005).
Article CAS Google Scholar
Versteeg, R. et al. The human transcriptome map reveals extremes in gene density, intron length, GC content and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 13, 1998–2004 (2003).
Article CAS Google Scholar
Emanuel, M. et al. The physics behind the larger scale organization of DNA in eukaryotes. Phys. Biol. 6, 025008–019 (2009).
Article ADS Google Scholar
Vaillant, C., Audit, B. & Arneodo, A. Experiments confirm the influence of genome long-range correlations on nucleosome positioning. Phys. Rev. Lett 99, 218103–107 (2007).
Article CAS ADS Google Scholar
Li, W. Delineating relative homogeneous GC domains in DNA sequences. Gene 276, 57–72 (2001).
Article CAS Google Scholar
Salerno, W., Havlak, P. & Miller, J. Scale-invariant structure of whole-genome intersections and alignments. Proc. Natl. Acad. Sci. U.S.A. 103, 13121–5 (2006).
Article CAS ADS Google Scholar
Peng, C. K. et al. Long-range correlation in nucleotide sequences. Nature 356, 168–170 (1992).
Article CAS ADS Google Scholar
Roman-Roldan, R., Bernaola-Galvan, P. & Oliver, J. L. Compositional segmentation and long-range fractal correlation in DNA sequences. Phys. Rev. E 53, 5181–5189 (1996).
ADS Google Scholar
Hameister, J., Helm, W. E., Hütt, M. T. & Dehnert, M. Advances in Data Analysis, Data Handling and Business Intelligence. 627–637 (Springer, Berlin Heidelberg, 2010).
Bose, R. & Chouhan, S. Super-information: A novel measure of information useful for DNA sequences. Phys. Rev. E 83, 051918 (2011).
Article ADS Google Scholar
Akhter, S. et al. Applying Shannon information theory to bacterial and phage genomes and metagenomes. Sci. Rep. 3, 1033 (2013).
Article Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining. Inference and Prediction. 241–254 (Springer, Berlin Heidelberg, 2009).

Download references

Author information

Authors and Affiliations

Politecnico di Torino, Italy
A. Carbone
ISC-CNR, Unità Università ‘La Sapienza’ di Roma, Italy
A. Carbone
ETH Zurich, Switzerland
A. Carbone

Authors

A. Carbone
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The author declares no competing financial interests.

Electronic supplementary material

Supplementary Information

Information measure for long range correlated sequence

Rights and permissions

This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/

Reprints and permissions

About this article

Cite this article

Carbone, A. Information Measure for Long-Range Correlated Sequences: the Case of the 24 Human Chromosomes. Sci Rep 3, 2721 (2013). https://doi.org/10.1038/srep02721

Download citation

Received: 25 March 2013
Accepted: 04 September 2013
Published: 23 September 2013
DOI: https://doi.org/10.1038/srep02721

This article is cited by

Quantitative assessment of drivers of recent global temperature variability: an information theoretic approach
- Ankush Bhaskar
- Durbha Sai Ramesh
- S. Gurubaran
Climate Dynamics (2017)
Estimating the anomalous diffusion exponent for single particle tracking data with measurement errors - An alternative approach
- Krzysztof Burnecki
- Eldad Kepten
- Aleksander Weron
Scientific Reports (2015)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.