The common origin of symmetry and structure in genetic sequences

When exploring statistical properties of genetic sequences two main features stand out: the existence of non-random structures at various scales (e.g., long-range correlations) and the presence of symmetries (e.g., Chargaff parity rules). In the last decades, numerous studies investigated the origin and significance of each of these features separately. Here we show that both symmetry and structure have to be considered as the outcome of the same biological processes, whose cumulative effect can be quantitatively measured on extant genomes. We present a novel analysis (based on a minimal model) that not only explains and reproduces previous observations but also predicts the existence of a nested hierarchy of symmetries emerging at different structural scales. Our genome-wide analysis of H. Sapiens confirms the theoretical predictions.


Notation and model properties
To fix notations, we describe our model and its statistical properties as follows: • The full sequence is build concatenating r macro-structures: s = m 1 m 2 · · · m r .
• A macrostructure m is build concatenating m domains: m = d m 1 d m 2 · · · d m m .
• The average domain length is denoted by L D , the average macro-structure length is denoted by L M . The total length of the sequence is N .
• A domain d m in the macro-structure m is a finite-size realisation of a process chosen between two 1 symmetrically related process-types: C m andĈ m . We use the notation d ∈ C to indicate that d is generated by the process of type C.
• For a a given observable X, we denote by f C (X) the limiting relative frequency 2 of occurrence of X in a domain of type C. Recall that, the definition of symmetrically related processes (of the same macro-structure) imposes that, for every choice of X: In principle, different macro-structures have different process-types statistics.
We denote by π m ({c, l}) the relative frequency of a cluster of length l of domains of the same type c.
• In each macro-structure, the probability distribution of domain-sizes is denoted p m (l).
• We do not enforce any prescription to concatenate domains in a macrostructure (determined by π), but the following properties: This ensure that the structural statistics of two symmetrically coupled domain-types ordering is unbiased.
L S beyond which correlations in domain ordering can be neglected. L S is thus the average size of clusters of domains of the same type.

Derivation of symmetries
We start by showing the validity of the extended Chargaff symmetry P (X) = P (X) for < counts of X in the full sequence is dominated by X not overlapping different macro-structures and thus we conclude that (N = N − ) We now show the validity of the nested hierarchy of symmetries discussed in the main paper.
We focus on observables of the form Y = (X A , X B ; ), where X A and X B are oligonucleotide of size much smaller than typical domain sizes L D . We always approximate the counts of X inside a domain of type C and of length l by l · f C (X). Define As we will consider only the case l X A , l X B L D , we neglect the last term.
We can now investigate and rule out the main contributions to the overall counting #(Y ) at different scales: -( L D ): At these scales the following sum dominates, We conclude that #(X A , X B , ) #(X B ,X A , ) at these scales, and thus symmetry S1 is valid. This can also be derived directly from equation (2).
For >> L D , X A and X B typically lie in different domains and therefore the second term in equation (3) dominates The counts will be estimated as the product of the probabilities of X A and X B because each domain is an independent realisations. At different scales there are different relationships between the domains in which X A and X B typically lie, leading to the following cases: -(L D << < L S ): At these scales the sum is dominated by counts of Y inside a cluster of domains of the same type. Each cluster contribute to the counts of Y with a term these scales, and thus symmetry S2 (and S1) is valid. If the processes are such that correlations inside domains vanishes at a scale smaller than the realization of the process, we consider this shorter correlation time to be the effective domain size L D and S2 sets in at this shorter scale.
-(L S L M ): At these scales the sum is dominated by X A and X B lying in different cluster i j>i where at these scales, and thus symmetry S3 (and S2, S1 and S4) is valid.
-(L M ): At these scales the sum is dominated by counts where X A and X B are in different macro-structures: i j>i where q m,n ( ) counts how many sites separated by lie in macro-structures m and n, respectively.
We conclude that #(X A , X B , ) #(X A , X B , ) #(X A ,X B , ) #(X A ,X B , ) and thus symmetry S4 is valid.