Treatises on biological evolution reflect a conflict between the relative roles played by contingency and necessity1. An important tradition in evolutionary biology, based on a large amount of empirical evidence, considers that contingency marks the dynamics of evolution in a way that makes it unpredictable1,2,3. The trend towards the appearance of increasing complexity falls within the frame of contingent evolution insofar as it is inevitable given that, passively, we can expect that sooner or later more complex entities will evolve from the original, simpler entities. This is what Gould2 referred to as ‘the passive tendency towards complexity marked by the minimum initial complexity wall’.

A central task for those studying complexity is to provide an accurate measure to ascertain if there is a trend of increasing complexity3,4. In fact, a necessary condition for progressive and open-ended evolution is to prove the existence of a metric that increases with the evolutionary age of the corresponding organisms4,5. We suggest that we can find such metrics in the genomes6. Genomes probably provide the best record of the biological history of a species. Not only do they enable us to reconstruct their phylogenetic relationships but they also contain information gained from their continuous biotic and environmental interactions over time6,7,8. Standard genome parameters such as genome size, number of genes, and gene components (i.e., introns, exons) are insufficient indicators of genome complexity because they partially capture the historical information encoded in a genome9,10. We suggest here that metrics unassociated with biological functions may improve our measurements of genome sequence complexity. However, some metrics that have been previously applied to genomes are too broad, and not all of them accurately capture all the necessary information gleaned from a genome during its evolutionary history6,11. For example, algorithmic complexity12,13 is inconveniently maximized for randomness and the effective complexity of Gell-Mann and Lloyd14 is recommended for collections or ensembles of sequences, but in several cases such as that seen in genome sequences, it is not clear how to define an appropriate ensemble. Likewise, those metrics based on mutual information or statistical dependence15,16 also quantify the complexity of sequence ensembles rather than the complexity of a single sequence.

Here we consider six metrics that are calculated on individual genomes. The first four metrics are based on the Sequence Compositional Complexity (SCC) derived from a four-symbol DNA sequence or the binary sequences resulting from grouping the four nucleotides into S(C,G) versus W(A,T) or R(A,G) versus Y (T,C), or K(A,C) versus M(T,G), thus obtaining SCCSW, SCCRY and SCCKM metrics, respectively17. These four metrics increase with the number of parts (i.e. compositional domains) as well as the length and compositional differences among them found in a genome sequence by a segmentation algorithm. These metrics parallel the concept of ‘pure complexity’ of McShea18 and McShea and Brandon3, where complexity is more closely related to the number of part types of an individual than with the number of functions.

The fifth metric we used is the Biobit (BB), a metric based on the difference between the maximum entropy for a k-mer of a random genome of the same length as the genome under consideration and the entropy of that genome for such a k-mer19. Lastly, we used the Genomic Signature (GS), also a k-mer-based metric, which is the value corresponding to the k-mer that maximizes the difference between observed and expected equi-frequent classes of mers. GS is based on the relative abundances of short oligonucleotides20 and chaos game representation applied to genomes21,22.

We tested the above-mentioned metrics by analyzing the genome evolution of an ancient and diverse group of organisms: the phylum Cyanobacteria. These microorganisms played a fundamental role in the evolution of life on Earth. The fossil record shows that they were here 2.0 Billion years ago (Bya) and molecular clock analyses indicate that the phylum originated over 2.5 Bya23,24. By releasing oxygen through photosynthesis, Cyanobacteria caused the Great Oxidation Event about 2.3 Bya and changed the history of life on Earth25. The oxidation of the environment allowed the evolution of complex multicellular life forms26, leading to the origin of eukaryotic crown groups including plants and animals27. As it is well known, Cyanobacteria also were the progenitor of plastids through symbiosis with ancient eukaryotes28.

Cyanobacteria are morphologically diverse. The group was traditionally classified into five subsections according to several biological criteria29,30. Unicellular cyanobacteria are classified in subsections I and II, depending on their mode of reproduction. Those from subsection I (Chroococcales) divide only by binary fission while those from subsection II (Pleurocapsales) are capable of reproducing also by multiple fission giving rise to small cells called baeocytes. Filamentous cyanobacteria are classified into subsections III, IV, and V. Those from subsection III (Oscillatoriales) are composed only by vegetative cells that reproduce by intercalary division. Cyanobacteria from subsections IV and V are capable of cell differentiation producing trichomes composed of vegetative cells and heterocysts for nitrogen fixation. In addition, some members also produce hormogonia for dispersal and akinetes for dormancy. Members of subsection IV (Nostocales) always divide in a plane at right angles to the long axis of the trichome; while those from subsection V (Stigonematales) may also divide at parallel axes relative to the long axes of the trichome.

Of the above subsections of Cyanobacteria, only Stigonematales are monophyletic24,31. More recent classification schemes using phylogenetic analysis from 31 conserved protein sequences divide Cyanobacteria into nine different groups32. These include Gleobacterales, Synechococcales, Oscillatoriales, Chroococcales, Pleurocapsales, Spirulinales, Rubidibacter/Halothece, Chroococcidiopsidales, and Nostocales. Of these groups, Chroococcales, Oscillatoriales, and Synechococcales are not monophyletic. This classification scheme attempts to reconcile phylogenetic history with several aspects of morphology and cytology. Other phylogenetic analyses based on 31 concatenated conserved proteins divide cyanobacteria into seven groups33. These groups are named from A to G (groups B and C are further subdivided into B1, B2 and B3 and C1, C2 and C3) and all of them are monophyletic. Furthermore, taxon addition and subtraction analyses on a concatenated dataset of 137 conserved proteins and two rRNA sequences, allowed the identification of long-branch attraction artifacts34. The resulting tree was used to classify cyanobacteria into 6 monophyletic groups, corresponding to some of the A to G lineages. Finally, phylogenetic analysis on a concatenated dataset of 43 proteins from 208 taxa, recovered all A–G groups and revealed the existence of novel monophyletic lineages located at the base of the tree35. Clearly, the taxonomy and evolution of Cyanobacteria are active areas of research. The classification of Cyanobacteria is likely to change in the near future as more lineages are sequenced and analyzed.

In this study, we test whether there is a statistically and phylogenetically supported driven tendency towards increasing genome sequence complexity in the evolution of Cyanobacteria as reflected by some of their metrics of genomic complexity.

Results

Complexity measures throughout Cyanobacteria phylogeny

The values of the four SCCs, BB and GS metrics as well as three standard genome parameters (Genome size, %GC and No. of genes) (see “Methods” section) for 91 Cyanobacterial genomes are reported in Table S1. Figure 1 shows a maximum likelihood phylogeny of Cyanobacteria whose branch lengths are proportional to the number of amino acid substitutions (see “Methods” section). The phylogeny is in general agreement with the previous analyses24,31,32.

Figure 1
figure 1

Phylogeny of Cyanobacteria with the metrics of sequence complexity and genome parameters for each chosen genome. The values of metrics and parameters are proportional to circle size. The colored species correspond to four monophyletic sub-clades that were used to test evidence of a driven trend for each sub-clade (see also Fig. S2).

Phylogenetic signal

All metrics of complexity and genome parameters showed a significant phylogenetic signal (Table 1), indicating that for all cases genomes of closely related cyanobacterial species tend to resemble more than two randomly selected genomes (Fig. 1). However, the magnitude of the phylogenetic signal differs across metrics and parameters, with %GC and GS showing the highest values, which probably reflects different forces shaping the evolution of all these metrics and parameters (see “Discussion” section).

Table 1 Phylogenetic signals (K) of metrics of genome sequence complexity and genome parameters.

Phylogenetic correlations

To gain a better understanding of the metrics, after we corrected the phylogenetic signals, we evaluated how they correlate with each other and, particularly, with the genomic parameters (Table 2). It is worth noticing that two metrics, SCC and SCCRY correlate with other metrics and parameters (six correlations each one), accounting for 43% of all significant correlations.

Table 2 Phylogenetic Pearson correlations (r) among metrics of genome complexity and genome parameters.

Ridge regression of complexity metrics versus distance from the root

Using ridge regression of genome complexity metrics and genome parameters versus distance from the root, we have studied whether there is evidence of evolutionary trends, having detected interesting patterns (Fig. 2). Of the complexity metrics, four out of six show a statistically significant positive trend (SCC, SSCSW, SCCRY and GS), indicating that complexity, as determined by our proposed criteria, has increased with the distance from the root. In contrast, SCCKM shows no trend and BB reveals a significant negative trend. Interestingly, genome parameters show no evidence of any evolutionary trend.

Figure 2
figure 2

Phylogenetic trends of genomic complexity metrics (a) and standard genome parameters (b). The estimated genomic value for each tip (red circles) or node (white circles) in the phylogenetic tree is regressed against its evolutionary age (i.e., distance from the root). The statistical significance of the regression is tested against 10,000 slopes obtained under simulated Brownian evolution. The slopes and their P values are shown in Table S2.

Driven trends in Cyanobacteria

A critical question regarding trends is if they are passive or driven. To evaluate this, we have applied three types of tests (see “Methods” section for a detailed description): the minimum (with three types of proofs), the ancestor–descendant, and the subclade (with two types of proofs) tests.

Regarding the first proof of the minimum test (i.e., skewness), we observed that the skewness of all metrics (except SCC and GC content) for the entire phylum exhibit significant and positive skewness (D’Agostino–Pearson test, n = 91; Table 3), which supports a left wall for these metrics and parameters that is compatible with either a passive or a driven trend. Nevertheless, it is expected that if the minimum value of a given metric or parameter increases with evolutionary time, then the trend will probably be driven. To test this we have taken as the minimum the estimated value of the most basal clade, xb, for each metric/parameter (Fig. 1). However, as it can be observed (Fig. 3), there are lower or higher values than the one corresponding to the basal clade for any metric/parameter. Then, it is necessary to study in greater depth the distribution of lower and higher values with respect to xb in order to have evidence of the putative existence of a driven trend. With this end, we carried out the second proof of the minimum test, where we measure |xd − xb|, the absolute difference between descendants’ clades and the most basal clade. Table 3 shows whether there is a statistical difference (Chi-square test) between those clades (179 in total) that are higher or lower than the basal clade, xb. As it can be observed, all the tests are significant with four metrics (SCC, SCCRY, SCCKM and BB) and two parameters (Genome size and No. of genes) showing a significant increase in the metric/parameter with respect to the corresponding basal values. In contrast, two metrics (SCCSW and GS) and one parameter (%GC) present a significant decrease. Finally, employing a Student’s t-test (third proof of the minimum test), we tested if there is a statistical difference between the average value of the absolute difference (|xd − xb|) of a given metric or parameter higher or lower than xb. It can be observed (Table 3) that three metrics (SCCSW, SCCRY and SCCKM) show a statistical difference in favor of a higher value than xb and one metric (GS) and the three parameters (Genome size, %GC content, and No. of genes) present a statistical difference in favor of a lower value than xb.

Table 3 Proofs of the minimum test.
Figure 3
figure 3

Distribution of metrics and parameters according to root-to-tip distance. The interior dashed line corresponds to the value of the basal clade, xb. The histograms that appear above each figure correspond to the number of accumulated values of metrics and parameters (regardless of the age) ranging from lower (left) to higher (right) values than xb.

Regarding the ancestor–descendant test (see “Methods” section for a detailed description) we tabulated the derived clades for all possible nodes and whether they present a higher, lower, or equal value of the metric/parameter than the ancestral clade corresponding to each node. In order to avoid bias due to proximity to the putative left wall, McShea36 recommended applying the test only to those clades where both ancestor and descendent are higher than the average value of the metric/parameter. As it can be observed (Fisher exact test, Table 4) this exigent test shows that metrics SCC and GS and the three genome parameters are in favor of a driven trend. A good visualization of the ancestor–descendant proof on the phylogeny of the Cyanobacteria for each metric/parameter has been obtained by estimating the values of internal nodes using a maximum likelihood function and interpolating the value along each edge (see “Methods” section). Figure 4 shows the mapping corresponding to the SCC metric where the driven positive trend of this metric can be clearly appreciated (Fig. S1 for the mapping of the rest of metrics/parameters).

Table 4 Ancestor–descendant test.
Figure 4
figure 4

Mapping of the SCC complexity metric on the Cyanobacteria tree.

Finally, the last test applied is the sub-clade test, with the two associated proofs. In the first proof, we tested whether the trend observed at the phylum level is also observed in four selected monophyletic clades and second, we have also applied the skewness test to either the entire phylum (results are given in Table 3) and to the chosen sub-clades. We have chosen four monophyletic clades formed by clusters 97, 132, 162, and 172 that harbor 18, 22, 11, and 8 species, respectively (four-colors in Fig. 1 and Fig. S2). Clade 97 is formed by Synechococcus, Prochlorococcus, and Cyanobium; clade 132 corresponds to Nostocales (subsections IV and V); clade 162 contains Cyanothece and Microcystis; and clade 172, among others, contains Geminocystis and Cyanobacterium. The most relevant result found was that some metrics of genome complexity show statistically significant positive trends (SCC, SCCRY, and GS) and some others show negative trends (SCCSW and SCCKM), whereas the genome parameters do not show any positive trends (Table S2; Fig. S3). Thus, we keep SCC, SCCRY and GS as the metrics showing positive trends at both levels of phylogenetic resolution.

Regarding the second proof for the sub-clade test, we have examined if the monophyletic sub-clade drawn from the right tail of the entire distribution should have a statistically significant average higher value than the one corresponding to the entire phylum. Regarding the skewness of the phylum (Table 3), we observe that all metrics (except SCC and %GC) exhibit significant and positive skewness. However, this test of skewness cannot be applied to the four chosen monophyletic sub-clades either because (a) the average value (median) of a given metric/parameter for each sub-clade was lower than the median of the phylum (16 cases out of 36) or, (b) there was no statistical evidence (the remaining 20 cases) of a higher median (Mood’s median test) of a given metric/parameter for each sub-clade than the median of the entire phylum (see Table S3).

In summary, the overall results obtained in relation to the evidence found for a trend in a given metric or parameter, i.e., the phylogenetic signal, the number of significant correlations against the rest of metrics/parameters, as well as whether the trend is driven or not (Table 5) show that SCC, SCCRY and to a lower extent GS present the highest scores, and can thus be considered metrics evidencing progressive evolution of Cyanobacteria.

Table 5 Summary of the results for each sequence complexity metric and genome parameter.

Discussion

Genomes probably provide the best record of the biological history of species. Not only do they enable us to reconstruct their phylogenetic relationships but they also contain information gained from their continuous biotic and environmental interactions over time6,8. This information is an elusive but crucial component of the genome, whose study as a whole deserves deeper attention because it holds clues to answer many biological questions, particularly those of an evolutionary nature.

The genome has distinct layers of information encoded in DNA sequences10,37. The most well-known are those involved in biological function, such as the typical genome division into coding and non-coding parts or the differential conservation shown by distinct codon positions due to the differential evolutionary constraints acting within genes38,39,40. In the present study, we intend to capture or approximate the genome information held in these layers using certain metrics (collectively named ‘genome complexity metrics’) to determine whether they show phylogenetic signals and indicate some kind of evolutionary trend. To do so, we use a group of organisms with a long phylogenetic history: the phylum Cyanobacteria. SCC accounts for the global compositional complexity of a DNA sequence encoded by the four nucleotides (A, T, C, and G) and shares similarity with McShea’s18 operational definition of biological complexity, or the degree to which the parts of a morphological structure differ from each other. SCCSW may account for the complexity due to the partition of the genome into GC-rich and GC-poor segments (e.g., the isochores), which are known to be associated with many functionally relevant properties such as gene density, gene length, retrotransposon density, or recombination frequency41,42,43,44,45,46. Thus, SCCSW might capture the genome information gained throughout evolution by the selective forces acting on these important functional elements. On the other hand, SCCRY accounts for the complexity due to the partition of the genome into segments of different purine/pyrimidine richness. Such strand asymmetries are less directly related to biological function, but this alphabet has been useful to uncover long-range correlations and analyze the evolution of fractal structure in the genomes47,48,49. Recently, a connection has been found between strand symmetry and the repetitive action of transposable elements during evolution37 (see also Koonin50 and his concept of ‘fuzzy meaning’ of sequences). The partition given by SCCKM has not been associated with any biological function. Finally, GS and BB explore the maximum deviation for a given k-mer between a real and a random genome. GS directly compares the observed distribution of k-mer classes of a real genome with respect to that corresponding to a random one. On the other hand, by calculating the entropy differences between both groups, BB measures the relative entropic and anti-entropic fraction of a real genome19.

From a population genetics perspective, cyanobacteria can be considered proto-typical bacterial species whose populations are evolving under high effective population sizes51, with intermediate mutation rates between those of RNA viruses (higher mutation rate) and lower or higher eukaryotes (lower mutation rates)52. Therefore, natural selection is expected to play a major role in the evolution of these organisms. Irrespective of whether mutations (or any source of genetic novelty) are deleterious or beneficial, their destiny will be dictated by the deterministic action of purifying or positive selection, respectively53,54. This observation is highly pertinent when it comes to appropriately interpreting the phylogenetic signals observed in the metrics of complexity measures and genome parameters following the in silico evolutionary processes described by Revell et al.55. Considering, thus, that selection is a key force in the evolution of Cyanobacteria, most of the K-values estimated for the metrics may reflect the action of purifying or stabilizing selection, particularly those that are below 1 (all metrics and parameters, except GS and %GC). K from GS is 1, which could be interpreted either as a random drift effect or, more convincingly for this type of organism, as fluctuating selection for a relatively high rate of movement of the optimum55. Finally, K associated with %GC is much higher than one, which can also be interpreted as the result of an evolutionary process with heterogeneous peak shifts.

Importantly, our study of the evolutionary trends in Cyanobacteria by means of ridge regression found clear differences between metrics of complexity and genome parameters. Four metrics (SCC, SCCRY, SCCSW, and GS) indicate changes toward higher complexity in more evolved clades (long-branch distance with respect to the root of the tree), while SCCKM does not show any signs of a trend and BB shows a negative trend. However, the genome parameters show no evidence of any trend (Fig. 2). These results are reinforced when comparatively analyzing trends between metrics and parameters at a lower phylogenetic resolution (i.e. in monophyletic subclades, Tables S2 and S3 and Fig. S3). Although metrics used in this work capture different aspects of the evolution of genome sequence complexity in Cyanobacteria (positive trends in SCC, SCCRY, and GS versus negative trends in SCCSW and SCCKM), the genome parameters never present any positive trends (Fig. S2 and Table S2). In that respect, although some metrics capture increasing sequence complexity, genome parameters do not.

It is worth noticing that the metrics to measure sequence complexity and the associated positive driven trends have captured something different from functional comparative genomics in Cyanobacteria. One interesting case is the comparison between those Cyanobacteria species that are multicellular and develop heterocysts or akinete from those that do not develop such traits. We tested this by considering which of the species chosen in our data set have heterocyst versus non-heterocyst and akinete versus non-akinete (Table S1). The presence of heterocysts or akinete could be taken as evidence of higher complexity against its absence. We carried out a test for each one of the metrics and genome parameters to see if there were a statistically significant difference and higher value of the groups of heterocyst or akinete with respect to the groups of non-heterocyst or non-akinete, respectively (Table S1). No statistically significant difference were found for any metric (except for SCCKM between akinete vs non-akinete, Mann–Withney test, P < 0.05). However, when comparing the average values corresponding to genome parameters (genome size, gene number and %GC), we repeatedly observed that species with heterocyst or akinete showed a statistically significant higher genome size, higher gene number, and lower %GC (Mann–Withney test, P < 0.05). From a functional point of view, the standard genome parameters have been found to differentiate between multicellular cyanobacteria, which is not the case for the metrics, particularly among those showing a consistent positive driven trend. (i.e., SCC, GS). These metrics are capturing something different in the genomic sequence. Take, for instance, the three species (see Fig. 4) that present the highest SCC values: Cyanobacterium stanieri, C. aponirium, and Trichodesmium erythraeum. They present a larger distance from the root even more than the SynPro clade (see Fig. 1). None of these three species, nor all the Synpro clade, have heterocysts or akinete, and all appear to present a larger distance from the root than those species harboring these traits. It is clear, then, that the positive trend we have detected is reflecting something different. We speculate that the species showing a larger distance from the root may be more evolvable than those that present a shorter distance to it.

It is interesting, on the other hand, to point out the process of selection and genome streamlining of Synechococcus and Prochlorococcus in clade 97 (SynPro clade), giving rise to more evolved shorter genomes, which are AT-rich and show a lower number of genes than the rest of Cyanobacteria (Table S1). As it can be observed, there are statistically significant negative trends in the three genome parameters but also positive trends of SCC (Fig. 4) and SCCRY metrics (Fig. S2 and Table S2). Therefore, genome reduction in this clade does not imply loss of genome complexity; on the contrary, our study shows that this clade also has a highly complex genome sequence56. On the other hand, it is interesting to consider the comparison between this specialist clade with others that are generalistic, like Microcystis sp. (Figs. 1, 4). The genus Microcystis appears to be older than the Synpro clade. Both, however, have no heterocysts nor akinete (as examples of complex functionality; i.e., multicellularity) but, in general, show a higher SCC or GS metric than the multicellulars. The higher SCC values that we observed in the SynPro clade indicate a higher intra-genome compositional diversity in these species (i.e., a higher number of compositional segments and/or higher compositional differences among them). In the same way that a high rate of genetic variability promotes a higher evolvability57, it can also be considered that both groups have also developed a higher capacity to evolve, captured by some of the metrics that we have studied. On the other hand, apparently genome reduction and specialization in the SynPro clade, as already stated, is not equivalent to the loss of genome sequence complexity.

In summary, considering that selection is a major driver in the evolution of Cyanobacteria, the observed positive trends towards increasing sequence complexity captured by the SCC, SCCRY, and GS metrics cannot be explained, contrary to what Gould2 holds as a passive tendency to increase. The three tests carried out in order to demonstrate whether positive trends are passive or driven show us that the positive trend is driven and is likely due to the action of natural selection, something that we have not tested for directly. Several of the metrics gathered in this study confirm this trend in the case of the evolutionary history of Cyanobacteria.

Methods

Phylogenetic analysis

Ninety-one complete and nearly complete cyanobacterial genomes were downloaded from GenBank and annotated using Prokka58 (Table S1). To infer a phylogenomic tree we proceeded first to identify the set of homologous gene families conserved among Cyanobacteria (core genome) using get_homologues.pl pipeline59. For this, we used BDBH and OMCL methodologies within get_homologues.pl with the following parameters: a threshold e-value ≤ 10—10 for BLAST searches; a minimum percent amino acid identity > 30% between query and subject sequences; and for OMCL, we set the inflation parameter (I) set to 2.0. The consensus core-genome was inferred by the intersection of BDBH and OMCL gene families. To select high-quality phylogenetic markers from the core-genome (i.e. those gene families not showing recombination and/or horizontal gene transfer), we used the software package get_phylomarkers60. By this procedure, we obtained an alignment of 96 top markers comprising 36,760 amino acids. Clustal-Omega was used to align the protein sequences61. The multiple alignment was cured by eliminating uninformative sites and misaligned positions with Gblocks62. Finally, a maximum likelihood phylogeny was reconstructed using PhyML63 with LG model + I (estimation of invariant sites) + G (gamma distribution) as selected by ProtTest364. The root was located on the branch connecting both Gloebacter spp. to the rest of the cyanobacteria. This location of the root is based on cytologic (for instance, Gloebacter spp. lacks thylakoids) as well phylogenetic and molecular clock analyses32,33,34,65.

Genome sequence complexity metrics

SCC

Sequence Compositional Complexity of genomes was calculated by using a two-step process. We first obtained the non-overlapping compositional domains comprising the genome sequence, and then applied an entropic complexity measurement able to account for the heterogeneity of such compositional domains. The compositional domains of a given genome sequence are obtained through a segmentation algorithm that was properly designed66 by using the Jensen-Shannon entropic divergence67,68 to split the sequence—and iteratively the sub-sequences- into non-overlapping compositional domains which, at a given statistical significance, s, are homogeneous and compositionally different from the neighboring domains. It is worth mentioning that the segmentation algorithm we used, and hence the SCC complexity values derived from it, are invariable to sequence orientation, as Shannon entropy is invariant under symbol interchange.

Note also that the statistical significance level s, is the probability that the difference between each pair of adjacent domains is not due to statistical fluctuations. By changing this parameter one can obtain the underlying distribution of segment lengths and nucleotide compositions at different levels of detail69 thus fulfilling one of the key requirements for complexity measures14. Improvements to this segmentation algorithm also allow to segment long-range correlated sequences70. Full details of the segmentation algorithm have been published elsewhere71,72. Implementation details, as well as source codes and executable binaries for different operating systems can be downloaded from: https://github.com/bioinfoUGR/segment and https://github.com/bioinfoUGR/isofinder.

Once a genome sequence was segmented into n compositional domains, we computed SCC as:

$$ SCC = H\left( S \right) - \mathop \sum \limits_{i = 1}^{n} \frac{{G_{i} }}{G}H\left( {S_{i} } \right) $$

where S denotes the whole genomes and G its length, Gi the length of the i th domain, Si. \(H\left( \cdot \right) = - \sum flog_{2} f\) is the Shannon entropy of the distribution of relative frequencies of symbol occurrences, f, in the corresponding (sub) sequence17. It should be noted that the above expression is the same one than that used in the segmentation process, applying it to the tentative two new subsequences (n = 2) to be obtained in each step. Thus, the two parts of the SCC computation are based on the same theoretical background.

We apply the above two-step procedure to each of the entire four-symbol cyanobacterial genomes, thus obtaining a SCC complexity value for each of them. In addition, we also apply the same procedure to the binary sequences resulting from grouping the four nucleotides into S(C,G) versus W(A,T) or R(A,G) versus Y (T,C), or K(A,C) versus M(T,G), then obtaining SCCSW, SCCRY and SCCKM metrics, respectively. These three additional metrics are partial complexities that provide complementary views of genome complexity to that obtained with the four-symbol sequence71,72.

We provided additional details on the segmentation carried out in Cyanobacteria by using the UCSC Genome Browser. Genome maps of the compositional segments obtained for each Cyanobacteria genome analyzed in this paper can be found at the following link: https://sites.google.com/go.ugr.es/oliver/databases/dna-compositional-segments/cyanobacteria?authuser=0. Note that, once at UCSC Genome Browser, the user can obtain a complete list of segment coordinates for each genome in plain text by clicking on Tools: Table Browser.

BB. Biobit is an informative measure of the complexity of a genome, which is a generalized logistic map that balances the entropic and anti-entropic components of genomes and appears to be related to their evolutionary dynamics. BB compares genomes of size n with random genomes of the same size to establish a measure of its complexity. More precisely, BB is a metric of genome sequence complexity that is derived from the comparison between the k-mer that yields the maximum entropy of a given random genome and the corresponding entropy of the real genome of the same length19. The authors demonstrated that the entropy of a real genome of length G, E2L(G) takes a value between the maximum (2log4(G) or 2L(G)) and the minimum (L(G)) entropy. On the other hand, the authors define and measure two additional components, that they call entropic (E(G)) and anti-entropic (A(G)) of a real genome, in such a way that A(G) + E(G) = L(G). Then, the entropy of those components are given by E(G) = E2L(G) − 2L(G) and A(G) = 2L(G) − E2L(G), respectively. The BB of a genome (BB(G)) is a non-linear combination of the two entropic and anti-entropic components given by:

$$ BB\left( G \right) = \sqrt[{}]{L\left( G \right)}\sqrt[{}]{{\frac{A\left( G \right)}{{L\left( G \right)}}}}\left( {1 - 2\frac{A\left( G \right)}{{L\left( G \right)}}} \right)^{3} , $$

where \(\frac{A\left( G \right)}{{L\left( G \right)}}\) is the anti-entropic fraction of the genome and \( 1 - 2\frac{A\left( G \right)}{{L\left( G \right)}}\) is the corresponding entropic fraction. Both components vary between 0 and 1. Implementation details, as well as source codes, can be downloaded from https://www.uv.es/~varnau/adn/Biobit32B.c.

GS

The Chaos Game Representation (CGR)21,22 is an image derived from a genome where each point of the image corresponds to a given k-mer level of analysis. If the genome sequence is a random collection of bases, the CGR will be a uniformly filled square image. On the bases of building a CGR for a particular genome, we define a corresponding Genomic Signature (GS) that is a numerical value obtained for a particular k-mer level by comparing point-by-point the difference between the CGR’s of a real genome and a random genome of the same length. In order to make it comparable, the pixel values of the images are normalized. As stated, the size of the images generated depends on the k-mer used. For a given k-mer, we have 4k different words and the corresponding image 4k pixels too. To build a frequency table for each k-mer minus the expected frequency for a random genome is equivalent to the difference between the CGR images of a real and a random genome. In fact, if G is the size of the genome to analyze, the expected value (EV) for a given k-mer is given by EV = (G-k + 1)/(4k). This value is used to normalize to 1 the values of the k-mers obtained for each of the genomes analyzed. We then define the GS as:

$$ GS = max_{k} \mathop \sum \limits_{{{\text{i}} = 1}}^{{4^{k} }} \left| {\frac{{P_{i} }}{EV} - 1} \right| $$

where Pi is the relative frequency of the k-mer i. Implementation details, as well as source codes, can be downloaded from https://www.uv.es/~varnau/adn/word_chaos_GS.c.

Standard genome parameters

Finally, we have also included three standard genome parameters: genome size, %GC and number of genes.

Phylogenetic signal

We used the phylogenetic tree of Cyanobacteria to test the existence of a phylogenetic signal in the genome complexity metrics and genome parameters through Blomberg et al.73 K-statistic in the picante package for R74. K ranges from 0 to ∞. K values significantly higher than zero are indicative of the presence of a phylogenetic signal or, in other words, that closely related species resemble more in the studied trait than expected by chance. K = 1 is the value expected under Brownian evolution.

Phylogenetic correlations

We have examined the correlation between genome parameters and metrics of genome complexity after correcting the phylogenetic signal. Pearson r value between variables was computed as the phylogenetic trait variance–covariance matrix between two variables and significance tested against a t-distribution with n − 2 degrees of freedom. We used the R code provided by Liam Revell to perform Pearson correlation with phylogenetic data (https://blog.phytools.org/2017/08/pearson-correlation-with-phylogenetic.html). The P value obtained with this procedure is the same as that provided by a phylogenetic generalized linear square model. As we run multiple phylogenetic correlations, we corrected P values by false discovery rates.

Evolutionary trends

We tested the existence of an evolutionary trend in the genomic complexity measures and genome parameters by fitting a ridge regression of each of these genomic values against tip-to-root or node-to-root distances. The search.trend function in the RRphylo package75 performs a phylogenetic ridge regression between the trait values of the tips/nodes of a phylogenetic tree and their distance to the root. The values of traits (in our case, genomic complexity and genome parameters) on internal nodes of the tree were reconstructed by the RRphylo package by applying a ridge regression for continuous ancestral character estimation, as explained in76. Similar to other ancestral reconstruction methods, ancestral states are calculated as a weighted average of the tip values while taking into account the phylogenetic correlation structure of the data. However, ridge regression accounts for varying rates of evolution in different regions of the tree and estimates them with ancestral characters simultaneously. The significance of the ridge regression slope was tested against 10,000 slopes obtained after simulating a simple (i.e., no-trend) Brownian evolution of the trait in our phylogenetic tree75.

Continuous character mapping

We used two functions (contMap and fastAnc) from the phytools R package77. The contMap R function allows plotting a tree with a mapped continuous character, such as any of our complexity measures. Mapping is accomplished by estimating states at internal nodes using maximum likelihood with the function fastAnc and interpolating the states along each edge using Equation 2 of78.

Testing trends: passive or driven

To unravel whether the positive trends are passive or driven we have applied three types of tests, called the minimum, the ancestor–descendant and the subclade test, respectively3,36. These tests are well known in paleontology and evolutionary biology and, to the best of our knowledge, this is the first time they have been applied to genome evolutionary analyses. To gain a better understanding of the positive trends we have also applied those tests for comparative purposes to the metrics and genome parameters that do not show evidence of such a positive evolutionary trend.

Minimum test

Regarding the minimum test, we have applied three types of proofs. The first one evaluates if a positive skewness of the entire phylum gives support to the existence of a left wall. It is expected that if the minimum value of a given metric or parameter delimiting the left wall increases with evolutionary time, then the trend will probably be driven. To evaluate this, we considered as the minimum the estimated value of the most basal clade, xb, for each metric/parameter (Fig. 1). In the second proof of the minimum test we measure |xd − xb|, the absolute difference between descendants’ clades and the most basal clade in order to see if whether there is a statistical difference between those clades that are higher or lower than the basal clade, xb. Finally, the third proof of the minimum test, examines if there is a statistical difference between the average value of the absolute difference (|xd − xb|) of a given metric or parameter higher or lower than xb.

The ancestor–descendant test

According to Gould2, the ancestor–descendant test is the most appropriate one to discover whether positive trends are passive or driven. McShea36 indicates that in a passive system, increases and decreases should be the same, whereas in a driven trend the number of increases should occur more often. To test this, we tabulated the derived clades for all possible nodes and whether they present a higher, lower, or equal value of the metric/parameter than the ancestral clade corresponding to each node. In order to avoid bias due to proximity to the putative left wall, McShea36 recommends applying the test only to those clades where both ancestor and descendent are higher than the average value of the metric/parameter.

The sub-clade test

The final test applied is the sub-clade test. According to McSchea18 if the parent distribution is skewed (see histograms of Fig. 3; Table 3) and the mean skew of a sub-clade drawn from the right tail is also skewed, the system is probably driven. For this test, we have applied two types of proofs. First, we tested whether the trend observed at the phylum level is also observed in four selected monophyletic clades (colored species in Fig. 1) and second, we have also applied the skewness test proposed by McShea18 properly to the entire phylum. Regarding the second proof for the sub-clade test, we followed the criteria given by McShea36 whereby the monophyletic sub-clade drawn from the right tail of the entire distribution should have a statistically significant average (median) higher value than the one corresponding to the entire phylum.

Basic statistical analyses and graphs were performed using Origin (OriginLab Corporation, Northampton, MA, USA) and R (R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/).