Abstract
Progressive evolution, or the tendency towards increasing complexity, is a controversial issue in biology, which resolution entails a proper measurement of complexity. Genomes are the best entities to address this challenge, as they encode the historical information of a species’ biotic and environmental interactions. As a case study, we have measured genome sequence complexity in the ancient phylum Cyanobacteria. To arrive at an appropriate measure of genome sequence complexity, we have chosen metrics that do not decipher biological functionality but that show strong phylogenetic signal. Using a ridge regression of those metrics against root-to-tip distance, we detected positive trends towards higher complexity in three of them. Lastly, we applied three standard tests to detect if progressive evolution is passive or driven—the minimum, ancestor–descendant, and sub-clade tests. These results provide evidence for driven progressive evolution at the genome-level in the phylum Cyanobacteria.
Similar content being viewed by others
Treatises on biological evolution reflect a conflict between the relative roles played by contingency and necessity1. An important tradition in evolutionary biology, based on a large amount of empirical evidence, considers that contingency marks the dynamics of evolution in a way that makes it unpredictable1,2,3. The trend towards the appearance of increasing complexity falls within the frame of contingent evolution insofar as it is inevitable given that, passively, we can expect that sooner or later more complex entities will evolve from the original, simpler entities. This is what Gould2 referred to as ‘the passive tendency towards complexity marked by the minimum initial complexity wall’.
A central task for those studying complexity is to provide an accurate measure to ascertain if there is a trend of increasing complexity3,4. In fact, a necessary condition for progressive and open-ended evolution is to prove the existence of a metric that increases with the evolutionary age of the corresponding organisms4,5. We suggest that we can find such metrics in the genomes6. Genomes probably provide the best record of the biological history of a species. Not only do they enable us to reconstruct their phylogenetic relationships but they also contain information gained from their continuous biotic and environmental interactions over time6,7,8. Standard genome parameters such as genome size, number of genes, and gene components (i.e., introns, exons) are insufficient indicators of genome complexity because they partially capture the historical information encoded in a genome9,10. We suggest here that metrics unassociated with biological functions may improve our measurements of genome sequence complexity. However, some metrics that have been previously applied to genomes are too broad, and not all of them accurately capture all the necessary information gleaned from a genome during its evolutionary history6,11. For example, algorithmic complexity12,13 is inconveniently maximized for randomness and the effective complexity of Gell-Mann and Lloyd14 is recommended for collections or ensembles of sequences, but in several cases such as that seen in genome sequences, it is not clear how to define an appropriate ensemble. Likewise, those metrics based on mutual information or statistical dependence15,16 also quantify the complexity of sequence ensembles rather than the complexity of a single sequence.
Here we consider six metrics that are calculated on individual genomes. The first four metrics are based on the Sequence Compositional Complexity (SCC) derived from a four-symbol DNA sequence or the binary sequences resulting from grouping the four nucleotides into S(C,G) versus W(A,T) or R(A,G) versus Y (T,C), or K(A,C) versus M(T,G), thus obtaining SCCSW, SCCRY and SCCKM metrics, respectively17. These four metrics increase with the number of parts (i.e. compositional domains) as well as the length and compositional differences among them found in a genome sequence by a segmentation algorithm. These metrics parallel the concept of ‘pure complexity’ of McShea18 and McShea and Brandon3, where complexity is more closely related to the number of part types of an individual than with the number of functions.
The fifth metric we used is the Biobit (BB), a metric based on the difference between the maximum entropy for a k-mer of a random genome of the same length as the genome under consideration and the entropy of that genome for such a k-mer19. Lastly, we used the Genomic Signature (GS), also a k-mer-based metric, which is the value corresponding to the k-mer that maximizes the difference between observed and expected equi-frequent classes of mers. GS is based on the relative abundances of short oligonucleotides20 and chaos game representation applied to genomes21,22.
We tested the above-mentioned metrics by analyzing the genome evolution of an ancient and diverse group of organisms: the phylum Cyanobacteria. These microorganisms played a fundamental role in the evolution of life on Earth. The fossil record shows that they were here 2.0 Billion years ago (Bya) and molecular clock analyses indicate that the phylum originated over 2.5 Bya23,24. By releasing oxygen through photosynthesis, Cyanobacteria caused the Great Oxidation Event about 2.3 Bya and changed the history of life on Earth25. The oxidation of the environment allowed the evolution of complex multicellular life forms26, leading to the origin of eukaryotic crown groups including plants and animals27. As it is well known, Cyanobacteria also were the progenitor of plastids through symbiosis with ancient eukaryotes28.
Cyanobacteria are morphologically diverse. The group was traditionally classified into five subsections according to several biological criteria29,30. Unicellular cyanobacteria are classified in subsections I and II, depending on their mode of reproduction. Those from subsection I (Chroococcales) divide only by binary fission while those from subsection II (Pleurocapsales) are capable of reproducing also by multiple fission giving rise to small cells called baeocytes. Filamentous cyanobacteria are classified into subsections III, IV, and V. Those from subsection III (Oscillatoriales) are composed only by vegetative cells that reproduce by intercalary division. Cyanobacteria from subsections IV and V are capable of cell differentiation producing trichomes composed of vegetative cells and heterocysts for nitrogen fixation. In addition, some members also produce hormogonia for dispersal and akinetes for dormancy. Members of subsection IV (Nostocales) always divide in a plane at right angles to the long axis of the trichome; while those from subsection V (Stigonematales) may also divide at parallel axes relative to the long axes of the trichome.
Of the above subsections of Cyanobacteria, only Stigonematales are monophyletic24,31. More recent classification schemes using phylogenetic analysis from 31 conserved protein sequences divide Cyanobacteria into nine different groups32. These include Gleobacterales, Synechococcales, Oscillatoriales, Chroococcales, Pleurocapsales, Spirulinales, Rubidibacter/Halothece, Chroococcidiopsidales, and Nostocales. Of these groups, Chroococcales, Oscillatoriales, and Synechococcales are not monophyletic. This classification scheme attempts to reconcile phylogenetic history with several aspects of morphology and cytology. Other phylogenetic analyses based on 31 concatenated conserved proteins divide cyanobacteria into seven groups33. These groups are named from A to G (groups B and C are further subdivided into B1, B2 and B3 and C1, C2 and C3) and all of them are monophyletic. Furthermore, taxon addition and subtraction analyses on a concatenated dataset of 137 conserved proteins and two rRNA sequences, allowed the identification of long-branch attraction artifacts34. The resulting tree was used to classify cyanobacteria into 6 monophyletic groups, corresponding to some of the A to G lineages. Finally, phylogenetic analysis on a concatenated dataset of 43 proteins from 208 taxa, recovered all A–G groups and revealed the existence of novel monophyletic lineages located at the base of the tree35. Clearly, the taxonomy and evolution of Cyanobacteria are active areas of research. The classification of Cyanobacteria is likely to change in the near future as more lineages are sequenced and analyzed.
In this study, we test whether there is a statistically and phylogenetically supported driven tendency towards increasing genome sequence complexity in the evolution of Cyanobacteria as reflected by some of their metrics of genomic complexity.
Results
Complexity measures throughout Cyanobacteria phylogeny
The values of the four SCCs, BB and GS metrics as well as three standard genome parameters (Genome size, %GC and No. of genes) (see “Methods” section) for 91 Cyanobacterial genomes are reported in Table S1. Figure 1 shows a maximum likelihood phylogeny of Cyanobacteria whose branch lengths are proportional to the number of amino acid substitutions (see “Methods” section). The phylogeny is in general agreement with the previous analyses24,31,32.
Phylogenetic signal
All metrics of complexity and genome parameters showed a significant phylogenetic signal (Table 1), indicating that for all cases genomes of closely related cyanobacterial species tend to resemble more than two randomly selected genomes (Fig. 1). However, the magnitude of the phylogenetic signal differs across metrics and parameters, with %GC and GS showing the highest values, which probably reflects different forces shaping the evolution of all these metrics and parameters (see “Discussion” section).
Phylogenetic correlations
To gain a better understanding of the metrics, after we corrected the phylogenetic signals, we evaluated how they correlate with each other and, particularly, with the genomic parameters (Table 2). It is worth noticing that two metrics, SCC and SCCRY correlate with other metrics and parameters (six correlations each one), accounting for 43% of all significant correlations.
Ridge regression of complexity metrics versus distance from the root
Using ridge regression of genome complexity metrics and genome parameters versus distance from the root, we have studied whether there is evidence of evolutionary trends, having detected interesting patterns (Fig. 2). Of the complexity metrics, four out of six show a statistically significant positive trend (SCC, SSCSW, SCCRY and GS), indicating that complexity, as determined by our proposed criteria, has increased with the distance from the root. In contrast, SCCKM shows no trend and BB reveals a significant negative trend. Interestingly, genome parameters show no evidence of any evolutionary trend.
Driven trends in Cyanobacteria
A critical question regarding trends is if they are passive or driven. To evaluate this, we have applied three types of tests (see “Methods” section for a detailed description): the minimum (with three types of proofs), the ancestor–descendant, and the subclade (with two types of proofs) tests.
Regarding the first proof of the minimum test (i.e., skewness), we observed that the skewness of all metrics (except SCC and GC content) for the entire phylum exhibit significant and positive skewness (D’Agostino–Pearson test, n = 91; Table 3), which supports a left wall for these metrics and parameters that is compatible with either a passive or a driven trend. Nevertheless, it is expected that if the minimum value of a given metric or parameter increases with evolutionary time, then the trend will probably be driven. To test this we have taken as the minimum the estimated value of the most basal clade, xb, for each metric/parameter (Fig. 1). However, as it can be observed (Fig. 3), there are lower or higher values than the one corresponding to the basal clade for any metric/parameter. Then, it is necessary to study in greater depth the distribution of lower and higher values with respect to xb in order to have evidence of the putative existence of a driven trend. With this end, we carried out the second proof of the minimum test, where we measure |xd − xb|, the absolute difference between descendants’ clades and the most basal clade. Table 3 shows whether there is a statistical difference (Chi-square test) between those clades (179 in total) that are higher or lower than the basal clade, xb. As it can be observed, all the tests are significant with four metrics (SCC, SCCRY, SCCKM and BB) and two parameters (Genome size and No. of genes) showing a significant increase in the metric/parameter with respect to the corresponding basal values. In contrast, two metrics (SCCSW and GS) and one parameter (%GC) present a significant decrease. Finally, employing a Student’s t-test (third proof of the minimum test), we tested if there is a statistical difference between the average value of the absolute difference (|xd − xb|) of a given metric or parameter higher or lower than xb. It can be observed (Table 3) that three metrics (SCCSW, SCCRY and SCCKM) show a statistical difference in favor of a higher value than xb and one metric (GS) and the three parameters (Genome size, %GC content, and No. of genes) present a statistical difference in favor of a lower value than xb.
Regarding the ancestor–descendant test (see “Methods” section for a detailed description) we tabulated the derived clades for all possible nodes and whether they present a higher, lower, or equal value of the metric/parameter than the ancestral clade corresponding to each node. In order to avoid bias due to proximity to the putative left wall, McShea36 recommended applying the test only to those clades where both ancestor and descendent are higher than the average value of the metric/parameter. As it can be observed (Fisher exact test, Table 4) this exigent test shows that metrics SCC and GS and the three genome parameters are in favor of a driven trend. A good visualization of the ancestor–descendant proof on the phylogeny of the Cyanobacteria for each metric/parameter has been obtained by estimating the values of internal nodes using a maximum likelihood function and interpolating the value along each edge (see “Methods” section). Figure 4 shows the mapping corresponding to the SCC metric where the driven positive trend of this metric can be clearly appreciated (Fig. S1 for the mapping of the rest of metrics/parameters).
Finally, the last test applied is the sub-clade test, with the two associated proofs. In the first proof, we tested whether the trend observed at the phylum level is also observed in four selected monophyletic clades and second, we have also applied the skewness test to either the entire phylum (results are given in Table 3) and to the chosen sub-clades. We have chosen four monophyletic clades formed by clusters 97, 132, 162, and 172 that harbor 18, 22, 11, and 8 species, respectively (four-colors in Fig. 1 and Fig. S2). Clade 97 is formed by Synechococcus, Prochlorococcus, and Cyanobium; clade 132 corresponds to Nostocales (subsections IV and V); clade 162 contains Cyanothece and Microcystis; and clade 172, among others, contains Geminocystis and Cyanobacterium. The most relevant result found was that some metrics of genome complexity show statistically significant positive trends (SCC, SCCRY, and GS) and some others show negative trends (SCCSW and SCCKM), whereas the genome parameters do not show any positive trends (Table S2; Fig. S3). Thus, we keep SCC, SCCRY and GS as the metrics showing positive trends at both levels of phylogenetic resolution.
Regarding the second proof for the sub-clade test, we have examined if the monophyletic sub-clade drawn from the right tail of the entire distribution should have a statistically significant average higher value than the one corresponding to the entire phylum. Regarding the skewness of the phylum (Table 3), we observe that all metrics (except SCC and %GC) exhibit significant and positive skewness. However, this test of skewness cannot be applied to the four chosen monophyletic sub-clades either because (a) the average value (median) of a given metric/parameter for each sub-clade was lower than the median of the phylum (16 cases out of 36) or, (b) there was no statistical evidence (the remaining 20 cases) of a higher median (Mood’s median test) of a given metric/parameter for each sub-clade than the median of the entire phylum (see Table S3).
In summary, the overall results obtained in relation to the evidence found for a trend in a given metric or parameter, i.e., the phylogenetic signal, the number of significant correlations against the rest of metrics/parameters, as well as whether the trend is driven or not (Table 5) show that SCC, SCCRY and to a lower extent GS present the highest scores, and can thus be considered metrics evidencing progressive evolution of Cyanobacteria.
Discussion
Genomes probably provide the best record of the biological history of species. Not only do they enable us to reconstruct their phylogenetic relationships but they also contain information gained from their continuous biotic and environmental interactions over time6,8. This information is an elusive but crucial component of the genome, whose study as a whole deserves deeper attention because it holds clues to answer many biological questions, particularly those of an evolutionary nature.
The genome has distinct layers of information encoded in DNA sequences10,37. The most well-known are those involved in biological function, such as the typical genome division into coding and non-coding parts or the differential conservation shown by distinct codon positions due to the differential evolutionary constraints acting within genes38,39,40. In the present study, we intend to capture or approximate the genome information held in these layers using certain metrics (collectively named ‘genome complexity metrics’) to determine whether they show phylogenetic signals and indicate some kind of evolutionary trend. To do so, we use a group of organisms with a long phylogenetic history: the phylum Cyanobacteria. SCC accounts for the global compositional complexity of a DNA sequence encoded by the four nucleotides (A, T, C, and G) and shares similarity with McShea’s18 operational definition of biological complexity, or the degree to which the parts of a morphological structure differ from each other. SCCSW may account for the complexity due to the partition of the genome into GC-rich and GC-poor segments (e.g., the isochores), which are known to be associated with many functionally relevant properties such as gene density, gene length, retrotransposon density, or recombination frequency41,42,43,44,45,46. Thus, SCCSW might capture the genome information gained throughout evolution by the selective forces acting on these important functional elements. On the other hand, SCCRY accounts for the complexity due to the partition of the genome into segments of different purine/pyrimidine richness. Such strand asymmetries are less directly related to biological function, but this alphabet has been useful to uncover long-range correlations and analyze the evolution of fractal structure in the genomes47,48,49. Recently, a connection has been found between strand symmetry and the repetitive action of transposable elements during evolution37 (see also Koonin50 and his concept of ‘fuzzy meaning’ of sequences). The partition given by SCCKM has not been associated with any biological function. Finally, GS and BB explore the maximum deviation for a given k-mer between a real and a random genome. GS directly compares the observed distribution of k-mer classes of a real genome with respect to that corresponding to a random one. On the other hand, by calculating the entropy differences between both groups, BB measures the relative entropic and anti-entropic fraction of a real genome19.
From a population genetics perspective, cyanobacteria can be considered proto-typical bacterial species whose populations are evolving under high effective population sizes51, with intermediate mutation rates between those of RNA viruses (higher mutation rate) and lower or higher eukaryotes (lower mutation rates)52. Therefore, natural selection is expected to play a major role in the evolution of these organisms. Irrespective of whether mutations (or any source of genetic novelty) are deleterious or beneficial, their destiny will be dictated by the deterministic action of purifying or positive selection, respectively53,54. This observation is highly pertinent when it comes to appropriately interpreting the phylogenetic signals observed in the metrics of complexity measures and genome parameters following the in silico evolutionary processes described by Revell et al.55. Considering, thus, that selection is a key force in the evolution of Cyanobacteria, most of the K-values estimated for the metrics may reflect the action of purifying or stabilizing selection, particularly those that are below 1 (all metrics and parameters, except GS and %GC). K from GS is 1, which could be interpreted either as a random drift effect or, more convincingly for this type of organism, as fluctuating selection for a relatively high rate of movement of the optimum55. Finally, K associated with %GC is much higher than one, which can also be interpreted as the result of an evolutionary process with heterogeneous peak shifts.
Importantly, our study of the evolutionary trends in Cyanobacteria by means of ridge regression found clear differences between metrics of complexity and genome parameters. Four metrics (SCC, SCCRY, SCCSW, and GS) indicate changes toward higher complexity in more evolved clades (long-branch distance with respect to the root of the tree), while SCCKM does not show any signs of a trend and BB shows a negative trend. However, the genome parameters show no evidence of any trend (Fig. 2). These results are reinforced when comparatively analyzing trends between metrics and parameters at a lower phylogenetic resolution (i.e. in monophyletic subclades, Tables S2 and S3 and Fig. S3). Although metrics used in this work capture different aspects of the evolution of genome sequence complexity in Cyanobacteria (positive trends in SCC, SCCRY, and GS versus negative trends in SCCSW and SCCKM), the genome parameters never present any positive trends (Fig. S2 and Table S2). In that respect, although some metrics capture increasing sequence complexity, genome parameters do not.
It is worth noticing that the metrics to measure sequence complexity and the associated positive driven trends have captured something different from functional comparative genomics in Cyanobacteria. One interesting case is the comparison between those Cyanobacteria species that are multicellular and develop heterocysts or akinete from those that do not develop such traits. We tested this by considering which of the species chosen in our data set have heterocyst versus non-heterocyst and akinete versus non-akinete (Table S1). The presence of heterocysts or akinete could be taken as evidence of higher complexity against its absence. We carried out a test for each one of the metrics and genome parameters to see if there were a statistically significant difference and higher value of the groups of heterocyst or akinete with respect to the groups of non-heterocyst or non-akinete, respectively (Table S1). No statistically significant difference were found for any metric (except for SCCKM between akinete vs non-akinete, Mann–Withney test, P < 0.05). However, when comparing the average values corresponding to genome parameters (genome size, gene number and %GC), we repeatedly observed that species with heterocyst or akinete showed a statistically significant higher genome size, higher gene number, and lower %GC (Mann–Withney test, P < 0.05). From a functional point of view, the standard genome parameters have been found to differentiate between multicellular cyanobacteria, which is not the case for the metrics, particularly among those showing a consistent positive driven trend. (i.e., SCC, GS). These metrics are capturing something different in the genomic sequence. Take, for instance, the three species (see Fig. 4) that present the highest SCC values: Cyanobacterium stanieri, C. aponirium, and Trichodesmium erythraeum. They present a larger distance from the root even more than the SynPro clade (see Fig. 1). None of these three species, nor all the Synpro clade, have heterocysts or akinete, and all appear to present a larger distance from the root than those species harboring these traits. It is clear, then, that the positive trend we have detected is reflecting something different. We speculate that the species showing a larger distance from the root may be more evolvable than those that present a shorter distance to it.
It is interesting, on the other hand, to point out the process of selection and genome streamlining of Synechococcus and Prochlorococcus in clade 97 (SynPro clade), giving rise to more evolved shorter genomes, which are AT-rich and show a lower number of genes than the rest of Cyanobacteria (Table S1). As it can be observed, there are statistically significant negative trends in the three genome parameters but also positive trends of SCC (Fig. 4) and SCCRY metrics (Fig. S2 and Table S2). Therefore, genome reduction in this clade does not imply loss of genome complexity; on the contrary, our study shows that this clade also has a highly complex genome sequence56. On the other hand, it is interesting to consider the comparison between this specialist clade with others that are generalistic, like Microcystis sp. (Figs. 1, 4). The genus Microcystis appears to be older than the Synpro clade. Both, however, have no heterocysts nor akinete (as examples of complex functionality; i.e., multicellularity) but, in general, show a higher SCC or GS metric than the multicellulars. The higher SCC values that we observed in the SynPro clade indicate a higher intra-genome compositional diversity in these species (i.e., a higher number of compositional segments and/or higher compositional differences among them). In the same way that a high rate of genetic variability promotes a higher evolvability57, it can also be considered that both groups have also developed a higher capacity to evolve, captured by some of the metrics that we have studied. On the other hand, apparently genome reduction and specialization in the SynPro clade, as already stated, is not equivalent to the loss of genome sequence complexity.
In summary, considering that selection is a major driver in the evolution of Cyanobacteria, the observed positive trends towards increasing sequence complexity captured by the SCC, SCCRY, and GS metrics cannot be explained, contrary to what Gould2 holds as a passive tendency to increase. The three tests carried out in order to demonstrate whether positive trends are passive or driven show us that the positive trend is driven and is likely due to the action of natural selection, something that we have not tested for directly. Several of the metrics gathered in this study confirm this trend in the case of the evolutionary history of Cyanobacteria.
Methods
Phylogenetic analysis
Ninety-one complete and nearly complete cyanobacterial genomes were downloaded from GenBank and annotated using Prokka58 (Table S1). To infer a phylogenomic tree we proceeded first to identify the set of homologous gene families conserved among Cyanobacteria (core genome) using get_homologues.pl pipeline59. For this, we used BDBH and OMCL methodologies within get_homologues.pl with the following parameters: a threshold e-value ≤ 10—10 for BLAST searches; a minimum percent amino acid identity > 30% between query and subject sequences; and for OMCL, we set the inflation parameter (I) set to 2.0. The consensus core-genome was inferred by the intersection of BDBH and OMCL gene families. To select high-quality phylogenetic markers from the core-genome (i.e. those gene families not showing recombination and/or horizontal gene transfer), we used the software package get_phylomarkers60. By this procedure, we obtained an alignment of 96 top markers comprising 36,760 amino acids. Clustal-Omega was used to align the protein sequences61. The multiple alignment was cured by eliminating uninformative sites and misaligned positions with Gblocks62. Finally, a maximum likelihood phylogeny was reconstructed using PhyML63 with LG model + I (estimation of invariant sites) + G (gamma distribution) as selected by ProtTest364. The root was located on the branch connecting both Gloebacter spp. to the rest of the cyanobacteria. This location of the root is based on cytologic (for instance, Gloebacter spp. lacks thylakoids) as well phylogenetic and molecular clock analyses32,33,34,65.
Genome sequence complexity metrics
SCC
Sequence Compositional Complexity of genomes was calculated by using a two-step process. We first obtained the non-overlapping compositional domains comprising the genome sequence, and then applied an entropic complexity measurement able to account for the heterogeneity of such compositional domains. The compositional domains of a given genome sequence are obtained through a segmentation algorithm that was properly designed66 by using the Jensen-Shannon entropic divergence67,68 to split the sequence—and iteratively the sub-sequences- into non-overlapping compositional domains which, at a given statistical significance, s, are homogeneous and compositionally different from the neighboring domains. It is worth mentioning that the segmentation algorithm we used, and hence the SCC complexity values derived from it, are invariable to sequence orientation, as Shannon entropy is invariant under symbol interchange.
Note also that the statistical significance level s, is the probability that the difference between each pair of adjacent domains is not due to statistical fluctuations. By changing this parameter one can obtain the underlying distribution of segment lengths and nucleotide compositions at different levels of detail69 thus fulfilling one of the key requirements for complexity measures14. Improvements to this segmentation algorithm also allow to segment long-range correlated sequences70. Full details of the segmentation algorithm have been published elsewhere71,72. Implementation details, as well as source codes and executable binaries for different operating systems can be downloaded from: https://github.com/bioinfoUGR/segment and https://github.com/bioinfoUGR/isofinder.
Once a genome sequence was segmented into n compositional domains, we computed SCC as:
where S denotes the whole genomes and G its length, Gi the length of the i th domain, Si. \(H\left( \cdot \right) = - \sum flog_{2} f\) is the Shannon entropy of the distribution of relative frequencies of symbol occurrences, f, in the corresponding (sub) sequence17. It should be noted that the above expression is the same one than that used in the segmentation process, applying it to the tentative two new subsequences (n = 2) to be obtained in each step. Thus, the two parts of the SCC computation are based on the same theoretical background.
We apply the above two-step procedure to each of the entire four-symbol cyanobacterial genomes, thus obtaining a SCC complexity value for each of them. In addition, we also apply the same procedure to the binary sequences resulting from grouping the four nucleotides into S(C,G) versus W(A,T) or R(A,G) versus Y (T,C), or K(A,C) versus M(T,G), then obtaining SCCSW, SCCRY and SCCKM metrics, respectively. These three additional metrics are partial complexities that provide complementary views of genome complexity to that obtained with the four-symbol sequence71,72.
We provided additional details on the segmentation carried out in Cyanobacteria by using the UCSC Genome Browser. Genome maps of the compositional segments obtained for each Cyanobacteria genome analyzed in this paper can be found at the following link: https://sites.google.com/go.ugr.es/oliver/databases/dna-compositional-segments/cyanobacteria?authuser=0. Note that, once at UCSC Genome Browser, the user can obtain a complete list of segment coordinates for each genome in plain text by clicking on Tools: Table Browser.
BB. Biobit is an informative measure of the complexity of a genome, which is a generalized logistic map that balances the entropic and anti-entropic components of genomes and appears to be related to their evolutionary dynamics. BB compares genomes of size n with random genomes of the same size to establish a measure of its complexity. More precisely, BB is a metric of genome sequence complexity that is derived from the comparison between the k-mer that yields the maximum entropy of a given random genome and the corresponding entropy of the real genome of the same length19. The authors demonstrated that the entropy of a real genome of length G, E2L(G) takes a value between the maximum (2log4(G) or 2L(G)) and the minimum (L(G)) entropy. On the other hand, the authors define and measure two additional components, that they call entropic (E(G)) and anti-entropic (A(G)) of a real genome, in such a way that A(G) + E(G) = L(G). Then, the entropy of those components are given by E(G) = E2L(G) − 2L(G) and A(G) = 2L(G) − E2L(G), respectively. The BB of a genome (BB(G)) is a non-linear combination of the two entropic and anti-entropic components given by:
where \(\frac{A\left( G \right)}{{L\left( G \right)}}\) is the anti-entropic fraction of the genome and \( 1 - 2\frac{A\left( G \right)}{{L\left( G \right)}}\) is the corresponding entropic fraction. Both components vary between 0 and 1. Implementation details, as well as source codes, can be downloaded from https://www.uv.es/~varnau/adn/Biobit32B.c.
GS
The Chaos Game Representation (CGR)21,22 is an image derived from a genome where each point of the image corresponds to a given k-mer level of analysis. If the genome sequence is a random collection of bases, the CGR will be a uniformly filled square image. On the bases of building a CGR for a particular genome, we define a corresponding Genomic Signature (GS) that is a numerical value obtained for a particular k-mer level by comparing point-by-point the difference between the CGR’s of a real genome and a random genome of the same length. In order to make it comparable, the pixel values of the images are normalized. As stated, the size of the images generated depends on the k-mer used. For a given k-mer, we have 4k different words and the corresponding image 4k pixels too. To build a frequency table for each k-mer minus the expected frequency for a random genome is equivalent to the difference between the CGR images of a real and a random genome. In fact, if G is the size of the genome to analyze, the expected value (EV) for a given k-mer is given by EV = (G-k + 1)/(4k). This value is used to normalize to 1 the values of the k-mers obtained for each of the genomes analyzed. We then define the GS as:
where Pi is the relative frequency of the k-mer i. Implementation details, as well as source codes, can be downloaded from https://www.uv.es/~varnau/adn/word_chaos_GS.c.
Standard genome parameters
Finally, we have also included three standard genome parameters: genome size, %GC and number of genes.
Phylogenetic signal
We used the phylogenetic tree of Cyanobacteria to test the existence of a phylogenetic signal in the genome complexity metrics and genome parameters through Blomberg et al.73 K-statistic in the picante package for R74. K ranges from 0 to ∞. K values significantly higher than zero are indicative of the presence of a phylogenetic signal or, in other words, that closely related species resemble more in the studied trait than expected by chance. K = 1 is the value expected under Brownian evolution.
Phylogenetic correlations
We have examined the correlation between genome parameters and metrics of genome complexity after correcting the phylogenetic signal. Pearson r value between variables was computed as the phylogenetic trait variance–covariance matrix between two variables and significance tested against a t-distribution with n − 2 degrees of freedom. We used the R code provided by Liam Revell to perform Pearson correlation with phylogenetic data (https://blog.phytools.org/2017/08/pearson-correlation-with-phylogenetic.html). The P value obtained with this procedure is the same as that provided by a phylogenetic generalized linear square model. As we run multiple phylogenetic correlations, we corrected P values by false discovery rates.
Evolutionary trends
We tested the existence of an evolutionary trend in the genomic complexity measures and genome parameters by fitting a ridge regression of each of these genomic values against tip-to-root or node-to-root distances. The search.trend function in the RRphylo package75 performs a phylogenetic ridge regression between the trait values of the tips/nodes of a phylogenetic tree and their distance to the root. The values of traits (in our case, genomic complexity and genome parameters) on internal nodes of the tree were reconstructed by the RRphylo package by applying a ridge regression for continuous ancestral character estimation, as explained in76. Similar to other ancestral reconstruction methods, ancestral states are calculated as a weighted average of the tip values while taking into account the phylogenetic correlation structure of the data. However, ridge regression accounts for varying rates of evolution in different regions of the tree and estimates them with ancestral characters simultaneously. The significance of the ridge regression slope was tested against 10,000 slopes obtained after simulating a simple (i.e., no-trend) Brownian evolution of the trait in our phylogenetic tree75.
Continuous character mapping
We used two functions (contMap and fastAnc) from the phytools R package77. The contMap R function allows plotting a tree with a mapped continuous character, such as any of our complexity measures. Mapping is accomplished by estimating states at internal nodes using maximum likelihood with the function fastAnc and interpolating the states along each edge using Equation 2 of78.
Testing trends: passive or driven
To unravel whether the positive trends are passive or driven we have applied three types of tests, called the minimum, the ancestor–descendant and the subclade test, respectively3,36. These tests are well known in paleontology and evolutionary biology and, to the best of our knowledge, this is the first time they have been applied to genome evolutionary analyses. To gain a better understanding of the positive trends we have also applied those tests for comparative purposes to the metrics and genome parameters that do not show evidence of such a positive evolutionary trend.
Minimum test
Regarding the minimum test, we have applied three types of proofs. The first one evaluates if a positive skewness of the entire phylum gives support to the existence of a left wall. It is expected that if the minimum value of a given metric or parameter delimiting the left wall increases with evolutionary time, then the trend will probably be driven. To evaluate this, we considered as the minimum the estimated value of the most basal clade, xb, for each metric/parameter (Fig. 1). In the second proof of the minimum test we measure |xd − xb|, the absolute difference between descendants’ clades and the most basal clade in order to see if whether there is a statistical difference between those clades that are higher or lower than the basal clade, xb. Finally, the third proof of the minimum test, examines if there is a statistical difference between the average value of the absolute difference (|xd − xb|) of a given metric or parameter higher or lower than xb.
The ancestor–descendant test
According to Gould2, the ancestor–descendant test is the most appropriate one to discover whether positive trends are passive or driven. McShea36 indicates that in a passive system, increases and decreases should be the same, whereas in a driven trend the number of increases should occur more often. To test this, we tabulated the derived clades for all possible nodes and whether they present a higher, lower, or equal value of the metric/parameter than the ancestral clade corresponding to each node. In order to avoid bias due to proximity to the putative left wall, McShea36 recommends applying the test only to those clades where both ancestor and descendent are higher than the average value of the metric/parameter.
The sub-clade test
The final test applied is the sub-clade test. According to McSchea18 if the parent distribution is skewed (see histograms of Fig. 3; Table 3) and the mean skew of a sub-clade drawn from the right tail is also skewed, the system is probably driven. For this test, we have applied two types of proofs. First, we tested whether the trend observed at the phylum level is also observed in four selected monophyletic clades (colored species in Fig. 1) and second, we have also applied the skewness test proposed by McShea18 properly to the entire phylum. Regarding the second proof for the sub-clade test, we followed the criteria given by McShea36 whereby the monophyletic sub-clade drawn from the right tail of the entire distribution should have a statistically significant average (median) higher value than the one corresponding to the entire phylum.
Basic statistical analyses and graphs were performed using Origin (OriginLab Corporation, Northampton, MA, USA) and R (R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/).
Data availability
All data generated or analysed during this study are included in this published article (and its Supplementary Information files).
References
Moya, A. The Calculus of Life (Springer, New York, 2014).
Gould, S. J. Full House: The Spread of Excellence from Plato to Darwin (Harmony Books, New York, 1996).
McShea, D. W. & Brandon, R. N. Biology’s First Law (Chicago University Press, Chicago, 2010).
Day, T. Computability, Gödel’s incompleteness theorem, and an inherent limit on the predictability of evolution. J. R. Soc. Interface 9, 624–639 (2012).
Corominas-Murtra, B., Seoane, L. F. & Solé, R. Zipf’s Law, unbounded complexity and open-ended evolution. J. R. Soc. Interface 15, 20180395 (2018).
Adami, C. What is complexity?. BioEssays 24, 1085–1094 (2002).
Adami, C. What is information?. Philos. Trans. R. Soc. A 374, 20150230 (2016).
Krakauer, D. C. Darwinian demons, evolutionary complexity, and information maximization. Chaos 21, 037110 (2011).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the Human Genome. Science 326, 289–294 (2009).
Dekker, J. et al. The D nucleome network. The D nucleome project. Nature 549, 219–226 (2017).
Zurek, W. H. (ed.) Complexity, Entropy and the Physics of Information (Addison-Wesley Press, Cambridge, 1990).
Chaitin, G. J. Algorithmic information theory. IBM J. Res. Dev. 21, 350–359 (1977).
Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and its Applications (Springer, New York, 2008).
Gell-Mann, M. & Lloyd, S. Information measures, effective complexity, and total information. Complexity 2, 44–52 (1996).
Grassberger, P. Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys. 25, 907–938 (1986).
Adami, C. & Cerf, N. J. Physical complexity of symbolic sequences. Phys. D Nonlinear Phenom. 137, 62–69 (2000).
Román-Roldán, R., Bernaola-Galván, P. & Oliver, J. L. Sequence compositional complexity of DNA through an entropic segmentation method. Phys. Rev. Lett. 80, 1344–1347 (1998).
McShea, D. W. Evolutionary change in the morphological complexity of the mammalian vertebral column. Evolution 47, 730–740 (1993).
Bonnici, V. & Manca, V. Informational laws of genome structures. Sci. Rep. 6, 28840 (2016).
Karlin, S. & Ladunga, I. Comparisons of eukaryotic genomic sequences. Proc. Natl. Acad. Sci. U. S. A. 91, 12832–12836 (1994).
Jeffrey, H. J. Chaos game representation of gene structure. Nucleic Acids Res. 18, 2163–2170 (1990).
Almeida, J. S., Carriço, J. A., Maretzek, A., Noble, P. A. & Fletcher, M. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics 17, 429–437 (2001).
Sergeev, V. N., Gerasimenko, L. M. & Zavarzin, G. A. The Proterozoic history and present state of Cyanobacteria. Microbiology 71, 623–637 (2002).
Schirrmeister, B. E., De Vos, J. M., Antonelli, A. & Bagheri, H. C. Evolution of multicellularity coincided with increased diversification of Cyanobacteria and the Great Oxidation Event. Proc. Natl. Acad. Sci. U. S. A. 110, 1791–1796 (2013).
Bekker, A. et al. Dating the rise of atmospheric oxygen. Nature 427, 117–120 (2004).
Hedges, S. B., Blair, J. E., Venturi, M. L. & Shoe, J. L. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evol. Biol. 4, 2 (2004).
Knoll, A. H. Paleobiological perspectives on early microbial evolution. Cold Spring Harb. Perspect. Biol. 7, 1–17 (2015).
Sagan, L. On the origin of mitosing cells. J. Theor. Biol. 14, 225–274 (1967).
Rippka, R., Deruelles, J. & Waterbury, J. B. Generic assignments, strain histories and properties of pure cultures of Cyanobacteria. J. Gen. Microbiol. 111, 1–61 (1979).
Rippka, R. Recognition and Identification of Cyanobacteria. Methods Enzymol. 167, 28–67 (1988).
Dagan, T. et al. Genomes of Stigonematalean Cyanobacteria (subsection V) and the evolution of oxygenic photosynthesis from prokaryotes to plastids. Genome Biol. Evol. 5, 31–44 (2013).
Komárek, J., Kaštovský, J., Mareš, J. & Johansen, J. R. Taxonomic classification of cyanoprokaryotes (cyanobacterial genera), using a polyphasic approach. Preslia 86, 295–335 (2014).
Shih, P. M. et al. Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing. Proc. Natl. Acad. Sci. U. S. A. 110, 1053–1058 (2013).
Uyeda, J. C., Harmon, L. J. & Blank, C. E. A comprehensive study of cyanobacterial morphological and ecological evolutionary dynamics through deep geologic time. PLoS ONE 11, e0162539 (2016).
Will, S. E. et al. Day and night: Metabolic profiles and evolutionary relationships of six axenic non-marine cyanobacteria. Genome Biol. Evol. 11, 270–294 (2019).
McShea, D. W. Mechanisms of large-scale evolutionary trends. Evolution 48, 1747–1763 (1994).
Cristadoro, G., Degli Esposti, M. & Altmann, E. G. The common origin of symmetry and structure in genetic sequences. Sci. Rep. 8, 15817 (2018).
Ikemura, T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13–34 (1985).
Sueoka, N. Directional mutation pressure, selective constraints, and genetic equilibria. J. Mol. Evol. 34, 95–114 (1992).
Bernardi, G. Structural and Evolutionary Genomics. Natural Selection in Genome Evolution (Elsevier, Amsterdam, 2004).
Bernardi, G. et al. The mosaic genome of warm-blooded vertebrates. Science 228, 953–958 (1985).
Mouchiroud, D., Gautier, C. & Bernardi, G. The compositional distribution of coding sequences and DNA molecules in humans and murids. J. Mol. Evol. 27, 311–320 (1988).
Zoubak, S., Clay, O. & Bernardi, G. The gene distribution of the human genome. Gene 174, 95–102 (1996).
Oliver, J. L., Carpena, P., Hackenberg, M. & Bernaola-Galván, P. IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Res. 32(Suppl_2), W287–W292 (2004).
Bernardi, G. Chromosome architecture and genome organization. PLoS ONE 10, e0143739 (2015).
Jabbari, K. & Bernardi, G. An isochore framework underlies chromatin architecture. PLoS ONE 12, e0168023 (2017).
Li, W. & Kaneko, K. DNA correlations. Nature 360, 635–636 (1992).
Peng, C. K. et al. Long-range correlations in nucleotide sequences. Nature 356, 168–170 (1992).
Voss, R. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805–3808 (1992).
Koonin, E. V. The meaning of biological information. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 374, 20150065 (2016).
Lynch, M. & Conery, J. S. The origins of genome complexity. Science 302, 1401–1404 (2003).
Gago, S., Elena, S. F., Flores, R. & Sanjuán, R. Extremely high mutation rate of a hammerhead viroid. Science 323, 1308 (2009).
Lynch, M. L. The frailty of adaptive hypotheses for the origins of organismal complexity. Proc. Natl. Acad. Sci. USA 104(Suppl 1), 8597–8604 (2007).
Koonin, E. V. Splendor and misery of adaptation, or the importance of neutral null for understanding evolution. BMC Biol. 14, 114 (2016).
Revell, L. J., Harmon, L. J. & Collar, D. C. Phylogenetic signal, evolutionary process, and rate. Syst. Biol. 57, 591–601 (2008).
Batut, B., Knibbe, C., Marais, G. & Daubin, V. Reductive genome evolution at both ends of the bacterial population size spectrum. Nat. Rev. Microbiol. 12, 841–850 (2014).
Payne, J. L. & Wagner, A. (2019) The causes of evolvability and their evolution. Nat. Rev. Genet. 20, 24–38 (2019).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Contreras-Moreira, B. & Vinuesa, P. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 79, 7696–7701 (2013).
Vinuesa, P., Ochoa-Sánchez, L. E. & Contreras-Moreira, B. GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. 9, 771 (2018).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Talavera, G. & Castresana, J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564–577 (2007).
Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003).
Darriba, D., Taboada, G. L., Doallo, R. & Posada, D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27, 1164–1165 (2011).
Di Rienzi, S. C. et al. The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria. eLife 2, e1102. https://doi.org/10.7554/eLife.01102 (2013).
Bernaola-Galván, P., Román-Roldán, R. & Oliver, J. L. Compositional segmentation and long-range fractal correlations in DNA sequences. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top. 53, 5181–5189 (1996).
Grosse, I. et al. Analysis of symbolic sequences using the Jensen–Shannon divergence. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top. 65, 041905 (2002).
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991).
Bernaola-Galván, P. et al. Segmentation of time series with long-range fractal correlations. Eur. Phys. J. B 85, 211 (2012).
Oliver, J. L., Román-Roldán, R., Pérez, J. & Bernaola-Galván, P. SEGMENT: Identifying compositional domains in DNA sequences. Bioinformatics 15, 974–979 (1999).
Bernaola-Galván, P. B., Oliver, J. L. & Roldán, R. R. Decomposition of DNA sequence complexity. Phys. Rev. Lett. 83, 3336–3339 (1999).
Bernaola-Galván, P., Oliver, J. L., Carpena, P., Clay, O. & Bernardi, G. Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 333, 121–133 (2004).
Blomberg, S. P., Garland, T. & Ives, A. R. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution 57, 717–745 (2003).
Kembel, S. W. et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26, 1463–1464 (2010).
Castiglione, S. et al. Simultaneous detection of macroevolutionary patterns in phenotypic means and rate of change with and within phylogenetic trees including extinct species. PLoS ONE 14, e0210101 (2019).
Kratsch, C. & McHardy, A. C. RidgeRace: Ridge regression for continuous ancestral character estimation on phylogenetic trees. Bioinformatics 30, 527–533 (2014).
Revell, L. J. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223 (2012).
Felsenstein, J. Phylogenies and the comparative method. Am. Nat. 125, 1–15 (1985).
Acknowledgements
This project was funded by grants from the Spanish Minister of Science, Innovation and Universities (former Spanish Minister of Economy and Competitiveness) to A.M. (Project SAF2015-65878-R), J.L.O. (Project AGL2017-88702-C2-2-R) and A.L. (Project PGC2018-099344-B-I00), grant from Generalitat Valenciana to A.M. (Project Prometeo/2018/A/133), and co-financed by the European Regional Development Fund (ERDF). This project was also supported by a Fulbright fellowship (Spanish Minister of Science, Innovation and Universities) to A.M. for a sabbatical leave at Harvard University. The authors thank to Fernando Baquero, Mitchell Distin and Guillermo Ponz for critical reading of the manuscript.
Author information
Authors and Affiliations
Contributions
A.M., J.L.O., M.V. and L.D. designed research; A.M., J.L.O., M.V., L.D., V.A., P.B., R.dlF., W.D., C.G., F.M.G., A.L., R.L. and R.R. performed research. A.M., J.L.O., M.V., L.D., V.A., P.B., W.D. and R.R. analysed data; A.M., J.L.O, M.V. and L.D. wrote the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Moya, A., Oliver, J.L., Verdú, M. et al. Driven progressive evolution of genome sequence complexity in Cyanobacteria. Sci Rep 10, 19073 (2020). https://doi.org/10.1038/s41598-020-76014-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-020-76014-4
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.