Phylogenetic analysis of mutational robustness based on codon usage supports that the standard genetic code does not prefer extreme environments

The mutational robustness of the genetic code is rarely discussed in the context of biological diversity, such as codon usage and related factors, often considered as independent of the actual organism’s proteome. Here we put the living beings back to picture and use distortion as a metric of mutational robustness. Distortion estimates the expected severities of non-synonymous mutations measuring it by amino acid physicochemical properties and weighting for codon usage. Using the biological variance of codon frequencies, we interpret the mutational robustness of the standard genetic code with regards to their corresponding environments and genomic compositions (GC-content). Employing phylogenetic analyses, we show that coding fidelity in physicochemical properties can deteriorate with codon usages adapted to extreme environments and these putative effects are not the artefacts of phylogenetic bias. High temperature environments select for codon usages with decreased mutational robustness of hydrophobic, volumetric, and isoelectric properties. Selection at high saline concentrations also leads to reduced fidelity in polar and isoelectric patterns. These show that the genetic code performs best with mesophilic codon usages, strengthening the view that LUCA or its ancestors preferred lower temperature environments. Taxonomic implications, such as rooting the tree of life, are also discussed.

The origin of translational apparatus and the genetic code is amongst the greatest conundrums of Life 1 . Its fundamental challenge is to uncover the constraints, historical accidents, and evolutionary driving forces that could have shaped the standard codon table. The current views propose the general mechanisms of (1) stereochemical affinity between codons and attributed amino acids (stereochemical theory 2,3 ), (2) coevolution between the biosynthetic paths of amino acids and cognate codons (coevolution theory 4,5 ), and (3) minimization of translation errors (adaptive, physicochemical or error minimization theory 6,7 ) as possible explanations for the overall structure of the standard genetic code. Here, we focus and expand on the third one.
Although these existing hypotheses for the development of the genetic code are still hotly debated (including other theories, see other reviews 8,9 for a recent overview of the field), the scientific community tends to agree on that the code is robust to mutations because its structure reduces the deleterious effects of translational errors. The error minimization theory proposes that the universal genetic code was shaped under selective forces that made the code at least partly optimized for fidelity in the physicochemical properties of mutated amino acids (or such aspects played a role during its development). With a few notable exceptions showing that the code can be locally improved with codon reassignments 10,11 , several studies have pointed out that the overwhelming portion of random alternative genetic code structures have inferior error capacities, leading to the argument Methods Data. For codon usage profiles, we used the Uniprot Reference Proteome database within UniProtKB 35 . A Python script was used on the corresponding mRNAs to calculate nucleobase and codon distributions along with the distortion measures for each organism. This data was cross-referenced with optimal environmental conditions via NCBI Taxonomy ID-s 36 . The environmental data for optimal growth temperature, pH, and salt concentration were obtained from the BacDive database 37 . S16 rRNAs were retrieved from the 16S RefSeq Nucleotide 38 sequence records. Sequences were aligned using MUSCLE 39 . The final dataset contained 64 taxa (8 archaeal and 56 bacterial), representative of the molecular diversity in each domain.
Phylogenetic tree construction. We used Beast v1.10.4 40 for a Bayesian analysis. The S16 rRNA phylogenetic tree was built from the 64-sequence alignment with a GTR 41 model, a gamma law with eight categories and an estimated proportion of invariant sites, using default priors, and an uncorrelated relaxed clock. Chains were run for 50,000,000 generations and samples were collected in each 1000 generations. The analysis in Tracer 42 demonstrated a good mixing of the chains. A burn-in of 5,000,000 samples was discarded, and a maximum clade credibility tree was computed from the remaining samples (Supplementary Data S1 online).
Distortion as a measure of mutational robustness. In order to provide a practical measure for the error-rate of mistranslation and point-mutations, the information theoretic concept of distortion (Eq. 1) 33,34 is used to estimate the average effect (cost per symbol) of mutations given a source distribution of codons and the uncertainty of the code resulting from noise, i.e. the probability of codon c i mutating into c j (see next section about background mutation model). Another important element of distortion is the distortion matrix, which reflects the underlying genetic code and corresponding physicochemical properties. This distortion matrix is essentially identical with matrices widely used in other studies, summarizing the errors of one amino acid mutating into another 6,7 . Distortion matrix with elements d(aa i ,aa j ) specifies the distortion associated with mistaking the encoded symbol aa i (amino acid) in the source (X) and reproducing it as aa j in the reproduced copy (Y). We define d aa i , aa j = 0 if aa i = aa j , that is, c i and c j codes for the same amino acid.
Distinct distortion matrices were defined to provide different physicochemical measures of robustness. We decided to use properties made available by Haig and Hurst 7 . These include polar requirement, hydropathy, molecular volume and isoelectric point, yielding four different measures of code performance, denoted as D Hyd , D Pol , D Vol , D pI (Supplementary Data S2 online).
Background mutation model. The conditional probabilities P(Y = c j | X = c i ) are the result of random mutations appearing in the genome and describe the chance of codon c i mutating into codon c j . In order to approximate these probabilities, a simple background mutation model is required describing the generalized mechanism for spontaneous DNA mutations. However, we must estimate the raw, a priori performance of the www.nature.com/scientificreports/ genetic code without natural selection introducing additional bias. We introduce a simplified model of random amino acid mutations (Eqs.2-4), which essentially invokes Kimura's two parameter model 43 . Here, κ denotes the transition/transversion rate ratio, otherwise known as ti/tv ratio; p ti is the probability of transition, p tv is the probability of transversion, and µ is the mutation rate. The inherent structure of the genetic code defines the probability of which codon i mutates into codon j given that a transition or transversion occurs; these are denoted by terms P(c i → c j | ti) and P(c i → c j | tv), respectively.
Expected proteomic distortions were then calculated for each taxon's codon composition. Since our goal is a comparative analysis between taxa, the effect of µ is unimportant. Our preliminary work showed that although ti/tv ratio has a quantitative impact on distortion, the qualitative outcomes remain robust (Radványi and Kun, submitted). Our calculations concerning the distortion values were restricted to κ = 2.5 (roughly 71% of mutations are transitions), approximating ratios encountered by studies of genome-wide and intronic sequences 30,44 . Such regions ought to represent a more relaxed state of selection against mutations, hence providing a closer estimate of the background ratio of ti/tv prior to selection.

Comparative analysis.
To correct for non-independence because of common ancestry of species, we performed Phylogenetic Generalized Least Squares (PGLS) models on the combined data. The analyses were carried out with the ape 45 , phytools 46 , caper 47 and geiger 48 packages in RStudio v3.5 49 . Response variables D Hyd , D Pol , D Vol and D pI were modelled separately using the available environmental variables as predictors. The genomic content of guanine and cytosine (GC-content) was also used as predictor since preliminary studies have shown its predominant effect on the codon and amino acid composition of proteomes 21,26,27 . Model assumptions were checked; no violations were apparent. Based on Akaike information criterion (AIC) values, we apply a Brownian model; further branch length transformations and correlation structures did not result in generally better fits.

Results
In all four cases of physicochemical distortions, the PGLS resulted in significant models. In the obtained phylogenetic tree, deeper phylogenetic topologies are well supported by posterior values (Fig. 1). The results of the PGLS regressions are shown on Fig. 2, including partial regression lines. The standardized partial coefficients (where variables were Z-transformed prior to analysis) can be found as Supplementary Table S1 online. The highest proportion of explained variances is found in the case of distortion of hydrophobic properties (R 2 = 0.633; F 4,59 = 25.492; p = 2.726 × 10 −12 ). The second highest proportion is explained for the distortion calculated for isoelectric points (R 2 = 0.397; F 4,59 = 9.689; p = 4.311 × 10 −6 ), followed by that of the volumetric distortion (R 2 = 0.361; F 4,59 = 8.314; p = 2.180 × 10 −5 ). The least amount of variance was found in the case of distortion in polar requirements (R 2 = 0.249; F 4,59 = 4.880; p = 1.816 × 10 −3 ).
The effect of GC-content. GC-content has a significant positive effect on hydrophobic distortion (β = 0.237; t = 9.464; p = 1.945 × 10 −13 ). A less dominant, but significant negative effect is encountered in the distortion of isoelectric properties (β = − 0.078; t = − 2.147; p = 0.036). In other words, increasing the GC-content of the coding region results in lowered accuracy for hydrophobic traits, but the maintenance of molecular patterns related to isoelectric points becomes easier.
These effects translate to a general decrease in the expected physicochemical fidelity both in thermophiles and halophiles. In other words, such extremophilic codon usages could decrease the chance of preserving the respective physicochemical patterns with an occurring mutation, diminishing their mutational robustness.
The effect of ambient pH remains less conclusive. Although there is a significant positive effect on distortion in polar requirements (β = 0.006; t = 2.206; p = 0.031), other properties are not shown to be significantly influenced. Evidence of substantial selection on proteins in extreme acidophiles or alkaliphiles is sparse. This may be attributed to the relative invariance of intracellular pH regardless of the ambient environment 50 . We note, however, that our sample provides only a narrow pH range, and bias in codon usage have been recently noted 16 .
Model robustness to ti/tv-ratio. To verify the robustness of these effects, we extended our analysis to a wider range of ti/tv-ratios (κ = 2.5-10; Supplementary Figure S1 online). The quantitative coefficients of the tested factors change asymptotically and remain significant. The signs of the significant coefficients do not change. The only exception is the GC-content where its negative impact on isoelectric properties becomes non-significant www.nature.com/scientificreports/ (κ > 2.5); therefore, only its positive effect on hydrophobic distortion is confirmed. Thus, we may conclude that the effects of environmental selection are robust on a broader range of ti/tv-ratios, and our interpretations concerning the impact of environmental selection, especially the distortive effect at high temperature, remains valid.

Discussion
The optimality of the genetic code should be discussed in the context of different gradients, such as environmental selection or GC-content. Their possible repercussions on codon distribution will fundamentally impact the expected errors made by the code. In order to study this question, we have applied a previously developed minimalistic background mutation model for the generalised mechanism of emerging point-mutations. Then, we calculated distortions for codon distributions encountered in different taxa with known environmental requirements. Distortion measures were based on different physicochemical properties: hydropathy, polar requirement, volume, and isoelectric point. Next, in order to account for phylogenetic non-independence, we performed four distinct Phylogenetic Generalized Least Squares (PGLS) regressions on these physicochemical distortion measures using a 16S rRNA tree (Fig. 1). One of the putative predictors was the GC-composition of the coding region of the genome, based on a www.nature.com/scientificreports/ large number of studies supporting its predominant effect on the amino acid composition of proteomes 26,51 . The other group of predictors included environmental variables, which can select for characteristic codon or amino acid compositions of proteomes: temperature, salt concentration, and pH optima.
The key insight provided in this study is that adaptations to certain extreme environments and GC-bias seem to have drastic effects on the physicochemical fidelity of translation and the severity of incidental mutations, and these putative effects are not the artefacts of phylogenetic bias.
While we have included only a limited number of taxa based on the availability of their proteomes and the environment they live in, they can be considered representative. Any phylogeny is constrained by what we know at the moment about the diversity of life on Earth. Current diversity is only a subset of the diversity that ever existed (which we need to keep in mind when we want to infer past events based on characteristics of current species), and metagenomics has repeatedly demonstrated that we know only a fraction of the current diversity. Metagenome studies discovered a previously unknown diversity of microbes 52,53 . Quite some of the microbial dark matter 54 were first assigned to novel clades distinct from the established great groups of Bacteria and Archaea. Nowadays it seems, that the truly novel clades are fewer 54,55 . A recent catalogue of microbes, incorporating more than 50 thousand new metagenome-assembled genomes, concluded that the majority of deep-branching lineages (lineages that would be represented on the level of phylum) are represented by current genome sequences 56 ; the Candidate Phyla Radiation 57 could be merged into one monophyletic phylum 58 . Consequently, even a limited sample of taxa from all great branches of microbes can be considered representative.

Substitution biases might be caused by aversions of certain physicochemical distortions.
We have shown that high GC-content is expected to increase hydropathic distortion. Only this effect of nucleotide composition is robust to ti/tv-ratio, which would predict an AT-bias for the highest fidelity in hydrophobic attributes, pointing towards a general substitution trend that resembles preliminary observations of AT-biased substitution patterns [30][31][32] . Hydrophobic patterns are regarded as a primary force of protein folding 59-62 , further www.nature.com/scientificreports/ supported by the fact that secondary structures can be described and predicted along these properties 63 . Therefore, A/T-mutations should be more likely to fix due to their lower risk of jeopardizing the structure.

Environmental selection influences the mutational robustness of the genetic code.
With regards to environmental gradients, we show that selection in halophiles and thermophiles results in codon usage profiles generally worse for maintaining physicochemical patterns. Despite the observed adjustments against such mutations 16 the conservation of the required physicochemical properties tends to be more unreliable and the average effects of incidental mutations are expected to be more severe and deleterious. Here, we have demonstrated reduced fidelity in hydrophobic, volumetric, and isoelectric patterns of thermophiles, as well as the vulnerability of halophiles to mutations disturbing polar and isoelectric arrangements in proteins. This is a possible explanation of why these extremophiles possess remarkably low mutation and substitution rates 64-68 : it is a straightforward result of avoiding harsh fitness costs, especially if we consider higher importance of hydrophobic interactions and salt bridges in thermophilic proteins 18,69 , as well as the central role of polar properties in halophiles 17,70 .
We conclude that such low mutation rates and strong selection patterns encountered in extremophiles are caused by the inefficiency of the genetic code, as it seems especially ill-suited for extremophile codon usages with regards to mutational robustness, which also means that evolvability diminishes with the employment of standard genetic code, casting doubts on what role extreme environments could have played at development of the codon mapping.
Implications of the mesophile optimality of the genetic code and codon usage. Along with our estimated phylogeny, the majority of influential phylogenies have also provided an intuitive evidence of an extremophile LUCA, by placing thermophiles as the most basal groups [71][72][73][74] . This observation has long facilitated the somewhat overreaching logic that the cradle of life, including the development of the genetic code, was always associated with "infernal" environments of the Hadean Earth. At the same time, the genetic code is usually considered as a near-optimal, robust mapping that is able to partially maximize the fidelity of translation, since the majority, but not all 10,11 , of alternative codes falls short of such error capacity 6,7 .
Having these facts put together, our study implies a contradiction between these two notions: The fidelity of the genetic code is expected to decrease with higher temperatures. This not only collides with earlier reports supporting the extremophilic nature of the code [75][76][77] , but also points out that claims between its error-minimization and thermophilic origin seem non-compatible: if "(and oh what a big if)" 78 the genetic code evolved in order to be optimal for physicochemical properties, then it is more likely to finish its emergence among milder conditions.
A mesophile optimality of the genetic code could be still compatible with phylogenies placing thermophiles at basal locations near LUCA. The evolution of the genetic code preceded LUCA. A well-thought-out rooting of the tree of life puts the bacterial clade Chloroflexi closest to the root 55 , and it has thermophilic members. Chloroflexi are photosynthetic bacteria, meaning that LUCA was an autotroph. But, the first cell was, by necessity, heterotrophic, i.e. dependent on the environment for organic building blocks; the fully fledged photosynthesis can evolve only later. This means that inferences about LUCA does not help us understand the environment in which the first cell or the organism inventing the standard genetic code thrived.
But there is no need to accept a thermophilic LUCA. Both rRNA and protein sequences indicate that hyperthermophilic features of Bacteria and Archaea are parallel adaptations, while their ancestors could have been mesophilic or only slightly thermophilic [79][80][81] . Furthermore, the idea of a thermophile LUCA comes from accepting the root of the universal tree of life to lay between Bacteria and Archaea. Cavalier-Smith has argued for quite some time, that the root lies within Gram negative bacteria, and Archaea and Eukaryotes (compromising the clade Neomura) are derived from Gram positive bacteria 82,83 . Archaea are the exemplars of extremophiles, but if their extremophilic characteristic is derived 84 , then there is no need for LUCA to be an extremophile. Indeed, among the Chloroflexi, there are mesophilic members, so even that does not contradict the proposition. Our own results presented here also strengthen the view that LUCA was a mesophile 85 . Outlook. Our work supports the expanded view on the optimality of the genetic code by involving codon usage 13,14 . The "mesophilic genetic code" hypothesis demands further research. Psychrophiles and organisms preferring extreme pH conditions could not be sufficiently represented in our current analysis, and albeit linear responses had a good fit in our case, increasing the environmental ranges should lead to more accurate estimations of translational preference. It must be also emphasized that our interpretation of codon usage analysis remains conditional on the assumption of error minimization. It does not weaken the case of other dominant theories of the evolution of the genetic code. The evolution of genetic code is likely to be a result of multiple driving forces and cannot be understood solely by natural selection increasing mutational robustness 86 .
Relying on simple codon usage data have disadvantages. Thermophiles and halophiles possess elevated rates of horizontal gene transfer [87][88][89] . The codon bias of recently acquired, not completely adapted genes originally hosted by non-extremophilic hosts can interfere with the analysis, thus a later focus on core genes is needed. Upcoming studies are also yet to address the effect of mRNA expression patterns. Here, the implicit assumption is that proteins are expressed on the same level in each proteome. However, difference in expression has clear evolutional implications 90 that are also related to thermophilic properties 91 and misfold chance 92 . We believe that these observations can be incorporated into theory.
On the other hand, our result should not be taken as a direct confirmation of the universal genetic code being an inefficient mapping at extreme conditions. To assess that point better, it would demand us to gather data about the robustness of alternative codes and the proportion of variants where codon usage response to extreme environments can increase fidelity. www.nature.com/scientificreports/ There is another point where biological codon usage data fails to elaborate. Simply extending the measure of distortion to the domain of alternative codes poses a paramount challenge. The currently known codon usage profiles cover only the standard genetic code (and alternative genetic code variants to some extent 14 ). As codon frequencies and their environmental response already carry the inherent effect of the standard genetic code, their variance cannot be directly applied to randomized codes. This warrants further investigation into the physicochemical requirements of extreme habitat conditions as the likely causes of characteristic amino acid compositions that influence codon distribution.

Data availability
All data generated or analysed during this study are included in this published article (and its Supplementary  Information files).