Codon-specic Ramachandran plots show amino acid backbone conformation depends on identity of the translated codon.

Codon-specific Ramachandran plots show amino acid backbone conformation depends on identity


Introduction
One of the most critical cellular processes is the decoding of genetic information into functional proteins.Transfer RNA (tRNA) molecules recognize codons of the messenger RNA (mRNA) sequence as it passes through the ribosome and deliver specific amino acids sequentially for addition to the growing peptide chain.61 codons map to 20 amino acids, meaning that most amino acids are encoded by more than one, synonymous, codon.Once considered a silent redundancy of the genetic code, synonymous coding is now known to be functionally important, subject to evolutionary selective pressure and clearly associated with disease 1,2,3,4,5 .Changes in synonymous coding can alter mRNA splicing, mRNA folding and stability 6,7,8 , and can affect translational speed and accuracy and the conformation of the translated protein 9,10,11,12 .
Numerous studies have shown that changes in the rhythm of translation can alter the kinetics of cotranslational folding and so, the global conformation of the final protein product 13,14 .Translation rate is affected by synonymous codon usage which alters mRNA structure and tRNA abundance, the latter coevolving with codon bias 15,16,17,18,18,20 .This mechanism provides an indirect association between codon usage and global protein structure.Nevertheless, whether and how synonymous variants of a gene will alter the conformation of the final folded protein is still poorly predictable and additionally the literature is riddled with reports of single synonymous mutations causing measurable functional effects that are not well-explained by current mechanisms ,21,22,23,24 .Together this suggests that we are far from fully understanding the role of codon usage in orchestrating protein folding.
To the best of our knowledge, no studies have investigated whether the specific backbone torsion of an amino acid is associated with the synonymous codon from which it was translated.To probe for such a direct and local association, we developed a method for estimating and comparing codonspecific backbone dihedral angle distributions, which we term codon-specific Ramachandran plots.
Comparing these distributions for pairs of synonymous codons, statistically significant differences are observed, indicating a dependence between the codon identity and the backbone dihedral angle of the amino acid into which that codon is translated.

Data collection and codon assignment
Investigating dependence between codon identity and the protein backbone structure is not readily achievable since, regrettably, there is no annotation within the Protein Data Bank (PDB) for the actual genetic template used in producing the protein for crystallization.Automatic assignment of codon identity to each position in a protein structure is a prerequisite to calculate codon-specific Ramachandran plots.It is imperative to stress that any method used for large scale codon reassignment will carry an inherent limitation of being contaminated with uncertainty and error.The main reason is that codon optimization is very commonly used to improve heterologous gene expression 25 , especially in structure determination which necessitates the production of large amounts of soluble protein.There is not one common approach to codon optimization, and the choice of method often depends on trial and error 26,27 .
The procedure for computing and comparing codon-specific comparing backbone dihedral angle distributions is displayed in Fig. 1 and detailed in the Methods.Briefly, high-resolution PDB structures are retrieved, structures are filtered to remove homology bias and the resulting proteins are grouped Querying the PDB for high resolution (<1.8Å), high quality (Rfree < 24%) X-ray crystal structures of E. coli proteins expressed in E.coli (A), out of which unique chains were extracted (B).To ensure the protein set was non-redundant, pairwise sequence alignment scores were calculated between every pair of unique sequences (C).A farthest point sampling procedure was then employed to produce a sub-set of structures with normalized pairwise similarity not exceeding 0.7 (D).Structures were then grouped according to their unique Uniprot identifier.Genetic sequences were retrieved from ENA records cross-referenced by Uniprot (E), adopting a conservative approach: locations having more than one genetic variant for a specific residue are excluded from further analysis (F).For each group, a single protein record was generated with each point in the amino acid sequence annotated with the j, y backbone dihedral angles averaged over all the structures in the record, the codon, and DSSP secondary structure assignment (G).The final data set included 1343 protein chains.We estimated the codon distributions from their samples using kernel density estimation (KDE) on a torus with a Gaussian kernel width of 2°.We used a bootstrapresampling scheme to estimate multiple realizations of these codon specific distributions.p-values were calculated via permutation test on the L1 distance between the estimated densities (steps H-J); the rejection threshold (p=0.019) was established by Benjamini-Hochberg multiple hypothesis correction with the false discovery rate set to q=0.05 (K).according to their unique Uniprot entry.For each position in a protein chain, the backbone dihedral angles, j and y, are calculated; if multiple PDB structures are available, the angles are averaged.
Alongside this precise structural information, DSSP secondary structure is designated, and codons are assigned according to the genetic sequences obtained from ENA records cross-referenced in the Uniprot entry.Only locations with unambiguously assigned secondary structures and codons are retained.
We used only well-fitted X-ray crystal structures having a resolution higher than 1.8Å, as a recent study considering alternate backbone conformations found resolutions better than 2.0Å useful for such purposes 28 .To limit codon assignment errors from including codon optimized genes, we selected only structures of E. coli proteins expressed in E. coli, the most common expression system in the PDB.We purposely did not include all natively expressed proteins from other species, since codon biases differ Out of the two codons GTA and GTT translating valine, GTA has 8% lower propensity for strands and 9.4% higher propensity for helices.Propensities are manifested through the relative weights of the corresponding modes in the Ramachandran plot, which is visible in the marginal distributions of the dihedral angle j plotted here.When conditioned by secondary structure (i.e., restricted to a specific mode), the distributions of the two synonymous codons become indistinguishable.Kernel density estimates are shown with the shaded regions denoting 10%-90% confidence intervals calculated on 1000 random bootstraps.Codon-specific backbone angle distributions are significantly distinct within the β-sheet mode.
Synonymous codons are known to have distinct propensities to different secondary structures 30,31,32,33 , which is manifested as different probabilities of the corresponding modes in the full codon-specific Ramachandran plots (Fig. 2).The difference in propensity for the main two, α and β, secondary structure modes might therefore dominate the difference between the codon-specific Ramachandran plots of synonymous codons.To factor out this effect, we conditioned the dihedral angle distribution on the secondary structure, effectively restricting it either to the distinct β-sheet or α-helix modes.
Select examples of the resulting codon-specific Ramachandran plots, conditioned by these modes, are shown in Fig. 3, while the full set is provided in Supplementary Figs.1-2.Visually, it is evident that synonymous codons of some amino acids have clearly distinguishable distribution shapes especially in the β-mode.

Cys
Ile Thr Val To quantify those differences and their significance, we used a distribution-free two-sample permutation test with the L1 distance between KDEs serving as the test statistic, and assigned p-values to each synonymous codon pair with respect to the null hypothesis that the two codons have the same underlying distribution of backbone angles.To determine the p-value threshold for statistical significance in a setting where multiple hypotheses are considered together, we employed the Benjamini-Hochberg correction with false discovery rate set to 0.05.This process is shown schematically in Fig. 1 and detailed in the Methods.Matrices visualizing the distances between select pairs of synonymous codon distributions are shown alongside the contour plots in Fig. 3 and for all synonymous codon groups in Supplementary Figs.3-4.
Note that alongside the 87 synonymous pairs, we also included the 61 comparisons of each codon to itself.The latter served as a control, and indeed, the null hypotheses were not rejected for any of the same codon pairs in either of the secondary structures.No synonymous pairs were rejected in comparisons of the distributions of the α-mode, however when comparing distributions for the βmode, 57 of the 87 synonymous pairs were rejected.
It is not surprising that α-helices, being less flexible than β-sheets 34,35 , display less variability in codonspecific Ramachandran plots.The Ramachandran plot defines a richer range of structural contexts than the discrete categories available in DSSP annotation 36 , especially in the β-mode.It is therefore possible that some of the differences we observe between codon-specific dihedral angle distributions in the β-mode are attributable to codon preferences for finer secondary structure categories such as parallel and antiparallel β-sheets.However, it should be noted that we used a strict conditioning by the secondary structure, taking only the DSSP annotation 37 E (extended strand -β-sheet in parallel and/or anti-parallel sheet conformation with minimum length of 2 residues) for the β-mode, and H (α-helix -a 4 turn helix with minimum length of 4 residues), for the α-mode.

Distances between dihedral angle distributions of synonymous codons hint at a correlation to features of the translation process
Our findings remain silent regarding the origin of the observed differences in synonymous codon backbone dihedral angle distributions; in particular, the causation direction cannot be established unambiguously.It is tempting to speculate, however, that the translation process plays an active role in the observed effect.To illustrate this speculation, we considered how two features of the translation machinery correlate to the calculated distances between backbone dihedral angle distributions of synonymous codon pairs.Firstly, we demonstrate that the difference in the codonspecific translation speed between a pair of synonymous codons appears to positively correlate to the distance between their dihedral angle distributions (Fig. 4, left).Although ribosome profiling has facilitated measurement of translation speed to exquisite single-codon resolution in human and yeast cells, the application to bacteria has been more problematic 38 .We used the data from Chevance et al. who developed an in vivo bacterial genetic assay for measuring ribosomal speed independent of the stability of the mRNA transcript or the translated protein product 39 .Note that in order to limit confounding factors, we considered only pairs of codons being translated by the same tRNA.
In a second illustration, we identified codons translated unambiguously by a single tRNA, following Bjork et al. 40 , and grouped codon pairs as being translated by either the same or different tRNA molecules.Fig. 4 (right) shows that synonymous codon pairs translated by different tRNAs tend to have a larger distance between their backbone dihedral angle distributions.
While these two trends can by no means be conclusive, they suggest the potential value of the proposed methods in analyzing relations between synonymous coding and the features of the translation process.

Discussion
Codon-specific Ramachandran plots and their comparative analysis could serve as a useful, quantitative tool in future studies looking at the association between coding and local protein structure.It is likely that codon-specific backbone dihedral angle distributions will show even more significant variations when extended to pairs or triplets.Codon pair usage bias has been observed in E. coli 41 and in human disease 42,43 .It has been suggested that codon translation efficiency is modulated by adjacent single nucleotides 44 , that codon pair order significantly affects translation speed 39 , and, more recently, the case for a genetic code formed by codons triplets has been argued 24 .
The main challenge in extending this method to codon pairs or longer tuples is the relative scarceness of data and the need to compare multi-dimensional density functions characterizing the backbone structure of a tuple of amino acids.Extending the analysis to other expression systems faces a similar data scarceness challenge and considering genes from various source organisms expressed in E. coli, either to probe for evolutionary distinctions between species or to overcome data scarceness when probing codon pairs in a hypothesized translation dependent mechanism, is burdened by the uncertainty associated with codon (re) assignment.The latter will be overcome when structural databases start annotating the exact genetic source used for producing protein which is crucial, given the ever-amounting evidence for the critical functional importance of codon usage We hope that the associations between synonymous coding and local backbone conformation revealed through codon-specific Ramachandran plots will spark new investigations which should directly probe the possible causal relationships that might underpin these observations.The implications of an active, coding-dependent process would be tremendous, necessitating an immediate rethink as to how we manipulate the genetic code through codon optimization.This question could not be timelier, as mRNA vaccines are taking centre stage in global medicine.It would also affect how we define the role of synonymous variants in health and disease and could open a new window into understanding protein folding in general.Regardless of the causal direction of the observed dependence between coding and local structure can potentially improve protein prediction algorithms since in such a task the causal relationship between the two is superfluous.

Data collection
Protein structure data is collected from the Protein Data Bank (PDB) 45 through a structured query against the search API.We queried for structures meeting the following criteria: (i) Method: X-Ray Diffraction; (ii) X-Ray Resolution: Less than or equal to 1.8 Å; (iii) Rfree: Less than or equal to 0.24, (iv) Expression system contains the phrase "Escherichia Coli" and (v) Source organism taxonomy ID equal to 562 (Escherichia Coli).
Each query result contains a list of PDB IDs with an entity number, e.g., 1ABC:1 matching the query criteria.Each entity within a PDB structure corresponds to one or more identical polypeptide chains which exist in the structure.Note that a structure may have more than one unique entity (e.g., 1ABC:1 and 1ABC:2) in which case we would obtain both.For each unique entity ID, we select the first of its matching chains using lexicographic order.
Next, we query the PDB's entry data API to obtain a mapping from the specific chain to a list of Uniprot 46 IDs.Whilst most chains map to a single Uniprot ID, there are cases where a chain is chimeric i.e. contains sections from multiple different proteins.We discard such chimeric chains and keep only chains which map to a unique Uniprot ID.We align the protein sequence of each chain to the Uniprot record sequence to provide a Uniprot index for each residue in the PDB chain.We used the same pairwise alignment algorithm as described in 1.3.
We then remove homology bias using the procedure described in 1.2.For all the remaining PDB chains, we calculate the backbone angles (, ) and use DSSP 37 to assign a secondary structure per residue.
Finally, we assign each residue with a codon using the method described in 1.

The result of this
process is what we call a Protein Record for each PDB chain.The Protein Record contains, per residue: corresponding Uniprot ID and residue index, torsion angles (, ), secondary structure and codon.

Redundancy filtering
To remove homology bias from our data, we performed a filtering step.First, each pair of Uniprot sequences are aligned using the BioPython 47 software package, with a match score of 1 and all penalty scores set to zero.Thus, we obtain an alignment score  !" ≥ 0 between every pair of Uniprot sequences , .We then calculate normalized scores, .  !!  "" .
Note that by definition 0 ≤ ! " ≤ 1 and ! != 1 for every , .In other words, this normalization ensures that the self-alignment score is 1 and all other scores are normalized the be in [0,1], regardless of the sequence lengths or the alignment penalty values.This normalization also makes it simple to choose a similarity cutoff threshold, since the threshold is chosen in the fixed range [0,1] where 1 equates to an exact match and 0 to a complete mismatch.We chose a normalized similarity threshold of  = 0.7.
Using the normalized alignment scores we then employ a farthest-first traversal procedure to sort the Uniprot sequences: the first sequence is selected arbitrarily, and each successively selected sequence is such that it has the lowest maximum normalized alignment score between itself and all previouslyselected sequences.Formally, denote by  and  the sets of selected and un-selected sequences, respectively.We initialize to  = {0} and  = {1,2, … ,  − 1} where  is the number of Uniprot sequences.At each step  of this traversal, for each unselected sequence  ∈ , we calculate its greatest similarity to any of the so-far selected sequences, We then choose the sequence which has the lowest maximal similarity to the selected sequences, i.e., we add to .We stop the procedure once  # () >  for the sequence  that was selected at step , and retain  as the output filtered set of Uniprot sequences.This ensures that no two sequences in the selected set have a normalized similarity score greater than .After performing this procedure, we keep in our dataset only PDB chains that were mapped to one of the Uniprot sequences in .Any PDB chain mapped to a Uniprot sequence from  is discarded from analysis.Note that we keep all chains from different PDB structures that correspond to the same selected Uniprot sequence in order to aggregate their backbone angles as explained in 1.4.

Codon assignment
Since the exact genetic sequence of the protein is not annotated in the PDB we assigned codons from the native sequence.Given a PDB chain, we obtain its unique Uniprot ID from the previous step.We query Uniprot to obtain all cross-referenced IDs to the European Nucleotide Archive (ENA) 48 .From the ENA database, we obtain all available genetic sequences for the specific protein and translate each genetic sequence to an amino-acid sequence using the standard genetic code table, and perform pairwise sequence alignment between the PDB chain's amino-acid sequence and the translated genetic sequences.The alignment is performed using the BioPython 47 implementation of the Gotoh global alignment algorithm 49 .We used BLOSUM80 as the substitution matrix for the alignment, a gapopening penalty of −10 and a gap-extension penalty of −0.5.
Following the pairwise alignment of the amino acid sequence to all translated genetic sequences, we obtain the aligned codons from each sequence and assign them to corresponding residues from the PDB chain.This process yields zero or more assigned codons per residue in the PDB chain.In cases where there is more than one codon (i.e., different genetic sequences contributed different codons), we consider the assignment ambiguous and exclude that residue from further analysis.

Angle aggregation
Since some proteins have been characterized by multiple crystal structures, there residues from different PDB chains which to the same Uniprot ID and location in the Uniprot sequence.For example, in our dataset, the residues 1SEH:A:42, 1RNJ:A:42 and 2HRM:A:42 were all aligned to the Uniprot ID and index P06968:41.We consider such cases as different realizations of the same protein residue and aggregate the backbone angles from such residues, to obtain an "average" measurement of their backbone angles.
When aggregating the angles, we must account for the fact that a torsion angle pair  = (, ) is defined on a torus (i.e. the domain  ( ×  ( where  ( is a circle).Intuitively, each angle naturally "wraps-around" at ±180 ∘ , and the space spanned by two such angles is a torus.Thus, taking a simple average of each angle separately would not be correct.Instead we use the torus-mean function defined in 1.7.1.

Backbone angle distribution distance
We seek to measure the distance between distributions of backbone torsion angles of synonymous codons in -helix and -sheet secondary structure modes.Denote (|, ) the distribution of backbone angles  of codon  in secondary structure .We denote the distance between the backbone angle distributions of two synonymous codons  and ′ in secondary structure  as (, ′)| and estimate them between all pairs of synonymous codons.We include all cases of  = ′ as controls.The distance metric we used was the L1 distance, We also experimented with the L2 and smoothed Wasserstein distances, however empirical tests showed that the L1 metric provided the highest statistical power of all three at reasonable computational costs.
Although the underlying backbone angle distributions (|, ) are unknown, we observe torsion angles sampled from them, since we can assign a codon and secondary structure per residue.This provides us a finite sample { !∼ (⋅ |, )} ! for each codon  and secondary structure , from which we can estimate the unknown distribution.We use these samples to fit a kernel-density estimate (KDE),  Z (|, ), of each distribution, as explained in 1.7.3.The distance metric  ( (, ′)| is then calculated on the KDEs of each synonymous codon pair.Since the KDEs are discrete, the integration above becomes a sum, , where  is the number of KDE bins in each direction and  # " ,# != ] # " ,  # !^ are discrete evenlysampled grid points.We then use permutation-based hypothesis testing to determine whether the distance we obtained supports the (alternative) hypothesis that the codons have a significantly different distribution, as explained in 1.6.

Preventing bias due to secondary structure preference
A crucial point is that we calculate the distance between synonymous codon distributions separately for each secondary structure.This is done in order to avoid biasing the distance due to the fact that synonymous codons have different propensities for different secondary structures.
To see why this is a problem, consider that the the distribution of a codon's backbone angles in both helix and sheet together can be written as a bi-modal mixture of the separate distributions: where we denote  1 and  2 as -helix and -sheet, respectively, and  1 and  2 are the the mixture coefficients representing the prevalence of the codon in each of the secondary structures.
Now assume there exist synonymous codons , ′ which have exactly the same backbone angle distributions in both helix and sheet, but each has distinct mixture coefficients.Two such codons will clearly manifest the same angles in helices and sheets, so we would like to measure a distance of zero between their backbone angle distributions (theoretically, in the limit of infinite samples).However, measuring the distance between the codons' distributions on data from both secondary structures together will in-effect be greater than zero (even in the limit of infinite samples), because it would measure the difference between their propensities for the secondary structures, which will be captured as the height of each mode in their bi-modal distributions.Thus, by comparing codon distribution after conditioning on each secondary structure separately, we avoid biasing our distance measurements by this difference of propensity for secondary structures.

Detecting synonymous codons with different angle distributions
Faced with finite-sample estimations of codon backbone angle distributions,  Z (|, ), we aim to determine whether there exist pairs of synonymous codons (, ′) for which the underlying distributions, (|, ), are different.The challenge, of course, is that any difference we see in terms of the measured distance between estimated distributions,  Z ( (, ′)|, could be due to chance, arising from the the availability of only finite data.Thus, our approach is to determine whether the distance we measured is large enough such that the probability of obtaining such a distance by chance from identical underlying distributions is extremely small.
For every pair of synonymous codons and secondary structure (, ′)| (where we allow  = ′), we define a null hypothesis, which states that they have identical underlying backbone angle distributions: We used permutation-based hypothesis testing 50 to obtain valid p-values for each of these null hypotheses without the need to make assumptions about the backbone angle distributions (|, ) or the distribution of the distance metric  ( (, ′)| under the null.The permutation testing procedure is detailed in 1.7.4.We thus obtain, per secondary structure, a total of 148 p-values: 61 for identical codons,  = ′, and an additional 87 for non-idential but synonymous codons,  ≠ ′.
With an hypothesis test, one usually seeks to seeks to control the probability of rejecting the null hypothesis when it is true (type-I error) by choosing a pre-defined significance level such as  = 0.05 and rejecting when  < .However, in this case we are in a multiple-hypothesis setting, where we consider multiple null-hypotheses simultaneously and seek to determine which of them can be rejected.It is therefore necessary to control the type-I error for the whole set of simultaneous hypothesis tests, not individually per test.Otherwise, for a fixed , the chance of any type-I error increases with the number of hypotheses, so it is no longer controlled by .To address the multiplehypothesis setting, we used the Benjamini-Hochberg method 51 .Using this approach, a significance threshold is calculated dynamically from the set of all obtained p-values, in a way which controls the False-Discovery Rate (FDR) for the entire set of tests (instead of the type-I error of each individual test).The method allows us to specify the FDR-control parameter, , and ensures that that over repeated trials the expected value of the proportion between false discoveries (i.e.false rejections of the null hypotheses) and total discoveries (all rejections of null hypotheses) will be .

Preventing bias due to sample size differences
Synonymous codons are not equally prevalent; some are more abundantly found than others, and in some cases there is a substantial difference in their abundance.This translates into vastly different sample sizes we collect for the backbone angles of each codon.Some rare codons in our dataset had almost two orders of magnitude less data compared to their more abundant synonymous counterparts.
This creates a challenge for comparing distributions estimated from finite samples.Estimated distributions can seem very distinct when using vastly different sample sizes to estimate them, even when the samples come from the same underlying distribution.One way to account for this would be to cross-validate the KDE kernel bandwidth and choose an appropriate value for each sample size.This is challenging however, since we would need to separately cross-validate for all codons, some with very limited data.
Instead, we opted to use a single kernel bandwidth, but fix the sample size for each set of synonymous comparisons.For each amino acid , and per secondary structure , we choose a single sample size  , that will be used to estimate the distributions of all its synonymous codons.This single sample size is chosen to be the minimum of the sample sizes from all of its codons.Due to computational constraints, we also set an upper limit  :;< to the sample size for each estimated distribution.Thus, the sample size for all codons of amino acid  was calculated as  , = min l :;< , min 5∈ `5, am, where  5, is the sample size for codon  in secondary structure .
Choosing the smallest sample size mitigates sample-size bias in the estimation of the codon backbone angle distributions.However, this approach causes potential loss of data: in a group of synonymous codons, having one very rare codon would mean that we also use a limited number of samples from the more abundant codons.In order to address the sample size bias on one hand, while exploiting all available data on the other hand, we employed bootstrapped-sampling 52 scheme on top of the distribution estimation and comparison.For each codon  ∈ , we estimate its distribution  times from  , samples drawn with replacement from its collected data.This gives us access to at most  ⋅  , samples from each codon  ∈ , instead of only  , .We then compare  pairs of distributions for each synonymous codon pair , ′ ∈  using the permutation test, and use the results of all permutations in all bootstrap iterations to calculate the p-value of (, ′).
Statistical tests were performed with  = 25 bootstrap iterations with  = 200 permutations each for a total of 5000 permutations used for p-value calculation.We used  :;< = 200 for all comparisons and set an FDR threshold of  = 0.05.

Full procedure
The procedure for comparing synonymous codon backbone angle distributions and then selecting which codon pairs have significantly different distributions is described below.
For each synonymous codon pair, (, ′) and secondary structure , we calculate a p-value with respect to the null hypothesis  3,(5,5&)| , i.e. that they come from the same underlying distribution: 1.For  ∈ {1, … , }: a. Sample  , observations randomly from  and from ′ (each with replacement).
b. Denote the sampled observations from  and ′ as  and ′ respectively.For each secondary structure , we calculate the significance threshold based on the Benjamini-Hochberg method as follows: 1. Denote `!, a !0( ? the set of  = 148 p-values obtained from all pairwise comparisons of synonymous codons in secondary structure . 2. Sort the p-values and denote  (!), the -th sorted p-value.
3. Calculate the threshold p-value index for an FDR of , which is the largest p-value smaller than the adaptive threshold of  ⋅ /: 4. Set the adaptive significance threshold:  ?=  (! # ), .
Finally, the set of synonymous codon pairs corresponding to the rejected null hypotheses are deemed to have significantly different backbone angle distributions.

Mathematical Tools
1.7.1 Torus mean.Given a set of  points on a torus { !} !0( @ where  != ( !,  ! ) ∈  ( ×  ( , we would like to calculate the mean of these points,  ‾ in a way which accounts for the wrap-around of each angle at ±180 ∘ .We define function which approximates a centroid on a torus, by calculating the average angle with circular wrapping in each direction separately.We denote this function as  ‾ = torm({ !}).For example, if  ( = (170,170) and  A = (−170, −130) then we expect torm({ ( ,  A }) = (180, −160).We define the function as follows where atan2(, ) is a signed version of arctan(/) which uses the sign of both arguments to unambiguously recover the sign of the original angle  such that  = sin and  = cos.Such a test allows one to determine whether there is sufficient evidence to reject the null hypothesis, while limiting the chance of a type-I error (false positive, or rejecting  3 when it is true) to be at most 0 <  ≪ 1. Denote by (, ) ∈ ℝ a test statistic of our choosing, which numerically summarizes the differences between  and , such that the smaller the value of (, ), the more  and  are deemed similar.Further denote by ̂= (, ) the value of this test statistic when evaluated on the samples at hand.To perform the hypothesis test, a p-value is calculated, which is the probability of obtaining a result at least as large as ̂ under the assumption that  3 is true:  = Pr[ ≥ | 3 ].The null hypothesis  3 is then rejected if  < , thereby limiting the probability of type-I error to be .
A point of difficulty with this approach is that calculating the p-value requires access to the distribution of the test statistic under the null.Specifically, one needs to know the CDF of the random variable | 3 , which can typically only be known by making some assumptions about the data (e.g., that it follows a known distribution) and choosing a test-statistic for which the distribution of | 3 can be computed analytically under the assumptions about the data (such as the t-statistic).
For the application of comparing codon backbone angle distributions, it is crucial to avoid any assumptions about the data distributions from which the samples are obtained.Any such assumption would be unfounded, and may bias our test to the point of making it useless.Moreover, we require the ability to choose any test-statistic , because this allows us to compare multiple options and choose a highly-discriminative test-statistic for the type of data at hand.
The key observation behind this approach is that under the null, we can treat  and  as labels which are randomly assigned to observations from the same data distribution.Therefore, by permuting the labels and calculating the permuted test-statistic, we are obtaining samples of | 3 .If  3 is indeed true, we expect that ̂≈ #, thereby yielding  ≈ 0.5 as  → ∞.Conversely, if  3 is false, we would expect that ̂> #, and then  → 0 as  → ∞.In practice, the number of permutations  is limited by computational constraints.Nevertheless, since the smallest p-value which can be obtained is  :FG = 1/(1 + ), we know an upper limit for the number of necessary permutations for a given significance level (in case of a single test).

Figure 1 -
Figure 1 -Data Collection and Analysis.Querying the PDB for high resolution (<1.8Å), high quality (Rfree < 24%) X-ray crystal structures of E. coli proteins expressed in E.coli (A), out of which unique chains were extracted (B).To ensure the protein set was non-redundant, pairwise sequence alignment scores were calculated between every pair of unique sequences (C).A farthest point sampling procedure was then employed to produce a sub-set of structures with normalized pairwise similarity not exceeding 0.7 (D).Structures were then grouped according to their unique Uniprot identifier.Genetic sequences were retrieved from ENA records cross-referenced by Uniprot (E), adopting a conservative approach: locations having more than one genetic variant for a specific residue are excluded from further analysis (F).For each group, a single protein record was generated with each point in the amino acid sequence annotated with the j, y backbone dihedral angles averaged over all the structures in the record, the codon, and DSSP secondary structure assignment (G).The final data set included 1343 protein chains.We estimated the codon distributions from their samples using kernel density estimation (KDE) on a torus with a Gaussian kernel width of 2°.We used a bootstrapresampling scheme to estimate multiple realizations of these codon specific distributions.p-values were calculated via permutation test on the L1 distance between the estimated densities (steps H-J); the rejection threshold (p=0.019) was established by Benjamini-Hochberg multiple hypothesis correction with the false discovery rate set to q=0.05 (K).

Figure 2 -
Figure 2 -Different propensities for secondary structures of synonymous codons are manifested in the dihedral angle distribution.Out of the two codons GTA and GTT translating valine, GTA has 8% lower propensity for strands and 9.4% higher propensity for helices.Propensities are manifested through the relative weights of the corresponding modes in the Ramachandran plot, which is visible in the marginal distributions of the dihedral angle j plotted here.When conditioned by secondary structure (i.e., restricted to a specific mode), the distributions of the two synonymous codons become indistinguishable.Kernel density estimates are shown with the shaded regions denoting 10%-90% confidence intervals calculated on 1000 random bootstraps.

6 between organisms 29
and such generalization could obfuscate the sought for associations between coding and structure.

Figure 3 -
Figure 3 -Codon-specific Ramachandran plots of select amino acids and distances between them.Shown left-to-right are cysteine, isoleucine, threonine, and valine.Contour plots depict the level lines containing 10%, 50% and 90% of the probability mass.Shaded regions represent 10%-90% confidence intervals calculated on 1000 random bootstraps.The β-sheet (top) and a-helix (bottom) modes are depicted.The matrices show normalized L1 distances between pairs of codon-specific Ramachandran plots in the two secondary structures.Pink dots indicate pairs with significantly different dihedral angle distributions.

Figure 4 -
Figure 4 -Distances between codon-specific Ramachandran plots are related to parameters of the translation process.Left: The absolute difference in the relative translation speed as a function of the distance between backbone dihedral angle distributions for pairs of codons translated unambiguously by the same single tRNA.The two quantities are positively correlated (r 2 =0.6).Translation speed data and confidence intervals are reproduced from Chevance et al. (2014).Right: Pairwise distances between backbone dihedral angle distributions of codons translated unambiguously by the same tRNA (green) or two distinct tRNAs (red), sorted in ascending order (left) and as cumulative histograms (right).Non-cognate codon pairs tend to exhibit a significantly bigger distance.Horizontal lines indicate means.In both plots, the normalized L1 distances are reported with the ±s confidence intervals calculated on 1000 bootstraps.
c. Apply permutation test procedure (1.7.4) on  and ′ for  permutations.The teststatistic (, ) first computes the KDEs of  and , then calcultates the L1 distance between them.d.Denote by  = the number of times the base metric no greater than the permuted metric in the current permutation test.2. Calculate the p-value with respect to  3,(5,5&)| :

Figure 5 -
Figure5-Example of test statistic distribution in the permutation test.Codon pairs are compared using the L1 distance statistic between their dihedral angle KDEs in the β-sheet secondary structure mode.For each pair, depicted is the 1-cumulative distribution function (CDF) of the difference between the L1 distance between the pair of KDEs and one between the pair of KDEs constructed with permuted labels.The intersection of 1-CDF with the vertical axis yields the p-value of the test.When comparing a codon to itself (I-ATT, I-ATT), the null hypothesis holds, and the difference is expected to be positive half of the times (p-value»0.5).The indistinguishable pair I-ATT, I-ATC produces a high pvalue, while the more clearly distinguishable pair I-ATT, I-ATA yield a very low p-value.Two nonsynonymous codons (I-ATT, A-GCG) appear perfectly distinguishable.Distributions were calculated using 100 bootstrap samples with 200 permutations in each.

4 Permutation-based two-sample hypothesis test.
Given two points on the torus,  ( = ( ( ,  ( ) and  A = ( A ,  A ), we would like to measure the distance, in angles between these points.Calculating a simple Euclidean distance doesn't account for the fact that each angle wraps-around at ±180 ∘ .For example, the Euclidean distance between  ( = (170,0) and  ( = (−160,0) would produce 330 ∘ , while it's simple to see that when considering the wrap-around at 180 ∘ , the distance should be 30 and  is a constant factor which normalizes the KDE so that it sums to one.The KDE was evaluated on a discrete grid of size 128 × 128, which corresponds to a bin width of 360/128 ≈ 2.8 ∘ .By applying the kernel to the wrap-around distance, we correctly account for the distance on the torus between each sample and each grid point.We used a simple univariate Gaussian Instead, we used a fixed bandwidth for all KDEs, and made sure to always compare KDEs calculated from the same number of samples.Given two statistical samples,  = { !} !0( @ $ and  = { !} !0( @ % containing  B and  C observations respectively, we wish to test whether the observations in both samples were obtained from the same underlying data distribution.A powerful and well-known approach to do this, is by conducting a two-sample statistical hypothesis test, with the null hypothesis that  and  are sampled from the same distribution, i.e.,  3 :  B () =  C ().
537.2Torusdistance.∘.We therefore define the torus distance function as follows:tord( ( ,  A ) = .arccosAcos((−A)+ arccos A cos( ( −  A ).1.7.3 Kernel density estimation.We used two-dimensional kernel density estimation (KDE)53in order to estimate backbone angle distributions from finite samples.Given samples { !} !0( @ from torsion angles of a codon  in secondary structure , we calculate]tord(,  !)^,where  represents points on a discrete grid,  is a scalar kernel function, tord(⋅,⋅) is the torus wraparound distance defined in 1.7.2, kernel, () = exp(− A /2 A ), with a variance of  = 2 (equivalent to the kernel bandwidth).We did not adjust the kernel bandwidth for each estimated distribution according to the number of data points.