Deciphering the rationale behind specific codon usage pattern in extremophiles

Protein stability is affected at different hierarchies – gene, RNA, amino acid sequence and structure. Gene is the first level which contributes via varying codon compositions. Codon selectivity of an organism differs with normal and extremophilic milieu. The present work attempts at detailing the codon usage pattern of six extremophilic classes and their harmony. Homologous gene datasets of thermophile-mesophile, psychrophile-mesophile, thermophile-psychrophile, acidophile-alkaliphile, halophile-nonhalophile and barophile-nonbarophile were analysed for filtering statistically significant attributes. Relative abundance analysis, 1–9 scale ranking, nucleotide compositions, attribute weighting and machine learning algorithms were employed to arrive at findings. AGG in thermophiles and barophiles, CAA in mesophiles and psychrophiles, TGG in acidophiles, GAG in alkaliphiles and GAC in halophiles had highest preference. Preference of GC-rich and G/C-ending codons were observed in halophiles and barophiles whereas, a decreasing trend was reflected in psychrophiles and alkaliphiles. GC-rich codons were found to decrease and G/C-ending codons increased in thermophiles whereas, acidophiles showed equal contents of GC-rich and G/C-ending codons. Codon usage patterns exhibited harmony among different extremophiles and has been detailed. However, the codon attribute preferences and their selectivity of extremophiles varied in comparison to non-extremophiles. The finding can be instrumental in codon optimization application for heterologous expression of extremophilic proteins.


Analysing relative abundance of codons among extremophiles and non-extremophiles.
Relative abundance of statistically significant codons was calculated to understand codon frequency. The relative abundance was either positive or negative (Fig. 1A-F). The positive relative abundance of codons showed higher preference towards extremophiles and vice versa for a negative relative abundance. The analysis revealed positive relative abundance of 8 codons for thermophiles in T-M; 16 for psychrophiles in P-M; 16 for thermophiles and 26 for psychrophiles in T-P; 28 for acidophiles and 21 for alkaliphiles in A-B; 18 for halophiles H-Nh; and 10 for barophiles B-Nb. Comprehensively, in T-M dataset, the codons like ATA (Ile), AGG (Arg), CTC (Leu), AGA (Arg), GAA (Glu), CTT (Leu), etc. had higher abundance and CAA (Gln) had lowest abundance in thermophiles. In P-M dataset, CAA (Gln) had highest abundance and GAC (Asp) had lowest abundance in psychrophiles.

Analysis of AT-or GC-rich and A/T-or G/C-ending codons.
The characteristics of extremophilic codons were enumerated by analysing the nucleotide composition of the significant codons. Such studies have not been taken up till date. The aforementioned codons showing positive relative abundance to extremophilicity were taken into account for analysing AT-or GC-rich and A/T-or G/C-ending codons. The preferred codons were counted for their nucleotide composition analysis for AT-rich or GC-rich codon and wobble base analysis for A/T-or G/C-ending codons. The statistical analysis of relative nucleotide composition of codons showed a decreasing trend of AT-rich codons -psychrophiles (63.3%) > alkaliphiles (57.2%) > thermophiles (52.9%). Analysis of GC-rich codons showed decreasing trend in barophiles (60%) > halophiles (55.5%). Acidophiles showed equal proportion of AT-rich and GC-rich codons (Fig. 3A). The A/T-or G/C-ending codon analysis revealed that psychrophiles and alkaliphiles preferred A/T ending codons whereas, thermophiles, halophiles and barophiles preferred G/C-ending codons (Fig. 3B). Similar to AT-rich and GC-rich codon analysis, acidophiles had also showed equal proportion of A/T-ending and G/C-ending codons. The results of AT-or GC-rich and A/T-or G/C-ending codons of all groups of extremophiles corroborated with each other except that of thermophiles. Thermophiles have higher priorities of AT-rich codons but they prefer upto 60% of G/C-base at wobble position. Such statistical analysis is imperative for expanding the understanding of nucleotide composition of codon usage patterns and codon adaptations in different classes of extremophiles.    generation. The present work applied it for prediction of extremophile and non-extremophile CDS on the basis of selected significant codons. This software integrates all types of machine learning schemes for both unsupervised clustering algorithms (k-means; k-medoids; SVC, support vector clustering; DBSCAN, density-based spatial clustering of applications with noise; and EMC, expectation maximization clustering) and supervised learning algorithms such as k-NN, k-nearest neighbour; Naïve Bayes; logistic regression; SVM, support vector machine; decision trees; and, ANN, artificial neural network. The performance of these machine learning classifiers were optimized by testing varied parameters (information gain, gain ratio, Gini index, accuracy, dot kernels, radial kernels, polynomial kernels, sigmoid kernels, anova kernels, C-SVC, nu-SVC, etc.) specific to individual applied algorithm. The prediction of these algorithms was validated by 70% testing and 30% training datasets. To distinguish the importance of codons in extremophiles, the datasets were independently subjected to 11 different attribute weighting algorithms ( Table 2). The analysis was performed to enumerate the number of weighting algorithms that weighed the statistically significant codons ≥0.5 (each codon was weighted in the range of 0 to 1 by these algorithms). For instance, CAA of T-M was weighted by 10 algorithms out of 11. Similarly, AGA codon in P-M was weighted by 8 algorithms; CAA in T-P by 10 algorithms; TGG in A-B by 9 algorithms; GAC in H-Nh by 11 algorithms; and, AGG and AAG in B-Nb were weighted equally by 10 algorithms. These weighted codons have indicated some significance for extremophilicity but could not express any preference towards either. The present finding corroborated with earlier results of relative abundance and 1-9 scale ranking analysis for most weighted codons. Further, the datasets were subjected to unsupervised and supervised learning algorithms. The applied unsupervised clustering algorithms performed the task of dividing the labelled CDS into extremophile and non-extremophile clusters (Supplementary Table S7). The clustering analysis of k-means, k-means (kernel), k-medoids and EMC could partly cluster labelled CDS into distinct groups. For example, T-M dataset was analysed by k-means algorithm and it contained 232 CDS (or 116 pairs) distributed to cluster 0 (179 CDS) and cluster 1 (53 CDS). The 179 CDS of cluster 0 were classified as 94 thermophilic and 85 mesophilic. The remaining 53 CDS in cluster 1 were classified as 22 thermophilic and 31 mesophilic. Similar result was obtained in other datasets as well. On the other hand, DBSCAN and SVC were completely unsuccessful in clustering labelled CDS of all the comparing datasets. The reason for failure could be inappropriate choice of minimum number of data-points required 17 .

Analysis of variation in
Supervised learning analysis showed all the model generation algorithms gave different accuracy of prediction in different datasets (Table 3). Only best machine learning models with highest prediction accuracy were selected for interpretation of adaptable codons enlisted in Table 3  Decision Tree and Random Forest with four classification criteria (information gain, gain ratio, gini index and accuracy) better classified codon datasets with good accuracy percentage. However, CHAID (chi-squared automatic interaction detection), ID3 (iterative dichotomiser 3) and weight-based parallel decision tree model failed to classify codon datasets, since they generated trees without roots and leaves hence, discarded. The best and most accurate trees were selected and their discrimination rules are shown in Table 4 and detailed in Supplementary Figures S1-S6. Using information gain criterion decision tree for T-M, P-M and T-P gave accuracy of 78.57%, 75.00% and 92.65% respectively. In T-M and P-M, CAA (Gln) is the selection criterion for mesophiles and psychrophiles when its percentage is above 1.866% and 4.092%, respectively. Correspondingly, CAA >1.056% in T-P comparison is the selection criterion for psychrophiles. The percentage occurrence of CAA (Gln) ≤1.866 in T-M whereas, in T-P, CAA ≤1.056% indicates thermophilic category. Therefore, CAA codon is highly preferred in mesophiles and psychrophiles and less preferred in thermophiles 18,19 . Further, in A-B dataset, Random Forest (Gini index) gave performance accuracy of 80.77% for classification of codons of acidophiles and alkaliphiles. The tree depicted the occurrence percentage of GAG (Glu) >4.202% and AAG (Lys) >5.007% in alkaliphilic proteins whereas, the occurrence percentage of GAG (Glu) ≤4.202%, CTC (Leu) >2.705% and GAT (Asp) ≤5.524% in acidophilic proteins. In H-Nh, Decision Tree (gain ratio) gave highest accuracy of 85.00% and showed that GAC (Asp) is the selection criterion when its frequency >8.861% for halophilic genes whereas, the combination of percentage occurrence of GAC ≤8.861% and AGG (Arg) >1.441% for non-halophilic genes. Finally, in B-Nb, the Random Forest (gini index) gave the highest accuracy of 96.55% for codon classification prevalent in barophiles and non-barophiles. It depicted that when composition of AGG (Arg) >3.007% and ATA (Ile) >3.553% in a gene, it codes for barophilic proteins, while when the composition of AGG (Arg) ≤3.007%, TAC (Tyr) ≤2.105% and AGT (Ser) >1.200%, it codes for non-barophilic proteins.

Discussion
The selection of synonymous codons in extremophiles is by mutational bias, dominant effect of nucleotide composition and dependency on the surrounding milieu [20][21][22] . Codon usage affects the patterns of amino acid 23 , regulates protein structure and function by affecting translation elongation speed in the eukaryotic systems as Drosophila 24 and Neurospora 25 . Protein structures of extremophiles prefer increased non-covalent interactions to maintain activity at high temperature, pH and pressure 26 . This can be attributed to increased usage of bulky and charged amino acids associated to the higher percentage of their corresponding codons in the gene. For instance, halophilic proteins are characterized by increased negative surface charge due to increased acidic amino acid  as Asp leading to higher percentages of GAC codon 27 . Expanding the horizon of adaptability from structure to codon usage in protein extremostability is the intent of the present work. The GC-content variations in all the classes of extremophilic genomes has been deduced by Chakravorty et al. 28 . The study indicates, in spite of the variation observed in each extremophilic class the basis of extreme-stability selection based only on GC-content could be ambiguous. Hence, additional basis of selection needs to be carried out. Analysis of AT-or GC-rich and A/T-or G/C-ending codons could be another endorsive support. Earlier reports show that the variations in nucleotide composition leads to change in patterns of codon usage indirectly affecting thermostability 29 32 . This suggests that the thermophiles have AT-rich bases at first two base positions of codons and the third position is usually occupied by G/C-base. The present study also enumerates nucleotide composition for most extremophiles as halophiles, acidophile, alkaliphile and barophiles which has not been documented earlier. Genome of alkaliphilic bacterium Bacillus halodurans was observed to have less GC-content, hence poor usage of GC-rich codons 33 . In correspondence to thermophiles, barophiles also showed a higher usability of GC-rich as well as G/C-ending codons than AT-rich and A/T-ending codons suggesting that these codons make the genome and proteome more robust and tolerant 34 . In halophiles, the preferred codons were relatively more GC-rich and GC-ending but their codon preferences varied amongst other extremophiles 14 .
Comparative codon usage analysis in thermophiles, mesophiles and psychrophiles showed a decreased preference of AGG (Arg) codon and increased preference of CAA (Gln) from thermophiles to mesophiles to psychrophiles. This could be due to increased usage of AGR codons and decreased usage of CGN codons for Arg in thermophiles proven by Van der Linden and de Farias (2006) 35 . The reason could be if the second nucleotide 'G' of CGN is mutated to ' A' then it codes for histidine (CAT and CAC) and glutamine (CAA and CAG) which is detrimental for thermostability 9 . The preference of CAA codon showed deleterious effects since it codes for thermolabile residue i.e. glutamine which is prone to spontaneous deamidation and results into cleavage of peptide bonds at elevated temperature 36 . Suggesting, CAA codon is significantly preferred in psychrophiles and mesophiles rather than thermophiles. Therefore, nature selects an alternative approach to sustain thermostability by AGR    Table 4. Summary of decision tree prediction on extremophile datasets with their criteria chosen and best discriminatory rule for classification of codons.
SCIentIfIC RePoRts | (2018) 8:15548 | DOI:10.1038/s41598-018-33476-x (AGA and AGG) codon bias for arginine. The AGR codons have roles in protecting thermostability by usage of Arg 9,32,35 . Liu et al. (2012) also reported that purine-rich codon usage such as AGR (Arg) have positive correlation with optimum growth temperature of organism 37 . Codons such as ATA (Ile), CTC (Leu), AGA (Arg), GAA (Glu), CTT (Leu), etc. also showed abundance in thermophiles since they get translated to amino acids that enhances hydrophobic interactions and surface charges 38 . Codon adaptability of barophiles has been scantily reported. The comparative analysis of barophiles and non-barophiles showed AGG (Arg) had higher priority and CAA (Gln) had lowest indicating common codon usage patterns of thermophiles and barophiles 39 . Di Giulio (2005) divulged that GC-ending codons were significant in barophiles especially AGG that codes for arginine which frequently occurred in barophiles 40

Creation of comparative datasets and enumeration of statistically significant codons.
To study codon usage patterns, gene CDS of extremophiles were comparatively analysed with their non-extremophilic homologous counterparts. Six groups of extremophiles were searched with various extremophilic keywords in PubMed-NCBI. Protein sequences were collected from UniprotKB. Acidophilic proteins (pH ≤ 6) and alkaliphilic proteins (pH ≥ 8) were searched from BRENDA. The homologous non-extremophilic counteparts were chosen by BLAST, ClustalW, K-align, Parallel PRRN and CLUSS2. Six comparative non redundant datasets of CDS were created (T-M, P-M, T-P, A-B, H-Nh and B-Nb) from EMBL-EBI-ENA database (Tables S1-S6). Percentage of 64 codons were calculated and normalized. Non-parametric two-sample Kolmogorov-Smirnov test was employed to enumerate the statistically significant codons with p-value < 0.05.

Relative abundance analysis of codons.
Individual dataset was utilized for enumeration of relative abundance of significant codons for understanding the occurrence preference. The weighted average differences were first calculated for each significant codon corresponding to extremophile and non-extremophile which was found to be either positive or negative. The relative abundance of a codon was calculated using a derived equation (1): where, β rel , relative abundance of a codon in a comparing datasets; α e , weighted average of a codon in extremophile dataset; α ne , weighted average of the same codon in non-extremophile dataset; α max , maximum of weighted average differences in all the statistically significant codons. Then, the derived mathematical expressions for α e , α ne and α max were incorporated in the following equations (2, 3 and 4): where, (α e ) i , statistically significant codon of i th genes in extremophile dataset; (α ne ) i , statistically significant codon (same) of i th genes in non-extremophile dataset; N, total protein pairs in the comparing dataset; α j , weighted average difference of codon from extremophile dataset and non-extremophile dataset and M, total number of significant codons in the comparing datasets.
Prioritizing the codons to understand their preference in extremophiles. The significant codons of each extremophile class were ranked in 1-9 scale according to their contribution towards extremophilicity. The generated weighted average of each codon was normalized by taking ratio of codon of extremophile and non-extremophile counterpart. The ratio weights were considered as normalized weight and were further used for deriving their 1 to 9 interval scale weight. All the ratio weights were scaled down to a 1-9 rank using a generated python script (Supplementary Table S9) which uses the following equation (5): where W i is the derived weight in the 1 to 9 scale of any i th significant codon in any of the comparing dataset, i = 1, .., n where n is the number of statistically significant codon; ξ i is the value of the weight for i th significant codon, α is the minimum value in the weight for codon feature and β is the maximum value in the weight of codon feature. This gave the relative importance of each feature. Analysing data-points of highest and lowest ranked codon. In the section "Prioritizing the codons to understand their preference in extremophiles", the resulted highest and lowest ranked significant codons of each datasets were used for data-points analysis by plotting their percentage score in their respective CDS. The data-points analysis was carried out for the highest and lowest ranked codon. It was estimated by normalizing with the data-points having the maximum value to have scores in the range of 0-1. Further, the data-points of highest and lowest ranked codons were separately graphically represented for each comparing dataset.

Analysis of AT-or GC-rich and
Finding codon harmony among extremophiles. The harmony in codon usage among six studied groups was analysed. On the basis of relative abundance and 1-9 scale ranking of significant codons, the positively contributing codons from the datasets were classified among six types of extremophiles to decipher codon harmony.
Generation of machine learning models to classify and predict extremophilic codons. Machine learning algorithms were used to predict, classify and generate models for extremophilic codon usages by attribute weighting, unsupervised and supervised machine learning. The datasets were subjected to test these algorithms using Rapid Miner version 5.3.000. The prediction of these algorithms were validated by 70% testing and 30% training datasets 43 . The employed approaches classified binary datasets on the basis of their discriminating codons. Eleven different algorithms (SVM; Principle Component Analysis; Correlation, Deviation, Chi squared statistic, Gini index, Information gain, Information gain ratio, Uncertainty, Relief and Rule) were applied independently on the datasets and weigh the codons in a range of 0-1. The codon attributes with weight ≥0.5 were selected for analysing codon preference. The datasets were further subjected to unsupervised and supervised learning algorithms since attribute weighting is insufficient in generating models for codon usage pattern. The unsupervised clustering algorithms group the similar data-points and dissimilar data-points into separate clusters according to various criteria 44 . Six unsupervised clustering algorithms (k-Means, k-Means (kernel), k-Medoids, SVC, DBSCAN and EMC) were applied separately on datasets. Unsupervised methods fail to correctly cluster data-points and get the accurate model, making supervised algorithms a necessity. In supervised learning (Lazy modelling (k-NN, Naïve Bayes), logistic regression, SVM, decision trees and ANN) training instances labelled appropriately were applied. Logistic Regression and SVM models were generated through kernel function parameters such as dot, radial, polynomial, sigmoid and anova kernels. Four tree induction models such as Decision Tree, Decision Stump, Random Tree and Random Forest (generate trees up to 500) were applied for classification of datasets using four criteria (Gini Index, Information Gain, Gain Ratio and Accuracy) 45 . Additionally, CHAID, ID3 and weight-based parallel decision tree model was also run with aforementioned 11 different attribute weighting criteria. Finally, best tree induction models with highest prediction accuracy were SCIentIfIC RePoRts | (2018) 8:15548 | DOI:10.1038/s41598-018-33476-x selected for interpretation of adaptable codons. Furthermore, the feed-forward neural networks were employed on the datasets that were trained by a back propagation algorithm (such as multi-layer perceptron). The parameters described for neural networks are the size of all hidden layers. The number of nodes and neurons were chosen with an interval of 10 specified as hidden layer size. The accuracy of prediction was obtained for each supervised learning method for categorization of codon features into two labelled attributes of extremophile and non-extremophile dataset.