Introduction

Depression is acknowledged as a worldwide major public health concern by numerous international agencies and national governments1. According to the World Health Organization in 2016, depression accounts for 10% of the non-fatal disease burden worldwide2. It has an hereditary element, and can result from genetic and environmental influences. Depression represents a complex polygenic and multifactorial disorder where many genetic variants, each with a small or unnoticeable impact, combine to contribute to the resulting phenotype3. Genome-wide association studies (GWAS) have identified 178 genetic risk loci and 223 independently significant SNPs4. There are almost 1500 symptom combinations that fulfil the diagnostic criterion for depression, and any two patients of depressive disorder may, very likely, not have common symptoms5. Gender also has an impact, and women are nearly twice as likely as men to be diagnosed with depression. In this light, a greater genetic understanding of depression is needed to help achieve improvements in diagnosis and treatment6.

Convergent preclinical and clinical research data have revealed significant correlations among stress, depression, and epigenetic abnormalities. Depressive disorders are widespread, disabling, and costly illnesses that are linked to a decreased role in functioning and quality of life and an increase in medical comorbidity and mortality7. Numerous studies on depression have focused on mutations and the genetic composition of genes. In contrast, there has been minimal analysis of the codon usage bias (CUB) of genes associated with depression. CUB is the unequal use of synonymous codons of an amino acid in which some codons are utilized more often than others. Hence CUB analysis can prove valuable in aiding our understanding of molecular biology, genetics, and functional regulation of gene expression. Computational evaluation on codon bias has been of recent research interest to determine the role of codon preference in disorders with a genetic component, such as in anxiety, Alzheimer's disease, and others.

There are 61 codons that encode for amino acids and, excluding methionine and tryptophan, two or more codons encode each single amino acid, and such codons are called synonymous codons. Codons encode a total of 20 amino acids, and it is now well-established that synonymous codon usage is not random8. Although the amino acid sequence is not altered, changes are evident in mRNA secondary structure, and its stability9. With that, usage of cognitive tRNA is also affected. As a result, these alterations, previously thought to be phenotypically silent and frequently overlooked in investigations of human genetic diversity, are gaining the scientific community's attention as a reason behind several medical disorders. These synonymous codon changes may significantly alter gene expression levels10. Stop codon readthrough (SCR), for example, is a known phenomenon where translation is continued beyond the stop codon, and protein isoforms are generated. The SCR is found to be associated with the codon context, and UGA is the leakiest stop codon11. In the context of physiological consequence, for the water channel Aquaporin 4 (AQP4), agents that stimulate an unusual SCR event were found to mediate improved Aβ clearance and, thereby, provide insight as well as a new potential therapeutic strategy for Alzheimer’s disease12. Rare codons can cause ribosomes to pause on a mRNA during translation and mediate premature chain termination. Indeed, some genetic conditions, like cystic fibrosis, may arise from incorrect stop codons in genes13. Bias in codon usage impacts mRNA stability and translation fidelity14. In the light of these facts, we hypothesize role of CUB in depression may, in part, underpin disease expression. A greater understanding of these patterns may aid define new potential targets and/or markers for human disorders9,10, such as depression.

In this regard, whereas various studies have appraised point mutations and variant analysis of genes involved in depression; to our knowledge, no study has yet been conducted on codon pattern analysis of such genes. Therefore, in the present study, our primary goal was to evaluate the codon preference for expression-associated genes. Additionally, skew, neutrality, parity, protein properties, gene expression, codon pair, and codon context analyses were also assessed. Our overall analysis aids in revealing different molecular patterns in the depression-associated genes to help expose their molecular signatures.

Results

Result of pathway analysis

Pathway analysis for the envisaged genes was conducted through the PANTHER knowledgebase to understand the involvement of genes in various vital pathways. A total of 12 pathways were assigned to the 18 genes, which were associated with 5-Hydroxytryptamine biosynthesis, 5HT1 type receptor-mediated signaling pathway, 5HT2 type receptor-mediated signaling pathway, 5HT3 type receptor-mediated signaling pathway, 5HT4 type receptor-mediated signaling pathway, Adrenaline and noradrenaline biosynthesis, Bupropion degradation, Dopamine receptor-mediated signaling pathway, Heterotrimeric G-protein signaling pathway-Gi alpha and Gs alpha mediated pathway, Huntington disease, Metabotropic glutamate receptor group II pathway and Nicotine degradation. Pathways analysis shows that these genes are mainly associated with signal transduction and metabolic processes.

Compositional analysis

Depression-related testing for genes was searched from the Genetic Testing Registry (GTR), National Center for Biotechnology Information Search database. The tests gtr/tests/508,961 by Assurex Health Inc, gtr/tests/569,407 by genomind Professional PGx Express CORE Anxiety & Depression, and gtr/tests/579,485 by Intergen Genetic Diagnosis and Research Centre presented a panel of 18 genes that are evaluated for the presence of depressive disorders. Different gene genotypes are available based on the SNPs; however, we accessed only the ‘reference’ coding gene sequences from the NCBI nucleotide database. Although a larger number of genes is preferable to support statistical analyses, this was the available total number of genes in the accessible panel targeted to a depression diagnosis and, hence, 18 gene sequences were obtained (for specifics, see Table 1).

Table 1 Depression associated genes evaluated for codon pattern analysis: their regular functions and roles during depression along with their modulated expression and SNP data.

Our compositional analysis of genes involved in depression revealed that GC3 content, which is an indicator of codon bias52, was highest amongst all other compositional parameters. Average %A, %C, %T and %G composition was 24.39%, 26.17%, 23.66% and 25.75%, respectively. In occurrence, these nucleotides appear in the order of %C > %G > %A > %T. At codon position one nucleotide composition %T1 (18.67%), at codon position two %G2 (17.82%) and at the third codon position %A3 (16.42%) were least, and %GC3 content varied between 41.80% and 83.82%.

GC content (GC12 and GC3) effects on gene length

The coding-sequence lengths possess an evolutionary meaning in relation to GC content compositional variations in DNA. An analysis of the genome database revealed a richness of GC in the longest coding sequences in vertebrates and prokaryotes, with the additional observation that the shorter versions of these are GC poor53. A Pearson correlation coefficient (r) was obtained based on the linear correlation between the two data sets. This analysis revealed a lack of correlation between length and GC components %GC12 and %GC3, which indicated no dependency of %GC content on lengths of genes. A trend was observed that among all 18 evaluated genes, most of the genes had a size between 1350 and 1650 bp. Furthermore, in all the genes, %GC3 content was higher than %GC12. Gene lengths were normalized by dividing them by 100 to be comparable with the percent GC composition. A depiction of normalized gene length and %GC3 content is given in Fig. 1. To evaluate correlation trends between length and %GC content, we additionally appraised the correlation between the adjusted length and %GC content of a set of 62 housekeeping genes. We found that length negatively correlates with %GC3 (Pearson correlation coefficient r = -0.263, p < 0.05) in housekeeping genes (Supplementary Table S1).

Figure 1
figure 1

Length vs %GC3 content in depression (top) and housekeeping (bottom) genes.

Dinucleotide ratio analysis

Dinucleotides CpG, GpT, and TpA were either underrepresented or randomly presented (odds ratio < 1.6) in all the genes envisaged. On the other hand, ApG, CpT, GpA, and TpG dinucleotides were either overrepresented or randomly presented (odds ratio > 1.6).

RSCU analysis shows preference of GC ending codons

The overall RSCU analysis revealed that GC ending codons were preferred over AT ending codons. CTG and GTG codons were the most overrepresented codons, whereas TTA, GTA, ATA, CTA, CGT, ACG, GCG, CCG, and TCG codons were the most underrepresented codons (Fig. 2). RSCU values of depression associated genes are shown in Table 2. To determine the correlation trends between length and %GC content, we further sought a correlation between adjusted length and %GC content of a set of 62 housekeeping genes. Also, we compared RSCU values of depression-associated genes with the RSCU values of housekeeping genes, and, based on t-test, it was evident that codon usage was significantly different (t = 3.58, p < 0.0001) for codon GTA. In addition to this, codons GTG, CCC, GAT, and GAC also differed at a 10% significance level (Table 3).

Figure 2
figure 2

RSCU values of different codons in 18 depression associated gene sets shows an underrepresentation of A/T ending codons.

Table 2 RSCU values of individual genes.
Table 3 The t-test analysis between RSCU values of depression and housekeeping genes with 1000 bootstrap value, wherein iteratively resampling a dataset with replacement is involved.

Relationship between codon bias, nucleotide skews and gene length

CUB had a significant positive association (r = 0.863, p < 0.001) with the length of proteins. We also investigated the relationship between protein length and protein expression level, but a lack of correlation was observed. Nucleotide disproportion is referred to as skews. Various skews, including AT skew, GC skew, purine skew, pyrimidine skew, keto skew, and amino skew are available to assess the effects of nucleotide disproportion on any parameter under consideration. Herein, we compared the effects of various skews on CUB, and found that only the pyrimidine, amino and keto skews had significant positive correlation with scaled Chi square value (SCS) values (r = 0.767, p < 0.05, r = 0.756, p < 0.01, r = 0.793, p < 0.01; Spearman correlation “r” with Bonferroni correction). Different nucleotide skew values are given in Table 4.

Table 4 Nucleotide skew in relation to the 18 depression associated genes.

CUB and gene expression profiling

Codon adaption index (CAI) is used as a quantitative method of predicting the level of expression of a gene based on its codon sequence54. In the study of Sahoo et al.55, critical analysis of predicted highly expressed (PHE) genes in Arabidopsis thaliana was performed by considering the expression data from Gene Expression Omnibus (GEO) datasets, where protein expression levels are quantified by RMA (Relative Molecular Abundance) signal intensity. The linear Pearson correlation coefficient between RMA and CAI showed a statistically significant correlation (r = 0.47, p < 0.05). In another experiment conducted by Guimaraes et al.56, protein abundance (PA) was measured for > 800 genes in. CAI was found to be significantly correlated with PA after controlling for mRNA abundance (r = 0.3526, P ≤ 0.001). The above examples clearly indicate that CAI might be conveniently used as a surrogate for protein expression. Thus, we used CAI values as expression data for depression genes (calculated through server CAIcal, developed by Puigbo and colleagues (2008) to correlate with their respective gene lengths57).

The CAI values of the genes associated with depression displayed values ranging from 0.713 (UGT2B15) to 0.85 (CYP1A2). The CAI value has a significant negative association with the SCS value (r = − 0.910, p < 0.001), and this indicates that in highly expressed genes, low codon bias is present58. A higher CAI indicated a relatively high protein expression level. Most of the AT ending codons have a significantly negative relationship with CAI, except for GTA, CGT, GCT (bearing no relationship with CAI). In contrast, most GC ending codons had a significant positive relationship with CAI, except for GTC, CTC, ACG, and TCG (with no relationship with CAI). The only exception was codon TTG that had a significant negative relationship with CAI.

Codon context analysis revealed a context between stop codon UGA and other amino acid encoding codons

On the one hand, where codon bias is a preferred use of codons, on the other hand, codon context refers to the presence of sequential pairs of codons in a gene59. In this light, codon context analysis was undertaken on the 18 genes associated with depression. Codon context, additionally, is a feature that influences the gene expression independent of codon bias60. The trend for codon context variation is depicted as a matrix of 64*64 codons. The total number of codon pairs observed in the 18 genes is 2047. As illustrated in Fig. 3, highly used codon pairs are displayed as a green colour, whereas lesser-used codon pairs are presented as red. The rows display 5’ codons, whereas the columns display 3’ codons (Fig. 3). It is clear from the Figure that stop codon UAG exhibited high context with many of the amino acid encoding codons. With that, all kinds of contexts (positive, negative and no context) were observed between the codons of envisaged genes.

Figure 3
figure 3

Codon context analysis for depression-associated genes. The green color portrays highly used codon pairs, whereas red represents lesser-used codon pairs. A pink color depicts a null usage of codons. Codon UGA and UAG were found paired with some specific codons. Statistically insignificant values are depicted as black.

Arginine or proline initiated codon pairs are abundant

Out of 15 top overrepresented codon pairs, only two codons comprised either CpG or TpA as their part. Out of 540 rare codon pairs (absent codon pairs are excluded), a maximum of 75 codon pairs were arginine initiated, followed by 65 codon pairs for proline. Methionine-initiated codon pairs were rarest (09 only). Among the most preferred 15 codon pairs, a maximum of 04 were leucine initiated (Table 5). These results indicate a distinct pattern for codon pair preference or avoidance due to multiple evolutionary forces acting on depression-associated genes.

Table 5 Codon context analysis for top 15 overrepresented and rare codon pairs.

Nucleotide disproportion influence on protein indices

We envisaged six nucleotide skews, namely AT skew, GC skew, purine skew, pyrimidine skew, keto skew, and amino skew. We performed Pearson linear correlation analysis between the nucleotide skews and protein properties to determine whether nucleotide disproportion influences physical protein properties (Table 6). Amino skew did not correlate with any of the protein properties envisaged. The results are suggestive of the effect of nucleotide disproportion on protein properties.

Table 6 Evaluation between nucleotide skew and protein properties.

Translation selection P2 is suggestive of a role of selectional forces

Translation selection (P2) values indicate the binding strength between the codon and anticodon. This was determined using the values of WWC, SSC, WWU, and SSU using the average RSCU values, and a value of 1.01 indicates strong selectional forces behind it.

Neutrality analysis confirms major role of selectional forces

Regression analysis between the %GC3 and %GC12 provided a slope value of 0.3276, which indicated that relative neutrality was 32.76% and the relative constraint was 67.24% (Fig. 4A). This signifies that selectional force (67.24%) was dominant over mutational force (32.76%). The graph also indicates that %GC3 is responsible for 71.7% variation in %GC12. Additionally, %GC12 and %GC3 are significantly positively correlated (r = 846, p < 0.001).

Figure 4
figure 4

(A) Regression analysis between average %GC content at codon position one and two (%GC12) and %GC (%GC3) content at the third codon position. (B) Parity plot comprising GC bias (G3/G3 + C3) on abscissa and AT bias (A3/A3 + T3) at the ordinate. (C) ENc-GC3 analysis showing presence of data points below the expected Nc curve depicting prevalence of selection force. (D) Regression between CAI and ENc (effective number of codons) revealed that 81.81% variations in CAI are attributed to ENc and thus on codon bias.

Parity analysis revealed preference of T and C over A and G nucleotides

Parity analysis determines the bias between A/T and C/G at the third codon position. At the center, where the axis value is zero, A = T and C = G. In the present study, the average position of x = 0.469 ± 0.050 (AT bias) and y = 0.439 ± 0.054 (GC bias). A bias value of less than 0.5 indicates a preference for pyrimidine over purines61. Herein, our analysis indicated that thymidine is preferred over adenine, and that cytosine is preferred over guanosine (Fig. 4B).

Relationship of codon bias with %GC3 content and gene expression

An ENc (effective number of codons) versus GC3 plot is generally used to study the effect of %GC3 composition, which is suggestive of both a mutational force and compositional parameter on codon bias. In the event that codon choice is constrained by mutational force alone, all the data points will lie on or just below the GC3 curve, whereas in the case of an operating selection force, the data points are well below the GC3 curve62. In the present study, only a few points were present near the curve. The rest of the data points are present below, suggesting selection force as a dominant force in shaping codon usage in depression-associated genes (Fig. 4C). Furthermore, we investigated the effect of codon bias on gene expression by regressing them. Since ENc is the non-directional measure of codon bias, a negative correlation between them (Pearson correlation r = − 0.904, p < 0.0001) indicates that gene expression also increases with increasing codon bias. Overall, 81.81% variation in gene expression is attributed to codon bias (Fig. 4D).

Effects of mutation pressure on codon composition is highest for G and least for T nucleotide

Mutation at the third position of a codon did not change the meaning of the codon, with regard to the amino acid encoded by it, and is called the silent position of the codon because of redundancy of the code. Nevertheless, this position is affected by mutation force since, here, mutation changes the nucleotide but not the meaning of the codon. The effect of mutational force on composition was 92.55%, 84.28%, 88.9%, and 93.25% for nucleotides A, T, C, and G, respectively (Fig. 5). In this regard, it is clear from Fig. 5 that mutational forces on G nucleotide contributed the most in relation to determining its composition (93.25%), whereas mutational forces on nucleotide T contributed least towards determining its composition (84.28%).

Figure 5
figure 5

Regression analysis between overall nucleotide content and content at the third codon position. Panel (A): A3 and A; panel (B): T3 and T; panel (C): C3 and C; panel (D): G3 and G.

Principal component analysis

Principal component analysis was undertaken using the 59 RSCU values of 59 codons. Figure 6 represents the correspondence analysis and reveals that the first two axes account for significant variation (50.46% and 10.88%, respectively). The third and fourth axes account for 6.64% and 5.78% variation, respectively, and the contribution of the first four axes is 73.76%. Based on the loading values, codons AGA, CTG, CGC, and GGA influence CUB the most in depression-associated genes. The first and second principal component (PC1 and PC2) scores of different genes are provided in Fig. 6.

Figure 6
figure 6

PCA analysis for 18 depression associated genes. Orange and green bars depict the loading scores of PC1 and PC2 for the genes.

Discussion

Depression is a disorder with a wide range of symptoms. In evaluating patients with depression, GWAS has revealed a high degree of polygenicity that underlies the mental illness and related complex phenotypes, and has discovered that many SNPs with relatively small effect size, when combined, potentially contribute to phenotype development4. Polygenicity includes some genetic heterogeneity; affected people may have different combinations of risk alleles, and unaffected people will also carry many of these variants. Depression is clearly a heterogeneous condition, as evidenced by the fact that two people can be diagnosed with depression but have no common symptoms. Added to this, neurodegenerative disorders too63 can potentially contribute to depression64.

In this light, various studies have been undertaken to understand the physiology and genetics behind depression. To our knowledge, however, no previous work has described the compositional features and codon usage patterns of genes associated with depression. Hence, the present research focuses on the codon usage of genes associated with depression. Our evaluation used a panel of 18 genes that have been associated with depression (Table 1). Although this number is not optimal and can be considered by some to be undersized for statistical analyses, it is the maximum number of genes available for depression detection from the NCBI gene testing registry. The products of genes are involved in multiple biological functions and pathways (given in Table 1), and altered expression levels or SNPs can lead to various genotypes that result in diseased conditions or different response to medications.

Nucleotide composition is imperative in knowing the codon usage since many of the parameters associated with codon usage indices, including nucleotide skew, neutrality, and parity plots, are composition dependent. Compositional analysis revealed that %C occurrence was highest, with the lowest occurrence of %T. The %GC3 content was the most variable compositional parameter and varied between 41.80 and 83.82%.

CAI is a measure of gene expression level, and this measure compares the codon composition of a gene with a reference set of genes65. Our study found a range of CAI values between 0.713 (UGT2B15) and 0.85 (CYP1A2). In Escherichia coli (E. coli), which has long been regarded as a model organism in the study of CUB, the highest CAI value of 0.85 was for the lpp gene, one of the most abundant genes, encoding an outer membrane lipoprotein66. Hence, it can be speculated that the CAI value 0.85 (CYP1A2 gene), in our depression study, likely also is associated with a high-level expression. The relationship between the CAI and expression value can be better understood in the light of an experiment conducted by Dos Reis et al.58, who distributed the E. coli genes into three groups based on codon usage and expression level data obtained from microarray experiments. They found a positive relationship between the CAI value and expression level in one of these group. In another group, the genes with low CAI were highly expressed, which contradicts the set paradigm of CUB, where optimal codon usage leads to higher CAI. However, the results are still explainable based on the mutation-selection balance hypothesis of codon usage. High CAI values were also obtained in the present study, indicating a higher expression level. However, other dynamic factors, including mutational-selectional balance, could provide attributing factors to the expression. CAI is associated with compositional constraints and can potentially show all relationships (negative, positive, and no correlation). Hence it can be inferred from this study that the gene expression level depends on the base composition. Such a phenomenon could be the compositional pressure on CUB, which ultimately drives the gene expression. Our view is supported by the results of Sahoo et al.,67 who described the critical role of codon composition in regulating the gene expression profile in the Arabidopsis thaliana genome (a small plant from the mustard family native to Eurasia and Africa) based on the score of modified relative codon bias. A study by Franzo and colleagues68, likewise, demonstrated that CUB is highly affected by nucleotide composition in an evaluation of an infectious bronchitis virus. The genes associated with depression showed an interesting pattern related to nucleotide composition and CUB. After comparing compositional constraint relationships with SCS, one of the measures of CUB, we found a negative relationship of SCS with G nucleotides (overall %G, %G2, %G3 and %GC2) only. This signifies the importance of G nucleotide in determining codon usage.

Codon usage bias is affected by several factors, and gene length is one of them. Based on a study on codon usage in 8,133, 1,550, and 2,917 genes, respectively, from Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana, a significant negative linkage between codon usage and protein length was explained69. On the other hand, Eyre-Walker70 found a positive association between codon usage and gene length, suggesting selection against missense errors in E. coli. In this light, it can be inferred that length can have both positive and negative impacts on CUB—depending on the model organism under evaluation. CUB and protein length were positively correlated with GC3 content and the correlation was stronger for %GC12 content in all the proteins envisaged, without any exception. Our results agree with Khandia et al.71, who found that in all the proteins whose size ranged between 150 and 3000 amino acids in a study focused on primary immunodeficiency and cancer, GC12 content was lower than GC3—without any exception.

In our current study, dinucleotides CpG, GpT, and TpA were underrepresented, whereas ApG, CpT, GpA, and TpG were overrepresented. In the human ORFome (open reading frames within a genome), CpG and TpA dinucleotides show the highest level of suppression, and GpT is the third of those with the lowest abundance72. Thus, it appears that depression-related gene sets also follow the common trend of odds ratio present in human ORFome. CpG dinucleotides occur at a low frequency in the human genome, and this is attributed to a higher mutation rate of 5-methylated CpG to TpG, and, as a result, the TpG dinucleotide is increased73. Contrary to the results of Kunec and Osterrieder72 and to ours, Franzo et al.68 found an overrepresentation of GpT dinucleotide. ApG, CpT, GpA, and TpG overrepresentation partially concord with Franzo et al.68, who reported ApG and TpG dinucleotide pairs overrepresented in the whole-genome, and CpT in the polyprotein region only in infectious bronchitis virus. Such results suggest that the odds ratio might serve as a molecular signature74.

RSCU analysis indicated that GC ending codons were preferred over AT ending codons; however, parity analysis indicated that T and C nucleotides are preferred over A and G nucleotides. In accordance with the results of nucleotide analysis, codons encompassing TpA and CpG dinucleotides (TTA, GTA, ATA, CTA, CGT, ACG, GCG, CCG, TCG) were underrepresented. The overrepresentation of CTG and GTG codons observed in the present study matches the results of Khandia et al.,71, who found overrepresentation of CTG and GTG in 78.33% and 68.33% of genes common to primary immunodeficiency and cancer, respectively. This abundance of CTG and GTG codons might have come from the conversion of CpG to TpG dinucleotide, an integral part of the CTG and GTG codons. Such result suggest that RSCU bias is the result of dinucleotide bias72, resulting from a consequence of intrinsic characteristics and evolutionary forces like selection and mutation75.

The codons also influence the gene expression level, and it was observed that most AT-ending codons have a negative association with CAI. In contrast, most GC ending codons have a positive association with GC ending codons. The only exception to this was the codon TTG, which is negatively associated with CAI. The two codons, AGG and TTG, behave differently in the human genome. When the other C and G ending codons are decreased, these two increase76, which is probably why they are inversely affected by CAI.

Compositional properties affect codon usage and nucleotide disproportion too. Nucleotide disproportion (skews) also affects CUB and, in the Nipah virus, an association between CUB and nucleotide skew similarly has been reported77. We found CUB becomes affected by purine skew. Various skews significantly affected different protein indices, also suggestive of the role of compositional constraints on the physical properties of proteins. In mitochondrial NADH dehydrogenase genes (ND genes, encoding for respiratory complexes) of Amphibia, amino skew, purine skew, and keto skew showed a significant correlation with ENc, thereby demonstrating that skewness can potentially affect the CUB78. In the genes associated with depression, %GC12 and %GC3 are found to be significantly positively correlated (r = 0.846, p < 0.001), and this correlation is suggestive of the role of mutational force in shaping codon usage79.

The CUB and codon context bias are important parameters to be considered during heterologous protein expression80. In our study, it was evident that few of the codons remain minimally used, and this is in accord with the studies of Chakraborty et al.,81 on codon context in leukemia-associated genes. Identical codon pairs, GTG-GTG and CTG-CTG codon pairs were the most favored codon pairs in the depression-associated gene set. Here, Co-tRNA and identical codon pairing help conserve the resources and enhance translational efficacy by up to 30%82.

In the present study, we performed gene correlation analysis to determine whether the genes involved in similar functions share similar attributes or not. Gene correlation analysis was undertaken based on RSCU to determine whether genes have a similar kind of codon usage or not. The data indicated that all the 18 genes evaluated displayed similar codon choices, as evidenced by the positive relationship among all the genes in the study. However, the correlation value varied at different levels, and few genes did not display correlation. When the gene correlation was studied at the protein indices level, all genes were found positively correlated except for the CYP3A4 gene, which showed no correlation with any of the genes. Such analysis helps determine how genes involved in one kind of ailment may be similar based on different parameters, and we found similarities between them based on RSCU and protein indices.

Translational selection (P2) refers to the strength of the binding force between the codon and anticodon, and indicates selectional pressure. In the four cotton species (G. arboreum, G. raimondii, G. hirsutum and G. barbadense), P2 values were more than 0.5. In this light, our result indicates the dominant role of selection over mutation pressure in the codons’ usage83.

Upon evaluating the effects of mutational forces on overall nucleotide composition, it was evident that mutational pressure affected nucleotide A and G equally (approx. 57%), whereas nucleotide C was least affected. Principal component analysis indicated that the codon usage by genes is majorly influenced by G and C ending codons. Overall analysis revealed the importance of compositional, mutational, and selectional pressure. However, the role of selection pressure was dominant over the others84. There are a few striking similarities in neurobiological alterations between depressive disorders and neurodegeneration, as in Alzheimer’s, Parkinson’s, and Huntington’s disease64. In the study of Khandia et al.,63, codon pattern studies in neurodegeneration-related gene sets have been undertaken with minor overlap in which gene composition, dinucleotide analysis, RSCU, CAI, and different protein indices were evaluated. In the future, parameters like codon pair occurrence, codon context, and effects on gene expression on codon bias might be investigated in such genes.

The present study envisages an investigation of different molecular patterns and relative synonymous codon usage in 18 depression-associated genes; here, out of 18 genes, 09 genes showed modulation of gene expression during the depressive state. BDNF, COMT, CYP2C9, CYP3A4, HTR2A, SLC6A4, and MTHFR genes showed reduced expression, while UGT1A1 and CYP2C19 showed enhanced expression. For other genes, different genotypes (related to SNPs) associated with depression or response to depression therapy could not be included in the study since the SNPs responsible for depression might be present in promoter/repeats/exon/ intron/leader sequences85, but the analysis of codon usage, codon pair, CAI, and other patterns is intended for only protein-encoding sequences. As a consequence, we acquired only the coding sequences of the envisaged genes, which were available as RefSeqGene in the NCBI database. In relation to the 07 genes whose expression is found downregulated during depression, this theoretically might be corrected for their expression level by introducing a copy of the gene (such as by using gene therapy methods employed currently, like CRISPR-cas) with codon usage in such a manner so that codons with lower RSCU values might be changed with codons having higher RSCU values, to enhance the gene expression which might be presumed using the index CAI; thereby using the current study to open potential new hypotheses and avenues for future research.

Conclusion

In relation to CUB evaluation of depression associated genes, compositional analysis revealed that %C nucleotide was highest, followed by %G, %A, and %T. Among all compositional constraints, %GC3 was variable the most. All the 18 genes envisaged in the study had high CAI values, indicating high-level gene expression. Additionally, within the present study, the gene expression level was driven by compositional constraints. Interestingly, CUB in depression-linked genes is associated solely with overall G nucleotide composition and composition at the second and third codon position, referring to the effect of G nucleotide compositional constraint on CUB. Codon bias was positively correlated with the length of the gene, indicating increased bias with the length of the protein. CpG, GpT, and TpA dinucleotides were underrepresented with an over-representation of ApG, CpT, GpA, and TpG dinucleotides. The pattern present in dinucleotides was seen further in RSCU values of codons, where all CpG and TpA containing codons have low RSCU values and are underrepresented. Likewise, overrepresented dinucleotide CpG is further exhibited in CTG and GTG over presented codons. Among the nucleotide skews evaluated, purine skew was found to affect CUB. A highly significant positive relationship between GC3 and GC12 indicated the role of mutational force in shaping codon usage. The neutrality plot exhibited the prominent role of the selection force in shaping codon utilization. The parity plot results further supported this notion in which T and C nucleotides are preferred over A and G nucleotides. Based on translation selection (P2) analysis, it could be inferred that the genes had low codon bias. Gene correlation analysis based on RSCU revealed a variable degree of positive correlation among genes showing a similar codon usage pattern, which the PCA further established. All the genes clustered together indicated a similar codon choice. Codon context analysis revealed the abundance of identical codon pairs GTG-GTG and CTG-CTG, which enhance the translational rates and are results of selection forces. Based on the study, a synthetic construct could potentially be synthesized with the information on relative synonymous codons, codon bias, codon pair bias, and CAI in hand. Such a construct might help modulate gene expression. For example, in 07 genes studied here, which are downregulated during depression, restoring an overexpressing copy within the body through gene therapy might potentially curb the ailment, and provides an hypothesis and potential avenue for future research.

Material and methods

Pathway analysis

For the envisaged genes, pathway analysis was conducted through PANTHER knowledgebase. The database provides comprehensive information regarding the evolution of protein-coding gene families. The database was retrieved from the weblink https://www.pantherdb.org/.

Compositional analysis (overall and at various positions of codon)

A panel of a total of 18 gene sequences targeted to a depression diagnosis was available from the Genetic Testing Registry, National Center for Biotechnology Information Search database (gtr/tests/508,961 by Assurex Health Inc, gtr/tests/569,407 by Genomind Professional PGx Express CORE Anxiety & Depression, and gtr/tests/579,485 by Intergen Genetic Diagnosis and Research Centre). Each of the genes could have had different isoforms /genotypes; hence we acquired the 'reference' gene sequences (RefSeqGene) from the National Center for Biotechnology Information Search database, and the feature 'CDS' was selected, converted into 'fasta format' and used for further studies. Information regarding these genes is given in Table 1.

The composition of nucleotides affects various codon usage parameters. The overall nucleotide composition of individual nucleotides and their composition at all of the three positions of codons for these 18 genes were determined using the software CAIcal developed by Ref.57. The average percent of GC composition at the first position (%GC1) and the second position (%GC2) viz. %GC12 and GC3 were used in neutrality analysis. %AT and %GC compositions at third codon positions were used in parity analysis.

Dinucleotide odds ratio analysis

The odds ratio is the ratio between the observed and expected frequency. An odds ratio below 0.73 is indicative of under-representation, whereas values above 1.23 indicate over-representation of any dinucleotide pair62.

Relative synonymous codon usage (RSCU) analysis

The RSCU is the ratio of the observed frequency of synonymous codons and is calculated using the formula

$$RSCU = \frac{Xij}{{1/ni\sum\nolimits_{j = 1}^{ni} {Xij} }}$$

where Xij stands for the frequency of the jth codon for ith amino acid and ni is the number of codons for the ith amino acid (ith codon family).

RSCU values of less than 0.6 are considered underrepresented codons and RSCU values above 1.6 are deemed over represented codons86.

Determination of scaled Chi square value (SCS)

The SCS, unlike the effective number of codons (ENc), is a directional measure of CUB87. SCS values were calculated for each of the genes implicated in depression.

Codon adaptation index (CAI)

CAI is a measure of CUB and helps determine the gene expression level. The CAI value ranges between 0 and 1, and the higher the value, the higher the expression65. CAI values are adjusted in the synthetic biology approach to obtain maximum expression level.

Skew calculation

Skew, herein, is a disproportionate use of nucleotides. Asymmetrically biased nucleotides arise due to asymmetric replication with leading and lagging strands88. AT skew, GC skew, purine skew, pyrimidine skew, keto skew, and amino skews were determined.

Estimation of physical properties of protein

pI or isoelectric point, instability index, aliphatic index, hydrophobicity, frequency of acidic, basic, and neutral amino acids, GRAVY, and AROMA, are the physicochemical properties of a protein that were assessed in the present study to evaluate the effects of various parameters on protein properties. Theoretical pI (PI), instability index (II), aliphatic index (AI) and hydrophobicity (HY) were computed using the ProtParam tool—ExPASy89. The frequency of acidic, basic, and neutral amino acids was determined using the Peptide2 tool available at Peptide 2.0 Inc.

Regression analysis

A regression analysis between %GC3 and %GC12 defines the magnitude of mutational and selection forces. If the slope tends to be near 1, it indicates that mutational force solely influences the codon usage and vice versa90. Simultaneously, a perfect correlation between GC12 and GC3, with a slope near value 1, indicates mutational force as the dominant one91.

Parity analysis

A parity rule 2 (PR2) bias indicates the bias between A and T and C and G at the third position of the codon. A parity plot is made by plotting AT bias [A3/(A3 + T3)] as the ordinate and GC bias [G3/(G3 + C3)] as the abscissa79,92.

Translational selection

The P2 analysis indicates the strength of codon-anticodon interaction and indicates translation efficacy when information of a preferred codon set is unknown83.

Translational selection P2 was calculated using the formula:

$${\text{P}}2 = ({\text{WWC}} + {\text{SSU}})/(WWY + SSY)$$

where W = A or U, S = C or G, and Y = C or U.

Moreover, any values above 0.5 indicate a bias favoring translational selection93.

Codon context analysis

In prokaryotic genes, it was first observed that codons and codon pairs also exhibit a bias in occurrence94. In another study, it was observed that codon pairs also influence the rate of translation. Overrepresented codon pairs are translated at a slower speed than pairs of underrepresented codons. The phenomenon is related to the compatibilities of adjacent tRNAisoacceptor molecules present on ribosomes participating in translation. Such results suggest co-evolution of frequency of one codon to the next codon with structural compatibilities and tRNAisoacceptor abundance as a measure to control translation rates95. Furthermore, codon pair optimization and deoptimization have been proven to affect the translation efficiency in several experiments deciphering the importance of codon context bias96,97. We performed codon context analysis using Anaconda 2 software in the present study.

Statistical analysis

Statistical analyses, such as Pearson correlation and regression analysis, were undertaken using PAST4 software. Standard calculations, such as additions and subtractions, were performed in Microsoft Office 2010 used in skew and other analyses. Principal component analysis was undertaken using PAST4 software.