Introduction

Since most proteins need to be folded to function, protein stability is one of the most basic properties of a protein. The protein stability discussed herein primarily refers to the thermodynamic stability of a protein, which determines whether the protein is in a naturally folded configuration or a denatured (unfolded or extended) state. Protein stability is a fundamental property that affects protein configuration, activity and regulation. It plays an essential role in evolution, a variety of diseases and industrial applications1,2,3,4. The most common cause of monogenic diseases is single-nucleotide variation (SNV) leading to amino acid substitutions. These missense variants can have a strong effect on the stability of a protein, leading to detrimental changes to protein function. Loss of protein stability is a major contributor to this single-gene disease1. More and more attention has been paid in the past few decades to understand the biological principles of protein stability5,6. Accurately predicting protein stability through theoretical and experimental methods is crucial for academic research and industrial applications.

Androgens have a wide range of physiological effects on male reproductive and non-reproductive systems at different stages of development7,8,9. During the fetal period, androgens are primarily responsible for sex differentiation by masculinizing the Wolff tube and external genitalia9. During puberty, androgens regulate the growth and function of the male reproductive system9. In adults, androgens play key role in regulating behavior, spermatogenesis and bone metabolism9. Androgens mediate their actions primarily via the androgen receptor (AR), a ligand-dependent nuclear transcription factor expressed in primary/secondary sex organs8,9. AR is also expressed in non-genital organs such as the skeletal muscle, skin, adrenal gland, kidney and nervous system8,9.

AR is an extensively studied steroid receptors. AR mutations have been identified in various diseases, including hereditary diseases such as androgen insensitivity syndrome (AIS)10,11, spinal and bulbar muscular atrophy (SBMA)12,13 and benign prostatic hyperplasia14,15. AR also plays an important role in the development and metastasis of several hormone-related cancers, including prostate cancer12, breast cancer16,17, liver cancer18,19,20.

The AR contains three major functional domains: (1) the N-terminal domain (NTD) comprises an activation function 1 (AF-1) region, (2) DNA binding domain (DBD) and (3) the C-terminal ligand binding domain (LBD) comprise an AF-2 region8,21. The primary mechanism of action for AR is to directly regulate gene transcription. Androgen binds to the AR, leading to conformational change of AR, dissociation of heat shock proteins, driving the interaction between the N- and C-terminus of AR, and importin-α binds AR to transport AR into the nucleus22. In the nucleus, the AR dimerizes and binds to androgen response elements (ARES) in the promoter region of the target genes23. AR interact with additional proteins in the nucleus, causing the transcription of specific target genes to be up- or down-regulated. Notable target genes for AR are insulin-like growth factor I receptor (IGF-1R)24, prostate-specific antigen (PSA)25,26, and transmembrane protease serine 2 (TMPRSS2)27,28,29.

Although more than 1,000 AR mutations have been identified and these mutants are pathologically important12, few theoretical studies have investigated the impact of mutations on AR protein folding stability in disease and its relationship with the phenotype of the patients. Several algorithms have been developed to predict the effect of mutations on protein stability30,31,32,33,34,35. Notable algorithms include FoldX36, Dmutant37, I-Mutant2.038, CUPSAT39, Eris40 and STRUM41. Compared with other methods, FoldX performs well and is the most commonly used protein stability prediction algorithm42. Folding free energy reflects the overall protein stability, and changes in protein stability due to naturally occurring missense mutations often cause disease42. This work calculates the folding free energy of AR variants by FoldX and measures the guanidine hydrochloride (GdmHCl) denaturation curves of different mutants, trying to establish correlation between protein stability and patient phenotype. By correlating the patient's phenotype with changes in AR stability, this study may prove to be diagnostic and/or predictive tools for assessing the effects of mutations on disease outcome.

Results and discussion

Computational pipeline for disease-associated androgen receptor (AR) variants

In order to study the pathogenic pattern of AR mutants in related diseases, our computational analysis consist of the following steps as in Fig. 1: (1) extracted AR variations from the following four databases: ARDB (The androgen receptor gene mutations database: 2012 update; https://androgendb.mcgill.ca), HGMD (Human Gene Mutation Database, 2015), Cosmic (2018.10.01), 1,000 genome; (2) analyzed mutation frequency, protein stability and relative surface accessibility (RSA) of these variants; (3) summarized the pathogenic mechanism of AR mutants in related diseases.

Figure 1
figure 1

Flowchart of the computational analysis pipeline for disease-associated androgen receptor (AR) variants.

AR structure

The NTD domain accounts for more than half of the entire AR protein as in Fig. 2 (amino acids 1–558). AR NTD contains polyglutamine (ploy-Q) and polyglycine (poly-G) repeats, and the length of these two repeats is highly variable in the human population43,44. The latest human AR reference gene sequence (NM_000044.2) encodes a protein of 920 amino acids in length (instead of the previous 919). Because the reference length of ploy-Q is replaced by 23 instead of the original 21, and the reference length of poly-G is changed from 24 to 23. The length of the ploy-Q repeat is inversely proportional to the AR transcriptional activity, and the longer the polyglutamine repeat, the smaller the AR transcriptional activity45. The AR NTD is an intrinsically disordered proteins (IDP) that lacks a stable structure in aqueous solution (Fig. 2)46,47. AR NTD undergoes conformational changes when interacting with DNA and/or target proteins and in the presence of structurally stable solutes46,47. The plasticity of the AR NTD structure allows it to interact with many structurally distinct proteins (e.g., P160 family coactivator, transcription factor IIF) and intramolecularly interact with C terminal LBD domain (N/C interaction of AR)48,49,50,51.

Figure 2
figure 2

AR domains and structure. Blue and red spheres represent N-terminal and C-terminal, respectively.

AR DBD (amino acids 560–620) consists of two zinc-coordinated modules, and it is highly conserved among steroid hormone receptors (Fig. 2). DBD selectively binds to the androgen response elements (ARES) on the promoter, activating specific AR target genes, such as TMPRSS2, PSA and IGF-1R52. AR contains two NLS sequences-one in the DBD domain, the other one in the hinge region between DBD and LBD. NLS consists of two basic amino acid clusters separated by ten residues (617-RKCYEAGMTLGARKLKK-634). The binding of androgens to AR induces the exposure of NLS and result in nuclear import of AR by binding to the importin-α53.

The crystal structure of AR LBD (residues 691–920) was first well characterized in 200054. Subsequently, many related complex structures were deposited to the RCSB PDB (The Research Collaboratory for Structural Bioinformatics Protein Data Bank). AR LBD consists of 11 α-helices and 4 short β-strands, forming a three-layer anti-parallel α-helical sandwich fold, which is characteristic of AR LBD (Fig. 2)54.

AR mutations analysis

Relationship between AR mutations and diseases

1,110 mutations were found in the AR gene, of which 905 were possible loss-of-function alterations and caused androgen insensitivity syndrome (AIS) or related with AIS. Different AR mutations impair androgen-dependent male sexual differentiation to varying degrees12. Severe androgen insensitivity (AI) produces an external female phenotype. Partial AI produce a range of external genital phenotypes, from near-normal females to normal or near-normal males. It has been suggested that the clinical severity of AI be divided into three levels: complete, partial (when there is significant external genital ambiguity), and mild (for the least severe form). In addition, there are 4 AR loss-of-function mutations associated with premature ovarian failure (POF)12.

There are also 168 AR mutations that are possible gain-of-function alterations found in prostate cancer tissues. Aside from skin cancer, prostate cancer is the most common cancer among men and is the second leading cause of cancer-related death in the United States55. In prostate cancer, AR mutation may reduce the specificity and selectivity of its binding ligands, thereby being activated by a wider range of ligands such as adrenal androgens, estrogens, progesterone and antiandrogens. These gain-of-function AR mutations in prostate cancer tissue may be responsible for the failure of prior anti-androgen therapy. In addition, in spinal cord and bulbar muscular atrophy (SBMA, also known as Kennedy's disease), poly-Q amplified AR protein is toxic to motor neurons to some extent by gain-of-function56.

AR mutations statistics

The AR NTD is highly conserved with relatively few deleterious mutations, and only 8% of the residues in the domain are found to have mutations, including 230 variants (Fig. 3). Most mutations are nonsense mutations or frameshift mutations result in premature termination of the translation. A significantly higher percentage (27%) of residue mutations were found in the DBD domain, including 146 variants (Fig. 3). Mutations in DBD domain are primarily SNVs, result in defects in the DNA binding/dimerization activity of the protein and impaired or absent transcription activity of AR. In the hinge region of the AR, only 8% of the residues are found to have mutations. Some variations in this region have no significant adverse effects on AR function. In LBD domain, 56% of the residues contained mutations, including 692 variants (Fig. 3).

Figure 3
figure 3

Analysis of AR mutations from three datasets.

The effect of disease-related AR SNV on protein stability

Nonsense or frameshift variants cause large changes to the encoded protein and are therefore usually functionally damaging. However, missense variants (single-residue variants, SRV), in which one amino acid is replaced by another amino acid, account for more than 40% of the unique variants observed in the Exome Aggregation Consortium database57, and their phenotypic consequences are often hard to predict. It has been found in cell experiments that many SRV have only a small effect on protein function. Analysis of high-throughput data across multiple proteins has shown that about two-thirds of SRV have only a small effect on protein function58. However, some SRV are severely harmful and cause complete loss of function. In a clinical setting, it would be useful to have reliable methods and sufficient data for interpretation of SRV and an accurate classification of whether they are pathogenic or benign.

From the HGMD dataset, we downloaded a collection of 337 single residue mutations in the AR that are known to be associated with human inherited disease. All mutations in this set were found in patients with AIS and were associated with loss of AR function. From the Cosmic dataset, we downloaded 323 single residue mutations in the AR that are known to be associated with human cancers, the vast majority of which were associated with prostate cancer. From the 1,000 genome dataset, we downloaded 39 single residue variations in the AR that are not significantly associated with human diseases. The Venn diagram below shows the relations between these three AR mutation sets (Fig. 4).

Figure 4
figure 4

Venn diagram of the SRV variants in HGMD, COSMIC and 1,000 Genome datasets.

The effect of disease-related AR mutants on protein folding stability

FoldX, the most commonly used protein stability prediction algorithm, can estimate the effect of SRV on protein stability based on the three-dimensional structure of the protein. So far, only the DBD and LBD domain in AR has crystal structure available, so we selected SRV of the DBD and LBD domain for analysis. We applied FoldX to analyze single residue variations in the AR DBD and LBD domain to estimate changes in protein folding free energy caused by these variations. We found that the predicted folding free energy change ΔΔG of human inherited disease-associated AR mutations (HGMD) was significantly higher (Fig. 5). The ΔΔG caused by the AR polymorphism (1,000 genome) that are not significantly related to the diseases are mainly around 1 kcal/mol. The average ΔΔG of tumor-associated AR somatic mutants is slightly lower than that caused by polymorphisms. However, mutations associated with human inherited diseases tend to have a significantly greater impact on protein stability than polymorphic or tumor-associated somatic mutants, with an average ΔΔG > 2 kcal/mol (Fig. 5).

Figure 5
figure 5

The predicted folding free energy change of AR LBD single residue variations calculated by FoldX. The statistical difference between HGMD, Cosmic and 1,000 genome groups was measured. The significance of statistical difference was calculated by paired two-side Student’s t-test (**p < 0.01, ***p < 0.001).

Structurally unstable or misfolded proteins may form toxic aggregates or inclusions. Organisms control protein quality by refolding or degrading of these unstable or misfolded proteins59. Most intracellular protein degradation occurs via the ubiquitin–proteasome system (UPS) or autophagy–lysosomal pathway (ALP). In a folded protein, degradation signals are usually buried in the protein. When a protein is partially or fully unfolded, one or more degrons of the protein may be exposed. E3 ubiquitin ligase scans cells for such degradation signals, binds substrates, and promotes substrate ubiquitination and degradation by the 26S proteasome. A recent study on Lynch syndrome-associated MSH2 mutations found that, as little as 3 kcal/mol was sufficient to trigger protein degradation60. The ALP is usually responsible for the degradation of highly misfolded and insoluble protein aggregates.

Correlation between protein folding stability of AR mutation and AIS patient phenotype

We further analyzed the correlation between the clinical severity of the AIS patient and the folding stability of the related AR mutation. AR mutants that cause severe androgen insensitivity (CAIS) tend to have a significantly greater impact on protein stability compared to partial AI (PAIS). PAIS AR mutants compared to polymorphisms tend to have a slight greater impact on protein stability (Fig. 6).

Figure 6
figure 6

Correlation between protein folding stability of AR mutation and AIS patient phenotype. The statistical difference between CAIS, PAIS and 1,000 genome groups was measured. The significance of statistical difference was calculated by paired two-side Student’s t-test (**p < 0.01).

Expression and purification of AR-LBD WT and mutants

Besides the wide-type (WT) AR, we randomly select two AR mutants that cause CAIS (W752R, L813F) and two mutants that cause PAIS (I738T, C807Y). The above proteins with N-terminal his-tag were expressed in Escherichia coli and purified by Ni–NTA affinity column and following Superdex-75 gel filtration column (Fig. 7). Compared with other mutants, we found that the I738T mutant had small effect on its solubility (Fig. 7), in agreement with FoldX prediction that the I738T has a mild effect on protein folding stability (ΔΔG = 1.4 kcal/mol). FoldX predicts that ΔΔG of C807Y, W752R and L813 mutants are 6.2, 3.6 and 6.4 kcal/mol, respectively. Our experiments show that the solubilities of these three mutants are much lower than the WT (Fig. 7). We proposed that the poor solubilities of these three mutants may due to their decreased folding stabilities, therefore leading to high tendency to precipitate as inclusion bodies (data not shown). As will be shown later by GdmHCl-induced denaturation experiments, these three mutations indeed have severely problems in terms of folding stability.

Figure 7
figure 7

The expression and purification of his tag AR-LBD WT and I738T, C807Y, W752R, L813 mutant proteins. (A) Gel filtration profiles of AR-LBD WT and mutant proteins. AR-LBD WT and mutant protein was eluted at a peak of 250 mL in Superdex-75 column. (BF) Coomassie blue stained SDS gel of the pooled fractions of AR-LBD WT and mutant proteins.

GdmHCl-induced chemical denaturation

In the room termperature, the AR-LBD WT and its mutants have similar CD spectra (200–250 nm, Fig. 8A), possessing the characteristic of α-helical proteins. We first tried to measure the temperature denaturation curves of AR-LBD WT and its mutants. However, since some mutants were observed to aggregate after elevating temperature, the measurement of the temperature-dependent folding stability by CD spectra looks unreliable. Therefore, we measured the stability of AR-LBD and its mutants by chemical denaturation.

Figure 8
figure 8

GdmHCl induced AR-LBD WT and mutant protein denaturation. (A) Circular dichroism spectra of AR-LBD WT and mutants at 25 ℃. (BF) Changes of ellipticities of AR-LBD WT and mutants with increasing concentration of GdmHCl at 25 ℃. (G) The fitted folding free energies of AR-LBD WT and mutants using the two-state model by the linear extrapolation method.

Figure 8B–F shows the change of CD spectrum of AR-LBD WT and its mutant proteins with different concentrations of GdmHCl. Because GdmHCl itself has absorption below 210 nm, so we used the change of CD ellipticity at 222 nm to measure the fraction of folded protein and calculate the folding free energy of AR-LBD variants.

AR-LBD WT is the most stable sample in this test. The transition midpoint (C1/2) of AR-LBD WT is at 3.0 M GdmHCl, which is higher than those C1/2 of disease mutants (1.9–2.2 M GdmHCl, Table 1).

Table 1 Measured folding free energy of AR_LBD WT and mutants.

We further calculated the folding free energies of AR-LBD variants based on the GdmHCl denaturation experiment. The linear relationship between ΔG and GdmHCl concentration is shown in Fig. 8G. The intercept (ΔG0) at zero GdmHCl concentration gives the extrapolated folding free energy at physiological condition. The slope m represents the sensitivity of protein stability to denaturant. These values are given in Table 1.

The fitted folding free energy ΔG0 of AR-LBD WT is − 5.3 kcal/mol, while those of I738T, C807Y, W752R and L813F mutants are higher (less stable) by 2.3–3.8 kcal/mol (ΔΔG). W752R and L813F, mutations of CAIS, compared to PAIS mutations I738T and C807Y, have greater changes in folding free energy. We proposed that AR mutants with less stabilities can lead to more severe androgen insensitivity syndrome. In summary, the experimental results are consistent with the FoldX prediction results.

The m value can be interpreted by the free energy change of the folded and unfolded protein being transferred from water to 1 M GdmHCl61. The m values of I738T, C807Y, W752R and L813F mutants were 1.3, 1.1, 0.9 and 0.6, respectively, which were lower than the m value of AR-LBD WT (1.7). The decrease of the m value indicates two possibilities: (1) the mutation increases the solvent accessibility of the hydrophobic residue in the folded state. (2) The deletion of some hydrogen bonds in the mutant will weaken the hydrophobic interaction of certain residues, thereby increasing its accessibility or sensitivity to GdmHCl. The decreasing trend of m value of mutants is also related to the ΔΔG of these mutant proteins. The L813F mutant with the smallest m value possessed the greatest decrease in protein stability (ΔΔG = 3.8 kcal/mol). The I738T with the smallest m value reduction has the smallest stability reduction (ΔΔG = 2.3 kcal/mol).

Conclusion and outlook

In summary, for the first time through computer-based analysis, we found that mutations associated with human inherited diseases tend to have a significantly greater impact on protein stability than polymorphisms. Using computer-based analysis and GdmHCl denaturation experiments, we found that changes in protein folding stability are correlated with patient phenotypes. The change of folding free energy of the AR-LBD mutants predicted by FoldX are consistent with the measured folding free energy changes by GdmHCl denaturation experiments, which further supports the reliability of our conclusion. Therefore, this paper clearly demonstrates the importance of AR protein stability in the pathogenicity of hereditary diseases (AIS), and provides reference for clinic diagnosis.

In addition to predicting pathogenicity and improving diagnosis, this AR protein stability studies provide new opportunities for the treatment of AIS. Many pathogenic variations might be adequately functional, but are degraded by protein quality control system due to mild instability62. For these pathogenic variations, it may be possible to rescue protein function by preventing their recognition or degradation by the protein quality control system62. If the degradation of these protein variants is inhibited, it is possible to avoid pathogenicity. In addition, some pathogenic variations may be so unstable that even inhibition of their degradation is not sufficient to rescue their stability and function. These variants can restore protein function by stabilizing the protein with small molecules that bind directly to the unstable protein variants63. Small molecules compounds have been shown to rescue function of pathogenic variations, such as in mutants CFTR64 and TP5365.

Methods

Data set

We extracted AR variations from the following four databases: ARDB (The androgen receptor gene mutations database: 2012 update; https://androgendb.mcgill.ca), HGMD (Human Gene Mutation Database, 2015), Cosmic (2018.10.01), 1,000 genome;

FoldX A structure-based method for the prediction of free energy changes upon protein variations. Here we used FoldX that exploits both sequence and structural information to predict the protein stability changes upon single point mutation. When predicting the ΔΔG associated with a variation, positive value indicates that the protein is destabilized, and a negative value indicates that the protein is stable.

Construction of AR mutants, protein overexpression and purification

The sequence of human AR cDNA (NCBI accessiong number: CCDS14387.1) encoding 664-920aa was cloned into pET28a expression vector. Site-directed mutagenesis was prepared using the procedure provided by the QuickChange site-directed mutagenesis kit. All constructs were verified by DNA sequencing. In the expression of WT AR and mutants, when the OD600nm of the culture reached 0.5, DHT (dihydrotestosterone) was added with the final concentration of 30 μM. After 15 min, 0.1 mM IPTG was added to induce protein expression at 16 °C overnight. The cells were collected by centrifugation at 5,000g for 10 min at 4 °C. The centrifuged cells were then resuspended and sonicated in 20 mM Tris buffer containing 500 mM NaCl and 50 mM imidazole (pH8.0). Debris was removed by centrifugation at 20,000g for 30 min. The supernatant was loaded into a Ni–NTA column and the desired fraction was eluted with 300 mM imidazole. The eluent was loaded into a Superdex 75 column equilibrated with a buffer (20 mM Tris–HCl, pH 7.5, 200 mM NaCl, 1 mM EDTA, 10 mM 2-ME). The collected fractions were concentrated to about 0.5–2 mg/ml for CD measurement. The protein concentration was determined by using the calculated extinction coefficient at 280 nm.

Circular dichroism (CD) measurements

CD measurement is performed on a Chirascan spectrometer. All CD measurements were performed in buffer (20 mM Tris–HCl, pH 7.5, 200 mM NaCl, 1 mM EDTA, 10 mM 2-ME). For the GdmHCl induced denaturation experiments, 70 μg/ml of AR-LBD WT and mutant proteins with different GdmHCl concentrations were prepared and equilibrated at 25 °C for 1 h. The CD signal was measured at a path of 0.1 cm, and three independent measurement results were averaged.

Two-state analysis of GdmHCl denaturation

A two-state hypothesis was used to fit the denaturation curve of GdmHCl. The folding free energy of AR-LBD WT and mutant proteins without GdmHCl is estimated by a linear extrapolation:

$$\alpha_{i} = \frac{{\left[ {\theta_{i} \left] { {-} } \right[\theta_{{\text{U}}} } \right]}}{{\left[ {\theta_{{\text{F}}} \left] { - } \right[\theta_{{\text{U}}} } \right]}}$$

where [θi] is the ellipticity at the ith gdmHCl concentration, [θF] is the ellipticity of the protein completely folded, [θU] is the ellipticity of the protein in 5 M GdmHCl. It is assumed that the protein in 5 M GdmHCl has been fully unfolded.

$$K_{i} = \frac{{{\upalpha }_{i} }}{{1 - {\upalpha }_{i} }}$$

where Ki is the folding constant of the monomer protein at the ith GdmHCl concentration, which can be calculated by the folding fraction αi. The free energy of protein folding can be estimated by the following equation:

$$\Delta G_{{\text{F}}} = - RT{\ln}K_{{\text{F}}}$$

where R is the gas constant, T is the absolute temperature, and KF is the folding constant of monomer protein, which can be calculated by the function KF = [F]/[U], where [F] and [U] represent respectively folded and unfolded fractions.

$$\Delta Gi = \Delta G_{0} + m\left[ {{\text{GdmHCl}}} \right]$$

The free energy of protein folding is a linear function of GdmHCl concentration, where ΔGi is the free energy of protein at the ith GdmHCl concentration, and ΔG0 is the free energy of protein folding without GdmHCl.