Computational analysis of androgen receptor (AR) variants to decipher the relationship between protein stability and related-diseases

Although more than 1,000 androgen receptor (AR) mutations have been identified and these mutants are pathologically important, few theoretical studies have investigated the role of AR protein folding stability in disease and its relationship with the phenotype of the patients. Here, we extracted AR variant data from four databases: ARDB, HGMD, Cosmic, and 1,000 genome. 905 androgen insensitivity syndrome (AIS)-associated loss-of-function mutants and 168 prostate cancer-associated gain-of-function mutants in AR were found. We analyzed the effect of single-residue variation on the folding stability of AR by FoldX and guanidine hydrochloride denaturation experiment, and found that genetic disease-associated mutations tend to have a significantly greater effect on protein stability than gene polymorphisms. Moreover, AR mutants in complete androgen insensitivity syndrome (CAIS) tend to have a greater effect on protein stability than in partial androgen insensitive syndrome (PAIS). This study, by linking disease phenotypes to changes in AR stability, demonstrates the importance of protein stability in the pathogenesis of hereditary disease.

Since most proteins need to be folded to function, protein stability is one of the most basic properties of a protein. The protein stability discussed herein primarily refers to the thermodynamic stability of a protein, which determines whether the protein is in a naturally folded configuration or a denatured (unfolded or extended) state. Protein stability is a fundamental property that affects protein configuration, activity and regulation. It plays an essential role in evolution, a variety of diseases and industrial applications [1][2][3][4] . The most common cause of monogenic diseases is single-nucleotide variation (SNV) leading to amino acid substitutions. These missense variants can have a strong effect on the stability of a protein, leading to detrimental changes to protein function. Loss of protein stability is a major contributor to this single-gene disease 1 . More and more attention has been paid in the past few decades to understand the biological principles of protein stability 5,6 . Accurately predicting protein stability through theoretical and experimental methods is crucial for academic research and industrial applications.
Androgens have a wide range of physiological effects on male reproductive and non-reproductive systems at different stages of development [7][8][9] . During the fetal period, androgens are primarily responsible for sex differentiation by masculinizing the Wolff tube and external genitalia 9 . During puberty, androgens regulate the growth and function of the male reproductive system 9 . In adults, androgens play key role in regulating behavior, spermatogenesis and bone metabolism 9 . Androgens mediate their actions primarily via the androgen receptor Results and discussion computational pipeline for disease-associated androgen receptor (AR) variants. In order to study the pathogenic pattern of AR mutants in related diseases, our computational analysis consist of the following steps as in  www.nature.com/scientificreports/ relative surface accessibility (RSA) of these variants; (3) summarized the pathogenic mechanism of AR mutants in related diseases.
AR structure. The NTD domain accounts for more than half of the entire AR protein as in Fig. 2 (amino acids 1-558). AR NTD contains polyglutamine (ploy-Q) and polyglycine (poly-G) repeats, and the length of these two repeats is highly variable in the human population 43,44 . The latest human AR reference gene sequence (NM_000044.2) encodes a protein of 920 amino acids in length (instead of the previous 919). Because the reference length of ploy-Q is replaced by 23 instead of the original 21, and the reference length of poly-G is changed from 24 to 23. The length of the ploy-Q repeat is inversely proportional to the AR transcriptional activity, and the longer the polyglutamine repeat, the smaller the AR transcriptional activity 45 . The AR NTD is an intrinsically disordered proteins (IDP) that lacks a stable structure in aqueous solution ( Fig. 2) 46,47 . AR NTD undergoes conformational changes when interacting with DNA and/or target proteins and in the presence of structurally stable solutes 46,47 . The plasticity of the AR NTD structure allows it to interact with many structurally distinct proteins (e.g., P160 family coactivator, transcription factor IIF) and intramolecularly interact with C terminal LBD domain (N/C interaction of AR) [48][49][50][51] . AR DBD (amino acids 560-620) consists of two zinc-coordinated modules, and it is highly conserved among steroid hormone receptors (Fig. 2). DBD selectively binds to the androgen response elements (ARES) on the promoter, activating specific AR target genes, such as TMPRSS2, PSA and IGF-1R 52 . AR contains two NLS sequences-one in the DBD domain, the other one in the hinge region between DBD and LBD. NLS consists of two basic amino acid clusters separated by ten residues (617-RKCYEAGMTLGARKLKK-634). The binding of androgens to AR induces the exposure of NLS and result in nuclear import of AR by binding to the importin-α 53 .
The crystal structure of AR LBD (residues 691-920) was first well characterized in 2000 54 . Subsequently, many related complex structures were deposited to the RCSB PDB (The Research Collaboratory for Structural Bioinformatics Protein Data Bank). AR LBD consists of 11 α-helices and 4 short β-strands, forming a three-layer anti-parallel α-helical sandwich fold, which is characteristic of AR LBD (Fig. 2) 54 .

AR mutations analysis.
Relationship between AR mutations and diseases. 1,110 mutations were found in the AR gene, of which 905 were possible loss-of-function alterations and caused androgen insensitivity syndrome (AIS) or related with AIS. Different AR mutations impair androgen-dependent male sexual differentiation to varying degrees 12 . Severe androgen insensitivity (AI) produces an external female phenotype. Partial AI produce a range of external genital phenotypes, from near-normal females to normal or near-normal males. It has been suggested that the clinical severity of AI be divided into three levels: complete, partial (when there is significant external genital ambiguity), and mild (for the least severe form). In addition, there are 4 AR loss-offunction mutations associated with premature ovarian failure (POF) 12 .
There are also 168 AR mutations that are possible gain-of-function alterations found in prostate cancer tissues. Aside from skin cancer, prostate cancer is the most common cancer among men and is the second leading cause of cancer-related death in the United States 55 . In prostate cancer, AR mutation may reduce the specificity and selectivity of its binding ligands, thereby being activated by a wider range of ligands such as adrenal androgens, estrogens, progesterone and antiandrogens. These gain-of-function AR mutations in prostate cancer tissue may be responsible for the failure of prior anti-androgen therapy. In addition, in spinal cord and bulbar muscular atrophy (SBMA, also known as Kennedy's disease), poly-Q amplified AR protein is toxic to motor neurons to some extent by gain-of-function 56 . The effect of disease-related AR SNV on protein stability. Nonsense or frameshift variants cause large changes to the encoded protein and are therefore usually functionally damaging. However, missense variants (single-residue variants, SRV), in which one amino acid is replaced by another amino acid, account for more than 40% of the unique variants observed in the Exome Aggregation Consortium database 57 , and their phenotypic consequences are often hard to predict. It has been found in cell experiments that many SRV have only a small effect on protein function. Analysis of high-throughput data across multiple proteins has shown that about two-thirds of SRV have only a small effect on protein function 58 . However, some SRV are severely harmful and cause complete loss of function. In a clinical setting, it would be useful to have reliable methods and sufficient data for interpretation of SRV and an accurate classification of whether they are pathogenic or benign. From the HGMD dataset, we downloaded a collection of 337 single residue mutations in the AR that are known to be associated with human inherited disease. All mutations in this set were found in patients with AIS and were associated with loss of AR function. From the Cosmic dataset, we downloaded 323 single residue mutations in the AR that are known to be associated with human cancers, the vast majority of which were associated with prostate cancer. From the 1,000 genome dataset, we downloaded 39 single residue variations in the AR that are not significantly associated with human diseases. The Venn diagram below shows the relations between these three AR mutation sets (Fig. 4).
The effect of disease-related AR mutants on protein folding stability. FoldX, the most commonly used protein stability prediction algorithm, can estimate the effect of SRV on protein stability based on the three-dimensional structure of the protein. So far, only the DBD and LBD domain in AR has crystal structure available, so we selected SRV of the DBD and LBD domain for analysis. We applied FoldX to analyze single residue variations in the AR DBD and LBD domain to estimate changes in protein folding free energy caused by these variations. We found that the predicted folding free energy change ΔΔG of human inherited disease-associated AR mutations (HGMD) was significantly higher (Fig. 5). The ΔΔG caused by the AR polymorphism (1,000 genome) that are not significantly related to the diseases are mainly around 1 kcal/mol. The average ΔΔG of tumor-associated www.nature.com/scientificreports/ AR somatic mutants is slightly lower than that caused by polymorphisms. However, mutations associated with human inherited diseases tend to have a significantly greater impact on protein stability than polymorphic or tumor-associated somatic mutants, with an average ΔΔG > 2 kcal/mol (Fig. 5).
Structurally unstable or misfolded proteins may form toxic aggregates or inclusions. Organisms control protein quality by refolding or degrading of these unstable or misfolded proteins 59 . Most intracellular protein degradation occurs via the ubiquitin-proteasome system (UPS) or autophagy-lysosomal pathway (ALP). In a folded protein, degradation signals are usually buried in the protein. When a protein is partially or fully unfolded, one or more degrons of the protein may be exposed. E3 ubiquitin ligase scans cells for such degradation signals, binds substrates, and promotes substrate ubiquitination and degradation by the 26S proteasome. A recent study on Lynch syndrome-associated MSH2 mutations found that, as little as 3 kcal/mol was sufficient to trigger protein degradation 60 . The ALP is usually responsible for the degradation of highly misfolded and insoluble protein aggregates.
Correlation between protein folding stability of AR mutation and AIS patient phenotype. We further analyzed the correlation between the clinical severity of the AIS patient and the folding stability of the related AR mutation. AR mutants that cause severe androgen insensitivity (CAIS) tend to have a significantly greater impact on protein stability compared to partial AI (PAIS). PAIS AR mutants compared to polymorphisms tend to have a slight greater impact on protein stability (Fig. 6).

Expression and purification of AR-LBD WT and mutants.
Besides the wide-type (WT) AR, we randomly select two AR mutants that cause CAIS (W752R, L813F) and two mutants that cause PAIS (I738T, C807Y). The above proteins with N-terminal his-tag were expressed in Escherichia coli and purified by Ni-NTA affinity column and following Superdex-75 gel filtration column (Fig. 7). Compared with other mutants, we found that the I738T mutant had small effect on its solubility (Fig. 7), in agreement with FoldX prediction that the I738T has a mild  www.nature.com/scientificreports/ effect on protein folding stability (ΔΔG = 1.4 kcal/mol). FoldX predicts that ΔΔG of C807Y, W752R and L813 mutants are 6.2, 3.6 and 6.4 kcal/mol, respectively. Our experiments show that the solubilities of these three mutants are much lower than the WT (Fig. 7). We proposed that the poor solubilities of these three mutants may due to their decreased folding stabilities, therefore leading to high tendency to precipitate as inclusion bodies Figure 6. Correlation between protein folding stability of AR mutation and AIS patient phenotype. The statistical difference between CAIS, PAIS and 1,000 genome groups was measured. The significance of statistical difference was calculated by paired two-side Student's t-test (**p < 0.01). www.nature.com/scientificreports/ (data not shown). As will be shown later by GdmHCl-induced denaturation experiments, these three mutations indeed have severely problems in terms of folding stability.
GdmHCl-induced chemical denaturation. In the room termperature, the AR-LBD WT and its mutants have similar CD spectra (200-250 nm, Fig. 8A), possessing the characteristic of α-helical proteins. We first tried to measure the temperature denaturation curves of AR-LBD WT and its mutants. However, since some mutants were observed to aggregate after elevating temperature, the measurement of the temperature-dependent folding stability by CD spectra looks unreliable. Therefore, we measured the stability of AR-LBD and its mutants by chemical denaturation. Figure 8B-F shows the change of CD spectrum of AR-LBD WT and its mutant proteins with different concentrations of GdmHCl. Because GdmHCl itself has absorption below 210 nm, so we used the change of CD ellipticity at 222 nm to measure the fraction of folded protein and calculate the folding free energy of AR-LBD variants.
AR-LBD WT is the most stable sample in this test. The transition midpoint (C 1/2 ) of AR-LBD WT is at 3.0 M GdmHCl, which is higher than those C 1/2 of disease mutants (1.9-2.2 M GdmHCl, Table 1).
We further calculated the folding free energies of AR-LBD variants based on the GdmHCl denaturation experiment. The linear relationship between ΔG and GdmHCl concentration is shown in Fig. 8G. The intercept www.nature.com/scientificreports/ (ΔG 0 ) at zero GdmHCl concentration gives the extrapolated folding free energy at physiological condition. The slope m represents the sensitivity of protein stability to denaturant. These values are given in Table 1.
The fitted folding free energy ΔG 0 of AR-LBD WT is − 5.3 kcal/mol, while those of I738T, C807Y, W752R and L813F mutants are higher (less stable) by 2.3-3.8 kcal/mol (ΔΔG). W752R and L813F, mutations of CAIS, compared to PAIS mutations I738T and C807Y, have greater changes in folding free energy. We proposed that AR mutants with less stabilities can lead to more severe androgen insensitivity syndrome. In summary, the experimental results are consistent with the FoldX prediction results.
The m value can be interpreted by the free energy change of the folded and unfolded protein being transferred from water to 1 M GdmHCl 61 . The m values of I738T, C807Y, W752R and L813F mutants were 1.3, 1.1, 0.9 and 0.6, respectively, which were lower than the m value of AR-LBD WT (1.7). The decrease of the m value indicates two possibilities: (1) the mutation increases the solvent accessibility of the hydrophobic residue in the folded state. (2) The deletion of some hydrogen bonds in the mutant will weaken the hydrophobic interaction of certain residues, thereby increasing its accessibility or sensitivity to GdmHCl. The decreasing trend of m value of mutants is also related to the ΔΔG of these mutant proteins. The L813F mutant with the smallest m value possessed the greatest decrease in protein stability (ΔΔG = 3.8 kcal/mol). The I738T with the smallest m value reduction has the smallest stability reduction (ΔΔG = 2.3 kcal/mol).

conclusion and outlook
In summary, for the first time through computer-based analysis, we found that mutations associated with human inherited diseases tend to have a significantly greater impact on protein stability than polymorphisms. Using computer-based analysis and GdmHCl denaturation experiments, we found that changes in protein folding stability are correlated with patient phenotypes. The change of folding free energy of the AR-LBD mutants predicted by FoldX are consistent with the measured folding free energy changes by GdmHCl denaturation experiments, which further supports the reliability of our conclusion. Therefore, this paper clearly demonstrates the importance of AR protein stability in the pathogenicity of hereditary diseases (AIS), and provides reference for clinic diagnosis.
In addition to predicting pathogenicity and improving diagnosis, this AR protein stability studies provide new opportunities for the treatment of AIS. Many pathogenic variations might be adequately functional, but are degraded by protein quality control system due to mild instability 62 . For these pathogenic variations, it may be possible to rescue protein function by preventing their recognition or degradation by the protein quality control system 62 . If the degradation of these protein variants is inhibited, it is possible to avoid pathogenicity. In addition, some pathogenic variations may be so unstable that even inhibition of their degradation is not sufficient to rescue their stability and function. These variants can restore protein function by stabilizing the protein with small molecules that bind directly to the unstable protein variants 63 . Small molecules compounds have been shown to rescue function of pathogenic variations, such as in mutants CFTR 64  FoldX A structure-based method for the prediction of free energy changes upon protein variations. Here we used FoldX that exploits both sequence and structural information to predict the protein stability changes upon single point mutation. When predicting the ΔΔG associated with a variation, positive value indicates that the protein is destabilized, and a negative value indicates that the protein is stable.
Construction of AR mutants, protein overexpression and purification. The sequence of human AR cDNA (NCBI accessiong number: CCDS14387.1) encoding 664-920aa was cloned into pET28a expression vector. Site-directed mutagenesis was prepared using the procedure provided by the QuickChange site-directed mutagenesis kit. All constructs were verified by DNA sequencing. In the expression of WT AR and mutants, when the OD 600nm of the culture reached 0.5, DHT (dihydrotestosterone) was added with the final concentration of 30 μM. After 15 min, 0.1 mM IPTG was added to induce protein expression at 16 °C overnight. The cells were collected by centrifugation at 5,000g for 10 min at 4 °C. The centrifuged cells were then resuspended and sonicated in 20 mM Tris buffer containing 500 mM NaCl and 50 mM imidazole (pH8.0). Debris was removed Table 1. Measured folding free energy of AR_LBD WT and mutants. Measured C 1/2 [GdmHCl] , folding free energy (ΔG 0 ), standard error of folding free energy, change of folding free energy (∆∆G), m value, and standard error of m value of AR_LBD WT and mutants.

AR-LBD
Disease m value C www.nature.com/scientificreports/ by centrifugation at 20,000g for 30 min. The supernatant was loaded into a Ni-NTA column and the desired fraction was eluted with 300 mM imidazole. The eluent was loaded into a Superdex 75 column equilibrated with a buffer (20 mM Tris-HCl, pH 7.5, 200 mM NaCl, 1 mM EDTA, 10 mM 2-ME). The collected fractions were concentrated to about 0.5-2 mg/ml for CD measurement. The protein concentration was determined by using the calculated extinction coefficient at 280 nm. circular dichroism (cD) measurements. CD measurement is performed on a Chirascan spectrometer.
All CD measurements were performed in buffer (20 mM Tris-HCl, pH 7.5, 200 mM NaCl, 1 mM EDTA, 10 mM 2-ME). For the GdmHCl induced denaturation experiments, 70 μg/ml of AR-LBD WT and mutant proteins with different GdmHCl concentrations were prepared and equilibrated at 25 °C for 1 h. The CD signal was measured at a path of 0.1 cm, and three independent measurement results were averaged.
two-state analysis of GdmHcl denaturation. A two-state hypothesis was used to fit the denaturation curve of GdmHCl. The folding free energy of AR-LBD WT and mutant proteins without GdmHCl is estimated by a linear extrapolation: where [θ i ] is the ellipticity at the ith gdmHCl concentration, [θ F ] is the ellipticity of the protein completely folded, [θ U ] is the ellipticity of the protein in 5 M GdmHCl. It is assumed that the protein in 5 M GdmHCl has been fully unfolded.
where K i is the folding constant of the monomer protein at the ith GdmHCl concentration, which can be calculated by the folding fraction α i . The free energy of protein folding can be estimated by the following equation: where R is the gas constant, T is the absolute temperature, and K F is the folding constant of monomer protein, which can be calculated by the function Scientific RepoRtS | (2020) 10:12101 | https://doi.org/10.1038/s41598-020-68731-7 www.nature.com/scientificreports/