Structural genomics approach to investigate deleterious impact of nsSNPs in conserved telomere maintenance component 1

Conserved telomere maintenance component 1 (CTC1) is an important component of the CST (CTC1-STN1-TEN1) complex, involved in maintaining the stability of telomeric DNA. Several non-synonymous single-nucleotide polymorphisms (nsSNPs) in CTC1 have been reported to cause Coats plus syndrome and Dyskeratosis congenital diseases. Here, we have performed sequence and structure analyses of nsSNPs of CTC1 using state-of-the-art computational methods. The structure-based study focuses on the C-terminal OB-fold region of CTC1. There are 11 pathogenic mutations identified, and detailed structural analyses were performed. These mutations cause a significant disruption of noncovalent interactions, which may be a possible reason for CTC1 instability and consequent diseases. To see the impact of such mutations on the protein conformation, all-atom molecular dynamics (MD) simulations of CTC1-wild-type (WT) and two of the selected mutations, R806C and R806L for 200 ns, were carried out. A significant conformational change in the structure of the R806C mutant was observed. This study provides a valuable direction to understand the molecular basis of CTC1 dysfunction in disease progression, including Coats plus syndrome.

. Schematic diagram of the shelterin and CST complexes bound to telomeric DNA. TRF1 telomeric repeat binding factor 1; TRF2 telomeric repeat binding factor 2; TIN2 TRF1 interacting nuclear factor 2; TPP1 protein encoded by adrenocortical dysplasia homolog (ACD) gene; POT1 protection of telomerase; CTC1 conserved telomere maintenance component 1; STN1 suppressor of CDC thirteen homolog; TEN1 telomere length regulation protein TEN1 homolog. Prediction of deleterious mutations using sequence-based tools. Deleterious or damaging SNPs in CTC1 were predicted through various tools that are available through different public domains. A brief description of the tools and methods used in sequence-based predictions is given below: SIFT. Sorting Intolerant from Tolerant (SIFT) (http:// sift. jcvi. org/) algorithm is used to determine whether the non-synonymous amino acid substitutions are deleterious or not based on sequence homology and physical properties of the amino acid. If the SIFT score is less than or equal to 0.05, then the mutation is not tolerable 48 . A total of 971 nsSNPs were retrieved for the human CTC1 protein. The effect of these nsSNPs on the protein was predicted using the SIFT tool.
PolyPhen-2. Polymorphism phenotyping-2 (PolyPhen-2) (http:// genet ics. bwh. harva rd. edu/ pph2/) is a sequence-based tool that accepts FASTA file format as input 49 . This tool considers the comparative and physical properties and estimates the damaging probability of the amino acid substitutions. It gives the Position-Specific Independent Count (PSIC) score for the mutant and then calculates the score deviation with the wild-type (WT). If the PSIC score is greater than 0.09, then the non-synonymous mutation is predicted as a deleterious mutation.
PROVEAN. Protein variation effect analyzer (PROVEAN) (http:// prove an. jcvi. org/) was used to identify the damaging missense mutations of CTC1 protein. It estimates the consequence of the mutations on the functionality of the protein 50 . PROVEAN score of less than − 2.5 for an nsSNP is considered deleterious, whereas nsSNPs with a score greater than − 2.5 are considered neutral. All the 971 missense mutations of the Human CTC1 protein were analyzed by the PROVEAN tool.
Mutation assessor. Mutation Assessor (http:// mutat ionas sessor. org/ r3/) is a sequence-based tool that predicts the functional impact of an nsSNP on protein. The mutation assessor result is based on multiple sequence alignment and evolutionarily conserved residues 51 . The tool takes UniProt protein accession or NCBI Refseq protein ID as input for protein sequence and subsequently classifies the mutations as medium, low, or neutral for deleterious effects. The mutation assessor gives FI score for every non-synonymous mutation. If the FI score is more significant than 2.00, then the mutation is considered deleterious.
PON-P2. PON-P2 (http:// struc ture. bmc. lu. se/ PON-P2/) is a machine learning-based classifier for the classification of amino acid substitutions on human proteins 52  www.nature.com/scientificreports/ Prediction of the destabilizing nsSNPs using structure-based tools. The tools and methods used in structure-based predictions are described below: STRUM. STRUM (https:// zhang lab. ccmb. med. umich. edu/ STRUM/) is a tool which calculates the change in ΔΔG between WT and mutant protein 53 . 3D models are generated starting from WT protein by the iterative threading assembly refinement (I-TASSER) simulations and gradient boosting regression method. Physics and knowledge-based energy functions are incorporated on the structures modelled by I-TASSER and used to train STRUM models. One of the unique features of STRUM is that it combines various methods like multiple sequence alignment; some structural profile scores and gives the sequence profile score, which shows the probability of the given amino acid at a mutant position being found in the ensemble of homologous proteins. This tool accepts the FASTA format file as well as the PDB files as an input format. A mutation is destabilizing if the STRUM score, i.e., ΔΔG, is > 0.
MAESTROweb. MAESTRO (https:// pbwww. che. sbg. ac. at/ maest ro/ web) is a multi-agent stability prediction tool that estimates the free energy change on protein unfolding. It calculates the impact of a point mutation on the stability of the protein by calculating the free energy change (ΔG) between the WT and the mutant protein.
MAESTRO takes PDB coordinate files as an input and applies machine learning techniques to calculate the Gibbs free energy change. The quality of the prediction decreases when modelled structures are used as input files. If the score for a mutation is less than 0, then the mutation alters the stability of the protein 54 .
SDM2. Site-Directed Mutator (SDM2) (http:// marid. bioc. cam. ac. uk/ sdm2) calculates the change in protein stability between the WT and the mutant protein 55 . It takes the PDB coordinate file as input and uses environment-specific amino acid substitution tables to estimate the point mutation's protein stability. The updated version of environment-specific amino acid substitution tables is based on new parameters like packing density and residue length. The tool was tested with 2690 amino acid substitutions from 132 different 3D structures of proteins. If the ΔΔG is > 0 for a given non-synonymous SNP, SDM2 predicts it as a destabilizing nsSNP.
mCSM. mCSM is a novel tool to evaluate non-synonymous mutations that uses a graph-based approach to predict destabilizing mutations. The predictive models are trained with the environment derived from the atomic distance patterns of different residues. According to mCSM, the mutational impact of each amino acid residue is linked with atomic distance patterns surrounding that residue. These distant patterns describe the mutated residue's nature in the WT protein. This tool gives a better understanding of the mutations associated with diseases for a range of proteins. A mutation has an altering effect on a protein structure if the mCSM score (ΔΔG) is less than 0 56 .
DUET. DUET (http:// biosig. unime lb. edu. au/ duet/) is an integrated tool to study the effect of nsSNPs on protein stability. It uses Support Vector Machine and integrates the scores of mCSM and SDM and gives the combined value of ΔΔG. DUET combines secondary structure and pharmacophore vector (used by mCSM) and residue relative solvent accessibility (used by SDM) and integrates it with supervised learning. The accuracy of the combined method is verified with the experimental thermodynamic data present in the training dataset. The input for DUET is the PDB structure file along with single point mutation, and this tool gives DUET score as well as mCSM and SDM score in the result 57 .
Identification of Pathogenic nsSNPs. PMut. PMut (http:// mmb. irbba rcelo na. org/ PMut) is used to predict the nsSNPs which are associated with the disease phenotype. The manually curated datasets obtained from Swiss-Prot is used to train the neural network-based classifier of PMut and predicted physicochemical properties and sequence conservation are used as prominent features. The updated version has enabled people to generate their predictors for specific families of proteins. It also gives access to the repository of the preestimated predictions. PMut score for a pathogenic single point mutation is more significant than 0.05 58 .
MutPred2. MutPred2 (http:// mutpr ed. mutdb. org) is a web-based tool that classifies an amino acid substitution as disease-causing or neutral. MutPred2 uses a machine learning-based approach to predict the pathogenicity of a mutation and gives the molecular mechanism of pathogenicity. It also predicts the impact of a mutation in 50 different protein properties. If the MutPred2 score is greater than 0.5 then the mutation is pathogenic 59 .
Analysis of packing density and accessible surface area. Apart from predicting deleterious mutations, the SDM2 tool also calculates the relative side-chain solvent accessibility (RSA), residue depth and residue-occluded packing density (OSP) for the WT and mutant proteins. The newly updated version of SDM2 includes these features mentioned above. It uses environment-specific amino acid substitution tables to estimate residue depth, RSA and OSP for the protein variants. Lee and Richards's method has been used to calculate RSA. For the analysis of structural stability, residue depth and OSP are also very prominent properties of the protein structure 60 .
Analysis of aggregation propensity. Solubility  Analysis of noncovalent interactions. The Arpeggio server calculates the number of interatomic interactions of a protein structure 62 . About 15 types of interatomic interactions can be calculated by Arpeggio. PDB format structure files can be submitted to this server for interaction analysis. Arpeggio assigns atom types to each atom using OpenBabel via SMARTS queries. It finds the nearest-neighbor atoms within a 5Å radial cutoff and a structural interaction fingerprint (SIFt) is given to each pairwise interatomic contact. It gives the number of interactions and provides downloadable tabular data showing different covalent and noncovalent interactions. The interactions can also be visualized through this tool.
Analysis of conserved residues. ConSurf tool (https:// consu rf. tau. ac. il/) was used to determine the degree of conservation of residues in a specific position using multiple sequence alignment 63 . The amino acid conservation is vital to understand evolution and the function and structure of a protein. The ConSurf tool uses the empirical Bayesian method or maximum likelihood (ML) to calculate the degree of conservation of each residue. The ConSurf score ranges between 1 and 9, where 1 is the score for most minor conserved positions, 5 is the score for intermediate conserved positions, and nine is for highly conserved positions. The ConSurf-DB also stores the pre-estimated scores for known PDB structures. The buried residues with a high degree of conservation are considered structural residues and the exposed residues with a high degree of conservation are considered functional residues. The computational approach and tools used in the mutational analysis are represented in Fig. 3.

MD simulations.
All-atom MD simulation for 200 ns was carried out on CTC1-WT, R806C, and R806L.
MD simulation was performed under explicit solvent conditions at 300 K using the GROMACS 5.1.5 package while utilizing the GROMOS96 43a1 force field as described earlier [64][65][66] . The solvation was carried out in a cubic box filled with water with a dimension of 10 Å. Appropriate numbers of Na + and Cl − ions were added to all three systems for neutralization while utilizing the genion tool in Gromacs. Energy minimization was carried out using 1500 steps of the steepest descent method to remove any steric clashes in the systems. Equilibration was carried out at 300 K using the two-step ensemble process (NVT and NPT) for 100 ps. The final MD run on each system was carried for 200 ns and a leap-frog integrator was used for the production of the time-evolution trajectories. A multi-tier approach was employed to identify the structural and functional consequences of the non-synonymous mutations on the CTC1 gene. Sequence-based and structure-based approaches have been employed to www.nature.com/scientificreports/ obtain the high confidence deleterious nsSNPs. All the nsSNPs were subjected to sequence-based analysis using five web-based tools: SIFT, PolyPhen2, PROVEAN, Mutation assessor, and PON-P2. For the structure-based approach, STRUM, MAESTRO-web, SDM2, mCSM and DUET were used to analyze the 126 nsSNPs in the OB-fold region of CTC1. Disease phenotype identification of the high confidence nsSNPs had been made using MutPred2 and PMut web server. Figure 3 depicts an overview of all the computational approaches used in this study. We have also employed other approaches, namely analysis of packing density and accessible surface area, analysis of aggregation propensity and degree of amino acid conservation of the OB-fold of the CTC1 protein.

Result and discussion
Identification of deleterious nsSNPs using sequence and structure-based approaches. Multiple tools were used to predict deleterious mutations. Because using a single tool can provide some false positives. The use of more tools can eliminate false predictions, and it may provide more accurate results. For the sequence-based approach, SIFT, PolyPhen2, PROVEAN, Mutation assessor and PON-P2 were used (Fig. 4). The SIFT web tool considers the physical properties and classifies the nsSNPs into tolerated (non-damaging) and intolerant (damaging) mutations. A higher value of tolerance index implies a low functional impact of a mutation on the protein and vice versa 48 . PolyPhen2 is another tool that also uses the amino acid sequence to determine damaging mutations. PolyPhen2 quantifies the non-synonymous mutations in three categories: possibly damaging (score > 0.2 and < 0.96), probably damaging (score > 0.96), and benign (score < 0.2). To improve the confidence level, three other tools PROVEAN, Mutation assessor and PON-P2 were used. PROVEAN uses a clustering approach where BLAST hits from the query sequence are used to form clusters, of which around 30 are chosen to generate a prediction. A delta alignment score is calculated for each sequence in a cluster, and the score is averaged to generate a final and default PROVEAN score. PROVEAN score less than − 2.5 signifies a deleterious mutation. The mutation assessor uses conservation scores. Conserved regions are determined from multiple sequence alignment of the query sequences. A conservation score is generated for each region, which characterizes the functional impact of the substitution. A score of more than 3.5 is considered deleterious, a score between 2.0 and 3.5 is probably deleterious, and a score between 0.5 and 2.0 is considered normal. PON-P2 classifies the amino acid variants into pathogenic, neutral and unknown categories using evolutionary conservation and physical and biochemical properties of amino acid. The disease-causing mutation also alters the stability of a protein. A protein is either in a folded or unfolded form. In thermodynamics, the energy difference (Gibbs free energy) between folded and unfolded (G u ) protein can be calculated as ΔG = G u − G f . The change of protein stability and free energy landscape is calculated as ΔΔG = G m − G w , where G m is the mutant protein and G w is the WT protein. More negative ΔΔG value implies more stabilizing mutation and a positive ΔΔG value depicts destabilizing mutations 53 .
Our study has used five different structure-based stability predictors: STRUM, MAESTROweb, SDM2, mCSM and DUET. All of the tools use the PDB structure file of the WT protein as an input. Using the atomic coordinates, they determines the stability of the variants by calculating the folding free energy. Most of these tools use a machine learning-based approach combining various functional genomics approaches and estimate impact of mutations on the structure and stability of protein 41,67,68 .  Table S1). Out of the 126 mutations which lie in the C-terminal OB-fold of the protein predicted deleterious mutations by SIFT, PolyPhen2, PROVEAN, PON-P2 and Mutation Assessor were 38 (30.16%), 30 (31.25%), 53 (42.06%), 3 (2.39%) and 73 (57.94%), respectively (Fig. 5).
This study only focuses on the OB-fold region of human CTC1 protein, further analysis was done only for the missense mutations in this region. Out of the 126 nsSNPs of hCTC1 OB structure-based prediction by STRUM, MAESTROweb, SDM2, mCSM and DUET showed 125 (99.20%), 108 (85.71%), 81 (64.29%), 113 (89.68%) and 94 (74.6%) missense mutations as destabilizing mutations (Supplementary Table S2, Fig. 6). For further study, we have only collected those mutations which are predicted to be deleterious by three different sequence-based tools and four different structure-based tools to increase the confidence level. After filtering out by this approach, 75 (59.52%) mutations were collected are predicted as deleterious and destabilizing by both sequence-based and structure-based approaches. This 75 nsSNPs were then analyzed for disease phenotype association.  www.nature.com/scientificreports/ Identification of pathogenic nsSNPs. We have predicted the disease association of non-synonymous mutations using the PMut and MutPred web servers. These two methods find the disease phenotypes and classify the mutations into pathogenic or benign based on the pathogenicity score. Out of a total of 75 high confidence nsSNPs obtained from sequence and structure-based analysis, PMut and MutPred predicted 12 (16%) and 23 (30.67%) nsSNPs as pathogenic, respectively. Out of 75 high confidence nsSNPs, only 11 mutations (S730R, S730G, R731W, R744G, G767R, F800C, R806C, R806L, W807C, R818L, and L860P) were identified as pathogenic from both the disease phenotype prediction tool. The further study focuses on these 11 mutations out of the 75 mutations (Supplementary Table S3-S4). MutPred predicted the pathogenicity of these variations. According to MutPred, the mutations (S730R, S730G, R806C, R806L, W807C, and R818L) showed loss of strand. The mutations S730R, R731W and R744G show gain of helix, strand and loop. The variations F806C, R806C, R806L, W807C, and R818L also alters the ordered interface of CTC1 protein. G767 shows a gain of ADPribosylation at that position.

Analysis of evolutionarily conserved residues.
Analyzing the conservation of amino acid residue in the protein structure can understand the importance of an amino acid residue and discloses its localized evolution 42,69 . The structural integrity of a protein also depends on the conserved residues. The tendency of an amino acid to mutate depends upon the degree of conservation. The OB-fold region of the human CTC1 protein was analyzed with the ConSurf tool for obtaining the degree of conservation of the residues. The ConSurf analysis revealed that amino acid residues between 728 to 745, 792 to 820 and 850 to 861 are highly conserved than the other residues. It was also revealed that most of the highly conserved residues are buried (Fig. 7).
Analysis of aggregation propensity. The function of a protein is greatly influenced by its solubility.
Insoluble parts of a protein try to form aggregates which can cause diseases like Alzheimer's, amyloidosis, and Parkinson's diseases 70,71 . SODA (Solubility based on Disorder and Aggregation) was used to calculate the solubility of the protein variants to find the disease association 61 . SODA calculates the aggregation, disorder, helix, and strand propensity which arise due to the mutations. Out of the 11 mutations obtained from disease phenotype prediction, five nsSNPs decrease the solubility of the protein, whereas the other six mutations increase the solubility of the protein ( Table 1).

Analysis of noncovalent interactions. Previous reports on the mutational analysis demonstrated that
the effect of nsSNPs on the stability of the protein depends upon the changes in hydrophobic contacts. We have calculated the van der Waals, hydrogen bonding, electrostatic, and hydrophobic interactions in WT CTC1 www.nature.com/scientificreports/ and its mutants with the Arpeggio web server's help. An increase and decrease in the number of bonds show mutations in the local and global environment. The 11 nsSNPs found to be pathogenic were analyzed using the Arpeggio webserver (Table 2). Finally, out of 11 mutations, based on the solubility, aggregation propensity, SODA score and other harmful properties, R806C and R806L along with WT CTC1 were selected for MD simulation study. Both R806C and R806L showed less solubility, high aggregation propensity and were predicted to be deleterious by all of the structure-based tools with the SODA score of − 40.10 and − 47.17, respectively. Protein misfolding and aggregation are involved in many disease progression, including neurodegenerative diseases [72][73][74][75][76][77][78] . MutPred2 analysis also showed that both the mutations contribute to loss of strand and alters the ordered interface of CTC1. MD simulations. MD simulation was utilized to study the disorderly impact of R806C and R806L on CTC1 conformation. The superimposed structural snapshots of CTC-WT, R806C and R806L taken at every 50 ns during the simulation are shown in Fig. 8. No significant difference is observed in mutants' structures when compared with WT except the loop regions, which become more flexible during the simulation in R806C. Interestingly, loss of the N-terminal helix was observed in both mutants during the early phase of the simulation, while it seems to disappear after 100 ns in the case of WT. Decrement was observed in the secondary structure content of mutants as compared to WT. A significant change was observed in the initial and final conformation of all three systems with RMSD calculated in PyMOL as 1.58 Å, 1.96 Å and 1.43 Å for CTC1-WT, R806C and R806L, respectively (Fig. 8).
The time evaluation RMSD plot CTC1-WT, R806C, and R806L is represented in Fig. 9A. The result indicates a significant shift in the mutants' RMSD, especially in the case of R806C. We did not find any shift transition in the case of R806L while comparing with WT (Fig. 9A). It was observed that R806L was more stable and showed compactness with RMSD below ~ 3 Å, closely followed by WT (~ 3.5 Å RMSD). For R806C, the average RMSD was observed to be ~ 4 Å and the structure showing the unstable distribution of RMSD throughout the trajectory. The RMSD of R806C showed a sharp shift up to 6.5 Å suggesting unfolding transition of the CTC1 conformation upon mutation. During the RMSF analysis, the residual fluctuation pattern was similar in CTC1-WT and R806L, with two significant peaks in the 780-790 and 845-855 aa regions. While in the case of R806C, the www.nature.com/scientificreports/ residual fluctuation showed a higher tendency with a few random peaks in the 770-790, 820-830 and 865-875 aa regions (Fig. 9B).
To further investigate the compactness and stability of CTC1-WT and its mutants, we have calculated the Radius of gyration (R g ) of all three systems and plotted as a function of time (Fig. 9C). Here we found that there is no significant difference in the average R g values of WT, R806C and R806L. However, a slight increment can be observed after 160 ns in R g in the case of R806C, suggesting a loss in compactness, as RMSD suggested (Fig. 9C).

Conclusion
SNPs are considered as one of the most frequent genetic variants associated with several human diseases. Extensive analysis of SNPs can give insights to understand disease-causing mechanisms and help find effective treatments of diseases. In the present study, we have analyzed the nsSNPs of the CTC1, specifically the C-terminal OB-fold region. Sequence and structure-based analysis have shown that 126 mutations present in the C-terminal OB-fold of CTC1 where 75 mutations were found to be deleterious and destabilizing. A pathogenicity study revealed that 11 out of all the mutations are pathogenic. Although the RMSD calculation could not give any conclusive result, aggregation propensity analysis showed that almost 45% of the pathogenic mutations present in the C-terminal OB-fold of CTC1 tend to form aggregates or become less soluble. Pathogenicity of these mutations may occurred due to structural changes caused by the gain or loss of noncovalent intramolecular forces.  www.nature.com/scientificreports/ MD simulation analyses, especially RMSD indicated a significant conformational loss in CTC1 protein structure due to R806C mutation. The results provide an in-depth understanding of the conformational behavior of CTC1 upon mutations. This study offers a detailed insight to understand the pathogenic nsSNPs of C-terminal OB-fold of CTC1 and the possible consequences of these mutations. These insights may be further used to build therapeutic strategies to cure CTC1 associated diseases.