Phenotype Prediction of Pathogenic Nonsynonymous Single Nucleotide Polymorphisms in WFS1

Wolfram syndrome (WS) is a rare, progressive, neurodegenerative disorder that has an autosomal recessive pattern of inheritance. The gene for WS, wolfram syndrome 1 gene (WFS1), is located on human chromosome 4p16.1 and encodes a transmembrane protein. To date, approximately 230 mutations in WFS1 have been confirmed, in which nonsynonymous single nucleotide polymorphisms (nsSNPs) are the most common forms of genetic variation. Nonetheless, there is poor knowledge on the relationship between SNP genotype and phenotype in other nsSNPs of the WFS1 gene. Here, we analysed 395 nsSNPs associated with the WFS1 gene using different computational methods and identified 20 nsSNPs to be potentially pathogenic. Furthermore, to identify the amino acid distributions and significances of pathogenic nsSNPs in the protein of WFS1, its transmembrane domain was constructed by the TMHMM server, which suggested that mutations outside of the TMhelix could have more effects on protein function. The predicted pathogenic mutations for the nsSNPs of the WFS1 gene provide an excellent guide for screening pathogenic mutations.

For further study, we used PhD-SNP and MutPred to investigate whether these 156 filtered deleterious and damaging nsSNPs were associated with disease. PhD-SNP is optimised to classify disease-causing point mutations from the given datasets, and MutPred is also a web application tool developed to classify an AAS as either disease-associated or neutral in humans but also predicts the molecular cause of disease/deleterious AASs. Of the 156 nsSNPs, 97 diseased-associated nsSNPs were predicted by PhD-SNP and 91 nsSNPs were predicted to be disease-associated by MutPred tools. But it is worth noting that some of the 28 mutations with scores of 0.000 for SIFT and 1.000 for Polyphen-2 in Table 1 like P346L, Y351C, G834S or L842F were not predicted as diseased-associated by both PhD-SNP and MutPred, this might be because the loci of these amino acid were conserved, but the mutants on these loci could not cause the molecular changes or affect the whole protein structure. Finally, 70 nsSNPs were predicted to be diseased-associated using both PhD-SNP and MutPred, in which the numbers of mutations predicted as very confident hypotheses, confident hypotheses and actionable hypotheses were 16, 33 and 21, respectively. The most common changes in the molecular mechanisms in the mutants predicted by MutPred were gains or losses of helixes and sheets. Representative diseased-associated nsSNPs and the corresponding AAS of nsSNPs in the WFS1 gene are provided in Table 2. After inspecting these mutations in their reference sources, most of the nsSNPs predicted have also been reported, demonstrating that the nsSNPs predicted were credible from multiple computational methods. Finally, we predicted 20 mutations (F329I, S353C, R375H, R375C, E394K, F439C, R517P, L594R, P607L, S662P, T665I, R732C,  R732H, G736D, Y739D, C742R, R832C, R859W, R868C and A874T) to be potentially pathogenic mutations, and 50 other mutations had been previously published or cited (Table 2).
Additionally, to better understand how the pathogenic nsSNPs affect protein conformation and result in disease states, we constructed wild type and mutant proteins via the Robetta and SWISS-MODEL tools (Fig. 1, Supplementary file 1-4). And the geometric evaluations of the modeled 3D structure were performed using PROCHECK by calculating the Ramachandran plot (Fig. 2). The wild type protein showed 99.4% of residues in most favoured and allowed region and the overall average of G factors was 0.27 which showed the structure was usual. In this step, we randomly selected three predicted nsSNPs (P292S, S443I and G695V) that have been reported to be pathogenic 6  between the wild type and mutant proteins. We observed that after mutation, not only did the amino acid change, but it also affected the entire protein structure. All of the three protein structures (P292S, S443I and G695V) representing different mutations gained or lost some α -helixes, suggesting a potential molecular mechanism resulting in WS.

Amino acid distribution in the transmembrane domain.
To elucidate the amino acid distributions and significances of predicted pathogenic nsSNPs in wolframin, we constructed its transmembrane domain using the TMHMM server v2.0 (Fig. 3). In this analysis, the transmembrane domain of wolframin was divided into 9 TMhelixes, with each TMhelix being approximately 23-amino acids long. Except for the third and seventh TMhelix, 18 pathogenic mutations were distributed across the other seven TMhelixes, accounting for 25.71% of all 70 pathogenic mutations, of which 13 were previously known. Notably, most pathogenic mutations in our study were not located in the transmembrane domain but in the C-terminal domain of wolframin (Table 3). In all 70 pathogenic mutations, approximately 52 were not located in the TMhelix (74.29%), 39 of which were located in the C-terminal domain. Thirty-seven pathogenic mutations have been previously reported in the 52 mutations not located in the TMhelix, and only 15 mutations were predicted to be potentially pathogenic.

Discussion
WS is a rare autosomal recessive disorder with a number of loss-of-function mutations of the WFS1, both within and between most affected patients/families. Wide tissue distribution of wolframin and many mutations in WFS1 resulting in WS may contribute to different phenotypes. Growing evidences have presented many clinical signs and possible correlations between the genotype and the development of the neurologic manifestations, the age at onset of diabetes mellitus, hearing defects, and diabetes insipidus in WS on the cohort of WS patients 22,23 . So far, although a large number of variants of the WFS1 gene have been identified, novel mutations are continuously found in this gene. Furthermore, the pathogenic role of different mutations, polymorphisms and sequencing variants of the gene remains largely unknown. Phenotypic prediction of the effects of nsSNPs might identify meaningful changes in genes that alter protein function to induce phenotypic consequences. The sheer number of SNPs in online databases provides an abundant resource to predict the phenotypic effects of nsSNPs, and known pathogenic mutations from the literature provide us an opportunity to inspect prediction accuracy, which indicates whether the relationships between nsSNP prediction results and known pathogenic mutations are confirmed by in vivo and in vitro experiments.
In the present study, we predicted 20 potentially pathogenic mutations and 50 known pathogenic mutations using in silico methods, and combined the results of the most common changes by MutPred and the predictions of the three protein structures by the SWISS-MODEL to determine that the most probable mutational effects causing WS might be the gains or losses of α -helixes. It is worth to consider that some predicted pathogenic nsSNPs have been confirmed by in vitro functional studies and genetic analysis for WS families, which could indirectly verify the accuracy of our methods. For example, p.P724L(c.2171C> T) and p.G695V(c.2084G> T) of WFS1 have been reported to lead to WS and which cause the formation of detergent-insoluble aggregates of wolframin when was expressed in COS-7 cells 24 ; the p.A684V(c.2051C> T) and p.L511P (c.1532T> C) were ectopically expressed in HEK293 cells which showed reduced protein levels compared to wild type wolframin, strongly indicating that the mutation is disease-causing 21,25 . Meanwhile, by direct DNA sequencing and linkage analysis, p.L804P (c.2411T> C) and p.R859P (c.2576G> C) were identified after screening the entire coding region of the WFS1 gene in a Chinese WS family and in a US family with the nonsyndromic hearing loss, respectively 26,27 . WFS1 spanning approximately 33.4 kb of genomic DNA, consists of eight exons and produces a peptide product which is 890-amino acid long (wolframin). The amino acid distribution results of wolframin suggest that wolframin contains 9 transmembrane domains. These results are consistent with the previous research which provides experimental evidence that wolframin contains 9 transmembrane segments and is embedded in the membrane in an Ncyt/Clum topology 15 . However, the prediction for wolframin available at UniProt database gives 11 transmembrane domains (http://www.uniprot.org/uniprot/O76024) (Table 4), and the difference between the two predicted results was mainly in the TMhelix 5, TMhelix 6 and TMhelix 11. In our result, the 493-515 amino acids are located in TMhelix5; while in UniProt, this region has been divided into TMhelix 5 and TMhelix 6 domains, respectively; the 653-890 amino acids have also been predicted as two TMhelixes in the same way in the UniProt. With reference to most researches, the wolframin were considered as 9 transmembrane domains with some evidences, and this is due to the differences in the execution of algorithm. Additionally, our results also indicate that the mutations outside of the TMhelix could have more pronounced functional effects, especially in the C-terminal with 39 predicted mutations. Many of the reported missense mutations are located in the C-terminal hydrophilic part of the protein 15 , and the experiments also support these predictions. Just as de Heredia et al. found that besides the transmembrane domains, the mutations identified in WS patients also concentrate in the last 100 amino acids in the C-terminal 1 . Using yeast two-hybrid  Table 2. Diseased-associated nsSNPs of WFS1 predicted using the PhD-SNP and MutPred servers.

*
In the SNP ID column, the nsSNPs with the prefix "rs" are from dbSNP, and those with the prefix "CM" and "WFS1_" are from HGMD and Locus Specific Database, respectively, and the remaining with no SNP ID are in the Deafness Variation Database.The nsSNPs highlighted in bold are potential pathogenic nsSNPs which have not been reported. analysis, Zatyka et al. identified that the C-terminal domain of wolframin, which is positioned in the ER lumen, bound the C-terminal domain (amino acids 652-890) of the ER-localized Na + /K + ATPase beta-1 subunit (ATP1B1) 28 . And the Na + /K + ATPase deficiency has a crucial role in apoptosis and in neural degenerative disease which can be induced by mutations in WFS1, leading to the development of WS 29 .
In summary, we used extensive functional and structural level analyses to predict potentially pathogenic mutations for nsSNPs in the WFS1 gene and analysed the amino acid distributions of wolframin to provide a guide for screening pathogenic mutations and investigating the function of wolframin.   Furthermore, we provide information for predicting the effects of nsSNPs in genes encoding transmembrane proteins and for further research in variant effect prediction.

Materials and Methods
Dataset collection. NsSNP datasets of the WFS1 gene were obtained from the NCBI dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/) 30  Filtering and mining of nsSNPs. Because SNPs from the databases were not initially nsSNPs, we needed to perform some manual filtering. In this process, we eliminated SNPs in 3′ or 5′ UTRs and synonymous SNPs. For prediction and analysis, SNP ID, gene name, protein accession, amino acid residue 1 (wild type), amino acid position, and amino acid residue 2 (missense) for all nsSNPs were collected from the NCBI dbSNP database, HGMD, and Deafness Variation Databases.
Predicting the phenotype of nsSNPs with the SIFT and PolyPhen-2 tools. After filtering the nsSNPs, we predicted their functional effects with the SIFT (http://sift-dna.org) and PolyPhen-2 (http:// genetics.bwh.harvard.edu/pph2/) tools. In SIFT server, a highly conserved position is more likely to be deleterious with a SIFT score < 0.05, whereas a tolerant mutation will have a SIFT score > 0.05 32,33 .
PolyPhen-2 extracts various sequence-and structure-based features of the substitution site and inputs them into a probabilistic classifier based on a given AAS and protein accession. The mutation is appraised qualitatively, as benign, possibly damaging, or most likely damaging 34 .
Identifying disease-associated nsSNPs using the PhD-SNP and MutPred tools. PhD  in a 20-element vector that is − 1 in position relative to the wild type residue, 1 in the position relative to the mutant residues and 0 in the remaining 18 positions. Next, a second 20-element vector encoding the sequence environment is constructed to report the occurrence of residues in a window of 19 residues around the mutated residue. With this supervised learning approach, a given mutation is classified as disease or neutral 35,36 . MutPred is based on SIFT scores, the gain or loss of 14 different structural and functional properties. Two important scores are contained in the output of MutPred: a general score (g), and top 5 property score (p). The general score (g) indicates the probability that the AAS is deleterious/disease-associated, whereas the top 5 property score (p) is the P-value that indicates whether certain structural and functional properties are affected. The combinations of high general scores and low property scores are referred to as actionable hypotheses, confident hypotheses, and very confident hypotheses 37 . Protein structure prediction of pathogenic nsSNPs via Robetta and SWISS-MODEL tools. As the structure of wolframin is not available and there is not suitable template for modelling, so we choose the Robetta server (http://robetta.bakerlab.org/) to construct the protein structure. The Robetta server is a full chain protein structure prediction server for ab initio and comparative modeling, and the SWISS-MODEL (http://swissmodel.expasy.org/) is a fully automated, dedicated protein structure homology-modelling server 38,39 . The amino acid sequence of wolframin was retrieved from NCBI (accession number: NP_005996.2). 3D-structure of wolframin was performed using Robetta server. And the mutant proteins were constructed by SWISS-MODEL with the template performed using Robetta server (Sup.file S). The quality of the modelled structure of native and mutant protein was evaluated by the PROCHECK (http://services.mbi.ucla.edu/SAVES/). Analysis of the transmembrane domain by the TMHMM server v2.0. TMHMM server v2.0 (http://www.cbs.dtu.dk/services/TMHMM/), based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system, is a membrane protein topology prediction method. Compared with other servers, TMHMM server v2.0, which is thought to be currently the best  performing transmembrane prediction program, can model and predict the location and orientation of alpha helices in membrane-spanning proteins with high accuracy 40 .