Abstract
G-protein-coupled receptors (GPCRs) are found in a wide range of organisms and are central to a cellular signaling network that regulates many basic physiological processes. GPCRs are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. In this paper, we predict the functional nonsynonymous single nucleotide polymorphisms (nsSNPs) in human GPCRs by defining optimal attributes and using a decision tree method. The predictive power of each attribute was evaluated. A subset of sequences with optimal attributes was obtained using the decision tree method combined with a genetic search algorithm. The subset contains both sequence-based and structure-based information, and the information for each subset consists of a conservation score, the location of the mutation, the BLOSUM62 substitution matrix score, as well as the hydrophobicity change, the solvent accessibility, and the buried charge. Seven important rules were derived from the decision tree. A total of 166 functional nsSNPs in human GPCRs from the dbSNP have been predicted using the optimal attributes subset.
Similar content being viewed by others
Introduction
G-protein-coupled receptors (GPCRs) are the largest protein superfamily in most mammalian genomes. Despite their great diversity in terms of sequence composition, all GPCRs share a common protein structure. An N-terminal extracellular domain of variable length is followed by seven transmembrane (TM) helices, connected by three intracellular loops (ICL) and three extracellular loops (ECL), one of which terminates in a C-terminal intracellular domain (Gether 2000). These receptors are plasma membrane-bound and can respond to a large number of extracellular signals from nucleotides, peptides, amines and hormones (Sakmar 1998). Upon recognition of these ligands, GPCRs act through G proteins in signaling pathways that influence almost all physiological functions. As such, pharmacologic agents that serve to antagonize GPCR-mediated signaling are common. Actually, more than one-third of all known small-molecule drugs are targeted at GPCRs (Marinissen and Gutkind 2001; Howard et al. 2001). Thus, small genomic-level differences in GPCRs may explain the different drug-response behaviors of different individuals toward a drug, and they can be used to tailor drugs based on an individual’s genetic makeup (Drysdale et al. 2000; Phillips et al. 2001; Roses 2004).
Single nucleotide polymorphisms (SNPs) are defined as single base variations in sequence that occur at a frequency of at least 1% and may directly explain the pathogenesis of disease. SNPs in protein-coding exons are classified as synonymous or nonsynonymous (nsSNPs) according to whether or not they alter the protein sequence. Some nsSNPs can affect gene function through their effects on the structure or function of the encoded protein. Recently, several studies have tried to investigate how to determine whether nsSNPs are either functional or neutral using protein sequence and structural information. Empirical rules identifying detrimental nsSNPs were derived based on structural information (Wang and Moult 2001). An algorithm named SIFT, which is based on the sequence conservation and scores from position-specific scoring matrices, was developed to rationalize amino acid changes that were likely to affect the function of a protein (Ng and Henikoff 2001). Structural and sequence information was then combined and used to predict functional nsSNPs (Chasman and Adams 2001; Sunyaev et al. 2001; Ramensky et al. 2002; Saunders and Baker 2002; Krishnan and Westhead 2003; Bao and Cui 2005); however, bovine rhodopsin (BR) is the only three-dimensional structure of a GPCR that has been resolved (Palczewski et al. 2000). Balasubramanian et al. (2005) predicted disease-causing nsSNPs in GPCRs based on sequence information using logistic regression methods. However, some nsSNPs will influence protein function but will not cause inherited diseases. Here, we aimed to predict the functional nsSNPs in human GPCRs from dbSNP based on the optimal sequence and structural information using a decision tree method.
Materials and methods
Datasets
The GPCRDB has an extensive collection of point mutations that have been compiled from the literature using the MuteXt automated extraction method (Horn et al. 2003, 2004). We analyzed the functions of these mutated residues and collected those mutations that changed receptor function or structure in order to include them in our training set.
To derive a dataset of neutral mutations, proteins which have >90% sequence identity to the target GPCRs were extracted from GPCRDB. For each target protein, only one ortholog was chosen from each species based on the best match to the target protein, and multiple sequence alignments (MSA) were performed to the target protein and its homologs. Amino acid variations at any position in the MSA were considered to be neutral variations (Sunyaev et al. 2001; Balasubramanian et al. 2005). The logic behind this assumption is that variations in highly homologous sequences between species are generally neutral and are highly unlikely to be deleterious, because detrimental changes will be selectively removed during the course of evolution. There may be examples, however, where some are functional changes in one species but not in the others. In total, the training dataset contained 750 functional mutations and 1,345 neutral changes in 72 receptors.
The nsSNPs in human GPCRs with known ligands were extracted from dbSNP. First, the corresponding gene of each human GPCR was found in Swiss-Prot, and then, according to the name of each gene, nsSNPs were searched for in dbSNP. The opsins, olfactory and taste receptors are excluded since they are not drug targets. In all, 519 nsSNPs were identified.
Attributes
The sequence-based and structure-based attributes of amino acid polymorphisms that may serve as generalized predictors of effects on function were chosen from the literature. The sequence-based attributes that were used in the prediction of functional nsSNPs were the sequence conservation score at the mutated position, the physiochemical changes (mass, hydrophobicity, volume) between the wild-type residues and mutated residues, and substitution matrix scores, such as BLOSUM62 and PAM120 matrices. Three structure-based attributes, including the location of the mutation (whether or not it was in the TM regions), the solvent accessibility, and the buried charge were also considered.
The sequence conservation score was calculated in two steps using the software program AL2CO (Pei and Grishin 2001). First, an independent count-based weighting scheme was used to estimate the amino acid frequencies. The conservation score was then calculated from these frequencies based on an entropy-based method (Shannon 1948). The MSA files of the subtype families containing the target proteins were extracted from the GPCRDB. Each family has a different number of receptors, and as the number of sequences in a family varies, the level of conservation of each position changes, and thus the average conservation score changes (Armon et al. 2001). In order to diminish this effect, the conservation score was normalized to its z-score function, which was calculated by subtracting the mean conservation score from the conservation score and dividing by the standard deviation.
The hydrophobicity of the amino acids was evaluated using the Kyte–Doolittle hydrophobicity scale (Kyte and Doolittle 1982). Average residue volumes were taken from Harpaz et al. (1994). The mass, hydrophobicity and volume changes were the absolute value of the difference between the wild-type residue and the mutated residue. The TM regions of a protein were taken from the Swiss-Prot database entry for each protein. If no information was available, the TMHMM program was used to predict the TM regions of the receptors (Krogh et al. 2001). The location of the wild-type residue in the TM regions was 1; otherwise it was 0. The solvent accessibility was predicted by PHD (Rost and Sander 1993, 1994). Relative accessibility was grouped into three states: buried (<9%), intermediate (9–36%) and exposed (≥36%). The wild-type residue was deemed to be a buried charge if it was K, R, D, E or H and its solvent accessibility was in the buried state (Krishnan and Westhead 2003).
Decision tree
Decision tree learning is a means of approximating discrete-valued target functions in which the learned function is represented by a decision tree. It has been shown to perform well in homogeneous cross-validated training datasets (Krishnan and Westhead 2003). Here we used the C4.5 decision tree algorithm developed by Quinlan (1993). It was performed as a J48 decision tree classifier using a Weka machine learning workbench (Witten and Frank 2000; Frank et al. 2004). The default set of parameters and tenfold cross-validation were used in the predictions. The decision tree not only provides a prediction but also yields an estimate of the probability that a prediction from the rule is correct. Each rule was derived from the training dataset, and the estimated accuracy was used to assign a confidence level to the prediction. Rules with estimated accuracies of x% were taken to have a confidence level of x/100. Another measurement of a rule was “cover,” which was the number of mutations conforming to the rule in the training dataset. If the cover of a rule was too small, it meant that only a few mutations in the training dataset met this rule, and so the rule had no representative meaning. In this paper, we used 30 as the cover threshold.
Optimize attributes set
The attributes mentioned above have been proven to be related to functional mutations. Combining of all those attributes may result in redundant descriptions of each polymorphism and cause a reduction in prediction quality (Dobson et al. 2006). Therefore, attributes selection was an indispensable step before prediction. Here, optimization means finding the best combination of attributes that maximizes the prediction accuracy. The optimized attributes subset was obtained using wrapper-based attribute selection with J48 as the learning method combined with the genetic search method with default option settings. The genetic search algorithm was initialized with a population size of 20 and then 50 generations were evaluated.
Evaluation of the prediction accuracy
The mutations are classified into “effect” or “no effect.” Mutations in the “effect” class will influence the structure or function of the protein, which means that these are functional mutations. Because the training dataset contained more neutral mutations than functional mutations, Matthew’s correlation coefficient (MCC) was used to evaluate the performance (Matthews 1985):
where TP is true positives, FN is false negatives, TN is true negatives and FP is false positives. When there is an obvious disparity in the number of positive samples and negative samples, MCC is usually a better evaluation criterion than the overall accuracy. MCC combines both sensitivity and specificity into one measure and the values lie in the range of −1 to 1. A value of 1 means complete prediction accuracy, while a value of 0 means that every prediction was randomly assigned.
Statistics
Statistical analysis of the distribution of each attribute for functional mutations and neutral mutations was performed using the chi-squared test.
Results
Predictive powers of individual attributes
The prediction performance of each attribute was assessed using the decision tree method. Except for the solvent accessibility, all other attributes played a role in predicting whether an nsSNP has an effect on protein function or not. When solvent accessibility was used as a single attribute in the prediction, the MCC was 0. In contrast, the conservation score, whose MCC reached 0.68, was found to be the best discriminator of functional versus neutral variations. The MCCs of the location, mass change, and volume change attributes were higher than 0.4, but these achieved less prediction accuracy than the conservation score. Other attributes, such as PAM120 and BLOSUM62 substitution matrices, hydrophobicity change, and buried charge, had poor predictive performance (Table 1).
Distribution of functional and neutral mutations
The distribution of attribute values for functional mutations was significantly different from that of neutral mutations (Fig. 1). Approximately 62.67% of the functional mutations had conservation scores of >0.5, whereas only 2.15% of the neutral mutations had conservation scores of >0.5. For those mutations with a conservation score <−0.5, only 17.6% were functional mutations, whereas 81.86% were neutral. For those mutations with a conservation score of between −0.5 and 0.5, functional mutations were only 3.74% more than neutral (Fig. 1a).
When the hydrophobicity value changes of wild-type residues and mutated residues were >3, 50.27% of the mutations were functional compared with 21.27% neutral mutations. When the hydrophobicity value changes were <3, there were more neutral mutations than functional (Fig. 1b). The distributions of the mass and volume changes were similar to that of the hydrophobicity. When volume changes were >60 or mass changes were >40, there were more functional mutations than neutral mutations (Fig. 1c,d). These data indicate that dramatic changes in physiochemical properties tend to change the structure or function of the protein, and thus the mutations would be functional.
The nature of the amino acid changes was assessed using BLOSUM62 and PAM120 substitution matrices, since these two matrices are widely used and robust. A total of 67.74% of the functional mutations have BLOSUM62 scores of <−1, and only 27.52% of the neutral variations have BLOSUM62 scores of <−1 (Fig. 1e). The distribution of the PAM120 substitution matrix score is similar to that of the BLOSUM62 results (Fig. 1f). For these methods, the smaller the score, the higher the probability that a mutation is functional. We found that 47.06% of the functional mutations and 14.35% of the neutral changes have a PAM120 score of <−2, while 25.46% of the functional mutations and 57.84% of the neutral changes have a PAM120 score of >0.
TM regions contained 77.87% of functional mutations and 24.98% of neutral mutations, while extracellular and intracellular domains contained 22.13% of functional and 75.02% of neutral mutations (Fig. 1g). The distributions of buried charge and solvent accessibility for functional variations and neutral mutations were significantly different (χ 2 = 33.78, P < 0.01 and χ 2 = 51.49, P < 0.01, respectively), although this difference was not as pronounced for the other attributes (Fig. 1h,i).
Optimal attributes subset
During the attribute selection process, six attributes were found to be the optimal attributes subset: the conservation score, the BLOSUM62 substitution matrix score, the location, the solvent accessibility, the buried charge, and the hydrophobicity change. The prediction performance of this optimal attributes subset was compared with four different attribute sets: all attributes, sequence-based attributes, structure-based attributes, and conservation score alone (Table 1). The MCC of the optimal attributes set (0.81) was the highest among them. Sequence-based attributes (even using just the conservation score) were better than the structure-based ones. When all attributes were combined, the prediction accuracy was improved compared with those of sequence-based or structure-based attributes alone.
Rules for predicting functional nsSNPs
The decision tree method can produce intelligible rules and attach a confidence level to each rule. Seven important rules with covers of >30 were obtained (Table 2). These rules were used to predict functional mutations, and they conveniently discriminate functional nsSNPs from neutral mutations. For example, according to Rule 1, if the conservation score of an nsSNP was less than or equal to −0.343, and it was located in the extracellular or intracellular domains, then the probability that this nsSNP is neutral would be 0.96.
Functional nsSNPs in human GPCRs
We collected 519 nsSNPs from dbSNP, and 166 of these (32%) were predicted to be functional using the optimal attributes set (Table 3). Analysis of these nsSNPs in GPCRs will provide the basis for assessing susceptibility to diseases and designing individualized therapy.
Discussion
In the present study, the prediction power of both sequence-based and structure-based attributes was used to predict functional nsSNPs in human GPCRs. Since only one GPCR structure is known, we used predicted structure information instead of the actual structure. A conservation score that is based on evolutionary selection information was found to be the best single predictor for discriminating functional mutations from neutral variations. A high conservation score means that there is selective pressure to maintain these residues during evolution, and therefore these are likely to be important to the structure and function of the protein. The mutations that occur at these conserved sites are often functional mutations. The change in the physiochemical properties of residues would influence the structure or stability of the proteins and indirectly change the function, and so these attributes only have moderate prediction power. Substitution matrices, which consider only the likelihood of the substitution in all proteins at all positions, can also be useful, albeit with lower prediction accuracy (Yue and Moult 2006). We found that functional mutations are overrepresented in the TM regions and are underrepresented in the extracellular and intracellular domains. This implies that changes in TM regions may directly affect either the structure or function of the receptor. Mutations in TM regions could abrogate or diminish the activity of the protein when a ligand-binding site is affected. Alternatively, a mutation in a TM region could compromise the protein’s structural integrity by having an effect on helix–helix packing interactions. In general, structure-based attributes had poorer predictive powers than sequence-based attributes. The MCC of solvent accessibility was zero, which means every prediction was randomly assigned, and this was the worst predictor among all the attributes when it was used alone.
Combining the attributes can greatly improve the prediction accuracy. Though conservation score was the most powerful predictor, the MCC increased to 0.22 when it was combined with other sequence-based attributes. When all nine attributes were used in a prediction, the accuracy was improved when compared with the sequence-based attributes alone. We also found the proposed structural information to be useful in prediction. It is likely that most mutations that affect protein function actually affect it indirectly through changes in structural stability. However, simply taking all the attributes together did not achieve the best performance. We found that the optimal attributes subset only requires six attributes—the conservation score, the BLOSUM62 substitution matrix score, the location, the solvent accessibility, the buried charge, and the hydrophobicity change. The combination of these six attributes had an MCC that was 0.03 higher than that of all nine attributes. The optimal subset includes both sequence-based and structure-based attributes. Moreover, it is interesting to see that the optimal attributes subset did not consist of the six best predictors when each was assessed by itself. The predictabilities of some inferior attributes, such as solvent accessibility and buried charge, were increased when used in combination.
Seven important rules with cover >30 were derived from the decision tree. Based on these rules, we could intuitively distinguish functional nsSNPs from neutral nsSNPs only if the attribute values of the nsSNPs are available, and there is no need for any complex training or testing processes.
In summary, combining sequence-based and structure-based information will improve the prediction performance, but the optimal attributes subset was not simply a combination of the attributes. With the optimal attributes subset, a total of 166 functional nsSNPs were predicted. Given the important roles of GPCRs in many physiological processes and their pharmaceutical relevance as drug targets, further investigation of these nsSNPs will be very useful for elucidating disease pathogenesis mechanisms and drug efficacy issues.
References
Armon A, Graur D, Ben-Tal N (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 307:447–463
Balasubramanian S, Xia Y, Freinkman E, Gerstein M (2005) Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms. Nucleic Acids Res 33:1710-1721
Bao L, Cui Y (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21:2185–2190
Chasman D, Adams RM (2001) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol 307:683–706
Dobson RJ, Munroe PB, Caulfied MJ, Saqi MA (2006) Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinf 7:217–235
Drysdale CM, McGraw DW, Stack CB, Stephens JC, Judson RS, Nandabalan K et al. (2000) Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc Natl Acad Sci USA 97:10483–10488
Frank E, Hall M, Trigg L, Holmes G, Witten IH (2004) Data mining in bioinformatics using Weka. Bioinformatics 20:2479–2481
Gether U (2000) Uncovering molecular mechanisms involved in activation of G protein-coupled receptors. Endocr Soc 21:90–113
Harpaz Y, Gerstein M, Chothia C (1994) Volume changes on protein folding. Structure 2:641–649
Horn F, Bettler E, Oliveria L, Campagne F, Cohen FE, Vriend G (2003) GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31:294–297
Horn F, Lau AL, Cohen FE (2004) Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20:557–568
Howard AD, McAllister G, Feighner SD, Liu Q, Nargund RP, Van der Ploeg LH et al. (2001) Orphan G-protein-coupled receptors and natural ligand discovery. Trends Pharmacol Sci 22:132–140
Krishnan VG, Westhead DR (2003) A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics 19:2199–2209
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132
Marinissen MJ, Gutkind JS (2001) G protein-coupled receptors and signaling networks: emerging paradigms. Trends Pharmacol Sci 22:368–376
Matthews BW (1985) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451
Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11:863–874
Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA et al. (2000) Crystal structure of rhodopsin: a G protein-coupled receptor. Science 289:739–745
Pei J, Grishin NV (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17:700–712
Phillips KA, Veenstra DL, Oren E, Lee JK, Sadee W (2001) Potential role of pharmacogenomics in reducing adverse drug reactions: a systematics review. JAMA 286:2270–2279
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
Ramensky V, Bork P, Sunyaev S (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30:3894–3900
Roses AD (2004) Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat Rev Genet 5:645–656
Rost B, Sander C (1993) Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci USA 90:7558–7562
Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 9:56–68
Sakmar TP (1998) Rhodopsin: a prototypical G protein-coupled receptor. Prog Nucleic Acid Res Mol Biol 59:1–34
Saunders CT, Baker D (2002) Evolutionary of structural and evolutionary contributions to deleterious mutations prediction. J Mol Biol 322:891–901
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech 27:379–423; 623–656
Sunyaev S, Ramensky V, Koch I, Lathe WIII, Kondrashov AS, Bork P (2001) Prediction of deleterious human alleles. Hum Mol Genet 10:591–597
Wang Z, Moult J (2001) SNPs, protein structure, and disease. Hum Mutat 17:263–270
Witten I, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA
Yue P, Moult J (2006) Identification and analysis of deleterious human SNPs. J Mol Biol 356:1236–1274
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xue, D., Yin, J., Tan, M. et al. Prediction of functional nonsynonymous single nucleotide polymorphisms in human G-protein-coupled receptors. J Hum Genet 53, 379–389 (2008). https://doi.org/10.1007/s10038-008-0260-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10038-008-0260-8