Introduction

G-protein-coupled receptors (GPCRs) are the largest protein superfamily in most mammalian genomes. Despite their great diversity in terms of sequence composition, all GPCRs share a common protein structure. An N-terminal extracellular domain of variable length is followed by seven transmembrane (TM) helices, connected by three intracellular loops (ICL) and three extracellular loops (ECL), one of which terminates in a C-terminal intracellular domain (Gether 2000). These receptors are plasma membrane-bound and can respond to a large number of extracellular signals from nucleotides, peptides, amines and hormones (Sakmar 1998). Upon recognition of these ligands, GPCRs act through G proteins in signaling pathways that influence almost all physiological functions. As such, pharmacologic agents that serve to antagonize GPCR-mediated signaling are common. Actually, more than one-third of all known small-molecule drugs are targeted at GPCRs (Marinissen and Gutkind 2001; Howard et al. 2001). Thus, small genomic-level differences in GPCRs may explain the different drug-response behaviors of different individuals toward a drug, and they can be used to tailor drugs based on an individual’s genetic makeup (Drysdale et al. 2000; Phillips et al. 2001; Roses 2004).

Single nucleotide polymorphisms (SNPs) are defined as single base variations in sequence that occur at a frequency of at least 1% and may directly explain the pathogenesis of disease. SNPs in protein-coding exons are classified as synonymous or nonsynonymous (nsSNPs) according to whether or not they alter the protein sequence. Some nsSNPs can affect gene function through their effects on the structure or function of the encoded protein. Recently, several studies have tried to investigate how to determine whether nsSNPs are either functional or neutral using protein sequence and structural information. Empirical rules identifying detrimental nsSNPs were derived based on structural information (Wang and Moult 2001). An algorithm named SIFT, which is based on the sequence conservation and scores from position-specific scoring matrices, was developed to rationalize amino acid changes that were likely to affect the function of a protein (Ng and Henikoff 2001). Structural and sequence information was then combined and used to predict functional nsSNPs (Chasman and Adams 2001; Sunyaev et al. 2001; Ramensky et al. 2002; Saunders and Baker 2002; Krishnan and Westhead 2003; Bao and Cui 2005); however, bovine rhodopsin (BR) is the only three-dimensional structure of a GPCR that has been resolved (Palczewski et al. 2000). Balasubramanian et al. (2005) predicted disease-causing nsSNPs in GPCRs based on sequence information using logistic regression methods. However, some nsSNPs will influence protein function but will not cause inherited diseases. Here, we aimed to predict the functional nsSNPs in human GPCRs from dbSNP based on the optimal sequence and structural information using a decision tree method.

Materials and methods

Datasets

The GPCRDB has an extensive collection of point mutations that have been compiled from the literature using the MuteXt automated extraction method (Horn et al. 2003, 2004). We analyzed the functions of these mutated residues and collected those mutations that changed receptor function or structure in order to include them in our training set.

To derive a dataset of neutral mutations, proteins which have >90% sequence identity to the target GPCRs were extracted from GPCRDB. For each target protein, only one ortholog was chosen from each species based on the best match to the target protein, and multiple sequence alignments (MSA) were performed to the target protein and its homologs. Amino acid variations at any position in the MSA were considered to be neutral variations (Sunyaev et al. 2001; Balasubramanian et al. 2005). The logic behind this assumption is that variations in highly homologous sequences between species are generally neutral and are highly unlikely to be deleterious, because detrimental changes will be selectively removed during the course of evolution. There may be examples, however, where some are functional changes in one species but not in the others. In total, the training dataset contained 750 functional mutations and 1,345 neutral changes in 72 receptors.

The nsSNPs in human GPCRs with known ligands were extracted from dbSNP. First, the corresponding gene of each human GPCR was found in Swiss-Prot, and then, according to the name of each gene, nsSNPs were searched for in dbSNP. The opsins, olfactory and taste receptors are excluded since they are not drug targets. In all, 519 nsSNPs were identified.

Attributes

The sequence-based and structure-based attributes of amino acid polymorphisms that may serve as generalized predictors of effects on function were chosen from the literature. The sequence-based attributes that were used in the prediction of functional nsSNPs were the sequence conservation score at the mutated position, the physiochemical changes (mass, hydrophobicity, volume) between the wild-type residues and mutated residues, and substitution matrix scores, such as BLOSUM62 and PAM120 matrices. Three structure-based attributes, including the location of the mutation (whether or not it was in the TM regions), the solvent accessibility, and the buried charge were also considered.

The sequence conservation score was calculated in two steps using the software program AL2CO (Pei and Grishin 2001). First, an independent count-based weighting scheme was used to estimate the amino acid frequencies. The conservation score was then calculated from these frequencies based on an entropy-based method (Shannon 1948). The MSA files of the subtype families containing the target proteins were extracted from the GPCRDB. Each family has a different number of receptors, and as the number of sequences in a family varies, the level of conservation of each position changes, and thus the average conservation score changes (Armon et al. 2001). In order to diminish this effect, the conservation score was normalized to its z-score function, which was calculated by subtracting the mean conservation score from the conservation score and dividing by the standard deviation.

The hydrophobicity of the amino acids was evaluated using the Kyte–Doolittle hydrophobicity scale (Kyte and Doolittle 1982). Average residue volumes were taken from Harpaz et al. (1994). The mass, hydrophobicity and volume changes were the absolute value of the difference between the wild-type residue and the mutated residue. The TM regions of a protein were taken from the Swiss-Prot database entry for each protein. If no information was available, the TMHMM program was used to predict the TM regions of the receptors (Krogh et al. 2001). The location of the wild-type residue in the TM regions was 1; otherwise it was 0. The solvent accessibility was predicted by PHD (Rost and Sander 1993, 1994). Relative accessibility was grouped into three states: buried (<9%), intermediate (9–36%) and exposed (≥36%). The wild-type residue was deemed to be a buried charge if it was K, R, D, E or H and its solvent accessibility was in the buried state (Krishnan and Westhead 2003).

Decision tree

Decision tree learning is a means of approximating discrete-valued target functions in which the learned function is represented by a decision tree. It has been shown to perform well in homogeneous cross-validated training datasets (Krishnan and Westhead 2003). Here we used the C4.5 decision tree algorithm developed by Quinlan (1993). It was performed as a J48 decision tree classifier using a Weka machine learning workbench (Witten and Frank 2000; Frank et al. 2004). The default set of parameters and tenfold cross-validation were used in the predictions. The decision tree not only provides a prediction but also yields an estimate of the probability that a prediction from the rule is correct. Each rule was derived from the training dataset, and the estimated accuracy was used to assign a confidence level to the prediction. Rules with estimated accuracies of x% were taken to have a confidence level of x/100. Another measurement of a rule was “cover,” which was the number of mutations conforming to the rule in the training dataset. If the cover of a rule was too small, it meant that only a few mutations in the training dataset met this rule, and so the rule had no representative meaning. In this paper, we used 30 as the cover threshold.

Optimize attributes set

The attributes mentioned above have been proven to be related to functional mutations. Combining of all those attributes may result in redundant descriptions of each polymorphism and cause a reduction in prediction quality (Dobson et al. 2006). Therefore, attributes selection was an indispensable step before prediction. Here, optimization means finding the best combination of attributes that maximizes the prediction accuracy. The optimized attributes subset was obtained using wrapper-based attribute selection with J48 as the learning method combined with the genetic search method with default option settings. The genetic search algorithm was initialized with a population size of 20 and then 50 generations were evaluated.

Evaluation of the prediction accuracy

The mutations are classified into “effect” or “no effect.” Mutations in the “effect” class will influence the structure or function of the protein, which means that these are functional mutations. Because the training dataset contained more neutral mutations than functional mutations, Matthew’s correlation coefficient (MCC) was used to evaluate the performance (Matthews 1985):

$$ {\text{MCC}} = \frac{{({\text{TP}} \cdot {\text{TN}} - {\text{FP}} \cdot {\text{FN}})}} {{{\sqrt {({\text{TN}} + {\text{FN}})({\text{TN}} + {\text{FP}})({\text{TP}} + {\text{FN}})({\text{TP}} + {\text{FP}})} }}} $$

where TP is true positives, FN is false negatives, TN is true negatives and FP is false positives. When there is an obvious disparity in the number of positive samples and negative samples, MCC is usually a better evaluation criterion than the overall accuracy. MCC combines both sensitivity and specificity into one measure and the values lie in the range of −1 to 1. A value of 1 means complete prediction accuracy, while a value of 0 means that every prediction was randomly assigned.

Statistics

Statistical analysis of the distribution of each attribute for functional mutations and neutral mutations was performed using the chi-squared test.

Results

Predictive powers of individual attributes

The prediction performance of each attribute was assessed using the decision tree method. Except for the solvent accessibility, all other attributes played a role in predicting whether an nsSNP has an effect on protein function or not. When solvent accessibility was used as a single attribute in the prediction, the MCC was 0. In contrast, the conservation score, whose MCC reached 0.68, was found to be the best discriminator of functional versus neutral variations. The MCCs of the location, mass change, and volume change attributes were higher than 0.4, but these achieved less prediction accuracy than the conservation score. Other attributes, such as PAM120 and BLOSUM62 substitution matrices, hydrophobicity change, and buried charge, had poor predictive performance (Table 1).

Table 1 Prediction performances of attributes and attribute sets obtained using the decision tree method

Distribution of functional and neutral mutations

The distribution of attribute values for functional mutations was significantly different from that of neutral mutations (Fig. 1). Approximately 62.67% of the functional mutations had conservation scores of >0.5, whereas only 2.15% of the neutral mutations had conservation scores of >0.5. For those mutations with a conservation score <−0.5, only 17.6% were functional mutations, whereas 81.86% were neutral. For those mutations with a conservation score of between −0.5 and 0.5, functional mutations were only 3.74% more than neutral (Fig. 1a).

Fig. 1a–i
figure 1

The distribution of attributes for functional mutations and neutral mutations. The shaded bars represent the functional mutations and the white bars are neutral mutations. Attributes: a conservation score (χ 2 = 1099.05, < 0.01, 7 df), b hydrophobicity change (χ 2 = 208.37, < 0.01, 7 df), c volume change (χ 2 = 212.64, < 0.01, 6 df), d mass change (χ 2 = 211.83, < 0.01, 5 df), e BLOSUM62 score (χ 2 = 281.85, < 0.01, 7 df), f PAM120 (χ 2 = 314.47, < 0.01, 5 df), g location (χ 2 = 546.78, < 0.01, 1 df), h solvent accessibility (χ 2 = 51.49, < 0.01, 2 df), and i buried charge (χ 2 = 33.78, < 0.01, 1 df )

When the hydrophobicity value changes of wild-type residues and mutated residues were >3, 50.27% of the mutations were functional compared with 21.27% neutral mutations. When the hydrophobicity value changes were <3, there were more neutral mutations than functional (Fig. 1b). The distributions of the mass and volume changes were similar to that of the hydrophobicity. When volume changes were >60 or mass changes were >40, there were more functional mutations than neutral mutations (Fig. 1c,d). These data indicate that dramatic changes in physiochemical properties tend to change the structure or function of the protein, and thus the mutations would be functional.

The nature of the amino acid changes was assessed using BLOSUM62 and PAM120 substitution matrices, since these two matrices are widely used and robust. A total of 67.74% of the functional mutations have BLOSUM62 scores of <−1, and only 27.52% of the neutral variations have BLOSUM62 scores of <−1 (Fig. 1e). The distribution of the PAM120 substitution matrix score is similar to that of the BLOSUM62 results (Fig. 1f). For these methods, the smaller the score, the higher the probability that a mutation is functional. We found that 47.06% of the functional mutations and 14.35% of the neutral changes have a PAM120 score of <−2, while 25.46% of the functional mutations and 57.84% of the neutral changes have a PAM120 score of >0.

TM regions contained 77.87% of functional mutations and 24.98% of neutral mutations, while extracellular and intracellular domains contained 22.13% of functional and 75.02% of neutral mutations (Fig. 1g). The distributions of buried charge and solvent accessibility for functional variations and neutral mutations were significantly different (χ 2 = 33.78, < 0.01 and χ 2 = 51.49, < 0.01, respectively), although this difference was not as pronounced for the other attributes (Fig. 1h,i).

Optimal attributes subset

During the attribute selection process, six attributes were found to be the optimal attributes subset: the conservation score, the BLOSUM62 substitution matrix score, the location, the solvent accessibility, the buried charge, and the hydrophobicity change. The prediction performance of this optimal attributes subset was compared with four different attribute sets: all attributes, sequence-based attributes, structure-based attributes, and conservation score alone (Table 1). The MCC of the optimal attributes set (0.81) was the highest among them. Sequence-based attributes (even using just the conservation score) were better than the structure-based ones. When all attributes were combined, the prediction accuracy was improved compared with those of sequence-based or structure-based attributes alone.

Rules for predicting functional nsSNPs

The decision tree method can produce intelligible rules and attach a confidence level to each rule. Seven important rules with covers of >30 were obtained (Table 2). These rules were used to predict functional mutations, and they conveniently discriminate functional nsSNPs from neutral mutations. For example, according to Rule 1, if the conservation score of an nsSNP was less than or equal to −0.343, and it was located in the extracellular or intracellular domains, then the probability that this nsSNP is neutral would be 0.96.

Table 2 Rules derived from the decision tree with the optimized attribute set

Functional nsSNPs in human GPCRs

We collected 519 nsSNPs from dbSNP, and 166 of these (32%) were predicted to be functional using the optimal attributes set (Table 3). Analysis of these nsSNPs in GPCRs will provide the basis for assessing susceptibility to diseases and designing individualized therapy.

Table 3 Predicted functional nsSNPs in human GPCRs, obtained with the optimized attribute set using the decision tree method

Discussion

In the present study, the prediction power of both sequence-based and structure-based attributes was used to predict functional nsSNPs in human GPCRs. Since only one GPCR structure is known, we used predicted structure information instead of the actual structure. A conservation score that is based on evolutionary selection information was found to be the best single predictor for discriminating functional mutations from neutral variations. A high conservation score means that there is selective pressure to maintain these residues during evolution, and therefore these are likely to be important to the structure and function of the protein. The mutations that occur at these conserved sites are often functional mutations. The change in the physiochemical properties of residues would influence the structure or stability of the proteins and indirectly change the function, and so these attributes only have moderate prediction power. Substitution matrices, which consider only the likelihood of the substitution in all proteins at all positions, can also be useful, albeit with lower prediction accuracy (Yue and Moult 2006). We found that functional mutations are overrepresented in the TM regions and are underrepresented in the extracellular and intracellular domains. This implies that changes in TM regions may directly affect either the structure or function of the receptor. Mutations in TM regions could abrogate or diminish the activity of the protein when a ligand-binding site is affected. Alternatively, a mutation in a TM region could compromise the protein’s structural integrity by having an effect on helix–helix packing interactions. In general, structure-based attributes had poorer predictive powers than sequence-based attributes. The MCC of solvent accessibility was zero, which means every prediction was randomly assigned, and this was the worst predictor among all the attributes when it was used alone.

Combining the attributes can greatly improve the prediction accuracy. Though conservation score was the most powerful predictor, the MCC increased to 0.22 when it was combined with other sequence-based attributes. When all nine attributes were used in a prediction, the accuracy was improved when compared with the sequence-based attributes alone. We also found the proposed structural information to be useful in prediction. It is likely that most mutations that affect protein function actually affect it indirectly through changes in structural stability. However, simply taking all the attributes together did not achieve the best performance. We found that the optimal attributes subset only requires six attributes—the conservation score, the BLOSUM62 substitution matrix score, the location, the solvent accessibility, the buried charge, and the hydrophobicity change. The combination of these six attributes had an MCC that was 0.03 higher than that of all nine attributes. The optimal subset includes both sequence-based and structure-based attributes. Moreover, it is interesting to see that the optimal attributes subset did not consist of the six best predictors when each was assessed by itself. The predictabilities of some inferior attributes, such as solvent accessibility and buried charge, were increased when used in combination.

Seven important rules with cover >30 were derived from the decision tree. Based on these rules, we could intuitively distinguish functional nsSNPs from neutral nsSNPs only if the attribute values of the nsSNPs are available, and there is no need for any complex training or testing processes.

In summary, combining sequence-based and structure-based information will improve the prediction performance, but the optimal attributes subset was not simply a combination of the attributes. With the optimal attributes subset, a total of 166 functional nsSNPs were predicted. Given the important roles of GPCRs in many physiological processes and their pharmaceutical relevance as drug targets, further investigation of these nsSNPs will be very useful for elucidating disease pathogenesis mechanisms and drug efficacy issues.