Introduction

The neurotransmitter serotonin (5-hydroxytryptamine, 5-HT) mediates many of physiological responses and pathological processes in both the peripheral and central nervous system and has diverse effects on appetite, sleep and general metabolism1. The dysfunction of serotoninergic neurotransmission causes many psychiatric disorders such as depression, anxiety and migraine2. Serotonin exerts its functions primarily by interacting with different types of serotonin receptors (5-HT receptors)3, 4, 5, 6. The 5-HT receptors are a subfamily of G protein-coupled receptors (GPCRs) with the exception of the 5-HT3 subtype, which is a ligand-gated ion channel7. At least 14 distinct 5-HT receptors have been identified to date that can be divided into seven subtypes (5-HT15-HT7) based on the molecular cloning, amino acid sequence, pharmacological properties and signal transduction8. The 5-HT1A receptor, which is mainly distributed in the frontal cortex, septum, amygdala, hippocampus, and hypothalamus9, 10, is one of the best characterized members in the family and is a crucial modulator of serotonergic signaling in the central nervous system11.

Accumulating results indicate that the 5-HT1A receptor participates in the regulation of various physiological and pathophysiological processes such as psychosis, cognition, feeding/satiety, temperature regulation, depression, anxiety, sleep, pain perception and sexual activity12, 13. It has become one of the most attractive targets for the development of drugs treating numerous neurological and psychiatric disorders. Currently, five drugs primarily targeting this receptor have already been launched, and dozens of others are in various clinical stages. Among these launched drugs, buspirone is the earliest 5-HT1A receptor agonist that was launched in 1985 by Bristol-Myers Squibb (BMS) for the management of anxiety disorders. Although no 5-HT1A antagonists have been launched, previous studies have shown that 5-HT1A receptor antagonists may be useful in the treatment of Alzheimer's disease and other cognition disorders14. In view of the significant differences in physiological functions between the 5-HT1A receptor agonists and antagonists, identification of agonistic or antagonistic properties of 5-HT1A receptor ligands has become an important issue for drug development.

Some experimental methods have been established to identify the function of known ligands15, 16. However, these methods are time-consuming or expensive. Furthermore, the assays cannot be employed unless a compound is available. Therefore, a reliable computational model would be beneficial to accurately predict the physiological function of a compound before it is synthesized. To the best of our knowledge, no such model has been reported to date.

In this study, a genetic algorithm optimized the support vector machine (GA-SVM) method was adopted to construct a computational model for the identification of agonists or antagonists of the 5-HT1A receptor using 259 agonists and antagonists collected from the literatures. The constructed SVM model displayed high predictive accuracy for training and test sets. The application of the model to an external dataset that comprised 25 recently reported ligands revealed that our predicted data were in good agreement with their biological functions of the reported ligands, demonstrating that our model was reliable for identifying agonists and antagonists of the 5-HT1A receptor. This approach may also be employed to construct models to predict agonists or antagonists of other GPCR members.

Materials and methods

Data sets

A total of 259 5-HT1A receptor ligands were collected from previous studies1, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, which are composed of 137 agonists and 122 antagonists with diverse structural classes such as aminotetralins, indolylalkylamines, ergolines, aporphines, arylpiperazines and aryloxyalkylamines (Tables S1 and S2 in Supplementary Information). Because we aimed to build a binary classifier, the partial agonists of the 5-HT1A receptor were classified as agonists. When provided with biological data containing conflicting information for the same compound from different research groups, we used the latest results or the results from the research group with a long history of studying 5-HT1A receptor ligands as our raw data. All of the 259 function-known ligands were randomly divided into training and test sets with the ratio of 4:1 (207:52) (Table 1). The training set was used to develop the prediction model, whereas the test set was used to assess the performance of the generated model. The structures of these compounds were created and optimized using Sybyl6.848.

Table 1 Number of agonists and antagonists in training set and test set.

Support vector machine

The support vector machine (SVM), which was originally developed by Vladimir Vapnik et al, is based on the structural risk minimization principle from the statistical learning theory and is a supervised learning method that can be applied to classification and regression49. Simply speaking, a SVM model is constructed based on a given set of training inputs belonging to two different classes. Then, the model is used to predict the class of a new input. A data input is regarded as a multi-dimensional vector, and the goal is to determine a hyperplane to separate the inputs, which are the sets of agonists or antagonists in this study. In particular, the popular Library for Support Vector Machines (LIBSVM2.89) was employed in this study50.

There are two parameters, C and r, that must be carefully adjusted to develop a robust SVM model. C is a global parameter, which regulates the trade-off between maximization of the margin and minimization of the training error. Small C values are prone to highlight the margin and overlook the outliers in the training set. However, large C values may lead to overfitting of the training set. The parameter r indicates the radial basis function (RBF), which is the kernel function used in this study51. Here, we optimized the value of C and r using our in-house method (detailed below) to build the best classification model.

Molecular descriptors

Molecular descriptors are generally used for quantitative representation of the structural and physicochemical features of compounds52, 53. Depending on the 3D structure of each compound, 292 molecular descriptors including topological, graph-theoretical, quantum-chemical and electro-topological state (E-state) descriptors were calculated using Discovery Studio 2.154. In addition, the value of every descriptor was scaled to [-1, 1] (see Supplementary Information, Excel S1).

Feature selection and parameter optimization

Usually, only a few of the calculated molecular descriptors are essential to develop a SVM model. To select the most important descriptors and optimize model parameters simultaneously, an in-house program was coded in our laboratory using genetic algorithm (GA)55. GA is a method that randomly initializes a population of solutions and then improves it through repetitive operations of mutation, crossover and selection. Each possible solution is referred to as a chromosome, which consists of two parts, the feature mask and the SVM parameters (C and r). The value of the feature mask is 0 or 1, where 0 represents the corresponding descriptor is abandoned while 1 indicates to keep the descriptor. Although the value of C and r are real numbers, only specific discrete values were considered in this study, where C and r were represented as 2m and 2n with m and n integers.

When developing the classification model, 5-fold cross-validation was adopted to explore the reliability of the statistical models. The training set of 207 ligands in this study was randomly split into five subsets of approximately equal size. In each validation, one subset was used for test while the rest four were used for training the model. This process was repeated five times so that each subset could be used for the prediction once.

Model validation

After the model was built, we adopted different means to evaluate its performance. Receiver-operating-characteristics (ROC) curve is generally used to assess the classification power of computational models56, 57. To plot a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) were required. TPR, which is also called sensitivity and is calculated with equation (1), defines how many true positive results appear among all of the positive inputs during prediction. Instead, FPR determines how many false positive results emerge among all of the negative inputs. FPR is equal to (1– specificity), where the specificity is calculated with equation (2).

In these equations, TP represents the number of correctly predicted agonists, TN represents the number of correctly predicted antagonists, FP represents the number of antagonists that are incorrectly predicted as agonists and FN represents the number of agonists that are incorrectly predicted as antagonists.

The quality of our SVM model was also measured using the Matthews correlation coefficient (MCC), which is defined by equation (3)58, 59. It returns a value between −1 (worst model) and 1 (perfect model) while 0 represents a random model.

To fully examine the performance of the developed model, the overall accuracy is calculated using equation (4).

Results

Feature selection and model performance

Using our in-house feature selection program, 13 descriptors were finally selected (Table 2), which was roughly divided into five classes. The optimized values of C and r were both 1 in the SVM model with a cross-validation r2 of 0.826. As shown in Table 3 (before refinement), the overall predictive accuracy, sensitivity and specificity for training and test sets were all higher than 0.8, indicating that the developed model was reliable and robust. Indeed, the calculated MCC was 0.783 for the model.

Table 2 List of optimized 13 molecular descriptors used in the SVM and their descriptions and classes.
Table 3 The value of accuracy, sensitivity and specificity before and after refinement.

To view the results more intuitively, the quality of the results was illustrated using ROC plots (Figure 1). The (0,1) point in the upper left corner of the ROC space represented 100% sensitivity (no false negatives) and 100% specificity (no false positives). A random classifier would give us a diagonal line (the so-called line of no-discrimination) from the left bottom to the top right corner. Finally, we explored another parameter that was represented as the Area-Under-the-ROC-Curve (AUC). The AUC values for the training and test sets were 0.883 and 0.906, respectively.

Figure 1
figure 1

ROC curves for training set (A) and test set (B). The solid line represents ROC curve and the area under the curve is characterized by AUC, whose value for training set and test set are 0.883 and 0.906, respectively. The dash line is diagonal line, which describes a random model.

Model refinement

To further improve the performance of the developed model, the probability estimate factor was used as a criterion to remove those ambiguous compounds with a threshold of the factor less than 0.7. Thus, two probability estimates, namely agonist probability estimate and antagonist probability estimate, were calculated for each compound, and the sum of them was equal to 1. The larger probability estimate is regarded as the probability estimate factor of a specific compound. Figure 2 shows the probability estimate distribution for compounds in the training and test sets. There were 195 compounds in the training set and 41 compounds in the test set that remained in the refined datasets, demonstrating that the majority of the compounds had a probability estimate higher than 0.7. Then we reevaluated our model using the refined datasets, which yielded an improved accuracy, sensitivity and specificity (>0.9, after refinement in Table 3).

Figure 2
figure 2

Probability estimate for training set (white) and test set (black). There are only 12 ligands with probability estimate lower than 0.7 among all 207 ligands in training set, so the threshold of probability estimate factor is chosen to be 0.7. Moreover, the probability estimate of 11 ligands from test set are less than this threshold, indicating that the predicted result may be unreliable.

Application of the SVM model to an external dataset

To validate the reliability of our SVM model for identifying agonists and antagonists of the 5-HT1A receptor, we applied the model to an external dataset including 25 ligands that were collected from very recently published literatures. In conclusion, 15 compounds were predicted to be antagonists, and 3 compounds were predicted as agonists with probability estimate higher than 0.7 (Table 4).

Table 4 Detailed predicted results of 25 external compounds.

Discussion

As shown in Table 2, 13 molecular descriptors were selected as the most relevant descriptors for discriminating between 5-HT1A receptor agonists and antagonists, including VAMP/AM1 semi-empirical quantum-chemical, electro-topological state (E-state), molecular property and shadow index descriptors. Several selected descriptors reflected the corresponding structural information that was closely related to the function of these ligands. For example, the E-state indices were efficient descriptors to describe the affinity of 5-HT1A receptor antagonists60. Our results demonstrate that another important descriptor, the “number of surface points with positive electrostatic potential”, is in agreement with the data that most of the agonists or antagonists of 5-HT1A receptor are positively charged. None of the descriptors alone could completely describe the differences between agonists and antagonists. However, the collective use of the descriptors yielded a more accurate model. Thus, a group of diverse, comprehensive and representative descriptors were used to develop a powerful SVM model which could effectively distinguish agonists from antagonists of 5-HT1A receptor.

Another highlight of our study was the consideration of the probability estimate factor while establishing the SVM model. For example, an agonist with the probability estimate of 0.9 is more likely to be an agonist than the one with the probability of 0.6, which is also true for antagonists. Based on the probability estimate distribution for the compounds of the training set, the threshold of the probability estimate was set to 0.7 for more reliable classification. Then, the compounds with probability estimate less than 0.7 were removed from training and test sets. The predictive accuracy for the refined datasets was significantly increased, especially for the test set, from 0.865 to 0.927. These results demonstrate an improved predictive power after introducing the probability estimate factor. Moreover, differences in accuracy, sensitivity, and specificity between the training and test sets after refinement were much smaller than those before refinement, suggesting that a more balanced model between the training and test sets was achieved.

The probability estimate is an important parameter for judging the reliability of the predicted result. For instance, HT01HT10, a series of carboxamide and sulfonamide alkyl61, were predicted to be antagonists with high probability. Indeed, they are structurally similar to WAY-100635 (Figure 3), a well-known antagonist of the 5-HT1A receptor, indicating that HT01HT10 are likely to function as antagonists of the 5-HT1A receptor. These results show that our SVM model has an instructive role for exploring agonists and antagonists of the 5-HT1A receptor. Besides, a group of newly discovered N-phenylpiperazine derivatives, HT11HT17, were predicted to function as agonists or antagonists with high probability estimate. For example, compound HT13 was predicted to be an agonist by our model, which was also believed to stimulate 5-HT1A receptor activity like an agonist by other research group62, indicating that similar structures may play different roles in regulating the function of the receptor.

Figure 3
figure 3

Chemical structures of WAY-100635 (A) and compounds HT01HT10 (B). As shown, they are structurally similar, all of them containing a [4-(2-methoxyphenyl)-1-piperazinyl]ethyl structure fragment. WAY-100635 is believed to act as a selective 5-HT1A receptor antagonist, so compounds HT01HT10 may also be antagonists of 5-HT1A receptor.

On the other hand, those compounds with only moderate or poor 5-HT1A receptor affinity in biological tests were predicted to be binders with lower probability estimates. For instance, the probability estimates of compounds HT18HT22, which are weaker binders63, were approximately 0.7. Therefore, further biological research on these compounds may not be urgent. Similarly, the predicted results for compounds HT23HT25 with lower probability estimates are contrary to the known biological functions. Actually, they had been proved to be weak agonists or partial agonists of the 5-HT1A receptor63. These conflicts indicate that our SVM model may have problems with applicability of chemical space similar to other computational models and requires further optimization.

In summary, based on 13 molecular descriptors that were derived from previously known agonists and antagonists, we developed a robust SVM model with great predictive capability. Moreover, the predictive accuracy for the training and test sets (especially for the test set) were significantly increased when we considered the compounds with probability estimate higher than 0.7. Then we applied the model to an external dataset for validation, which confirmed that our GA-SVM method is effective for the classification of agonists and antagonists of the 5-HT1A receptor. The strategy and methods used in this study may be extended to other GPCR members.

Author contribution

Prof Wei-liang ZHU designed and supervised the research and revised the manuscript. Xue-lian ZHU performed the research, analyzed data and wrote the manuscript. Prof He-yao WANG and Zhi-jian XU helped with parts of the research design. Hai-yan CAI, Yong WANG, and Prof Ao ZHANG helped to perform the research and revise the manuscript.