Resistance gene identification from Larimichthys crocea with machine learning techniques

The research on resistance genes (R-gene) plays a vital role in bioinformatics as it has the capability of coping with adverse changes in the external environment, which can form the corresponding resistance protein by transcription and translation. It is meaningful to identify and predict R-gene of Larimichthys crocea (L.Crocea). It is friendly for breeding and the marine environment as well. Large amounts of L.Crocea’s immune mechanisms have been explored by biological methods. However, much about them is still unclear. In order to break the limited understanding of the L.Crocea’s immune mechanisms and to detect new R-gene and R-gene-like genes, this paper came up with a more useful combination prediction method, which is to extract and classify the feature of available genomic data by machine learning. The effectiveness of feature extraction and classification methods to identify potential novel R-gene was evaluated, and different statistical analyzes were utilized to explore the reliability of prediction method, which can help us further understand the immune mechanisms of L.Crocea against pathogens. In this paper, a webserver called LCRG-Pred is available at http://server.malab.cn/rg_lc/.

Considering these and other similarities, as a solution, machine leaning was used to model all reviewed resistance genes in all species, and the model was evaluated and applied to identify L.Crocea for novel R-gene.
This paper aims to identify and analyze the resistance genes of Larimichthys crocea so as to improve its own immune system to fight against the invasion of pathogens. In view of the specific functional classes of proteins with common structure and physical-chemical characteristics, we extract feature information from all known R-gene sequences with machine learning methods, and classification algorithms are adopted for identification of the gene fragment separately. Potential rules of the sequences could be acquired by studying the reviewed sequences, and the same properties were able to confirm by using the classifier model we obtained to classify the unknown sequence. Moreover, different feature extraction methods and classification methods were compared, and the results and differences of the prediction are discussed and analyzed. In addition, the quality of the prediction was verified. The main flowchart of the process is given in Fig. 1. In short, experiments demonstrate that the proposed methods, especially the SVMProt-RF by using SVM-Prot 28,29 combined with Random forest, could be utilized for the prediction of novel R-gene.

Results
Comparative Analysis. Sampling method Comparative Analysis. Firstly, on the basis of SVM-Prot feature method, we compared the performance of original samples (Ω 0riglR−g ) and samples after two sampling strategies (Ω tr and Ω wtr ) separately under Random forest classifier, where all other parameters are the same. Table 1 shows the results based on three different sampling methods. As we can see, given that the number of non-R-gene is greater than R-gene, it makes no sense if R-gene was classified as non-R-gene, though it gets higher accuracy. Besides, weighted random sampling contributes to the best result, which is good for establishment of a better performance classifier.
Multi-Classifier Comparative Analysis. In order to demonstrate the validity of the classification results of R-gene sequence in the Random forest algorithm, we compare the results of Ω tr treated by SVM-Prot feature under different classifiers. To get the objective evaluation, we adopt both test set Ω test and 10-fold cross-validation to verify the classification effect, as shown in Table 2 and Fig. 2. Visibly, the results of Random forest, LibD3C 30 , Bagging, Gradient Boosting Decision Tree (GBDT) and RandomSubSpace algorithm we obtained are better than others,   Table 2. In view of the performance of classifier, the sensitivity of J48, KNN-IB1, Random tree, GBDT and SMO are all less than 72%, that is, the model is less than 72% for classifying R-gene correctly, even if the total accuracy of some of these methods is very high. Besides, the sensitivities of Bayes Network, Naive Bayes, and LibSVM are higher than 80%, but their low specificities result in a serious false positive problem when identifying the R-gene. Different from the above classifiers, Random forest, LibD3C, AdaboostM1, bagging and RandomSubSpace with the guarantee of high sensitivity have an acceptable specificity. In addition, Random forest and LibD3C work better considering the Mcc, total accuracy and ROC Area. Furthermore, for the time consumed, LibD3C is 36 times more than Random forest with the same parameters. For the test set, KNN-IB1 achieved a higher accuracy rate of 77.5998% while Random forest 69.347%, as can be seen in Fig. 2, which can only indicate that KNN-IB1 has a higher classification accuracy of non-R-gene. Therefore, the function of Random forest classifier shows better with comprehensive consideration.
Multi-Feature Comparative Analysis. In this section, feature extraction methods are compared in our experiment on the basis of Random forest classifier, including the 188-D constructed from SVM-Port features, Pse-AAC 31 features and 473-D features, as shown in Table 3. The strengths of the 188-D feature extraction algorithm is obvious, which obtains higher accuracy as well as higher sensitivity and specificity, better than the other two feature extraction algorithms. The second part of Table 3  Identification R-gene from Larimichthys crocea. To get a better understanding of Larimichthys crocea immune system for future breeding and disease prevention, an effective support and recognition of the resistance genes of L.Crocea is particularly crucial. In our experiments, a combined classification model was developed by identifying all reviewed R-gene, and it was applied to screen the R-gene of L.Crocea. As for the selection of the original data of prediction model, we used the protein sequence coded by R-gene based on the following  conditions: R-gene expresses the resistance function through the protein product directly; protein sequence consists of 20 amino acid with abundant physicochemical properties, while nucleotide sequence consists of only 4 elements, which is not conducive to the feature extraction. Here, we obtained multiple hybrid prediction models with higher accuracy after a series of comparison as demonstrated before. Ω LC (sequence of L.Crocea) was predicted based on these models. A comparison was made between SVMProt-RF method and others as well. Figure 3 gives the results of the prediction. As we can see, 64.64% R-gene existent in the sequences of L.Crocea while 61.01%, 61.12%, 61.68%, 39.74%, 65.16%, 52.70% and 43.20% were respectively obtained in others. Furthermore, Table 4 shows the prediction results of Ω LC applied by Ω 0riglR−g model, Ω tr and Ω wtr model, their prediction results taking up 45.30%, 64.64% and 54.39% respectively. A comparative table of SVMProt-RF and NBSpred prediction is given in the Table 5, since there exist obvious similarities of pathogen-associated molecular patterns (PAMPs) in animals and plants, especially the plant receptors resembling mammalian Toll-like receptors (TLR) or cytoplasmic nucleotide-binding oligomerization domain leucine-rich repeat (LRR) proteins 26 , and NBSpred is a web server for predicting nucleotide binding site lucine-rich repeat proteins (NBS-LRR) of plant 24 . SVM method is used to extract features of datasets by calculating six compositional attributes, including amino acid frequency, dipeptide frequencies, tripeptide frequencies, multiplet frequencies and hydrophobicity composition 24 . Total, 9801 sequences are identified as R-gene and R-gene-like genes through SVMProt-RF. NBSPred only detected 2.544% sequences as R-gene from L.Crocea dataset. Distinct differences remain in plants and vertebrates, such as plants do not own specific immunity and cannot produce immunizations because they lack circulatory blood system like an animal. So, we can find that one prediction model can identify R-gene of plants accurately but fails to predict R-gene of L.Crocea.

Discussion
In this paper, after comparison among different feature extraction methods and classification algorithms, the SVM-Prot feature extract method and random forests classification algorithm were combined (SVMProt-RF)   to preliminarily mine the resistance gene of the whole protein data, which proves to achieve the best results. And further screening was conducted on the acquired resistance gene to determine the relationship between the candidate sequence and the resistance trait. The work was divided into the following parts: the establishment of resistance data sets, the feature extraction, the sampling of imbalanced data sets and the comparison of resistance genes classification models. In comparison with other previously mentioned works and methods, we can reach the conclusion that our methods have the following advantages: (1) It reduce the redundancy of R-gene samples, and optimize efficiency by keeping the original data information.
(2) Feature extraction based on datasets that contains resistance genes of all reviewed species and the prediction of R-gene of L.Crocea are more accurate. (3) Compared with other classifiers, the result of SVMProt-RF method associated with weight random-sampling shows that the model has a better sensitivity and specificity, and better adaptability to identify R-gene. (4) It Can be used to predict the resistance genes of more candidate sequences, and verify the correlation between them with biological experiment.
The establishment of the model is of great significance for the subsequent resistance gene discovery and its evolution, regulation and pathway analysis. What's more, for the immune system-related genes of Larimichthys crocea, further exploration is still required.

Method
Data preprocessing. The original R-gene sequences were retrieved from Uniprot database 32 , which has been reviewed by experimentation. The dataset is composed of 13,959 sequences that contains all species like zoon, plants and fungi, denoted as Ω 0riglR−g . Each R-gene class, nevertheless, contains a lot of duplicate sequences that cause excessive redundancy. Therefore, CD-HIT was utilized to remove redundancy in positive dataset, which has been used in the realm of bioinformatics 33,34 . Considering the following algorithm: First, sort out all sequences according to their length; then form the classes by sequentially processing the length sequence. If the similarity of new sequence was higher than the existing class in threshold, the new sequence was added to this class, otherwise make it as a new class. Finally, 6720 R-gene were obtained with similarity below 70% after CD-HIT, denoted as Ω R−g : The negative sample was acquired from PFAM families due to the intimate relationship between R-gene and its protein sequence. No-duplicates PFAM of R-gene were removed from the whole PFAM families database. We got negative families here, and the longest sequence of proteins was fetched in each negative families. 10028 non-R-gene sequences were involved, denoted as Ω NR−g . Thus the training dataset Ω is denoted as follows: where Ω contains a total of 16,748 sequences. The prediction datasets of Larimichthys crocea that consist of 18,018 sequences are collected from Uniprot database 32 as well. To describe it simply, we denoted it as Ω LC .

Feature extraction algorithm. SVM-Prot features. SVM-Prot is a web server for protein classification. It
constructs 188-D features for protein sequences description and classification 28,29 . The features have been applied successfully in several protein identification works, such as cytokines 35,36 and enzymes 37,38 . The extracted features include hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility 28 . For each of these 8 types of physical-chemical properties, some feature groups were designed to describe global information of protein sequences. These feature groups contain composition (C), transition (T) and distribution (D) 14,28 . C expresses a percentage of the amino acids of particular property over total amino acid sequence. T is the frequency of amino acids of particular property that are intimately next to another amino acid of particular property. D depicts the position of amino acids of particular property in their sequences. Thus, the dimension of each feature vector is 21 (denoted as D eachV ). In addition, considering amino acid composition (denoted as H acc ), the protein structure is composed of 20 amino acids:

Table 5. Comparison of SVMProt-RM and NBSPred prediction for R-gene of L.Crocea.
where L is the number of features. The features of Ω and Ω LC were extracted. Table 6 shows a part of the results of PSBA1 R-gene in Acaryochloris marina.
Pseudo amino acid composition features. Pseudo amino acid composition features (Pse-AAC) 41 as an efficient computation tool has been diffusely leveraged for protein sequences in predicting protein structures and functions 31,41 , as well as DNA and RNA sequences 42 . To describe it distinctly, we assume a R-gene sequence R, expressed as: here, L denotes the length of the sequence and r i (i = 1, 2, … , L is the position of residue in R. Besides, given the different amphiphilic features of proteins, the Pse-AAC feature of R can be defined as the following vector 41,42 : The classifiers tend to have a higher recognition rate for the majority class, which make it hard to identify the minority class correctly 44,45 . What we want is to eliminate the over fitting problem caused by unbalanced data. The commonly used method is sampling 46 , including under-sampling and over-sampling.
Since it is easy to obtain reviewed R-gene but not the non-R-gene, which incurs serious class imbalance problem and affects the performance of the classifier, two sampling methods are used in this paper to find out the best performance. One is random-under-sampling. The balance of the train sets is realized by random sampling of large class set, where the number of large class sets equals the small class sets. Here we get 6720 sequences each for Ω NR−g and Ω R−g as train sets, denoted as Ω tr , and 3308 negative sequences remain as test sets Ω test . Another method we applied is weighted random sampling 47 , balancing the dataset by adding different weights to the unbalanced samples. Seeing that the ratio about Ω R−g and Ω NR−g is approximately equal to 7:10, weight factor 10 and 7 were added to the Ω R−g and Ω NR−g separately, so 16748 train sets were obtained, denoted as Ω wtr .
Classifier selection and tools. Random forest. Random forest is a kind of classifier which is trained and predicted by a number of trees, as proposed by Leo Breiman 48 . Numerous advantages have been listed than other algorithms, including noise-ability, avoiding over-fitting, being able to handle high dimensional (feature) data and etc. The essence in this algorithm is an improvement based on the decision tree. An object can be categorized into a class, when the class follows the principle of the judgment based on every decision tree in the forest. The classification ability of the single tree would be marginal, but the probability of being classified properly is greatly enhanced after random generation of a large number of decision trees. In this study, R-gene is a binary classification, so all decision trees are binary tree.
WEKA. WEKA is one of the well-known data mining platform (http://www.cs.waikato.ac.nz/ml/weka/) that are utilized for data analysis and model prediction. Several machine learning algorithms were gathered as tools. Cross-validation is provided by WEKA. In this study, we utilize its classification function to establish a model of Ω tr , and its test sets Ω test to verify the precision of the model. Thirteen classifiers are selected for this paper.