A computational method for predicting nucleocapsid protein in retroviruses

Nucleocapsid protein (NC) in the group-specific antigen (gag) of retrovirus is essential in the interactions of most retroviral gag proteins with RNAs. Computational method to predict NCs would benefit subsequent structure analysis and functional study on them. However, no computational method to predict the exact locations of NCs in retroviruses has been proposed yet. The wide range of length variation of NCs also increases the difficulties. In this paper, a computational method to identify NCs in retroviruses is proposed. All available retrovirus sequences with NC annotations were collected from NCBI. Models based on random forest (RF) and weighted support vector machine (WSVM) were built to predict initiation and termination sites of NCs. Factor analysis scales of generalized amino acid information along with position weight matrix were utilized to generate the feature space. Homology based gene prediction methods were also compared and integrated to bring out better predicting performance. Candidate initiation and termination sites predicted were then combined and screened according to their intervals, decision values and alignment scores. All available gag sequences without NC annotations were scanned with the model to detect putative NCs. Geometric means of sensitivity and specificity generated from prediction of initiation and termination sites under fivefold cross-validation are 0.9900 and 0.9548 respectively. 90.91% of all the collected retrovirus sequences with NC annotations could be predicted totally correct by the model combining WSVM, RF and simple alignment. The composite model performs better than the simplex ones. 235 putative NCs in unannotated gags were detected by the model. Our prediction method performs well on NC recognition and could also be expanded to solve other gene prediction problems, especially those whose training samples have large length variations.

www.nature.com/scientificreports/ that could predict the precise locations of NCs' initiation and termination sites has been proposed yet. Besides, length of NCs from different retroviridae genera varies from 48 to 126aa according to records in National Center for Biotechnology Information (NCBI), thus gene prediction methods for genes with conserved lengths are not applicable. Furthermore, classical database search tools [13][14][15] couldn't achieve satisfying results in prediction of retrovirus genes with large length variation 16 . Therefore, there is an urgent need to come up with a computational model for NC prediction. In this paper, computational models to identify NCs from retroviruses were proposed. All available annotated NC sequences in retroviruses were collected for the training and testing process. Position weight matrix (PWM) along with all six parameters of factor analysis scales of generalized amino acid information (FASGAI) 17 were used to generate the feature space for NC prediction. The initiation and termination sites of NCs were separately predicted and combined together afterwards to acquire high prediction accuracy when dealing with sequences that are poorly conserved in their lengths. Their performance was tested by fivefold cross validation test. A composite ab initio model to predict intact NCs from genetic sequences was then proposed. It performs better than the simplex ones. All of the 6651 available gag sequences without NC annotations were scanned with the composite model and 282 putative NCs in them were found.

Materials and methods
NC collection. All available amino acid sequences of retroviruses with their NCs annotated based on experimental evidences were collected from NCBI at http:// www. ncbi. nlm. nih. gov. There are 77 of such sequences in total. Among them, 4 of them are beta-retrovirus, 13 of them are gamma-retrovirus, 9 of them are from deltaretrovirus, 2 of them are epsilon-retrovirus and the other 49 of them are lentivirus. All these sequences were used for the following training and testing process. All of them are with intact NC structures (please refer to S1 File for details).

Separate prediction of NC boundaries.
Traditional gene predicting methods could performance well when predicting gene sequences with fixed lengths. However, when it comes to gene sequences with large length variations, such methods may lose effectiveness or even feasibility. This might be because the constant dimension of feature space used in traditional methods couldn't represent features of such genes properly. The lengths of annotated NCs in retroviruses range from 48 to 126aa, so an approach to revise the traditional gene predicting methods to fit the NC predicting problem is needed.
Our predicting method focuses more on the border areas adjacent to the start and end of NCs instead of interior areas away from them, for the former contain more effective information for gene prediction and are usually more conservative. The fixed length flanking residues of the initiation site and termination site were predicted to deduce the precise locations of the start and end of NCs. Initiation site and termination site were predicted separately, and the sequence between them were regarded as a candidate NC sequence only when it's length is reasonable. Then the most probable candidate NC was singled out among all candidate NCs to be the final putative NC according to the decision value and alignment score involving it. This technique brings out both feasibility and high accuracy.
Sample preparation. Two sets of training samples for initiation and termination sites prediction respectively were built separately. The training samples for initiation sites could be denoted as: Similarly, the training samples for termination sites could be denoted as: Here, I p denotes a positive training sample of initiation site generated from a gag sequence, I n denotes a negative training sample of initiation site. Similarly, T p denotes a positive training sample of termination site and T n denotes a negative training sample of termination site. Init(NC) and Term(NC) represent the true initiation site and termination site of a NC sequence. osi and ost are randomly generated offsets added to initiation and termination site locations respectively to generate negative samples. L is and L ts denote the length of initiation samples and termination samples.
We generated the negative sample set with a size 5 times as large as the positive sample set and took the imbalanced sample sets problem into our consideration in the modelling process, to overcome the difficulty of the lack of positive training samples.

Feature selection.
A hybrid feature space construction approach was proposed by combining position characteristics and physicochemical properties of sequences.
(1) I p = gag(i : i + L is − 1) www.nature.com/scientificreports/ Position characteristics. The widely recognized PWM 18 was applied to extract the position characteristic of sequences. By aligning residues starting from initiation sites or ending at termination sites of positive NC sequences, PWMs are defined as follow: Here, f kj stands for the absolute frequency of amino acid k in the jth position of N aligned sequences of length l , j ∈ (1, ..., l) , k is the set of amino acids, b k = 1/|k| (|k|=20 for amino acids, so b k = 0.05).
After generating the PWM, the position characteristic of any l-aa-long sequence V was extracted by the following mapping method. Each amino acid of V was assigned with its corresponding value in the matrix according to its position. Then a l-dimension-vector V Pos was generated to represent the position characteristic of the original l-aa-long sequence: where j ∈ (1, ..., l) , k = V j .
Physicochemical properties. All 6 parameters of the FASGAI 19 were selected to extract the physicochemical properties of sequences (Please refer to S2 File for details of FASGAI). FASGAI involves hydrophobicity, alpha and turn propensities, bulky properties, compositional characteristics, local flexibility, and electronic properties derived from 335 property parameters of 20-coded amino acids. Thus when dealing with an l-aa-long sequence, the sequence was mapped into a 6 × l matrix to represent its physicochemical properties.
After combining the position characteristics and physicochemical properties, a feature space with (1 + 6) × l features in total was established for the l-aa-long sequence.
Binary classifiers. In our previous study, three binary classifiers based on different principles were applied to the same feature space to test and compare their predicting abilities: weighted support vector machine (WSVM), random forest (RF) and weighted extreme learning machine (WELM). And we found that the combination of the first two of them could generate the best predicting performance 20 . Prediction models based on WSVM and RF were separately built to predict the initiation and termination sites of NCs.
Finding candidate NCs. After the probable NC start and end locations were predicted, a combination method to combine them is required. As there may be several possible NC start and end combination pairs in one unannotated gag sequence, it is necessary to dispose all the less probable putative combinations and leave the most probable one as the final prediction result. The details of such ruling out strategy are shown as follow: Step 1: Keep all the putative NC boundary pairs generated from RF models which have interval distance within the range of NC sequence lengths as candidate NC boundary pairs. For the mth and nth amino acids in a gag sequence, the amino acid pair (m, n) is a candidate NC pair only when it satisfies: where NC min and NC max are the minimum and maximum lengths of annotated NCs respectively, e min and e max are natural numbers and act as the relaxation parameters for the minimum and maximum NC lengths respectively. L is and L ts denotes the length of initiation samples and termination samples.C RFI and C RFT are Boolean variables, their values indicate whether the mth and nth amino acids of gag sequence S are candidate initiation site and termination site respectively according to the prediction results from random forest models.
Step 2: Calculate the products of decision values of initiation and termination sites of all candidate NC boundary pairs sorted out in step 1. Then keep the candidate boundary pair with the largest product as the putative NC (A decision value is generated from WSVM models according to the distance of a sample to the classification hyper plane. The prediction result is more likely to be positive when the decision value is larger, vice versa.). Consider amino acid pair (m, n) as a putative NC pair only when it also satisfies: where D WSVMI (S, m, L is ) and D WSVMT (S, n, L ts ) are decision values assigned to the mth and the nth amino acids of gag sequence S after computation of the WSVM models. (m, n) also satisfies the constraints in (6).
Combination with homology based method. After the screening process, the putative NCs generated by WSVM & RF models are compared with putative NCs generated from homology based methods. The results are then combined together to enhance the prediction performance. First we introduce a simple alignment (SA). Thus the locations of the putative initiation site and termination site are shown as follow: www.nature.com/scientificreports/ Here max D P = max D WSVMI (S, m, L is ) · D WSVMT (S, n, L ts ) ( m and n also satisfy the constraints shown in (6), max A P = max A I (S, p, L is ) · A T (S, q, L ts ) , subject to Here A I (S, p, L is ) is the maximum alignment score generated from a L is long subsequence starting from the pth amino acid of gag sequence S after comparing it with all the positive training samples of initiation sites. Analogously, A T (S, q, L is ) is the maximum alignment score of a L ts long subsequence ending the qth amino acid of S . The alignment function Align calculates the total number of identical amino acids at the same locations in two sequences with equal length. Since the products of decision values of totally correct boundary pairs are close to 1 but couldn't reach it, while the products of alignment scores have a maximum value of 1, parameter α is introduced to balance the two kinds of maximum products for fair comparisons ( α = 0.95 here).
Since the alignment technique here is rather simple, a revision could be done to enhance the performance of combination with homology based method. The widely used bioinformatics tool for sequence searching: Basic Local Alignment Search Tool (BLAST) is used to replace the original simple alignment. Take the unannotated gag sequence S as the query sequence, and take all the positive NC sequences in the training set as the subject sequences. Then the NC sequence that could produce the most significant alignment results indicates the area most likely to be an NC in S . The locations of the putative initiation site and termination site after combination with BLAST (blastp here since the sequences are protein sequences) are shown as follow: subject to Here min B E is the minimum E-value produced by blastp. The subsequence between the pth and the qth amino acid is the corresponding area that produces the minimum E-value. β is the threshold value that determines the selection of prediction results.
Performance assessment. fivefold cross-validation was employed to assess the performance of the WSVM and RF models predicting the initiation sites and termination sites in this paper.
G − mean under fivefold cross-validation was selected as the major performance evaluation measure. It also provide the basis for parameter selection of models. S n , S p , ACC and MCC were also calculated as a supplemental reference.
where true positive ( TP ) and false negative ( FN ) are the number of positive samples that are predicted to be positive and negative respectively. Analogously, true negative ( TN ) and false positive ( FP ) are used to denote the number of negative samples that are predicted to be negative and positive respectively.
Among these evaluation measures, G − mean and MCC are better at providing a comprehensive view of the prediction performance, especially with our training dataset which has quantity imbalance between positive and negative data.
As with the performance assessment on prediction of entire NC proteins, leave-one-out cross-validation is applied. Each turn we pick out one gag sequence with NC annotation as the testing sequence and leave all others as the source of training samples. Then the above process is repeated until all sequences have been left out for a time as the testing sequence. The reason for not applying fivefold cross-validation here is to rule out random factors as much as possible, since fivefold cross-validation could generate different partition of datasets which may cause fluctuations in prediction performance. Such fluctuations could undermine the cogency of performance comparison between different methods. The prediction accuracy of the initiation sites, termination sites and entire NCs were calculated and compared.

Detecting putative NCs in gags.
When the NC predicting models are eventually built, the models could be used to search for more putative NCs in unannotated gags. A fixed length sliding window is used to "scan" the unannotated gag sequences to find candidate NC initiation and termination sites. L is and L ts were set to (8) A I (S, p, L is ) = maxAlign(S(p, p + 1, ..., p + L is − 1), I p )/L is A T (S, q, L is ) = maxAlign(S(q − L ts + 1, q − L ts , ..., q), T p )/L ts . www.nature.com/scientificreports/ equate with the length of the sliding window for convenience. The "WSVM & RF + SA" approach is adopted in the detecting process for speed and convenience. When P M is larger than the threshold β , its corresponding candidate NC boundary pair is predicted as a putative NC.

Results
Predicting Performance of the method. Prediction models based on strategy described above were built. Their effectiveness was also tested and shown below. (Prediction source code is available at SourceForge, with the download URL: https:// sourc eforge. net/ proje cts/ ncpre dicti on/ files/ NCpre dicti on. zip/ downl oad).

Accuracy of the prediction of NC initiation and termination sites.
The performance of the prediction models aimed at recognizing NC initiation and termination sites based on WSVM and RF were tested by fivefold cross-validation and shown in Table 1. From Table 1, we can find that the G − mean values of initiation sites and termination sites are above 0.9900 and 0.9548 respectively. The MCC values of initiation sites and termination sites are above 0.9735 and 0.9179 respectively. It indicates that both WSVM and RF models could generate satisfying results.
Accuracy of the prediction of NC. All of the 77 retrovirus sequences collected with intact NC structures were used to test the predicting performance of different methods. Leave-one-out cross-validation is applied here to rule out random factors. We tested the performance of WSVM & RF, SA, blastp and their different combinations (please refer to S3 File for more details). The accuracy amount and rate of the prediction of initiation site, termination site and entire NC are shown in Table 2. We can find that the combination of machine learning methods and homology based methods could bring about better performance (prediction results of blastp is available at SourceForge, with the download URL: https:// sourc eforge. net/ proje cts/ leave oneou tblas tp/ files/ Blast leave oneout. zip/ downl oad). It is also worth mentioning that the "WSVM & RF + SA" method performs better when there is P M ≥ β ( β = 0.82 here), which indicates that such method is reliable in detecting NCs in unannotated gags. A selfconsistency test was also performed with the "WSVM & RF + SA" method, 90.91% of the NCs could be predicted totally correct, the others are predicted with only slight deviations (please refer to S4 File for more details).
Putative NCs. All of the 6041 available unannotated gag sequences were scanned with the "WSVM & RF + SA" model and 235 putative NCs in them were found (please refer to S5 File for more details, the putative NCs are marked in red).

Discussion
Conservative property of NC boundaries. Motifs of sequences adjacent to origins and terminals of NCs in ERVs were generated based on WebLogo version 2.8.2 (http:// weblo go. berke ley. edu/ logo. cgi) and shown in Fig. 1. From Fig. 1, we can find that sequences adjacent to NC boundaries are quite conservative. This explains why satisfying predicting results could be generated from models built on starts and ends of NC. www.nature.com/scientificreports/ Feature importances. The random forest classifier with its associated gini feature importance, allows for an explicit feature elimination 21 . Thus random forest classifier is utilized to calculate the feature importances of the FASGAI amino acid information. The feature importances of the 6 factors of FASGAI is shown in Fig. 2.
Deep learning algorithm. Along with the rapid development of deep learning these years, it is natural to try to use deep learning algorithms such as convolutional neural network (CNN) to solve the prediction of NCs. A CNN model was also built to solve the problem. The optimized model structure is shown in Fig. 3. The model contains only 6 convolutional layers, thus could be considered as a relatively simple CNN. However, the performance of the CNN model is not comparable with the"WSVM & RF + SA" approach, even though its training is much more time consuming. The detailed results were shown in Table 3, from which we could find that Sn rises along with the increase of fold number, while still not comparable with the Sn generated by WSVM or RF (shown in Table 1). The reason of this phenomenon is that deep learning algorithms contains more parameters to be iterated during the training process, but in this case, the sample size is not enough for the sufficient training of the parameters, so the"WSVM & RF + SA" approach suits better.

Optimization of model parameters.
Model parameters should be optimized until the model could bring out the best predicting performance. As with the WSVM & RF models, we adopted grid search to traverse the parameter space. The parameters that could bring out the highest value of G-mean were considered as the optimized combination of parameters. Since rerunning the model with one set of parameter combination for several times would compensate random factors with each other, another loop was added to the program to rule out arbitrary and capricious behaviours. As with the size of the sliding window in the putative NC detection process, the predicting performance of the "WSVM & RF + SA" method with different window lengths is tested and briefly shown in Table 4 (please refer to S6 File for more details). 16 was found to be an optimized value.  www.nature.com/scientificreports/ Evolutionary relationship analyses. Evolutionary analyses were conducted in MEGA7 22 . The evolutionary history was inferred using the Neighbor-Joining method 23 . The evolutionary distances were computed using the Poisson correction method 24 and are in the units of the number of amino acid substitutions per site. A comparison result between homology of NCs within genera and that of inter-genera is given in Fig. 4. It is obvious that NCs in the same genus are more homologous than that from different genera. This is identical with    www.nature.com/scientificreports/ expectation and implies that genus-specified NC prediction methods could be brought up in the future to further enhance predicting performance when more annotated NCs are accumulated.
Future outlook. The co-evolving information in the protein sequences is also verified to be useful for capturing the characteristics of proteins sequences [25][26][27] . Although these attempts were generally made in the area of protein-protein interactions (PPIs) instead of the prediction of functional elements, their basic idea to utilize coevolving information do provide some enlightenment in the process of feature engineering, which might benefit us in our future research. Moreover, when more annotated NC sequences accumulate, the performance of deep learning algorithms could be improved since there would be enough information for the parameter iteration.

Data availability
All data generated or analysed during this study are included in this published article (and its Supplementary  Information files).