iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance

Meiotic recombination presents an uneven distribution across the genome. Genomic regions that exhibit at relatively high frequencies of recombination are called hotspots, whereas those with relatively low frequencies of recombination are called coldspots. Therefore, hotspots and coldspots would provide useful information for the study of the mechanism of recombination. In this study, we proposed a computational predictor called iRSpot-DACC to predict hot/cold spots across the yeast genome. It combined Support Vector Machines (SVMs) and a feature called dinucleotide-based auto-cross covariance (DACC), which is able to incorporate the global sequence-order information and fifteen local DNA properties into the predictor. Combined with Principal Component Analysis (PCA), its performance was further improved. Experimental results on a benchmark dataset showed that iRSpot-DACC can achieve an accuracy of 82.7%, outperforming some highly related methods.

DNA sequences. Wu et al. 15 proposed a SVM model based on the features of genomic and epigenomic to predict meiotic recombination hotspots in human and mouse. Chen et al. 16 presented a SVM model based on pseudo dinucleotide composition. Wang et al. 17 proposed a method based on gapped kmers. Most of these predictors only considered the local sequence-order information, while little global sequence-order information was taken into account. However, in many bioinformatics' tasks, the global sequence-order information has showed strong discriminative power as shown in many studies. Therefore, in a predictor, the global sequence-order factor should be incorporated. Unfortunately, it is not an easy job, because the lengths of DNA sequences are different.
To address this problem, a feature called dinucleotide-based auto-cross covariance (DACC) 18 is applied to recombination hot/cold spots identification, which is able to incorporate the global sequence-order effects in the DNA sequences into the predictor. Combined with Support Vector Machines (SVMs), a predictor called iRSpot-DACC is proposed. Later, in order to further improve its performance and computational cost, Principal Component Analysis (PCA) 19 is adopted. Experimental results on a benchmark dataset demonstrate that the proposed method outperformed some highly related models, including IDQD 1 and iRSpot-PseDNC 16 .

Results
Influence of parameters on the predictive performance of iRSpot-DACC. In iRSpot-DACC, there is a parameter, the distance between two dinucleotides lag, would affect its predictive performance. In the current study, lag is optimized via the 5-fold cross validation. The influence of lag on the performance of iRSpot-DACC is shown in Fig. 1, from which we can see that the optimized value can be achieved when lag = 6, and this parameter has little impact on the performance. DACC is the combination of Dinucleotide-based auto covariance (DAC) and Dinucleotide-based cross covariance (DCC) (cf. section Material and Methods). With this parameter setting, the lengths of the feature vectors for DAC and DCC are 15 × 6 = 90 and 15 × 14 × 6 = 1260 respectively. Therefore, the dimension of DACC is 90 + 1260 = 1350.
The computational performance of iRSpot-DACC can be further improved by using PCA. In order to further improve its performance and computational cost of iRSpot-DACC, the Principal Component Analysis (PCA) 19 is employed.
There is a parameter w (cf. Eq. (18)) in PCA, which would have impact on both the predictive accuracy and the dimension of the feature vectors. Therefore, we optimize this parameter utilizing 5-fold cross validation. The results show that the iRSpot-DACC-PCA (iRSpot-DACC combined with PCA) achieves the best performance when w = 0.99 and its performance is shown in Table 1, from which we can see that iRSpot-DACC-PCA outperforms iRSpot-DACC.  The feature vector's dimension of iRSpot-DACC-PCA is 173, which is significantly smaller than the original dimension of iRSpot-DACC (1350). Therefore, the predictive accuracy and the computational cost of iRSpot-DACC are further improved by using PCA.

Discriminative visualization and interpretation.
In order to further explore the discriminative power and indicate the meaning of the feature space in biology, we calculate the discriminative weight vector according to the study 20 . The specific formula of the feature discriminative weight vector W can be formulated as: where A is the specific weight for each training samples obtained from SVM training process; M is the feature space of the benchmark dataset used in the current study; N is the number of DNA sequences in the training dataset; j is the dimension of the feature vector. Therefore, W is a 1 × j vector and each element in it represents the corresponding feature's discriminative power. The feature discriminative weight vector with 1350 features (cf. section Results) is depicted in Fig. 2, in which the deeper color spots represent stronger discriminative power than the lighter color spots. From Fig. 2 we can see that the top three discriminative features are DAC(2, 3), DCC (2,8,3) and DCC(2, 15, 1). All the three features are deduced from the same property (F-tilt), which suggests the importance of this property of F-tilt (μ = 2). The top ten discriminative features are listed in Table 2. In this table, we can conclude several conclusions. First, the correlation between properties F-roll (μ = 1) and several other properties shows strongly discriminative power for identifying recombination hot/cold spots. Second, the correlation between F-tilt (μ = 2) and other properties including itself also shows strongly discriminative power. Third, when the distance between two dinucleotides equals to 1, 2, 3 or 5, the influence of the corresponding features would be important for identifying hot/cold spots.
Comparison with other related predictors. Two methods for hot/cold spots identification are compared with the proposed methods iRSpot-DACC and iRSpot-DACC-PCA, including IDQD 1 and iRSpot-PseDNC 16 . The results of various methods on the benchmark dataset S are shown in Table 1.
According to Table 1, we can see that iRSpot-DACC outperforms the two methods IDQD 1 and iRSpot-PseDNC 16 . Furthermore, iRSpot-DACC-PCA outperforms iRSpot-DACC by adopting Principal Component Analysis (PCA). The main reasons are described as follows: IDQD 1 only consider the local sequence-order information, and iRSpot-PseDNC 16 improves it by incorporating global sequence-order information. However, iRSpot-DACC not only incorporates the global sequence-order information but also contains more DNA properties into the feature vectors. Therefore, we conclude that iRSpot-DACC would be a useful tool for hot/cold spots identification.

Discussion
In this study, we propose a computation method called iRSpot-DACC for yeast hot/cold spots identification. The method incorporates long range or global sequence-order information. The result shows that iRSpot-DACC outperform other state-of-the-art predictors. Furthermore, iRSpot-DACC incorporates the correlations between different dinucleotide DNA properties. Another important advantage of our approach derived from PCA (principal component analysis) 21 which not only can improve the predictive accuracy, but also can reduce the computational cost. It can be expected that DACC would be a powerful feature extraction method, and it can be applied to other tasks in the field of bioinformatics, such as DNA-binding proteins identification 22 , protein fold prediction 23,24 , cytokine detection 25,26 , protein-protein interaction site prediction 27 , tumor classification and analysis 28 , etc. Moreover, since publicly accessible web-server is beneficial to develop more useful predictors, we would make efforts in our future work to develop a web-server for the method proposed in this paper. Furthermore, we will apply other advanced machine learning techniques to establish more accurate predictors for hot spot identification, such as deep learning, and neural networks [29][30][31][32] .

Material and Methods
Benchmark Dataset. The benchmark dataset used in this study was constructed by Jiang et al. 13 , which contains 490 hotspots and 591 coldspots. For more detailed information of this benchmark dataset, please refer to 13 . Therefore, the benchmark dataset for the current study can be expressed as: where S + is the set of recombination hotspots, S − is the set of recombination coldspots, and ∪ is a mathematical operator representing "union".
Dinucleotide-based auto-cross covariance (DACC). As described above, the global sequence-order information shows strongly discriminative power for identifying recombination hot/cold spots. Therefore, it is crucial to incorporate the global sequence-order information into our model. In order to deal with this problem, a feature called Dinucleotide-based auto-cross covariance (DACC) 18 is adopted, which incorporates global sequence-order information along DNA sequences. DACC is the combination of Dinucleotide-based auto covariance (DAC) and Dinucleotide-based cross covariance (DCC). Next, we will introduce DAC and DCC respectively. Given a DNA sequence D where L is the length of DNA sequence, R 1 means the nucleic acid residue at the first position in the sequence, R 2 means the nucleic acid residue at the second position and so forth.
The DAC 18,33,34 represents the correlation of one DNA local property between two dinucleotides at a distance of lag in the sequence. DAC can be calculated by: where μ is the index of dinucleotide local property; L represents the DNA sequence length; P μ (R i R i+1 ) means the value of the dinucleotide R i R i+1 at position i for the local property index μ; µ P is the average value of P μ (R i R i+1 ) for a DNA sequence and can be calculated as:  and where μ 1 , μ 2 are two different property indices, L represents the DNA sequence length; is the numerical value of the dinucleotide (R i R i+1 ) at position i for the property index μ 1 (μ 2 ); µ P 1 µ (P ) 2 is the average value for property index value μ 1 (μ 2 ) along the whole sequence and have the same form with Eq. (6). In such way, the feature vector's length of DCC is N * (N − 1) * LAG, where N is the number of dinucleotide properties used in this study and LAG is the maximum of lag =  lag L AG ( 1, 2 , ) . The processes for generating the feature vectors of DAC and DCC are presented in the Fig. 3(a,b) respectively.
In this study, fifteen properties from 36 are used. Their values are listed in Table 3.

Support vector machine (SVM). Support Vector Machine (SVM) is a pattern recognition technique intro-
duced by Vapnik 37 , which has been employed for many computational tasks in bioinformatics [38][39][40][41] . It seeks an optimal hyperplane via transforming the original feature space into a high dimensional vector space to achieve classification.
In the current study, the ANACONDA package (http://www.continuum.io/) is adopted, which contains the implementation of SVM. The selected kernel function is radial basis function (RBF), which is defined as: Two parameters, the regularization parameter C and the kernel width parameter γ are optimized on the dataset by using a grid tool provided by ANACONDA. In the current study, the values of the two parameters are shown below: Principal Component Analysis (PCA). Feature selections are able to remove the noise so as to improve the classification performance 42 . In order to reduce redundant information, in this study, we adopt Principal Component Analysis (PCA) 19 to reduce the dimension of the original feature vectors. It reduces the dimension of the feature vectors through projecting a feature space onto a smaller subspace that represents the dataset well. Suppose, the original feature space of iRSpot-DACC can be represented as: where N is the number of training sample, k is the dimension of the feature vectors. Then, the averages for every dimension of X can be expressed as: where N and k have the same meaning with Eq. (11). Therefore, the matrix which is composed of mean vectors for every dimension in X can be represented as: where e ij represents the element of X and e i can be acquired from Eq. (12). Then, the covariance matrix X Cov( ) and its eigenvalues can be calculated and the eigenvalues can be represented as:  Next, l eigenvectors whose corresponding eigenvalues are more bigger than other eigenvectors' are chosen to form a matrix, which can be represented as: where each column represents an eigenvector and their corresponding eigenvalues can be represented as: . Finally, the new subspace M can be calculated by = M XW (17) Therefore, the dimension of the feature space is reduced from k to l. The values of k and l have been discussed in section Results.
The selection of principal components is based on the cumulative weight ratio w: The values of w and l have been discussed in section Results. Jackknife test. In statistical prediction, three cross-validation methods including independent dataset test, sub-sampling (or K-fold cross-validation) test and jackknife test are often used to measure the performance of a predictor [43][44][45] . Among the three methods, jackknife test is deemed the most objective which urging it to be widely adopted by researchers to evaluate the performance of various classifiers. Therefore, in the current study, jackknife test is also adopted to measure the performance of iRSpot-DACC and iRSpot-DACC-PCA. In the jackknife test, each sequence in the benchmark dataset would be selected as test sample and the corresponding remaining samples as training samples.  (19) where TP, FP, TN and FN represent the true positive, false positive, true negative and false negative respectively.