i6mA-CNN: a convolution based computational approach towards identification of DNA N6-methyladenine sites in rice genome

DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification and is responsible for many biological functions. Experimental methods for genome wide 6mA site detection is an expensive and manual labour intensive process. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves area under the receiver operating characteristic curve of 0.98 with an overall accuracy of 0.94 using 5 fold cross validation on benchmark dataset. Finally, we evaluate our model on two other plant genome 6mA site identification datasets besides rice. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. Web tool for this research can be found at: https://cutt.ly/Co6KuWG. Supplementary data (benchmark dataset, independent test dataset, comparison purpose dataset, trained model, physicochemical property values, attention mechanism details for motif finding) are available at https://cutt.ly/PpDdeDH.


Introduction
N4-methylcytosine, N6-methyladenine and 5-methylcytosine exist in DNA of different species (Feng et al. (2019a)).DNA N 6 -methylation is a part of epigenetics where methyl groups are transferred to DNA molecules (von Meyenn et al. (2016)).It is one kind of post replication modification.Such modification has been identified in three kingdoms of life such as Bacteria, Archaea, and Eukaryotes (O'Brown and Greer ( 2016)).N 6methylation occurs when methyl group is transferred to the 6 th position of purine ring of Adenine nucleotide.Such modification is responsible for discrimination between original DNA strand and newly synthesized DNA strand, gene transcription regulation, replication and repair (Wion and Casadesús (2006)).
Laboratory based experimental methods such as sequencing (SMRT-seq, MeDIP-seq) (Flusberg et al. (2010)), methylated DNA Immunoprecipitation (Pomraning et al. (2009)), capillary electrophoresis and laser-induced fluorescence (Krais et al. (2010)) have been proposed to identify N 6 -methyladenine (6mA) sites in the genome.6mA profile of rice has been obtained recently using mass spectrometry analysis and 6mA Immunoprecipitation followed by sequencing (IP-seq) (Zhou et al. (2018a)).Genome wide identification of 6mA sites is labor intensive and costly using such experimental methods.Computational experiment based pLogo plot (O'shea et al. (2013)) on benchmark dataset regarding rice 6mA site identification shows nucleotide distribution to be different near 6mA and non-6mA sites (Hao et al. (2019)).So, accurate computational methods can be effective for automatic 6mA site identification in rice genome.Feng et al. (2019b) established iDNA6mA-PseKNC tool for N 6methyladenine site identification on Mus musculus genome dataset containing 1934 samples in each class.They used custom feature vector for each sample inspired by PseAAC (Shen and Chou (2008)).They used LibSVM package 3.18 for their classification task.This same package was used in i6mA-Pred tool for the same task in rice genome dataset containing 880 samples in each class using similar feature vector by Chen et al. (2019).They used Maximum Relevance Maximum Distance (MRMD) method (Chen et al. (2018)) along with Incremental Feature Selection (IFS) for limiting feature space.This same dataset was used to train and build iDNA6mA tool which utilized one hot matrix as sample feature by Tahir et al. (2019).They used sequential one dimensional convolutional neural network architecture for classification.Kong and Zhang (2019) used the same benchmark rice genome dataset to build i6mA-DNCP tool.They used di-nucleotide composition frequency and dinucleotide based DNA properties as sample features.TreeBagging algorithm was used as their classifier.They implemented heuristic based DNA property selection for selecting four dinucleotide based DNA properties for their classification.Hao et al. (2019) developed iDNA6mA-Rice tool using a rice genome dataset containing 1,54,000 samples in each of the two classes.They used one hot vector based linear feature space with Random Forest algorithm.
The main limitation of these methods is that they require single feature vector for training and validation purpose irrespective of the heterogeneity of the features they are using.Moreover, these proposed algorithms are unable to use two dimensional matrix oriented features which limits local and sequential pattern detection capability.
We propose i6mA-CNN, a one dimensional CNN based branched classifier which can identify 6mA sites in plant genome.We work with four different types of feature matrices used for DNA sequence representation -PseAAC inspired customized feature matrix, monomer one hot matrix, dimer one hot matrix and dimer physicochemical properties.We use correlation based dimer physicochemical property selection technique in order to remove unnecessary and/ or redundant property features.Our proposed model is able to coordinate between these heterogeneous feature matrices and learn distinguishing features between 6mA and non-6mA sites successfully.Accurate experimental results on a large benchmark dataset and success on independent test datasets show the effectiveness of our proposed tool.

Benchmark Dataset Construction
Gene Expression Omnibus (GEO) database maintained by National Center for Biotechnology Information (NCBI) (Long et al. (2018)) has a total of 2,65,290 6mA site containing sequences each of 41 base pair (bp) length with 6mA site at the center.CD-HIT program (Li and Godzik (2006)) with 80% (following similar research conducted by Hao et al. (2019)) similarity threshold has been implemented on these sequences in order to avoid redundancy bias which resulted in 1,54,000 mA site containing sequences.This is our positive set of samples.Same number of negative samples (no 6mA site) have been obtained from NCBI to formulate a class balanced dataset following some specific criteria.Each 41 bp long negative sequence has Adenine at the center, where the center is not methylated (proven by experiment).6mA sites have a tendency of occurring near GAGG rich sequences (Zhou et al. (2018b)).In order to avoid such bias while classification, negative sequences rich with GAGG motifs were considered only to make negative data more objective.

Sequence Representation
A sample sequence of our dataset looks as follows: where, L is the length of the DNA sequence, which is 41 in our case and N i ∈ {A, T, C, G}.With appropriate mathematical representation of these sequences, deep learning models can learn class distinguishing features from the sequence local and global patterns (Umarov and Solovyev (2017)).It is clear from Figure 1 that our model is capable of coordinating between multiple types of feature vectors and matrices using branch like structure which has been explained in detail in Subsection 2.3.We use the following sequence representation features in our model:

PseAAC Inspired Feature Representation
PseAAC (Pseudo Amino Acid Composition) has shown success in dealing with protein sequences (Zhong and Zhou (2014); Zhou and Zhong (2016)).Inspired by such success, PseKNC (Pseudo K-tuple Nucleotide Composition) has been introduced for dealing with DNA/RNA sequences in computational biology (Chen et al. (2014(Chen et al. ( , 2015))).The general form of PseKNC has been designed in such a way that their values will depend on user specified feature extraction techniques from DNA samples.
The four types of nucleotides of DNA can be classified into three categories of attributes.These attributes along with representation code have been provided in Table 1.The categories are as follows: Ring number: A and G have two rings, whereas C and T have only one ring.Chemical functionality: A and C are from amino group, whereas G and T are from keto group.Hydrogen bonding: C and G are bonded to each other with 3 hydrogen bonds (strong), whereas A and T with only two (weak).
The above mentioned properties have significant impact on biological functions as well (Chou (1988)).The i th nucleotide N i of a sequence can be represented by (x i , y i , z i ), where x i , y i and z i represent ring structure, functional group and hydrogen bonding code (see Table 1) of nucleotide N i , respectively.Besides these three properties, we consider lingering density of each nucleotide to preserve sequence oriented features.Lingering density of nucleotide N i is defined below: Here, L i is the length of the substring up to nucleotide N i and So, for each nucleotide, we have four properties to calculate.Since each sequence in our dataset is 41 bp long, we have a 41 × 4 dimensional feature matrix for each sequence.We illustrate this four property representation scheme on a toy example sequence "ACGGA" in Table 2.The resultant matrix representation is: , 1, 0, 1], [0, 1, 1, 0.5], [1, 0, 1, 0.33], [1, 0, 1, 0.5], [1, 1, 0, 0.4]] 2.2.2 Monomer One Hot Representation Tahir et al. (2019) developed tool iDNA6mA based on monomer one hot encoding.Their one dimensional convolutional neural network (1D CNN) was quite successful in class discrimination based on one hot encoding based mathematical representation.Each branch of our designed model is also 1D CNN oriented.So, we have used this particular feature.There are four types of nucleotides in DNA sequence and each of them is represented by a four size one hot vector.A, T, C and G are represented by (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1), respectively.So, a 41 bp long sequence is represented by a 41 × 4 dimensional matrix.

(c) Dimer one hot representation
There is statistically significant difference (considering p-value of 0.05) between 6mA and non-6mA site containing samples in terms of dinucleotide composition frequencies (Kong and Zhang (2019) This is certainly a large matrix.If the dinucleotide physicochemical property set contains features which include redundant or less important information in terms of our classification task, it will lead to overfitting problem (Hawkins (2004)).The steps of dinucleotide physicochemical property (DPP) selection are as follows: • We use Logistic regression based model for 6mA site identification task using one DPP at a time.• We note the 5 fold cross validation Mathew's correlation coefficient (MCC) score for each DPP and select the the top 20 DPPs according to obtained MCC score.The higher the MCC, the better is the DPP for our identification task.It is to note that MCC score is better than F1 score and accuracy for binary classification evaluation (Chicco and Jurman (2020)).• We select the topmost DPP.Now we want to remove redundant information.Each DPP is represented by a 16 size vector as there are 16 possible dimers.DPPs which have a high positive or negative correlation coefficient represent the same type of information.We A 1 1 0 2/5 = 0.4 (positive correlation) or less than -0.9 (negative correlation) with the topmost DPP.
Thus we select nine DPPs out of 90 which include Roll rise, major Groove distance, twist stiffness, protein induced deformability, mobility to bend towards major groove, melting temperature, Hartman trans free energy, protein DNA twist and tip.

Model Architecture
Our .Each branch of the architecture learns class distinguishing features regarding its corresponding input feature matrix independent of the other three branches.The monomer one hot representation of the leftmost branch represents raw representation of sample sequence.Model parameters of this branch learn from this raw representation free from any feature specific knowledge bias.This branch has the potential of learning those class specific hidden features which may not be apparent to humans.The second branch which has dimer one hot representation as input helps the model learn statistically significant differences between 6mA and non-6mA site containing samples in terms of dinucleotide pattern and frequency of occurrence.Such difference has been pointed out by Kong and Zhang (2019).The third branch uses dimer physicochemical property based representation.These properties played an important role and DNase I hypersensitive sites identification (Zhang et al. (2019)).This third branch facilitates learning the interaction between these property values in a sequence order basis.The last branch is based on learning from ring structure, functional group, hydrogen bonding and lingering density based nucleotide features.These features have impact on DNA biological functions (Chou (1988); Chou and Mao (1988)).This last branch parameters learn to identify patterns from these sequence coupled features.
We need to have a way to combine these branches for a mutual classification decision depending on the four heterogeneous feature matrices.The convolutional part of each branch i returns a matrix of dimension m i × n i .We flatten and convert this matrix into a linear vector.We pass this vector through a dense node having Sigmoid activation function, which outputs a value in the range of 0 to 1.This single value obtained from the i th branch is its representative for 6mA identification task.We have total four values in a range of 0 to 1 from the four branches, which we concatenate and turn into a four size vector.This four size vector is the representative vector of all four branches.The final dense layer with Sigmoid activation function provides the output value depending on the four values of the concatenated vector.If the output value is greater than 0.5, then we consider the sample sequence to have 6mA site.Else, we decide that there is no 6 mA site in the input sequence.

Performance Evaluation and Model Selection
Accuracy (Acc), sensitivity (Sn), specificity (Sp) and Mathew's correlation coefficient (MCC) have been used as evaluation metrics which are described as follows: Here, TP, FP, TN and FN represent true positive, false positive, true negative and false negative, respectively.We consider the 6mA site containing samples as positive class samples.Apart from the above four metrics, we have also used area under receiver operating characteristic (auROC) curve for performance evaluation.This metric provides classifier performance irrespective of provided threshold for class discrimination to the classifier.This score has an impact on classification confidence value for each sample as well.Acc, Sn, Sp and auROC lie in the range [0, 1] while for MCC score, the range is [-1, +1].Higher value indicates better classification ability.
We have tuned and selected our model hyperparameters using 5 fold cross validation on our benchmark dataset.We took a local search oriented approach where we went in uphill direction in terms of obtained MCC value.The tuned model hyperparameters are as follows: • Convolution layer: Change in number of convolution layers in each branch, convolution filter number and kernel size in each convolution layer • Learning rate: Experimentation with different learning rates such as 0.01, 0.001, 0.0001, 0.00001 and adaptive learning rate scheme • Dropout rate: Testing out dropout rate of 0.1, 0.3 and 0.5.

Results and Discussion
i6mA-Pred (Chen et al. (2019)), iDNA6mA-Rice (Hao et al. (2019)), iDNA6mA (Tahir et al. (2019)) and i6mA-DNCP (Kong and Zhang (2019)) are some of the state-of-art tools which can identify 6mA sites in rice genome.For comparison purpose of our proposed tool with these stateof-the-art tools, consistency in dataset usage and evaluation method are necessary.
The benchmark dataset that we have used for this research has also been used for constructing the latest rice genome i6mA site identification tool iDNA6mA-Rice (Hao et al. (2019)).Tools for the same task have been developed by Chen et al. (2019); Tahir et al. (2019); Kong and Zhang (2019), though they used a different dataset for benchmarking which contains 880 samples in each of the two classes.They used either Jackknife testing or 10 fold cross validation for algorithm and model selection.For comparison purpose, we also test our custom model on the same 880 sample per class dataset using similar validation methods.This comparison purpose dataset has been obtained from NCBI GEO under accession number GSE103145.All sequences with 6mA site in the center and with modification score greater than 30 (according to Methylome Analysis Technical Note, 30 is the minimum score to call a nucleotide as modified) have been considered as positive samples.Each sample sequence is 41 bp long.CD-HIT (Li and Godzik (2006)) with 60% similarity threshold has been applied on the chosen positive sequences in order to remove redundancy.Thus the 880 positive samples were obtained.The 880 negative samples of this dataset have been obtained in a manner similar to our negative benchmark dataset samples.Detailed performance comparison between the tools used for 6mA site identification has been provided in Table 3.All the tool results except for our proposed tool i6mA-CNN have been obtained from corresponding research articles (Chen et al. (2019); Hao et al. (2019); Tahir et al. (2019); Kong and Zhang (2019)).Our proposed tool clearly outperforms all other tools for similar task by all five performance metrics.Tool i6mA-CNN achieves a high 5 fold cross validation auROC score of 0.98 on our benchmark dataset.ROC curve has been shown in Figure 2.
Independent test set performance of a tool can help in proving the generalization ability of that tool.Following procedures similar to our benchmark dataset construction, we have constructed two independent test datasets in order to evaluate the robustness of our proposed tool in terms of N 6 -Methyladenine site identification in plant genome.Fragaria Vesca (Wild strawberry) and Arabidopsis thaliana (cress weed plant) 6mA site containing sequences have been obtained from the MDR database (Liu et al. (2019)) and from NCBI GEO with accession number GSE81597  4. Results show the robustness of the proposed tool for 6mA site identification in plant genome.Our tool has also been tested on Mus musculus (house mouse) genome dataset.It was used for constructing iDNA6mA-PseKNC tool (Feng et al. (2019b)).The 6mA site containing positive samples were extracted from MethSMRT database (Ye et al. (2016)).Our tool achieved less than 50% test accuracy on this particular dataset.This proves that 6mA site containing region pattern of plants are different from that of animals.
Our model is a CNN based branched model with no time stamp oriented sequential information processing capability.There are reasons why we did not use recurrent layers which utilize time stamp information while working with sequence type data.First, it is not possible to provide multiple feature representations simultaneously in each time stamp input in recurrent neural networks.Second, each time stamp of such networks only accept linear vector representation which would limit our current ability of using matrices.Third, recurrent neural networks such as RNN, GRU or LSTM give importance to positional information (Yin et al. (2017)).But classification of genome related sample sequence often depends on the presence of potential motif type patterns irrespective of their position in the sequence (Zhou et al. (2018b)).Finally, training and validation using recurrent layer based networks are costly because of sequential type of computation.1D CNN based model having multiple convolution layers have the capability of encoding and learning both short and long range patterns.Parallel processing is possible in such networks which allow fast computation.Our customized model architecture is able to work with multiple heterogeneous features simultaneously and is able to utilize the benefits of feature matrix representation.Because of such inherent characteristics, our proposed tool is fast, accurate and shows performance superior to the remaining state-of-the-art tools for 6mA site identification in rice genome and also generalizes well to other plant genomes for the same task.We have found potential motifs which our deep learning based model actively searches for in order to perform 6mA site identification in rice genome.Motifs are class specific substrings found frequently in the sample sequences of that class.Following the procedure shown in the research by Amin et al. (2019), we have customized an attention model (details of this model have been provided as supplementary information) based on our proposed CNN model provided in Figure 1 with a view to identifying the potential motifs.A list of potential motifs found for 6mA site identification on our benchmark dataset have been provided in Table 5.The top motif that we have found is ATAT.It was found to be active 5.2% of the time during 6mA site identification.The pLogo plot constructed by Hao et al. (2019) showed the significant difference of nucleotide distribution in postive and negative samples.This analysis has not been repeated in our current work.

Conclusion
Tool i6mA-CNN is the outcome of the current research aimed at 6mA site identification in rice genome.The combination of four different feature matrices and redundant feature reduction have facilitated our customized CNN based architecture to be robust and accurate.Experimental results on two rice genome datasets and two other independent plant genome datasets related to 6mA site identification show the effectiveness of our tool.Proposed i6mA-CNN is expected to be an effective tool for automation

Fig. 1 .
Fig. 1.Proposed Model Architecture model architecture has been shown in Figure 1.There are four branches for four different feature matrices (described in detail in Subsection Sequence Representation).Each branch has three 1D convolution layers put sequentially.The kernel size and filter number of each convolution layer have been shown in the figure.1D CNN plays important role in position independent local feature extraction (Chen et al. (2017); Amin et al. (2019)) et al.inDNA specific classification and identification tasks such as promoter classification(Amin et al. (2019)), 6mA site identification(Kong and Zhang (2019)), recombination spot identification(Zhang and Kong (2019))

•
Optimizer: Model training with adaptive moment estimation (ADAM), stochastic gradient descent (SGD) and Nesterov accelerated gradient (NAG) optimizer.• Activation function: Experimentation with Relu, Tanh and LeakyRelu activation function in initial convolution layers As shown in Figure 1, Each branch has three convolution layers.These three consecutive layers contain kernel size of 5, 3 and 3, whereas the number of filters are 128, 64 and 32, respectively.Learning rate of 0.0001 has been used in order to avoid divergence while training.Dropout of 0.5 has been used for overfitting avoidance.Addition of dropout changes network connectivity on each training epoch and thus helps model to understand the training sample features instead of memorizing them (Srivastava et al. (2014)).ADAM has been chosen as the optimizer for model weight update during training.Each intermediate convolution layer output goes through Relu activation function which is very popular for being effective and simple (Agarap (2018)).

Table 1 :
Nucleotide categorical attribute-wise encoding ).There are sixteen possible unique dimers such as AA, AT, AC, …in DNA sequences.We represent each dimer by a 16 size one hot vector.1D CNN based feature extraction technique is able to learn from class distinguishing dimer patterns which include difference in frequency.In a 41 length sequence, there are 40 overlapping dimers such as N 1 N 2 , N 2 N 3 , N 3 N 4 , ….So, each sample of our dataset is represented by a 40×16 dimensional matrix.
(Chen et al. (2014)chemical Property Based RepresentationAccording toCheng et al. (1985), Adenine methylation may have impact on DNA structure and its stability.We have used the 90 dinucleotide physicochemical properties used by PseKNC tool(Chen et al. (2014)).So, each dimer of our sample sequence is represented by a 90 size Zscore normalized vector.As a result, each 41 bp long sample has been represented by a 40 × 90 dimensional matrix.

Table 2 :
Four property representation scheme detail of "ACGGA" sequence Pos

Table 5 :
Potential motifs for N 6 -Methyladenine site identification

Table 3 :
i6mA-CNN performance comparison with state-of-the-art tools.Best performance results have been marked in bold.Dataset 1 is the 880 sample per class dataset used for comparison purpose, while Dataset 2 is our benchmark dataset with 1,54,000 samples in each class in epigenetics related research.Future research may aim at constructing a tool that can perform 6mA site identification in both plant and animal genome effectively.