Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data

Artificial neural networks show promising performance in detecting correlations within data that are associated with specific outcomes. However, the black-box nature of such models can hinder the knowledge advancement in research fields by obscuring the decision process and preventing scientist to fully conceptualize predicted outcomes. Furthermore, domain experts like healthcare providers need explainable predictions to assess whether a predicted outcome can be trusted in high stakes scenarios and to help them integrating a model into their own routine. Therefore, interpretable models play a crucial role for the incorporation of machine learning into high stakes scenarios like healthcare. In this paper we introduce Convolutional Motif Kernel Networks, a neural network architecture that involves learning a feature representation within a subspace of the reproducing kernel Hilbert space of the position-aware motif kernel function. The resulting model enables to directly interpret and evaluate prediction outcomes by providing a biologically and medically meaningful explanation without the need for additional post-hoc analysis. We show that our model is able to robustly learn on small datasets and reaches state-of-the-art performance on relevant healthcare prediction tasks. Our proposed method can be utilized on DNA and protein sequences. Furthermore, we show that the proposed method learns biologically meaningful concepts directly from data using an end-to-end learning scheme.


Setting the Scaling Parameter
Given a sequence x with length |x|, we found empirically that setting β = |x| 2 10 compensates for the transformation of sequence positions as introduced in the manuscript (see Figure 1).

Normalized Position Frequency Matrix
A set of biological sequences of length k can be easily transformed in to a position frequency matrix (PFM) by counting the occurrences of each nucleotide or amino acid from an alphabet A at each position and constructing a matrix M = (m ij ) i=1,...,|A|;j=1,...,k , where m ij is equal to the number of occurrences of nucleotide or amino acid i at position j.The PFM can then be easily transformed into a normalized position frequency matrix (nPFM) M norm = (m norm,ij ) i=1,...,|A|;j=1,...,k by dividing each entry of every column m j = i m ij of M by the 2 -norm of the respective column, i.e.
Using this definition of an nPFM, the set of all k-mers is a subset of the set of all sequence motifs.Considering a k-mer as a special motifs, each column of the corresponding nPFM has only one element set to one while all other elements are set to zero.
Figure 1: Comparison of the exponential term of the oligo kernel with the position kernel as introduced in the main manuscript in Eq. (1).Six different sequence lengths were tested: 10, 100, 1000, 10 000, 100 000, and 1 000 000.The scaling parameter β of the position kernel was set to |x| 2 10 , where |x| denotes the corresponding sequence length.As noticeable in the plots, the chosen parameter value compensates for the reduced absolute distances of sequence positions and their dependence on the sequence length due to the projection onto the unit circle introduced in the main manuscript.

Orthogonal Projection onto the Subspace E
We denote a motif position pair by y = (ω, p).Then, the explicit parametrization derived by Mairal in [3] Appendix A also holds for our kernel.This can be easily shown by simple calculus using Mairal's definition: Then the following holds: Here, K ZZ = (K 0 (z i , z j )) i=1,...,n;j=1,...,n is the Gram matrix formed over the anchor points z 1 , . . ., z n , K ZZ is the (pseudo)-inverse square root of the Gram matrix, and K Z ((ω, p)) = (K 0 (z 1 , (ω, p)), . . ., K 0 (z n , (ω, p))) T .K 0 is the kernel function introduced in the main manuscript.From here the argument is similar to Mairal's argument in the Appendix of [3].

Deriving the Position-Aware Motif Kernel from Motif Functions
For a sequence x over an alphabet A and a motif χ ∈ R |A|k , we define the motif function as with and position variable t ∈ R 2 .
Consider two sequences x and x of same length, i.e. |x| = |x |.The motif kernel between them is given as the inner product of the associated motif functions φ x and φ x : with ω p , ω q ∈ R |A|k + and p, q from the upper half of the unit circle.The above derivation uses the fact that the variables χ ∈ R |A|k and t ∈ R 2 are not restricted and thus (using the substitutions y = χ, γ = α, δ = ω p , = ω q , N = |A|k and y = t, γ = β 2σ 2 , δ = p, = q, N = 2, respectively) can be removed via 4a +c (7) with a = 2γ, b = 2γ(δ i + i ), and c = −γ(δ 2 i + 2 i ).The last equality in Eq. ( 5) results from the normalisation of the nPFM as defined in Eq. (1), i.e., the column-wise unit 2 -norm demanded earlier in Section 5.1 of the main manuscript.Consider two normalized position frequency matrices (nPFM) , with |A| being the size of the alphabet over which the motif is created and k being the length of the motif.One can define two vectors a and b given as flattened nPFMs A and B, i.e., the columns are concatenated to convert the matrices into vectors.Utilizing that each column of a nPFM has unit 2 -norm by definition, the following equality is obtained: 6 HIVdb HIVdb is one of the largest, publicly available databases containing resistance information against antiretroviral drugs that are used for the treatment of an HIV infection or an acquired immunodeficiency syndrome (AIDS).We used available information for eight PI drugs, six NRTI drugs, and four NNRTI drugs (data was downloaded on 14 April 2021).The PI dataset contained 819 isolates, each represented by an amino acid sequence of length 99.The NRTI dataset contained 489 isolates, each represented by an amino acid sequence of length 240.Finally, the NNRTI dataset contained 583 isolates, each represented by an amino acid sequence of length 240.For each isolate, available drug resistance information for each drug of the corresponding type was given as a change in HIVdb stores the isolate information as a tab-separated table.Each position is represented either by a '-', if the amino acid is the same as in the wild type, or by the mutation at that position.Furthermore, there are columns for each drug with the fold resistance of the isolate compared to the wild type.We provide a script that translates these tables into a FASTA file that can be used directly as input for the provided DataSet object.HIVdb provides two fold resistance thresholds for each of the drugs within the database.These thresholds are used by our preparation script to assign each isolate to either the 'low resistance', 'medium resistance', or 'high resistance' class.The DataSet object combines 'medium resistance' and 'high resistance' classes into the 'resistant' class on data access.
Table 1 provides detailed information about sample count, class distribution, best performing parameters, and classification performances for each of the tested drugs.The amount of isolates for each of the drugs differs from the total number of isolates, since not all isolates had fold resistance information for each of the drugs.We excluded drugs with less than 100 isolates from all further evaluations.Each of the drugs is indicated using the official three-letter abbreviation.The corresponding complete names and trade names are written below.

Splice Site Benchmarks
The NN269 benchmark consists of 1324 true targets for both acceptor and donor datasets that were collected from 269 human genes.The acceptor sequences consist of 90 nucleotides with the splice site acceptor dimer AG at positions 69 to 70.The donor sequences consist of 15 nucleotides with the splice site donor dimer GT at positions 8 to 9. Furthermore, the benchmark includes 5553 acceptor decoys and 4922 donor decoys.Each decoy sequence has the splice site dimers (AG for acceptor sites, GT for donor sites) at the same positions as the true targets.The DGSplicer benchmark consists of 2380 true acceptor targets and 2379 true donor targets that were extracted from 462 multi-exon human genes.The acceptor sequences have a length of 36 nucleotides with the dimer AG at positions 26 to 27.The donor sequences have a length of 18 nucleotides with the dimer GT at positions 10 to 11.As with NN269, decoy sequences are included in the benchmark, i.e., 400314 pseudo acceptor sites and 282955 pseudo donor sites.Both benchmarks are split into training and test sets.The detailed statistics about the used benchmarks for the prediction of splice sites can be found in Table 2. To deal with the huge imbalance between positive and negative samples within the DGSplicer benchmark, we randomly under-sampled the negative samples in the training sets to achieve a class ratio of Np Nrn = 0.25, where N p is the number of samples in the positive class

CMKN
CMKN models have 5 hyperparameters associated with the motif kernel layer that need to be optimized.These include the k-mer length k, the motif comparison parameter α, the position scaling parameter β, the positional uncertainty parameter σ, and the number of anchor points.
We used prior (biological) knowledge to fix three of the hyperparameters.This allowed us to reduce computation time and lower the CO 2 footprint of our experiments.We fixed the motif length k to 1, since considering longer motifs offers no advantage in the case of resistance mutations against antiretroviral drugs.Previously conducted research suggest that a drug resistance is caused by exchanging single amino acids which justifies to set the motif length to 1. Additionally, we fixed α to 1.0 to ensure that the impact of inexact motif matching is not reduced.This takes the fact into account that some positions can have several different mutations causing resistance against antiretroviral drugs and the exchange of a single amino acid can cause a drug resistance.Therefore, inexact motif matching contains valuable information which justifies the value for α we used.Lastly, the position scaling parameter β was fixed to |x| 2 10 , where |x| denotes the length of the input sequences.This value compensates for the transformation of the sequence positions as described in Section 2.
The other two hyperparameters, the positional uncertainty σ and the number of anchor points, were optimized using a simple grid search.We used a small but sensible number of values for the grid to keep the CO 2 footprint of our experiments as low as possible.The following values were used for the grid search: The different choices for the number of anchors for PI and NRTI/NNRTI drugs were chosen to take account of the different sequence lengths.The optimal parameter choice for each of the drugs can be found in Table 1.
All CMKN models were trained using PyTorch's implementation of the ADAM algorithm and class-balanced loss as introduced in [1].Training lasted for 200 epochs and no early-stopping was used.The learning rate started at 0.1 and was dynamically adjusted using the ReduceLROnPlateau method.
The value combination that maximized the most of the four performance measures, accuracy, F1 score, auROC, and MCC, was selected for further analysis.

Models used for Comparison
We used standard parameters for the random forest classifiers trained on our HIVdb datasets that matched the parameters used in [6].The number of trees was set to 500 while the number of features to look at, when searching for the best split, was set to the square root of the total number of features.For all other parameters the default values of the sklearn.ensemble.RandomForestClassifier class were used.This is coherent with the training procedure used in [6].The input sequences were ordinal encoded before passing them to the classifier, i.e., each letter was replaced by an integer reflecting the position of the letter in the alphabet.
For support vector machines (SVMs) with polynomial kernel, we performed a grid search to optimize the degree of the polynomial kernel and the regularization parameter.The grid consisted of the following values: Similar to the CMKN models, the parameter combination that maximized the most of the four performance measures, accuracy, F1 score, auROC, and MCC, was selected for further analysis.We used the same ordinal encoding that was used in the RF experiment to encode the input for the SVMs.For the model selection of SVMs with oligo kernel, we used the input encoding as described in [4].The hyperparameters of this method are the length of k-mers denoted by k, the position uncertainty parameter σ, and the regularization parameter C. Similar to CMKNs, k was fixed to 1 due to biological reasons.The other two hyperparameter were optimized with the following grid: The optimal parameters for the SVM models were: The CNN architecture used for DRM coverage comparison in section 9 was first published in [7].First, we tried to train CNN models using the R script provided by the authors on our datasets but, unfortunately, the script exited with errors and was not usable.Instead of debugging the provided script, which would already pose the risk of changing the training environment compared to the original publication, we decided to implement the same architecture using PyTorch.This course of action provided the additional benefit that the training of both neural network architectures, CNN and CMKN, used the same framework and followed the same standards which improves comparability of the methods.The first layer was an embedding layer that took the nominal encoded sequences and learned a three-dimensional embedding.Afterwards two one-dimensional convolutional layers with ReLu activation were used.Both convolutional layers had a kernel size of 9 with 32 filters.A one-dimensional MaxPooling layer with a pooling size of 5 was used between the two convolutional layers.The flatted output of the second convolutional layer was fed into fully-connected layers for the classification.CNN models were trained using PyTorch's implementation of the ADAM algorithm and class-balanced loss as introduced in [1].Training lasted for 200 epochs and no early-stopping was used.The learning rate started at 0.1 and was dynamically adjusted using the ReduceLROnPlateau method.We did not include the performance of the CNN architecture by Steiner et al. into our main manuscript due to the non-competitive performance on our datasets (i.e., they reached a mean MCC of 0.1).However, we included the description here since Steiner's models are used in section 9.The CNN results shown in the main manuscript were achieved with a network resulting from a simple ablation test, i.e., we replaced the kernel layer of an CMKN model with a standard convolutional layer with the same parameters.

CMKN
In contrast to the HIV experiments, we only fixed two hyperparameters in the CMKN models used for splice site prediction: α was fixed to 1 and β was fixed as described in Section 2. The other hyperparameters were optimized with a grid search using the following values: The optimal parameter choice for each splice site dataset can be found in Table 2.
All CMKN models were trained using PyTorch's implementation of the ADAM algorithm and class-balanced loss as introduced in [1].Training lasted for 200 epochs and no early-stopping was used.The learning rate started at 0.1 and was dynamically adjusted using the ReduceLROnPlateau method.

Comparing the Interpretation Capabilities of CMKNs
Comparing the quality of interpretation capabilities of different methods is a difficult problem, due to the ethical and philosophical aspects of interpretation which are hard to express mathematically [2,5].Nevertheless, we performed some quantitative comparisons of the global interpretation between the used inherently interpretable methods (i.e., CMKNs and oligo kernel SVMs) as well as published post-hoc interpretations of non-interpretable methods (impurity-based feature importance for RFs [6] and permutation feature importance analysis for CNNs [7]) on the HIV prediction task.For all methods, the 20 most important positions were identified and the percental coverage of drug resistance mutation positions (DRMs) using these 20 positions was calculated.The results can be found in Figure 2. We found that CMKNs performed similar to post-hoc methods applied on CNNs and RFs when used to identify DRMs.All three methods significantly outperformed oligo kernel SVMs.However, the post-hoc methods were limited to identifying positions without compositional information of the mutations.That lead to an incomplete picture of the biological process and therefore decreases the utility of the interpretation provided by post-hoc methods.Only CMKNs and oligo kernel SVMs were able to provide the complete picture with positional and compositional information of the drug resistance mutations.We used the procedure described by Meinicke and colleagues in the original manuscript to visualize the learned oligo functions from the trained SVMs [4].The result is presented in Figure 3. Due to the fact that the oligo kernel is limited to discrete k-mers, the compositional variability of DRMs is not identified by oligo kernel SVMs.For identified DRMS, the visualization focuses on single amino acids.3: Interpretation of the support vector machines utilizing the oligo kernel (SVM oligo ).Details about the calculation of these matrices can be found in the original manuscript [4].SVM oligo had difficulties identifying drug resistance mutation positions (DRMs) and, due to it's limitation to k-mers, focused on single amino acids for each DRM.

Figure 2 :
Figure 2: Number of identified drug resistance mutation positions (DRMs) by different models.Post-hoc methods were needed to calculate feature importances for some models: For convolutional neural network (CNN) classifiers, we applied a permutation feature importance analysis and for the random forest (RF) classifiers, we calculated impurity-based feature importances.Position importances were directly assessable for CMKNs and oligo kernel SVMs (SVM oligo ).CMKNs performed similar to CNNs and RFs.All three significantly outperformed SVM oligo .The center line of each box indicates the median.The height of the boxes represents the inter quartile range (IQR) with the upper and lower whiskers set to 1.5 times the IQR.Outliers are depicted by circles.

Figure
Figure3: Interpretation of the support vector machines utilizing the oligo kernel (SVM oligo ).Details about the calculation of these matrices can be found in the original manuscript[4].SVM oligo had difficulties identifying drug resistance mutation positions (DRMs) and, due to it's limitation to k-mers, focused on single amino acids for each DRM.

Table 1 :
Overview of the HIVdb datasets.For each drug, basic dataset statistics are displayed as well as the best values for the positional uncertainty Removed from analysis due to low sample count fold resistance compared to the wildtype.We used thresholds provided by HIVdb to convert fold resistance values into the discrete classes 'susceptible' and 'resistant'.
parameter σ and the number of anchor points.Furthermore, mean values as well as standard deviations for four performance parameters (accuracy, F1 score, area under the receiver operating characteristic curve (auROC), and Matthew's correlation coefficient (MCC)) achieved on validation sets during a stratified 5-fold cross validation are shown.

Table 2 :
Overview of the benchmarks used to evaluate CMKN on the splice site prediction task.
rn is the number of samples in the negative class after resampling.Class ratios in validation and test sets were kept unchanged.