Supplementary Figure 2: Kernel selection and data integration. | Nature Biotechnology

Supplementary Figure 2: Kernel selection and data integration.

From: Prediction of potent shRNAs with a sequential classification algorithm

Supplementary Figure 2

(a) Schematic of the first support vector machine (SVM) classifier that serves to eliminate non-functional sequences and prioritize shRNAs that are likely to be potent.

(b) Schematic of the kernel representation used by SplashRNA. A weighted degree kernel is calculated across the entire guide sequence, while two spectrum kernels are calculated across nucleotides 1-15 and 16-22, respectively.

(c) TILE score distribution (Online Methods ). We set a potency threshold separating the negative from the positive class at the minimal point between the two modes of the distribution (green line, for thresholds see Supplementary Table 1).

(d) Testing of multiple kernel combinations in a leave-one-gene-out nested cross-validation setting on the TILE data set found that the combination of a weighted degree kernel over positions 1-22 and two spectrum kernels at positions 1-15 and 16-22 (allKernels) yields the best performance. Spec1 is a spectrum kernel over positions 1-15. Spec2 is a spectrum kernel over positions 16-22. Spec1_spec2 is a combination of spec1 and spec2. Wdk is a weighted degree kernel over positions 1-22. Wdk_spec1 is a combination of wdk and spec1. Wdk_spec2 is a combination of wdk and spec2. All_kernels is a combination of wdk, spec1 and spec2.

(e) M1 score distribution (Supplementary Table 1, Online Methods). Cutoffs (green lines) were calculated by fitting Gaussian distributions to the modes and setting thresholds at 5% false positive rate (FPR) and 5% false negative rate (FNR).

(f) Incorporation of M1 positives, negatives or both into the TILE training set was tested in a nested leave-one-gene-out cross-validation setting. Inclusion of M1 negatives deteriorated performance on the TILE data set, whereas inclusion of the M1 positives alone improved performance. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.

(g) Score distribution for the shERWOOD miR-30 set (Supplementary Table 1, Online Methods). We set the threshold at an arbitrary cutoff of zero (green line).

(h) Incorporation of M1 positives into the TILE training set improved performance on the external shERWOOD data set. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.

Back to article page