Prediction of potent shRNAs with a sequential classification algorithm


We present SplashRNA, a sequential classifier to predict potent microRNA-based short hairpin RNAs (shRNAs). Trained on published and novel data sets, SplashRNA outperforms previous algorithms and reliably predicts the most efficient shRNAs for a given gene. Combined with an optimized miR-E backbone, >90% of high-scoring SplashRNA predictions trigger >85% protein knockdown when expressed from a single genomic integration. SplashRNA can significantly improve the accuracy of loss-of-function genetics studies and facilitates the generation of compact shRNA libraries.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Computational modeling of advancements in shRNA technology.
Figure 2: Benchmarking SplashRNA prediction performance.

Accession codes


Gene Expression Omnibus


  1. 1

    Fellmann, C. & Lowe, S.W. Nat. Cell Biol. 16, 10–18 (2014).

    CAS  Article  Google Scholar 

  2. 2

    Guda, S. et al. Mol. Ther. 23, 1465–1474 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Grimm, D. et al. Nature 441, 537–541 (2006).

    CAS  Article  Google Scholar 

  4. 4

    McBride, J.L. et al. Proc. Natl. Acad. Sci. USA 105, 5868–5873 (2008).

    CAS  Article  Google Scholar 

  5. 5

    Baek, S.T. et al. Neuron 82, 1255–1262 (2014).

    CAS  Article  Google Scholar 

  6. 6

    Zuber, J. et al. Nat. Biotechnol. 29, 79–83 (2011).

    CAS  Article  Google Scholar 

  7. 7

    Fellmann, C. et al. Cell Rep. 5, 1704–1713 (2013).

    CAS  Article  Google Scholar 

  8. 8

    Gu, S. et al. Cell 151, 900–911 (2012).

    CAS  Article  Google Scholar 

  9. 9

    Watanabe, C., Cuellar, T.L. & Haley, B. RNA Biol. 13, 25–33 (2016).

    Article  Google Scholar 

  10. 10

    Fellmann, C. et al. Mol. Cell 41, 733–746 (2011).

    CAS  Article  Google Scholar 

  11. 11

    Yuan, T.L. et al. Cancer Discov. 4, 1182–1197 (2014).

    CAS  Article  Google Scholar 

  12. 12

    Knott, S.R.V. et al. Mol. Cell 56, 796–807 (2014).

    CAS  Article  Google Scholar 

  13. 13

    Auyeung, V.C.C., Ulitsky, I., McGeary, S.E.E. & Bartel, D.P.P. Cell 152, 844–858 (2013).

    CAS  Article  Google Scholar 

  14. 14

    Viola, P. & Jones, M. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 511–518 (2001).

    Google Scholar 

  15. 15

    Pelossof, R. Learning with Stochastic Focus of Attention PhD thesis, (Columbia Univ. 2011).

  16. 16

    Leslie, C., Eskin, E. & Noble, W.S. Pac. Symp. Biocomput. 575, 564–575 (2002).

    Google Scholar 

  17. 17

    Sonnenburg, S., Rätsch, G. & Rieck, K. Large scale learning with string kernels. Large-scale Kernel Machines. (eds. Bottou, L., Chapelle, O., DeCoste, D. & Weston, J.) 73–104 (MIT Press, Cambridge, MA 2007).

  18. 18

    Vert, J.P., Foveau, N., Lajaunie, C. & Vandenbrouck, Y. BMC Bioinformatics 7, 520 (2006).

    Article  Google Scholar 

  19. 19

    Kampmann, M. et al. Proc. Natl. Acad. Sci. USA 112, E3384–E3391 (2015).

    CAS  Article  Google Scholar 

  20. 20

    Matveeva, O.V., Nazipova, N.N., Ogurtsov, A.Y. & Shabalina, S.A. Front. Genet. 3, 163 (2012).

    CAS  Article  Google Scholar 

  21. 21

    Morgens, D.W., Deans, R.M., Li, A. & Bassik, M.C. Nat. Biotechnol. 34, 634–636 (2016).

    CAS  Article  Google Scholar 

  22. 22

    Kampmann, M., Bassik, M.C. & Weissman, J.S. Proc. Natl. Acad. Sci. USA 110, E2317–E2326 (2013).

    CAS  Article  Google Scholar 

  23. 23

    Hart, T., Brown, K.R., Sircoulomb, F., Rottapel, R. & Moffat, J. Mol. Syst. Biol. 10, 733 (2014).

    Article  Google Scholar 

  24. 24

    Spies, N., Burge, C.B. & Bartel, D.P. Genome Res. 23, 2078–2090 (2013).

    CAS  Article  Google Scholar 

  25. 25

    Derti, A. et al. Genome Res. 22, 1173–1183 (2012).

    CAS  Article  Google Scholar 

  26. 26

    Lianoglou, S., Garg, V., Yang, J.L., Leslie, C.S. & Mayr, C. Genes Dev. 27, 2380–2396 (2013).

    CAS  Article  Google Scholar 

  27. 27

    Yi, R., Doehle, B.P., Qin, Y., Macara, I.G. & Cullen, B.R. RNA 11, 220–226 (2005).

    CAS  Article  Google Scholar 

  28. 28

    Boudreau, R.L., Martins, I. & Davidson, B.L. Mol. Ther. 17, 169–175 (2009).

    CAS  Article  Google Scholar 

  29. 29

    Sigoillot, F.D. et al. Nat. Methods 9, 363–366 (2012).

    CAS  Article  Google Scholar 

  30. 30

    Khvorova, A., Reynolds, A. & Jayasena, S.D. Cell 115, 209–216 (2003).

    CAS  Article  Google Scholar 

  31. 31

    Reynolds, A. et al. Nat. Biotechnol. 22, 326–330 (2004).

    CAS  Article  Google Scholar 

  32. 32

    Schwarz, D.S. et al. Cell 115, 199–208 (2003).

    CAS  Article  Google Scholar 

  33. 33

    Huesken, D. et al. Nat. Biotechnol. 23, 995–1001 (2005).

    CAS  Article  Google Scholar 

  34. 34

    Saetrom, P. & Snøve, O. Biochem. Biophys. Res. Commun. 321, 247–253 (2004).

    CAS  Article  Google Scholar 

  35. 35

    Filhol, O. et al. PLoS One 7, e48057 (2012).

    CAS  Article  Google Scholar 

  36. 36

    Taxman, D.J. et al. BMC Biotechnol. 6, 7 (2006).

    Article  Google Scholar 

  37. 37

    Sonnenburg, S. et al. J. Mach. Learn. Res. 11, 1799–1802 (2010).

    Google Scholar 

  38. 38

    Huber, W. et al. Nat. Methods 12, 115–121 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Lawrence, M. et al. PLoS Comput. Biol. (2013).

  40. 40

    Dow, L.E. et al. Nat. Protoc. 7, 374–393 (2012).

    CAS  Article  Google Scholar 

  41. 41

    Platt, R.J. et al. Cell 159, 440–455 (2014).

    CAS  Article  Google Scholar 

  42. 42

    Hochedlinger, K., Yamada, Y., Beard, C. & Jaenisch, R. Cell 121, 465–477 (2005).

    CAS  Article  Google Scholar 

Download references


We thank J.A. Doudna, G.J. Hannon, L.E. Dow and S.N. Floor for continuous support and valuable discussions. We gratefully acknowledge assistance and support from A. Banito, V. Sridhar, L. Faletti, C.C. Chen and S. Tian. C.F. was supported in part by a K99/R00 Pathway to Independence Award (K99GM118909) from the National Institutes of Health (NIH), National Institute of General Medical Sciences (NIGMS). C.F. is a founder of Mirimus Inc., a company that develops RNAi-based reagents and transgenic mice. This work was also supported in part by grant CA013106 (S.W.L.). S.W.L. is a founder and member of the scientific advisory board of Mirimus Inc., the Geoffrey Beene Chair of Cancer Biology at MSKCC and an investigator of the Howard Hughes Medical Institute. J.Z. is a member of the scientific advisory board, and P.K.P. is a founder and employee of Mirimus Inc. C.S.L. was supported in part by NHGRI U01 grants HG007033 and HG007893 and NCI U01 grant CA164190. A375 cells were a kind gift from Neal Rosen, MSKCC.

Author information




R.P., L.F., C.S.L. and C.F. conceived and designed the study, and developed the data integration framework. R.P., L.F., and C.W. built the algorithm, and carried out the model training and computational validation. C.-H.H., N.S., D.-Y.L., Y.G., P.K.P., D.F.T., T.H., J.Z., S.W.L. and C.F. generated the biological data sets and validated knockdown potency. R.P., L.F., C.W. and V.T.S. built the web page. V.T. and G.R. assisted with study design and advised on algorithmic development. Q.X. and R.J.G. helped with validation of predictions. R.P., L.F., C.-H.H., T.H., J.Z., S.W.L., C.S.L. and C.F. analyzed data and wrote the manuscript.

Corresponding authors

Correspondence to Christina S Leslie or Christof Fellmann.

Ethics declarations

Competing interests

C.F. is a founder of Mirimus Inc., a company that develops RNAi-based reagents and transgenic mice. S.W.L. is a founder and member of the scientific advisory board of Mirimus Inc. J.Z. is a member of the scientific advisory board of Mirimus Inc. P.K.P. is a founder and employee of Mirimus Inc. R.P. and L.F. have filed intellectual property on SplashRNA.

Integrated supplementary information

Supplementary Figure 1 Data set generation.

(a-f) Generation of the M1 (miR-30, 20,400 shRNAs) Sensor assay data set (Supplementary Table 2, Online Methods).(a) Schematic of our previously published Sensor assay that enables large-scale functional assessment of shRNA potency (Online Methods).(b) Library complexity over Sensor assay sort cycles. Shown are normalized read numbers (parts per million, ppm) in both duplicates for each shRNA represented within the initial libraries (Vector) and the pools after the indicated sorts (Sort 3, 5).(c) Correlation of reads per shRNA between the two replicates before sorting (left panel), after Sort 5 (middle panel) and between the initial and endpoint population (right panel; shown for one representative replicate). r, Pearson correlation coefficient.(d) Correlation of Sensor score and reads per shRNA in the vector libraries, showing that the score is independent of the initial shRNA representation. r, Pearson correlation coefficient.(e) Enrichment or depletion of 17 control shRNAs after Sort 5. All controls have been used in previous Sensor assays (e.g. TILE, mRas + hRAS) and are classified into a strong, intermediate and weak class according to their knockdown potency assessed by immunoblotting.(f) Rank correlation of 325 performance control shRNAs. 65 shRNAs per gene targeting mouse Bcl2, Kras, Mcl1, Myc and Trp53 that had previously been tested as part of the TILE data set were chosen as supplemental controls to assess Sensor assay performance for weak, intermediate and strong shRNAs. The individual shRNA ranks between TILE and M1 were highly correlated (325 shRNAs, Spearman rank correlation coefficient rho: 0.63; gene-specific correlation coefficients are also reported), even though the TILE and M1 data sets were generated several years apart, using mostly different equipment, reagents and operators.(g) Generation of the miR-E reporter assay data set (Supplementary Table 2, Online Methods). Normalized reporter knockdown values of miR-E shRNAs assessed one-by-one in an RNAi reporter assay. The shRNAs were tested in 42 individual batches, each including several control shRNAs for data scaling (miR-E Ren.713, miR-30 Pten.1524) and quality control (miR-E Pten.1523, miR-E Pten.1524). Background fluorescence of the parental chicken cell line (ERC) and maximal fluorescence of the batch-specific reporter cell line (ERC cells expressing the shRNA target reporter) were also measured. All shRNAs were grouped into either a positive or negative class. A threshold value of 80 was chosen as a cutoff, based on the performance of miR-30 Pten.1524 and miR-E Ren.713.(h) Nucleotide representation of positive shRNAs from the indicated data sets. Shown are the nucleotides one to eight of the guide strand (starting in the center), including the entire seed region. Unbiased TILE (miR-30) set, showing a diversified nucleotide composition (left panel). Preselected M1 (miR-30, DSIR + Sensor rules selected) set, showing a biased nucleotide representation (middle panel). Preselected miR-E + UltramiR set, showing a different nucleotide bias due to the altered shRNA backbone. More shRNAs starting with a C were found to be potent (compared to TILE, p = 0.002, Fisher’s exact test), indicating less restrictive sequence requirements when using the miR-E backbone.

Supplementary Figure 2 Kernel selection and data integration.

(a) Schematic of the first support vector machine (SVM) classifier that serves to eliminate non-functional sequences and prioritize shRNAs that are likely to be potent.(b) Schematic of the kernel representation used by SplashRNA. A weighted degree kernel is calculated across the entire guide sequence, while two spectrum kernels are calculated across nucleotides 1-15 and 16-22, respectively.(c) TILE score distribution (Online Methods ). We set a potency threshold separating the negative from the positive class at the minimal point between the two modes of the distribution (green line, for thresholds see Supplementary Table 1).(d) Testing of multiple kernel combinations in a leave-one-gene-out nested cross-validation setting on the TILE data set found that the combination of a weighted degree kernel over positions 1-22 and two spectrum kernels at positions 1-15 and 16-22 (allKernels) yields the best performance. Spec1 is a spectrum kernel over positions 1-15. Spec2 is a spectrum kernel over positions 16-22. Spec1_spec2 is a combination of spec1 and spec2. Wdk is a weighted degree kernel over positions 1-22. Wdk_spec1 is a combination of wdk and spec1. Wdk_spec2 is a combination of wdk and spec2. All_kernels is a combination of wdk, spec1 and spec2.(e) M1 score distribution (Supplementary Table 1, Online Methods). Cutoffs (green lines) were calculated by fitting Gaussian distributions to the modes and setting thresholds at 5% false positive rate (FPR) and 5% false negative rate (FNR).(f) Incorporation of M1 positives, negatives or both into the TILE training set was tested in a nested leave-one-gene-out cross-validation setting. Inclusion of M1 negatives deteriorated performance on the TILE data set, whereas inclusion of the M1 positives alone improved performance. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.(g) Score distribution for the shERWOOD miR-30 set (Supplementary Table 1, Online Methods). We set the threshold at an arbitrary cutoff of zero (green line).(h) Incorporation of M1 positives into the TILE training set improved performance on the external shERWOOD data set. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.

Supplementary Figure 3 Calibration of the sequential SVM classifier SplashRNA.

(a) Precision-recall trade-off between the two classifiers SplashmiR-30 and SplashmiR-E. Selection of alpha (α) and theta (θ) hyperparameters leads to varied performance (area under the precision-recall curve, auPR) on the TILE miR-30 (x-axis) and miR-E + UltramiR (y-axis) sets. Each line represents a setting of alpha; points on the line represent distinct theta values. The circle indicates the alpha and theta choices for the final sequential classifier (SplashRNA: α = 0.6, θ = 1.1). The dashed line represents the performance of the convex linear classifier without a threshold at every alpha. Note that the performance of a sequential classifier equals or exceeds that of a linear combination since one can set the threshold (θ) to a small enough value such that all examples are evaluated by both classifiers.(b) Performance on the TILE set, varying the value for theta with alpha set to 0.6. The insert shows a zoom in of the first 15% of the precision-recall.(c) Performance on the miR-E + UltramiR set, varying the value for theta with alpha set to 0.6.

Supplementary Figure 4 Prediction performance of SplashRNA.

(a) Precision-recall curves on the TILE data set, comparing leave-one-gene-out nested cross-validation predictions from SplashRNA (auPR: 0.696) and SplashmiR-30 (auPR: 0.699) against the alternative prediction tools DSIR (auPR: 0.594), seqScore (auPR: 0.526) and miR_Scan (auPR: 0.449).(b) Score distribution of the mRas + hRAS set (DSIR + Sensor rules selected). The green line indicates the threshold (Online Methods, Supplementary Table 1).(c) Prediction performance comparison of the indicated algorithms on the external mRas + hRAS Sensor data set (Supplementary Table 1). SplashRNA outperformed the other algorithms.(d) Score distributions of the miR-E and UltramiR data sets. For the miR-E set, the threshold was set to 80 (green line, Online Methods ). The UltramiR set represents the distribution of log depletion scores of shRNAs tested in a cell-viability screen (Supplementary Table 1).(e) SplashRNA and DSIR based re-ranking of shERWOOD selected UltramiR shRNAs targeting essential genes that were tested in a cell-viability screen. X-axis: mean SplashRNA or DSIR score for equally sized groups (purple and blue dots, 20 groups) of 39 shRNAs each. Y-axis: Percent of shRNAs in each group that were potent (Online Methods ). SplashRNA and DSIR were compared against the published minimum (Min), median (Med) and maximum (Max) shERWOOD algorithm performance on the same data set (green-brown dots).(f) Retrospective potency prediction of shRNAs from a large-scale essential genes RNAi screen. The biological screen used 20-25 miR-E-like shRNAs per gene to identify essential genes. shRNA potency was quantified by assessing their log fold changes (Online Methods ). For each of the top 50 essential genes, all tested algorithms selected their top and bottom five sequences by prediction score. Log fold changes for all selected shRNA across the 50 genes were compared. SplashRNA achieved the most significant discrimination between top and bottom predictions (p = 1.8e-11, one-sided Wilcoxon rank sum test). seqScore (p = 2.3e-5) was used to generate the initial library of approximately 25 shRNAs per gene.(g) Retrospective potency prediction of shRNAs from a large-scale toxin resistance and sensitivity RNAi screen. The biological screen used 25 miR-E-like shRNAs per gene to identify resistance and sensitivity genes. shRNA potency was quantified by assessing their log fold changes (Online Methods ). For each of the top 20 sensitivity genes, all tested algorithms selected their top and bottom five sequences by prediction score. Log fold changes for all selected shRNA across the 20 genes were compared. SplashRNA was the only algorithm to achieve significant discrimination between the top and bottom predictions at p < 0.01 (p = 4.8e-4, one-sided Wilcoxon rank sum test). Of note, SplashRNA also outperformed the other algorithms when selecting smaller or larger numbers of top sensitivity genes from the biological screen (data not shown). seqScore was used to generate the initial library of approximately 25 shRNAs per gene.

Supplementary Figure 5 Transcript selection.

(a) Distribution of shRNA potency in functionally distinct transcript regions. Shown is the potency distribution of shRNAs in the unbiased TILE data set that target the 5’UTR, CDS or 3’UTR. Since these shRNAs were evaluated using the Sensor assay, their targets are not subject to alternative cleavage and polyadenylation (ApA) and/or splicing events.(b) AU content of potent and weak miR-30 shRNAs from the unbiased TILE set. Potent shRNAs tend to have a higher proportion of A/U nucleotides (p < 2.2e-16, two-sided Kolmogorov-Smirnov test).(c) AU content of functionally distinct transcript regions in the human genome. Shown are the AU densities in 5’UTR, CDS and 3’UTR.(d) AU content in mouse transcripts.(e) Alternative cleavage and polyadenylation (ApA) prevents potent shRNAs from inhibiting their putative target gene. Immunoblotting of Pten in NIH/3T3s transduced at single-copy with LEPG expressing the indicated shRNAs. Nine top predictions targeting the CDS or the 3’UTR after early ApA sites were compared alongside controls for their ability to suppress mouse Pten. Actb was used as loading control.(f) Comparison of knockdown efficiency and annotation of ApA sites. Shown are potent Pten shRNA predictions and their position (start, end) on the mouse genome (mm9). KD indicates a qualitative degree of the knockdown observed in immunoblotting analyses of NIH/3T3s (e). ApA indicates previously published positions on the mouse genome (mm9) of ApA sites (alternative 3’ ends) identified in NIH/3T3 and mouse ES cells by 3P-Seq. 2P-Seq shows the quantification of transcript expression levels measured by 2P-Seq. All shRNAs and ApA sites are ordered according to their position along the mouse genome.

Supplementary Figure 6 Extensive validation of de novo SplashRNA predictions.

(a-f) Western blot validation of de novo SplashRNA predictions. All shRNAs were expressed using LEPG at single-copy conditions. β-Actin (Actb, ACTB) was used for normalization.(a) Immunoblotting of Pbrm1 in NIH/3T3s (median KD: 97%, median SplashRNA score: 1.7).(b) Immunoblotting of Rela in NIH/3T3s (median KD: 90%, median SplashRNA score: 1.1).(c) Immunoblotting of Bcl2l11 in NIH/3T3s (median KD: 97%, median SplashRNA score: 0.7).(d) Immunoblotting of Axin1 in NIH/3T3s (median KD: 95%, median SplashRNA score: 1.3).(e) Schematic of the multiple human NF2 transcript variants. NF2 has nine variants with an intersection of only 198 nucleotides, excluding the 5’UTR, rendering the prediction task especially difficult due to limited sequence space.(f) Predicting miR-E shRNAs for extremely short transcripts. Immunoblotting of NF2 in A375s transduced with the indicated shRNAs targeting all nine NF2 variants (median KD: 89%, median SplashRNA score: 0.6).(g) Comparison of SplashRNA and DSIR predictions against CRISPR-Cas9 mediated suppression of Cd9 in mouse embryonic fibroblasts (MEFs). Shown are normalized (relative to the indicated controls) median anti-Cd9-APC fluorescence intensities of RRT-MEFs and CRT-MEFs expressing the indicated shRNAs or sgRNAs (Online Methods ). The six top-scoring predictions from DSIR + Sensor rules (DSIR) or SplashRNA (ordered according to their respective scores) were compared to six sgRNA sequences (Supplementary Table 2). *, Cd9.1137 is the top prediction from both algorithms and was plotted twice for clarity. While DSIR predictions triggered Cd9 knockdown with variable efficacy, SplashRNA predictions consistently induce strong Cd9 suppression, closely approaching knockout conditions.(h) Transfer function of SplashRNA score versus protein knockdown for all 62 de novo predicted shRNAs validated by immunofluorescence (Supplementary Table 2). Green triangles indicate the minimum knockdown for 80% of the predictions for a given SplashRNA score bin. Bins were defined to have a width of 0.5 with the leftmost bin starting at 0.25. For the bin centered on SplashRNA score = 1, 80% of predictions showed at least 86% protein knockdown. The expected knockdown for the top 80% of predictions (e.g. 4/5 shRNAs) increases with the SplashRNA score. Overall, 91% of predictions with a SplashRNA score >1 showed more than 85% protein knockdown.(i) Uncropped images of Pten (Figure 2d) and Bap1 (Figure 2e) western blots, and their respective β-Actin controls. Pten predicted molecular weight (MW): 47 kDa; MW validated by Cell Signaling Technology: 54 kDa. Bap1 predicted MW: 80 kDa; MW validated by Bethyl Laboratories: 80-95 kDa. β-Actin MW validated by Sigma-Aldrich: 42 kDa.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6 and Supplementary Table 1 (PDF 2720 kb)

Supplementary Table 2

Novel datasets and sequences of validated shRNAs (XLSX 4243 kb)

Supplementary Table 3

Genome-wide SplashRNA predictions for all human and mouse protein coding genes. (XLSX 25766 kb)

Supplementary Code

Source code that implements the main SplashRNA algorithm (ZIP 2201 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pelossof, R., Fairchild, L., Huang, C. et al. Prediction of potent shRNAs with a sequential classification algorithm. Nat Biotechnol 35, 350–353 (2017).

Download citation

Further reading


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing