Brief Communication | Published:

Deep-learning augmented RNA-seq analysis of transcript splicing

Nature Methodsvolume 16pages307310 (2019) | Download Citation

Abstract

A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. We report DARTS (https://github.com/Xinglab/DARTS), a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

The DARTS program, trained model parameters, and predictive features are provided at GitHub (https://github.com/Xinglab/DARTS).

Data availability

The RNA-seq data that support the findings of the deep learning models are available from the ENCODE project (https://www.encodeproject.org/) and the Roadmap Epigenomics project (http://www.roadmapepigenomics.org/). The H358 time-course RNA-seq data were downloaded from GEO accession GSE75492. The PC3E-GS689 RNA-seq data and RASL-seq data can be accessed from GEO under accession GSE112037.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Nat. Methods 7, 1009–1015 (2010).

  2. 2.

    Shen, S. et al. Proc. Natl Acad. Sci. USA 111, E5593–E5601 (2014).

  3. 3.

    ENCODE Project Consortium. Nature 489, 57–74 (2012).

  4. 4.

    Kundaje, A. et al. Nature 518, 317–330 (2015).

  5. 5.

    Cieślik, M. & Chinnaiyan, A. M. Nat. Rev. Genet. 19, 93–109 (2018).

  6. 6.

    Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. Am. J. Hum. Genet. 102, 11–26 (2018).

  7. 7.

    Xiong, H. Y. et al. Science 347, 1254806 (2015).

  8. 8.

    Barash, Y. et al. Nature 465, 53–59 (2010).

  9. 9.

    Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Bioinformatics 30, i121–i129 (2014).

  10. 10.

    Huang, Y. & Sanguinetti, G. Genome. Biol. 18, 123 (2017).

  11. 11.

    Gerstberger, S., Hafner, M. & Tuschl, T. Nat. Rev. Genet. 15, 829–845 (2014).

  12. 12.

    Van Nostrand, E. L. et al. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/179648v1?versioned=true (2017).

  13. 13.

    Warzecha, C. C. et al. EMBO J. 29, 3286–3300 (2010).

  14. 14.

    Yang, Y. et al. Mol. Cell. Biol. 36, 1704–1719 (2016).

  15. 15.

    Dittmar, K. A. et al. Mol. Cell. Biol. 32, 1468–1482 (2012).

  16. 16.

    Lu, Z. X. et al. Mol. Cancer Res. 13, 305–318 (2015).

  17. 17.

    Li, H., Qiu, J. & Fu, X.-D. Curr. Protoc. Mol. Biol. 98, 13.1–4.13.9 (2012).

  18. 18.

    Trincado, J. L. et al. Genome. Biol. 19, 40 (2018).

  19. 19.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. J. Mach. Learn. Res. 15, 1929–1958 (2014).

  20. 20.

    Ioffe, S. & Szegedy, C. In Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR/Microtome Publishing, Brookline, MA, USA, 2015).

  21. 21.

    Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nat. Biotechnol. 34, 525–527 (2016).

  22. 22.

    Harrow, J. et al. Genome. Biol. 7, 1–S4.9 (2006).

  23. 23.

    Kent, W. J. Genome Res. 12, 656–664 (2002).

  24. 24.

    Ying, Y. et al. Cell 170, 312–323 (2017).

Download references

Acknowledgements

We thank X.-D. Fu (UCSD) for the RASL oligos and advice on RASL-seq. This study is supported by National Institutes of Health grants (R01GM088342, R01GM117624, U01HG007912, and U01CA233074 to Y.X.). Z.Z. is partially supported by a UCLA Dissertation Year Fellowship.

Author information

Author notes

  1. These authors contributed equally: Zijun Zhang, Zhicheng Pan.

Affiliations

  1. Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA, USA

    • Zijun Zhang
    • , Zhicheng Pan
    •  & Yi Xing
  2. Department of Microbiology, Immunology & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, USA

    • Yi Ying
    • , Zhijie Xie
    • , Samir Adhikari
    • , Douglas L. Black
    •  & Yi Xing
  3. Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA

    • Samir Adhikari
    •  & Yi Xing
  4. Department of Molecular and Medical Pharmacology, University of California, Los Angeles, Los Angeles, CA, USA

    • John Phillips
  5. Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Russ P. Carstens
  6. Department of Statistics, University of California, Los Angeles, Los Angeles, CA, USA

    • Yingnian Wu
  7. Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Yi Xing

Authors

  1. Search for Zijun Zhang in:

  2. Search for Zhicheng Pan in:

  3. Search for Yi Ying in:

  4. Search for Zhijie Xie in:

  5. Search for Samir Adhikari in:

  6. Search for John Phillips in:

  7. Search for Russ P. Carstens in:

  8. Search for Douglas L. Black in:

  9. Search for Yingnian Wu in:

  10. Search for Yi Xing in:

Contributions

Z.Z. and Y.X. conceived the study; Z.Z., Y.W., and Y.X. designed the research; Z.Z., Z.P., Y.Y., S.A., and J.P. performed the research; Z.X., R.P.C., and D.L.B. contributed analytic tools; Z.Z. and Y.X. analyzed data; and Z.Z. and Y.X. wrote the paper with input from all other authors.

Competing interests

Y.X. and D.L.B. are scientific cofounders of Panorama Medicine. Z.Z. and Y.X. are in the process of filing a patent application for DARTS.

Corresponding author

Correspondence to Yi Xing.

Integrated supplementary information

  1. Supplementary Figure 1 Schematic overview of the DARTS DNN model.

    The DARTS DNN model consists of four hidden layers and 7,923,402 parameters. Batch normalization and drop-out of hidden variables are implemented during training to mitigate overfitting.

  2. Supplementary Figure 2 Performance comparison of DARTS BHT(flat), MISO, and MATS using simulated RNA-seq data generated by Flux simulator.

    We derived the transcriptome profiles from a real RNA-seq dataset with widespread splicing changes (E-MTAB-1147; knockdown of splicing factor HNRNPC in the HeLa cell line) and plugged them into Flux simulator as ground truth to simulate RNA-seq reads. a,b, Then we computed (a) AUROC and (b) AUPR for each statistical method by labeling the exon-skipping events with ground-truth \(| {\rm{\Delta }}\psi | > 0.05\) as positive and \(| {\rm{\Delta }}\psi | \le 0.05\) as negative (details are provided in the Supplementary Notes). DARTS BHT(flat) performed favorably compared with MISO and MATS.

  3. Supplementary Figure 3 Performance comparison of DARTS BHT(flat) with replicates versus DARTS BHT(flat) on pooled data and rMATS with replicates on n = 3,000 independent simulated events.

    We fixed the total RNA-seq read counts (coverage per replicate × number of replicates) while varying the number of replicates (K), within-group variance (sigma), and whether there is one outlier sample. The replicate DARTS model (rDARTS) outperforms DARTS on pooled data when there are outlier samples (b,e) or when the within-group variance is large (c,f).

  4. Supplementary Figure 4 The performance of the DARTS DNN during cross-validation and testing as training progressed.

    The maximum AUROC was 0.97 during cross-validation and 0.86 during testing.

  5. Supplementary Figure 5 Relationship of DARTS posterior, prior, and the amount of observed RNA-seq read counts.

    For a fixed absolute PSI difference between the two conditions (i.e. the effect size δ), we calculated the posterior probability \(P(| \delta | > 0.05| I,S)\) from simulated data by varying the prior probability and the amount of read counts. The prior’s effect on the DARTS posterior diminished when the number of observed read counts was large (>100) and/or with large effect sizes (δ = 0.3). For events with moderate or low numbers of read counts, a strong informative prior improves the inference accuracy.

  6. Supplementary Figure 6 Application of the DARTS DNN to different classes of alternative splicing patterns.

    a,c,e, The performance of the DARTS DNN during cross-validation and testing as training progressed for alternative 5′ splice sites (A5SS), alternative 3′ splice sites (A3SS), and retained introns (RI) as measured by AUROC. b,d,f, Comparison of the DARTS DNN with baseline methods using independent leave-out datasets. The DARTS DNN outperforms baseline methods trained on individual leave-out datasets by a large margin. Note that in these analyses the DARTS DNN is trained using combined ENCODE and Roadmap RNA-seq datasets, with certain pairwise comparisons held out for benchmarking as independent leave-out datasets.

  7. Supplementary Figure 7 An example of the DARTS DNN prediction for the PLEKHA1 gene in the H358 EMT time-course RNA-seq data.

    The genome browser view represents aggregated RNA-seq signals from three biological replicates. The DARTS DNN score for this exon is 0.94 in day 5 versus day 0, increasing the posterior probability of differential splicing to 0.73 over 0.42 when using RNA-seq data alone. The differential splicing pattern of this exon was apparent throughout the time course and was previously validated by RT-PCR.

  8. Supplementary Figure 8 Meta-exon motif analysis of the ESRP motif.

    a,b, For the comparison of day 6 versus day 0 on the H358 time-course RNA-seq data, we calculated ESRP motif scores for (a) all DARTS BHT(flat) significant events and (b) DARTS BHT(flat) significant events that become insignificant in DARTS BHT(info). The latter set of events does not have enrichment of the ESRP motif.

  9. Supplementary Figure 9 Characteristics of the DARTS DNN predicted events.

    a,b, The cumulative density function of (a) gene expression levels (TPM values) and (b) RNA-seq read coverage for DARTS-DNN-predicted differential events and RNA-seq differential events. The DARTS-DNN-predicted differential events are from genes with significantly lower expression levels and have significantly lower RNA-seq read coverage compared with that of RNA-seq differential events (two-sided Wilcoxon test). c, DARTS BHT(info) outperforms baseline methods that use RNA-seq data alone to call differential splicing (DARTS BHT(flat), rMATS, and SUPPA2), as benchmarked using ground truth defined by RASL-seq. d, DARTS BHT(info) outperforms baseline methods at different FPR thresholds for DARTS-DNN-predicted differential events (n represents the number of alternative splicing events), with the maximum gain observed for the most confidently predicted events with FPR = 1%.

  10. Supplementary Figure 10 Ranking by DARTS BHT on simulated data when using different t1 and t2 values.

    The results of DARTS BHT are robust to different choices of parameters, especially for the inference of differential alternative splicing events (upper right corner in each panel).

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–10, Supplementary Notes

  2. Reporting Summary

  3. Supplementary Table 1

    A complete list of cis RNA sequence features and trans RNA-binding protein features used in the DARTS DNN model. The cis RNA sequence features for each type of alternative splicing pattern (skipped exons, alternative 5′ splice sites, alternative 3′ splice sites, retained introns) are shown in separate tabs.

  4. Supplementary Table 2

    Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in ENCODE RBP knockdown RNA-seq experiments used to train the DARTS DNN model.

  5. Supplementary Table 3

    Performance of DARTS DNN, DNN, logistic regression, and random forest with ENCODE leave-out datasets.

  6. Supplementary Table 4

    Performance of DARTS BHT(info) and DARTS BHT(flat) in cell-type-specific splicing analysis using individual replicates of ENCODE HepG2 versus K562 RNA-seq data.

  7. Supplementary Table 5

    Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in pairwise comparisons of Roadmap RNA-seq data used for training the DARTS DNN model.

  8. Supplementary Table 6

    A complete list of DARTS-DNN-predicted differential (n = 52) and unchanged (n = 29) alternative splicing events in the comparison of GS689 versus PC3E. Prior probability predicted by the DARTS DNN, posterior probability of DARTS BHT(flat) and DARTS BHT(info), and P values and FDR from rMATS are reported.

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-019-0351-9