Deep-learning augmented RNA-seq analysis of transcript splicing

Zhang, Zijun; Pan, Zhicheng; Ying, Yi; Xie, Zhijie; Adhikari, Samir; Phillips, John; Carstens, Russ P.; Black, Douglas L.; Wu, Yingnian; Xing, Yi

doi:10.1038/s41592-019-0351-9

Brief Communication
Published: 25 March 2019

Deep-learning augmented RNA-seq analysis of transcript splicing

Zijun Zhang ORCID: orcid.org/0000-0003-2016-8014¹^na1,
Zhicheng Pan¹^na1,
Yi Ying²,
Zhijie Xie²,
Samir Adhikari^2,3,
John Phillips⁴,
Russ P. Carstens⁵,
Douglas L. Black²,
Yingnian Wu⁶ &
…
Yi Xing ORCID: orcid.org/0000-0001-9257-7613^1,2,3,7

Nature Methods volume 16, pages 307–310 (2019)Cite this article

20k Accesses
60 Citations
169 Altmetric
Metrics details

Subjects

Abstract

A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. We report DARTS (https://github.com/Xinglab/DARTS), a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The DARTS computational framework.**

**Fig. 2: Performance evaluation of the DARTS BHT framework, and the influence of training datasets on the performance of the DARTS DNN.**

**Fig. 3: DARTS analysis of alternative splicing during the EMT.**

Genome organization around nuclear speckles drives mRNA splicing efficiency

Article 08 May 2024

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Single-cell analysis reveals context-dependent, cell-level selection of mtDNA

Article Open access 24 April 2024

Code availability

The DARTS program, trained model parameters, and predictive features are provided at GitHub (https://github.com/Xinglab/DARTS).

Data availability

The RNA-seq data that support the findings of the deep learning models are available from the ENCODE project (https://www.encodeproject.org/) and the Roadmap Epigenomics project (http://www.roadmapepigenomics.org/). The H358 time-course RNA-seq data were downloaded from GEO accession GSE75492. The PC3E-GS689 RNA-seq data and RASL-seq data can be accessed from GEO under accession GSE112037.

References

Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Nat. Methods 7, 1009–1015 (2010).
Article CAS Google Scholar
Shen, S. et al. Proc. Natl Acad. Sci. USA 111, E5593–E5601 (2014).
Article CAS Google Scholar
ENCODE Project Consortium. Nature 489, 57–74 (2012).
Kundaje, A. et al. Nature 518, 317–330 (2015).
Article CAS Google Scholar
Cieślik, M. & Chinnaiyan, A. M. Nat. Rev. Genet. 19, 93–109 (2018).
Article Google Scholar
Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. Am. J. Hum. Genet. 102, 11–26 (2018).
Article CAS Google Scholar
Xiong, H. Y. et al. Science 347, 1254806 (2015).
Article Google Scholar
Barash, Y. et al. Nature 465, 53–59 (2010).
Article CAS Google Scholar
Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Bioinformatics 30, i121–i129 (2014).
Article CAS Google Scholar
Huang, Y. & Sanguinetti, G. Genome. Biol. 18, 123 (2017).
Article Google Scholar
Gerstberger, S., Hafner, M. & Tuschl, T. Nat. Rev. Genet. 15, 829–845 (2014).
Article CAS Google Scholar
Van Nostrand, E. L. et al. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/179648v1?versioned=true (2017).
Warzecha, C. C. et al. EMBO J. 29, 3286–3300 (2010).
Article CAS Google Scholar
Yang, Y. et al. Mol. Cell. Biol. 36, 1704–1719 (2016).
Article CAS Google Scholar
Dittmar, K. A. et al. Mol. Cell. Biol. 32, 1468–1482 (2012).
Article CAS Google Scholar
Lu, Z. X. et al. Mol. Cancer Res. 13, 305–318 (2015).
Article CAS Google Scholar
Li, H., Qiu, J. & Fu, X.-D. Curr. Protoc. Mol. Biol. 98, 13.1–4.13.9 (2012).
Article Google Scholar
Trincado, J. L. et al. Genome. Biol. 19, 40 (2018).
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Google Scholar
Ioffe, S. & Szegedy, C. In Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR/Microtome Publishing, Brookline, MA, USA, 2015).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS Google Scholar
Harrow, J. et al. Genome. Biol. 7, 1–S4.9 (2006).
Article Google Scholar
Kent, W. J. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Ying, Y. et al. Cell 170, 312–323 (2017).
Article CAS Google Scholar

Download references

Acknowledgements

We thank X.-D. Fu (UCSD) for the RASL oligos and advice on RASL-seq. This study is supported by National Institutes of Health grants (R01GM088342, R01GM117624, U01HG007912, and U01CA233074 to Y.X.). Z.Z. is partially supported by a UCLA Dissertation Year Fellowship.

Author information

These authors contributed equally: Zijun Zhang, Zhicheng Pan.

Authors and Affiliations

Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA, USA
Zijun Zhang, Zhicheng Pan & Yi Xing
Department of Microbiology, Immunology & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, USA
Yi Ying, Zhijie Xie, Samir Adhikari, Douglas L. Black & Yi Xing
Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Samir Adhikari & Yi Xing
Department of Molecular and Medical Pharmacology, University of California, Los Angeles, Los Angeles, CA, USA
John Phillips
Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Russ P. Carstens
Department of Statistics, University of California, Los Angeles, Los Angeles, CA, USA
Yingnian Wu
Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
Yi Xing

Authors

Zijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhicheng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yi Ying
View author publications
You can also search for this author in PubMed Google Scholar
Zhijie Xie
View author publications
You can also search for this author in PubMed Google Scholar
Samir Adhikari
View author publications
You can also search for this author in PubMed Google Scholar
John Phillips
View author publications
You can also search for this author in PubMed Google Scholar
Russ P. Carstens
View author publications
You can also search for this author in PubMed Google Scholar
Douglas L. Black
View author publications
You can also search for this author in PubMed Google Scholar
Yingnian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Xing
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.Z. and Y.X. conceived the study; Z.Z., Y.W., and Y.X. designed the research; Z.Z., Z.P., Y.Y., S.A., and J.P. performed the research; Z.X., R.P.C., and D.L.B. contributed analytic tools; Z.Z. and Y.X. analyzed data; and Z.Z. and Y.X. wrote the paper with input from all other authors.

Corresponding author

Correspondence to Yi Xing.

Ethics declarations

Competing interests

Y.X. and D.L.B. are scientific cofounders of Panorama Medicine. Z.Z. and Y.X. are in the process of filing a patent application for DARTS.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Schematic overview of the DARTS DNN model.

The DARTS DNN model consists of four hidden layers and 7,923,402 parameters. Batch normalization and drop-out of hidden variables are implemented during training to mitigate overfitting.

Supplementary Figure 2 Performance comparison of DARTS BHT(flat), MISO, and MATS using simulated RNA-seq data generated by Flux simulator.

We derived the transcriptome profiles from a real RNA-seq dataset with widespread splicing changes (E-MTAB-1147; knockdown of splicing factor HNRNPC in the HeLa cell line) and plugged them into Flux simulator as ground truth to simulate RNA-seq reads. a,b, Then we computed (a) AUROC and (b) AUPR for each statistical method by labeling the exon-skipping events with ground-truth \(| {\rm{\Delta }}\psi | > 0.05\) as positive and \(| {\rm{\Delta }}\psi | \le 0.05\) as negative (details are provided in the Supplementary Notes). DARTS BHT(flat) performed favorably compared with MISO and MATS.

Supplementary Figure 3 Performance comparison of DARTS BHT(flat) with replicates versus DARTS BHT(flat) on pooled data and rMATS with replicates on n = 3,000 independent simulated events.

We fixed the total RNA-seq read counts (coverage per replicate × number of replicates) while varying the number of replicates (K), within-group variance (sigma), and whether there is one outlier sample. The replicate DARTS model (rDARTS) outperforms DARTS on pooled data when there are outlier samples (b,e) or when the within-group variance is large (c,f).

Supplementary Figure 4 The performance of the DARTS DNN during cross-validation and testing as training progressed.

The maximum AUROC was 0.97 during cross-validation and 0.86 during testing.

Supplementary Figure 5 Relationship of DARTS posterior, prior, and the amount of observed RNA-seq read counts.

For a fixed absolute PSI difference between the two conditions (i.e. the effect size δ), we calculated the posterior probability \(P(| \delta | > 0.05| I,S)\) from simulated data by varying the prior probability and the amount of read counts. The prior’s effect on the DARTS posterior diminished when the number of observed read counts was large (>100) and/or with large effect sizes (δ = 0.3). For events with moderate or low numbers of read counts, a strong informative prior improves the inference accuracy.

Supplementary Figure 6 Application of the DARTS DNN to different classes of alternative splicing patterns.

a,c,e, The performance of the DARTS DNN during cross-validation and testing as training progressed for alternative 5′ splice sites (A5SS), alternative 3′ splice sites (A3SS), and retained introns (RI) as measured by AUROC. b,d,f, Comparison of the DARTS DNN with baseline methods using independent leave-out datasets. The DARTS DNN outperforms baseline methods trained on individual leave-out datasets by a large margin. Note that in these analyses the DARTS DNN is trained using combined ENCODE and Roadmap RNA-seq datasets, with certain pairwise comparisons held out for benchmarking as independent leave-out datasets.

Supplementary Figure 7 An example of the DARTS DNN prediction for the PLEKHA1 gene in the H358 EMT time-course RNA-seq data.

The genome browser view represents aggregated RNA-seq signals from three biological replicates. The DARTS DNN score for this exon is 0.94 in day 5 versus day 0, increasing the posterior probability of differential splicing to 0.73 over 0.42 when using RNA-seq data alone. The differential splicing pattern of this exon was apparent throughout the time course and was previously validated by RT-PCR.

Supplementary Figure 8 Meta-exon motif analysis of the ESRP motif.

a,b, For the comparison of day 6 versus day 0 on the H358 time-course RNA-seq data, we calculated ESRP motif scores for (a) all DARTS BHT(flat) significant events and (b) DARTS BHT(flat) significant events that become insignificant in DARTS BHT(info). The latter set of events does not have enrichment of the ESRP motif.

i

Supplementary Figure 9 Characteristics of the DARTS DNN predicted events.

a,b, The cumulative density function of (a) gene expression levels (TPM values) and (b) RNA-seq read coverage for DARTS-DNN-predicted differential events and RNA-seq differential events. The DARTS-DNN-predicted differential events are from genes with significantly lower expression levels and have significantly lower RNA-seq read coverage compared with that of RNA-seq differential events (two-sided Wilcoxon test). c, DARTS BHT(info) outperforms baseline methods that use RNA-seq data alone to call differential splicing (DARTS BHT(flat), rMATS, and SUPPA2), as benchmarked using ground truth defined by RASL-seq. d, DARTS BHT(info) outperforms baseline methods at different FPR thresholds for DARTS-DNN-predicted differential events (n represents the number of alternative splicing events), with the maximum gain observed for the most confidently predicted events with FPR = 1%.

Supplementary Figure 10 Ranking by DARTS BHT on simulated data when using different t1 and t2 values.

The results of DARTS BHT are robust to different choices of parameters, especially for the inference of differential alternative splicing events (upper right corner in each panel).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Notes

Reporting Summary

Supplementary Table 1

A complete list of cis RNA sequence features and trans RNA-binding protein features used in the DARTS DNN model. The cis RNA sequence features for each type of alternative splicing pattern (skipped exons, alternative 5′ splice sites, alternative 3′ splice sites, retained introns) are shown in separate tabs.

Supplementary Table 2

Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in ENCODE RBP knockdown RNA-seq experiments used to train the DARTS DNN model.

Supplementary Table 3

Performance of DARTS DNN, DNN, logistic regression, and random forest with ENCODE leave-out datasets.

Supplementary Table 4

Performance of DARTS BHT(info) and DARTS BHT(flat) in cell-type-specific splicing analysis using individual replicates of ENCODE HepG2 versus K562 RNA-seq data.

Supplementary Table 5

Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in pairwise comparisons of Roadmap RNA-seq data used for training the DARTS DNN model.

Supplementary Table 6

A complete list of DARTS-DNN-predicted differential (n = 52) and unchanged (n = 29) alternative splicing events in the comparison of GS689 versus PC3E. Prior probability predicted by the DARTS DNN, posterior probability of DARTS BHT(flat) and DARTS BHT(info), and P values and FDR from rMATS are reported.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., Pan, Z., Ying, Y. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods 16, 307–310 (2019). https://doi.org/10.1038/s41592-019-0351-9

Download citation

Received: 05 September 2018
Accepted: 01 February 2019
Published: 25 March 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s41592-019-0351-9

This article is cited by

rMATS-turbo: an efficient and flexible computational tool for alternative splicing analysis of large-scale RNA-seq data
- Yuanyuan Wang
- Zhijie Xie
- Yi Xing
Nature Protocols (2024)
The problem of selection bias in studies of pre-mRNA splicing
- Zachary W. Dwyer
- Jeffrey A. Pleiss
Nature Communications (2023)
A scoping review on deep learning for next-generation RNA-Seq. data analysis
- Diksha Pandey
- P. Onkara Perumal
Functional & Integrative Genomics (2023)
Using machine learning to detect the differential usage of novel gene isoforms
- Xiaopu Zhang
- Musa A. Hassan
- James G. D. Prendergast
BMC Bioinformatics (2022)
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome
- Mehran Karimzadeh
- Michael M. Hoffman
Genome Biology (2022)