Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Protocol
  • Published:

Detection of aberrant gene expression events in RNA sequencing data

Abstract

RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects in individuals affected by genetically undiagnosed rare disorders. Pioneering studies have shown that RNA-seq could increase the diagnosis rates over DNA sequencing alone by 8–36%, depending on the disease entity and tissue probed. To accelerate adoption of RNA-seq by human genetics centers, detailed analysis protocols are now needed. We present a step-by-step protocol that details how to robustly detect aberrant expression levels, aberrant splicing and mono-allelic expression in RNA-seq data using dedicated statistical methods. We describe how to generate and assess quality control plots and interpret the analysis results. The protocol is based on the detection of RNA outliers pipeline (DROP), a modular computational workflow that integrates all the analysis steps, can leverage parallel computing infrastructures and generates browsable web page reports.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow overview.
Fig. 2: The aberrant expression module.
Fig. 3: The aberrant splicing module.
Fig. 4: MAE module.
Fig. 5: Downstream analysis of outlier results.

Similar content being viewed by others

Data availability

A subset of the Geuvadis dataset15 comprising 100 samples (Supplementary Data 1) was used to test and demonstrate the workflow. The original dataset is accessible without restriction under https://www.internationalgenome.org/data-portal/data-collection/geuvadis. The analyses performed in the ‘Dataset design’ section used the GTEx and Kremer et al.9 datasets. The GTEx dataset was downloaded from the GTEx Portal on 12 June 2017, under the dbGaP accession number phs00424.v6.p1. The count matrices from the Kremer et al. dataset are available on Zenodo (https://doi.org/10.5281/zenodo.3887451).

Code availability

DROP, including a small demo dataset of 10 samples and chromosome 21 only, is publicly available at https://github.com/gagneurlab/drop under MIT license. The current version is 0.9.2, which is fixed with https://doi.org/10.5281/zenodo.4106177. All the plots, results, and analyses of the test dataset can be found at https://www.cmm.in.tum.de/public/paper/drop_analysis/webDir/html/drop_analysis_index.html.

References

  1. Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).

    Article  CAS  PubMed  Google Scholar 

  2. Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Taylor, J. C. et al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat. Genet. 47, 717–726 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lionel, A. C. et al. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet. Med. 20, 435–443 (2018).

    Article  CAS  PubMed  Google Scholar 

  5. Chong, J. X. et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Cooper, G. M. Parlez-vous VUS? Genome Res. 25, 1423–1426 (2015).

  7. Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Murdock, D. R. et al. Transcriptome-directed analysis for Mendelian disease diagnosis overcomes limitations of conventional genomic testing. J. Clin. Investig. https://doi.org/10.1172/JCI141500 (2020).

  14. Koster, J. & Rahmann, S. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    Article  PubMed  CAS  Google Scholar 

  15. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).

    Article  CAS  PubMed  Google Scholar 

  18. Brechtmann, F. et al. OUTRIDER: a statistical method for detecting aberrantly expressed genes in RNA sequencing data. Am. J. Hum. Genet. 103, 907–917 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Mertes, C. et al. Detection of aberrant splicing events in RNA-Seq data with FRASER. Preprint at bioRxiv https://doi.org/10.1101/2019.12.18.866830 (2019).

  20. Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).

    Article  PubMed  CAS  Google Scholar 

  21. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  PubMed Central  Google Scholar 

  22. Papatheodorou, I. et al. Expression Atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res. 46, D246–D251 (2018).

    Article  CAS  PubMed  Google Scholar 

  23. Aicher, J. K., Jewell, P., Vaquero-Garcia, J., Barash, Y. & Bhoj, E. J. Mapping RNA splicing variations in clinically accessible and nonaccessible tissues to facilitate Mendelian disease diagnosis using RNA-seq. Genet. Med. 22, 1181–1190 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Scotti, M. M. & Swanson, M. S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).

    Article  CAS  PubMed  Google Scholar 

  26. Singh, R. K. & Cooper, T. A. Pre-mRNA splicing in disease and therapeutics. Trends Mol. Med. 18, 472–482 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).

    Article  CAS  PubMed  Google Scholar 

  29. Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. Gonorazky, H. et al. RNAseq analysis for the diagnosis of muscular dystrophy. Ann. Clin. Transl. Neurol. 3, 55–60 (2016).

    Article  CAS  PubMed  Google Scholar 

  31. Kernohan, K. D. et al. Whole-transcriptome sequencing in blood provides a diagnosis of spinal muscular atrophy with progressive myoclonic epilepsy. Hum. Mutat. 38, 611–614 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Hamanaka, K. et al. RNA sequencing solved the most common but unrecognized NEB pathogenic variant in Japanese nemaline myopathy. Genet. Med. 21, 1629–1638 (2019).

    Article  CAS  PubMed  Google Scholar 

  33. Wang, K. et al. Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement. Genome Med. 5, 67 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  34. Pervouchine, D. D., Knowles, D. G. & Guigo, R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274 (2013).

    Article  CAS  PubMed  Google Scholar 

  35. Kapustin, Y. et al. Cryptic splice sites and split genes. Nucleic Acids Res. 39, 5837–5844 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Mohammadi, P. et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science 366, 351–356 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Albers, C. A. et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat. Genet. 44, 435–439 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. van Haelst, M. M. et al. Further confirmation of the MED13L haploinsufficiency syndrome. Eur. J. Hum. Genet. 23, 135–138 (2015).

    Article  PubMed  CAS  Google Scholar 

  39. Lindstrand, A. et al. Different mutations in PDE4D associated with developmental disorders with mirror phenotypes. J. Med. Genet. 51, 45–54 (2014).

    Article  CAS  PubMed  Google Scholar 

  40. ’t Hoen, P. A. C. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).

    Article  PubMed  CAS  Google Scholar 

  41. Lee, S. et al. NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types. Nucleic Acids Res. 45, e103–e103 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Castel, S. E., Mohammadi, P., Chung, W. K., Shen, Y. & Lappalainen, T. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nat. Commun. 7, 12817 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).

    Article  CAS  PubMed  Google Scholar 

  44. Dai, X., Theobard, R., Cheng, H., Xing, M. & Zhang, J. Fusion genes: a promising tool combating against cancer. Biochim. Biophys. Acta Rev. Cancer 1869, 149–160 (2018).

    Article  CAS  PubMed  Google Scholar 

  45. van Heesch, S. et al. Genomic and functional overlap between somatic and germline chromosomal rearrangements. Cell Rep. 9, 2001–2010 (2014).

    Article  PubMed  CAS  Google Scholar 

  46. Oliver, G. R. et al. A tailored approach to fusion transcript identification increases diagnosis of rare inherited disease. PLoS One 14, e0223337 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Tian, L. et al. CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data. Genome Biol. 21, 126 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  49. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Van der Auwera, G. A. et al. in Current Protocols in Bioinformatics 11.10.1–11.10.33 (Wiley, 2013).

  53. Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  54. McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  55. Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).

    Article  CAS  PubMed  Google Scholar 

  56. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    Article  CAS  PubMed  Google Scholar 

  57. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 3 (2011).

    Article  CAS  Google Scholar 

  58. Ben-Kiki, O. & Evans, C. YAML Ain’t Markup Language (YAMLTM) Version 1.2. 80 https://yaml.org/spec/1.2/spec.html (2009).

  59. Anders, S., Pyl, P. T. & Huber, W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

    Article  CAS  PubMed  Google Scholar 

  60. McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).

    Article  CAS  PubMed  Google Scholar 

  64. Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank all the users who helped with their feedback during the revision, especially D. R. Murdock. We also thank C. Andrade for helping with the DROP logo, as well as the members of the Gagneur lab for input. The Bavaria California Technology Center supported C.M. through a fellowship. The German Bundesministerium für Bildung und Forschung (BMBF) supported the study through the e:Med Networking fonds AbCD-Net (FKZ 01ZX1706A to V.A.Y., C.M., and J.G.), the German Network for Mitochondrial Disorders (mitoNET; 01GM1113C to H.P.), the E-Rare project GENOMIT (01GM1920A to M.G. and H.P.), the Medical Informatics Initiative CORD-MI (Collaboration on Rare Diseases) to V.A.Y., and the ERA PerMed project PerMiM (01KU2016A to H.P. and J.G.). The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.

Author information

Authors and Affiliations

Authors

Contributions

V.A.Y., C.M., M.F.M., and J.G. participated to the design of the workflow. V.A.Y., C.M., M.F.M., D.K-A., I.F.S., and P.F.G. contributed to the computational workflow. L.F. implemented the candidate prioritization workflow. L.W. designed and implemented wBuild. V.A.Y. and J.G. wrote the manuscript with the help of L.F, D.K-A., M.G., I.F.S., and H.P. C.M., H.P., and J.G. supervised the research. All authors revised the manuscript.

Corresponding author

Correspondence to Julien Gagneur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks Anna Esteve-Codina and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Kremer, L. et al. Nat. Commun. 8, 15824 (2017): https://doi.org/10.1038/ncomms15824

Murdock, D. R. et al. J. Clin. Invest. (2020): https://doi.org/10.1172/JCI141500

Key data used in this protocol

Kremer, L. et al. Nat. Commun. 8, 15824 (2017): https://doi.org/10.1038/ncomms15824

GTEx Consortium. Nature 550, 7675 (2017): https://doi.org/10.1038/nature24277

Lappalainen, T. et al. Nature 501, 7468 (2013): https://doi.org/10.1038/nature12531

Supplementary information

Supplementary Information

Supplementary Figs. 1–8 and Supplementary Methods.

Reporting Summary

Supplementary Data 1

Sample annotation of the test dataset.

Supplementary Data 2

Configuration file for the test dataset.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yépez, V.A., Mertes, C., Müller, M.F. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat Protoc 16, 1276–1296 (2021). https://doi.org/10.1038/s41596-020-00462-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-020-00462-5

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing