RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects in individuals affected by genetically undiagnosed rare disorders. Pioneering studies have shown that RNA-seq could increase the diagnosis rates over DNA sequencing alone by 8–36%, depending on the disease entity and tissue probed. To accelerate adoption of RNA-seq by human genetics centers, detailed analysis protocols are now needed. We present a step-by-step protocol that details how to robustly detect aberrant expression levels, aberrant splicing and mono-allelic expression in RNA-seq data using dedicated statistical methods. We describe how to generate and assess quality control plots and interpret the analysis results. The protocol is based on the detection of RNA outliers pipeline (DROP), a modular computational workflow that integrates all the analysis steps, can leverage parallel computing infrastructures and generates browsable web page reports.
This is a preview of subscription content
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
A subset of the Geuvadis dataset15 comprising 100 samples (Supplementary Data 1) was used to test and demonstrate the workflow. The original dataset is accessible without restriction under https://www.internationalgenome.org/data-portal/data-collection/geuvadis. The analyses performed in the ‘Dataset design’ section used the GTEx and Kremer et al.9 datasets. The GTEx dataset was downloaded from the GTEx Portal on 12 June 2017, under the dbGaP accession number phs00424.v6.p1. The count matrices from the Kremer et al. dataset are available on Zenodo (https://doi.org/10.5281/zenodo.3887451).
DROP, including a small demo dataset of 10 samples and chromosome 21 only, is publicly available at https://github.com/gagneurlab/drop under MIT license. The current version is 0.9.2, which is fixed with https://doi.org/10.5281/zenodo.4106177. All the plots, results, and analyses of the test dataset can be found at https://www.cmm.in.tum.de/public/paper/drop_analysis/webDir/html/drop_analysis_index.html.
Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Taylor, J. C. et al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat. Genet. 47, 717–726 (2015).
Lionel, A. C. et al. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet. Med. 20, 435–443 (2018).
Chong, J. X. et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).
Cooper, G. M. Parlez-vous VUS? Genome Res. 25, 1423–1426 (2015).
Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).
Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Murdock, D. R. et al. Transcriptome-directed analysis for Mendelian disease diagnosis overcomes limitations of conventional genomic testing. J. Clin. Investig. https://doi.org/10.1172/JCI141500 (2020).
Koster, J. & Rahmann, S. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
Brechtmann, F. et al. OUTRIDER: a statistical method for detecting aberrantly expressed genes in RNA sequencing data. Am. J. Hum. Genet. 103, 907–917 (2018).
Mertes, C. et al. Detection of aberrant splicing events in RNA-Seq data with FRASER. Preprint at bioRxiv https://doi.org/10.1101/2019.12.18.866830 (2019).
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Papatheodorou, I. et al. Expression Atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res. 46, D246–D251 (2018).
Aicher, J. K., Jewell, P., Vaquero-Garcia, J., Barash, Y. & Bhoj, E. J. Mapping RNA splicing variations in clinically accessible and nonaccessible tissues to facilitate Mendelian disease diagnosis using RNA-seq. Genet. Med. 22, 1181–1190 (2020).
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Scotti, M. M. & Swanson, M. S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).
Singh, R. K. & Cooper, T. A. Pre-mRNA splicing in disease and therapeutics. Trends Mol. Med. 18, 472–482 (2012).
Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2019).
Gonorazky, H. et al. RNAseq analysis for the diagnosis of muscular dystrophy. Ann. Clin. Transl. Neurol. 3, 55–60 (2016).
Kernohan, K. D. et al. Whole-transcriptome sequencing in blood provides a diagnosis of spinal muscular atrophy with progressive myoclonic epilepsy. Hum. Mutat. 38, 611–614 (2017).
Hamanaka, K. et al. RNA sequencing solved the most common but unrecognized NEB pathogenic variant in Japanese nemaline myopathy. Genet. Med. 21, 1629–1638 (2019).
Wang, K. et al. Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement. Genome Med. 5, 67 (2013).
Pervouchine, D. D., Knowles, D. G. & Guigo, R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274 (2013).
Kapustin, Y. et al. Cryptic splice sites and split genes. Nucleic Acids Res. 39, 5837–5844 (2011).
Mohammadi, P. et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science 366, 351–356 (2019).
Albers, C. A. et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat. Genet. 44, 435–439 (2012).
van Haelst, M. M. et al. Further confirmation of the MED13L haploinsufficiency syndrome. Eur. J. Hum. Genet. 23, 135–138 (2015).
Lindstrand, A. et al. Different mutations in PDE4D associated with developmental disorders with mirror phenotypes. J. Med. Genet. 51, 45–54 (2014).
’t Hoen, P. A. C. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).
Lee, S. et al. NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types. Nucleic Acids Res. 45, e103–e103 (2017).
Castel, S. E., Mohammadi, P., Chung, W. K., Shen, Y. & Lappalainen, T. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nat. Commun. 7, 12817 (2016).
Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).
Dai, X., Theobard, R., Cheng, H., Xing, M. & Zhang, J. Fusion genes: a promising tool combating against cancer. Biochim. Biophys. Acta Rev. Cancer 1869, 149–160 (2018).
van Heesch, S. et al. Genomic and functional overlap between somatic and germline chromosomal rearrangements. Cell Rep. 9, 2001–2010 (2014).
Oliver, G. R. et al. A tailored approach to fusion transcript identification increases diagnosis of rare inherited disease. PLoS One 14, e0223337 (2019).
Tian, L. et al. CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data. Genome Biol. 21, 126 (2020).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Van der Auwera, G. A. et al. in Current Protocols in Bioinformatics 11.10.1–11.10.33 (Wiley, 2013).
Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 3 (2011).
Ben-Kiki, O. & Evans, C. YAML Ain’t Markup Language (YAMLTM) Version 1.2. 80 https://yaml.org/spec/1.2/spec.html (2009).
Anders, S., Pyl, P. T. & Huber, W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297 (2012).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402 (2015).
Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423 (2015).
The authors thank all the users who helped with their feedback during the revision, especially D. R. Murdock. We also thank C. Andrade for helping with the DROP logo, as well as the members of the Gagneur lab for input. The Bavaria California Technology Center supported C.M. through a fellowship. The German Bundesministerium für Bildung und Forschung (BMBF) supported the study through the e:Med Networking fonds AbCD-Net (FKZ 01ZX1706A to V.A.Y., C.M., and J.G.), the German Network for Mitochondrial Disorders (mitoNET; 01GM1113C to H.P.), the E-Rare project GENOMIT (01GM1920A to M.G. and H.P.), the Medical Informatics Initiative CORD-MI (Collaboration on Rare Diseases) to V.A.Y., and the ERA PerMed project PerMiM (01KU2016A to H.P. and J.G.). The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.
The authors declare no competing interests.
Peer review information Nature Protocols thanks Anna Esteve-Codina and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key references using this protocol
Kremer, L. et al. Nat. Commun. 8, 15824 (2017): https://doi.org/10.1038/ncomms15824
Murdock, D. R. et al. J. Clin. Invest. (2020): https://doi.org/10.1172/JCI141500
Key data used in this protocol
Kremer, L. et al. Nat. Commun. 8, 15824 (2017): https://doi.org/10.1038/ncomms15824
GTEx Consortium. Nature 550, 7675 (2017): https://doi.org/10.1038/nature24277
Lappalainen, T. et al. Nature 501, 7468 (2013): https://doi.org/10.1038/nature12531
About this article
Cite this article
Yépez, V.A., Mertes, C., Müller, M.F. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat Protoc 16, 1276–1296 (2021). https://doi.org/10.1038/s41596-020-00462-5
Genome sequencing and RNA sequencing of urinary cells reveal an intronic FBN1 variant causing aberrant splicing
Journal of Human Genetics (2022)
Genome Medicine (2022)