S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing

Jagadeesh, Karthik A.; Paggi, Joseph M.; Ye, James S.; Stenson, Peter D.; Cooper, David N.; Bernstein, Jonathan A.; Bejerano, Gill

doi:10.1038/s41588-019-0348-4

Technical Report
Published: 25 February 2019

S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing

Karthik A. Jagadeesh¹^na1,
Joseph M. Paggi¹^na1,
James S. Ye²,
Peter D. Stenson³,
David N. Cooper ORCID: orcid.org/0000-0002-8943-8484³,
Jonathan A. Bernstein⁴ &
…
Gill Bejerano ORCID: orcid.org/0000-0001-5179-3635^1,4,5,6

Nature Genetics volume 51, pages 755–763 (2019)Cite this article

5011 Accesses
36 Citations
18 Altmetric
Metrics details

Subjects

Abstract

Exome analysis of patients with a likely monogenic disease does not identify a causal variant in over half of cases. Splice-disrupting mutations make up the second largest class of known disease-causing mutations. Each individual (singleton) exome harbors over 500 rare variants of unknown significance (VUS) in the splicing region. The existing relevant pathogenicity prediction tools tackle all non-coding variants as one amorphic class and/or are not calibrated for the high sensitivity required for clinical use. Here we calibrate seven such tools and devise a novel tool called Splicing Clinically Applicable Pathogenicity prediction (S-CAP) that is over twice as powerful as all previous tools, removing 41% of patient VUS at 95% sensitivity. We show that S-CAP does this by using its own features and not via meta-prediction over previous tools, and that splicing pathogenicity prediction is distinct from predicting molecular splicing changes. S-CAP is an important step on the path to deriving non-coding causal diagnoses.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Distribution of rare non-coding variants in the splicing region.**

**Fig. 2: Overall area under the ROC curve per region for splicing pathogenicity classification.**

**Fig. 3: Overall performance on patient data.**

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Genome-wide association studies

Article 26 August 2021

Code availability

The S-CAP code is publiclally available at Bitbucket (see URLs) for non-commercial purposes.

Data availability

S-CAP scores for all rare variants in the predefined splicing region in the human genome, along with the source code repository, training and test set variant IDs, and final trained models for the S-CAP classifier, will be made available via the S-CAP website on publication.

References

Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).
Article CAS Google Scholar
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Article CAS Google Scholar
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).
Article CAS Google Scholar
Dewey, F. E. et al. Clinical interpretation and implications of whole-genome sequencing. JAMA 311, 1035–1045 (2014).
Article CAS Google Scholar
Jagadeesh, K. A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 48, 1581–1586 (2016).
Article CAS Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS Google Scholar
Anderson, D. & Lassmann, T. A phenotype centric benchmark of variant prioritisation tools. NPJ Genom. Med 3, 5 (2018).
Article Google Scholar
Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).
Article Google Scholar
Köhler, S. et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am. J. Hum. Genet. 85, 457–464 (2009).
Article Google Scholar
Jagadeesh, K. A. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet. Med. (2018).
Smedley, D. et al. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat. Protoc. 10, 2004–2015 (2015).
Article CAS Google Scholar
Birgmeier, J. et al. AMELIE accelerates Mendelian patient diagnosis directly from the primary literature. Preprint at https://doi.org/10.1101/171322 (2017).
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Article CAS Google Scholar
Iglesias, A. et al. The usefulness of whole-exome sequencing in routine clinical practice. Genet. Med. 16, 922–931 (2014).
Article Google Scholar
Cuajungco, M. P. et al. Tissue-specific reduction in splicing efficiency of IKBKAP due to the major mutation associated with familial dysautonomia. Am. J. Hum. Genet. 72, 749–758 (2003).
Article CAS Google Scholar
Wong, J. J.-L. et al.Orchestrated intron retention regulates normal granulocyte differentiation. Cell 154, 583–595 (2013).
Article CAS Google Scholar
Marquez, Y., Höpfler, M., Ayatollahi, Z., Barta, A. & Kalyna, M. Unmasking alternative splicing inside protein-coding exons defines exitrons and their role in proteome plasticity. Genome Res. 25, 995–1007 (2015).
Article CAS Google Scholar
Eom, T. et al. NOVA-dependent regulation of cryptic NMD exons controls synaptic protein levels after seizure. eLife 2, e00178 (2013).
Article Google Scholar
Faustino, N. A. & Cooper, T. A. Pre-mRNA splicing and human disease. Genes Dev. 17, 419–437 (2003).
Article CAS Google Scholar
Scotti, M. M. & Swanson, M. S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).
Article CAS Google Scholar
Sibley, C. R., Blazquez, L. & Ule, J. Lessons from non-canonical splicing. Nat. Rev. Genet. 17, 407–421 (2016).
Article CAS Google Scholar
Samuels, D. C. et al. Finding the lost treasures in exome sequencing data. Trends Genet. 29, 593–599 (2013).
Article CAS Google Scholar
Stenson, P. D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).
Article CAS Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Article CAS Google Scholar
Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).
Article CAS Google Scholar
di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333–337 (2018).
Article CAS Google Scholar
Gelfman, S. et al. Annotating pathogenic non-coding variants in genic regions. Nat. Commun. 8, 236 (2017).
Article Google Scholar
Barash, Y. et al. Deciphering the splicing code. Nature 465, 53–59 (2010).
Article CAS Google Scholar
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
Article Google Scholar
Landrum, M. J. et al.ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Article CAS Google Scholar
Karczewski, K. & Francioli, L. The Genome Aggregation Database (gnomAD) https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ (2017)
Meienberg, J. et al. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res. 43, e76 (2015).
Article Google Scholar
Taggart, A. J., DeSimone, A. M., Shih, J. S., Filloux, M. E. & Fairbrother, W. G. Large-scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nat. Struct. Mol. Biol. 19, 719–721 (2012).
Article CAS Google Scholar
Matera, A. G. & Wang, Z. A day in the life of the spliceosome. Nat. Rev. Mol. Cell Biol. 15, 108–121 (2014).
Article CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS Google Scholar
Takata, A., Matsumoto, N. & Kato, T. Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia-associated loci. Nat. Commun. 8, 14519 (2017).
Article CAS Google Scholar
Zhang, X. et al. Identification of common genetic variants controlling transcript isoform variation in human whole blood. Nat. Genet. 47, 345–352 (2015).
Article CAS Google Scholar
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
Article Google Scholar
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Article CAS Google Scholar
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Article Google Scholar
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Article CAS Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS Google Scholar
Kuhn, R. M., Haussler, D. & Kent, W. J. The UCSC genome browser and associated tools. Brief. Bioinformatics 14, 144–161 (2013).
Article CAS Google Scholar
Mort, M. et al. MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing. Genome Biol. 15, R19 (2014).
Article Google Scholar
Paggi, J. M. & Bejerano, G. A sequence-based, deep learning model accurately predicts RNA splicing branchpoints. RNA 24, 1647–1658 (2018).
Article CAS Google Scholar
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Article Google Scholar
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr. Protoc. Bioinformatics 39, 1.13.1–1.13.20 (2012).
Google Scholar
Ardlie, K. G. et al. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Article Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Article CAS Google Scholar
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Article CAS Google Scholar
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Article Google Scholar
Cunningham, F. et al. Ensembl 2015. Nucleic Acids Res. 43, D662–D669 (2015).
Article CAS Google Scholar
Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).
Article CAS Google Scholar
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at https://doi.org/10.1101/148353 (2017).

Download references

Acknowledgements

We thank R. Dror for support, as well as A. Wenger, M. Berger, J. Birgmeier and all members of the Bejerano laboratory for useful discussions, feedback, and advice. P.D.S. and D.N.C. receive financial support from Qiagen through a license agreement with Cardiff University. This work was funded in part by a Stanford Graduate Fellowship and a Computational and Evolutionary Genomics Fellowship to K.A.J., a Stanford Graduate Fellowship to J.M.P., and Stanford Pediatrics Department, DARPA, a Packard Foundation and Microsoft Faculty Fellowships to G.B.

Author information

These authors contributed equally: Karthik A. Jagadeesh, Joseph M. Paggi.

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Karthik A. Jagadeesh, Joseph M. Paggi & Gill Bejerano
Department of Biology, Stanford University, Stanford, CA, USA
James S. Ye
Institute of Medical Genetics, Cardiff University, Cardiff, UK
Peter D. Stenson & David N. Cooper
Department of Pediatrics, Stanford University, Stanford, CA, USA
Jonathan A. Bernstein & Gill Bejerano
Department of Developmental Biology, Stanford University, Stanford, CA, USA
Gill Bejerano
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Gill Bejerano

Authors

Karthik A. Jagadeesh
View author publications
You can also search for this author in PubMed Google Scholar
Joseph M. Paggi
View author publications
You can also search for this author in PubMed Google Scholar
James S. Ye
View author publications
You can also search for this author in PubMed Google Scholar
Peter D. Stenson
View author publications
You can also search for this author in PubMed Google Scholar
David N. Cooper
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan A. Bernstein
View author publications
You can also search for this author in PubMed Google Scholar
Gill Bejerano
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.A.J., J.M.P., and G.B. designed the study. K.A.J., J.M.P., and J.S.Y. developed the features, trained the model, and evaluated the results. P.D.S. and D.N.C. curated the HGMD data. J.A.B. provided the patient exome cases feedback. K.A.J., J.M.P., and G.B. wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Gill Bejerano.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Framework for training and evaluating six pathogenicity models.

The splicing region is split into 6 independent regions as defined in Fig. 1c, and a separate model is trained for variants residing in each region. Given a set of variants to be scored, we calculate the S-CAP score for each variant by using the corresponding model associated with the region where the variant is found.

Supplementary Figure 2 Overall precision-recall curve per region for splicing-pathogenicity classification.

The size of training and testing data for each model is specified in Table 3. Precision measures the number of correctly classified pathogenic variants over the total number of variants classified as pathogenic. This measure reflects the class imbalance as there are over two orders of magnitude more benign variants that pathogenic variants. S-CAP improves on existing metrics over the precision-recall curve in all splicing regions.

Supplementary Figure 3 Performance of S-CAP compared with MutPred Splice.

MutPred Splice is a computational method for predicting the pathogenicity of exonic synonymous variants. MutPred Splice was trained by its authors using a subset of the pathogenic data used to train/test S-CAP. As a result, we need to independently test MutPred Splice on a set of variants that was not used in its training. This test set comprises 78 rare synonymous variants from HGMD added to the database in 2013 or later and 10,000 rare synonymous variants present in healthy controls in ExAC. On this set S-CAP improves on MutPred Splice by 5.7% when comparing the overall AUC. S-CAP performs especially well in the high sensitivity domain and improves on the MutPred Splice hsr-AUC by 204%.

Supplementary Figure 4 S-CAP performance in the high-sensitivity region.

The size of training and testing data for each model is specified in Table 3. The hsr-AUC curve is formed by subsetting the overall AUC to just the region where pathogenic variants are correctly classified over 95% of the time. An hsr-AUC curve is calculated for each of the regions as defined in Fig. 1c. S-CAP improves on the next best method’s hsr-AUC by 185% in the 3′ intronic region (a), 40.7% in the 3′ core sites (b), 224% in the exonic region (c), 31.6% in the 5′ core sites (d), 72.7% in the 5′ extended region (e) and 120% in the 5′ intronic region (f). None of the existing methods consistently outperforms the others.

Supplementary Figure 5 S-CAP performance in 3′-core and 5′-core regions without zygosity feature.

The size of training and testing data for each model is specified in Table 3 in the rows corresponding to 3′ Core and 5′ Core regions. Our model without the zygosity feature outperforms existing metrics in the 3′ core and 5′ core regions by up to 23.4% in AUC and by up to 36% in the hsrAUC. The performance in 3′ core region is slightly better than in the 5′ core region. Incorporating the variant zygosity feature boosts classification performance by 5.4% (Fig. 2, Supplementary Fig. 4).

Supplementary Figure 6 Performance on recessive and dominant classes.

The size of training data for the 3′ Core and 5′ Core models are specified in Table 3. The 3′ Core Dominant and Recessive models are evaluated over 787 and 131 pathogenic variants and 8,421 and 91 benign variants respectively. The 5′ Core Dominant and Recessive models are evaluated over 938 and 218 pathogenic variants and 9,085 and 103 benign variants respectively. The distribution of the underlying features is dramatically different for dominant and recessive variants. This results in a big difference in performance when classifying recessive and benign variants in the core splice site regions. S-CAP achieves an AUC (a) of 0.818 on dominant tagged variants and (b) of 0.862 on recessive tagged variants in the 3′ core region. There is a similar performance difference in the 5′ core region where S-CAP achieves an AUC (c) of 0.809 on dominant tagged variants (d) and of 0.875 on recessive tagged variants. In the high sensitivity region, S-CAP achieves an hsr-AUC (e, f) of 0.270 on dominant tagged variants and of 0.491 on recessive tagged variants in the 3′ core region (g, h) and of 0.229 on dominant tagged variants and of 0.510 of recessive tagged variants in the 5′ core region. S-CAP AUC and hsr-AUC are consistently better than those of all other tools, and no existing method consistently outperforms the others.

Supplementary Figure 7 S-CAP maintains a 41% decrease in VUS over 1000 Genomes superpopulations.

Each boxplot is defined so the centerline is at the median, the bottom of the box marks the 25^th %-ile, the top of the box marks the 75^th %-ile and the whiskers stretch to denote 1.5 times the interquartile range (IQR). S-CAP performance in reducing VUS is consistent across 5 different super populations from the 1000 genomes project despite European bias in the gnomAD control database. EUR, European (n = 503). AFR, African (n = 661). SAS, South Asian (n = 489). AMR, Ad Mixed American (n = 347). EAS, East Asian (n = 504).

Supplementary Figure 8 S-CAP test-set performance resembles performance during cross-validation.

The cross validation performance is measured over benign and pathogenic training variants as described in column 1 in Table 3. The cross validation performance is measured as the mean performance over all 5 folds. We find the performance is similar to between the cross validation and the test set reported performance in all 6 regions.

Supplementary information

Supplementary Information

Supplementary Figures 1–8, Supplementary Tables 1–5 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jagadeesh, K.A., Paggi, J.M., Ye, J.S. et al. S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing. Nat Genet 51, 755–763 (2019). https://doi.org/10.1038/s41588-019-0348-4

Download citation

Received: 07 March 2018
Accepted: 10 January 2019
Published: 25 February 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s41588-019-0348-4

This article is cited by

Detecting and understanding meaningful cancerous mutations based on computational models of mRNA splicing
- Nicolas Lynn
- Tamir Tuller
npj Systems Biology and Applications (2024)
Predicting the impact of rare variants on RNA splicing in CAGI6
- Jenny Lord
- Carolina Jaramillo Oquendo
- Diana Baralle
Human Genetics (2024)
SUsPECT: a pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation
- Renee Salz
- Nuno Saraiva-Agostinho
- Peter A.C. ’t Hoen
BMC Genomics (2023)
ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden
- Andrew G. Sharo
- Yangyun Zou
- Steven E. Brenner
Genome Medicine (2023)
Benchmarking splice variant prediction algorithms using massively parallel splicing assays
- Cathy Smith
- Jacob O. Kitzman
Genome Biology (2023)