Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Predicting effects of noncoding variants with deep learning–based sequence model

Abstract

Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1
Figure 2: The deep-learning model accurately predicts chromatin features from sequence with single-nucleotide sensitivity.
Figure 3: Sequence-based prioritization of functional noncoding variants.

References

  1. Leslie, R., O'Donnell, C.J. & Johnson, A.D. Bioinformatics 30, i185–i194 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. Ritchie, G.R., Dunham, I., Zeggini, E. & Flicek, P. Nat. Methods 11, 294–296 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  3. Kircher, M. et al. Nat. Genet. 46, 310–315 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. Fu, Y. et al. Genome Biol. 15, 480 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Lee, D. et al. Nat. Genet. 47, 955–961 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  6. Slattery, M. et al. Trends Biochem. Sci. 39, 381–399 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. Benveniste, D., Sonntag, H.J., Sanguinetti, G. & Sproul, D. Proc. Natl. Acad. Sci. USA 111, 13367–13372 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. Whitaker, J.W., Chen, Z. & Wang, W. Nat. Methods 12, 265–272 (2015).

    CAS  Article  PubMed  Google Scholar 

  9. ENCODE Project Consortium. Nature 489, 57–74 (2012).

  10. Kundaje, A. et al. Nature 518, 317–330 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  11. Arvey, A., Agius, P., Noble, W.S. & Leslie, C. Genome Res. 22, 1723–1734 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. PLoS Comput. Biol. 10, e1003711 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Neph, S. et al. Nature 489, 83–90 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. Cowper-Sal·lari, R. et al. Nat. Genet. 44, 1191–1198 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  15. De Gobbi, M. et al. Science 312, 1215–1217 (2006).

    CAS  Article  PubMed  Google Scholar 

  16. Weedon, M.N. et al. Nat. Genet. 46, 61–64 (2014).

    CAS  Article  PubMed  Google Scholar 

  17. Stenson, P.D. et al. Hum. Genet. 133, 1–9 (2014).

    CAS  Article  PubMed  Google Scholar 

  18. Welter, D. et al. Nucleic Acids Res. 42, D1001–D1006 (2014).

    CAS  Article  PubMed  Google Scholar 

  19. Abecasis, G.R. et al. Nature 491, 56–65 (2012).

    PubMed  Google Scholar 

  20. Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. McVicker, G. et al. Science 342, 747–749 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. Karolchik, D. et al. Nucleic Acids Res. 42, D764–D770 (2014).

    CAS  Article  PubMed  Google Scholar 

  23. Siepel, A. et al. Genome Res. 15, 1034–1050 (2005).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Genome Res. 20, 110–121 (2010).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. Cooper, G.M. et al. Genome Res. 15, 901–913 (2005).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was primarily supported by US National Institutes of Health (NIH) grants R01 GM071966 and R01 HG005998 to O.G.T. This work was supported in part by the US National Science Foundation (NSF) CAREER award (DBI-0546275), NIH award T32 HG003284 and NIH grant P50 GM071508. O.G.T. is supported by the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR). We acknowledge the TIGRESS high-performance computer center at Princeton University for computational resource support. We are grateful to all Troyanskaya laboratory members for valuable discussions.

Author information

Authors and Affiliations

Authors

Contributions

J.Z. designed the study, with input from O.G.T. J.Z. developed the method and analyzed the results. O.G.T. supervised the study. J.Z. and O.G.T. wrote the paper.

Corresponding author

Correspondence to Olga G Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Performance comparison of DeepSEA models trained with different context sequence lengths

DeepSEA models with the same architecture as described in the Online Methods were trained on 200bp, 500bp, and 1000bp input sequences respectively, and the AUCs of all chromatin features were shown with box plots. While the chromatin feature labels were always determined from the central 200bp regions, increasing context sequence length significantly improved model performance (P-value < 2.2e-16 by Wilcoxon signed rank test between any pair of models).

Supplementary Figure 2 Performance comparison with gkm-SVM

Deep convolutional network model outperformed gapped k-mer SVM (gkm-SVM) on transcription factor binding prediction. Deep convolutional network achieved higher area under receiver operating characteristic (AUC) for almost all transcription factors (left panel). Gapped k-mer SVM did not gain performance from increasing size of context sequences (right panel).

Supplementary Figure 3 In silico saturated mutagenesis analysis for identifying predictive sequence features

Predictive sequence features can be identified by analyzing effect on binding probability by computationally mutating each base. Each column in a heatmap represents a base position in the sequence. The three rows represent the three possible base substitutions following A>G>C>T order from bottom to top. For example, if the original sequence has base G, then the three rows represent C, T, A from bottom to top. The log2 fold change of odds (odds are computed from probability as P/(1 – P) are shown with the heatmap; yellow indicates increase of binding and blue indicates decrease of binding. Each sequence example is shown by two panels. The first (top) panel shows the ‘mutation scanning’ results on the whole 1000bp sequence. The second (bottom) panel focuses on the center 200bp in order to show the actual nucleotide sequences. Many sequence elements identified are consistent with canonical motifs such as TTGCTCAA for CEBPB, TGATAA for GATA1, GTAAATA for FOXA1 and GTACATA for FOXA2. The four example sequences shown in this figure are centered around SNPs chr1:109817590 G>T, chr16:209709 T>C, chr10:23508363 A>G, chr16:52599188 C>T respectively.

Supplementary Figure 4 DeepSEA accurately predicted histone QTL effects

DeepSEA histone mark classifiers provided accurate prediction of allele specific effects on histone marks H3K4me3 and H3K27ac (the allele with more histone mark). The top prediction accuracies are over 0.9 for both marks. The predictions were evaluated with histone mark QTLs identified with FDR < 0.1 in Yoruba lymphoblastoid cell lines1. Margin shown on the x axis is the threshold of predicted probability differences between the two alleles for classifying high-confidence predictions. Performance is measured by accuracy of the above threshold predictions (y axis).

1. McVicker, G. et al. Science 342, 747–749 (2013).

Supplementary Figure 5 Flow diagram for DeepSEA functional SNP prioritization

For each input variant, DeepSEA computes 1842 features, including 1838 predicted chromatin effect features and 4 evolutionary conservation features. Predicted chromatin effect features include absolute difference and relative difference computed based on predicted probability of reference and alternative sequences, for each TF / DNase / Histone chromatin feature. Evolutionary conservation scores based on multi-species genome alignments were retrieved for the variant positions. Each feature is taken the absolute value, and is then scaled to mean 0 and variance 1 before providing as input to classifier.

Supplementary Figure 6 DeepSEA functional significance score prioritizes functional noncoding variants with high performance

DeepSEA functional significance score measures the overall significance of predicted chromatin effects and evolutionary conservation scores, and it is unsupervised thus unbiased to any training functional variant annotation set (see Online Methods). Notably DeepSEA functional significance score still surpassed the performance of previous methods even though no supervised training was used (compare to Fig. 3). The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The ‘all’ negative-variant groups are randomly selected negative 1000 Genomes SNPs.

Supplementary Figure 7 Dissecting DeepSEA functional SNP prioritization performance with subsets of input features

DeepSEA functional SNP prioritization models performance on HGMD regulatory mutations, noncoding eQTLs, and noncoding trait-associated (GWAS) SNPs was analyzed by comparing with models trained with only predicted chromatin effect features or only evolutionary conservation features. The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The ‘all’ negative-variant groups are randomly selected negative 1000 Genomes SNPs.

Supplementary Figure 8 DeepSEA-based classifier prioritized functionally annotated indels with high performance

HGMD regulatory indels prioritization performance was evaluated against negative 1000 Genomes indel groups with different distances to positive indels (average distance shown on the x-axis). The performance was measured by area under receiver operating characteristic (AUC). The prioritization model was trained with HGMD regulatory single nucleotide substitution mutations against 1200bp average distance negative variants.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8 and Supplementary Note (PDF 794 kb)

Supplementary Table 1

List of all publicly available chromatin feature profile files used for training DeepSEA (XLSX 22 kb)

Supplementary Table 2

DeepSEA prediction performance for each transcription factor, DNase I hypersensitive site, and histone mark profile (XLSX 79 kb)

Supplementary Table 3

Sequence based allele specific DNase I hypersensitivity predictions for allele imbalanced variants called from Digital Genomic Footprinting DNase-seq data (CSV 4745 kb)

Supplementary Table 4

Allele-imbalance DNase I hypersensitivity prediction performance for 35 cell types (XLSX 13 kb)

Supplementary Table 5

DeepSEA functional variant prioritization model predictions for noncoding GRASP eQTLs and negative variants sets (CSV 63482 kb)

Supplementary Table 6

DeepSEA functional variant prioritization model predictions for noncoding GWAS Catalog SNPs and negative variant sets. (CSV 56051 kb)

Supplementary Table 7

Feature rankings for noncoding functional variant prioritization tasks. (XLSX 841 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Troyanskaya, O. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12, 931–934 (2015). https://doi.org/10.1038/nmeth.3547

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3547

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing