Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Leslie, R., O'Donnell, C.J. & Johnson, A.D. Bioinformatics 30, i185–i194 (2014).
Ritchie, G.R., Dunham, I., Zeggini, E. & Flicek, P. Nat. Methods 11, 294–296 (2014).
Kircher, M. et al. Nat. Genet. 46, 310–315 (2014).
Fu, Y. et al. Genome Biol. 15, 480 (2014).
Lee, D. et al. Nat. Genet. 47, 955–961 (2015).
Slattery, M. et al. Trends Biochem. Sci. 39, 381–399 (2014).
Benveniste, D., Sonntag, H.J., Sanguinetti, G. & Sproul, D. Proc. Natl. Acad. Sci. USA 111, 13367–13372 (2014).
Whitaker, J.W., Chen, Z. & Wang, W. Nat. Methods 12, 265–272 (2015).
ENCODE Project Consortium. Nature 489, 57–74 (2012).
Kundaje, A. et al. Nature 518, 317–330 (2015).
Arvey, A., Agius, P., Noble, W.S. & Leslie, C. Genome Res. 22, 1723–1734 (2012).
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. PLoS Comput. Biol. 10, e1003711 (2014).
Neph, S. et al. Nature 489, 83–90 (2012).
Cowper-Sal·lari, R. et al. Nat. Genet. 44, 1191–1198 (2012).
De Gobbi, M. et al. Science 312, 1215–1217 (2006).
Weedon, M.N. et al. Nat. Genet. 46, 61–64 (2014).
Stenson, P.D. et al. Hum. Genet. 133, 1–9 (2014).
Welter, D. et al. Nucleic Acids Res. 42, D1001–D1006 (2014).
Abecasis, G.R. et al. Nature 491, 56–65 (2012).
Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012).
McVicker, G. et al. Science 342, 747–749 (2013).
Karolchik, D. et al. Nucleic Acids Res. 42, D764–D770 (2014).
Siepel, A. et al. Genome Res. 15, 1034–1050 (2005).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Genome Res. 20, 110–121 (2010).
Cooper, G.M. et al. Genome Res. 15, 901–913 (2005).
Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).
This work was primarily supported by US National Institutes of Health (NIH) grants R01 GM071966 and R01 HG005998 to O.G.T. This work was supported in part by the US National Science Foundation (NSF) CAREER award (DBI-0546275), NIH award T32 HG003284 and NIH grant P50 GM071508. O.G.T. is supported by the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR). We acknowledge the TIGRESS high-performance computer center at Princeton University for computational resource support. We are grateful to all Troyanskaya laboratory members for valuable discussions.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Performance comparison of DeepSEA models trained with different context sequence lengths
DeepSEA models with the same architecture as described in the Online Methods were trained on 200bp, 500bp, and 1000bp input sequences respectively, and the AUCs of all chromatin features were shown with box plots. While the chromatin feature labels were always determined from the central 200bp regions, increasing context sequence length significantly improved model performance (P-value < 2.2e-16 by Wilcoxon signed rank test between any pair of models).
Deep convolutional network model outperformed gapped k-mer SVM (gkm-SVM) on transcription factor binding prediction. Deep convolutional network achieved higher area under receiver operating characteristic (AUC) for almost all transcription factors (left panel). Gapped k-mer SVM did not gain performance from increasing size of context sequences (right panel).
Supplementary Figure 3 In silico saturated mutagenesis analysis for identifying predictive sequence features
Predictive sequence features can be identified by analyzing effect on binding probability by computationally mutating each base. Each column in a heatmap represents a base position in the sequence. The three rows represent the three possible base substitutions following A>G>C>T order from bottom to top. For example, if the original sequence has base G, then the three rows represent C, T, A from bottom to top. The log2 fold change of odds (odds are computed from probability as P/(1 – P) are shown with the heatmap; yellow indicates increase of binding and blue indicates decrease of binding. Each sequence example is shown by two panels. The first (top) panel shows the ‘mutation scanning’ results on the whole 1000bp sequence. The second (bottom) panel focuses on the center 200bp in order to show the actual nucleotide sequences. Many sequence elements identified are consistent with canonical motifs such as TTGCTCAA for CEBPB, TGATAA for GATA1, GTAAATA for FOXA1 and GTACATA for FOXA2. The four example sequences shown in this figure are centered around SNPs chr1:109817590 G>T, chr16:209709 T>C, chr10:23508363 A>G, chr16:52599188 C>T respectively.
DeepSEA histone mark classifiers provided accurate prediction of allele specific effects on histone marks H3K4me3 and H3K27ac (the allele with more histone mark). The top prediction accuracies are over 0.9 for both marks. The predictions were evaluated with histone mark QTLs identified with FDR < 0.1 in Yoruba lymphoblastoid cell lines1. Margin shown on the x axis is the threshold of predicted probability differences between the two alleles for classifying high-confidence predictions. Performance is measured by accuracy of the above threshold predictions (y axis).
1. McVicker, G. et al. Science 342, 747–749 (2013).
For each input variant, DeepSEA computes 1842 features, including 1838 predicted chromatin effect features and 4 evolutionary conservation features. Predicted chromatin effect features include absolute difference and relative difference computed based on predicted probability of reference and alternative sequences, for each TF / DNase / Histone chromatin feature. Evolutionary conservation scores based on multi-species genome alignments were retrieved for the variant positions. Each feature is taken the absolute value, and is then scaled to mean 0 and variance 1 before providing as input to classifier.
Supplementary Figure 6 DeepSEA functional significance score prioritizes functional noncoding variants with high performance
DeepSEA functional significance score measures the overall significance of predicted chromatin effects and evolutionary conservation scores, and it is unsupervised thus unbiased to any training functional variant annotation set (see Online Methods). Notably DeepSEA functional significance score still surpassed the performance of previous methods even though no supervised training was used (compare to Fig. 3). The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The ‘all’ negative-variant groups are randomly selected negative 1000 Genomes SNPs.
Supplementary Figure 7 Dissecting DeepSEA functional SNP prioritization performance with subsets of input features
DeepSEA functional SNP prioritization models performance on HGMD regulatory mutations, noncoding eQTLs, and noncoding trait-associated (GWAS) SNPs was analyzed by comparing with models trained with only predicted chromatin effect features or only evolutionary conservation features. The performance was measured by area under receiver operating characteristic (AUC). x axis shows the average distances of negative-variant groups to a nearest positive variant. The ‘all’ negative-variant groups are randomly selected negative 1000 Genomes SNPs.
Supplementary Figure 8 DeepSEA-based classifier prioritized functionally annotated indels with high performance
HGMD regulatory indels prioritization performance was evaluated against negative 1000 Genomes indel groups with different distances to positive indels (average distance shown on the x-axis). The performance was measured by area under receiver operating characteristic (AUC). The prioritization model was trained with HGMD regulatory single nucleotide substitution mutations against 1200bp average distance negative variants.
Supplementary Figures 1–8 and Supplementary Note (PDF 794 kb)
List of all publicly available chromatin feature profile files used for training DeepSEA (XLSX 22 kb)
DeepSEA prediction performance for each transcription factor, DNase I hypersensitive site, and histone mark profile (XLSX 79 kb)
Sequence based allele specific DNase I hypersensitivity predictions for allele imbalanced variants called from Digital Genomic Footprinting DNase-seq data (CSV 4745 kb)
Allele-imbalance DNase I hypersensitivity prediction performance for 35 cell types (XLSX 13 kb)
DeepSEA functional variant prioritization model predictions for noncoding GRASP eQTLs and negative variants sets (CSV 63482 kb)
DeepSEA functional variant prioritization model predictions for noncoding GWAS Catalog SNPs and negative variant sets. (CSV 56051 kb)
Feature rankings for noncoding functional variant prioritization tasks. (XLSX 841 kb)
About this article
Cite this article
Zhou, J., Troyanskaya, O. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12, 931–934 (2015). https://doi.org/10.1038/nmeth.3547
BMC Genomics (2022)
Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development
BMC Bioinformatics (2022)
Bringing machine learning to research on intellectual and developmental disabilities: taking inspiration from neurological diseases
Journal of Neurodevelopmental Disorders (2022)
BMC Bioinformatics (2022)
BMC Genomics (2022)