Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type–specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.
This is a preview of subscription content, access via your institution
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Gene Expression Omnibus
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Ritchie, G.R.S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Hardison, R.C. & Taylor, J. Genomic approaches towards finding cis-regulatory modules in animals. Nat. Rev. Genet. 13, 469–483 (2012).
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).
ENCODE Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).
Lee, D., Karchin, R. & Beer, M.A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21, 2167–2180 (2011).
Fletez-Brant, C., Lee, D., McCallion, A.S. & Beer, M.A. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 41, W544–W556 (2013).
Gorkin, D.U. et al. Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes. Genome Res. 22, 2290–2301 (2012).
Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Lee, D. & Beer, M.A. in Genome Analysis: Current Procedures and Applications (ed. Poptsova, M.S.) 101–120 (Horizon Scientific Press, 2014).
Ghandi, M., Mohammad-Noori, M. & Beer, M. A. Robust k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69, 469–500 (2014).
Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).
Murisier, F., Guichard, S. & Beermann, F. A conserved transcriptional enhancer that specifies Tyrp1 expression to melanocytes. Dev. Biol. 298, 644–655 (2006).
Murisier, F., Guichard, S. & Beermann, F. The tyrosinase enhancer is activated by Sox10 and Mitf in mouse melanocytes. Pigment Cell Res. 20, 173–184 (2007).
Patwardhan, R.P. et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265–270 (2012).
Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).
Kheradpour, P. et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 23, 800–811 (2013).
Huang, Q. et al. A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat. Genet. 46, 126–135 (2014).
Bauer, D.E. et al. An erythroid enhancer of BCL11A subject to genetic variation determines fetal hemoglobin level. Science 342, 253–257 (2013).
Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010).
Farh, K.K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).
Jin, Y. et al. Genome-wide association analyses identify 13 new susceptibility loci for generalized vitiligo. Nat. Genet. 44, 676–680 (2012).
Barrett, J.C. et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat. Genet. 41, 703–707 (2009).
Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).
Barrett, J.C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 40, 955–962 (2008).
International Multiple Sclerosis Genetics Consortium. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat. Genet. 45, 1353–1360 (2013).
Dubois, P.C.A. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010).
Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Genet. 39, 830–832 (2007).
Hinds, D.A. et al. A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci. Nat. Genet. 45, 907–911 (2013).
Mells, G.F. et al. Genome-wide association study identifies 12 new susceptibility loci for primary biliary cirrhosis. Nat. Genet. 43, 329–332 (2011).
Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).
Eyre, S. et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat. Genet. 44, 1336–1340 (2012).
Cooper, J.D. et al. Seven newly identified loci for autoimmune thyroid disease. Hum. Mol. Genet. 21, 5202–5208 (2012).
Gourraud, P.-A. et al. A genome-wide association study of brain lesion distribution in multiple sclerosis. Brain 136, 1012–1024 (2013).
Liu, J.Z. et al. Dense fine-mapping study identifies new susceptibility loci for primary biliary cirrhosis. Nat. Genet. 44, 1137–1141 (2012).
Zhang, Y. et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Heintzman, N.D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).
Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112 (2009).
This research was supported in part by US National Institutes of Health grant R01 NS62972 to A.S.M. and by grant R01 HG007348 to M.A.B.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Correlation of deltaSVM and dsQTL effect size drops with increasing distance between dsQTL SNPs and the center of the associated DNase I–sensitive regions.
The original set of dsQTLs was defined as SNPs within ±1,000 bp of covarying hypersensitive regions13. We find that deltaSVM is only consistent with dsQTL effect size (beta) when we constrain the set of dsQTLs to be within 200 bp of the modulated DHS region: (a) 0~50 bp (red), (b) 50~200 bp (green), (c) 200–500 bp and (d) 500–1,000 bp. Thus, our analysis is consistent with a local mechanism of action for dsQTLs.
Supplementary Figure 2 Bases predicted to reduce the activity of functional regions are evolutionarily constrained.
We calculated the average deltaSVM scores of all three possible mutations at each base within LCL GM12878 DHSs and compared the conservation (phyloP) for bases causing (a) negative (red), (b) neutral (gray) and (c) positive (blue) deltaSVM-predicted impact (the top 1% of bases with negative deltaSVM, 1% of bases with deltaSVM near 0 and the top 1% of bases with positive deltaSVM; n = 63,123). (d) Differential distributions relative to bases with neutral deltaSVM. Bases with negative or positive deltaSVM are more conserved than those with neutral deltaSVM; P < 1 × 10−300 (under machine precision) and P < 1 × 10−14 (Kolmogorov-Smirnov test), respectively. Also, bases with negative deltaSVM are much more conserved than those with positive deltaSVM (average phyloP of 1.00 versus 0.20, P < 1 × 10−300).
Supplementary Figure 3 Correlations of deltaSVM and in vivo mutation effect size in the ALDOB enhancer using an aggregate model.
We averaged the deltaSVM scores of all three possible mutations at each base and compared the expression changes from the univariate model reported by Patwardhan et al.22.
Supplementary Figure 4 High-confidence predicted causal SNPs in loci associated with autoimmune disease.
The significance of the maximum of Abs(deltaSVM) depends on the number of flanking candidate causal SNPs. Sampling of random SNPs scored with the TH1 gkm-SVM yielded the solid curves for the top 2% of all loci and the mean, with standard deviation shown (dashed line). Seventeen of the 413 immune-associated loci exceed the 2% threshold, whereas 8 would be expected by chance.
Supplementary Figure 5 Precision of deltaSVM prediction of dsQTLs as a function of gkm-SVM feature length.
As in Figure 2e, with varying (l, k) values (where l is the total k-mer length and k is the number of ungapped positions). Precision improves as k is increased, but gapped k-mer performance is always better than that of ungapped k-mers (where l = k). For this large training set (44,768 sequences), (11, 7) is a bit better than the default (10, 6), but for smaller training sets our default feature set (10,6) is recommended.
Supplementary Figure 6 Constraining distance to a TSS in the negative set does not affect the precision of deltaSVM prediction of dsQTLs.
In Figure 2, the gkm-SVM was trained on a negative sequence set matched for GC distribution and repeat fraction, but distance to a TSS was unconstrained. We generated an additional negative set that matched the GC distribution in the GM12878 positive set (a) but also matched the distribution of distance to a TSS of the positive set (b). As shown in c and d, using a gkm-SVM trained on this TSS-matched negative set does not affect performance in predicting dsQTLs.
Supplementary Figure 7 Constraining distance to a TSS or LD for negative dsQTL control SNPs does not affect the precision of deltaSVM prediction of dsQTLs.
In Figure 2, deltaSVM predictions were tested on the positive dsQTLs and a 50 times larger set of negative dsQTL control SNPs selected at random from the genome. Here we constrain the distance to a TSS for negative dsQTL control SNPs to match the distribution of distance to a TSS for the positive dsQTLs. This set is already matched to the positive dsQTL set in terms of the number of SNPs in strong LD (a). Further constraining distance to a TSS (b) does not affect performance in predicting dsQTLs relative to either negative set (c,d).
Supplementary Figures 1–7. (PDF 303 kb)
All predictions of the 574 dsQTL SNPs and the 27,735 control SNPs. (XLSX 3759 kb)
deltaSVM predictions of all possible point mutations in the Tyr and Tyrp1 enhancers. (XLSX 58 kb)
Experimental validation results of randomly selected deltaSVM predictions from the Tyr and Typr1 enhancers. (XLSX 11 kb)
deltaSVM predictions of all possible single point mutations in the ALDOB enhancer and the corresponding in vivo effect size. (XLSX 48 kb)
deltaSVM predictions of mutations in K562 and HepG2 enhancers. (XLSX 23 kb)
deltaSVM predictions of all 3,113 autoimmune disease–associated SNPs. (XLSX 155 kb)
About this article
Cite this article
Lee, D., Gorkin, D., Baker, M. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 47, 955–961 (2015). https://doi.org/10.1038/ng.3331
This article is cited by
Genome Biology (2023)
Evaluating the mouse neural precursor line, SN4741, as a suitable proxy for midbrain dopaminergic neurons
BMC Genomics (2023)
Mapping genomic regulation of kidney disease and traits through high-resolution and interpretable eQTLs
Nature Communications (2023)
Dynamic network-guided CRISPRi screen identifies CTCF-loop-constrained nonlinear enhancer gene regulatory activity during cell state transitions
Nature Genetics (2023)
BMC Medicine (2022)