Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Wang, Y. et al. Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae. Genetics 180, 391–408 (2008).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Haudry, A. et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45, 891–898 (2013).
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Gerstein, M.B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).
Roy, S. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Ritchie, G.R.S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Shihab, H.A. et al. An integrative approach to predicting the functional effects of noncoding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Kelley, D.R., Snoek, J. & Rinn, J.L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Quang, D., Chen, Y. & Xie, X. DANN: a deep-learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015).
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Stenson, P.D. et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing, and personalized genomic medicine. Hum. Genet. 133, 1–9 (2013).
Landrum, M.J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Zerbino, D.R., Wilder, S.P., Johnson, N., Juettemann, T. & Flicek, P.R. The Ensembl Regulatory Build. Genome Biol. 16, 56 (2015).
Gaffney, D.J., Blekhman, R. & Majewski, J. Selective constraints in experimentally defined primate regulatory regions. PLoS Genet. 4, e1000157 (2008).
Chiaromonte, F. et al. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 68, 245–254 (2003).
Meader, S.J., Ponting, C.P. & Lunter, G. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20, 1335–1343 (2010).
Rands, C.M., Meader, S., Ponting, C.P. & Lunter, G. 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).
Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).
Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLOS Comput. Biol. 6, e1001025 (2010).
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Core, L.J. et al. Analysis of nascent RNA identifies a unified architecture of transcription initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Andersson, R., Sandelin, A. & Danko, C.G. A unified architecture of transcriptional regulatory elements. Trends Genet. 31, 426–433 (2015).
Duret, L. & Mouchiroud, D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17, 68–74 (2000).
Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O. & Arnold, F.H. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. USA 102, 14338–14343 (2005).
Kosiol, C. et al. Patterns of positive selection in six mammalian genomes. PLoS Genet. 4, e1000144 (2008).
Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).
Rao, S.S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Guo, Y. et al. CRISPR inversion of CTCF sites alters genome topology and enhancer–promoter function. Cell 162, 900–910 (2015).
Tang, Z. et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).
Wunderlich, Z. et al. Kruppel expression levels are maintained through compensatory evolution of shadow enhancers. Cell Rep. 12, 1740–1747 (2015).
Garber, M. et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54–i62 (2009).
Dousse, A., Junier, T. & Zdobnov, E.M. CEGA—a catalog of conserved elements from genomic alignments. Nucleic Acids Res. 44, D96–D100 (2016).
Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Dubchak, I. et al. Whole-genome rVISTA: a tool to determine enrichment of transcription factor binding sites in gene promoters from transcriptomic data. Bioinformatics 29, 2059–2061 (2013).
Pachkov, M., Balwierz, P.J., Arnold, P., Ozonov, E. & van Nimwegen, E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 41, D214–D220 (2013).
Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2014).
Vlachos, I.S. et al. DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucleic Acids Res. 43, D153–D159 (2015).
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Gompertz, B. On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philos. Trans. R. Soc. Lond. 115, 513–583 (1825).
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Kim, S. ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).
DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
We thank I. Gronau for comments on the manuscript and members of the Siepel laboratory for helpful discussions. This research was supported by the US National Institutes of Health (NIH) grants GM102192 (A.S.) and HG008901 (A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
The authors declare no competing financial interests.
Integrated supplementary information
(a) Correlation at all scored genomic positions. (b) Correlation at sites in mammalian phastCons elements.
Supplementary Figure 2 Prediction power of various computational methods for distinguishing curated transcription factor binding sites (TFBS) from likely non-TFBSs, described as receiver operating characteristic (ROC) curves.
Results are shown for the (a) "matched TSS" and (b) "matched region" schemes for pairing positive and negative examples (see Methods). We considered all TFBSs in the ORegAnno database14 that were associated with the hg19 assembly, pooling the data for all TFs and merging overlapping binding sites (7,369 TFBSs in total). The negative controls were matched by distance or region to the pooled set.
(a) LINSIGHT highlights a disease variant in a promoter region of the TDO2 gene from the HGMD database (CR045670). The whole region, even though is only a few hundred base pairs away from the TSS of the TDO2 gene, is not well conserved, as is evident by the low phastCons scores. In contrast, LINSIGHT predicts that dozens of bases in this region, including the variant CR045670, are under constraint because they overlap with predicted TFBSs, e.g., rVISTA TFBSs and SwissRegulon TFBSs. The variant CR045670 is supported by both the rVISTA and SwissRegulon databases but does not have high conservation scores, which highlights the importance of integrating a large number of complementary genomic features. (b) LINSIGHT highlights a splicing variant (rs173356864) related to Hurler syndrome17. Even though this variant is very close to an essential splice site of the IDUA gene, it is not highlighted by phyloP and GERP++. In contrast, LINSIGHT is able to identify it because LINSIGHT integrates a large number of features, including SPIDEX and phastCons both of which support the significance of this variant. The SPIDEX track shows the maximum of absolute SPIDEX scores (absolute z-scores) over all the three alternative variants at a position. Because LINSIGHT is trained on noncoding sequences, its scores are undefined in coding regions.
Supplementary Figure 4 Genomic distributions of noncoding disease variants in the ClinVar (left) and HGMD (right) data sets.
See Methods for definitions of Promoter, Splicing, UTR, and Other genomic regions.
Supplementary Figure 5 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.
Plots are similar to Figure 3 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.
Supplementary Figure 6 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.
Plots are similar to Figure 3 except that singleton variants in the 1000 Genomes Project phase 3 data set were used as negative examples18.
Supplementary Figure 7 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.
Plots are similar to Supplementary Figure 6 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.
Supplementary Figure 8 Contributions of various genomic features to the identification of disease-associated variants from HGMD and ClinVar.
The contribution of each class of genomic features is measured as the average reduction in the area under the curve (AUC) statistic resulting from the removal of those features. Results are shown for three matching schemes for positive and negative examples, and for variants in 1-kb promoters (n = 478), proximal to splicing sites (n = 65), in UTRs (n = 424), and all other variants (n = 615). The numbers of positives and negatives were matched by random subsampling, which was performed 100 times to calculate the average reduction of the AUC statistic. Error bar represents ± 1-fold standard deviation.
Plots are similar to Figure 4 except that no regional features were used in the training of LINSIGHT. (a) Probability of fitness consequences for mutations in enhancers (measured by average LINSIGHT score) is positively correlated with the number of cell types in which each enhancer is active (Spearman's rank correlation coefficient ρ = 0.253; two-tailed p-value < 10−15). Results are shown for 29,303 enhancers in 69 cell types. (b) Probability of fitness consequences for mutations in enhancers is positively correlated with probability of fitness consequences for mutations in associated promoters (Spearman's rank correlation coefficient ρ = 0.156; two-tailed p-value < 10−15). Results are shown for 25,067 enhancer-promoter pairs.
About this article
Cite this article
Huang, Y., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 49, 618–624 (2017). https://doi.org/10.1038/ng.3810
Challenges in the diagnosis and discovery of rare genetic disorders using contemporary sequencing technologies
Briefings in Functional Genomics (2020)
Trends in Genetics (2020)
Population genetic models of GERP scores suggest pervasive turnover of constrained sites across mammalian evolution
PLOS Genetics (2020)
Nature Communications (2020)