Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Huang, Yi-Fei; Gulko, Brad; Siepel, Adam

doi:10.1038/ng.3810

Technical Report
Published: 13 March 2017

Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Nature Genetics volume 49, pages 618–624 (2017)Cite this article

16k Accesses
197 Citations
88 Altmetric
Metrics details

Subjects

Abstract

Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Conceptual overview of LINSIGHT.**

**Figure 2: Summary of LINSIGHT scores across the noncoding human genome (3.001 billion nucleotide sites).**

**Figure 3: Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.**

**Figure 4: Evolutionary constraints on enhancers.**

The mutational constraint spectrum quantified from variation in 141,456 humans

Article Open access 27 May 2020

Hypothesis-free phenotype prediction within a genetics-first framework

Article Open access 17 February 2023

Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

Article Open access 08 August 2019

References

Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Article CAS PubMed PubMed Central Google Scholar
Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Article CAS PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae. Genetics 180, 391–408 (2008).
Article CAS PubMed PubMed Central Google Scholar
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haudry, A. et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45, 891–898 (2013).
Article CAS PubMed Google Scholar
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Gerstein, M.B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).
Article CAS PubMed PubMed Central Google Scholar
Roy, S. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Article CAS PubMed PubMed Central Google Scholar
Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article CAS Google Scholar
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Ritchie, G.R.S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Article CAS PubMed PubMed Central Google Scholar
Shihab, H.A. et al. An integrative approach to predicting the functional effects of noncoding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).
Article CAS PubMed PubMed Central Google Scholar
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS PubMed Google Scholar
Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kelley, D.R., Snoek, J. & Rinn, J.L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Quang, D., Chen, Y. & Xie, X. DANN: a deep-learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015).
Article CAS PubMed Google Scholar
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).
Article PubMed PubMed Central Google Scholar
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Article CAS PubMed PubMed Central Google Scholar
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Article CAS PubMed PubMed Central Google Scholar
Stenson, P.D. et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing, and personalized genomic medicine. Hum. Genet. 133, 1–9 (2013).
Article CAS PubMed Central Google Scholar
Landrum, M.J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Article CAS PubMed Google Scholar
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Article Google Scholar
Zerbino, D.R., Wilder, S.P., Johnson, N., Juettemann, T. & Flicek, P.R. The Ensembl Regulatory Build. Genome Biol. 16, 56 (2015).
Article PubMed PubMed Central Google Scholar
Gaffney, D.J., Blekhman, R. & Majewski, J. Selective constraints in experimentally defined primate regulatory regions. PLoS Genet. 4, e1000157 (2008).
Article CAS PubMed PubMed Central Google Scholar
Chiaromonte, F. et al. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 68, 245–254 (2003).
Article CAS PubMed Google Scholar
Meader, S.J., Ponting, C.P. & Lunter, G. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20, 1335–1343 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rands, C.M., Meader, S., Ponting, C.P. & Lunter, G. 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).
Article CAS PubMed Google Scholar
Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLOS Comput. Biol. 6, e1001025 (2010).
Article CAS PubMed PubMed Central Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Article CAS PubMed PubMed Central Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Article CAS PubMed PubMed Central Google Scholar
Core, L.J. et al. Analysis of nascent RNA identifies a unified architecture of transcription initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Article CAS PubMed PubMed Central Google Scholar
Andersson, R., Sandelin, A. & Danko, C.G. A unified architecture of transcriptional regulatory elements. Trends Genet. 31, 426–433 (2015).
Article CAS PubMed Google Scholar
Duret, L. & Mouchiroud, D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17, 68–74 (2000).
Article CAS PubMed Google Scholar
Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O. & Arnold, F.H. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. USA 102, 14338–14343 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kosiol, C. et al. Patterns of positive selection in six mammalian genomes. PLoS Genet. 4, e1000144 (2008).
Article CAS PubMed PubMed Central Google Scholar
Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
Article PubMed PubMed Central Google Scholar
Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).
Article CAS PubMed PubMed Central Google Scholar
Rao, S.S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y. et al. CRISPR inversion of CTCF sites alters genome topology and enhancer–promoter function. Cell 162, 900–910 (2015).
Article CAS PubMed PubMed Central Google Scholar
Tang, Z. et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).
Article CAS PubMed PubMed Central Google Scholar
Wunderlich, Z. et al. Kruppel expression levels are maintained through compensatory evolution of shadow enhancers. Cell Rep. 12, 1740–1747 (2015).
Article CAS PubMed PubMed Central Google Scholar
Garber, M. et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54–i62 (2009).
Article CAS PubMed PubMed Central Google Scholar
Dousse, A., Junier, T. & Zdobnov, E.M. CEGA—a catalog of conserved elements from genomic alignments. Nucleic Acids Res. 44, D96–D100 (2016).
Article CAS PubMed Google Scholar
Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS PubMed PubMed Central Google Scholar
Dubchak, I. et al. Whole-genome rVISTA: a tool to determine enrichment of transcription factor binding sites in gene promoters from transcriptomic data. Bioinformatics 29, 2059–2061 (2013).
Article CAS PubMed PubMed Central Google Scholar
Pachkov, M., Balwierz, P.J., Arnold, P., Ozonov, E. & van Nimwegen, E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 41, D214–D220 (2013).
Article CAS PubMed Google Scholar
Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2014).
Article CAS PubMed PubMed Central Google Scholar
Vlachos, I.S. et al. DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucleic Acids Res. 43, D153–D159 (2015).
Article CAS PubMed Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gompertz, B. On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philos. Trans. R. Soc. Lond. 115, 513–583 (1825).
Article Google Scholar
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
Google Scholar
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Kim, S. ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).
PubMed PubMed Central Google Scholar
DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank I. Gronau for comments on the manuscript and members of the Siepel laboratory for helpful discussions. This research was supported by the US National Institutes of Health (NIH) grants GM102192 (A.S.) and HG008901 (A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Author information

Authors and Affiliations

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
Yi-Fei Huang, Brad Gulko & Adam Siepel
Graduate Field of Computer Science, Cornell University, Ithaca, New York, USA
Brad Gulko

Authors

Yi-Fei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Brad Gulko
View author publications
You can also search for this author in PubMed Google Scholar
Adam Siepel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.-F.H. and A.S. conceived and designed the study; Y.-F.H. designed and implemented the LINSIGHT method; Y.-F.H. and B.G. analyzed the data; A.S. supervised the research; Y.-F.H. and A.S. wrote the manuscript with review and feedback from B.G.

Corresponding author

Correspondence to Adam Siepel.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Spearman's correlation coefficients (ρ) for all pairs of scores considered.

(a) Correlation at all scored genomic positions. (b) Correlation at sites in mammalian phastCons elements.

Supplementary Figure 2 Prediction power of various computational methods for distinguishing curated transcription factor binding sites (TFBS) from likely non-TFBSs, described as receiver operating characteristic (ROC) curves.

Results are shown for the (a) "matched TSS" and (b) "matched region" schemes for pairing positive and negative examples (see Methods). We considered all TFBSs in the ORegAnno database¹⁴ that were associated with the hg19 assembly, pooling the data for all TFs and merging overlapping binding sites (7,369 TFBSs in total). The negative controls were matched by distance or region to the pooled set.

Supplementary Figure 3 Additional examples of known disease variants detected by LINSIGHT.

(a) LINSIGHT highlights a disease variant in a promoter region of the TDO2 gene from the HGMD database (CR045670). The whole region, even though is only a few hundred base pairs away from the TSS of the TDO2 gene, is not well conserved, as is evident by the low phastCons scores. In contrast, LINSIGHT predicts that dozens of bases in this region, including the variant CR045670, are under constraint because they overlap with predicted TFBSs, e.g., rVISTA TFBSs and SwissRegulon TFBSs. The variant CR045670 is supported by both the rVISTA and SwissRegulon databases but does not have high conservation scores, which highlights the importance of integrating a large number of complementary genomic features. (b) LINSIGHT highlights a splicing variant (rs173356864) related to Hurler syndrome¹⁷. Even though this variant is very close to an essential splice site of the IDUA gene, it is not highlighted by phyloP and GERP++. In contrast, LINSIGHT is able to identify it because LINSIGHT integrates a large number of features, including SPIDEX and phastCons both of which support the significance of this variant. The SPIDEX track shows the maximum of absolute SPIDEX scores (absolute z-scores) over all the three alternative variants at a position. Because LINSIGHT is trained on noncoding sequences, its scores are undefined in coding regions.

Supplementary Figure 4 Genomic distributions of noncoding disease variants in the ClinVar (left) and HGMD (right) data sets.

See Methods for definitions of Promoter, Splicing, UTR, and Other genomic regions.

Supplementary Figure 5 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Figure 3 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.

Supplementary Figure 6 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Figure 3 except that singleton variants in the 1000 Genomes Project phase 3 data set were used as negative examples¹⁸.

Supplementary Figure 7 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Supplementary Figure 6 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.

Supplementary Figure 8 Contributions of various genomic features to the identification of disease-associated variants from HGMD and ClinVar.

The contribution of each class of genomic features is measured as the average reduction in the area under the curve (AUC) statistic resulting from the removal of those features. Results are shown for three matching schemes for positive and negative examples, and for variants in 1-kb promoters (n = 478), proximal to splicing sites (n = 65), in UTRs (n = 424), and all other variants (n = 615). The numbers of positives and negatives were matched by random subsampling, which was performed 100 times to calculate the average reduction of the AUC statistic. Error bar represents ± 1-fold standard deviation.

Supplementary Figure 9 Evolutionary constraints on enhancers.

Plots are similar to Figure 4 except that no regional features were used in the training of LINSIGHT. (a) Probability of fitness consequences for mutations in enhancers (measured by average LINSIGHT score) is positively correlated with the number of cell types in which each enhancer is active (Spearman's rank correlation coefficient ρ = 0.253; two-tailed p-value < 10⁻¹⁵). Results are shown for 29,303 enhancers in 69 cell types. (b) Probability of fitness consequences for mutations in enhancers is positively correlated with probability of fitness consequences for mutations in associated promoters (Spearman's rank correlation coefficient ρ = 0.156; two-tailed p-value < 10⁻¹⁵). Results are shown for 25,067 enhancer-promoter pairs.

Supplementary Figure 10 Distribution of fitness consequences of mutations at 8,082 tissue-specific enhancers (measured by average LINSIGHT score per enhancer) across 41 tissue types.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Tables 1–6 and Supplementary Note (PDF 2737 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, YF., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 49, 618–624 (2017). https://doi.org/10.1038/ng.3810

Download citation

Received: 15 August 2016
Accepted: 13 February 2017
Published: 13 March 2017
Issue Date: April 2017
DOI: https://doi.org/10.1038/ng.3810

This article is cited by

Functional dissection of human cardiac enhancers and noncoding de novo variants in congenital heart disease
- Feng Xiao
- Xiaoran Zhang
- William T. Pu
Nature Genetics (2024)
Jasmine and Iris: population-scale structural variant comparison and analysis
- Melanie Kirsche
- Gautam Prabhu
- Michael C. Schatz
Nature Methods (2023)
Ghost admixture in eastern gorillas
- Harvinder Pawar
- Aigerim Rymbekova
- Martin Kuhlwilm
Nature Ecology & Evolution (2023)
Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies
- Xihao Li
- Corbin Quick
- Xihong Lin
Nature Genetics (2023)
Universal annotation of the human genome through integration of over a thousand epigenomic datasets
- Ha Vu
- Jason Ernst
Genome Biology (2022)

Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Subjects

Abstract

Access options

Similar content being viewed by others

The mutational constraint spectrum quantified from variation in 141,456 humans

Hypothesis-free phenotype prediction within a genetics-first framework

Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Spearman's correlation coefficients (ρ) for all pairs of scores considered.

Supplementary Figure 2 Prediction power of various computational methods for distinguishing curated transcription factor binding sites (TFBS) from likely non-TFBSs, described as receiver operating characteristic (ROC) curves.

Supplementary Figure 3 Additional examples of known disease variants detected by LINSIGHT.

Supplementary Figure 4 Genomic distributions of noncoding disease variants in the ClinVar (left) and HGMD (right) data sets.

Supplementary Figure 5 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Supplementary Figure 6 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Supplementary Figure 7 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Supplementary Figure 8 Contributions of various genomic features to the identification of disease-associated variants from HGMD and ClinVar.

Supplementary Figure 9 Evolutionary constraints on enhancers.

Supplementary Figure 10 Distribution of fitness consequences of mutations at 8,082 tissue-specific enhancers (measured by average LINSIGHT score per enhancer) across 41 tissue types.

Supplementary information

Supplementary Text and Figures

Rights and permissions

About this article

Cite this article

This article is cited by

Functional dissection of human cardiac enhancers and noncoding de novo variants in congenital heart disease

Jasmine and Iris: population-scale structural variant comparison and analysis

Ghost admixture in eastern gorillas

Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies

Universal annotation of the human genome through integration of over a thousand epigenomic datasets

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links