Technical Report | Published:

Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Nature Genetics volume 49, pages 618624 (2017) | Download Citation

Abstract

Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  2. 2.

    et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

  3. 3.

    , , & A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

  4. 4.

    et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

  5. 5.

    et al. Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae. Genetics 180, 391–408 (2008).

  6. 6.

    et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).

  7. 7.

    et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45, 891–898 (2013).

  8. 8.

    ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  9. 9.

    et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).

  10. 10.

    et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).

  11. 11.

    et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  12. 12.

    Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  13. 13.

    , , & Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

  14. 14.

    et al. An integrative approach to predicting the functional effects of noncoding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).

  15. 15.

    , , & Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

  16. 16.

    & Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

  17. 17.

    , & Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

  18. 18.

    et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  19. 19.

    , & DANN: a deep-learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015).

  20. 20.

    et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

  21. 21.

    , , & Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).

  22. 22.

    et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).

  23. 23.

    et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing, and personalized genomic medicine. Hum. Genet. 133, 1–9 (2013).

  24. 24.

    et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

  25. 25.

    , , & Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

  26. 26.

    , & Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

  27. 27.

    , , , & The Ensembl Regulatory Build. Genome Biol. 16, 56 (2015).

  28. 28.

    , & Selective constraints in experimentally defined primate regulatory regions. PLoS Genet. 4, e1000157 (2008).

  29. 29.

    et al. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 68, 245–254 (2003).

  30. 30.

    , & Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20, 1335–1343 (2010).

  31. 31.

    , , & 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).

  32. 32.

    et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).

  33. 33.

    et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLOS Comput. Biol. 6, e1001025 (2010).

  34. 34.

    , , & A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).

  35. 35.

    et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

  36. 36.

    et al. Analysis of nascent RNA identifies a unified architecture of transcription initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).

  37. 37.

    , & A unified architecture of transcriptional regulatory elements. Trends Genet. 31, 426–433 (2015).

  38. 38.

    & Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17, 68–74 (2000).

  39. 39.

    , , , & Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. USA 102, 14338–14343 (2005).

  40. 40.

    et al. Patterns of positive selection in six mammalian genomes. PLoS Genet. 4, e1000144 (2008).

  41. 41.

    , , & A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).

  42. 42.

    et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).

  43. 43.

    et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

  44. 44.

    et al. CRISPR inversion of CTCF sites alters genome topology and enhancer–promoter function. Cell 162, 900–910 (2015).

  45. 45.

    et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).

  46. 46.

    et al. Kruppel expression levels are maintained through compensatory evolution of shadow enhancers. Cell Rep. 12, 1740–1747 (2015).

  47. 47.

    et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54–i62 (2009).

  48. 48.

    , & CEGA—a catalog of conserved elements from genomic alignments. Nucleic Acids Res. 44, D96–D100 (2016).

  49. 49.

    et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

  50. 50.

    et al. Whole-genome rVISTA: a tool to determine enrichment of transcription factor binding sites in gene promoters from transcriptomic data. Bioinformatics 29, 2059–2061 (2013).

  51. 51.

    , , , & SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 41, D214–D220 (2013).

  52. 52.

    et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2014).

  53. 53.

    et al. DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucleic Acids Res. 43, D153–D159 (2015).

  54. 54.

    et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22, 1760–1774 (2012).

  55. 55.

    On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philos. Trans. R. Soc. Lond. 115, 513–583 (1825).

  56. 56.

    , & Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).

  57. 57.

    The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  58. 58.

    ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).

  59. 59.

    , & Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).

Download references

Acknowledgements

We thank I. Gronau for comments on the manuscript and members of the Siepel laboratory for helpful discussions. This research was supported by the US National Institutes of Health (NIH) grants GM102192 (A.S.) and HG008901 (A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Author information

Affiliations

  1. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • Yi-Fei Huang
    • , Brad Gulko
    •  & Adam Siepel
  2. Graduate Field of Computer Science, Cornell University, Ithaca, New York, USA.

    • Brad Gulko

Authors

  1. Search for Yi-Fei Huang in:

  2. Search for Brad Gulko in:

  3. Search for Adam Siepel in:

Contributions

Y.-F.H. and A.S. conceived and designed the study; Y.-F.H. designed and implemented the LINSIGHT method; Y.-F.H. and B.G. analyzed the data; A.S. supervised the research; Y.-F.H. and A.S. wrote the manuscript with review and feedback from B.G.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Adam Siepel.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–10, Supplementary Tables 1–6 and Supplementary Note

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/ng.3810

Further reading