Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Abstract

Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Conceptual overview of LINSIGHT.
Figure 2: Summary of LINSIGHT scores across the noncoding human genome (3.001 billion nucleotide sites).
Figure 3: Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.
Figure 4: Evolutionary constraints on enhancers.

References

  1. 1

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  2. 2

    Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

    CAS  Article  Google Scholar 

  3. 3

    Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Wang, Y. et al. Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae. Genetics 180, 391–408 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. 6

    Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7

    Haudry, A. et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45, 891–898 (2013).

    Article  CAS  Google Scholar 

  8. 8

    ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  9. 9

    Gerstein, M.B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Roy, S. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  CAS  Google Scholar 

  12. 12

    Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  13. 13

    Ritchie, G.R.S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Shihab, H.A. et al. An integrative approach to predicting the functional effects of noncoding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  Google Scholar 

  16. 16

    Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Kelley, D.R., Snoek, J. & Rinn, J.L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  Article  Google Scholar 

  19. 19

    Quang, D., Chen, Y. & Xie, X. DANN: a deep-learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015).

    Article  CAS  Google Scholar 

  20. 20

    Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  21. 21

    Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. 23

    Stenson, P.D. et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing, and personalized genomic medicine. Hum. Genet. 133, 1–9 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Landrum, M.J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

    CAS  Article  Google Scholar 

  26. 26

    Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

    Article  Google Scholar 

  27. 27

    Zerbino, D.R., Wilder, S.P., Johnson, N., Juettemann, T. & Flicek, P.R. The Ensembl Regulatory Build. Genome Biol. 16, 56 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  28. 28

    Gaffney, D.J., Blekhman, R. & Majewski, J. Selective constraints in experimentally defined primate regulatory regions. PLoS Genet. 4, e1000157 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. 29

    Chiaromonte, F. et al. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 68, 245–254 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Meader, S.J., Ponting, C.P. & Lunter, G. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20, 1335–1343 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. 31

    Rands, C.M., Meader, S., Ponting, C.P. & Lunter, G. 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. 32

    Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).

    Article  CAS  Google Scholar 

  33. 33

    Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLOS Comput. Biol. 6, e1001025 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. 34

    Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. 36

    Core, L.J. et al. Analysis of nascent RNA identifies a unified architecture of transcription initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. 37

    Andersson, R., Sandelin, A. & Danko, C.G. A unified architecture of transcriptional regulatory elements. Trends Genet. 31, 426–433 (2015).

    Article  CAS  Google Scholar 

  38. 38

    Duret, L. & Mouchiroud, D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17, 68–74 (2000).

    Article  CAS  Google Scholar 

  39. 39

    Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O. & Arnold, F.H. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. USA 102, 14338–14343 (2005).

    Article  CAS  Google Scholar 

  40. 40

    Kosiol, C. et al. Patterns of positive selection in six mammalian genomes. PLoS Genet. 4, e1000144 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. 41

    Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  42. 42

    Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. 43

    Rao, S.S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. 44

    Guo, Y. et al. CRISPR inversion of CTCF sites alters genome topology and enhancer–promoter function. Cell 162, 900–910 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Tang, Z. et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. 46

    Wunderlich, Z. et al. Kruppel expression levels are maintained through compensatory evolution of shadow enhancers. Cell Rep. 12, 1740–1747 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    Garber, M. et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54–i62 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. 48

    Dousse, A., Junier, T. & Zdobnov, E.M. CEGA—a catalog of conserved elements from genomic alignments. Nucleic Acids Res. 44, D96–D100 (2016).

    Article  CAS  Google Scholar 

  49. 49

    Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Dubchak, I. et al. Whole-genome rVISTA: a tool to determine enrichment of transcription factor binding sites in gene promoters from transcriptomic data. Bioinformatics 29, 2059–2061 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Pachkov, M., Balwierz, P.J., Arnold, P., Ozonov, E. & van Nimwegen, E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 41, D214–D220 (2013).

    Article  CAS  Google Scholar 

  52. 52

    Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Vlachos, I.S. et al. DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions. Nucleic Acids Res. 43, D153–D159 (2015).

    Article  CAS  Google Scholar 

  54. 54

    Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  Article  Google Scholar 

  55. 55

    Gompertz, B. On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philos. Trans. R. Soc. Lond. 115, 513–583 (1825).

    Article  Google Scholar 

  56. 56

    Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).

    Google Scholar 

  57. 57

    The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  58. 58

    Kim, S. ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).

    PubMed  PubMed Central  Google Scholar 

  59. 59

    DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank I. Gronau for comments on the manuscript and members of the Siepel laboratory for helpful discussions. This research was supported by the US National Institutes of Health (NIH) grants GM102192 (A.S.) and HG008901 (A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Author information

Affiliations

Authors

Contributions

Y.-F.H. and A.S. conceived and designed the study; Y.-F.H. designed and implemented the LINSIGHT method; Y.-F.H. and B.G. analyzed the data; A.S. supervised the research; Y.-F.H. and A.S. wrote the manuscript with review and feedback from B.G.

Corresponding author

Correspondence to Adam Siepel.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Spearman's correlation coefficients (ρ) for all pairs of scores considered.

(a) Correlation at all scored genomic positions. (b) Correlation at sites in mammalian phastCons elements.

Supplementary Figure 2 Prediction power of various computational methods for distinguishing curated transcription factor binding sites (TFBS) from likely non-TFBSs, described as receiver operating characteristic (ROC) curves.

Results are shown for the (a) "matched TSS" and (b) "matched region" schemes for pairing positive and negative examples (see Methods). We considered all TFBSs in the ORegAnno database14 that were associated with the hg19 assembly, pooling the data for all TFs and merging overlapping binding sites (7,369 TFBSs in total). The negative controls were matched by distance or region to the pooled set.

Supplementary Figure 3 Additional examples of known disease variants detected by LINSIGHT.

(a) LINSIGHT highlights a disease variant in a promoter region of the TDO2 gene from the HGMD database (CR045670). The whole region, even though is only a few hundred base pairs away from the TSS of the TDO2 gene, is not well conserved, as is evident by the low phastCons scores. In contrast, LINSIGHT predicts that dozens of bases in this region, including the variant CR045670, are under constraint because they overlap with predicted TFBSs, e.g., rVISTA TFBSs and SwissRegulon TFBSs. The variant CR045670 is supported by both the rVISTA and SwissRegulon databases but does not have high conservation scores, which highlights the importance of integrating a large number of complementary genomic features. (b) LINSIGHT highlights a splicing variant (rs173356864) related to Hurler syndrome17. Even though this variant is very close to an essential splice site of the IDUA gene, it is not highlighted by phyloP and GERP++. In contrast, LINSIGHT is able to identify it because LINSIGHT integrates a large number of features, including SPIDEX and phastCons both of which support the significance of this variant. The SPIDEX track shows the maximum of absolute SPIDEX scores (absolute z-scores) over all the three alternative variants at a position. Because LINSIGHT is trained on noncoding sequences, its scores are undefined in coding regions.

Supplementary Figure 4 Genomic distributions of noncoding disease variants in the ClinVar (left) and HGMD (right) data sets.

See Methods for definitions of Promoter, Splicing, UTR, and Other genomic regions.

Supplementary Figure 5 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Figure 3 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.

Supplementary Figure 6 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Figure 3 except that singleton variants in the 1000 Genomes Project phase 3 data set were used as negative examples18.

Supplementary Figure 7 Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects.

Plots are similar to Supplementary Figure 6 except that power is quantified using the Area Under the Precision-Recall Curve (AUPRC) statistic.

Supplementary Figure 8 Contributions of various genomic features to the identification of disease-associated variants from HGMD and ClinVar.

The contribution of each class of genomic features is measured as the average reduction in the area under the curve (AUC) statistic resulting from the removal of those features. Results are shown for three matching schemes for positive and negative examples, and for variants in 1-kb promoters (n = 478), proximal to splicing sites (n = 65), in UTRs (n = 424), and all other variants (n = 615). The numbers of positives and negatives were matched by random subsampling, which was performed 100 times to calculate the average reduction of the AUC statistic. Error bar represents ± 1-fold standard deviation.

Supplementary Figure 9 Evolutionary constraints on enhancers.

Plots are similar to Figure 4 except that no regional features were used in the training of LINSIGHT. (a) Probability of fitness consequences for mutations in enhancers (measured by average LINSIGHT score) is positively correlated with the number of cell types in which each enhancer is active (Spearman's rank correlation coefficient ρ = 0.253; two-tailed p-value < 10−15). Results are shown for 29,303 enhancers in 69 cell types. (b) Probability of fitness consequences for mutations in enhancers is positively correlated with probability of fitness consequences for mutations in associated promoters (Spearman's rank correlation coefficient ρ = 0.156; two-tailed p-value < 10−15). Results are shown for 25,067 enhancer-promoter pairs.

Supplementary Figure 10 Distribution of fitness consequences of mutations at 8,082 tissue-specific enhancers (measured by average LINSIGHT score per enhancer) across 41 tissue types.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Tables 1–6 and Supplementary Note (PDF 2737 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Huang, Y., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 49, 618–624 (2017). https://doi.org/10.1038/ng.3810

Download citation

Further reading