Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Subjects

Abstract

Key challenges for human genetics, precision medicine and evolutionary biology include deciphering the regulatory code of gene expression and understanding the transcriptional effects of genome variation. However, this is extremely difficult because of the enormous scale of the noncoding mutation space. We developed a deep learning–based framework, ExPecto, that can accurately predict, ab initio from a DNA sequence, the tissue-specific transcriptional effects of mutations, including those that are rare or that have not been observed. We prioritized causal variants within disease- or trait-associated loci from all publicly available genome-wide association studies and experimentally validated predictions for four immune-related diseases. By exploiting the scalability of ExPecto, we characterized the regulatory mutation space for human RNA polymerase II–transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations. This enables probing of evolutionary constraints on gene expression and ab initio prediction of mutation disease effects, making ExPecto an end-to-end computational framework for the in silico prediction of expression and disease risk.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Deep learning–based sequence model accurately predicts cell-type-specific gene expression.
Fig. 2: Tissue-specific prediction of expression-altering variations.
Fig. 3: Prioritized putative causal variants from GWAS loci with expression effect prediction.
Fig. 4: Variation potential is predictive of gene regulatory specificity, activation status and evolutionary constraints.
Fig. 5: Ab initio prediction of allele-specific disease risk integrating predicted expression effects and inferred evolutionary constraints.

Similar content being viewed by others

References

  1. Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  CAS  Google Scholar 

  3. Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).

    Article  CAS  PubMed  Google Scholar 

  7. Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

    Article  CAS  PubMed  Google Scholar 

  8. Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).

    Article  CAS  PubMed  Google Scholar 

  10. Kreimer, A. et al. Predicting gene expression in massively parallel reporter assays: a comparative study. Hum. Mutat. 38, 1240–1250 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Aguet, F. et al. Local genetic effects on gene expression across 44 human tissues. Nature 550, 204–213 (2017).

    Article  Google Scholar 

  13. Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  PubMed  Google Scholar 

  14. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat. Neurosci. 17, 1418–1428 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Fairfax, B. P. et al. Genetics of gene expression in primary immune cells identifies cell-type-specific master regulators and roles of HLA alleles. Nat. Genet. 44, 502–510 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

    Article  CAS  PubMed  Google Scholar 

  19. Germain, M. et al. Genetics of venous thrombosis: insights from a new genome-wide association study. PLoS One 6, e25581 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Tang, W. et al. A genome-wide association study for venous thromboembolism: the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium. Genet. Epidemiol. 37, 512–521 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Plagnol, V. et al. Genome-wide association analysis of autoantibody positivity in type 1 diabetes cases. PLoS Genet. 7, e1002216 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Chu, X. et al. A genome-wide association study identifies two new risk loci for Graves’ disease. Nat. Genet. 43, 897–901 (2011).

    Article  CAS  PubMed  Google Scholar 

  23. Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214–219 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Graham, R. R. et al. Genetic variants near TNFAIP3 on 6q23 are associated with systemic lupus erythematosus. Nat. Genet. 40, 1059–1061 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lee, Y.-C. et al. Two new susceptibility loci for Kawasaki disease identified through genome-wide association analysis. Nat. Genet. 44, 522–525 (2012).

    Article  CAS  PubMed  Google Scholar 

  27. Xi, H. et al. Analysis of overrepresented motifs in human core promoters reveals dual regulatory roles of YY1. Genome Res. 17, 798–806 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Stenson, P. D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Nagaizumi, K. et al. Two double-heterozygous mutations in the F7 gene show different manifestations. Br. J. Haematol. 119, 1052–1058 (2002).

    Article  CAS  PubMed  Google Scholar 

  30. Feldmann, J. et al. Munc13-4 is essential for cytolytic granules fusion and is mutated in a form of familial hemophagocytic lymphohistiocytosis (FHL3). Cell 115, 461–473 (2003).

    Article  CAS  PubMed  Google Scholar 

  31. Ng, Y.-S., Wardemann, H., Chelnis, J., Cunningham-Rundles, C. & Meffre, E. Bruton’s tyrosine kinase is essential for human B cell tolerance. J. Exp. Med. 200, 927–934 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Yamagata, K. et al. Mutations in the hepatocyte nuclear factor-4α gene in maturity-onset diabetes of the young (MODY1). Nature 384, 458–460 (1996).

    Article  CAS  PubMed  Google Scholar 

  33. Servitja, J.-M. et al. Hnf-1α (MODY3) controls tissue-specific transcriptional programs and exerts opposed effects on cell growth in pancreatic islets and liver. Mol. Cell. Biol. 29, 2945–2959 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Vinagre, J. et al. Frequency of TERT promoter mutations in human cancers. Nat. Commun. 4, 2185 (2013).

    Article  CAS  PubMed  Google Scholar 

  36. Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary-association statistics. Nat. Rev. Genet. 18, 117–127 (2017).

    Article  CAS  PubMed  Google Scholar 

  37. Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn’s disease susceptibility. Nat. Genet. 39, 830–832 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

    Article  CAS  Google Scholar 

  39. Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet. 40, 955–962 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Kirino, Y. et al. Genome-wide association analysis identifies new susceptibility loci for Behçet’s disease and epistasis between HLA-B*51 and ERAP1. Nat. Genet. 45, 202–207 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Jiang, D. K. et al. Genetic variants in five novel loci including CFB and CD40 predispose to chronic hepatitis B. Hepatology 62, 118–128 (2015).

    Article  CAS  PubMed  Google Scholar 

  45. de Souza, N. The ENCODE project. Nat. Methods 9, 1046 (2012).

    Article  CAS  PubMed  Google Scholar 

  46. Bernstein, B. E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Chen, T. & Guestrin, C. XGBoost. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (ACM, San Francisco, 2016).

  48. Bühlmann, P. Boosting for high-dimensional linear models. Ann. Stat. 34, 559–583 (2006).

    Article  Google Scholar 

  49. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  CAS  Google Scholar 

  50. Efron, B. Size, power and false discovery rates. Ann. Stat. 35, 1351–1377 (2007).

    Google Scholar 

  51. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  53. Uhlen, M. et al. Tissue-based map of the human proteome. Science 347, 1260419–1260419 (2015).

    Article  CAS  PubMed  Google Scholar 

  54. Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors acknowledge all members of the Troyanskaya lab for helpful discussions. This work is supported by NIH grants R01HG005998, U54HL117798 and R01GM071966, HHS grant HHSN272201000054C and Simons Foundation grant 395506. The authors are pleased to acknowledge that a substantial portion of the work in this paper was performed at the TIGRESS high-performance computer center at Princeton University, which is jointly supported by the Princeton Institute for Computational Science and Engineering and the Princeton University Office of Information Technology’s Research Computing department. O.G.T. is a CIFAR fellow.

Author information

Authors and Affiliations

Authors

Contributions

J.Z. and O.G.T. conceived and designed the study; J.Z. developed the computational methods and performed the analyses; C.L.T. designed and performed experimental studies; K.Y., K.M.C. and A.K.W. developed the ExPecto web server; J.Z., C.L.T. and O.G.T. wrote the manuscript.

Corresponding author

Correspondence to Olga G. Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–18, Supplementary Table 3 and Supplementary Note

Reporting Summary

Supplementary Table 1

Top model-specific regulatory sequence features for tissue/cell-type-specific ExPecto expression models

Supplementary Table 2

Prioritized putative causal variants from GWAS trait/disease-associated loci

Supplementary Table 4

Synthesized DNA sequences for luciferase assays

Supplementary Data 1

Human population common and rare variants with strong predicted expression effects. The genome build version is hg19

Supplementary Data 2

Tissue/cell-type-specific gene evolutionary constraint directionality scores and the inferred directional constraint probabilities

Supplementary Data 3

GWAS disease risk allele predictions from only sequence based on inferred evolutionary constraint violations

Supplementary Data 4

List of all transcription factors, histone marks and DNase profiles used to train sequence representation models for ExPecto

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Theesfeld, C.L., Yao, K. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet 50, 1171–1179 (2018). https://doi.org/10.1038/s41588-018-0160-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-018-0160-6

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research