Article | Published:

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk



Key challenges for human genetics, precision medicine and evolutionary biology include deciphering the regulatory code of gene expression and understanding the transcriptional effects of genome variation. However, this is extremely difficult because of the enormous scale of the noncoding mutation space. We developed a deep learning–based framework, ExPecto, that can accurately predict, ab initio from a DNA sequence, the tissue-specific transcriptional effects of mutations, including those that are rare or that have not been observed. We prioritized causal variants within disease- or trait-associated loci from all publicly available genome-wide association studies and experimentally validated predictions for four immune-related diseases. By exploiting the scalability of ExPecto, we characterized the regulatory mutation space for human RNA polymerase II–transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations. This enables probing of evolutionary constraints on gene expression and ab initio prediction of mutation disease effects, making ExPecto an end-to-end computational framework for the in silico prediction of expression and disease risk.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

  2. 2.

    GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

  3. 3.

    Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

  4. 4.

    Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017).

  5. 5.

    Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).

  6. 6.

    Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).

  7. 7.

    Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

  8. 8.

    Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).

  9. 9.

    Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).

  10. 10.

    Kreimer, A. et al. Predicting gene expression in massively parallel reporter assays: a comparative study. Hum. Mutat. 38, 1240–1250 (2017).

  11. 11.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

  12. 12.

    Aguet, F. et al. Local genetic effects on gene expression across 44 human tissues. Nature 550, 204–213 (2017).

  13. 13.

    Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

  14. 14.

    Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013).

  15. 15.

    Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat. Neurosci. 17, 1418–1428 (2014).

  16. 16.

    Fairfax, B. P. et al. Genetics of gene expression in primary immune cells identifies cell-type-specific master regulators and roles of HLA alleles. Nat. Genet. 44, 502–510 (2012).

  17. 17.

    Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

  18. 18.

    MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

  19. 19.

    Germain, M. et al. Genetics of venous thrombosis: insights from a new genome-wide association study. PLoS One 6, e25581 (2011).

  20. 20.

    Tang, W. et al. A genome-wide association study for venous thromboembolism: the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium. Genet. Epidemiol. 37, 512–521 (2013).

  21. 21.

    Plagnol, V. et al. Genome-wide association analysis of autoantibody positivity in type 1 diabetes cases. PLoS Genet. 7, e1002216 (2011).

  22. 22.

    Chu, X. et al. A genome-wide association study identifies two new risk loci for Graves’ disease. Nat. Genet. 43, 897–901 (2011).

  23. 23.

    Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214–219 (2011).

  24. 24.

    Graham, R. R. et al. Genetic variants near TNFAIP3 on 6q23 are associated with systemic lupus erythematosus. Nat. Genet. 40, 1059–1061 (2008).

  25. 25.

    Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

  26. 26.

    Lee, Y.-C. et al. Two new susceptibility loci for Kawasaki disease identified through genome-wide association analysis. Nat. Genet. 44, 522–525 (2012).

  27. 27.

    Xi, H. et al. Analysis of overrepresented motifs in human core promoters reveals dual regulatory roles of YY1. Genome Res. 17, 798–806 (2007).

  28. 28.

    Stenson, P. D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).

  29. 29.

    Nagaizumi, K. et al. Two double-heterozygous mutations in the F7 gene show different manifestations. Br. J. Haematol. 119, 1052–1058 (2002).

  30. 30.

    Feldmann, J. et al. Munc13-4 is essential for cytolytic granules fusion and is mutated in a form of familial hemophagocytic lymphohistiocytosis (FHL3). Cell 115, 461–473 (2003).

  31. 31.

    Ng, Y.-S., Wardemann, H., Chelnis, J., Cunningham-Rundles, C. & Meffre, E. Bruton’s tyrosine kinase is essential for human B cell tolerance. J. Exp. Med. 200, 927–934 (2004).

  32. 32.

    Yamagata, K. et al. Mutations in the hepatocyte nuclear factor-4α gene in maturity-onset diabetes of the young (MODY1). Nature 384, 458–460 (1996).

  33. 33.

    Servitja, J.-M. et al. Hnf-1α (MODY3) controls tissue-specific transcriptional programs and exerts opposed effects on cell growth in pancreatic islets and liver. Mol. Cell. Biol. 29, 2945–2959 (2009).

  34. 34.

    Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

  35. 35.

    Vinagre, J. et al. Frequency of TERT promoter mutations in human cancers. Nat. Commun. 4, 2185 (2013).

  36. 36.

    Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary-association statistics. Nat. Rev. Genet. 18, 117–127 (2017).

  37. 37.

    Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn’s disease susceptibility. Nat. Genet. 39, 830–832 (2007).

  38. 38.

    Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  39. 39.

    Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet. 40, 955–962 (2008).

  40. 40.

    Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).

  41. 41.

    Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).

  42. 42.

    Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).

  43. 43.

    Kirino, Y. et al. Genome-wide association analysis identifies new susceptibility loci for Behçet’s disease and epistasis between HLA-B*51 and ERAP1. Nat. Genet. 45, 202–207 (2013).

  44. 44.

    Jiang, D. K. et al. Genetic variants in five novel loci including CFB and CD40 predispose to chronic hepatitis B. Hepatology 62, 118–128 (2015).

  45. 45.

    de Souza, N. The ENCODE project. Nat. Methods 9, 1046 (2012).

  46. 46.

    Bernstein, B. E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).

  47. 47.

    Chen, T. & Guestrin, C. XGBoost. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (ACM, San Francisco, 2016).

  48. 48.

    Bühlmann, P. Boosting for high-dimensional linear models. Ann. Stat. 34, 559–583 (2006).

  49. 49.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  50. 50.

    Efron, B. Size, power and false discovery rates. Ann. Stat. 35, 1351–1377 (2007).

  51. 51.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  52. 52.

    Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).

  53. 53.

    Uhlen, M. et al. Tissue-based map of the human proteome. Science 347, 1260419–1260419 (2015).

  54. 54.

    Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

Download references


The authors acknowledge all members of the Troyanskaya lab for helpful discussions. This work is supported by NIH grants R01HG005998, U54HL117798 and R01GM071966, HHS grant HHSN272201000054C and Simons Foundation grant 395506. The authors are pleased to acknowledge that a substantial portion of the work in this paper was performed at the TIGRESS high-performance computer center at Princeton University, which is jointly supported by the Princeton Institute for Computational Science and Engineering and the Princeton University Office of Information Technology’s Research Computing department. O.G.T. is a CIFAR fellow.

Author information

J.Z. and O.G.T. conceived and designed the study; J.Z. developed the computational methods and performed the analyses; C.L.T. designed and performed experimental studies; K.Y., K.M.C. and A.K.W. developed the ExPecto web server; J.Z., C.L.T. and O.G.T. wrote the manuscript.

Correspondence to Olga G. Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–18, Supplementary Table 3 and Supplementary Note

Reporting Summary

Supplementary Table 1

Top model-specific regulatory sequence features for tissue/cell-type-specific ExPecto expression models

Supplementary Table 2

Prioritized putative causal variants from GWAS trait/disease-associated loci

Supplementary Table 4

Synthesized DNA sequences for luciferase assays

Supplementary Data 1

Human population common and rare variants with strong predicted expression effects. The genome build version is hg19

Supplementary Data 2

Tissue/cell-type-specific gene evolutionary constraint directionality scores and the inferred directional constraint probabilities

Supplementary Data 3

GWAS disease risk allele predictions from only sequence based on inferred evolutionary constraint violations

Supplementary Data 4

List of all transcription factors, histone marks and DNase profiles used to train sequence representation models for ExPecto

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading

Fig. 1: Deep learning–based sequence model accurately predicts cell-type-specific gene expression.
Fig. 2: Tissue-specific prediction of expression-altering variations.
Fig. 3: Prioritized putative causal variants from GWAS loci with expression effect prediction.
Fig. 4: Variation potential is predictive of gene regulatory specificity, activation status and evolutionary constraints.
Fig. 5: Ab initio prediction of allele-specific disease risk integrating predicted expression effects and inferred evolutionary constraints.