Coding variants represent many of the strongest associations between genotype and phenotype; however, they exhibit inter-individual differences in effect, termed ‘variable penetrance’. Here, we study how cis-regulatory variation modifies the penetrance of coding variants. Using functional genomic and genetic data from the Genotype-Tissue Expression Project (GTEx), we observed that in the general population, purifying selection has depleted haplotype combinations predicted to increase pathogenic coding variant penetrance. Conversely, in cancer and autism patients, we observed an enrichment of penetrance increasing haplotype configurations for pathogenic variants in disease-implicated genes, providing evidence that regulatory haplotype configuration of coding variants affects disease risk. Finally, we experimentally validated this model by editing a Mendelian single-nucleotide polymorphism (SNP) using CRISPR/Cas9 on distinct expression haplotypes with the transcriptome as a phenotypic readout. Our results demonstrate that joint regulatory and coding variant effects are an important part of the genetic architecture of human traits and contribute to modified penetrance of disease-causing variants.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Chen, R. et al. Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases. Nat. Biotechnol. 34, 531–538 (2016).

  2. 2.

    Cooper, D. N., Krawczak, M., Polychronakos, C., Tyler-Smith, C. & Kehrer-Sawatzki, H. Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum. Genet. 132, 1077–1130 (2013).

  3. 3.

    Milne, R. L. & Antoniou, A. C. Genetic modifiers of cancer risk for BRCA1 and BRCA2 mutation carriers. Ann. Oncol. 22(Suppl 1), i11–17 (2011).

  4. 4.

    Emison, E. S. et al. A common sex-dependent mutation in a RET enhancer underlies Hirschsprung disease risk. Nature 434, 857–863 (2005).

  5. 5.

    Wei, W.-H., Hemani, G. & Haley, C. S. Detecting epistasis in human complex traits. Nat. Rev. Genet. 15, 722–733 (2014).

  6. 6.

    Snozek, C. L. H. et al. LDLR promoter variant and exon 14 mutation on the same chromosome are associated with an unusually severe FH phenotype and treatment resistance. Eur. J. Hum. Genet. 17, 85–90 (2009).

  7. 7.

    Alberobello, A. T. et al. An intronic SNP in the thyroid hormone receptor β gene is associated with pituitary cell-specific over-expression of a mutant thyroid hormone receptor β2 (R338W) in the index case of pituitary-selective resistance to thyroid hormone. J. Trans. Med. 9, 144 (2011).

  8. 8.

    Butt, C. et al. Combined carrier status of prothrombin 20210A and factor XIII-A Leu34 alleles as a strong risk factor for myocardial infarction: evidence of a gene-gene interaction. Blood 101, 3037–3041 (2003).

  9. 9.

    Amin, A. S. et al. Variants in the 3’ untranslated region of the KCNQ1-encoded Kv7.1 potassium channel modify disease severity in patients with type 1 long QT syndrome in an allele-specific manner. Eur. Heart J. 33, 714–723 (2012).

  10. 10.

    Dimas, A. S. et al. Modifier effects between regulatory and protein-coding variation. PLoS Genet. 4, e1000244–10 (2008).

  11. 11.

    Lappalainen, T., Montgomery, S. B., Nica, A. C. & Dermitzakis, E. T. Epistatic selection between coding and regulatory variation in human evolution and disease. Am. J. Hum. Genet. 89, 459–463 (2011).

  12. 12.

    Vu, V. et al. Natural variation in gene expression modulates the severity of mutant phenotypes. Cell 162, 391–402 (2015).

  13. 13.

    GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

  14. 14.

    GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

  15. 15.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  16. 16.

    Castel, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195 (2015).

  17. 17.

    Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res. 27, 1872–1884 (2017).

  18. 18.

    Irimia, M. et al. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell 159, 1511–1523 (2014).

  19. 19.

    Delaneau, O., Howie, B., Cox, A. J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).

  20. 20.

    Castel, S. E., Mohammadi, P., Chung, W. K., Shen, Y. & Lappalainen, T. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nat. Commun. 7, 12817 (2016).

  21. 21.

    Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

  22. 22.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  23. 23.

    Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

  24. 24.

    Payne, S. R. & Kemp, C. J. Tumor suppressor genetics. Carcinogenesis 26, 2031–2045 (2005).

  25. 25.

    Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).

  26. 26.

    Sanders, S. J. et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron 87, 1215–1233 (2015).

  27. 27.

    Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).

  28. 28.

    Krumm, N. et al. Excess of rare, inherited truncating mutations in autism. Nat. Genet. 47, 582–588 (2015).

  29. 29.

    Toro, J. R., Wei, M.-H., Glenn, G. M. & Weinreich, M. BHD mutations, clinical and molecular genetic investigations of Birt–Hogg–Dubé syndrome: a new series of 50 families and a review of published reports. J. Med. Genet. 45, 321–331 (2008).

  30. 30.

    Khoo, S. K. et al. Clinical and genetic studies of Birt–Hogg–Dubé syndrome. J. Med. Genet. 39, 906–912 (2002).

  31. 31.

    Lin, Y.-C. et al. Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations. Nat. Commun. 5, 4767 (2014).

  32. 32.

    Ward, L. D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucl. Acids Res. 40, D930–4 (2012).

  33. 33.

    Wang, J., Vasaikar, S., Shi, Z., Greer, M. & Zhang, B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucl. Acids Res. 45, W130–W137 (2017).

  34. 34.

    Werling, D. M. et al. Limited contribution of rare, noncoding variation to autism spectrum disorder from sequencing of 2,076 genomes in quartet families. Preprint at https://doi.org/10.1101/127043 (2017).

  35. 35.

    1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  36. 36.

    Panousis, N. I., Gutierrez-Arcelus, M., Dermitzakis, E. T. & Lappalainen, T. Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biol. 15, 467 (2014).

  37. 37.

    Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).

  38. 38.

    O’Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).

  39. 39.

    McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

  40. 40.

    Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

  41. 41.

    Edmonson, M. N. et al. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 27, 865–866 (2011).

  42. 42.

    Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1–8 (2016).

  43. 43.

    Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

  44. 44.

    Zhao, M., Kim, P., Mitra, R., Zhao, J. & Zhao, Z. TSGene 2.0: an updated literature-based knowledgebase for tumor suppressor genes. Nucl. Acids Res. 44, D1023–D1031 (2015).

  45. 45.

    Heigwer, F., Kerr, G. & Boutros, M. E-CRISP: fast CRISPR target site identification. Nat. Methods 11, 122–123 (2014).

  46. 46.

    Sanjana, N. E., Shalem, O. & Zhang, F. Improved vectors and genome-wide libraries for CRISPR screening. Nat. Methods 11, 783–784 (2014).

  47. 47.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  48. 48.

    Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

  49. 49.

    Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

  50. 50.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

Download references


We would like to thank members of the Lappalainen laboratory for discussion surrounding the project, and both Kristin Ardlie and Sampsa Hautaniemi, who supervised F.A. and A.C., respectively. We thank the GTEx donors for their contributions to science, the GTEx Laboratory, Data Analysis, and Coordinating Center (LDACC), and the GTEx analysis working group (AWG) for their work in generating the resource. In particular, we would like to thank Ayellet Segre and Xiao Li at the Broad for their work in performing WGS variant calling and phasing of GTEx v7 data. The GTEx Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (NIH) and by the National Cancer Institute (NCI), the National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI), the National Institute on Drug Abuse (NIDA), the National Institute of Mental Health (NIMH) and the National Institute of Neurological Disorders and Stroke (NINDS). We also acknowledge the families at the participating SSC sites, the principal investigators at each site, the coordinators and staff at the SSC sites, the SFARI staff and the UMass Amherst Community Campaign (UMACC). Funds for the SSC were provided by the Simons Foundation. Additionally, we acknowledge the contribution of TCGA specimen donors and The Cancer Genome Atlas Research Network for their analyses. Funds for the TCGA were provided by Cancer Institute and NHGRI. S.E.C. was supported by NHGRI grant 1K99HG009916-01; T.L. and S.E.C. were supported by NIGMS grant R01GM122924 and NIMH grant R01MH101814; T.L., S.E.C. and P.M. were supported by NIH contract HHSN2682010000029C; T.L. and P.M. were supported by NIMH grant R01MH106842; and T.L. was supported by NIH grants UM1HG008901 and 1U24DK112331. AC was supported by the Cancer Society of Finland and Academy of Finland grant 284598.

Author information


  1. New York Genome Center, New York, NY, USA

    • Stephane E. Castel
    • , Alejandra Cervera
    • , Pejman Mohammadi
    • , Aaron Wolman
    • , Ivan Iossifov
    • , Ana Vasileva
    •  & Tuuli Lappalainen
  2. Department of Systems Biology, Columbia University, New York, NY, USA

    • Stephane E. Castel
    • , Pejman Mohammadi
    • , Ana Vasileva
    •  & Tuuli Lappalainen
  3. Research Programs Unit, Genome-Scale Biology & Medicine, Department of Biochemistry and Developmental Biology, Faculty of Medicine, University of Helsinki, Helsinki, Finland

    • Alejandra Cervera
  4. Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA

    • Pejman Mohammadi
  5. The Scripps Translational Science Institute, La Jolla, CA, USA

    • Pejman Mohammadi
  6. Broad Institute of MIT and Harvard, Cambridge, USA

    • François Aguet
  7. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain

    • Ferran Reverter
    •  & Roderic Guigo
  8. Universitat Pompeu Fabrea (UPF), Barcelona, Spain

    • Roderic Guigo
  9. Cold Spring Harbor Laboratory, New York, NY, USA

    • Ivan Iossifov


  1. Search for Stephane E. Castel in:

  2. Search for Alejandra Cervera in:

  3. Search for Pejman Mohammadi in:

  4. Search for François Aguet in:

  5. Search for Ferran Reverter in:

  6. Search for Aaron Wolman in:

  7. Search for Roderic Guigo in:

  8. Search for Ivan Iossifov in:

  9. Search for Ana Vasileva in:

  10. Search for Tuuli Lappalainen in:


S.E.C. and T.L. designed the study and wrote the manuscript. S.E.C., A.V., and T.L. designed analyses and experiments. S.E.C., A.C., F.A., A.W., and A.V. performed analyses and experiments. P.M. aided development of the test for regulatory modifiers of penetrance. F.R. and R.G. provided and assisted in analysis of GTEx PSI data. I.I. provided and assisted in the analysis of SSC data.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Stephane E. Castel or Tuuli Lappalainen.

Integrated supplementary information

  1. Supplementary Figure 1 Illustration of the key features of the model of joint effects of regulatory and coding variants on functional gene dosage and selection.

    Under the model, regulatory variation altering functional gene dosage is particularly important in loss-of-function heterozygotes, where the dosage of functional protein is already reduced to half. Our general assumption is that common regulatory variants typically have such low effects on gene dosage that, in the absence of coding variants, they do not cause severe disease or substantial reduction of fitness. Accordingly, in this example, under an additive model of gene expression, the more highly expressed eQTL allele increases expression by 1.25× and disease risk increases non-linearly with decreasing gene dosage; there are potentially large disease risk differences for loss-of-function heterozygotes depending on eQTL haplotype configuration. This results in purifying selection acting more strongly against haplotype configurations that decrease functional gene dosage, while acting more weakly on those that increase functional gene dosage. At the population level, this differential strength of purifying selection would result in haplotype configurations that increase functional gene dosage being present at higher frequencies than those that decrease dosage. We note that, while we believe that this general model is plausible for many genes with dosage sensitivity, other scenarios are likely to exist, and, for example, fully recessive genes or gain-of-function coding variants would not follow this model. Future work and larger data sets are needed to elucidate the full picture of the relative importance of different types of joint effects of regulatory and coding variants.

  2. Supplementary Figure 2 Using GTEx allelic expression and percent spliced in to estimate the penetrance of coding variants at the individual level.

    a,b, Differences in allelic expression between rare (AF < 1%), potentially pathogenic (CADD > 15, n = 644,052 independent ASE measurements) (a) and benign (CADD <15, n = 368,715 independent ASE measurements) (b) missense variants and allele frequency-matched synonymous variants across GTEx tissues. A negative difference indicates reduced expression of missense variants compared with synonymous controls. Bars show the 95% confidence interval of the difference of the means between missense and synonymous variants. Storey’s II1 indicates the estimated proportion of true positives across the GTEx tissues. c, Box plots of mean aFC across each GTEx tissue for pathogenic and benign missense variants defined using PolyPhen (where pathogenic is probably damaging or possibly damaging). d, Comparison of allele frequency-matched synonymous-variant-normalized allelic expression between pathogenic and benign missense variants for each of the sets of variants analyzed in Fig. 2. e, Left, illustration of PSI z-score calculation from PSI measurements for the exon HsaEX0054530 in GTEx whole blood. Right, histogram of all absolute exon PSI z-scores for coding variants across GTEx whole blood, with the 90th percentile shown. Coding variants with an absolute PSI z-score greater than the 90th percentile were considered to be in exons with substantial variation in PSI compared with the population. P-values were calculated using a two-sided paired Wilcoxon signed rank test. For box plots: bottom whisker, Q1 − 1.5 × IQR; top whisker, Q3 + 1.5 × IQR; box, IQR; center, median; outliers are not plotted for ease of viewing.

  3. Supplementary Figure 3 Test for regulatory modifiers of coding variant penetrance using phased genetic data.

    a, As input, the test takes the phased genotypes of coding variants and the eQTL for that gene. For each individual heterozygous for a coding variant, a binary measure is produced to indicate whether the major (wild-type) allele is on the more weakly expressed eQTL haplotype. b, Across a population of individuals, the null expectation is that the observed haplotype configurations are a random sampling of all possible configurations, and thus, the proportion of observed major alleles on the more weakly expressed haplotype is equal to the frequency of the more weakly expressed haplotype in the population. The diagram depicts a single gene example, but observations are aggregated across genes, and the difference between the observed frequency of major alleles on the more weakly expressed haplotype and the frequency of the more weakly expressed haplotype across those genes is calculated. c, Results of a test performed on simulated haplotype data from 1,000 individuals at 500 genes with 1,000 replicates using a frequency for the more weakly expressed haplotype of 50% and coding variant frequencies observed in GTEx, across a range of genes exhibiting joint effects between regulatory and coding variants and effect size. The simulated effect size is described by the x axis in terms of the percentage of observed haplotype configurations that decrease penetrance. d) Power to detect significant (α = 0.05) regulatory modifiers of penetrance from simulation data in c is robust across a range of effect sizes. e, Comparison of P-values calculated using either the bootstrap approach or the Poisson binomial distribution from 1,000 simulations of 1,000 haplotypes generated under the null hypothesis shows that they are extremely similar. The equality line is shown in red, and the Pearson correlation R and slope are listed in the panel. See the Methods for more information.

  4. Supplementary Figure 4 Correlation between regulatory coding variant haplotype configuration and both eQTL effect size and gene conservation.

    a,b, Results are shown for rare pathogenic (MAF < 1%, CADD > 15, n = 132,788 variant expression haplotype configurations; left) and rare benign (MAF < 1%, CADD < 15, n = 204,395 variant expression haplotype configurations; right) coding variants separately. For each correlation, the Spearman rho and P-value are listed above the plot. The red line shows the linear regression fit, and the blue line shows the LOESS regression fit with a span of 0.25. Data were grouped into 100 equally sized bins ordered by either |eQTL effect size| (a) or gene conservation (b). See ‘Gene sets’ in the Methods for a description of how gene conservation was calculated.

  5. Supplementary Figure 5 Gene-level metrics of common (MAF > 5%) regulatory and rare (MAF < 1%) coding variant haplotypes.

    ad, Haplotypes were generated using potentially pathogenic (CADD > 15) or benign (CADD < 15) rare coding variants and the top cross-tissue GTEx v6p eQTLs to define more highly and weakly expressed haplotypes. Histograms are shown of the more highly expressed haplotype Ω, the number of coding variants with haplotype data, the mean coding variant frequency, and the number of haplotypes observed at the gene level for haplotypes from 620 phased and imputed GTEx v7 whole genomes (a), 615 phased and imputed TCGA germline whole genomes (b), and phased and imputed array and whole exome data from 1,712 SSC unaffected siblings (c) and 2,304 SSC probands (d). Differences in the frequency distribution of the more highly expressed haplotype between populations result from differences in eQTL allele frequency. This is not expected to cause systematic bias in our test of modified penetrance shown in Supplementary Fig. 3.

  6. Supplementary Figure 6 Analysis of joint eQTL and coding variant haplotype configurations in disease cohorts.

    a, Analysis of rare benign variants in disease-implicated genes. b, Analysis of rare pathogenic variants in control genes selected to have within ±5% the number of coding variants, coding variant frequency and number of eQTL coding variant haplotypes as disease genes. The same number of haplotypes that were used in the disease gene analysis (Fig. 4) were sampled from the pool of haplotypes in control genes. c, Analysis of rare pathogenic variants in disease-implicated genes after removing individuals with Asian ancestry from the TCGA cohort. Median estimates and 95% confidence intervals were generated using 100,000 bootstraps, and two-sided empirical P-values were generated from these confidence intervals and combined between cohorts using Fisher’s method to produce meta P-values (*P < 0.05, **P < 0.01). See ‘Gene sets’ in the Methods for description of the gene sets used, Supplementary Fig. 5 for description of the eQTL coding variant haplotypes used for the analysis and Supplementary Table 3 for the full table of results, including individual cohort-level P-values.

  7. Supplementary Figure 7 eQTL-haplotype-aware editing of a Mendelian SNP in 293T cells using the transcriptome as a phenotypic readout.

    a, HaploReg v3 annotations of the top causal variant candidates for the FLCN eQTL. Highlighted in blue is variant rs1708629, which we consider to be the most likely causal variant, as it is annotated as promoter and enhancer in multiple Roadmap Epigenomic cell types, is bound to numerous proteins, and contains many protein-binding motifs. The r2 value between the highlighted variant and other variants is listed. b, Visualization of eQTL SNP (rs1708629) genotyping reads, edited SNP (rs199643834) genotyping reads from representative monoallelic and single-copy clones, and targeted RNA-seq reads from representative clones with low SNP expression (snpLOW) and high SNP expression (snpHIGH). Ratios of the reference allele and alternative allele in targeted DNA and RNA sequencing are indicated. A full list of primers used for sequencing can be found in Supplementary Table 8. c, Volcano plot of differential expression analysis comparing two clones that are monoallelic for the edited SNP versus four wild-type clones. d, Illustration of haplotype configurations for the clonal cell lines analyzed.

  8. Supplementary Figure 8 Comparison of absolute distance between coding variants and eQTL variants for analyzed gene sets.

    a, Comparison of median pathogenic versus benign coding variant distances for each of the gene sets considered in Fig. 3. b, Comparison of median pathogenic versus benign coding variant distances from cases (TCGA/SSC probands) and controls (GTEX/SSC siblings) for each of the disease gene sets considered in Fig. 4. P-values were calculated using a two-sided Wilcoxon signed rank test. For box plots: bottom whisker, Q1 – 1.5 × IQR; top whisker, Q3 + 1.5 × IQR; box, IQR; center, median; outliers are not plotted for ease of viewing.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–8

  2. Reporting Summary

  3. Supplementary Table 1

    Top cross-tissue GTEx v6p eQTLs per gene

  4. Supplementary Table 2

    TCGA individuals and respective cancer types used for analysis

  5. Supplementary Table 3

    Disease cohort-level results of regulatory modifiers of the penetrance test

  6. Supplementary Table 4

    Results from targeted sequencing of edited 293T monoclonal cell lines

  7. Supplementary Table 5

    Genes with expression affected by the FLCN Mendelian disease SNP

  8. Supplementary Table 6

    Pathway-based gene set enrichment analysis of genes affected by the FLCN Mendelian disease SNP

  9. Supplementary Table 7

    Inferred ancestry of individuals in GTEx v7 and TCGA cohorts used for analysis

  10. Supplementary Table 8

    Oligonucleotides used in this study

About this article

Publication history