Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Estimating the selective effects of heterozygous protein-truncating variants from human exome data

Abstract

The evolutionary cost of gene loss is a central question in genetics and has been investigated in model organisms and human cell lines1,2,3. In humans, tolerance of the loss of one or both functional copies of a gene is related to the gene's causal role in disease. However, estimates of the selection and dominance coefficients in humans have been elusive. Here we analyze exome sequence data from 60,706 individuals4 to make genome-wide estimates of selection against heterozygous loss of gene function. Using this distribution of selection coefficients for heterozygous protein-truncating variants (PTVs), we provide corresponding Bayesian estimates for individual genes. We find that genes under the strongest selection are enriched in embryonic lethal mouse knockouts, Mendelian disease-associated genes, and regulators of transcription. Screening by essentiality, we find a large set of genes under strong selection that are likely to have crucial functions but have not yet been thoroughly characterized.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Inferred distribution of fitness effects for heterozygous loss of gene function.
Figure 2: Separation of disease-associated genes and clinical cases by mode of inheritance.
Figure 3: Enrichments of shet in known haploinsufficient disease-associated genes of high confidence (ClinGen Dosage Sensitivity Project).
Figure 4: Distribution of shet values for phenotypes in known disease-associated genes and clinical cases.
Figure 5: Gene essentiality in mice and cells by shet bin.
Figure 6: Protein pathways and protein–protein interactions, as a percentage of the associated developmental genes in each shet bin.

References

  1. 1

    Mukai, T., Chigusa, S.I., Mettler, L.E. & Crow, J.F. Mutation rate and dominance of genes affecting viability in Drosophila melanogaster. Genetics 72, 335–355 (1972).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2

    Deng, H.W. & Lynch, M. Estimation of deleterious-mutation parameters in natural populations. Genetics 144, 349–360 (1996).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Wang, T. et al. Identification and characterization of essential genes in the human genome. Science 350, 1096–1101 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Williamson, S.H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102, 7882–7887 (2005).

    CAS  PubMed  Google Scholar 

  6. 6

    Boyko, A.R. et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 4, e1000083 (2008).

    PubMed  PubMed Central  Google Scholar 

  7. 7

    Kryukov, G.V., Pennacchio, L.A. & Sunyaev, S.R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet. 80, 727–739 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Kryukov, G.V., Shpunt, A., Stamatoyannopoulos, J.A. & Sunyaev, S.R. Power of deep, all-exon resequencing for discovery of human trait genes. Proc. Natl. Acad. Sci. USA 106, 3871–3876 (2009).

    CAS  PubMed  Google Scholar 

  9. 9

    Eyre-Walker, A. & Keightley, P.D. The distribution of fitness effects of new mutations. Nat. Rev. Genet. 8, 610–618 (2007).

    CAS  PubMed  Google Scholar 

  10. 10

    Do, R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Fu, W., Gittelman, R.M., Bamshad, M.J. & Akey, J.M. Characteristics of neutral and deleterious protein-coding variation among individuals and populations. Am. J. Hum. Genet. 95, 421–436 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Lohmueller, K.E. The distribution of deleterious genetic variation in human populations. Curr. Opin. Genet. Dev. 29, 139–146 (2014).

    CAS  PubMed  Google Scholar 

  13. 13

    Gravel, S. When is selection effective? Genetics 203, 451–462 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Williamson, S., Fledel-Alon, A. & Bustamante, C.D. Population genetics of polymorphism and divergence for diploid selection models with arbitrary dominance. Genetics 168, 463–475 (2004).

    PubMed  PubMed Central  Google Scholar 

  15. 15

    Balick, D.J., Do, R., Cassa, C.A., Reich, D. & Sunyaev, S.R. Dominance of deleterious alleles controls the response to a population bottleneck. PLoS Genet. 11, e1005436 (2015).

    PubMed  PubMed Central  Google Scholar 

  16. 16

    Simons, Y.B., Turchin, M.C., Pritchard, J.K. & Sella, G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220–224 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19

    Francioli, L.C. et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet. 47, 822–826 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20

    Solomon, B.D., Nguyen, A.-D., Bear, K.A. & Wolfsberg, T.G. Clinical genomic database. Proc. Natl. Acad. Sci. USA 110, 9851–9855 (2013).

    CAS  PubMed  Google Scholar 

  21. 21

    Yang, Y. et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA 312, 1870–1879 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Lee, H. et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA 312, 1880–1887 (2014).

    PubMed  PubMed Central  Google Scholar 

  23. 23

    Saleheen, D. et al. Human knockouts in a cohort with a high rate of consanguinity. Preprint at bioRxiv http://dx.doi.org/10.1101/031518 (2015).

  24. 24

    Koscielny, G. et al. The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res. 42, D802–D809 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Georgi, B., Voight, B.F. & Buc´an, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Roessler, E. et al. Mutations in the human Sonic Hedgehog gene cause holoprosencephaly. Nat. Genet. 14, 357–360 (1996).

    CAS  PubMed  Google Scholar 

  27. 27

    Kang, S., Graham, J.M., Olney, A.H. & Biesecker, L.G. GLI3 frameshift mutations cause autosomal dominant Pallister–Hall syndrome. Nat. Genet. 15, 266–268 (1997).

    CAS  PubMed  Google Scholar 

  28. 28

    Vortkamp, A., Gessler, M. & Grzeschik, K.H. GLI3 zinc-finger gene interrupted by translocations in Greig syndrome families. Nature 352, 539–540 (1991).

    CAS  PubMed  Google Scholar 

  29. 29

    Wild, A. et al. Point mutations in human GLI3 cause Greig syndrome. Hum. Mol. Genet. 6, 1979–1984 (1997).

    CAS  PubMed  Google Scholar 

  30. 30

    Roessler, E. et al. Loss-of-function mutations in the human GLI2 gene are associated with pituitary anomalies and holoprosencephaly-like features. Proc. Natl. Acad. Sci. USA 100, 13424–13429 (2003).

    CAS  PubMed  Google Scholar 

  31. 31

    Chiang, C. et al. Cyclopia and defective axial patterning in mice lacking Sonic hedgehog gene function. Nature 383, 407–413 (1996).

    CAS  PubMed  Google Scholar 

  32. 32

    Hui, C.C. & Joyner, A.L. A mouse model of Greig cephalopolysyndactyly syndrome: the extra-toesJ mutation contains an intragenic deletion of the Gli3 gene. Nat. Genet. 3, 241–246 (1993).

    CAS  PubMed  Google Scholar 

  33. 33

    Mo, R. et al. Specific and redundant functions of Gli2 and Gli3 zinc finger genes in skeletal patterning and development. Development 124, 113–123 (1997).

    CAS  PubMed  Google Scholar 

  34. 34

    Huang, D.W., Sherman, B.T. & Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2008).

    Google Scholar 

  35. 35

    Seidman, J.G. & Seidman, C. Transcription factor haploinsufficiency: when half a loaf is not enough. J. Clin. Invest. 109, 451–455 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36

    NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 41, D8–D20 (2013).

  37. 37

    Raychaudhuri, S. et al. Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 5, e1000534 (2009).

    PubMed  PubMed Central  Google Scholar 

  38. 38

    Agrawal, A.F. & Whitlock, M.C. Inferences about the distribution of dominance drawn from yeast gene knockout data. Genetics 187, 553–566 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Simmons, M.J. & Crow, J.F. Mutations affecting fitness in Drosophila populations. Annu. Rev. Genet. 11, 49–78 (1977).

    CAS  PubMed  Google Scholar 

  40. 40

    Wright, S. Evolution in Mendelian populations. Bull. Math. Biol. 52, 241–295 (1990).

    CAS  PubMed  Google Scholar 

  41. 41

    Petrovski, S. et al. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLoS Genet. 11, e1005492 (2015).

    PubMed  PubMed Central  Google Scholar 

  42. 42

    Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits. Nat. Genet. 44, 623–630 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43

    Li, W.H. & Nei, M. Total number of individuals affected by a single deleterious mutation in a finite population. Am. J. Hum. Genet. 24, 667–679 (1972).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44

    Li, W.H. The first arrival time and mean age of a deleterious mutant gene in a finite population. Am. J. Hum. Genet. 27, 274–286 (1975).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Maruyama, T. The age of a rare mutant gene in a large population. Am. J. Hum. Genet. 26, 669–673 (1974).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46

    Maruyama, T. The age of an allele in a finite population. Genet. Res. 23, 137–143 (1974).

    CAS  PubMed  Google Scholar 

  47. 47

    Messer, P.W. SLiM: simulating evolution with selection and linkage. Genetics 194, 1037–1039 (2013).

    PubMed  PubMed Central  Google Scholar 

  48. 48

    Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Wang, S.R. et al. Simulation of Finnish population history, guided by empirical genetic data, to assess power of rare-variant tests in Finland. Am. J. Hum. Genet. 94, 710–720 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Huttlin, E.L. et al. The BioPlex Network: a systematic exploration of the human interactome. Cell 162, 425–440 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Ayadi, A. et al. Mouse large-scale phenotyping initiatives: overview of the European Mouse Disease Clinic (EUMODIC) and of the Wellcome Trust Sanger Institute Mouse Genetics Project. Mamm. Genome 23, 600–610 (2012).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank I. Adzhubei, K. Karczewski, E. Minikel, and A. Kondrashov for helpful advice. This work was supported by US National Institutes of Health (NIH) grants HG007229 (C.A.C.), GM078598 (S.R.S., D.M.J., D.J.B.), and MH101244 (S.R.S., D.W.).

Author information

Affiliations

Authors

Contributions

Overall concept and approach conceived and developed by C.A.C., D.R.B., and S.R.S. Implementation, data analysis, and interpretation conducted by D.W.,C.A.C., D.J.B., D.M.J., and D.N. Data sets and advice were provided by D.G.M., M.J.D., K.E.S., and A.O'D.-L. The article was written by C.A.C. and S.R.S. with contributions from D.W. and D.J.B. All authors read and discussed the manuscript.

Corresponding authors

Correspondence to David R Beier or Shamil R Sunyaev.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Population genetics simulations of model assumptions.

To validate the assumption that estimates of selection can be made under mutation-selection balance independent of demography or population size for variants under sufficiently strong selection ( Methods ), we used SLiM 2.0 to conduct forward population genetics simulations. We compare the theoretical mutation load (defined as the sum of PTV allele frequencies calculated as U/shet) with the simulated mutation load in four groups (African, Non-Finnish European, Finnish, and Combined). The combined group includes pooled site frequency spectra from African, Non-Finnish European, and Finnish populations in proportions represented in the EXAC dataset for shet {-5x10-2,-5x10-3,-5x10-4,-5x10-5,-5x10-6} from left to right on the x-axis. μ = 2x10-8, each gene is 100 base pairs, and U = 2x10-6 for all simulations. Plotted points are mean values across 10,000 replicates. The simulations support our assumption of mutation-selection balance (with no appreciable effect from drift) in the strong selection regime (|shet| > 1x10-3), which appears to be appropriate for PTVs even in case of the Finnish population that underwent a recent bottleneck and a subsequent population expansion.

Supplementary Figure 2 ROC curve for mode of inheritance gene classifier.

We train a Naïve Bayes classifier to predict the mode of inheritance in a set of solved clinical exome sequencing cases from Baylor College of Medicine (N=283 cases) and UCLA (N=176 cases). Using data from UCLA as the training dataset, we are able to cross-predict the mode of inheritance in separately ascertained Baylor cases with classification accuracy of 88.0%, sensitivity of 86.1%, specificity of 90.2%, and an AUC of 0.931. Genes that were related to diagnosis in both clinics (overlapping genes) were removed from the larger Baylor set.

Supplementary Figure 3 Association of shetQUOTE shet estimates with known disease genes.

Proportion of genes listed to have a disease association in the Human Gene Mutation Database, and number of disease associations related to each gene in OMIM MorbidMap, in each shet decile. Each bin is expected to contain 10% of all covered genes, ordered from greatest to smallest shet values, in bins 1 through 10, respectively.

Supplementary Figure 4 Enrichment in germline cancer predisposition genes.

In a large screen of germline cancer predisposition genes in the Pediatric Cancer Genome Project (PCGP), the enrichment of variants in pediatric cancer cases is measured over individuals in ExAC. Genes with greater enrichment of variants in cancer cases over ExAC are correlated with higher selection coefficients. Data are separated by shet bins on a log scale. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

Supplementary Figure 5 Enrichments of shet in de novo variants from autism spectrum disorder (ASD) case and control trios.

In a set of de novo ASD case (N=2,939) and control (N=1,429) trios, shet estimates can help discriminate between all protein-coding variants, protein-truncating variants (including all frameshift, nonsense, and essential splice site variants), and individually for nonsense, frameshift, and missense variants which are predicted to be PolyPhen-2 damaging. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

Supplementary Figure 6 Association of shet estimates with PubMed gene score.

[a] The average PubMed gene score is calculated by shet decile. Estimates of selection (shet) are positively correlated with the average PubMed gene score. Each bin contains 10% of all covered genes, ordered from greatest to smallest shet values, in bins 1 through 10, respectively. [b] The PubMed gene score is significantly positively correlated with the (p<0.0001) using a logarithmic model (y=4.557*log(shet)+44.449) with R2=0.00409.

Supplementary Figure 7 Most and least published genes from top shet decile.

The proportion of annotations related to genes with the fewest and most publications in Entrez Gene. From the set of genes under the strongest selection (top 10% of shet values), we create two sets of 250 genes. The first set of genes has the fewest publications associated with each gene, as defined by our PubMed gene score ( Methods ), and the second set has the greatest number of associated publications. Between the two groups, we compare the shet values, number of protein-protein interactions, viability of orthologous mouse knockouts (IMPC), and cell essentiality assays (KBM-7 CRISPR score and Gene Trap Score). These results suggest that the genes in the least published set are similar to those in the most published set, and are also potentially important developmental genes.

Supplementary Figure 8 Relationship between gene mutation rate and selection.

Relationship between the estimate of local mutation rate, U, and the naïve estimator for heterozygous selection against PTVs, ν/n=NU/n, for all 17,199 genes. Light green dots represent genes with =n/N>0.001 (1,201), which we omit in the inference of the distribution of P (shet). Light gray dots are used genes with n>0 (14,274), while dark blue dots correspond to those with n=0 (1,724). The latter were assigned a fixed selection coefficient estimate of 1 for illustration purposes. We computed the mean U in logarithmic bins of ν/n for the range 0.00003<ν/n≤0.012, and for the last bin from all genes with ν/n>0.012, including those with n=0 (large gray dots). Error bars denote s.e.m. The slight positive correlation between U and selection strength motivates the division of the data set into terciles of U and separate estimation of the parameters of the distribution of selection coefficients in each.

Supplementary Figure 9 Fit to the observed distribution of PTV counts.

Fitted distribution P(n) (black dots) from maximum likelihood fit to the observed distribution Q(n) (histogram) of PTV counts n across 15,998 considered genes divided into terciles according to mutation rate U, assuming shet~IG

Supplementary Figure 10 Inferred distribution of fitness effects for heterozygous loss of gene function in non-Finnish Europeans.

We separately repeated the inference procedure for P (shet) using data from a single population group, Non-Finnish Europeans (NFE, N=33,370, as annotated by ExAC), and generated a corresponding set of shet estimates. The inferred parameters are very similar to those from the larger sample. Estimates of parameters from maximum likelihood fit to the observed distribution of PTV counts n across genes with X =n/N<0.001 in the set of non-Finnish Europeans (16,279 genes), assuming shet~IG(α,β) in terciles of the mutation rate U. Parameter estimates are (α11) = (0.093, 0.0068), (α22) = (0.046, 0.0110), and (α33) = (0.078, 0.0183), and shown is the mixture distribution of the three components with equal weights.

Supplementary Figure 11 Inferred distribution of fitness effects for heterozygous loss of gene function when excluding Finnish individuals.

We re-generated estimates of the distribution of heterozygous selection coefficients shet using the set of PTVs from all individuals in ExAC (N=60,706) and the set that excludes all Finnish individuals (N=57,399), using ExAC version 0.3.1 with LOFTEE annotations. Estimates of parameters from maximum likelihood fit to the observed distribution of PTV counts n across genes with =n/N<0.001, assuming shet~IG(α,β). We find no substantial difference in the estimation of the prior for the distribution of selection coefficients in the ExAC sample that excludes Finnish individuals.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplementary Table 3 and Supplementary Note. (PDF 3379 kb)

Supplementary Table 1

Distribution of shet estimates. We provide shet estimates in Supplementary Table 1. This file includes the mean of the posterior distribution Eq. 7 for each gene as well as the upper and lower 95% credibility intervals for each gene estimate. Credibility intervals have precision of 10-3 where shet > 0.005 and 10-5 otherwise. (XLSX 1814 kb)

Supplementary Table 2

Predicted mode of inheritance for each gene. For each gene, we generate a probability of mode of inheritance (either autosomal dominant or autosomal recessive). Estimates are generated using a logistic regression, trained on the full set of labeled case examples from two clinical exome sequencing programs (Baylor and UCLA)21,22. These estimates are applicable for interpretation of genes in cases that are similarly ascertained as these two clinical exome sequencing programs. (XLSX 579 kb)

Supplementary Table 4

Most published and least published genes from top shet decile.Full annotations for the PubMed Score in the top shet decile for the top 250 and bottom 250 PubMed genes scores. From the set of genes under the strongest selection (top 10% of shet values), we create two sets of 250 genes. We then annotated these lists with the results from neutrally-ascertained screens of gene importance and gene essentiality. We summarize these screens using a heuristic score. (XLSX 60 kb)

Supplementary Table 5

Functional analysis terms from DAVID. We include the results of GO term enrichment screening from DAVID that reach Bonferroni corrected significance in genes with shet > 0.15, shet > 0.25 and shet > 0.5. (XLSX 185 kb)

Supplementary Table 6

Functional analysis clusters from DAVID. We include the results of functional cluster enrichment screening from DAVID that reach Bonferroni corrected significance in genes with shet > 0.15, shet > 0.25 and shet > 0.5. (XLSX 198 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cassa, C., Weghorn, D., Balick, D. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet 49, 806–810 (2017). https://doi.org/10.1038/ng.3831

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing