ゲノミクスヒト6万706人におけるタンパク質コード領域の遺伝的変動の解析

Journal name:
Nature
Volume:
536,
Pages:
285–291
Date published:
DOI:
doi:10.1038/nature19057
Received
Accepted
Published online

Abstract

ヒトの遺伝的変動の大規模な参照データセットは、DNA塩基配列変化の医学的および機能的な解釈に極めて重要である。今回我々は、ExAC(Exome Aggregation Consortium)の一環として作成された、多様な祖先系統の6万706人についての高品質エキソーム(タンパク質コード領域)DNA塩基配列データの収集およびその解析結果について報告する。ヒトの遺伝的多様性に関するこのカタログは、平均してエキソームの8塩基ごとに1個のバリアントを含んでおり、頻発性変異が広く存在する直接的証拠となる。このカタログを用いて、塩基配列バリアントの病原性の客観的な評価基準を算出し、さまざまな種類の変異に対して強力な選択を受けた遺伝子を突き止めた。予測されるタンパク質短縮バリアントをほぼ完全に除去した3230の遺伝子が同定されたが、その72%には、対応することが現在確認されているヒト疾患表現型がなかった。さらに、これらのデータは、病因バリアント候補の効率的な選別や、ヒトのタンパク質コード遺伝子における「ノックアウト」バリアントの発見に利用できることも明らかになった。

At a glance

Figures

  1. Patterns of genetic variation in 60,706 humans.
    Figure 1: Patterns of genetic variation in 60,706 humans.

    a, The size and diversity of public reference exome data sets. ExAC exceeds previous data sets in size for all studied populations. b, Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Fig. 5a. c, The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel (absent from prior databases of genetic variation, such as dbSNP). d, The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e, f, The number (e), and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.

  2. Mutational recurrence at large sample sizes.
    Figure 2: Mutational recurrence at large sample sizes.

    a, Proportion of validated de novo variants from two external data sets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colours are consistent in a窶?b>d. b, Number of unique variants observed, by mutational context, as a function of number of individuals (downsampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~20,000 individuals. c, The site frequency spectrum is shown for each mutational context. d, For doubletons (variants with an allele count (AC) of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e, The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.

  3. Quantifying intolerance to functional variation in genes and gene sets.
    Figure 3: Quantifying intolerance to functional variation in genes and gene sets.

    a, Histograms of constraint Z scores for 18,225 genes. This measure of departure of number of variants from expectation is normally distributed for synonymous variants, but right-shifted (higher constraint) for missense and protein-truncating variants (PTVs), indicating that more genes are intolerant to these classes of variation. b, The proportion of genes that are very probably intolerant of loss-of-function variation (pLI窶俄翁窶?.9) is highest for ClinGen haploinsufficient (HI) genes, and stratifies by the severity and age of onset of the haploinsufficient phenotype. Genes essential in cell culture and dominant disease genes are likewise enriched for intolerant genes, whereas recessive disease genes and olfactory receptors have fewer intolerant genes. Black error bars indicate 95% confidence intervals. c, Synonymous Z scores show no correlation with the number of tissues in which a gene is expressed, but the most missense- and PTV-constrained genes tend to be expressed in more tissues. Thick black bars indicate the first to third quartiles, with the white circle marking the median. d, Highly missense- and PTV-constrained genes are less likely to have eQTLs discovered in GTEx as the average gene. Shaded regions around the lines indicate 95% confidence intervals. e, Highly missense- and PTV-constrained genes are more likely to be adjacent to genome-wide association study (GWAS) signals than the average gene. Shaded regions around the lines indicate 95% confidence intervals. f, MAPS (Fig. 2d) is shown for each functional category, broken down by constraint score bins as shown. Missense and PTV constraint score bins provide information about natural selection at least partially orthogonal to MAPS, PolyPhen, and CADD scores, indicating that this metric should be useful in identifying variants associated with deleterious phenotypes. Shaded regions around the lines indicate 95% confidence intervals. For panels a, c窶?b>f, variants are coloured with synonymous in grey, missense in orange, and protein-truncating in maroon.

  4. Filtering for Mendelian variant discovery.
    Figure 4: Filtering for Mendelian variant discovery.

    a, Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency (AF) information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1,090 after filtering against ESP. Popmax allele frequency also provides greater power than global allele frequency, particularly when populations are unequally sampled. b, Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make allele frequency estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c, Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or probable pathogenic variants in ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes28. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d, Literature review of variants with >1% global allele frequency or >1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by American College of Medical Genetics and Genomics (ACMG) guidelines24.

  5. Protein-truncating variation in ExAC.
    Figure 5: Protein-truncating variation in ExAC.

    a, The average ExAC individual has 85 heterozygous and 35 homozygous protein-truncating variants (PTVs), of which 18 and 0.19 are rare (<1% allele frequency), respectively. Error bars represent standard deviation. b, Breakdown of PTVs per individual (a) by popmax allele frequency bin. Across all populations, most PTVs found in a given individual are common (>5% allele frequency). c, d, Number of genes with at least one PTV (c), or homozygous PTV (d), as a function of number of individuals, downsampled from ExAC. The South Asian population is broken down by consanguinity (inbreeding coefficient, F). At 60,000 individuals for ExAC, the plots in c, d, extend to 15,750 with at least one PTV and 1,550 genes with at least one homozygous PTV. Dotted line represents all ExAC samples.

  6. The effect of recurrence across different mutation and functional classes.
    Extended Data Fig. 1: The effect of recurrence across different mutation and functional classes.

    a, TiTv (transition to transversion) ratio of synonymous variants at downsampled intervals of ExAC. The TiTv is relatively stable at previous sample sizes (<5,000), but changes drastically at larger sample sizes. b, For synonymous doubleton variants, mutability of each trinucleotide context is correlated with mean Euclidean distance of individuals that share the doubleton. Transversion (red), and non-CpG transition (green) doubletons are more likely to be found in closer PCA space (more similar ancestries) than CpG transitions (blue). c, The proportion singleton among various functional categories. The functional category stop lost has a higher singleton rate than nonsense. Error bars represent standard error of the mean. d, Among synonymous variants, mutability of each trinucleotide context is correlated with proportion singleton, suggesting CpG transitions (blue) are more likely to have multiple independent origins driving their allele frequency up. e, The proportion singleton metric from c, broken down by transversions, non-CpG transitions, and CpG variants. Notably, there is a wide variation in singleton rates among mutational contexts in functional classes, and there are no stop-lost (variants that result in the loss of a stop codon) CpG transitions. Error bars represent standard error of the mean.

  7. Multi-nucleotide variants discovered in the ExAC data set.
    Extended Data Fig. 2: Multi-nucleotide variants discovered in the ExAC data set.

    a, Number of MNPs per impact on the variant interpretation. b, Distribution of the number of MNPs per sample where phasing changes interpretation, separated by allele frequency. Common >1%, rare <1%. MNPs comprised of a rare and common allele are considered rare as this defines the frequency of the MNP.

  8. Relationships between depth and observed versus expected variants, as well as correlations between observed and expected variant counts for synonymous, missense, and protein-truncating.
    Extended Data Fig. 3: Relationships between depth and observed versus expected variants, as well as correlations between observed and expected variant counts for synonymous, missense, and protein-truncating.

    a, The relationship between the median depth of exons (bins of 2) and the sum of all observed synonymous variants in those exons divided by the sum of all expected synonymous variants. The curve was used to determine the appropriate depth adjustment for expected variant counts. For the rest of the panels, the correlation between the depth-adjusted expected variants counts and observed are depicted for synonymous (b), missense (c), and protein-truncating (d). The black line indicates a perfect correlation (slope窶?窶?). Axes have been trimmed to remove TTN.

  9. Number of protein-truncating variants in constrained genes per individual by allele frequency bin.
    Extended Data Fig. 4: Number of protein-truncating variants in constrained genes per individual by allele frequency bin.

    Equivalent to Fig. 5b limited to constrained (pLI窶俄翁窶?.9) genes.

  10. Principal component analysis (PCA) and key metrics used to filter samples.
    Extended Data Fig. 5: Principal component analysis (PCA) and key metrics used to filter samples.

    a, Principal component analysis using a set of 5,400 common exome SNPs. Individuals are coloured by their distance from each of the population cluster centres using the first 4 principal components. b, The metrics number of variants, TiTv, alternate heterozygous/homozygous (HetHom) ratio and indel (InsDel) ratio. Populations are Latino (red), African (purple), European (blue), South Asian (yellow) and East Asian (green).

References

  1. Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216窶?span class="end-page">220 (2012)
  2. The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature 526, 68窶?span class="end-page">74 (2015)
  3. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493窶?span class="end-page">496 (2011)
  4. Stoneking, M. & Krause, J. Learning about human population history from ancient and modern genomes. Nature Rev. Genet. 12, 603窶?span class="end-page">614 (2011)
  5. MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823窶?span class="end-page">828 (2012)
  6. Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet. 12, 745窶?span class="end-page">755 (2011)
  7. MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469窶?span class="end-page">476 (2014)
  8. The Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223窶?span class="end-page">228 (2015)
  9. Fromer, M. et al. De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179窶?span class="end-page">184 (2014)
  10. Cooper, D. N. & Youssoufian, H. The CpG dinucleotide and human genetic disease. Hum. Genet. 78, 151窶?span class="end-page">155 (1988)
  11. Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nature Genet. 46, 944窶?span class="end-page">950 (2014)
  12. Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64窶?span class="end-page">69 (2012)
  13. Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nature Genet. 47, 435窶?span class="end-page">444 (2015)
  14. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013)
  15. Vicoso, B. & Charlesworth, B. Evolution on the X chromosome: unusual patterns and processes. Nature Rev. Genet. 7, 645窶?span class="end-page">653 (2006)
  16. Jeong, H., Mason, S. P., Barabテ。si, A. L. & Oltvai, Z. N. Lethality and centrality in protein networks. Nature 411, 41窶?span class="end-page">42 (2001)
  17. Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685窶?span class="end-page">8690 (2007)
  18. Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212窶?span class="end-page">1226 (2014)
  19. Itan, Y. et al. The human gene damage index as a gene-level approach to prioritizing exome variants. Proc. Natl Acad. Sci. USA 112, 13615窶?span class="end-page">13620 (2015)
  20. The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648窶?span class="end-page">660 (2015)
  21. Bell, C. J. et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011)
  22. Xue, Y. et al. Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. Am. J. Hum. Genet. 91, 1022窶?span class="end-page">1032 (2012)
  23. Piton, A., Redin, C. & Mandel, J.-L. XLID-causing mutations and associated genes challenged in light of data from large-scale human exome sequencing. Am. J. Hum. Genet. 93, 368窶?span class="end-page">383 (2013)
  24. Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405窶?span class="end-page">423 (2015)
  25. Chagnon, P. et al. A missense mutation (R565W) in Cirhin (FLJ14728) in North American Indian childhood cirrhosis. Am. J. Hum. Genet. 71, 1443窶?span class="end-page">1449 (2002)
  26. Stenson, P. D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1窶?span class="end-page">9 (2014)
  27. Dewey, F. E. et al. Sequence to medical phenotypes: a framework for interpretation of human whole genome DNA sequence data. PLoS Genet. 11, e1005496 (2015)
  28. Blekhman, R. et al. Natural selection on genes that underlie human disease susceptibility. Curr. Biol. 18, 883窶?span class="end-page">889 (2008)
  29. Minikel, E. V. et al. Quantifying prion disease penetrance using large population control cohorts. Sci. Transl. Med. 8, 322ra9 (2016)
  30. Chong, J. X. et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199窶?span class="end-page">215 (2015)
  31. Kathiresan, S. Developing medicines that mimic the natural successes of the human genome: lessons from NPC1L1, HMGCR, PCSK9, APOC3, and CETP. J. Am. Coll. Cardiol. 65, 1562窶?span class="end-page">1566 (2015)
  32. Lim, E. T. et al. Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10, e1004494 (2014)
  33. Ruderfer, D. M. et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nature Genet. http://dx.doi.org/10.1038/ng.3638 (2016)
  34. Sulem, P. et al. Identification of a large set of rare complete human knockouts. Nature Genet. 47, 448窶?span class="end-page">452 (2015)
  35. Narasimhan, V. M. et al. Health and population effects of rare gene knockouts in adult humans with related parents. Science http://dx.doi.org/10.1126/science.aac8624 (2016)
  36. Saleheen, D. et al. Human knockouts in a cohort with a high rate of consanguinity. Preprint at bioRxiv http://dx.doi.org/10.1101/031518 (2015)
  37. Freischmidt, A. et al. Haploinsufficiency of TBK1 causes familial ALS and fronto-temporal dementia. Nature Neurosci. 18, 631窶?span class="end-page">636 (2015)
  38. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491窶?span class="end-page">498 (2011)
  39. Voight, B. F. et al. The Metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 8, e1002793 (2012)
  40. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnol. 32, 246窶?span class="end-page">251 (2014)

Download references

Author information

  1. These authors contributed equally to this work.

    • Konrad J. Karczewski,
    • Eric V. Minikel &
    • Kaitlin E. Samocha

Affiliations

  1. Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Monkol Lek,
    • Konrad J. Karczewski,
    • Eric V. Minikel,
    • Kaitlin E. Samocha,
    • Anne H. O窶僖onnell-Luria,
    • Andrew J. Hill,
    • Beryl B. Cummings,
    • Taru Tukiainen,
    • Jack A. Kosmicki,
    • Laramie E. Duncan,
    • Karol Estrada,
    • Fengmei Zhao,
    • Emma Pierce-Hoffman,
    • Menachem Fromer,
    • Jackie Goldstein,
    • Daniel Howrigan,
    • Brett P. Thomas,
    • Benjamin M. Neale,
    • Aarno Palotie,
    • Mark J. Daly &
    • Daniel G. MacArthur
  2. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA

    • Monkol Lek,
    • Konrad J. Karczewski,
    • Eric V. Minikel,
    • Kaitlin E. Samocha,
    • Eric Banks,
    • Timothy Fennell,
    • Anne H. O窶僖onnell-Luria,
    • James S. Ware,
    • Andrew J. Hill,
    • Beryl B. Cummings,
    • Taru Tukiainen,
    • Daniel P. Birnbaum,
    • Jack A. Kosmicki,
    • Laramie E. Duncan,
    • Karol Estrada,
    • Fengmei Zhao,
    • James Zou,
    • Emma Pierce-Hoffman,
    • Jason Flannick,
    • Jackie Goldstein,
    • Namrata Gupta,
    • Daniel Howrigan,
    • Mitja I. Kurki,
    • Pradeep Natarajan,
    • Gina M. Peloso,
    • Manuel A. Rivas,
    • Christine Stevens,
    • Brett P. Thomas,
    • Ben Weisburd,
    • David M. Altshuler,
    • Stacey Donnelly,
    • Jose C. Florez,
    • Stacey B. Gabriel,
    • Sekar Kathiresan,
    • Benjamin M. Neale,
    • Aarno Palotie,
    • Jeremiah M. Scharf,
    • Mark J. Daly &
    • Daniel G. MacArthur
  3. School of Paediatrics and Child Health, University of Sydney, Sydney, New South Wales 2145, Australia

    • Monkol Lek
  4. Institute for Neuroscience and Muscle Research, Children窶冱 Hospital at Westmead, Sydney, New South Wales 2145, Australia

    • Monkol Lek
  5. Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Eric V. Minikel,
    • Kaitlin E. Samocha,
    • Beryl B. Cummings &
    • Aarno Palotie
  6. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA

    • Kaitlin E. Samocha,
    • Jack A. Kosmicki,
    • Laramie E. Duncan,
    • Menachem Fromer,
    • Jackie Goldstein,
    • Daniel Howrigan,
    • Samuel A. Rose,
    • Dongmei Yu,
    • Steven McCarroll,
    • Benjamin M. Neale,
    • Jeremiah M. Scharf &
    • Mark J. Daly
  7. Division of Genetics and Genomics, Boston Children窶冱 Hospital, Boston, Massachusetts 02115, USA

    • Anne H. O窶僖onnell-Luria
  8. Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA

    • James S. Ware &
    • Steven McCarroll
  9. National Heart and Lung Institute, Imperial College London, London SW7 2AZ, UK

    • James S. Ware
  10. NIHR Royal Brompton Cardiovascular Biomedical Research Unit, Royal Brompton Hospital, London SW3 6NP, UK

    • James S. Ware
  11. MRC Clinical Sciences Centre, Imperial College London, London SW7 2AZ, UK

    • James S. Ware
  12. Genome Sciences, University of Washington, Seattle, Washington 98195, USA

    • Andrew J. Hill
  13. Program in Bioinformatics and Integrative Genomics, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Jack A. Kosmicki
  14. Mouse Genome Informatics, Jackson Laboratory, Bar Harbor, Maine 04609, USA

    • Joanne Berghout
  15. Center for Biomedical Informatics and Biostatistics, University of Arizona, Tucson, Arizona 85721, USA

    • Joanne Berghout
  16. Institute of Medical Genetics, Cardiff University, Cardiff CF10 3XQ, UK

    • David N. Cooper &
    • Peter D. Stenson
  17. Google, Mountain View, California 94043, USA

    • Nicole Deflaux
  18. Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA

    • Mark DePristo,
    • Laura Gauthier,
    • Adam Kiezun,
    • Ami Levy Moonshine,
    • Ryan Poplin,
    • Valentin Ruano-Rubio,
    • Khalid Shakir,
    • Grace Tiao &
    • Gad Getz
  19. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Ron Do,
    • Menachem Fromer,
    • Douglas M. Ruderfer,
    • Shaun M. Purcell &
    • Pamela Sklar
  20. Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Ron Do,
    • Menachem Fromer,
    • Douglas M. Ruderfer,
    • Shaun M. Purcell &
    • Pamela Sklar
  21. The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Ron Do
  22. The Center for Statistical Genetics, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Ron Do
  23. Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Jason Flannick
  24. Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Menachem Fromer,
    • Douglas M. Ruderfer,
    • Shaun M. Purcell &
    • Pamela Sklar
  25. Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Mitja I. Kurki,
    • Dongmei Yu &
    • Jeremiah M. Scharf
  26. Harvard Medical School, Boston, Massachusetts 02115, USA

    • Pradeep Natarajan,
    • Jose C. Florez,
    • Gad Getz &
    • Sekar Kathiresan
  27. Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Pradeep Natarajan,
    • Gina M. Peloso,
    • Dongmei Yu,
    • Jose C. Florez,
    • Sekar Kathiresan &
    • Jeremiah M. Scharf
  28. Cardiovascular Research Center, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Pradeep Natarajan,
    • Gina M. Peloso &
    • Sekar Kathiresan
  29. Immunogenomics and Metabolic Disease Laboratory, Instituto Nacional de Medicina Genテウmica, Mexico City 14610, Mexico

    • Lorena Orozco
  30. Molecular Biology and Genomic Medicine Unit, Instituto Nacional de Ciencias Mテゥdicas y Nutriciテウn, Mexico City 14080, Mexico

    • Maria T. Tusie-Luna
  31. Samsung Advanced Institute for Health Sciences and Technology (SAIHST), Sungkyunkwan University, Samsung Medical Center, Seoul, South Korea

    • Hong-Hee Won
  32. Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

    • Dongmei Yu &
    • Jeremiah M. Scharf
  33. Vertex Pharmaceuticals, Boston, Massachusetts 02210, USA

    • David M. Altshuler
  34. Department of Cardiology, University Hospital, 43100 Parma, Italy

    • Diego Ardissino
  35. Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA

    • Michael Boehnke
  36. Department of Public Health and Primary Care, Strangeways Research Laboratory, Cambridge CB1 8RN, UK

    • John Danesh
  37. Cardiovascular Epidemiology and Genetics, Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain

    • Roberto Elosua
  38. Department of Pathology and Cancer Center, Massachusetts General Hospital, Boston, Massachusetts, 02114 USA

    • Gad Getz
  39. Psychiatric Genetic Epidemiology & Neurobiology Laboratory, State University of New York, Upstate Medical University, Syracuse, New York 13210, USA

    • Stephen J. Glatt
  40. Department of Psychiatry and Behavioral Sciences, State University of New York, Upstate Medical University, Syracuse, New York 13210, USA

    • Stephen J. Glatt
  41. Department of Neuroscience and Physiology, State University of New York, Upstate Medical University, Syracuse, New York 13210, USA

    • Stephen J. Glatt
  42. Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, SE-171 77 Stockholm, Sweden

    • Christina M. Hultman
  43. Department of Medicine, University of Eastern Finland and Kuopio University Hospital, 70211 Kuopio, Finland

    • Markku Laakso
  44. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX1 2JD, UK

    • Mark I. McCarthy &
    • Hugh C. Watkins
  45. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford OX1 2JD, UK

    • Mark I. McCarthy
  46. Oxford NIHR Biomedical Research Centre, Oxford University Hospitals Foundation Trust, Oxford OX1 2JD, UK

    • Mark I. McCarthy
  47. Inflammatory Bowel Disease and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California 90048, USA

    • Dermot McGovern
  48. Atherogenomics Laboratory, University of Ottawa Heart Institute, Ottawa, Ontario K1Y 4W7, Canada

    • Ruth McPherson
  49. Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 00100 Helsinki, Finland

    • Aarno Palotie
  50. Department of Biostatistics and Epidemiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

    • Danish Saleheen
  51. Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

    • Danish Saleheen
  52. Center for Non-Communicable Diseases, Karachi, Pakistan

    • Danish Saleheen
  53. Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Pamela Sklar
  54. Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA

    • Pamela Sklar
  55. Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599, USA

    • Patrick F. Sullivan
  56. Department of Medical Epidemiology and Biostatistics, Karolinska Institutet SE-171 77 Stockholm, Sweden

    • Patrick F. Sullivan
  57. Department of Public Health, University of Helsinki, 00100 Helsinki, Finland

    • Jaakko Tuomilehto
  58. Department of Psychiatry, University of California, San Diego, California 92093, USA

    • Ming T. Tsuang
  59. Radcliffe Department of Medicine, University of Oxford, Oxford OX1 2JD, UK

    • Hugh C. Watkins
  60. Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, Mississippi 39216, USA

    • James G. Wilson

Consortia

  1. Exome Aggregation Consortium

  2. A list of participants and their affiliations appears in the Supplementary Information

Contributions

M.Le., K.J.K., E.V.M., K.E.S., E.B., T.F., A.H.O., J.S.W., A.J.H., B.B.C., T.T., D.P.B., J.A.K., L.E.D., K.E., F.Z., J.Z., E.P., M.J.D. and D.G.M. contributed to the analysis and writing of the manuscript. M.Le., E.B., T.F., K.J.K., E.V.M., F.Z., D.P.B., J.B., D.N.C., N.D., M.D., R.D., J.F., M.F., L.G., J.G., N.G., D.H., A.K., M.I.K., A.L.M., P.N., L.O., G.M.P., R.P., M.A.R., V.R., S.A.R., D.M.R., K.S., P.D.S., C.S., B.P.T., G.T., M.T.T., B.W., H.W., D.Y., S.B.G., M.J.D. and D.G.M. contributed to the production of the ExAC data set. D.M.A., D.A., M.B., J.D., S.D., R.E., J.C.F., S.B.G., G.G., S.J.G., C.M.H., S.K., M.La., S.M., M.I.M., D.M., R.M., B.M.N., A.P., S.M.P., D.S., J.M.S., P.S., P.F.S., J.T., M.T.T., H.C.W., J.G.W., M.J.D. and D.G.M. contributed to the design and conduct of the various exome sequencing studies and review of the manuscript.

Competing financial interests

P.F.S. is a scientific advisor to Pfizer.

Corresponding author

Correspondence to:

ExAC data set is publicly available at (http://exac.broadinstitute.org).

Reviewer Information Nature thanks L. Biesecker, J. Shedure and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: The effect of recurrence across different mutation and functional classes. (344 KB)

    a, TiTv (transition to transversion) ratio of synonymous variants at downsampled intervals of ExAC. The TiTv is relatively stable at previous sample sizes (<5,000), but changes drastically at larger sample sizes. b, For synonymous doubleton variants, mutability of each trinucleotide context is correlated with mean Euclidean distance of individuals that share the doubleton. Transversion (red), and non-CpG transition (green) doubletons are more likely to be found in closer PCA space (more similar ancestries) than CpG transitions (blue). c, The proportion singleton among various functional categories. The functional category stop lost has a higher singleton rate than nonsense. Error bars represent standard error of the mean. d, Among synonymous variants, mutability of each trinucleotide context is correlated with proportion singleton, suggesting CpG transitions (blue) are more likely to have multiple independent origins driving their allele frequency up. e, The proportion singleton metric from c, broken down by transversions, non-CpG transitions, and CpG variants. Notably, there is a wide variation in singleton rates among mutational contexts in functional classes, and there are no stop-lost (variants that result in the loss of a stop codon) CpG transitions. Error bars represent standard error of the mean.

  2. Extended Data Figure 2: Multi-nucleotide variants discovered in the ExAC data set. (152 KB)

    a, Number of MNPs per impact on the variant interpretation. b, Distribution of the number of MNPs per sample where phasing changes interpretation, separated by allele frequency. Common >1%, rare <1%. MNPs comprised of a rare and common allele are considered rare as this defines the frequency of the MNP.

  3. Extended Data Figure 3: Relationships between depth and observed versus expected variants, as well as correlations between observed and expected variant counts for synonymous, missense, and protein-truncating. (188 KB)

    a, The relationship between the median depth of exons (bins of 2) and the sum of all observed synonymous variants in those exons divided by the sum of all expected synonymous variants. The curve was used to determine the appropriate depth adjustment for expected variant counts. For the rest of the panels, the correlation between the depth-adjusted expected variants counts and observed are depicted for synonymous (b), missense (c), and protein-truncating (d). The black line indicates a perfect correlation (slope窶?窶?). Axes have been trimmed to remove TTN.

  4. Extended Data Figure 4: Number of protein-truncating variants in constrained genes per individual by allele frequency bin. (110 KB)

    Equivalent to Fig. 5b limited to constrained (pLI窶俄翁窶?.9) genes.

  5. Extended Data Figure 5: Principal component analysis (PCA) and key metrics used to filter samples. (505 KB)

    a, Principal component analysis using a set of 5,400 common exome SNPs. Individuals are coloured by their distance from each of the population cluster centres using the first 4 principal components. b, The metrics number of variants, TiTv, alternate heterozygous/homozygous (HetHom) ratio and indel (InsDel) ratio. Populations are Latino (red), African (purple), European (blue), South Asian (yellow) and East Asian (green).

Supplementary information

PDF files

  1. Supplementary Information (6.3 MB)

    This file contains Supplementary Text and Data, Supplementary References, Supplementary Tables 1-5, 7-8, 10-12, 14-15, 17-18, 21-25, (see separate zipped file for Tables 6, 9, 13, 16, 19 and 20) and Supplementary Figures 1-5 - see Contents on pages 8-9 for more details.

Zip files

  1. Supplementary Tables (5.7 MB)

    This zipped file contains Supplementary Tables 6, 9, 13, 16, 19 and 20.

Additional data