Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human ‘knockout’ variants in protein-coding genes.
At a glance
- Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2012) et al.
- The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature 526, 68–74 (2015)
- Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011) &
- Learning about human population history from ancient and modern genomes. Nature Rev. Genet. 12, 603–614 (2011) &
- A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012) et al.
- Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet. 12, 745–755 (2011) et al.
- Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014) et al.
- The Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015)
- De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184 (2014) et al.
- The CpG dinucleotide and human genetic disease. Hum. Genet. 78, 151–155 (1988) &
- A framework for the interpretation of de novo mutation in human disease. Nature Genet. 46, 944–950 (2014) et al.
- Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012) et al.
- Large-scale whole-genome sequencing of the Icelandic population. Nature Genet. 47, 435–444 (2015) et al.
- Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013) , , , &
- Evolution on the X chromosome: unusual patterns and processes. Nature Rev. Genet. 7, 645–653 (2006) &
- Lethality and centrality in protein networks. Nature 411, 41–42 (2001) , , &
- The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007) et al.
- A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014) et al.
- The human gene damage index as a gene-level approach to prioritizing exome variants. Proc. Natl Acad. Sci. USA 112, 13615–13620 (2015) et al.
- The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015)
- Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011) et al.
- Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. Am. J. Hum. Genet. 91, 1022–1032 (2012) et al.
- XLID-causing mutations and associated genes challenged in light of data from large-scale human exome sequencing. Am. J. Hum. Genet. 93, 368–383 (2013) , &
- Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423 (2015) et al.
- A missense mutation (R565W) in Cirhin (FLJ14728) in North American Indian childhood cirrhosis. Am. J. Hum. Genet. 71, 1443–1449 (2002) et al.
- The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014) et al.
- Sequence to medical phenotypes: a framework for interpretation of human whole genome DNA sequence data. PLoS Genet. 11, e1005496 (2015) et al.
- Natural selection on genes that underlie human disease susceptibility. Curr. Biol. 18, 883–889 (2008) et al.
- Quantifying prion disease penetrance using large population control cohorts. Sci. Transl. Med. 8, 322ra9 (2016) et al.
- The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015) et al.
- Developing medicines that mimic the natural successes of the human genome: lessons from NPC1L1, HMGCR, PCSK9, APOC3, and CETP. J. Am. Coll. Cardiol. 65, 1562–1566 (2015)
- Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10, e1004494 (2014) et al.
- Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nature Genet. http://dx.doi.org/10.1038/ng.3638 (2016) et al.
- Identification of a large set of rare complete human knockouts. Nature Genet. 47, 448–452 (2015) et al.
- Health and population effects of rare gene knockouts in adult humans with related parents. Science http://dx.doi.org/10.1126/science.aac8624 (2016) et al.
- Human knockouts in a cohort with a high rate of consanguinity. Preprint at bioRxiv http://dx.doi.org/10.1101/031518 (2015) et al.
- Haploinsufficiency of TBK1 causes familial ALS and fronto-temporal dementia. Nature Neurosci. 18, 631–636 (2015) et al.
- A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011) et al.
- The Metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 8, e1002793 (2012) et al.
- Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnol. 32, 246–251 (2014) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: The effect of recurrence across different mutation and functional classes. (344 KB)
a, TiTv (transition to transversion) ratio of synonymous variants at downsampled intervals of ExAC. The TiTv is relatively stable at previous sample sizes (<5,000), but changes drastically at larger sample sizes. b, For synonymous doubleton variants, mutability of each trinucleotide context is correlated with mean Euclidean distance of individuals that share the doubleton. Transversion (red), and non-CpG transition (green) doubletons are more likely to be found in closer PCA space (more similar ancestries) than CpG transitions (blue). c, The proportion singleton among various functional categories. The functional category stop lost has a higher singleton rate than nonsense. Error bars represent standard error of the mean. d, Among synonymous variants, mutability of each trinucleotide context is correlated with proportion singleton, suggesting CpG transitions (blue) are more likely to have multiple independent origins driving their allele frequency up. e, The proportion singleton metric from c, broken down by transversions, non-CpG transitions, and CpG variants. Notably, there is a wide variation in singleton rates among mutational contexts in functional classes, and there are no stop-lost (variants that result in the loss of a stop codon) CpG transitions. Error bars represent standard error of the mean.
- Extended Data Figure 2: Multi-nucleotide variants discovered in the ExAC data set. (152 KB)
a, Number of MNPs per impact on the variant interpretation. b, Distribution of the number of MNPs per sample where phasing changes interpretation, separated by allele frequency. Common >1%, rare <1%. MNPs comprised of a rare and common allele are considered rare as this defines the frequency of the MNP.
- Extended Data Figure 3: Relationships between depth and observed versus expected variants, as well as correlations between observed and expected variant counts for synonymous, missense, and protein-truncating. (188 KB)
a, The relationship between the median depth of exons (bins of 2) and the sum of all observed synonymous variants in those exons divided by the sum of all expected synonymous variants. The curve was used to determine the appropriate depth adjustment for expected variant counts. For the rest of the panels, the correlation between the depth-adjusted expected variants counts and observed are depicted for synonymous (b), missense (c), and protein-truncating (d). The black line indicates a perfect correlation (slope = 1). Axes have been trimmed to remove TTN.
- Extended Data Figure 4: Number of protein-truncating variants in constrained genes per individual by allele frequency bin. (110 KB)
Equivalent to Fig. 5b limited to constrained (pLI ≥ 0.9) genes.
- Extended Data Figure 5: Principal component analysis (PCA) and key metrics used to filter samples. (505 KB)
a, Principal component analysis using a set of 5,400 common exome SNPs. Individuals are coloured by their distance from each of the population cluster centres using the first 4 principal components. b, The metrics number of variants, TiTv, alternate heterozygous/homozygous (HetHom) ratio and indel (InsDel) ratio. Populations are Latino (red), African (purple), European (blue), South Asian (yellow) and East Asian (green).
- Supplementary Information (6.3 MB)
This file contains Supplementary Text and Data, Supplementary References, Supplementary Tables 1-5, 7-8, 10-12, 14-15, 17-18, 21-25, (see separate zipped file for Tables 6, 9, 13, 16, 19 and 20) and Supplementary Figures 1-5 - see Contents on pages 8-9 for more details.
- Supplementary Tables (5.7 MB)
This zipped file contains Supplementary Tables 6, 9, 13, 16, 19 and 20.