The exome is the portion of the genome that encodes proteins. Aggregation of 60,706 human exome sequences from 14 studies provides in-depth insight into genetic variation in humans. See Article p.285
Just seven years ago, my colleagues and I reported the protein-coding DNA sequences, called exomes, of 12 individuals1 — among the first to be produced with a new generation of sequencing technologies2. Exome sequencing is much less expensive than whole-genome sequencing and, for cancers and Mendelian disorders (the latter caused by mutations in single genes), there is much more disease-associated genetic variation in the exome than in the rest of the genome. On page 285, the Exome Aggregation Consortium (ExAC) and collaborators3 report the exome sequences of 60,706 individuals, collected from diverse studies: a venture 5,000 times larger than our initial study.
The current work highlights the pace at which human genetics is being scaled up. The project is almost ten times bigger than the Exome Sequencing Project (ESP) reported in 2013 (ref. 4), which was an important forerunner of ExAC. Indeed, this may be the deepest dive into the well of human genetic variation so far.
The study and accompanying database are noteworthy on several counts. First, for the sheer number of individuals sequenced and the depth of coverage — that is, the number of times each nucleotide in each individual's exome was sequenced. In the recently completed 1000 Genomes Project, 2,504 genomes were shallowly sequenced5, a cost-saving strategy that favours the discovery of common over rare genetic variation. By contrast, each exome in ExAC has been sequenced deeply. Consequently, even genetic variants observed in just one individual can be confidently considered to be real (Fig. 1).
More than half of the approximately 7.5 million variants found by ExAC are seen only once. But collectively, they occur at a remarkably high density — at one out of every eight sites in the exome. For each gene, the authors contrasted the expected and observed numbers of variants that cause the production of truncated proteins, to search for regions containing lower-than-predicted levels of protein-truncating variants. This allowed them to identify several thousand genes that are highly sensitive to such variants — that is, unable to function normally after loss of one copy of the gene, even if the other copy is intact. Most of these genes have not yet been associated with disease, but mutation probably leads to embryonic death or strongly affects fitness in some other way. These genes are also intolerant of variants in regulatory DNA sequences that markedly alter levels of RNA synthesis from the gene6, and are more likely than other genes to be implicated in genome-wide association studies of common disease.
The second noteworthy achievement of the research is that it provides a glimpse of the bottom of the well of genetic variation in humans. In human genetics, it is generally assumed that when the same variant is found in more than one individual, it arose once in an ancestor shared by those individuals, rather than through independent mutations of the same site. However, at a particular class of site, called CpG dinucleotides, the researchers make a convincing case that variants observed in multiple individuals often reflect mutational recurrence.
In support of their assertion, the researchers find that discovery rates for new CpG dinucleotide mutations decrease in samples larger than 20,000 individuals. This provides further evidence that the size of the ExAC cohort is sufficiently large that we are beginning to saturate this class of human genetic variation, at least within the exome. It is worth noting, however, that CpG dinucleotides have a highly elevated mutation rate in human genomes, making the number of samples needed to observe such saturation much lower than for other kinds of variants. Nonetheless, this exciting finding presages what lies ahead, as larger aggregate analyses of exomes and genomes are performed.
Third, ExAC promotes the discovery of genes involved in rare diseases. In 2009, my group and others showed how exome sequencing could be used to identify Mendelian-disease genes or to diagnose Mendelian disease1,7,8. Because there are tens of thousands of genetic variants in an exome, these strategies depended on effectively filtering out common variants, which are not likely to cause Mendelian disorders. At that time, databases of common variants were uneven and of suspect quality. Although ESP greatly improved the situation by uniformly and systematically cataloguing both common and rare variants across the exome4, ExAC is an order of magnitude larger, and so enables better filtering. This is especially relevant for exome sequencing of non-European, non-African-American individuals, because ExAC provides greater sampling of individuals from outside the United States than ESP does.
On a related point, the study finds that hundreds of variants previously claimed to cause Mendelian disorders occur at implausibly high frequencies. As such, the authors suggest that they be reclassified as benign. A related study9 shows how ExAC may also force a reassessment of whether some genes are involved at all in particular rare disorders. There is little doubt that ExAC will both refine and accelerate Mendelian-gene discovery and clinical genetics.
Finally, the consortium's approach to data aggregation and sharing is admirable. ExAC is both a technical and political achievement, requiring wrangling not only of data but also of investigators, consents and more from 14 studies — most of which were directed at the genetics of various common diseases.
An ongoing challenge in genomics is balancing the privacy rights of human participants with a strong tradition of promptly and openly sharing data. Building on the precedent of ESP, ExAC hits this balance by publicly releasing aggregate analyses —a catalogue of variants and the frequencies at which they arise — but not data about associated traits or other individual-level information (although raw data for many studies in ExAC is theoretically accessible through restricted databases). In this way, the study maximizes benefit while minimizing harm. These data have already been available on a terrifically intuitive website for nearly two years (http://exac.broadinstitute.org/), and the site has accrued more than 4 million page views.
If there is one take-home message, it is that there is incredible value in aggregating sequencing data across genomic studies. As the exomes aggregated by ExAC represent just a small fraction of the human samples that have been subjected to exome or genome sequencing so far, we can and should do better. In the coming decade, the number of human genomes that will be sequenced in some manner will grow to at least tens of millions and, by the end of this century, perhaps even billions. The beginnings of saturation seen here with CpG dinucleotides may eventually be observed deeply and at every site, providing a nucleotide-level footprint of the human genome.
About this article
Genome Biology (2016)