The goal of the 1000 Genomes Project1 is to find most of the variants in the human genome that have a frequency of at least 1% in the populations studied. The consortium of researchers participating in the project now reports the results of its pilot phase (page 1061 of this issue2).

But first let's take a step back. A decade ago, the reference copy of the human genome was sequenced3,4. Although that project is undoubtedly one of the greatest scientific achievements of our time, its potential societal impact will be fully realized only if genomic regions that are responsible for various traits of medical importance, such as response to a drug or susceptibility to a disease, can be identified. After the initial sequencing of the human genome, therefore, a second phase of human genomics emerged, focusing on identifying genomic variations responsible for hereditary diseases and other medically relevant traits. Such genome-wide association studies (GWAS) are based on examining the genomes of thousands of individuals for correlations between the presence of genomic variants and the trait of interest.

Many successes have come out of GWAS5,6, but there has also been some disappointment that perhaps the pickings from these studies have been too slim7. For instance, although certain disorders — including obesity, diabetes and cardiovascular disease — are known to have a strong genetic component, their associated genomic variants detected through GWAS cannot explain most of the experimentally identified genetic effects found in affected families. Human geneticists call this problem the 'missing heritability'7.

There are many possible explanations for the missing heritability, the most popular being the effect of rare variants. GWAS are based on examining a battery of different variants across the genome. Until recently, however, the cost of including both common and rare variants in such studies was prohibitively high, pushing the focus towards identifying common variants that occur at a relatively high frequency in the population. Consequently, if many rare variants, rather than a few common ones, are responsible for a disease, the rare variants would have been missed in most GWAS.

An obvious solution to this problem is to sequence whole genomes. But this is easier said than done: GWAS require sample sizes of thousands, making whole-genome sequencing extremely expensive. However, computational-biology studies have provided crucial insight that is helping to pave the way for more-comprehensive genomic studies. The idea is that if most of both common and rare variants can be characterized in just a few individuals through whole-genome sequencing, a relatively small battery of variants could then be identified in the remaining individuals in the genome-wide association study, and the pattern of those variants could be inferred computationally on the basis of the few whole-genome sequences.

Sceptics may find this notion — using the data from some individuals to 'invent' data for others — alarming. But if done correctly, this method, called imputation8, can significantly increase the statistical power of GWAS (Fig. 1). This idea is one of the main motivating forces behind the 1000 Genomes Project.

Figure 1: Gene sequencing by imputation.
figure 1

On the basis of the pattern in a set of reference sequences, the missing nucleotides (indicated by question marks) in a new data set can be imputed. For example, because all sequences in the reference data with a G and a T in the first and third positions, respectively, have an A in the second position, the missing nucleotide in the first sequence of the new data is likely to be an A. Imputation methods are an integral component of the paper2 reporting the pilot phase of the 1000 Genomes Project.

In the pilot phase of the project2, the authors used several techniques to sequence the whole genomes of 179 individuals. They thereby generated a catalogue of 8 million previously unknown variants affecting single nucleotides — the building blocks of genes — and around 1 million structural variants due to small insertions or deletions of DNA. The study also presents several new methods for analysing genomic data. For example, it convincingly shows that imputation methods can significantly increase the power of GWAS.

New technologies also allow the protein-coding sequences (exons) within genes to be sequenced specifically. The vast majority of genomic DNA falls outside genes, but many of the most important variants are thought to be located within exons. Exon sequencing therefore provides a cost-effective method for identifying most of the functional variants. The consortium2 reports exon sequences of 697 individuals from different ethnic groups.

Apart from exon sequencing, another way to contain the cost of sequencing based on GWAS is to sequence genomes at only low coverage. This means that, for each individual, only a limited amount of randomly distributed DNA is sequenced. Although, on average, a genome is sequenced several times using this technique, there may be missing data in any particular genomic region. In fact, low coverage was the approach taken for whole-genome sequencing of the 179 individuals2.

A disadvantage of low-coverage sequencing is a higher error rate; but this can be reduced, again using imputation methods. Indeed, the consortium's low-coverage data produced an overall error rate of only 1–3% thanks to supplementation with such methods. Imputation-based methods may therefore also be the key to maximizing the utility of low-coverage sequencing data. Characterizing variants in heterozygous sites, which contain two versions of the DNA, is more difficult, and for them the error rate in the present study varied between 5% and 30% depending on the frequency of the variant.

Given the declining cost of DNA sequencing, future discoveries in human genomics are more likely to be based on a combination of exon sequencing and low-coverage, whole-genome sequencing, rather than on the more traditional techniques. Such DNA sequencing gives access to rare and novel variants, as well as being more suitable for identifying DNA insertions and deletions and, in general, for detecting less-common variants that affect only a single nucleotide.

The remaining question is how to accommodate errors in low-coverage sequencing, because an error rate of even a few per cent can lead to drastically reduced power if not accounted for appropriately9. Statistical methods that incorporate high error rates will be an essential component of future genomic-sequencing efforts. But no matter which protocol is used, the focus of the third phase of human genomics will clearly be on whole-genome sequencing.