The human genome comprises three billion bases, of which millions differ between any two genomes from different people. The 1000 Genomes Project, which was set up in 2008, aimed to study this variation in at least this many people, and in doing so to provide a solid foundation on which to build an understanding of genetic variation in the human population. Early work from the pilot phase of the project provided preliminary data sets and gave the first details about methods for analysing these data. In this issue of Nature, the 1000 Genomes Project publishes its final two papers1,2, which analyse 2,504 genomes from 26 populations, and provide the most comprehensive view of global human variation so far.
In the first paper (page 68), the 1000 Genomes Project Consortium1 focuses on relatively simple, short variations that affect up to 500 bases. As expected, the vast majority of variants affect only one base. However, the paper also outlines small but complex changes that were not studied in previous analyses. In the second paper (page 75), Sudmant et al.2 explore more-complex changes that affect larger portions of the chromosome. These structural variants, which can be up to 500,000 bases long, are analysed in much greater detail than had been possible previously, thanks to improvements in genome sequencing over the past decade.
Sudmant et al. show that structural variants are abundant in human genomes and arise from a series of evolutionary processes that are more complex than had been thought3. The authors compare the effect of these variants on gene expression with the effect of variants in which one base is substituted for another, known as single nucleotide polymorphisms (SNPs). They report that the structural variants have a disproportionate impact on gene expression, given their relatively low numbers in the genome compared with SNPs.
Both studies make huge strides in increasing the accuracy and sensitivity of DNA sequencing, particularly in identifying mutations that result in the insertion or deletion of bases (known as indels) and so shift the position of each subsequent base up or down the sequence. This advance allowed the consortium to correlate the presence of indels at specific positions with the more-humdrum SNPs.
These correlative data sets have benefits for genome-wide association studies (GWAS), which compare genomes from large cohorts of people to identify variants that are associated with disease or other traits. The correlated data sets from the 1000 Genomes Project Consortium1 allow researchers to infer a large complement of variants, including indels and structural variants, in panels of people for whom only a small subset of SNPs have been analysed, using partial sequencing techniques such as genotyping arrays. Because genotyping arrays are cheap, the ability to infer variation allows researchers to focus on increasing sample sizes — a crucial next step in improving our understanding of the genetics of disease. Furthermore, panels inferred in this way should enable the identification of disease-associated variants that occur at substantially lower frequencies than can be identified by GWAS alone (the consortium identified variants that are present in as little as some 0.5% of the population with European ancestry, whereas these rare variants are found only sparsely in direct genotyping studies).The use of such panels will be particularly effective when combined with other whole-genome-sequencing data sets, such as that of the UK10K Consortium4, which is studying the genetic code of 10,000 people in fine detail.
Sudmant and colleagues also find that some structural variants occur in genomic regions that have been previously associated with complex traits or disease. The mechanisms by which these variants underpin disease risk remain largely unexplored, but the authors' data set provides the starting point for further mechanistic studies. Thus, the new data are expected to facilitate future exploration of rare structural variants that have previously been largely inaccessible.
Another major advance on previous phases of the project is the broad sampling of populations (Fig. 1). Sequences have been obtained from people in five continental regions (East and South Asia, Europe, Africa and the Americas). As expected, given the sub-Saharan origin of modern humans, the 1000 Genomes Project Consortium finds that most of the world's variation between humans occurs in sub-Saharan populations. The papers' repository of variants thus provides a much richer view than previous, Euro-centric data sets5 as to what constitutes normal variation in humans, and will enable cost-effective genetic studies in sub-Saharan populations. Indeed, such studies are already under way, designed to make use of data from the 1000 Genomes Project. Understanding how genetic variation can differ between people from different continents also affects our understanding both of recent human evolution and of medicine. The latter is particularly pertinent in cosmopolitan cities, where clinicians are increasingly assessing people from many ethnic backgrounds.
The papers also sequence admixed populations, in which two previously separate populations have become mixed — for example, African American populations, which have African, European and Native American genetic heritage. Sequencing admixed populations is important because, for instance, it can help us to understand genetic variation in populations for whom few genomes have been sequenced, such as Native Americans. There are many admixed populations worldwide, including African American, Afro-Caribbean, Hispanic and North African populations, and the current studies lay the foundation for analysing and using genetic information from these groups.
Consistent with the aims of the 1000 Genomes Project, the current data sets and analyses have been openly released and have already been used in thousands of publications, ensuring that they will have a lasting impact. It is to the credit of the project that the people who donated DNA consented to the full release of their genetic data with the understanding that no other associated information, for example about health problems, would be collected. However, such completely accessible genome data sets are likely to become a minority, because there is a growing shift towards genomic data sets that are clinically annotated and so cannot normally be freely distributed. National laws and ethical standards mean that these data sets sometimes come with complex restrictions, even for use in research.
In this new world of genome sharing, baseline genetic data on human populations will still be needed, and such data will be much more useful when openly released. The International Genome Sample Resource6, which was created earlier this year, provides a coordination centre for open data sets. However, given that controlled-access data are likely to become the norm, strategies that make it easier to use and reuse clinical data sets will be hugely beneficial. The nascent Global Alliance for Genomics and Health7 provides an international framework for discussion about these complex issues.
The future of human population genetics is both rosy, with many more data being produced, and complex, with more-involved ethics and more-strictly controlled access likely to be required. The 1000 Genomes Project has delivered the data and the methodological foundation for this future.
About this article
Identification of novel alleles associated with insulin resistance in childhood obesity using pooled-DNA genome-wide association study approach
International Journal of Obesity (2018)