A paper that analysed genetic variants in 14,000 people to identify disease-associated regions set the standard for collaborative genome-wide association studies and provided methodological advances whose effects are still felt today.
Ten years ago this month, Nature published a landmark study1 that compared the frequencies of hundreds of thousands of common genetic variants (polymorphisms) at single nucleotides in people with and without seven diseases, to look for variants associated with each disease. Such genome-wide association studies (GWAS) provide an agnostic way to identify these variants, unfettered by prevailing — and potentially incorrect — assumptions about which genomic regions are important in disease biology. The study, by the Wellcome Trust Case Control Consortium (WTCCC), set the standard for this field of research, and nearly 3,000 GWAS have since been published.
Before the advent of GWAS, few genetic regions associated with disease had been reliably identified, and some researchers despaired of ever finding reproducible associations for most heritable diseases2. GWAS burst onto the scene in 2005, with the demonstration of a surprising association between the complement factor H gene, which was known for its role in immune regulation, and age-related macular degeneration, a leading cause of blindness3. Since then, GWAS have provided many more unexpected insights. One of the great early surprises of GWAS findings, for example, was that less than 10% of disease associations lie in protein-coding regions of the genome4. Another surprise has been the identification of specific regions associated with multiple, seemingly disparate diseases, such as polymorphisms in the gene CDKN2A/B, which are associated with coronary heart disease, type 2 diabetes and melanoma (the most serious form of skin cancer)5.
What made the WTCCC paper special was its large sample size and its pursuit of seven very different diseases — 2,000 cases each of bipolar disorder, coronary heart disease, Crohn's disease, high blood pressure, rheumatoid arthritis and diabetes types 1 and 2, compared with a shared set of 3,000 controls. In addition, the project involved more than 50 research groups across the United Kingdom. Persuading these groups to work collaboratively, hold their individual publications until the group-wide paper was published, and share their data openly with the scientific community was a masterwork of diplomacy, for which the study's organizers richly deserve commendation and gratitude.
By simultaneously studying diseases with differing aetiologies and genetic contributions, the consortium hoped to gain insight into not only the specific genetic architecture of each disease (the number of contributing genes and the sizes of their effects), but also differences between them. The researchers also aimed to address methodological issues to improve the reproducibility of genetic-association studies. The WTCCC achieved these aims and more, and the work's immediate impact was recognized by the consortium being chosen as Scientific American's research leader of the year6 and the article being lauded as The Lancet's paper of the year7.
The study revealed 24 statistically significant associations between diseases and specific single nucleotide polymorphisms (SNPs). In addition, it identified a host of other signals at lesser significance levels that were subsequently shown to harbour reproducible associations in larger studies. The only disease for which no associations were found was high blood pressure — but this was later explained by the discovery that the genetic architecture of this disease differs from that of the other six diseases analysed, involving many variants that each have a small effect. Such variants are detectable in much larger GWAS, and more than 100 regions associated with high blood pressure have since been identified8.
In terms of methodology, the WTCCC made valuable advances in genotype calling — a method used by researchers to discern which genetic variants each individual has at a particular site (their genotype)9. The authors also developed and disseminated methods for imputing non-genotyped variants, which lie between the SNPs assayed in a given study. By developing new algorithms and methods, they improved researchers' ability to reliably identify genotypes, reduce calling errors and infer with high probability SNPs that had not been assayed, and to combine data sets gained from GWAS that analysed different sets of variants9, increasing the power to detect rare disease-associated variants.
The consortium also demonstrated that using a common set of controls across multiple studies is a robust and efficient approach, and one that the team's members expanded further, using individuals studied for one disease as controls for another. The study revealed a previously unsuspected degree of geographic differentiation across the United Kingdom for 13 SNPs (for example, there was a north-to-south difference in the frequency of a variant in the gene TLR1 that might have a role in leprosy and tuberculosis). Finally, their work demonstrated empirically the power of increasing sample sizes to detect a greater number of disease-associated SNPs, and served as a cogent reminder that even more associations could be expected if studies were performed using samples that were larger still.
And the larger sample sizes came! Inspired by the WTCCC, international consortia rapidly formed to pool data. Sample sizes well into the tens of thousands became routine (Fig. 1), and at least 30 GWAS exceeding 100,000 individuals are now available online (www.ebi.ac.uk/gwas). The first study involving roughly 500,000 participants will soon be released (www.ukbiobank.ac.uk).
The WTCCC also helped to propel a revolution in data distribution. The study was one of the first GWAS to provide information about each participant's genotype and associated traits for use by the scientific community. Although access to these data were subsequently controlled to ensure participant confidentiality10, the tradition of open data-sharing and collaboration pioneered by the WTCCC has continued.
Where do we go from here? The flood of GWAS continues unabated, despite predictions that it was a transitional technology that would soon be supplanted by techniques to sequence either entire genomes or all protein-coding regions. Such sequencing studies have certainly identified many rare variants and polymorphisms involving more than a single nucleotide (such as inserted or deleted sections of chromosomes), which GWAS have difficulty detecting. But the low cost and straightforward analytics of GWAS seem likely to ensure its longevity.
A crucial gap in the GWAS spectrum remains to be filled, because ancestrally diverse, non-European populations have been appallingly under-studied11. Notable GWAS in these populations include studies of cardiac conduction in African Americans12 and sleep apnoea in Hispanic and Latino Americans13. One of the next steps will be to identify associations in under-studied populations such as those in Africa and Latin America, and in isolated and indigenous peoples such as those in the Arctic, Pacific islands and Americas. Another outstanding opportunity lies in studies of adverse reactions to drug or other treatments, in which effect sizes are often large and may be directly relevant to clinical care14.
Despite the thousands of studies and millions of genomes examined, associations identified by GWAS still explain only a small fraction of the heritability of complex diseases, and the overwhelming majority fall in regions of the genome that have no known function4. These gaps in knowledge are major challenges and must be overcome if we are to develop effective treatments and improve clinical care15. Ten years on, we are clearly on the right path, as set for us by the WTCCC. But, as with everything in science, the more we know, the more we have to learn.Footnote 1
The Wellcome Trust Case Control Consortium. Nature 447, 661–678 (2007).
Hirschhorn, J. N., Lohmueller, K., Byrne, E. & Hirschhorn, K. Genet. Med. 4, 45–61 (2002).
Klein, R. J. et al. Science 308, 385–389 (2005).
Hindorff, L. A. et al. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Manolio, T. A., Brooks, L. D. & Collins, F. C. J. Clin. Invest. 118, 1590–1605 (2008).
Mossman, K. Sci. Am. 298, 42 (2008).
Summerskill, W. Lancet 371, 370–371 (2008).
Warren, H. R. et al. Nature Genet. 49, 403–415 (2017).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. Nature Genet. 39, 906–913 (2007).
Homer, N. et al. PLoS Genet. 4, e1000167 (2008).
Popejoy, A. B. & Fullerton, S. M. Nature 538, 161–164 (2016).
Evans, D. S. et al. Hum. Mol. Genet. 25, 4350–4368 (2016).
Cade, B. E. et al. Am. J. Respir. Crit. Care Med. 194, 886–897 (2016).
Chan, S. L., Jin, S., Loh, M. & Brunham, L. R. Pharmacogenomics 16, 1161–1178 (2015).
Price, A. L., Spencer, C. C. A. & Donnelly, P. Proc. R. Soc. B 282, 20151684 (2015).
About this article
Attitudes among South African university staff and students towards disclosing secondary genetic findings
Journal of Community Genetics (2021)
Nature Reviews Genetics (2021)
Communications Biology (2019)
Medicine & Science in Sports & Exercise (2019)
Common genetic variants shared among five major psychiatric disorders: a large-scale genome-wide combined analysis
Global Clinical and Translational Research (2019)