Most common diseases arise from interaction between multiple genetic variations and factors such as diet. Studies of such diseases that exploit the rich data on variation in the human genome are just beginning.
The results of the first genome-wide-association (GWA) surveys of common diseases are trickling out. This trickle will soon be a flood of data, much anticipated but challenging to interpret. These initial studies will calibrate our expectations for future investigations, and help to establish the principles for how they are best reported.
On page 881 of this issue, Sladek et al.1 report the results of such a survey of type 2 diabetes *. It is the largest GWA study so far, and tackles a very common disease that is rising in prevalence throughout the world. More than one in three Americans born in 2000 will develop type 2 diabetes, and its rise is particularly rapid in populations that have recently adopted Western lifestyles — hence the efforts to understand the interplay between genetic and environmental risk factors in generating the high frequency of the disease. Sladek et al. contribute to these efforts. They demonstrate an unequivocal association between type 2 diabetes and a previously identified2 genetic locus (TCF7L2), and substantial — but preliminary — evidence for several new loci.
To evaluate GWA studies, we must revise our notion that a discovery in human genetics consists of identifying 'the gene' for a disease. This notion derives from investigations of rare diseases, which could, in a single study, be associated definitively with mutations in a single gene. Such mutations could be predicted to devastate the function of the proteins encoded by the gene, and thus to cause disease. Yet the 'genetic architecture' of common diseases, such as asthma and depression as well as diabetes, is not built on such obvious deleterious mutations. Rather, it arises from the combined increase in disease risk generated by an unknown number of genetic variants, some of which might not encode proteins, and are thus difficult to identify. In this situation, false positive results are to be expected, and statistically significant results in one study need to be replicated. Even after a gene locus is unequivocally implicated in disease susceptibility, it remains difficult to prove which associated variant is responsible.
For context, it is useful to consider the choices Sladek et al. made in terms of the strategy to be used for evaluating genome-wide genetic variation and of the number of individuals to be genotyped. All recent GWA studies assay genome variation using single nucleotide polymorphisms (SNPs); these base substitutions comprise most of the variation in the human genome, and current technology permits economical genotyping of hundreds of thousands of SNPs in a single experiment. Although some studies have successfully used SNP sets consisting only of variants within genes3,4, GWA studies aspire to survey all variation in the genome, including non-coding regions. Exhaustive surveys are not yet feasible, but genotyping a subset of the known variable sites may suffice.
Locations in the genome that are separated by a small number of base pairs are often in 'linkage disequilibrium': that is, there is a substantial association between the variants at the two loci. This phenomenon allows us to survey variation across the genome fairly precisely, simply by genotyping a subset of polymorphic loci. GWA studies rely on the assumption that linkage disequilibrium enables one SNP to act as a marker for association to other sequence variants in that region. The GWA studies carried out thus far differ in terms of the number and criteria for selection of the genotyped SNPs: some use SNPs chosen to be evenly physically spaced5,6,7, whereas others8 choose SNPs to maximize the detection of linkage disequilibrium, based on data from the International HapMap Project9. Sladek et al. used a marker set based on HapMap linkage-disequilibrium data, supplemented by a gene-centric SNP set; their combined marker set (about 400,000 SNPs) provides the highest-resolution survey of genomic variation of any GWA study so far.
In deciding how many individuals to genotype, recent studies follow either a one-stage8 or a two-stage5,6,7 design: in the first case, statistical significance is sought within the genotyped sample (and other samples may be used for replication); in the second, SNPs passing a loose significance threshold in the GWA study are genotyped in a follow-up sample, to seek significance. The choice of study design usually rests on estimates of the power to detect effects of a given magnitude that is considered realistic for the trait in question.
Sladek et al.1 adopted a two-stage design: in the first stage, they conducted the GWA study in a total sample of about 1,400 cases and controls (the largest sample in a GWA study so far). In the second stage, they genotyped the SNPs showing evidence of association in the GWA study in a new sample (about 5,500 total cases and controls) to test for significant associations. The results now published refer only to the follow-up of the most promising SNPs from the first stage: a more comprehensive second stage is still under way.
Sladek et al. identify strong associations for three novel loci (one detected in the linkage-disequilibrium-based marker set and two detected in both marker sets). Although more loci may emerge from the complete two-stage analysis, publication of these initial results provides the opportunity for swift replication (or not) by other research groups, using independent samples, as exemplified by the case of TCF7L2. Its association with type 2 diabetes was reported last year2, and has already been replicated in at least 20 independent studies. In one- or two-stage studies, care must be taken that the specific definition of disease adopted in the follow-up (or replication) samples is comparable to the definition used in the original GWA samples. In this respect it is noteworthy, but of unclear significance, that Sladek et al. used more stringent inclusion criteria in the GWA sample than in the follow-up sample.
The identification of a few significant disease associations represents only one outcome of Sladek and colleagues' study. GWA studies should be evaluated primarily from an epidemiological standpoint, focused not just on what new disease-susceptibility genes they propose, but on how they advance our understanding of the composition of genetic risk in the population. Sladek et al. take a first step towards such understanding, presenting an evaluation of what proportion of the disease cases can be attributed to variation in the loci they identify as significant in their second-stage analysis. As several additional GWA studies of type 2 diabetes will shortly report their results, we may soon be able to estimate the number — and location in the genome — of the genetic variants that are the main contributors to diabetes susceptibility, at least in some populations.
The results of the first GWA studies may also reveal the degree of genome coverage provided by the chosen SNP panels. Reassuringly, both of the linkage-disequilibrium-based studies reported so far1,8 were able to replicate one known locus. On the other hand, the most significant new locus in both cases was identified by only a single SNP, suggesting that even the dense marker sets employed in these studies provide insufficient coverage for detection of all important loci10. In certain populations that are of recent origin and that have remained isolated, linkage disequilibrium is more extensive than in the populations used in the GWA studies so far11; the first GWA surveys in such groups will be watched closely for evidence that they permit more complete coverage using comparable marker sets. Similarly, the results of GWA studies, now under way, that are using even larger numbers of SNPs, are much anticipated.
A final observation is that for both of the linkage-disequilibrium-based studies reported so far (for type 2 diabetes1 and inflammatory bowel disease8), the most significant novel associations were to a variant predicted to alter the protein product encoded by a gene (termed a non-synonymous coding SNP), and thus possibly to have a strong functional effect. Furthermore, in both cases the disease is associated with the more common variant at these loci, suggesting that the less common variant may offer protection against developing the disease. That diseases are associated with such common non-synonymous SNPs suggests that these variants may have offered an evolutionary advantage in previous environments. Clearly, one should not generalize from a sample size of two. Nevertheless, these findings underscore how GWA studies may not only deliver 'new' genes10, but permit advances in our understanding of how human evolution has 'made' the diseases that are common today.
This article and the paper concerned1 were published online on 11 February 2007.
About this article
Comparative network stratification analysis for identifying functional interpretable network biomarkers
BMC Bioinformatics (2017)