With the recent deluge of mega-biobank data, it is time to revisit what constitutes “replication” for genome-wide association studies. Many replication samples are unavailable or underpowered, therefore alternatives beyond strict statistical replication are needed until the required resources become available.
Since the first published genome-wide association study (GWAS) in 20051, a guiding principle in research conduct and interpretation has been that the strength and generalizability of GWAS findings relies upon reproducibility, grounded in strong independent statistical replication. This principle was highlighted in a seminal 2007 paper by Chanock et al.2 (NCI-NHGRI Working Group on Replication in Association Studies) regarding reproducibility for genotype–phenotype associations. The first GWAS in 2005 contained 96 cases and 50 controls1, and the Chanock et al. article was published in the same Nature issue as the Wellcome Trust Case Control Consortium’s landmark study containing 14,000 cases with 3,000 common controls3, the largest GWAS at the time. By contrast, this year we have seen the first published GWAS of >1 million participants4 as data from several large mega-biobanks become available. While several recommendations from Chanock et al. continue to hold true, four specific points merit further consideration in the current era. These points focus on (1) replication sample size, (2) access to independent datasets for replication, (3) use of similar populations for replication, and (4) the rationale for selecting replication SNPs. (see Box 1) It is timely to revisit this subject in the context of the vast advances in the last 11 years, focusing on the unique challenges for replication that large mega-biobanks present due to their size, phenotype-specificity, and population diversity. In this context, we define a mega-biobank as a study with phenotype and genotype data on >100,000 individuals and the term will refer to the study, rather than to the physical sample repository. As researchers strive to achieve the largest sample sizes possible and investigate new unique phenotypes, this Comment aims to revisit the basis for strict statistical replication as a mandatory requirement for publications with discovery sample sizes in the hundreds of thousands.
Two recent publications in Nature Communications provide insights into a few of these issues. Verweij et al. and Ramirez et al. both report genetic variants associated with measures of heart rate response and recovery after exercise5,6 based on GWAS using UK Biobank data. Verweij et al. used the full dataset for discovery and did not provide replication. Ramirez et al. divided the sample into a discovery and replication set, but additionally analyzed all individuals together. A comparison of methodologies is reported in Table 1 and a comparison of locus discovery in Fig. 1. A direct comparison of results is difficult due to differing sample sizes resulting from differences in data cleaning techniques, regression models, and methodology but overall the findings presented in the two manuscripts overlapped substantially. I here consider these publications in the context of the four points mentioned above:
Replication sample size: This is the largest exercise ECG dataset including genetic data in the world and as such there is no reasonably sized external replication cohort available. This will continue to be a problem with specialized or difficult-to-measure traits, which may be available in very few individual studies. In addition, any attempts at replication would necessarily involve meta-analysis of numerous much smaller studies and therefore have decreased power.
Independent datasets for replication: While Ramirez et al. split the data into discovery and replication sets, only half of the loci which achieved genome-wide significance in the full combined dataset (discovery + replication) reached genome-wide significance in the discovery, and many of these did not surpass the modest cut-off of p < 1 × 10–06 to advance to replication. These include loci (such as ACHE and CHRM2) which were also deemed significant in Verweij et al. and had previously been associated with resting heart rate in an independent dataset7. While other factors may have contributed to the attenuation of significance in the discovery set, such as the use of a model adjusting for resting heart rate, these signals were present in the full dataset. Many of the loci found only in Verweij et al. were associated with heart rate recovery at earlier time points than those explored by Ramirez et al., which may explain the lack of significant association in the latter.
Similar population for replication: Despite the fact that most genome-wide studies have been conducted in populations of predominantly European ancestry (like the UK Biobank population), the unique exercise test phenotype used by these publications has not been widely conducted in other genomic studies. This further illustrates that research to study “boutique” phenotypes will continue to be problematic, although some may soon be available for extraction from electronic health record data in ongoing mega-biobank studies like the US Department of Veterans Affairs Million Veteran Program and the All of Us Research program. This issue is compounded in studies of non-European ancestry as there are currently few options for replication of common phenotypes, let alone rarer ones. While many new initiatives, including the All of Us Research Program, are underway to recruit populations that are underrepresented in biomedical research, there will be a continued GWAS publication bias due to the lack of available replication data until these new efforts are established. This bias will result from (a) lack of publication, or publication in lower tier journals since replication is often required for publication, or (b) a perceived lack of scientific rigor of these studies since replication via GWAS has become the gold standard in the field.
Rationale for selecting replication SNPs: The authors of these studies were resourceful in using available databases to further investigate regions of interest since direct GWAS replication was not available. Both studies performed conditional analyses in order to determine independent common variants to take forward for investigation and both sought evidence of association of these SNPs with correlated traits, as well as with a broad spectrum of disease outcomes. Additionally, both studies sought further supporting evidence for possible biological mechanisms by use of publically available databases to assess functional annotation, eQTL colocalization, or overlap with sites of chromatin interaction or accessibility for SNPs of interest, as well as by performance of pathway analysis. While each of these methods has its limitations, these orthogonal biological lines of evidence to explore the likelihood of association should be considered in the same vein as statistical replication.
In summary, the Ramirez et al. and Verweij et al. studies, while using the same dataset, provide different insights into the genetics governing heart rate response to, and recovery after, exercise. Due to their differing phenotype definition and modeling, different questions are answered. Ramirez et al. accounted for resting heart rate, therefore may find signals that are more specific to exercise in general, whereas the multiple time-points investigated by Verweij et al. provide insight into what genes may be important at different stages in recovery post-exercise. In addition to addressing questions about replication strategies in mega-biobanks, these studies also give insight into the opportunities for having multiple researchers tackle similar question in publically available data, since each team will have their own approach to data cleaning, analysis, and interpretation, which can be complementary.
Ultimately, GWAS findings are hypotheses generating, providing strong evidence for statistical correlation but not causation; therefore functional and interventional studies in animal models and humans will always be required to determine biological mechanisms. With the sample sizes generated by these large mega-biobanks, in combination with the rapid development of large publically available functional data, for common variants we may have moved beyond the era where strict statistical replication via GWAS is always required for publication, and additional sources of information may be taken into account when prioritizing loci for further study. This is not to say that replication should not be sought; however, while evidence is awaited from appropriately powered, diverse cohorts to become available, this may be an interim silver standard solution. Rare variants present their own challenges for replication and should be treated with greater caution so we do not revert back to the many false positive associations reported during the “candidate gene” era that sparked the Chanock et al. paper.
In addition to a call for larger study populations focused on traditionally underrepresented populations, I would also advocate for greater integration of the excellent functional databases and tools, as well as further collaboration and crosstalk between statistical/population geneticists and molecular biology scientists to dig further into underlying biological mechanisms. In the 11 years since the Chanock et al. paper, there have not only been striking advances in the population genomic data available, but also in the sensitivity and specificity of wet-lab techniques to investigate specific variants, genes, and tissues, complimented by an explosion in the catalog of available functional databases. With the integration of these amazing resources into our research pipeline, who knows what discoveries the next decade will bring.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.