Examining the current standards for genetic discovery and replication in the era of mega-biobanks

Huffman, J. E.

doi:10.1038/s41467-018-07348-x

Download PDF

Comment
Open access
Published: 29 November 2018

Examining the current standards for genetic discovery and replication in the era of mega-biobanks

J. E. Huffman ORCID: orcid.org/0000-0002-9672-2491¹

Nature Communications volume 9, Article number: 5054 (2018) Cite this article

6842 Accesses
27 Citations
25 Altmetric
Metrics details

Subjects

Abstract

With the recent deluge of mega-biobank data, it is time to revisit what constitutes “replication” for genome-wide association studies. Many replication samples are unavailable or underpowered, therefore alternatives beyond strict statistical replication are needed until the required resources become available.

Introduction

Since the first published genome-wide association study (GWAS) in 2005¹, a guiding principle in research conduct and interpretation has been that the strength and generalizability of GWAS findings relies upon reproducibility, grounded in strong independent statistical replication. This principle was highlighted in a seminal 2007 paper by Chanock et al.² (NCI-NHGRI Working Group on Replication in Association Studies) regarding reproducibility for genotype–phenotype associations. The first GWAS in 2005 contained 96 cases and 50 controls¹, and the Chanock et al. article was published in the same Nature issue as the Wellcome Trust Case Control Consortium’s landmark study containing 14,000 cases with 3,000 common controls³, the largest GWAS at the time. By contrast, this year we have seen the first published GWAS of >1 million participants⁴ as data from several large mega-biobanks become available. While several recommendations from Chanock et al. continue to hold true, four specific points merit further consideration in the current era. These points focus on (1) replication sample size, (2) access to independent datasets for replication, (3) use of similar populations for replication, and (4) the rationale for selecting replication SNPs. (see Box 1) It is timely to revisit this subject in the context of the vast advances in the last 11 years, focusing on the unique challenges for replication that large mega-biobanks present due to their size, phenotype-specificity, and population diversity. In this context, we define a mega-biobank as a study with phenotype and genotype data on >100,000 individuals and the term will refer to the study, rather than to the physical sample repository. As researchers strive to achieve the largest sample sizes possible and investigate new unique phenotypes, this Comment aims to revisit the basis for strict statistical replication as a mandatory requirement for publications with discovery sample sizes in the hundreds of thousands.

Two recent publications in Nature Communications provide insights into a few of these issues. Verweij et al. and Ramirez et al. both report genetic variants associated with measures of heart rate response and recovery after exercise^5,6 based on GWAS using UK Biobank data. Verweij et al. used the full dataset for discovery and did not provide replication. Ramirez et al. divided the sample into a discovery and replication set, but additionally analyzed all individuals together. A comparison of methodologies is reported in Table 1 and a comparison of locus discovery in Fig. 1. A direct comparison of results is difficult due to differing sample sizes resulting from differences in data cleaning techniques, regression models, and methodology but overall the findings presented in the two manuscripts overlapped substantially. I here consider these publications in the context of the four points mentioned above:

(1)
Replication sample size: This is the largest exercise ECG dataset including genetic data in the world and as such there is no reasonably sized external replication cohort available. This will continue to be a problem with specialized or difficult-to-measure traits, which may be available in very few individual studies. In addition, any attempts at replication would necessarily involve meta-analysis of numerous much smaller studies and therefore have decreased power.
(2)
Independent datasets for replication: While Ramirez et al. split the data into discovery and replication sets, only half of the loci which achieved genome-wide significance in the full combined dataset (discovery + replication) reached genome-wide significance in the discovery, and many of these did not surpass the modest cut-off of p < 1 × 10^–06 to advance to replication. These include loci (such as ACHE and CHRM2) which were also deemed significant in Verweij et al. and had previously been associated with resting heart rate in an independent dataset⁷. While other factors may have contributed to the attenuation of significance in the discovery set, such as the use of a model adjusting for resting heart rate, these signals were present in the full dataset. Many of the loci found only in Verweij et al. were associated with heart rate recovery at earlier time points than those explored by Ramirez et al., which may explain the lack of significant association in the latter.
(3)
Similar population for replication: Despite the fact that most genome-wide studies have been conducted in populations of predominantly European ancestry (like the UK Biobank population), the unique exercise test phenotype used by these publications has not been widely conducted in other genomic studies. This further illustrates that research to study “boutique” phenotypes will continue to be problematic, although some may soon be available for extraction from electronic health record data in ongoing mega-biobank studies like the US Department of Veterans Affairs Million Veteran Program and the All of Us Research program. This issue is compounded in studies of non-European ancestry as there are currently few options for replication of common phenotypes, let alone rarer ones. While many new initiatives, including the All of Us Research Program, are underway to recruit populations that are underrepresented in biomedical research, there will be a continued GWAS publication bias due to the lack of available replication data until these new efforts are established. This bias will result from (a) lack of publication, or publication in lower tier journals since replication is often required for publication, or (b) a perceived lack of scientific rigor of these studies since replication via GWAS has become the gold standard in the field.
(4)
Rationale for selecting replication SNPs: The authors of these studies were resourceful in using available databases to further investigate regions of interest since direct GWAS replication was not available. Both studies performed conditional analyses in order to determine independent common variants to take forward for investigation and both sought evidence of association of these SNPs with correlated traits, as well as with a broad spectrum of disease outcomes. Additionally, both studies sought further supporting evidence for possible biological mechanisms by use of publically available databases to assess functional annotation, eQTL colocalization, or overlap with sites of chromatin interaction or accessibility for SNPs of interest, as well as by performance of pathway analysis. While each of these methods has its limitations, these orthogonal biological lines of evidence to explore the likelihood of association should be considered in the same vein as statistical replication.

Table 1 Methodology comparison between Ramirez et al. and Verweij et al. for genetic analysis of heart rate increase and recovery in response to exercise in UK Biobank.

Full size table

In summary, the Ramirez et al. and Verweij et al. studies, while using the same dataset, provide different insights into the genetics governing heart rate response to, and recovery after, exercise. Due to their differing phenotype definition and modeling, different questions are answered. Ramirez et al. accounted for resting heart rate, therefore may find signals that are more specific to exercise in general, whereas the multiple time-points investigated by Verweij et al. provide insight into what genes may be important at different stages in recovery post-exercise. In addition to addressing questions about replication strategies in mega-biobanks, these studies also give insight into the opportunities for having multiple researchers tackle similar question in publically available data, since each team will have their own approach to data cleaning, analysis, and interpretation, which can be complementary.

Ultimately, GWAS findings are hypotheses generating, providing strong evidence for statistical correlation but not causation; therefore functional and interventional studies in animal models and humans will always be required to determine biological mechanisms. With the sample sizes generated by these large mega-biobanks, in combination with the rapid development of large publically available functional data, for common variants we may have moved beyond the era where strict statistical replication via GWAS is always required for publication, and additional sources of information may be taken into account when prioritizing loci for further study. This is not to say that replication should not be sought; however, while evidence is awaited from appropriately powered, diverse cohorts to become available, this may be an interim silver standard solution. Rare variants present their own challenges for replication and should be treated with greater caution so we do not revert back to the many false positive associations reported during the “candidate gene” era that sparked the Chanock et al. paper.

In addition to a call for larger study populations focused on traditionally underrepresented populations, I would also advocate for greater integration of the excellent functional databases and tools, as well as further collaboration and crosstalk between statistical/population geneticists and molecular biology scientists to dig further into underlying biological mechanisms. In the 11 years since the Chanock et al. paper, there have not only been striking advances in the population genomic data available, but also in the sensitivity and specificity of wet-lab techniques to investigate specific variants, genes, and tissues, complimented by an explosion in the catalog of available functional databases. With the integration of these amazing resources into our research pipeline, who knows what discoveries the next decade will bring.

Box 1 Discussion of points to revisit from Chanock et al. in the context of mega-biobanks

(1)
“Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect”. Determination of the proposed effect may become difficult if the discovery population consists of >500,000 individuals, particularly if the variant to be replicated is rare. In addition, achieving a sufficiently large replication sample may require a meta-analysis of many smaller studies with an accompanying decrease in power due to population heterogeneity in sample make-up and phenotyping methods. Finally, since each mega-biobank was designed independently, there are some study phenotypes that are not available in large numbers in other studies.
(2)
“Replication should preferably be conducted in independent data sets to avoid the tendency to split one well-powered study into two less conclusive ones”. While large mega-biobanks are well-powered to discover common variant associations even when split into a discovery and replication set, they offer an additional advantage in the power they afford to discover rare variant associations. Such associations may be difficult to discover and replicate using split data sets. Also, although genetic data may be split into discovery and replication sets prior to association analysis, the phenotype and genotype data will have been collected, processed, and quality controlled together, therefore it can be argued that it is not a truly independent replication set.
(3)
“A similar population should be studied and notable differences between the populations studied in the initial and attempted replication studies should be described”. Recent reports have highlighted the pressing need for genome-wide studies to focus on more diverse participants⁸. Many of the large mega-biobanks are population-specific, for example UK Biobank⁹ is largely white British (European descent), BioBank Japan¹⁰ contains Japanese individuals, and the Million Veteran Program¹¹ is mainly male, and contains, in addition to participants of European descent, large numbers of African Americans, and Hispanic Americans. Despite the large sample sizes of mega-biobanks, this heterogeneity in itself can create issues for replication, particularly in studies seeking to replicate findings from similar non-European populations.
(4)
“A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature.” While some recent papers have addressed significance thresholds for use in large updated imputation panels and sequencing projects, it is not immediately clear what threshold should be used for rare variants or for admixed populations, where the linkage-disequilibrium thresholds may be very different from the white, common variant data which we are used to studying. Until now, p < 5 × 10^–08 has been accepted as the genome-wide threshold for significance^12,13. Recently, papers have suggested thresholds from p < 1 × 10^–08 to p < 1 × 10^–09 based on method of genotype ascertainment, genetic ancestry, and variant frequency^14,15. Neither addressed this question in the context of very large sample size, like those observed in large mega-biobanks. Additionally, the impact of each variant is not fully understood, particularly if they have a regulatory effect on the surrounding genic landscape. Even if an association can be assigned to a gene, functional information may not be readily available for all genes or may be incomplete. Therefore, lack of functional information may not be the best criteria for moving a variant forward for replication.

References

Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).
Article ADS CAS Google Scholar
Chanock, S. J. et al. Replicating genotype-phenotype associations. Nature 447, 655–660 (2007).
Article ADS CAS Google Scholar
Wellcome Trust Case Control, C.. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Article Google Scholar
Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).
Article CAS Google Scholar
Verweij, N., van de Vegte, Y. J. & van der Harst, P. Genetic study links components of the autonomous nervous system to heart-rate profile during exercise. Nat. Commun. 9, 898 (2018).
Article ADS Google Scholar
Ramirez, J. et al. Thirty loci identified for heart rate response to exercise and recovery implicate autonomic nervous system. Nat. Commun. 9, 1947 (2018).
Article ADS Google Scholar
den Hoed, M. et al. Identification of heart rate-associated loci and their effects on cardiac conduction and rhythm disorders. Nat. Genet. 45, 621–631 (2013).
Article Google Scholar
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).
Article ADS CAS Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article Google Scholar
Nagai, A. et al. Overview of the BioBank Japan Project: study design and profile. J. Epidemiol. 27, S2–S8 (2017).
Article Google Scholar
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article Google Scholar
Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).
Article Google Scholar
Pe’er, I., Yelensky, R., Altshuler, D. & Daly, M. J. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381–385 (2008).
Article Google Scholar
Pulit, S. L., de With, S. A. & de Bakker, P. I. Resetting the bar: statistical significance in whole-genome sequencing-based association studies of global populations. Genet. Epidemiol. 41, 145–151 (2017).
Article Google Scholar
Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 86 (2017).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Population Genomics, MAVERIC, VA Boston Healthcare System, Boston, MA, 02130, USA
J. E. Huffman

Authors

J. E. Huffman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.E.H. conceived of, researched, and wrote this piece in consultation with the journal editors.

Corresponding author

Correspondence to J. E. Huffman.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Huffman, J.E. Examining the current standards for genetic discovery and replication in the era of mega-biobanks. Nat Commun 9, 5054 (2018). https://doi.org/10.1038/s41467-018-07348-x

Download citation

Received: 25 May 2018
Accepted: 26 October 2018
Published: 29 November 2018
DOI: https://doi.org/10.1038/s41467-018-07348-x

This article is cited by

Genetics of varicose veins reveals polygenic architecture and genetic overlap with arterial and venous disease
- Michael G. Levin
- Jennifer E. Huffman
- Scott M. Damrauer
Nature Cardiovascular Research (2023)
Transparent, Open, and Reproducible Prevention Science
- Sean Grant
- Kathleen E. Wendt
- Catherine P. Bradshaw
Prevention Science (2022)
Multi-ethnic GWAS and fine-mapping of glycaemic traits identify novel loci in the PAGE Study
- Carolina G. Downie
- Sofia F. Dimos
- Heather M. Highland
Diabetologia (2022)
Genome-Wide Association Study of Clinical Outcome After Aneurysmal Subarachnoid Haemorrhage: Protocol
- Ben Gaastra
- Sheila Alexander
- Will Tapper
Translational Stroke Research (2022)
A hidden menace? Cytomegalovirus infection is associated with reduced cortical gray matter volume in major depressive disorder
- Haixia Zheng
- Bart N. Ford
- Jonathan Savitz
Molecular Psychiatry (2021)

Examining the current standards for genetic discovery and replication in the era of mega-biobanks

Subjects

Abstract

Introduction

Box 1 Discussion of points to revisit from Chanock et al. in the context of mega-biobanks

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

This article is cited by

Genetics of varicose veins reveals polygenic architecture and genetic overlap with arterial and venous disease

Transparent, Open, and Reproducible Prevention Science

Multi-ethnic GWAS and fine-mapping of glycaemic traits identify novel loci in the PAGE Study

Genome-Wide Association Study of Clinical Outcome After Aneurysmal Subarachnoid Haemorrhage: Protocol

A hidden menace? Cytomegalovirus infection is associated with reduced cortical gray matter volume in major depressive disorder

Search

Quick links

Subjects

Abstract

Introduction

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Genetics of varicose veins reveals polygenic architecture and genetic overlap with arterial and venous disease

Transparent, Open, and Reproducible Prevention Science

Multi-ethnic GWAS and fine-mapping of glycaemic traits identify novel loci in the PAGE Study

Genome-Wide Association Study of Clinical Outcome After Aneurysmal Subarachnoid Haemorrhage: Protocol

A hidden menace? Cytomegalovirus infection is associated with reduced cortical gray matter volume in major depressive disorder

Search

Quick links