Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis

Deflaux, Nicole; Selvaraj, Margaret Sunitha; Condon, Henry Robert; Mayo, Kelsey; Haidermota, Sara; Basford, Melissa A.; Lunt, Chris; Philippakis, Anthony A.; Roden, Dan M.; Denny, Joshua C.; Musick, Anjene; Collins, Rory; Allen, Naomi; Effingham, Mark; Glazer, David; Natarajan, Pradeep; Bick, Alexander G.

doi:10.1038/s41467-023-41185-x

Download PDF

Article
Open access
Published: 05 September 2023

Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis

Nature Communications volume 14, Article number: 5419 (2023) Cite this article

2429 Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Recently, large scale genomic projects such as All of Us and the UK Biobank have introduced a new research paradigm where data are stored centrally in cloud-based Trusted Research Environments (TREs). To characterize the advantages and drawbacks of different TRE attributes in facilitating cross-cohort analysis, we conduct a Genome-Wide Association Study of standard lipid measures using two approaches: meta-analysis and pooled analysis. Comparison of full summary data from both approaches with an external study shows strong correlation of known loci with lipid levels (R² ~ 83–97%). Importantly, 90 variants meet the significance threshold only in the meta-analysis and 64 variants are significant only in pooled analysis, with approximately 20% of variants in each of those groups being most prevalent in non-European, non-Asian ancestry individuals. These findings have important implications, as technical and policy choices lead to cross-cohort analyses generating similar, but not identical results, particularly for non-European ancestral populations.

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Introduction

Traditional data sharing processes require researchers to download copies of data to their own systems. More recently, health research is shifting to use Trusted Research Environments (TREs), such as the All of Us Researcher Workbench (AoU RW) and the UK Biobank Research Analysis Platform (UKB RAP), for large-scale clinical and genomic data-sharing and analysis^1,2,3,4. In general, a TRE is a secure computing environment which provides approved researchers with tools to access and analyze sensitive health data. TREs offer many benefits, including (1) increased protection of study participant data, (2) decreased barriers to access and analyze data, (3) lower cost of shared data storage, and (4) increased collaboration across the scientific community^5,6,7. The positive impact of TREs is clear, as is their potential to facilitate population- and global-scale health research^8,9.

For many important reasons, including participant data privacy, trust and security, TREs often implement a variety of policy and technological safeguards. For example, data that reside in an enclave may not be allowed to leave the environment in non-aggregated form^10,11. Researchers wishing to safely and appropriately analyze data across different TREs face technological hurdles and policy requirements to do so¹². Several approaches to data analysis across enclaves have been proposed. These include a meta-analysis whereby researchers perform analysis in separate TREs and then meta-analyze de-identified results outside of an enclave, and pooled analysis whereby researchers create and analyze merged data within a single enclave (Fig. 1). Each approach has advantages and limitations. All approaches to cross-analysis benefit from improved harmonization and standardization of data, policies, and working environments^8,13. Together with the broader research community, data providers play a critical role in charting approved paths to cross-analysis and disseminating this information broadly. This paper describes approaches to cross-analyze All of Us and UK Biobank data, and discusses benefits and limitations of each approach with respect to cost, complexity, and scientific utility (Supplementary Fig. 1).

**Fig. 1: Outline of steps in the meta- and pooled analyses for *All of Us* and UK Biobank cross-cohort analysis.**

Specifically, a genome-wide association study (GWAS) was used to explore cross-analysis of UK Biobank and All of Us data, as it is a standard analytical approach that benefits significantly from the boost in power obtained from increased sample size^14,15. Additionally, methods for meta-analysis and pooled GWAS are well developed¹⁶. Circulating lipid concentrations were chosen as the target phenotype to enable validation of the two approaches by replicating well-established genetic associations. The work presented here is the result of collaboration between the All of Us and UK Biobank programs intended to build and describe research resources rather than discover novel associations.

Results

We performed a genome-wide association study on circulating lipid levels involving All of Us whole genome sequence data and UK Biobank whole exome sequence data twice - (1) by meta-analyzing GWAS results from separate TREs and (2) by analyzing pooled data in a single TRE. The goals, recruitment methods, scientific rationale and genomic data for All of Us and UK Biobank have been described previously^1,2. In All of Us, we leveraged 98,622 whole genome sequenced samples alongside 200,643 whole exome sequenced samples from the UK Biobank. Although whole genome sequence data are available for UK Biobank, pooled analysis would require the data to be moved to a common enclave, which is not permitted by its access policy. The 200k exome release from UK Biobank was therefore explicitly chosen for use in this project because it was the last release of individual-level UK Biobank sequence data permitted to be analyzed outside of the UKB RAP, and therefore available for use in both pooled and meta-analyses performed on the AoU RW. Since our project was focused on comparing the computational approaches rather than on discovering new associations, maximal sample sizes were not needed.

The meta-analysis

For the meta-analysis, GWAS of lipid levels were performed separately in the All of Us and UK Biobank TREs (see supplement for further details). Phenotypes were prepared separately. We curated lipid phenotypes (high-density lipoprotein cholesterol: HDL-C, low-density lipoprotein cholesterol: LDL-C, total cholesterol: TC, triglycerides: TG) using the cohort builder tool within the AoU RW. We obtained phenotype information on one or more lipid measurements from electronic health records for 37,754 All of Us participants with available whole genome sequence data. In the UK Biobank, one or more lipid measurements from systematic central laboratory assay were available for 190,982 participants with exome sequence data¹⁷. Covariate information (age, sex at birth, self-reported race) and data on lipid-lowering medication for these corresponding samples were extracted from All of Us survey and electronic health record data and UK Biobank self-reported data. The lipid phenotypes were adjusted for statin medication^18,19 and normalized (see supplement).

A GWAS was performed in each cohort separately using REGENIE²⁰ on the subset of variants within the UK Biobank exonic capture regions (Fig. 2). In each TRE, we retained variants with allele count (AC) >=6, since variants with an exceptionally low allele count are not considered by the analysis method, and obtained 1,699,534 biallelic exonic variants from All of Us and 2,158,225 from the UK Biobank. After applying variant quality control to filter out low quality variants from the subset of samples in the lipids cohort, single variant GWAS was performed with 789,179 variants from the All of Us cohort and associated with the LDL-C phenotype. Separately, this same process was carried out with 2,037,169 variants from the UK Biobank cohort. Each set of results was then downloaded, keeping in mind that before dissemination they must be filtered to remove AC < 40 in accordance with the All of Us Data and Statistics Dissemination Policy, which disallows disclosure of group counts under 20 since a given individual could have two copies of a single allele¹⁰. All of Us does permit researchers to request an exception to this policy through the program’s Resource Access Board, which we were granted for the results in this particular study. Finally, we meta-analyzed variants by combining the summary statistics obtained from both studies using an inverse variance-weighted fixed effects method implemented in METAL²¹. 490 variants from 321 loci (r²:0.5) were significantly associated (p < 5E-08) with LDL-C (Fig. 3b, Supplementary Data 1).

**Fig. 2: Flow diagram highlighting the number of variants and sequenced samples retained at each stage of the meta- and pooled analyses.**

The pooled analysis

For the pooled analysis, data from the UK Biobank were copied into the AoU RW for cross-analysis with data from All of Us. Phenotypes were prepared as previously described and merged into a single table. Genomic data were prepared by merging variants for all available samples from the UK Biobank and All of Us cohorts into a single genomic data set (Fig. 2). For the pooled analysis, biallelic variants were retained if the same variant was present in both cohorts to avoid the clear batch effect of a variant present in only one cohort. We obtained 2,715,453 biallelic exonic variants for the pooled analysis after subsetting to UK Biobank exonic capture regions and filtering allele count (AC) >=6, since variants with an exceptionally low allele count are not considered by the analysis method. After applying variant quality control to filter out low quality variants from the subset of samples in the lipids cohort, single variant GWAS was performed with 2,135,845 merged variants in the pooled cohort for each of the lipid phenotypes. Cohort source (either All of Us or UK Biobank) was included as an additional covariate to mitigate potential batch effects from the different sequencing approaches and informatics pipelines used in All of Us and UK Biobank (see supplement). 464 variants were significantly associated (p < 5E-08) with the LDL-C phenotype from 284 loci (r²:0.5) (Fig. 3c, Supplementary Data 2).

Scientific differences between pooled and meta-analyses

We sought to test whether important scientific differences exist between our pooled and meta-analyses. We first investigated how the analytical approach impacted the identification of variants significantly associated with our phenotypes of interest. Most of the significant variants identified by either method were previously reported to be associated with plasma lipids in external datasets (Supplementary Data 1 and 2). Of the novel significant variants, most were short insertions/deletions which were largely excluded from prior efforts. Gene prioritization of the GWAS results from our analysis, fine-mapped variants to genes important to lipids including APOE, APOA2, LDLR, PCSK9, CEPT, APOA5, APOB with top 20 prioritization scores. We then tested the extent to which each approach replicates known associations by comparing lipid GWAS results with two previously published datasets that contain the largest amount of data on exome and genome sequencing lipid associations^22,23. The Selvaraj study includes diverse individuals from an external TOPMed cohort. The Hindy study included ~40,000 individuals from the UK Biobank (partially overlapping with our UK Biobank dataset) as well as ~170,000 other individuals, most of whom were of European ancestry. Effect sizes from both of our analyses are highly correlated with the two previously published standards (Fig. 4b). Analytical approach had little impact on either the number of significant SNPs or the concordance (R²) of associations in common with the Selvaraj study. When compared with the Hindy study, an average of ~3 more genome-wide significant SNPs were retained with the pooled analysis (Supplementary Fig. 10), however the concordance (R²) was slightly lower for all lipid phenotypes using the pooled approach (Fig. 4b). We next examined whether the pooled analysis includes a broader total set of variants than the meta-analysis. There are 1,496,404 variants which were present in only pooled analysis, most of which were of lower minor allele frequency (Fig. 4a).

**Fig. 4: Scientific differences in pooled and meta-analyses.**

Next, we tested how the analytical approach impacted the ancestry frequency distributions of significant variants. We obtained ancestry data from gnomAD and referenced the popmax ancestry information²⁴. Out of the 490 significant variants from meta-analysis and 464 variants from pooled analysis, 400 variants were common between both analyses. The variants common between both analyses were from different ancestral groups, 16% African, 13% American, 26% Non-Finnish European, 22% each from East Asian and South Asian groups (Fig. 4c, Supplementary Data 3). Around 90 variants were identified as genome-wide significant in meta-analysis but not in the pooled analysis, whereas 64 variants were significant in the pooled analysis but not in meta-analysis. Some of the variants considered significant in only one method were below but near the significance cutoff, or not included in both analyses due to AC filtering or variant QC (Supplementary Figs. 8 and 9). We identified two (rs72646508, rs145777339) and six low frequency variants (AF < 0.01) from meta- and pooled analysis respectively from American and African ancestral groups (Table 1). Since the All of Us cohort is enriched for American (Hispanic) and African ancestral samples, we were able to identify multiple variants unique to these ancestral groups using the pooled approach. Among the ancestry-specific variants from the pooled analysis we identified 5 rare variants specific to African ancestry and 1 from American ancestry. We also observed that the 64 variants uniquely significant in pooled analysis had more significant CADD scores (Phred-scores >= 20) when compared to those uniquely significant in meta-analysis (p-value 0.02), with much of the signal observed in the American ancestral group (p-value 0.09). The variants identified from pooled analysis (Phred-scores >= 20) were rare and present in non-European ancestry and these variants harbored functional severe consequences extending to missense, frameshift and stop-gain mutations.

Table 1 Rare variants uniquely significant in either meta-analysis or pooled analysis

Full size table

Cost and complexity differences between pooled and meta-analyses

Cost and complexity are critical considerations impacting the use and usability of large-scale biomedical research data. We evaluated analysis complexity by examining the number of discrete computational steps required to complete a lipid GWAS (Fig. 1). The number of arrows (where each arrow represents an input or output of a computational step) required for the meta- and pooled analysis were 32 and 19, respectively. The increased complexity of the meta-analytical approach is primarily attributed to the duplication of computational steps within each silo. Extending this model to a theoretical analysis of N datasets siloed in N distinct TREs, the number of arrows required to complete the GWAS scales linearly at ~4x faster rate with the number of siloed TREs in the meta-analysis versus the pooled analysis (see supplement).

Additionally, we report the cost comparison of the meta- versus pooled analyses. There are two aspects to the overall cost: (1) Cloud resource utilization (including the cost of data storage and cloud compute), and (2) the person-time needed to perform and review the results of each step. For cloud data storage costs, the respective TREs assume the considerable cost of hosting the primary formats of the genomic data, freeing researchers of this cost burden. Cloud compute costs are tool dependent. For analysis steps involving R, PLINK, or REGENIE the cloud compute resource costs are quite low - on the order of cents to a few dollars. Analysis steps involving Hail, by comparison, incur increased cloud compute cost. Hail processes data in a parallel fashion, leading to reduced wall-clock time to complete large-scale analyses. Hail is particularly useful whenever there does not already exist an optimized, purpose-built tool to perform the exact genomic data transformation needed. The primary cost driver for the meta-analysis was the Hail processing needed to extract relevant All of Us data from a Hail matrix table to create a BGEN file for use with REGENIE ($220). The primary cost driver for the pooled analysis was the Hail processing needed to merge the UK Biobank and All of Us variant data ($360).

Person-time is highly dependent on the researcher’s familiarity with the datasets, methods, tools, and TRE capabilities. We found the amount of person-time for the meta-analyses was roughly twice that required for the pooled analyses. The person-time savings gained during pooled data harmonization, manipulation, and visualization within a single analysis environment, outweighed the cost of the additional steps required to merge the phenotype and genomic data.

Discussion

We present two potential methods for the cross-analysis of UK Biobank and All of Us data using lipid GWAS as a case-study in computational approaches to analysis across TREs. Specifically, we looked at scientific and technical differences between meta-analysis of data in separate TRE silos, and pooled analysis of data in a single TRE. In each analysis we controlled for potential batch effects by including the source cohort as a covariate and limiting both pooled and meta-analyses to the subset of variants common in both the All of Us and UK Biobank cohorts. Each approach successfully replicated known genetic associations with plasma lipids. For both approaches, effect sizes found for each lipid trait are highly correlated with previously published studies. However, we did note several important scientific differences. First, pooled analysis enabled 1,496,404 additional variants to be included in the GWAS, compared with meta-analysis. Most of these variants were of lower minor allele frequencies, and thus this difference may be attributed to the fact that merging the two cohorts prior to applying the AC > 6 filter “rescued” rarer variants. We expect that the smaller overall number of variants retained for meta-analysis, because variants with an exceptionally low allele count are not considered by the analysis method, may negatively impact analysis of rare disease or rare variants. In these cases, a pooled approach may be preferred.

Second, the analytical approach impacted the number and ancestry frequency distributions of variants significantly associated with our phenotype of interest. We report 490 variants significantly associated with LDL-C from meta-analysis of GWAS performed separately in All of Us and UK Biobank TREs. In comparison, we found 464 variants significantly associated with LDL-C from pooled analysis of All of Us genome and UK Biobank exome sequencing data. We noted approximately 20% of variants significant in only the pooled analysis or significant in only the meta-analysis were most prevalent in non-European, non-Asian ancestry individuals. Prior foundational work has demonstrated that given otherwise equivalent datasets pooled and meta-analysis will generate theoretically and empirically equivalent results^25,26. However real-world experience as illustrated above and by others^27,28,29 has identified numerous differences between cohorts including phenotype ascertainment, genetic ancestry and population structure. Therefore, it is not surprising that these two analytical approaches yielded scientifically similar, but not identical, results. This has important implications for studying genetic variants in diverse individuals.

In addition to the scientific differences considered above, researchers seeking to analyze data across TREs face significant technical hurdles. Both complexity and cost scale with the number of data enclaves cross-analyzed. The pooled GWAS approach described was the least complex of the two investigated, requiring almost half as many discrete computational steps as meta-analysis. While analysis steps are displayed in a logical order in Fig. 1, many steps are run multiple times as an analyst becomes familiar with the datasets and capabilities of the respective TREs. The number of computational steps involved in meta-analysis grows at a ~4x faster rate than for pooled, and therefore there is a significant increase in meta-analysis cost associated with the person-time required to develop and debug an analysis. That increased cost is high for two TREs, and even more significant as the number of TREs increases, which is expected as the amount of valuable global data increases.

This study found several capabilities provided by existing TREs that facilitated cross-cohort analysis, and that if adopted by future TREs would facilitate incorporation of more data into future analyses. These include: (1) maintaining a single centrally funded copy of data that can be accessed in-place by researchers, (2) providing robust, integrated research support, (3) providing access to flexible, scalable infrastructure and tools suited to large-scale data analysis (Table 2).

Table 2 Important capabilities and opportunities to consider for improved cross-cohort analysis

Full size table

In addition, this study identified many opportunities to improve the support for cross-analysis in current and future TREs, including both technical and policy considerations (Table 2). In a meta-analysis, TRE technical differences (such as differences in user interfaces, analytical tools, supported programming languages, acceptable mechanisms for data access, acceptable mechanisms for data output, and methods for organizing and orchestrating an analysis) are considerable hurdles. The activation energy just to “get started” in multiple TREs is high. Our study team found it challenging to manage multiple copies of code in separate TREs. Data harmonization, a critical and time-consuming step, becomes much more tedious and error prone when one cannot view and visualize together the row-level data. Many common analytical tasks, including creating a simple comparison plot with dots and whisker detail like the one in Fig. 3a, are infeasible with aggregate data. Improved harmonization and standardization of data, policies, and working environments across TREs can help reduce this burden.

Policy decisions are based on complex rationale that attempt to balance participant privacy, data security, scientific utility, and data sharing goals which have significant practical impact on cross-analysis. Policy changes that enable researchers to cross-analyze pooled data in one or more mutually trusted TREs would be a powerful step forward towards improved data usability and increased researcher productivity. The additional friction incurred when performing data harmonization for the meta-analysis could be reduced if TREs had reciprocal policies that permitted some participant level data, such as phenotypes, to be securely transferred between them. This middle-ground approach may be a compromise to increase data usability in a manner respectful of the current myriad of genomic data sharing policy and governance issues.

The analyses and results in this paper have several limitations. First, cross-analyses were limited to All of Us whole genome sequence and UK Biobank whole exome data available at the time of this study and meeting the TRE policy constraints. As noted previously, these data were generated using different sequencing methods and informatics pipelines. Future cross-analyses may be improved by further harmonizing approaches and joint-calling pipelines used to generate these data. The primary goal of this work was to build and describe approved paths for cross-analysis to encourage use by the broader scientific community. As such, the case study selected for cross-analysis was intentionally limited to common variants associated with well-studied lipid phenotypes. Future cross-analysis of All of Us and UK Biobank data exploring rare-variants and novel associations are likely to have greater scientific impact, and potentially to surface greater sensitivity to methodological differences. Finally, this study was limited to the cross-analysis of data residing in two enclaves. Future work is needed to expand these approaches to cross-analysis of data residing in three or more enclaves.

Early paths for cross-analysis of population-scale clinical and genomic data are clear. Program leaders, data providers, policy groups, and TRE developers have a shared responsibility to ensure data assets generated from public funding yield maximal scientific benefit while continuing to balance and honor participants as partners in research programs. Thoughtful approaches to reducing barriers for efficient data access and analysis across large programs can increase the power of discovery while preserving participant trust. Data providers could consider providing mirrored copies of the data in multiple clouds to better enable pooled analyses. Additionally, and consistent with many existing efforts at federated analysis, data generators can further harmonize and standardize methods to avoid the need for downstream researchers to re-align and re-call genomic data. This study reinforces the need to reduce friction in cross-analysis to fully realize the potential of global-scale health research.

Methods

Cohorts

The UK Biobank (UKB) is a population-based cohort of approximately 500,000 participants recruited from 2006 to 2010, that has existing genomic and longitudinal phenotypic data. Baseline assessments were conducted at 22 assessment centers across the United Kingdom, with sample collections including blood-derived DNA. Secondary use of this data was approved by the Massachusetts General Hospital Institutional Review Board (protocol 2021P002228) and was facilitated through UK Biobank application 7089. The All of Us research program recruited individuals that have been and continue to be underrepresented in biomedical research due to limited access to healthcare. The first release of genomic data included approximately 98,000 individuals who completed electronic consent modules and health questionnaires upon enrollment. Approval to use the dataset for program operational demonstration projects was obtained from the All of Us Institutional Review Board.

Genotypes

Whole exome sequencing (WES) from the 200 K exome release is the most recent release of genomic data permitted by UK Biobank policy to be analyzed outside of the UK Biobank Research Analysis Platform (RAP). The 200 K exome release includes approximately 10 Million exonic variants with >95% of targeted bases covered at a depth of 20X or greater. On both the All of Us Researcher Workbench (AoU RW) and the UK Biobank Research Analysis Platform (RAP), the genotypes were filtered to include only variants within the exome capture region with an alternative allele frequency of 6 or more. Whole genome sequenced (WGS) data from All of Us alpha 3 release was available as a Hail matrix table on the AoU RW. The alpha3 genotypes were filtered to include only variants within the same exome capture region with an alternative allele frequency of 6 or more. As initial quality control, variants with Hardy-Weinberg equilibrium exact test p-value below 1e-15 or missing call rates exceeding 10% were removed. QC also checked for samples with missing call rates exceeding 10%, but none were found. To mitigate batch effects, in the pooled analysis the prepared genotypes were filtered to include only those variants found in both cohorts and in the meta-analysis the results were filtered to include only those indicated found to be in both cohorts.

Phenotypes

The primary outcomes in this study included LDL cholesterol (LDL-C), HDL cholesterol (HDL-C), total cholesterol (TC) and triglycerides (TG) as phenotypes. We curated and harmonized the lipid measurements and statin drug exposures for both UK Biobank and All of Us from the phenotype resources of these cohorts. LDL-C was either directly measured or calculated by the Friedewald equation when triglycerides were <400 mg/dL. Given the average effect of lipid lowering-medicines, when lipid-lowering medicines were present, we adjusted the total cholesterol by dividing by 0.8 and LDL-C by dividing by 0.7, triglycerides remained natural log transformed for analysis. The lipid phenotypes were then inverse rank normalized by the residuals, scaled by the standard deviation and adjusted for the covariates. We included PC1-10, age, age² and sex at birth as covariates in our study. To mitigate batch effects, for the pooled analysis we also included a covariate of ‘cohort’.

Statistical analysis

Single variant genome wide association studies (GWAS) were carried out using REGENIE v2.2.4. We implemented REGENIE Step1 NULL model generation using quality-controlled variants with a minor allele count (MAC) of 100. We applied the leave one chromosome out (LOCO) method for GWAS while adjusting for the covariates stated above. We used variant and sample missingness at 10% followed by Hardy-Weinberg equilibrium p-value not exceeding 1 × 10⁻¹⁵ for both step 1 and for the genome wide associations. We carried out meta-analysis of the siloed GWAS results from each cohort using the METAL package with the Standard Error scheme, where the methods weights effect size estimates using the inverse of the corresponding standard errors. The UKB siloed analysis was carried out on the UKB RAP, and the All of Us siloed analysis and the pooled analysis were carried out on the AoU RW. All the steps were implemented in R or Python notebooks. Complete details on the various steps carried out in the project are provided in the supplementary information.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The UK Biobank (UKB) whole-exome sequence data can be accessed through UKB Research Analysis Platform (RAP), through the UKB approval system (https://www.ukbiobank.ac.uk). Access to individual-level data from the All of Us research program is available to researchers whose institution has signed a data use agreement with All of Us (https://www.researchallofus.org/register/). Whole-genome sequencing data belongs to the controlled tier dataset, which requires additional training to access. gnomAD is publicly available (https://gnomad.broadinstitute.org/). The significant GWAS results generated in this study are provided in the Supplementary Data file.

Code availability

The code for all analyses can be found in https://github.com/all-of-us/ukb-cross-analysis-demo-project³⁰ and was compatible with UK Biobank Research Analysis Platform and All of Us Researcher Workbench available data and technical capabilities as of the Spring of 2022.

References

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
All of Us Research Program Investigators. et al. The “All of Us” research program. N. Engl. J. Med. 381, 668–676 (2019).
Article Google Scholar
UK Health Data Research Alliance & NHSX. Building Trusted Research Environments - principles and best practices; Towards TRE ecosystems. Preprint at https://doi.org/10.5281/ZENODO.5767586 (2021).
Hubbard, T., Reilly, G., Varma, S. & Seymour, D. Trusted research environments (TRE) green paper. Preprint at https://doi.org/10.5281/ZENODO.4594704 (2020).
Schatz, M. C., Langmead, B. & Salzberg, S. L. Cloud computing and the DNA data race. Nat. Biotechnol. 28, 691–693 (2010).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).
Article CAS PubMed PubMed Central Google Scholar
Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI genomic data science analysis, visualization, and informatics lab-space. Cell Genom. 2, 100085 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rehm, H. L. et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1, 100029 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhou, W. et al. Global biobank meta-analysis initiative: powering genetic discovery across human diseases. Cell Genom. 2, 100192 (2022).
Article CAS PubMed PubMed Central Google Scholar
Data access tiers – All of Us Research Hub. https://www.researchallofus.org/data-tools/data-access/.
UK Biobank data tiers and costs. https://www.ukbiobank.ac.uk/enable-your-research/costs.
Lunt, C. & Denny, J. C. I can drive in Iceland: enabling international joint analyses. Cell Genom. 1, 100034 (2021).
Article CAS PubMed PubMed Central Google Scholar
O’Doherty, K. C. et al. Toward better governance of human genomic data. Nat. Genet. 53, 2–8 (2021).
Article PubMed PubMed Central Google Scholar
Yang, J. et al. Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19, 807–812 (2011).
Article PubMed PubMed Central Google Scholar
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700,000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
Article CAS PubMed PubMed Central Google Scholar
Evangelou, E. & Ioannidis, J. P. A. Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14, 379–389 (2013).
Article CAS PubMed Google Scholar
Allen, N. E. et al. Approaches to minimising the epidemiological impact of sources of systematic and random variation that may affect biochemistry assay data in UK Biobank. Wellcome Open Res. 5, 222 (2021).
Article PubMed PubMed Central Google Scholar
Patel, A. P. et al. Association of rare pathogenic DNA variants for familial hypercholesterolemia, hereditary breast and ovarian cancer syndrome, and lynch syndrome with disease risk in adults according to family History. JAMA Netw. Open 3, e203959 (2020).
Article PubMed PubMed Central Google Scholar
Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 9, 3391 (2018).
Article ADS PubMed PubMed Central Google Scholar
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Article CAS PubMed Google Scholar
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hindy, G. et al. Rare coding variants in 35 genes associate with circulating lipid levels—a multi-ancestry analysis of 170,000 exomes. Am. J. Hum. Genet. 109, 81–96 (2022).
Article CAS PubMed Google Scholar
Selvaraj, M. S. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat. Commun. 13, 5995 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Lin, D. Y. & Zeng, D. Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genet. Epidemiol. 34, 60–66 (2010).
CAS PubMed Google Scholar
Lin, D. Y. & Zeng, D. On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika 97, 321–332 (2010).
Article MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Asselbergs, F. W. et al. Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. Am. J. Hum. Genet. 91, 823–838 (2012).
Article CAS PubMed PubMed Central Google Scholar
de Vries, P. S. et al. Multiancestry genome-wide association study of lipid levels incorporating gene-alcohol interactions. Am. J. Epidemiol. 188, 1033–1054 (2019).
Article PubMed PubMed Central Google Scholar
Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Deflaux, N. & Selvaraj, M.S. Demonstrating paths for unlocking the value of cloud genomics through cross-cohort analysis. all-of-us/ukb-cross-analysis-demo-project https://doi.org/10.5281/zenodo.8178627 (2023).

Download references

Acknowledgements

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Genome Centers: OT2OD002748, OT2OD002750, OT2OD002751; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the All of Us Research Program would not be possible without the partnership of its participants. The authors greatly appreciate feedback from Dr. Paul Harris on an early draft of the manuscript. P.N. is supported by grants from NHLBI/NIH (R01HL142711, R01HL127564) and NHGRI/NIH (U01HG011719). AGB is supported by grants from NIH (DP5 OD029586) and a Burroughs Wellcome Fund Career Award for Medical Scientists. This research has been conducted using data from UK Biobank, a major biomedical database, under application number 7089. All of Us, the All of Us logo, and “The Future of Health Begins with You” are service marks of the U.S. Department of Health and Human Services. This content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

These authors contributed equally: Nicole Deflaux, Margaret Sunitha Selvaraj.

Authors and Affiliations

Verily Life Sciences, San Francisco, CA, USA
Nicole Deflaux & David Glazer
Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Margaret Sunitha Selvaraj, Sara Haidermota & Pradeep Natarajan
Department of Medicine, Harvard Medical School, Boston, MA, USA
Margaret Sunitha Selvaraj & Pradeep Natarajan
Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
Margaret Sunitha Selvaraj & Pradeep Natarajan
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
Margaret Sunitha Selvaraj & Pradeep Natarajan
Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Henry Robert Condon & Alexander G. Bick
Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA
Kelsey Mayo & Melissa A. Basford
Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
Sara Haidermota & Pradeep Natarajan
All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
Chris Lunt, Joshua C. Denny & Anjene Musick
Broad Institute of Harvard and MIT, Cambridge, MA, USA
Anthony A. Philippakis
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Dan M. Roden
Department of Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA
Dan M. Roden
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Dan M. Roden
Nuffield Department of Population Health, University of Oxford, Oxford, Oxfordshire, UK
Rory Collins & Naomi Allen
UK Biobank, Cheadle, Stockport, UK
Rory Collins, Naomi Allen & Mark Effingham

Authors

Nicole Deflaux
View author publications
You can also search for this author in PubMed Google Scholar
Margaret Sunitha Selvaraj
View author publications
You can also search for this author in PubMed Google Scholar
Henry Robert Condon
View author publications
You can also search for this author in PubMed Google Scholar
Kelsey Mayo
View author publications
You can also search for this author in PubMed Google Scholar
Sara Haidermota
View author publications
You can also search for this author in PubMed Google Scholar
Melissa A. Basford
View author publications
You can also search for this author in PubMed Google Scholar
Chris Lunt
View author publications
You can also search for this author in PubMed Google Scholar
Anthony A. Philippakis
View author publications
You can also search for this author in PubMed Google Scholar
Dan M. Roden
View author publications
You can also search for this author in PubMed Google Scholar
Joshua C. Denny
View author publications
You can also search for this author in PubMed Google Scholar
Anjene Musick
View author publications
You can also search for this author in PubMed Google Scholar
Rory Collins
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Allen
View author publications
You can also search for this author in PubMed Google Scholar
Mark Effingham
View author publications
You can also search for this author in PubMed Google Scholar
David Glazer
View author publications
You can also search for this author in PubMed Google Scholar
Pradeep Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Bick
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.D., M.S.S., K.M., C.L., A.A.P., D.M.R., J.C.D., M.E., D.G., P.N. and A.G.B. contributed to the conception and design of the work. N.D. and M.S.S. performed the formal analysis. N.D., M.S.S., H.R.C., K.M., and S.H. contributed to the data acquisition and curation. K.M., M.A.B., C.L., A.M., R.C., N.A., M.E. and D.G. provided guidance on the datasets used in the study. N.D., M.S.S., K.M. and A.G.B. drafted the manuscript which was critically revised with contributions and input from all authors.

Corresponding author

Correspondence to Alexander G. Bick.

Ethics declarations

Competing interests

P.N. reports investigator-initiated grants from Amgen, Apple, AstraZeneca, Boston Scientific, and Novartis, personal fees from Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Novartis, Roche/Genentech, is a co-founder of TenSixteen Bio, is a shareholder of geneXwell and TenSixteen Bio, and spousal employment at Vertex, all unrelated to the present work. A.G.B. is a co-founder and shareholder of TenSixteen Bio unrelated to the present work. N.D. and D.G. are employees of Verily Life Sciences and may own stock as part of the standard compensation package. A.P. serves as a Google Ventures (GV) venture partner and holds an equity interest in certain of GV’s affiliated investment funds. A.P. has also received funding from Verily, MSFT, Intel, IBM, Bayer, Pfizer, Astra Zeneca, and Biogen. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Tomas Fitzgerald, Yukinori Okada and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplemental Data 1–3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Deflaux, N., Selvaraj, M.S., Condon, H.R. et al. Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis. Nat Commun 14, 5419 (2023). https://doi.org/10.1038/s41467-023-41185-x

Download citation

Received: 05 December 2022
Accepted: 24 August 2023
Published: 05 September 2023
DOI: https://doi.org/10.1038/s41467-023-41185-x

This article is cited by

Genomic data in the All of Us Research Program
- Alexander G. Bick
- Ginger A. Metcalf
- Joshua C. Denny
Nature (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.