Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis

Recently, large scale genomic projects such as All of Us and the UK Biobank have introduced a new research paradigm where data are stored centrally in cloud-based Trusted Research Environments (TREs). To characterize the advantages and drawbacks of different TRE attributes in facilitating cross-cohort analysis, we conduct a Genome-Wide Association Study of standard lipid measures using two approaches: meta-analysis and pooled analysis. Comparison of full summary data from both approaches with an external study shows strong correlation of known loci with lipid levels (R2 ~ 83–97%). Importantly, 90 variants meet the significance threshold only in the meta-analysis and 64 variants are significant only in pooled analysis, with approximately 20% of variants in each of those groups being most prevalent in non-European, non-Asian ancestry individuals. These findings have important implications, as technical and policy choices lead to cross-cohort analyses generating similar, but not identical results, particularly for non-European ancestral populations.

Team.Hail 0.2.13-81ab564db2b4.https://github.com/hail-is/hail/releases/tag/0.2.13) and emitted as a matrix table similarly filtered to exonic capture regions.The data was then further filtered via plink to include only variants with an alternate allele frequency of 6 or more.
The All of Us whole genome sequence alpha 3 release is the very first release of genomic data available for the All of Us cohort.The All of Us policy rules require that this data may only be used within the All of Us Researcher Workbench.Whole genome sequence (WGS) data was generated on consented All of Us participants via a College of American Pathologists (CAP) / Clinical Laboratory Improvement Amendments (CLIA) validated pipeline.A count-called SNP/Indel joint call set was generated from WGS data according to GATK Best Practices.WGS data were provided in pVCF and Hail matrix table formats.For the meta-analyses, the matrix table was filtered to variants within the UK Biobank exonic capture regions and variants with an alternate allele frequency of 6 or more.For the pooled analyses, the matrix table was similarly filtered to exonic capture regions.The data was then further filtered via plink to include only variants with an alternate allele frequency of 6 or more.We were not able to use the provided VCF filter flags for filtering since each cohort used different flagging criteria (Supplementary Fig. 3).
Supplementary Fig. 3.The UK Biobank and All of Us VCF data used different soft thresholds in the VCF filter field, therefore we were not able to use these precomputed results in our filtering.
For the pooled analysis, biallelic variants were merged if these values were identical: [chrom, pos, ref, alt] (Supplementary Fig. 4a and 4b).More specifically, the variants in the prepared UK Biobank and All of Us matrix tables were split into biallelic variants using Hail method split_multi_hts and then an inner join of the variants was performed via Hail method union_cols.For full details, please see 03_merge_variants.Variants were removed from use in downstream analyses, such as principal component analysis and REGENIE, if Hardy-Weinberg equilibrium exact test p-value was below 1e-15 or missing call rates exceed 10%.Samples were removed if missing call rates exceeded 10%, but no samples in UK Biobank or All of Us exceeded this missingness threshold.We did not apply other variant QC criteria such as call quality thresholds because the determination of equivalent thresholds for use with DeepVariant+GLnexus variants versus DRAGEN variants is non-trivial to determine due to their differences in accuracy 2 .

Phenotype preparation
Blood lipids including low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C) and triglycerides (TG) were used as the primary phenotypes in this study.We curated and harmonized the lipid measurements and statin drug exposures for both UK Biobank and All of Us from the phenotype resources of these cohorts.

Lipid levels
For UK Biobank, study-specific blood serum lipids assays were performed systematically in its central laboratories 3 .The first instance of the lipid measurement was used which included LDL-C (data-field-ID: 30780), TC (data-field-ID: 30690), HDL-C (data-field-ID: 30760) and TG (data-field-ID: 30870).Most participants (N=190,982) in the 200k exome release had at least one non-null lipid value, and therefore were included in the analysis.The lipid measurements were converted from mmol/L to mg/dL by multiplying TG values by 88.57 and the other lipid measurements by 38.67 4 .
For All of Us, there were no study-specific blood serum lipids assays available at the time of this analysis, so we instead used lipids measurements from Electronic Health Records (EHR).Of the 98,622 WGS samples, 37,754 of the participants had at least one type of lipid measurement.The most recent measurement value for each of the four lipid types was used.Note that a person's measurements for the four different lipid types, if available, may have occurred on different dates.In order to maximize the number of All of Us genomes we were able to include in this study, we collapsed several related OMOP measurement concepts and several OMOP unit concepts (Supplementary Fig. 5).This included use of measurements with no unit specified, when the data distribution for that measurement appeared empirically to be in mg/dL.
ipynb.Supplementary Fig. 4a.Characteristics of variants found in both datasets, versus those found only in UK Biobank or All of Us.The maximum value for allele number on the y-axis is determined by cohort size.The pooled exonic variants consist of common and rare variants.Most exonic variants found in UK Biobank only were very rare.Most exonic variants found in All of Us only were either very rare or have a very low allele number.Note that variant QC and AC filtering has not yet occurred for the data shown in these plots.From the All of Us VCF filter field values for these variants, most of the common variants found in All of Us only were of low quality and would have eventually been filtered out during variant QC, if they had been included.Supplementary Fig. 4b.gnomAD popmax allele frequencies of variants found in both datasets, versus those found only in UK Biobank or All of Us.The pooled exonic variants show a clear pattern with gnomAD population maximum allele frequencies.Less alignment along the diagonal is shown for UK Biobank only and All of Us only exonic variants.Note that variant QC and AC filtering has not yet occurred for the data shown in these plots.From the All of Us VCF filter field values for these variants, most of the common variants found in All of Us only were of low quality and would have eventually been filtered out during variant QC, if they had been included.