Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African ancestries

Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle where apolipoprotein(a) (protein product of the LPA gene) is covalently attached to apolipoprotein B. Lp(a) is a highly heritable, causal risk factor for cardiovascular diseases and varies in concentrations across ancestries. To comprehensively delineate the inherited basis for plasma Lp(a), we performed deep-coverage whole genome sequencing in 8,392 individuals of European and African American ancestries. Through whole genome variant discovery and direct genotyping of all structural variants overlapping LPA, we quantified the 5.5kb kringle IV-2 copy number (KIV2-CN), a known LPA structural polymorphism, and developed a model for its imputation. Through common variant analysis, we discovered a novel locus (SORT1) associated with Lp(a)-cholesterol, and also genetic modifiers of KIV2-CN. Furthermore, in contrast to previous GWAS studies, we explain most of the heritability of Lp(a), observing Lp(a) to be 85% heritable among African Americans and 75% among Europeans, yet with notable inter-ethnic heterogeneity. Through analyses of aggregates of rare coding and non-coding variants with Lp(a)-cholesterol, we found the only genome-wide significant signal to be at a non-coding SLC22A3 intronic window also previously described to be associated with Lp(a); however, this association was mitigated by adjustment with KIV2-CN. Finally, using an additional imputation dataset (N=27,344), we performed Mendelian randomization of LPA variant classes, finding that genetically regulated Lp(a) is more strongly associated with incident cardiovascular diseases than directly measured Lp(a), and is significantly associated with measures of subclinical atherosclerosis in African Americans.

Genome STRiP copy number estimates using ddPCR 23 , which establishes general accuracy for the quantified absolute copy number. To evaluate the precision of our KIV2-CN estimates, we utilized 123 pairs of siblings from JHS that were confidently identical-by-descent at both LPA 1Mb window haplotypes (genotype concordance > 99%), and found a robust correlation between sibling pair KIV2 copy number estimates (r 2 =0.989) (Supplementary Fig. 7a-d).
LPA locus variants, namely rs3798220 and rs10455872, have been previously associated with KIV2-CN 14,15 . In the FIN WGS, these two SNPs account for 12% of the variance of directly genotyped KIV2-CN. To improve KIV2-CN estimation from SNPs, we developed an imputation model using 2,215 FIN with WGS and applied it to impute KIV2-CN in the 27,344 FIN with array-derived genotypes. In the FIN WGS, we applied the least absolute shrinkage and selection operator (LASSO) across high-quality (imputation quality > 0.8) variants with minor allele frequency (MAF) > 0.1% available in the FIN imputation dataset in a 4MB window around LPA, which yielded a 61-variant model to impute KIV2-CN (Supplemental Fig. 8a). To understand the relative importance of each of these 61 variants, a random forest model was applied (Fig. 2b,  Supplementary Fig. 8b). Our model ascribed greatest importance to rs10455872, a previously described SNP associated with KIV2-CN 14,15 . The full 61-variant model in our validation dataset explained 60% of variation in genotyped KIV2-CN (Supplemental Table 5, Supplemental Fig.  6c, Fig. 3b). While low frequency loss-of-function variants have been observed by us and others 24,25 within LPA, removal of these carriers did not significantly alter the relationship between KIV2-CN and Lp(a) across all individuals (P=0.48).
We sought to also determine whether combinations of summed KIV2-CN alleles equivalent to the same total had the same relationship with KIV2-CN. We observed that the relationship of homozygous KIV2-CN alleles (from 59 FIN individuals 95% homozygous-by-descent at the LPA locus) to Lp(a) was similar to the remaining association observed across all others (P = 0.21).
The lead SORT1 locus variant, rs12740374, has been previously causally associated with LDL cholesterol 26 . Here, Lp(a)-C association for rs12740374 was not substantially altered conditioned on either LDL cholesterol (Fig. 4a) or apolipoprotein B (Supplementary Fig. 12). Common variants at CETP are associated with HDL cholesterol 27 and the lead CETP locus variant for Lp(a)-C, rs247616, is no longer significant after conditioning on HDL cholesterol (Supplementary Fig. 13). Lp(a)-C is strongly associated with HDL cholesterol (B = 0.41 SD Lp(a)-C/SD HDL, P =2.9x10 -191 ); notably, HDL and Lp(a) particles have similar densities potentially influencing Lp(a)-C measurement accuracy 28 . Finally, rs7412 (APOE p. Arg176Cys), denoting the major APOE2 polymorphism, has been previously associated with LDL cholesterol 29 and recently with Lp(a) in a meta-analysis 11 . The association of rs7412 with Lp(a) is diminished when conditioning on LDL cholesterol but remains strongly associated (before conditioning: B = -0.25 SD, P = 1x10 -23 , after conditioning: B = -0.18 SD, P = 5x10 -16 ) (Fig. 4b).
Next, we compared inter-ethnic effects of LPA locus variants attaining sub-threshold significance (P < 1x10 -4 ) in either ethnicity for Lp(a) and Lp(a)-C. Spearman rank correlation of genetic effects between the two ethnicities for Lp(a)-C was 0.38 and for Lp(a) 0.16 ( Supplementary  Fig. 16a,b). Moderately associated (P < 1x10 -2 ) LPA locus variants largely private in African Americans (FIN MAF <0.1%) had larger absolute effects across MAFs compared to such variants observed in both ethnicities (P = 3x10 -32 ) (Supplementary Fig. 17a,b). In comparing betas from genome-wide significant variants in African Americans with betas from the same variants in Europeans (Fig. 4c), we found the strongest inter-ethnic heterogeneity (HetP = 9.8x10 -64 ) at an LPAL2 intronic variant at the LPA locus (rs192873801, MAF 2.8% in JHS and 2.7% in FIN) with strongly divergent effects between the two ethnicities: +0.80 SD in JHS (P = 3.8x10 -32 ) and -0.61 SD in FIN (P = 2.0x10 -35 ) Supplementary Fig. 18. We noted these variants to be on separate haplotypes for JHS and FIN (Supplementary Fig. 19). Notably, the LPA lossof-function variant rs41272114, shows similarly strong effects in both ethnicities (HetP > 0.05).

Common variant association: KIV2-CN modifier analyses
To determine if there are variants that influence the relationship between KIV2-CN and Lp(a)-C or Lp(a) concentrations, we performed variant-by-KIV2-CN interaction analyses at a 4MB window around LPA. We identified three independent modifier variants at this locus which influenced the relationship between KIV2-CN and Lp(a)-C (rs13192132, P = 1.73x10 -15 , rs1810126, P = 6.84x10 -14 , rs1740445, P = 6.35x10 -9 ) (Fig. 5) and were consistent across ethnicities (Supplementary Table 11, Supplementary Fig. 20a,b). Sensitivity analyses of interactions was performed to assess for confounding from 1) haplotype effects and 2) single variants tagged through LD 32,33 . All three variants show association with Lp(a)-C individually (P < 0.05), but are not correlated with KIV2-CN genotype (Pearson correlation r 2 < 0.1) ( Supplementary Table 12). Furthermore, interaction associations persisted after conditioning on variants independently associated with Lp(a)-C (Supplementary Table 13).

Rare variant analysis: Coding and non-coding burden tests
Rare and low-frequency disruptive coding variants within LPA have been previously associated with Lp(a) 24,25 . Here, we performed two coding rare variant analyses studies (RVAS) aggregating rare (MAF <1%) variants which were 1) LOF or missense deleterious by in-silico prediction tools 35 , or 2) non-synonymous, within their respective genes, and performed association with Lp(a)-C, adjusting for KIV2-CN. All analyses were done separately for JHS and EST and metaanalyzed. While no genes reached significance in either analysis after accounting for multiplehypothesis testing, we observed suggestive evidence for LPA in both coding RVAS tests (P = 7x10 -4 for LOF and missense deleterious mutations, 1x10 -4 for non-synonymous mutations) (Supplementary Tables 18-19, Supplementary Fig. 22a,b).
We also interrogated whether there was evidence of rare, non-coding variants aggregated within regulatory sequences uniquely detected by WGS that influence Lp(a)-C. We performed three non-coding RVAS using the variant groupings described in the Methods along with Roadmap epigenome data 34 from adult liver, the main tissue where LPA is expressed (Supplementary Fig.  23, Supplementary Fig. 24). The only genome-wide significant association was for an intron of SLC22A3 at 6:160851000-160854000 with Lp(a)-C (P = 4.5x10 -8 ) ( Supplementary Tables 20-25). Similarly, rare variants in a putative regulatory domain of SLC22A3 were recently shown to be associated with Lp(a) in a sliding window analysis using low-coverage whole genomes 36 . However, we found that conditioning on LPA's KIV2-CN, 128 kb away, mitigated the observed association (P = 4.3x10 -3 , Supplementary Tables 20-21). Upon conditioning on KIV2-CN, while no sliding windows reached statistical significance, the top window was 6:160,939,500-160,942,500 (P = 1.6x10 -4 ), 13kb downstream of the LPA transcription end site and overlapping three annotated ORegAnno 37 CTCF binding sites (Fig. 6).
Interrogation of rare enhancer variants predicted to influence LPA expression in liver 38 showed nominal evidence of association with Lp(a)-C before (P = 5x10 -5 ) and after (P = 1x10 -3 ) conditioning on KIV2-CN (Fig. 6, Supplementary Fig. 25). However, other putative genelinked rare enhancer variants at the LPA locus, including the aforementioned SLC22A3 ( Supplementary Fig. 26), also demonstrate nominal associations, highlighting current challenges in both mapping associated regulatory elements to causal genes through in silico approaches and discerning the relative impacts of potentially pleiotropic regulatory elements.

Mendelian randomization
Genetic variation at the LPA locus is an optimal instrument for Mendelian randomization (MR) as it strongly and specifically influences circulating Lp(a) levels. Past studies have performed Lp(a) MR across clinical and metabolic traits using genetic risk scores comprised of between 1-18 variants 14,39,40 . Here, we performed MR using three different genetic instruments per cohort to distinguish variant classes influencing Lp(a) phenotypes: 1) an expanded genetic risk score, "GRS," comprised of the sum of the KIV2-CN-adjusted variant effects from LD-pruned variants in a ~4MB window around LPA with sub-threshold significance (P < 1x10 -4 ); 2) a "KIV2-CN" score using the directly genotyped or imputed KIV2-CN; and 3) a combined "GRS+KIV2-CN" score combining scores from (1) and (2). Each genetic instrument was normalized such that 1 unit increase in the score was equal to 1SD increase in Lp(a) (or Lp(a)-C). In African Americans, 235 variants were used towards the Lp(a) GRS and 39 towards the Lp(a)-C GRS (Supplementary Table 26). In Europeans, 399 variants were used towards the Lp(a) GRS and 49 towards the Lp(a)-C GRS (Supplementary Table 27-28). The GRS+KIV2-CN score explains 45-49% of Lp(a) variance and 20% of Lp(a)-C variance (Supplementary Fig 27,  Supplementary Table 29).
Association of GRS+KIV2-CN with 10 incident clinical events from the FIN imputation dataset (N=27,344) (Fig. 7a, Supplementary Table 30) demonstrated anticipated associations for incident cardiovascular diseases (HR 1.18/Lp(a) SD, P = 1x10 -5 ), comprising incident myocardial infarction (HR 1.23/Lp(a) SD, P = 8x10 -4 ), coronary heart disease (CHD) (HR 1.25/Lp(a) SD, P = 7x10 -7 ), and stroke (HR 1.27/Lp(a) SD, P = 1x10 -3 ). For given effect on Lp(a), the GRS had a larger effect on incident cardiovascular risk (HR 1.30/Lp(a) SD, P = 6x10 -8 ) than KIV2-CN (HR 1.03/Lp(a) SD, P = 0.17). While the KIV2-CN score alone was not as strongly associated with cardiovascular outcomes (P > 0.05), its estimated effect with incident MI (HR = 1.16) was similar to recent estimations in a MI case-control analysis 14 . Thus, power for MR using the KIV2-CN instrument may be hindered due to a limited number of incident MI cases and modest effect conferred by KIV2-CN. These results suggest that knowledge of LPA variant class genotypes may provide additional information on cardiovascular risk beyond circulating Lp(a) levels.
To determine whether LPA genomic variants influence the accumulation of subclinical cardiovascular atherosclerosis, we associated both the Lp(a) and Lp(a)-C genetic instruments with computed tomography-derived measures of atherosclerosis in the coronary arteries (CAC) and abdominal aorta (AAC) in 1,701 African Americans from JHS without prevalent clinical cardiovascular disease (Supplementary Table 31, Fig. 7b). Here, the comprehensive (GRS+KIV2-CN) genetic instruments for both Lp(a) and Lp(a)-C demonstrated association with subclinical atherosclerosis with similar standardized effects for both CAC and AAC: Lp(a) (CAC: B = 0.097, P = 7.6x10 -4 ; AAC: B = 0.092, P = 2.7x10 -3 ), and Lp(a)-C (CAC: B = 0.14, P = 4.6x10 -3 ; AAC: B = 0.12, P = 6.4x10 -3 ). Notably, this is the first known demonstration of LPA genomic variants affecting atherosclerotic risk in African Americans. A prior study of African Americans from the Dallas Heart Study found no association between sub-clinical measures of atherosclerosis, such as coronary calcium, and Lp(a) phenotype 41 . Compared to prior work, our power is optimized with larger sample size and genetic instrument for causal inference.

DISCUSSION:
We characterized the genetic architecture of Lp(a) and Lp(a)-C using deep coverage WGS in 8,392 Europeans and African Americans across allele frequencies and classes. While we observe that Lp(a) is highly heritable in Europeans and African Americans, distinct and common genetic determinants influence concentrations. Using a comprehensive genetic instrument that separately imputes apo(a) isoform, we show that knowledge of LPA genotypes can better inform incident cardiovascular disease risk prediction than just knowledge of Lp(a) biomarker level.
These observations permit several conclusions. First, through whole-genome sequencing and imputation, we observe substantial genetic heritability of Lp(a) -85% (SE 5%) in African Americans and 75% (SE 6%) in Europeans. We leverage this observation to systematically dissect the heritable components of Lp(a) across two ethnicities. Through single variant analysis, we find a novel locus for Lp(a)-C, SORT1, whereby the top variant (rs12740374) reduces plasma Lp(a)-C concentrations in both ethnicities and is independent of LDL cholesterol levels, thereby providing evidence for the sortilin receptor as a novel component in Lp(a)-C metabolism. Through genetic modifier analysis, we find evidence of three loci which affect the relationship between KIV2-CN and Lp(a)-C similarly across both ethnicities. We replicate evidence supporting rare coding variation at LPA influencing Lp(a); however, observed associations of aggregates of rare non-coding variation appeared to be largely explained by LPA structural variation, namely KIV2-CN.
Second, we observed high heritability in diverse ethnicities despite notable inter-ethnic differences in circulating biomarker concentrations. Upon finding that similar Lp(a) effect sizes are conferred per KIV2 copy in African Americans and Europeans, we delved further into KIV2independent effects conferred by variants at the LPA locus. Among distinct sequence variation, we notably observed an LPAL2 intronic variant with significant yet opposing effects in each ethnicity, likely indicating influences from haplotype structure or gene-environment interactions. Altogether, LPA locus variants largely private to African Americans (FIN MAF < 0.1%) confer significantly greater absolute effect on standardized Lp(a) levels than variants observed in both ethnicities.
Third, WGS enables the detection of relevant genomic variants for Lp(a) which cannot be detected via WES or genotyping arrays. Furthermore, knowledge of such variants, given differential effects on circulating Lp(a) and differential effects on incident cardiovascular events, provides additional information regarding cardiovascular disease risk beyond circulating Lp(a).
It should be noted that several limitations to this work exist. First, we estimate total KIV2-CN, but individuals may have different KIV2-CN alleles on each chromosome 42 . Our CNV analysis of next-generation sequencing data relies on aggregate depth of coverage for genotyping, precluding our ability to determine allelic KIV2-CN. However, despite this, sensitivity analyses suggest that the sum of KIV2-CN alleles may similarly associate with Lp(a) across varied KIV2-CN allele combinations. Additionally, the strongest SNP in our KIV2-CN imputation model is rs10455872, whose association with KIV2-CN has been well-described previously 17 , and our KIV2-CN estimate is robustly associated with Lp(a) phenotypes as expected. Second, we only assess one non-European cohort; however, it has been observed that there are distinct Lp(a) distributions in other ethnicities which may uncover additional loci and sources of genetic heterogeneity. Third, while in silico prediction tools for non-coding regions identify putative regulatory sequence, they are limited in their ability to 1) determine disruptive mutations, and 2) link regulatory regions to genes.
In summary, we characterize the shared and unique genetic determinants of Lp(a) using whole genome sequences in African Americans and Europeans. Additional knowledge of the complement of these determinants better informs cardiovascular disease risk prediction than biomarker alone.    results from (1) aggregating rare variants in adult liver enhancers or promoters and strong DHS (P < 10 -10 ) within 3kb sliding windows, and (2) aggregating rare variants in liver enhancers grouped to LPA via the "By expression" in-silico prediction method. In the top two panels, each red diamond represents the meta-analyzed mixed-model SKAT P-value with Lp(a)-C of rare (MAF < 1%), non-coding variants overlapping liver enhancer or promoter annotations in strong DHS (P(DHS) < 1e-10) grouped in a 3kb window, before adjusting for KIV2-CN (top, "Original") and after adjusting for KIV2-CN (bottom, "Conditioned on KIV2-CN"). The horizontal red lines denote the genome-wide Bonferroni significance threshold given the number of unique windows analyzed. The horizontal gray lines denote the Bonferroni significance threshold within this 1MB region around LPA. The regions incorporated into the "By Expression" grouping to LPA are shown in aqua, along with the respective associations of rare non-coding variants in these regions before and after conditioning on KIV2-CN. Random Forest Importance  TOPMED phase 1 BAM files were harmonized by the TOPMed Informatics Research Center (Center for Statistical Genetics, University of Michigan, Hyun Min Kang, Tom Blackwell and Goncalo Abecasis). In brief, sequence data were received from each sequencing center in the form of bam files mapped to the 1000 Genomes hs37d5 build 37 decoy reference sequence.
Processing was coordinated and managed by the 'GotCloud' processing pipeline 43 . Samples with DNA contamination > 3% (estimated using verifyBamId software 44 ) and <95% of the genome covered at least 10x were filtered out. The JHS WGS used for analysis are from the "freeze 3a" genotype callsets of the variant calling pipeline performed using the software tools in the following repository: https://github.com/statgen/topmed_freeze3_calling, with variant detection performed by vt discover2 software tool 45 .
WGS for FINRISK and the Estonian Biobank were performed using the Illumina HiSeqX platform at the Broad Institute of Harvard and MIT (Cambridge, MA). Libraries were normalized to 1.7nM, constructed, and sequenced on the Illumina HiSeqX with the use of 151-bp paired-end reads for WGS and output was processed by Picard to generate aligned BAM files (to hg19) 46,47 . Variants were discovered using the Geome Analysis Tookit (GATK) v3 HaplotypeCaller according to Best Practices 48 . Finland and Estonia WGS samples were jointly called.
Whole genome sequence quality control Sample quality control: The following three approaches were used by the TOPMed Genetic Analysis Center to identify and resolve sample identity issues in JHS: (1) concordance between annotated sex and biological sex inferred from the WGS data, (2) concordance between prior SNP array genotypes and WGSderived genotypes, and (3) comparisons of observed and expected relatedness from pedigrees. Turbidity produced by the antigen-antibody complexes was measured using the Roche Modular P Chemistry Analyzer. In FIN, Lp(a) was measured from serum stored at -70°C using a commercially available latex immunoassay on an Architect c8000 system (Quantia Lp(a), Abbott Diagnostics).

Lipid phenotypes
Lp(a)-C and Lp(a) were inverse-rank normalized separately by cohort for analysis.

Conventional lipids
Conventional lipoprotein cholesterols (HDL, LDL, TG, Total Cholesterol) and proteins (ApoB, ApoAI) were measured in EST and JHS by the VAP assay (where LDL refers to directly measured LDL, and not calculated). In FIN, these lipoproteins were measured via NMR as described in the Mendelian randomization methods below. In FIN, LDL cholesterol was either calculated by the Friedwald equation when triglycerides were <400 mg/dl or directly measured. Given the average effect of statins, when statins were present, total cholesterol was adjusted by dividing by 0.8 and LDL cholesterol by dividing by 0.7, as previously done 55 . All lipids were inverse-rank normalized separately by cohort in analysis.

KIV2-CN estimation from WGS data.
Genome STRiP 21 version 2.00.1710 was used to estimate KIV2-CN in the LPA gene. Specifically, we ran Genome STRiP read-depth genotyping on the hg19 interval 6:161032614-161067851 using the following custom settings to capture an aggregate read depth signal over every base position: -P depth.minimumMappingQuality:0, without specifying any of the usual genome masks.
After genotyping, we estimated the number of KIV2 protein domains from the raw copy number estimate by dividing the VCF genotype field CNF by the info field GSM1 and then estimating the KIV2 copy number by KIV2-CN = (CNF / GSM1) * 6.354 -0.708 where 6.354 is derived from the number of full copies of the repeating unit represented on the hg19 reference genome and -0.708 is to adjust to the KIV2 units as visualized in Supplementary  Fig. 6A, removing the outermost flanking exons that are part of the KIV1 and KIV3 (which are picked up in Genome STRiP due to their homology with the exons within the KIV2 domain).

Evaluation of KIV2-CN precision.
To evaluate the precision of our measurements of KIV2 copy number, we utilized 123 pairs of siblings from JHS that were confidently IBD2 (identical by descent on both haplotypes) at the LPA locus. To identify these sibling pairs, we interrogated the hg19 interval 6:160,450,001-161,590,000 (0.5 Mb upstream and downstream of the LPA gene) and computed the concordance of SNP genotypes in this interval between all sequenced sibling pairs. We classified all sibling pairs with less than 1% genotype discordance as confidently IBD2 at the LPA locus and compared IBD2 sibling KIV2-CNs.

KIV2-CN Imputation.
We split the FIN WGS into one training dataset comprised of two thirds of the samples (1477 samples) and one validation dataset (738 samples), and used the least absolute shrinkage and selection operator (LASSO), a machine-learning regression analysis method, using variants in a 4MB window around LPA imputed with high-quality (imputation quality > 0.8) and MAF > 0.001 in the FIN dataset. After applying 10-fold cross validation to find the optimal lambda (degree of shrinkage), the LASSO model selected 61 variants which minimized the mean squared error (Supplemental Fig. 6A). These 61 variants were also used in a random forest model, reiterating the exponential decay of mean-squared error as the number of variants in the model reaches 61 (Supplemental Fig. 6B), and finding the relative importance of each variant in the model.

Principle component analysis (PCA)
To visualize PCs across all 3 cohorts against each other, a panel of approximately 16,000 ancestry informative markers 56 (AIMs) identified across six continental populations 57 was chosen to derive principal components (PCs) of ancestry for all samples that passed quality control. Principal component analysis was performed using EIGENSTRAT, using suggested quality control criteria 58 (Supplementary Fig. 3). Separately, within-cohort PCA was performed for use as covariates in analysis.

Variant annotation
Variants were annotated with Hail 49 using annotations from Ensembl's Variant Effect Predictor (VEP), ascribing the most severe, canonical consequence and gene to each variant 59 . For noncoding regions in adult liver cells (E066), we used the Reg2Map HoneyBadger2-intersect 34 at strong (P < 1x10 -10 ) DNase I hypersensitive regions (https://personal.broadinstitute.org/meuleman/reg2map/HoneyBadger2-intersect_release/). Variants overlapping putative enhancers and promoters from the 25-state chromatin model 34 at this link were annotated and used in the single variant results annotations (Supplementary  Tables 7-8) as well as grouping rare variants in the "sliding window" and "by distance" noncoding rare variant studies.

Single variant association.
Single variant analysis for EST and JHS WGS was performed using Hail's linear mixed model regression 49 for associating each variant site with inverse normal transformed Lp(a) and Lp(a)-C within each cohort. All analyses were adjusted for KIV2-CN, age, sex, and an empirically derived kinship matrix to account for both familial and more distant relatedness 60 . To create the kinship matrix, regions of high-complexity known to have high LD were removed (as in the EPACTS make-kin --remove-complex flag); these regions included: 5:44000000-52000000, 6:24000000-36000000, 8:8000000-12000000, 11:42000000-58000000, and 17:40000000-43000000. Ten-fold random down-sampling of variants was performed to further reduce variant counts for fast processing-time.
For the FIN imputation dataset, single variant analysis was performed using SNPTEST (v2.5.2), using KIV2-CN, age, sex, fasting > 10hr, and adding PC1-10 as covariates to account for population structure due to absence of kinship matrix.
To ensure robust results, we only performed single variant analysis for variants with a MAF > 0.001 within either cohort. Summary statistics for JHS and FIN for Lp(a) and JHS and EST for Lp(a)-C, for the corresponding inverse-rank normalized phenotypes, were meta-analyzed across cohorts using METAL 61 , while also calculating heterogeneity statistics. Statistical significance alpha of 5x10 -8 was used for these analyses.
Additionally, for the LPA locus, iterative conditional association analysis was performed by cohort. Iterative conditioning was performed until P > 5x10 -8 was attained.

Heritability analyses.
Heritability analyses in EST WGS (for Lp(a)-C) and JHS WGS (for both Lp(a) and Lp(a)-C) were performed using Hail's linear mixed model regression heritability estimate 49 , described here https://hail.is/hail/hail.VariantDataset.html?highlight=lmm#hail.VariantDataset.lmmreg. Several filters were applied before variants were used in the kinship matrix. First, genome-wide variants underwent two-fold LD pruning as previously described via BOLT-REML 62 , using variants with MAF > 0.001 and missingness < 1% with maximum LD r 2 = 0.9 (PLINK 63 commands used: -maf 0.001 --geno 0.01 --indep-pairwise 50 5 0.9). Regions of high-complexity were removed as previously described for single variant analysis. Ten-fold random down-sampling of variants was performed to further reduce variant counts for feasible analysis processing-time. For the heritability estimates provided, 6,370,696 variants were used towards the kinship matrix in EST Lp(a)-C analysis, 1,897,407 variants in JHS Lp(a)-C analysis, and 1,894,291 variants in the JHS Lp(a) analysis. Baseline covariates used in the model, performed separately by cohort, included age, sex, fasting > 10h, and for EST, sequencing batch. A separate heritability estimate was also derived additionally conditioning on KIV2-CN.
For the FIN imputation dataset, variants were similarly limited, filtering to variants with MAF > 0.001, imputation quality > 0.8, and applying two-fold LD-pruning and removal of complex regions as described above (though the ten-fold down-sampling was not applied to keep the variant count on the same order of magnitude as in the WGS heritability analyses). A total of 3,088,864 variants were used towards heritability analysis, which was performed using BOLT-REML. Covariates used in the analysis included age, sex, fasting > 10hr, and PC1-10. A separate heritability estimate was also derived additionally conditioning on KIV2-CN. For Lp(a), heritability analysis additionally conditioning on both KIV2-CN and the KIV2-CN-independent GRS using in Mendelian randomization was performed. BOLT-REML was also applied towards the Lp(a) heritability analysis in JHS, arriving at the same heritability estimates as Hail (data not shown).

KIV2-CN modifier analysis.
Variant-by-KIV2-CN interaction analysis in the WGS was performed at a ~4MB window (6:158532140-162664257) around LPA to identify variants which modify the relationship between directly genotyped KIV2-CN and Lp(a)-C (for EST and JHS) and Lp(a) (for JHS only). Variants with minor allele count > 20 (by cohort) were included in analyses. The following interaction model was performed: Lp(a)-C ~ KIV2-CN + Variant + KIV2-CN x Variant + covariates Where the interaction effect and p-value corresponds to the term: "KIV2-CN x Variant". Cohortspecific analyses were performed and for Lp(a)-C, EST and JHS interaction results were metaanalyzed using METAL 61 . Using the full interaction results, three top modifier variants were identified (rs13192132, rs1810126, and rs1740445) that were genome-wide significant upon meta-analysis (P < 5x10 -8 ), in linkage equilibrium (r 2 < 0.1) across both ethnic backgrounds, and had replicating interaction effect directions in both ethnicities. To determine the cohort-specific Bonferroni significance threshold, LD clumping was performed on the full interaction results separately by cohort using the following PLINK 63 flags: --clump-kb 500 --clump-p1 1 --clump-p2 1 --clump-r2 0.25. In JHS, 1373 LD-pruned variants were identified, leading to a significance threshold of P = 3.64x10 -5 . In EST, 566 LD-pruned variants were identified, leading to a significance threshold of P = 8.83x10 -5 . Clumped variants with interaction p-values surpassing the Bonferroni threshold are provided by cohort and phenotype in Supplementary Table 14. For each of these variants, the statistics for the KIV2-CN term, the Variant term, and the KIV2-CN x Variant term from the interaction model, and the variant associations with KIV2-CN and Lp(a)-C (or Lp(a)) are provided in Supplementary Tables 15-17 (where each Supplementary Table  reflects a different cohort and phenotype combination). Overlap with methylation and acetylation marks was visualized using data from Roadmap for E066 adult liver cells at http://egg2.wustl.edu/roadmap/data/byFileType/alignments/consolidated/. Liver ATAC-seq data was downloaded from the ENCODE data portal (accession ENCFF893CSN). FASTQ files were adapter-trimmed and aligned to hg19 with bowtie2, and duplicates reads and reads with MAPQ < 30 were removed.
Previous publications of variant-by-variant interactions have recommended performing sensitivity analyses to ensure significant interactions identified are not 1) due to the variants being in LD on the same haplotype and 2) mitigated by a separate third variant which explains the entire association 32,33 . In particular, the most recent study by Fish et al. 28 recommended that variant-by-variant interactions be performed using un-correlated variants (LD r 2 < 0.6). Thus, we checked the correlation of each of the three top identified variants with KIV2-CN by cohort (Supplementary Table 12), finding that these variants are indeed not correlated with KIV2-CN (Pearson correlation r 2 < 0.1). Furthermore, variants not associated (P > 0.05) with the phenotype are suggested to be removed, under the hypothesis that they may represent weak marginal effects from a true underlying interaction. Indeed, our three top Lp(a)-C interaction variants are all individually associated with Lp(a)-C (Supplementary Table 12). Lastly, conditional analysis has been suggested to ensure that the interaction model is not mitigated by a separate third variant that explains the interaction. Thus, we performed conditional analysis on the top 3 interaction models, conditioning on the previously identified variants from single variant analysis (reported in Supplementary Table 9) found to be conditionally independently associated with Lp(a)-C in each cohort. As seen in Supplementary Table 13, conditional analysis does not fully mitigate any of the identified interaction associations.

Rare variant association analyses (RVAS).
Coding and non-coding grouping schemes Please refer to the Supplementary Text for details on the coding and non-coding grouping schemes used.

Statistical Analysis
We tested the association of the aggregate of the aforementioned groupings with each lipid trait using the mixed-model Sequence Kernal Association Test (SKAT) implementation in EPACTS to account for bidirectional effects. 60 Analyses were adjusted for age, sex, fasting > 10hr, sequencing batch (just used in Estonia), and empiric kinship. Groups with at least 2 rare variants and combined MAF > 0.001 across all aggregated variants in a given cohort were included in meta-analysis. P values were meta-analyzed using Fisher's method. Statistical significance for each RVAS test was based on the number of groups tested and is provided in the headers of Supplementary Tables 16-25.

Mendelian randomization Genetic Instruments
We developed three genetic instruments per cohort. The first instrument used was a genetic risk score, "GRS," comprised of variants in a ~4MB window around LPA (6:158532140-162664257) with sub-threshold significance (p-value < 1x10 -4 ), using variant effect sizes from the KIV2-CN conditioned single variant analysis and performing LD clumping in plink using the following parameters: --clump-kb 500 --clump-p1 0.0001 --clump-p2 1 --clump-r2 0.25. This resulted in 399 variants for Lp(a) GRS in FIN, 235 variants for Lp(a) GRS in JHS, 39 variants for Lp(a)-C GRS in JHS, and 49 variants for Lp(a)-C GRS in EST (Supplementary Table 26-28). The second instrument used was a "KIV2-CN" score using the directly genotyped or imputed KIV2-CN. The third instrument used was a combined "GRS+KIV2-CN" score combining scores from (1) and (2). Each of the three scores were inverse rank normalized and adjusted such that 1 unit increase in the score is equal to 1SD increase in Lp(a) (or Lp(a)-C, depending on how the instrument was adjusted). The multiplicative factors used to adjust each score are provided in Supplementary Table 29. Note: the genetic instruments for EST were not used in MR but are provided as a reference for European Lp(a)-C genetic instruments to compare with those from the JHS African Americans.

Phenotypes
Please refer to Supplemental Text for details on incident events and subclinical measures used.

Statistical analyses
For incident clinical events, a cox proportional hazards test was performed, finding the association between each incident event and each of the genetic instruments as well as observational Lp(a). For the quantitative subclinical measures, linear regression was performed, finding the association between each inverse-rank normalized phenotype and each of the genetic instruments as well as inverse-rank normalized Lp(a) and Lp(a)-C (where available). Covariates used in all analyses included the first 5 principal components of genetic ancestry, age, sex, if the individual was fasting > 10hr. To find the p-value of difference between GRS and KIV2-CN or the genetic instrument and observational phenotype, the following was used: P(difference)=2*pnorm(-abs(Z)), where Z=b1−b2/(√((SEb1)^2+(SEb2)^2)) Statistical significance was defined for the 9 FIN incident clinical events and 2 JHS subclinical atherosclerosis traits using a Bonferroni significance threshold was based on the number of phenotypes (P = 0.0056 and 0.025, respectively).