Comprehensive genetic analysis of the human lipidome identifies loci associated with lipid homeostasis with links to coronary artery disease

We integrated lipidomics and genomics to unravel the genetic architecture of lipid metabolism and identify genetic variants associated with lipid species putatively in the mechanistic pathway for coronary artery disease (CAD). We quantified 596 lipid species in serum from 4,492 individuals from the Busselton Health Study. The discovery GWAS identified 3,361 independent lipid-loci associations, involving 667 genomic regions (479 previously unreported), with validation in two independent cohorts. A meta-analysis revealed an additional 70 independent genomic regions associated with lipid species. We identified 134 lipid endophenotypes for CAD associated with 186 genomic loci. Associations between independent lipid-loci with coronary atherosclerosis were assessed in ∼456,000 individuals from the UK Biobank. Of the 53 lipid-loci that showed evidence of association (P < 1 × 10−3), 43 loci were associated with at least one lipid endophenotype. These findings illustrate the value of integrative biology to investigate the aetiology of atherosclerosis and CAD, with implications for other complex diseases.

L ipids comprise thousands of individual species, spanning many classes and subclasses. Genome-wide association studies (GWAS) of lipid species can provide novel insights into human physiology, inborn errors of metabolism and mechanisms for complex traits and diseases. Dyslipidaemia, a broad term for disordered lipid and lipoprotein, is a major risk factor for atherosclerotic cardiovascular disease and a therapeutic target for the primary and secondary prevention of coronary artery disease (CAD) 1,2 . Defined by elevated low-density lipoprotein (LDL) cholesterol and triglycerides with decreased high-density lipoprotein (HDL) cholesterol -these 'clinical lipid' measures provide only a partial view of the complex lipoprotein structures and their metabolism. Lipidomic technologies can now measure hundreds of individual molecular lipid species that make up the human lipidome, providing a more complete snapshot of the underlying lipid metabolism occurring within an individual.
Genome-wide association studies have uncovered thousands of genetic variants linked to traditional clinical lipids (LDL-cholesterol, HDL-cholesterol, triglycerides) 3,4 . Genes implicated at these loci show functional links between lipid levels and CAD 5 . The human lipidome is heritable and predictive of CAD, furthering our understanding of the biology of CAD 6 . The individual lipid species that make up the lipidome are biologically simpler measures that may reside closer to the causal action of genes, making them valuable endophenotypes for gene identification. Genetic interrogation of the human lipidome may therefore reveal further genetic variants that play a role in lipid metabolism and CAD.
Compared with other complex traits, relatively few genomic loci have been associated with lipid species in GWAS of the human serum/plasma lipidome [7][8][9][10][11][12][13][14][15][16][17] , although these studies have generally interrogated a restricted subset of lipid species. The serum lipidome is complex and consists of many isobaric and isomeric species that share elemental composition but are structurally distinct. Existing lipidomic studies often employ techniques that provide poor resolution of these species, limiting their biological interpretation. We have recently expanded our lipidomic platform to better characterise isomeric lipid species, now measuring 596 lipids from 33 classes 18 . Our methodology focuses on the precise measurement of a broad number of lipid and lipidlike compounds, utilising extensive chromatographic separation.
Here, we report a GWAS of 596 targeted lipid species (across 33 lipid classes) in an Australian population-based cohort of 4492 individuals, validation of significant loci in two independent cohorts and a meta-analysis of all results. Using robust procedures, we disentangle the genetic effects of lipid species from lipoproteins. Integration of multiple datasets, including expression quantitative trait loci (eQTL), methylation QTL (meQTL), and protein QTL (pQTL), and in-depth analysis of significant loci highlights putative susceptibility genes for CAD. We demonstrate robust associations between lipid species and CAD using genetic correlations, polygenic risk scores and phenotypic associations. Many lipid-associated loci show pleiotropy with CAD in colocalisation analysis. Assessment of loci with coronary atherosclerosis in 456,486 UK Biobank participants reveals genetic associations, independent of clinical lipid measures.

Results
Lipidomic profiling. We measured 596 individual lipid species within 33 lipid classes, covering the major glycerophospholipid, sphingolipid, glycerolipid, sterol, and fatty acyl classes in serum and plasma samples from three independent cohorts (Supplementary Table 1, Supplementary Data 1, 2). Assay performance was monitored using pooled plasma quality control samples, enabling the determination of coefficient of variation (%CV) values for each lipid class and species. In the Busselton Health Study (BHS) discovery cohort, the median %CV was 8.6% with 570 (95.6%) lipid species showing a %CV less than 20%. All lipids were measured in every individual, with the exception of three values which were below the limit of detection. The lipidomic analysis of the Australian Imaging, Biomarker, and Lifestyle (AIBL) and Alzheimer's Disease Neuroimaging Initiative (ADNI) validation cohorts showed similar assay performance 19 .
Discovery of genome-wide association study of the human serum lipidome. We performed a GWAS of the human serum lipidome (Fig. 1) in the BHS discovery cohort (4492 individuals of European ancestry) followed by validation against a meta-analysis of the two validation cohorts (ADNI and AIBL; 670 and 895 individuals of European ancestry, respectively). We further performed a discovery meta-analysis of all three studies. All summary-level statistics are available at our PheWeb 20 data portal (https://metabolomics.baker.edu.au/).
SNP-lipid species associations are largely independent of clinical lipid measures. We performed an additional GWAS, adjusting for clinical lipids (total cholesterol, HDL-cholesterol, triglycerides), to identify SNP-lipid species associations independent of clinical lipid traits (Adjusted Discovery GWAS). The median genomic inflation factors were 1.01 (range: 0.99-1.03), and 1.01 (range: 1.00-1.03) for lipid species and classes, respectively; with heritability estimates moderately correlated (r = 0.51) with lambda estimates, for each of the lipid species and classes ( Supplementary  Fig. 2b). Adjustment for clinical lipids identified 2424 independent SNP-lipid species associations, and 124 independent SNP-lipid class associations (Supplementary Data 6). There were 1545 SNP-lipid species and 72 SNP-lipid class associations that were significant in both the unadjusted and the adjusted analyses, with an r 2 between beta coefficients of 0.93 (Fig. 3). Adjustment for clinical lipids identified an additional 879 significant SNP-lipid species associations, for 387 lipid species. However, 726 SNP-lipid species associations previously associated in the unadjusted analysis, fell below our significance threshold. Approximately 24% of these lipid species are members of the cholesteryl ester (n = 93), and  Data 6). We also identified an additional 52 significant SNP-lipid class associations, particularly for trihexosylceramide (6 associations) and hexosylceramide (6 associations) classes. However, 60 SNP-lipid class associations fell below our significance threshold, with the classes diacylglycerol, G M3 ganglioside, lysophosphatidylcholine, lysoalkenylphosphatidylethanolamine, phosphatidylcholine, alkylphosphatidylethanolamine, alkenylphosphatidylethanolamine, phosphatidylserine, sphingomyelin, and triacylglycerol no longer associated (P < 5.0 × 10 −8 ) with any genetic variants.
Results from multi-trait conditional and joint (mtCOJO; Supplementary Data 3, 4) analyses using clinical lipid traits (total cholesterol, HDL-cholesterol, triglycerides) GWAS results from the UK Biobank, to minimise the risk of pleiotropy/collider bias introduced by heritable covariates, were largely consistent with those of the clinical lipidadjusted analysis (r 2 of beta coefficients = 0.91, Supplementary Fig. 3).
A comparison of the clinical lipid-adjusted Z-scores and mtCOJO Zscores identified three gene regions (APOE, FADS1/FADS2/FADS3, TMEM229B/PLEKHH1) with substantial differences (P < 1.0 × 10 −4 ) indicating the possibility of biased effect measures for the adjusted analyses in these regions. Overall, results were overwhelmingly consistent between mtCOJO and clinical lipid-adjusted analyses.
Conditional analysis (sequentially conditioning on the lead SNP) identified 386 secondary signals (across both unadjusted and clinical lipid-adjusted analyses), associated with 163 lipid species/classes (Supplementary Data 7). Two gene regions, LIPC and ATP10D, each contained five independent signals (P CONDITIONAL < 5.0 × 10 −8 ). The LIPC genomic region was strongly associated with phosphatidylethanolamine species and class, while ATP10D was associated with hexosylceramide species and class. The SPTLC3 region harboured four independent signals, strongly associated with sphingolipids containing a d19:1 sphingoid base. Associations validated in independent cohorts. For each lipid, significantly associated SNPs were linkage disequilibrium (LD)clumped to remove variants in LD (r 2 > 0.1). We assessed whether the 2411 independent lipid species/class associations identified in the BHS discovery cohort (unadjusted) were validated within a combined ADNI and AIBL validation cohort meta-analysis (Validation meta-analysis). There were 273 SNP-lipid associations not available for validation in the meta-analysis, either due to lipids not available in the ADNI and AIBL cohorts; missing SNPs (and proxies) on the imputation panel; or monomorphic/ very-low-frequency MAF in ADNI/AIBL. Therefore, we attempted to validate the remaining 2137 significant SNP-lipid associations. We considered a SNP-lipid association to be validated if (i) the SNP was significantly associated (P < 5 × 10 −8 ) in the unadjusted BHS discovery GWAS; (ii) the direction of effect was concordant between the validation meta-analysis and the BHS discovery analysis; and (iii) the association was nominally significant (P < 0.05; less conservative) or reached the Bonferroni significance threshold (P < 2.34 × 10 −5 ) in the validation metaanalysis. We identified 1474 (69.2%) SNP-lipid associations that reached nominal significance (P < 0.05), and 644 (30.1%) reaching Bonferroni-corrected significance (Supplementary Data 8). Almost all associations (>99%) had the same direction of effect, with a very strong correlation between validation meta-analysis and significant (P < 5 × 10 −8 ) discovery effect sizes (r 2 = 0.53 overall, and r 2 = 0.80 for SNPs with MAF > 0.05 in the BHS; Supplementary Fig. 4).
Discovery meta-analysis. At a stringent significance threshold of P < 3.47 × 10 −10 (5 × 10 −8 /144 effective lipid dimensions), the meta-analysis of all three studies identified 65,563 significant SNP-lipid associations (Supplementary Data 9), involving 499 lipid species/classes and 7600 SNPs. We identified 5658 new associations not observed in the BHS discovery GWAS alone, involving 352 lipids and 2914 SNPs. The majority of these (n = 5543; 98%) showed some evidence of association in the BHS discovery GWAS (5 × 10 −8 < P < 5 × 10 −4 ). However, 89 associations were not nominally significant (P > 0.05) in the BHS discovery GWAS, indicating that the effects observed in the metaanalysis were largely due to the AIBL and ADNI samples.
Defining independent loci and genes controlling lipid homeostasis. For each lipid, significantly associated SNPs were LDclumped to remove variants in LD (r 2 > 0.1). Lead variants from the BHS discovery GWAS (adjusted and unadjusted) and conditional analyses, were clumped if the index SNPs were in linkage disequilibrium (r 2 > 0.1). We identified 3361 independent locilipid associations, involving 610 lipid species/classes, each associated with between 1 and 30 independent SNPs. To identify genomic regions associated with lipid metabolism, a single dataset was produced by identifying the smallest P-value for each SNP across all lipids and analyses. LD-clumping of this dataset resulted in 667 independent genomic regions (Supplementary Data 10; filtered by column 'Lead SNP in BHS GWAS'). This procedure was repeated, including SNP-lipid associations passing our discovery meta-analysis significance threshold (P < 3.47 × 10 −10 ), resulting in 682 independent genomic regions (Supplementary Data 10; filtered by column 'Lead SNP in Discovery-Meta analysis'), 612 of which overlap with those identified in BHS alone (737 in total). The variants within a genomic region and the lipids associated with those variants are collectively termed a genetically influenced lipotype.
Identification of candidate genes within loci. Using the prioritisation of candidate causal Genes at Molecular QTLs (ProGeM) framework 22 to prioritise candidate causal genes, biologically plausible genes were identified in 573 of the 737 genomic regions (Supplementary Data 10-12), with an overlap of 498 genomic regions between genetic-based (bottom-up) and biological knowledge (top-down) based approaches. A total of 2321 SNPgene pairs were identified, where the gene has previously been implicated in the regulation of metabolism or a molecular phenotype (Fig. 4a). Of these genes, 970 (41.8%) are present in lipidmetabolism-specific databases.
Comparing our lead SNPs and proxies against previously published eQTL associations, 2058 SNP-gene pairs were identified (Fig. 4b). Published meQTL associations revealed 879 SNPgene pairs, 587 (66.8%) of which replicated eQTL associations. In contrast to eQTL and meQTL, the overlap of published pQTL associations was much less evident, with only 16 SNP-gene pairs identified (Fig. 4c). In total, 18 SNP-gene pairs were identified with evidence from the closest gene, protein consequences, eQTL and meQTL. The overlap of top-down and bottom-up candidates supported the annotation of 1031 SNP-gene pairs.
Most SNP-lipid species associations have not been previously reported. For each of the 737 lead variants, we assessed whether they (or their proxies) had been previously reported as being associated with any lipid or metabolite. From 35 previous metabolomic/lipidomic studies (Supplementary Table 2), 228 lead variants (31%) had been reported as associating with a lipid or metabolite, resulting in 509 unreported genetically influenced lipotypes (Supplementary Data 13).
Genetically influenced lipotypes overlap with coronary artery disease and cardiovascular disease-related loci. We looked at the overlap between 10 hard cardiovascular disease (CVD) endpoints from the GWAS Catalog and the lead SNP (or proxy) from each of the 737 regions, identifying a total of 23 lead SNPs, or their proxies, associated (P < 5 × 10 −8 ) with 10 hard CVD endpoints (Supplementary Data 14). The most frequently overlapping GWAS Catalog hard CVD endpoints were CAD (n = 14 SNPs), CVD (n = 10 SNPs), coronary artery calcification (n = 8 SNPs), and myocardial infarction (n = 8 SNPs). Three additional lead SNPs were associated with CAD in the CARDIoGRAMplusC4D and UK Biobank meta-analysis. Eighty-four lead SNPs were associated with 101 CVD-related traits, including chronic kidney disease (n = 18,) C-reactive protein (n = 14), metabolic syndrome (n = 12), body mass index (n = 8), and systolic blood pressure (n = 4). As expected, lead SNPs frequently overlapped with 186 lipid-related traits, with 99 lead SNPs or proxies observed in the GWAS Catalog.
Next, we estimated the genetic correlation between lipid species/classes and CAD. Using linkage disequilibrium score  Overall, using a significance threshold of P < 0.05, we identified 134 lipid species/classes that were significantly associated in each of the three analyses-association with incident CVD (phenotypic), CAD polygenic risk (PRS), and genetic correlation. Importantly, these lipid species/classes showed concordant directions of effects in all three analyses, defining these lipid species/classes as lipid endophenotypes for CAD.
Co-localisation analysis identified shared causal variants for coronary artery disease. We performed pairwise co-localisation analysis, within each QTL, between lipid species and CAD to assess whether they share common variants (Supplementary Data 16). We identified evidence of 43 shared variants for CAD and any lipid species (Table 1; Supplementary Note 1; Fig. 6). The strongest evidence was between CE(18:1) and CAD at the APOE P P P Fig. 5 Genetic and phenotypic associations of the lipidome with coronary artery disease. Forest plots of lipid-coronary artery disease; circles represent effect sizes and horizontal bars represent ±standard errors. a Phenotypic associations (logistic regression; two-sided) between lipid species and incident coronary artery disease in the BHS cohort (551 cases and 3703 controls), adjusted for age, sex, and the first 10 genomic principal components. b Association of lipid species with polygenic risk for coronary artery disease. Individuals in the discovery cohort (n = 4492) were assessed for risk using the metaGRS polygenic score, consisting of ∼1.7 million genetic variants. Linear regressions (two-sided) were performed to test the association between an individual's polygenic score and lipid species concentrations, adjusting for age, sex, and the 10 first principal components. c Genetic correlations of lipid species (n = 4492) against coronary artery disease (meta-analysis of CARDIoGRAMplusC4D and UK Biobank; 122,733 cases and 424,528 controls), performed with Linkage Disequilibrium Score Regression (LDSC; v1.0.1). Nominally significant and Benjamini-Hochberg corrected significance is indicated by light-and dark-grey circles, respectively. The 10 most significant lipid species are highlighted in blue, red, or green. Genetically influenced lipotypes were associated with coronary atherosclerosis in the UK Biobank. To further define pleiotropic effects between lipid species and CAD, we performed association analysis of 737 lead SNPs and coronary atherosclerosis in 456,486 participants of the UK Biobank (Supplementary Data 17). Eleven of the lipid-associated SNPs had genome-wide significant (P < 5 × 10 −8 ) associations with coronary atherosclerosis. Adjustment for clinical lipids (total cholesterol, HDL-cholesterol, triglycerides) increased this number to 17; however, adjustment for clinical lipids using mtCOJO, which is free of the bias introduced by heritable covariates, resulted in only 14 associations with coronary atherosclerosis. Importantly, 11 of these associations were sub-genome-wide significant in the initial analysis, suggesting the presence of strong pleiotropy in these regions. After comparing effect estimates between the standard GWAS and mtCOJO clinical lipid-adjusted analysis, eight lead SNPs (with P < 5 × 10 −8 in the standard GWAS) showed the opposite directions of associations. These regions contain prototypical lipid/lipoprotein regulating genes, such as APOE, CETP, LDLR, and PCSK9. Interestingly, for all lead SNPs with marginal association with coronary atherosclerosis (P < 1.0 × 10 −3 ; with and without conditioning on clinical lipids), 43 (81%) were associated with lipid endophenotypes for CAD.

Discussion
By integrative analysis of the human lipidome and CAD phenotypes, we have identified candidate risk genes for CAD, providing evidence for the role of these lipid species in the development of CAD. Our high-resolution genome-wide association analyses of the human lipidome have identified 737 independent genomic regions associated with lipid metabolism, of which 509 represent genetic loci not previously associated with lipid metabolism. This is a substantial increase over previous studies with similar or larger sample sizes 7,10,24 . Our expanded lipidomic platform utilises extensive chromatographic separation to increase the diversity of measured lipid species and distinguish lipid isomers and isotopes over those measured in previous studies. Combined with the extended pedigree study design of the BHS, we identify many rare/low-frequency variants with large effect sizes.
The majority (69.2%) of the 2137 SNP-lipid associations identified in our discovery GWAS were validated in a metaanalysis of two independent cohorts. Adjustment for clinical lipids (both as standard covariates and mtCOJO analysis), confirmed that the majority of SNP-lipid associations observed were not acting directly through clinical lipids (i.e. associations were not the result of mediated pleiotropy). Discovery meta-analysis of all three studies identified an additional 5658 SNP-lipid associations (from 122 loci)-involving 352 lipid species-that were not identified in the BHS discovery GWAS alone. Overall, nearly all lipid species (95%) had at least one genome-wide significant SNP association, highlighting the genetic contribution to lipid metabolism and homeostasis.
We identified 134 lipid species/classes showing consistent and significant associations with CAD when assessed with genetic correlation, phenotypic association, and PRS association. These lipids are potential endophenotypes for CAD, which can facilitate the identification of susceptibility genes. Of those loci associated with this subset of lipids, we identified 32 regions with evidence of shared genetic effects (co-localisation) with lipids and CAD. We assessed the association of lipid-loci with coronary atherosclerosis in~456,000 individuals of the UK Biobank, considering the independence of clinical lipid traits. A total of 53 loci showed evidence of association (P < 1 × 10 −3 ) in at least one analysis. Of these, 43 loci were associated with at least one of the 134 lipid species identified above.
Our lipidomic profiling provided improved resolution and precision in the measurement of lipid species. Prior studies examined lipid phenotypes that were mixtures of similar, but distinct species; lacked structural characterisation of lipid species, or were contaminated through isotopic overlap. Many of the associations between lipid species and prototypical lipid regulating genes observed in our study-such as FADS1/FADS2, APOE, and LDLR-have been reported in earlier GWAS [7][8][9][10][11][12][13][14][15]17,24 . With our expanded lipidomic profile, we have built on these earlier studies, identifying many new loci associated with lipid species and classes. Previous studies, containing mis-annotation of lipid species, report associations between SNPs in the FADS region and sphingomyelin species as containing a mono-unsaturated (16:1, 18:1, or 20:1) n-acyl chain 8,12 . Here, we show the associations of sphingomyelins with SNPs in the FADS gene region are disproportional with species containing the d18:2 sphingoid base. This is supported by recent experimental evidence, suggesting FADS3 is a ceramide-specific desaturase, targeting the sphingoid bases 25,26 . Early dogma suggested the dominant isoform of sphingomyelins was d18:1 leading to the aforementioned annotations (i.e. SM(d18:1/16:1)). However, chromatographic separation and characterisation identify the predominant species as SM(d18:2/16:0) 18 . While these associations are not novel per se, the additional specificity of our lipidomics methodology extends across all lipid species and classes, leading to greater confidence in defining true relationships.
We also observed strong associations between specific sphingolipid isoforms and variants in the SPTLC3 gene region. Serine palmitoyltransferase long chain base subunits (SPTLC) are a series of enzymes responsible for the de novo synthesis of sphingolipids through condensation of serine with palmitoyl-CoA. Three mammalian isoforms have been identified (SPTLC1-3), which form a heterodimer in situ, of which SPTLC1 is requisite for function 27 . The subunit SPTLC3 was discovered more recently and was thought to facilitate the synthesis of shorter-chain sphingolipids 28 . However, we identify strong associations of SNPs in the SPTLC3 gene region with atypical sphingolipids, containing a d19:1 sphingoid base (Supplementary Data 4). This supports the recent report that SPTLC3 has broader substrate specificity, with capacity to metabolise branched isomers of palmitate (anteiso-branched-C16) 27 leading to the synthesis of d19:1 sphingoid bases. The atypical structure of these sphingolipids has previously led to mis-annotation resulting in reported associations of the SPTLC3 gene with hydroxylated sphingomyelins 10,13,14 , when hydroxylated sphingomyelins in the n-acyl chain are unlikely to exist in human plasma 29 .
Many genes associated with CAD risk were identified as also associated with lipid species and classes, including HMGCR, PCSK9, and LDLR (Table 1), thereby providing new avenues for investigation into mechanistic pathways. We also provide new evidence to support potential roles for genes not reaching genomewide significance and identify possible mechanisms linking these genes to CAD; we identified strong associations between ten independent signals in the LIPC/ALDH1A2/AQP9 gene region with phosphatidylethanolamine, lysophosphatidylethanolamine, and phosphatidylglycerol lipid species independent of clinical lipids. Two lead variants were associated with functional consequences, including a start loss for gene ALDH1A2 and a missense variant for   (Fig. 7a). Visual comparison of regional association plots and SNP effect scatter plot supports consistent effects (Figs. 7b, d). We selected independent SNPs (r 2 < 0.05) in the LIPC gene region associated with the phosphatidylethanolamine class and assessed the similarity of effects with CAD (Fig. 7d). Inverse-variance weighted metaanalysis of SNP effects using Generalised Summary-data-based Mendelian Randomisation (GSMR) support strong pleiotropy consistent with a causal relationship (Fig. 7e). Angiopoietin-like 3 (ANGPTL3) has been implicated in CAD risk, with a deficiency being associated with cardioprotective effects [32][33][34][35] . ANGPTL3 acts as an inhibitor to two other lipases, lipoprotein lipase (LPL)-a rate-limiting enzyme in the clearance of triglyceride-rich lipoproteins-and the phospholipase endothelial lipase (LIPG) 36 . Indeed, loss of function mutations in the ANGPTL3 gene has been linked to hypolipidemia 34 . Most previous research has focused on the lipoprotein modulating effects of ANGPTL3 through LPL. However, a recent Mendelian Randomization analysis, using NMR lipoprotein profiling, revealed a divergence in the metabolic effects of genetic variants in ANGPTL3 and LPL 37 . We recently identified a rare frameshift deletion (rs398122988) associated with decreased ANGPTL3 protein levels in extended Mexican American families 38 ; the variant was also associated with a~1.3 standard deviation decrease in phosphatidylinositol species. In this study, we validate this observation, with SNPs in the ANGPTL3 gene region associated with a decrease in phosphatidylinositol species, again these associations persisted even after adjustment for clinical lipids (total cholesterol, HDL-cholesterol, triglycerides). Interestingly, we also observe associations of phosphatidylinositol species with SNPs in the LIPG region, suggesting a larger metabolic effect of the ANGPTL3-LIPG pathway, at least in fasting subjects. Commonly, phosphatidylinositol species have been studied for their intracellular messaging roles following phosphorylation of the inositol ring by kinases, including PI-3-kinase, which lead to downstream cardio-metabolic effects 39 . However, the role of phosphatidylinositol species in CVD risk is still largely unknown. We have previously observed the change in the ratio of phosphatidylinositol to phosphatidylcholine species as a predictor of CVD risk reduction from statin treatment 40 . Further work is now required to unravel the role of phosphatidylinositol in mediating the effect of these genes on CVD risk.
Limitations to the study warrant mention. First, our samples were restricted to individuals with European ancestry, complicating generalisability to individuals of non-European ancestry. Previous studies 24,41,42 have shown conservation of lipidmetabolism genetics across different ancestries; however, future studies in non-European ancestry individuals are required. Second, adjustment for many combinations of lipid-lowering medications and doses is not practical. As a majority of lipid-lowering medications were statins and the assumption that medication dose was titrated, a single lipid species/class correction was applied to all individuals taking these medications. However, as only 2% of the BHS discovery cohort were taking lipid-lowering medications, the putative impact is unlikely to be large. A larger proportion of the two validation samples were taking lipidlowering medications (ADNI: 49%; AIBL: 22%). Nonetheless, a substantial number of our associations were validated; therefore, the single adjustment was also unlikely to have greatly affected our results. Third, we did not have access to an independent validation sample for our discovery meta-analysis. We consider the discovery meta-analysis to be exploratory, with the potential to provide evidence of associations that can be followed up in future studies. Finally, lipidomic profiling was performed on serum in the discovery BHS and validation ADNI cohorts, whereas the validation study AIBL was plasma. While the absolute concentration of some blood metabolites may differ between plasma and serum, measurements are generally highly correlated between matrices 43 . We have previously shown lipid associations are consistent between serum and plasma 19 .
In summary, using our expanded lipidomic profiling platform, we have investigated the largest number of targeted lipid species in a GWAS, and have reported significant genetic associations with lipid species that have not previously been reported in any genetic association studies to date. Our strategy to use lipid species as endophenotypes in the search for CVD genes is the tip of the iceberg. We have previously reported phenotypic associations of lipid species with other complex traits, including diabetes 44 , Alzheimer's disease 19 , and atrial fibrillation 45 ; we believe the same integrative genomics approach may now be used to elucidate the mechanistic underpinnings of lipid metabolism in these and other complex diseases. These data now represent a valuable resource for the future exploration of the genetic analysis of the lipidome to identify lipid metabolic pathways and regulatory genes associated with complex disease and identify new therapeutic targets. To this end we provide all summary statistics and an online searchable resource of association plots of lipid species and classes with genetic variants and regional association plots with individual lipid species and classes (https://metabolomics.baker.edu.au/).

Methods
Study populations. Participants in the discovery cohort (n = 4492) were all participants of the 1994/95 survey of the long-running epidemiological study, the BHS, for whom genome-wide SNP data, extensive longitudinal phenotype data, and blood serum were available. The BHS is a community-based study in Western Australia that includes both related and unrelated individuals (predominantly of European ancestry) and has been described in more detail elsewhere [46][47][48] . Informed consent was obtained from all participants and the 1994/95 health survey was approved by the University of Western Australia Human Research Ethics Committee (UWA HREC). The current study was also approved by UWA HREC (RA/4/1/7894) and the Western Australian Department of Health HREC (RGS03656).
The two validation cohorts used in this study were the AIBL study 49 and the ADNI study 50 ; both of which were established to discover biomarkers, health and lifestyle factors for the development, early detection, and tracking of Alzheimer's disease. The AIBL study is a longitudinal study which recruited 1112 individuals aged over 60 years within Australia. Time points for blood/data collection were every 18 months from baseline. For each individual, lipidomic data obtained from the earliest blood collection was used. At baseline, 768 individuals were characterised as cognitively normal, 133 with mild cognitive impairment and 211 with Alzheimer's disease. The ADNI study is a longitudinal study, starting in 2004 and recruited 800 individuals at baseline, from sites across the United States of America and Canada. Serum samples obtained at baseline were analysed. Study data analysed here were obtained from the ADNI database, which is available online (http://adni.loni.usc.edu/). For the lipidomics analysis, the AIBL study was deemed low risk (The Alfred Ethics Committee; Project 183/19), and the ADNI study was deemed 'research not involving human subjects' (Duke Institute review board; ID:Pro00053208).
Lipidomic profiling. Targeted lipidomic profiling was performed using liquid chromatography coupled electrospray ionisation-tandem mass spectrometry from fasting blood serum (BHS discovery), fasting blood plasma (AIBL validation), and a combination of fasting and non-fasting blood serum (ADNI validation; 90% fasting, 10% non-fasting). We quantified 596 lipid species (from 33 lipid classes) in the BHS discovery cohort, 573 lipid species (from 32 lipid classes) in the validation AIBL cohort, and 581 lipid species (from 32 lipid classes) in the validation ADNI cohort. Due to strict quality control, lipid species may be removed from a dataset and typically represent very low abundant species and/or those requiring nearoptimal chromatographic separation. All lipid classes were consistent across the studies, except for the Oxidised sterol ester which was only available in the discovery BHS cohort. Overall, 596 lipid species were quantified; 570 of which were quantified within all three cohorts; five lipid species were present only within BHS and ADNI; and 21 lipid species were present only in the BHS cohort (Supplementary Data 1, 2).
Lipidomic profiling of each cohort was performed using the standardised methodology described by Huynh et al. 18 . Lipidomic profiling has been described previously for BHS 6 and ADNI/AIBL 19 . Briefly, 10 μL of serum/plasma was spiked with an internal standard mix (Supplementary Data 1) and lipid species were isolated using a single-phase butanol:methanol (1:1; BuOH:MeOH) extraction 51 . Analysis of serum/plasma extracts was performed on an Agilent 6490 QqQ mass spectrometer with an Agilent 1290 series HPLC, as previously described. Mass spectrometry settings and transitions for each lipid class are shown in Supplementary Data 1. A total of 497 transitions, representing 596 lipid species (BHS discovery), 573 lipid species (AIBL validation), and 581 lipid species (ADNI validation), were measured using dynamic multiple reaction monitoring (dMRM), where data was collected during a retention time window specific to each lipid species. Raw mass spectrometry data were analysed using MassHunter Quant B08 (Agilent Technologies).
Data integration and cleaning. Lipid concentrations were calculated by relating the area under the chromatographic peak, for each lipid species, to the corresponding internal standard. Correction factors were applied to adjust for differences in response factors, where these were known 18 . In-house pipelines were used for quality control and filtering of lipid concentrations. Across the entire BHS dataset, only three missing values were evident. Lipids below the limit of detection (missing values) were imputed to half the minimum observed value. To remove technical batch variation, the lipid data in each analytical batch (approximately 486 samples per batch) was aligned to the median value in pooled plasma quality control samples included in each analytical run. Unwanted variation in the discovery cohort was identified using a modified remove unwanted variation-2 (RUV-2) approach 52 . In brief, lipid data were residualised in a linear mixed model, against age, sex, body mass index (BMI), clinical lipids and the genetic relatedness matrix (described below) as the random effects. Principal component analysis was performed on the residualised data. The first two components showed clear trends along with samples in collection order. Therefore, variation associated with these first two principal components was removed from the original dataset. Lipid class totals were generated by summing the concentration of the individual species within each class. Validation cohorts were processed in a similar manner.
Phenotypic variables. Details of the BHS data collection have been published previously 53 . Serum cholesterol and triglycerides were calculated by standard enzymatic methods on a Hitachi 747 (Roche Diagnostics, Sydney, Australia) from fasting blood collected in 1994/95. HDL-cholesterol was determined on a serum supernatant after polyethylene glycol precipitation using an enzymatic cholesterol assay and LDL-cholesterol was estimated using the Friedewald formula 54 . Height and weight (used to calculate BMI) were collected from participants at the time of the interview (1994/95). The use of lipid-lowering medication was recorded at the time of the interview (1994/95). Diagnosis of incident CAD was defined as either hospitalisation or death due to CAD (ICD9: 410-414; ICD10: I20-I25) after the blood collection date (and until June 2015). Hospitalisations and deaths were identified from the Western Australian Department of Health Hospital Morbidity Data Collection and Death Registrations.
Medication usage adjustment. For individuals taking lipid-lowering medication (BHS, n = 108; AIBL, n = 198; ADNI, n = 328), lipid species and clinical lipid concentrations were adjusted using previously identified effects of lipid-lowering medication. Changes in lipid species and clinical lipids following one year of statin use were calculated from a placebo randomised controlled trial (LIPID study; n = 4991) 40 . To calculate correction factors 55 , lipid measures were centred and scaled by the mean and standard deviation of baseline measures (prior to statin usage), and the change in lipid abundance was calculated and regressed on age, sex, BMI, and statin usage. Statin usage beta coefficients (effect of the lipid-lowering medication) were added to standardised lipid species concentrations of the individuals taking lipid-lowering medication in the current study. For lipid species present in both this study and the LIPID study (overlap of 314 lipid species), species-specific correction factors were calculated. For those lipid species not measured in the LIPID study (n = 282), class-specific correction factors were used in place of species-specific correction factors i.e. a ceramide-specific correction factor (average beta coefficient of overlapping ceramide species) was used for ceramide species not measured in the LIPID study. Due to the large proportion of ADNI participants taking lipid-lowering medication, we performed a sensitivity analysis, comparing the above correction against residualising lipid concentrations adjusting for medication usage as a covariate (Supplementary Note 2).
Genotyping and imputation. For the BHS discovery cohort, genotyping was performed on the Illumina Human 610 K Quad-Bead Chip (Illumina Inc., San Diego, CA, USA) at the Centre National de Genotypage in Paris, France (n = 1468), and on the Illumina 660 W Quad Array Bead Chip (Illumina Inc., San Diego, CA, USA) at the PathWest Laboratory Medicine WA (Nedlands, WA, Australia (n = 3428). Complete linkage clustering based on pairwise identity by state distance in PLINK 56 showed no batch effects, therefore the batches were merged. Standard genotype data quality control was performed as described previously 48 . Briefly, individuals were excluded if: >3% of SNP data were missing (n = 11), reported sex did not match genotyped sex (n = 48), duplicates (n = 123), missing phenotype data (n = 11), or >5 standard deviations above/below mean heterozygosity (n = 28). Individuals with non-European ancestry (n = 4) were also excluded. To prepare genotype data for imputation, SNPs were excluded if: call rates <95%, minor allele count <10, deviations from HWE (P < 5.0 × 10 −4 ), no matching Haplotype Reference Consortium (HRC) reference panel SNP, palindromic (A/T, G/C) SNPs with MAF greater than 0.4 from the HRC (n = 5), and SNPs with >0.2 MAF difference compared to HRC (n = 150). After quality control, SNP data was available for 513,634 SNPs. Imputation was performed to the HRC reference panel using the Michigan Imputation Server 57 . Following imputation, 39,117,105 SNPs were available for analysis. We excluded variants if the number of copies of the minor allele <5 or if imputation quality (r 2 ) <0.3. This resulted in 13,887,524 variants available for analysis.
Genotyping in AIBL was performed on the Infinium OmniExpressExome array (Illumina, Inc., San Diego, CA) 60 . Quality control procedures were performed in Plink 56 . After removing individuals with ambiguous sex, Plink was used to remove individuals with call rate <0.90; SNPs were removed if call rate <0.95, HWE test P < 1.0 × 10 −4 , or MAF < 0.05. SNPs were flipped to the positive strand before imputation to the 1000 Genomes Phase 3 reference panel using the Michigan Imputation Server 57 (using Minimac 4). Both the AIBL and ADNI validation cohorts were restricted to individuals of non-Hispanic European ancestry, based on projection onto the 1000 Genomes reference panel.
Genetic relatedness matrix. The discovery sample, BHS, used in this study consisted of related and unrelated individuals; therefore, all analyses included a genetic relatedness matrix. Twenty-two genetic relatedness matrices were calculated. First, a hard-call set of imputed SNPs was created in Plink (i.e. SNP genotypes were called if SNP imputation quality r 2 > 0.8 and if genotype probability >0.9). The HLA region on chromosome 6 was also excluded. SNPs were then pruned in Plink using 'indep-pairwise 500 50 0.3' [window of size 500, moving 50 SNPs along each time, removing variants with r 2 > 0.3] to create a set of 486,553 independent SNPs. Twenty-two genetic relatedness matrices were created (using the option 'gk 1' which specifies a centred relatedness matrix), with each omitting one chromosome, in GEMMA 61 .
Statistical analysis. Genome-wide association analyses for the 596 lipid species and 33 lipid classes in the discovery cohort were performed using imputed genotype dosages in linear mixed models, as implemented in GEMMA 61 . To avoid proximal contamination, analyses were performed using genetic relatedness matrices implementing a leave-one-chromosome out scheme. Analyses were performed using rank-based inverse normal transformed residuals, after adjustment by age, sex, age 2 , age*sex, age 2 *sex, and the first 10 principal components (generated from Eigenstrat) 62,63 .
Validation cohorts, ADNI and AIBL, were analysed using an additive linear model, as implemented in Plink 56 . Analyses were performed using rank-based inverse normal transformed residuals, after adjustment by age, sex, age 2 , age*sex, age 2 *sex, study-specific covariates (including fasting status for ADNI) and a number of principal components deemed sufficient to capture population structure. Meta-analysis between all three studies was performed using an inversevariance weighted fixed-effects model, as implemented in METAL 64 . Due to the correlation between lipid species, the effective number of tests was calculated as the number of principal components required to explain at least 95% variance of the lipidome (144 components).
Statistical significance was defined using the standard genome-wide significance (P < 5 × 10 −8 ) in the BHS discovery analysis, P < 0.05 in AIBL/ADNI validation, and P < 3.47 × 10 −10 in the three-study meta-analysis (5 × 10 −8 /144 lipid dimensions; Bonferroni correction using the effective number of tests). A more stringent threshold was used for the meta-analysis due to the lack of validation samples available.
For each lipid, significantly associated SNPs were LD-clumped (r 2 > 0.1) using correlation measures obtained from 10,000 unrelated individuals from the UK Biobank, the 1000 Genomes, or the BHS. A singular dataset was created by retrieving the smallest P-value across all analyses. This dataset was LD-clumped (r 2 > 0.1) to determine the number of independent genomic regions. For each locus, a regional association plot was produced using LocusZoom 65 .
Detection of distinct association signals. Conditional analysis was performed to detect independent association signals at each genome-wide significant loci using GEMMA. For each lipid, we iteratively clumped regions within a 2 Mb window centred on the lead SNP until no more genome-wide significant associations were left. Regions with overlapping windows were merged. Conditional analysis was iteratively performed, including the lead variant as a covariate until no more conditionally independent signals (P < 5 × 10 −8 ) remained.
Assessment of effects of clinical lipid trait adjustment. Within the discovery cohort, to determine whether SNP-lipid associations were independent of clinical lipid traits (total cholesterol, HDL-cholesterol, triglycerides), all SNPs were tested with and without adjustment for clinical lipid traits. We compared loci effect sizes between analyses run with and without clinical lipid adjustment using a pooled standard deviation t-test (Supplementary Note 3). Bonferroni adjustment (0.05/ number of loci) was used to identify loci which differed substantially following adjustment. As adjusting for heritable covariates can introduce collider bias 66 , we further validated these using multi-trait conditional and joint analysis (mtCOJO) 67 , conditioning on GWAS summary-level data for clinical lipids obtained from the UK Biobank 68 .
Putative causal genes, for each loci, were identified using a slightly modified approach to that previously described (ProGeM) 22 . For the bottom-up approach, the three closest protein coding genes (within a 1 Mb window) were identified, for each lead SNP. Genes were noted if a lead SNP or its proxies were annotated by SNPEff as missense, start loss, stop gain, or with an annotation impact as High. As performed by ProGeM, the top-down analysis reports genes within 500 kb of the lead SNP that are present in a curated database of known metabolic-related genes. A list of primary candidates was generated based on the overlap of top-down and bottom-up genes.
Overlap of lead variants with cardiovascular disease-related loci. To assess whether our lead SNPs were previously associated with CVD-related traits, we performed a look-up within the GWAS Catalog v1.02 (release 2020-08-26) 78 of 10 hard CVD endpoints, 72 CVD-related traits, and 141 lipid-related traits. We also performed a look-up against a meta-analysis of CAD between CARDIo-GRAMplusC4D and UK Biobank 79 .
Associations of lipid species with coronary artery disease and coronary artery disease polygenic risk. Within the discovery cohort, the association of lipid species with incident CAD was assessed using logistic regression, adjusting for age, sex, and the first 10 genomic principal components. Prevalent CAD cases were removed prior to analysis; defined as individuals hospitalised with CAD between the start of the Hospital Morbidity Data Collection (1970), and an individual's serum collection date. Incident CAD events (CAD hospitalisations or death) were included up to the end of follow-up (July 2015). Results are displayed as log-odds ratios.
Polygenic risk for CAD was calculated for each individual in the discovery cohort using the metaGRS polygenic score, consisting of~1.7 million genetic variants 23 . Linear regression in R was performed to test the association between an individual's polygenic score and lipid species concentrations, adjusting for age, sex, and the 10 first principal components.
Genetic correlations. Genetic correlations of lipid species against CAD were assessed using Linkage Disequilibrium Score Regression (v1.0.1) 80 . Regression weights and scores were obtained from 1000 Genomes European data, as previously described 81 . Summary statistics from all datasets were restricted to SNPs from the HapMap 3 panel, with 1000 Genomes European MAF greater than 5%. Where available, SNPs were filtered to an imputation quality r 2 > 0.9. Similarly, SNPs were removed if the reported MAF deviated from 1000 Genomes European MAF by greater than 0.1. Summary statistics for CAD were obtained from the meta-analysis of CARDIoGRAMplusC4D and UK Biobank by van der Harst and Verweij 79 . Due to no overlapping samples between BHS and other summary results, the genetic covariance intercept was constrained to 0.
Co-localisation analysis. Co-localisation between lipid species genome-wide significant loci and CAD was performed using the R package COLOC 82 . For each loci, all variants within a 400 kb window centred on the lead SNP were selected. Priors were kept at default settings. Evidence for shared variants was determined as the posterior probability of both traits containing causal variants in the region (H3 + H4 > 0.8) and a larger probability of a shared variant (H4/H3 > 10). Sensitivity analysis for regions with shared variants is shown in Supplementary Note 1.
Association of loci with coronary atherosclerosis in the UK Biobank. Lead SNPs (or proxies) were tested for association with coronary atherosclerosis in the UK Biobank. In a subset of white, British individuals (n = 456,486), electronic health records (updated 14 th December 2020) were converted into PheCodes 83,84 using the R package PheWAS 85 . Coronary atherosclerosis (phecode 411.4) was exported for genome-wide association analysis. FastGWA 86 was used to assess the association of lipid-loci with these phenotypes, adjusting for age, sex, age 2 , age*sex, age 2 *sex, the first 20 principal components as provided by the UK Biobank, and the genetic relatedness matrix as the random effect. The analysis was repeated, additionally adjusting for clinical lipids (total cholesterol, HDL-cholesterol, triglycerides; measurements obtained from the first available blood collection). Individuals with missing values were excluded from the analysis. As clinical lipids are heritable, mtCOJO analysis was also performed using GWAS summary statistics obtained above.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.