Genome-wide Association Study Of Plasma Proteins Identifies Putatively Causal Genes, Proteins, And Pathways For Cardiovascular Disease

Identifying genetic variants associated with circulating protein concentrations (pQTLs) and integrating them with variants from genome-wide association studies (GWAS) may illuminate the proteome’s causal role in disease and bridge a GWAS knowledge gap for hitherto unexplained SNP-disease associations. We conducted GWAS of 71 high-value proteins for cardiovascular disease in 6,861 Framingham Heart Study participants followed by external replication. We comprehensively mapped thousands of pQTLs, including functional annotations and clinical-trait associations, and created an integrated plasma-protein-QTL searchable database. We next identified 15 proteins with pQTLs coinciding with coronary heart disease (CHD)-related variants from GWAS or tested causal for CHD by Mendelian randomization; most of these proteins were associated with new-onset cardiovascular disease events in Framingham participants with long-term follow-up. Identifying pQTLs and integrating them with GWAS results yields insights into genes, proteins, and pathways that may be causally associated with disease and can serve as therapeutic targets for treatment and prevention.


Introduction
Considerable progress has been made in identifying genetic underpinnings of coronary heart disease (CHD), [1][2][3][4] which remains the leading cause of death worldwide. 5 Proteins are the functional products of the genome and serve as critical factors for biological processes involved in health and disease as well as primary drug targets. Numerous proteins have been reported to be associated with CHD; it is often difficult, however, to establish with certainty whether CHD-associated proteins are causally related to risk or simply represent downstream markers of disease-related processes. Identifying genetic variants associated with protein levels (protein quantitative trait loci; pQTLs), characterizing pQTLs that also are associated with CHD from genome-wide association studies (GWAS), and inferring causality may provide novel insights into the roles of genetic variants, genes, and the proteins they code in the pathogenesis of CHD. To date, most pQTL studies [6][7][8][9][10][11][12][13][14][15] have been based on small sample sizes or did not conduct prospective testing of associations between protein levels and clinical disease.
To address a GWAS knowledge gap for genetic variants of unknown relevance to CHD, we conducted a multistage study ( Figure 1) consisting of GWAS of high-value cardiovascular disease (CVD) plasma proteins that were measured in Framingham Heart Study (FHS) participants, followed by external replication in participants from the Cooperative Health Research in the Region of Augsburg (KORA) F4 study 12 and from other protein GWAS. We integrated pQTLs with genetic variants from CHD GWAS databases [1][2][3][4] and employed Mendelian randomization (MR) 16 to reveal proteins with potentially causal effects on CHD. Last, we tested proteins for association with new-onset CHD events in FHS participants with long-term follow-up. We hypothesized that a strategy of protein GWAS followed by causal testing and prospective association with CHD outcomes would identify putatively causal genes, proteins, and pathways for CHD and highlight novel targets for its prevention and treatment.

Findings
Discovery Set: Seventy-one proteins, selected a priori based on prior evidence of association with CVD, were measured in 7,333 FHS participants (Table S1). The sample size available for GWAS was up to 6,861 participants (mean age 50 years, 53% women); clinical characteristics of the discovery sample are summarized in Table S2. pQTL Mapping: With a GWAS sample size of ~6,800 participants and a significance threshold of p<5x10 -8 , our study had 80% power to detect a pQTL that explained ≥ 0.6% of variance in protein levels (Table S3). We identified 1,793 insertion/deletion polymorphisms for 57 proteins (Table S4) and 20,495 pQTLs with Reference SNP cluster IDs for 60 proteins (Table S5) (Table S7). Thirty-four pQTLs were rare variants (minor allele frequency<1%) associated with 18 proteins (Table S8).
The effect sizes and the proportion of inter-individual variation explained by some pQTLs were large. For example, cis-pQTL rs941590, is a missense variant that explained 32% of inter-individual variation in SERPINA10 levels ( Figure S1) and was previously reported to be associated with family history of venous thrombosis. 18 Three proteins (PON1, GRN, and LPA) had pQTLs that explained 10-30% of variation in protein levels. Minor allele frequency was inversely correlated with effect size, but not with proportion of variance explained. In general, cis-pQTLs and missense variants had larger effect sizes and explained a greater proportion of the variation in protein levels than did trans-pQTLs and non-coding variants, respectively ( Figure 3).
External Replication: Among our 60 proteins linked to 130 sentinel pQTLs, 46 proteins (associated with 105 sentinel pQTLs) were measured in the KORA F4 study 12 or in prior GWAS. For each pQTL locus identified in the FHS, the pQTL with the lowest proteinassociation p-value was selected as the sentinel pQTL for external replication. Thirty-six proteins, encompassing 82 sentinel pQTLs, were measured in KORA (21 cis-pQTLs and 61 trans-pQTLs); 23 additional loci were evaluated for replication in other GWAS.
Tissue enrichment analysis revealed that pQTL-mapped genes are highly expressed in monocytes (p=2.38x10 -4 ) and hepatocytes (p=4.55x10 -5 ). We employed Functional Mapping and Annotation 21 of GWAS (FUMA; http://fuma.ctglab.nl) to generate detailed annotations of pQTLs for each protein (regional plot of each pQTL locus, functional categorization of pQTL SNPs, gene mapping, and pathway enrichment analyses) that are provided in Figure S3. These annotations revealed that pQTLs often reside in active regulatory regions and are frequently located in intergenic and intronic regions. Proteinspecific Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses of the corresponding pQTLs revealed a preponderance of pathways concordant with the function of the studied protein.
Moreover, we identified pQTLs associated with expression of the corresponding protein-coding genes for 15 proteins, suggesting that many pQTLs affect circulating protein levels by regulating blood cell gene expression (Table S15).
Clinical Annotation: We found that 58 missense pQTLs (from the Exome Chip) were linked to clinical disorders in the NCBI ClinVar 24 database (Table S16). We identified examples where the missense pQTL and its associated protein are both linked to CHD-related traits. For example, rs2228671, a missense variant in the LDL-receptor gene (LDLR), was associated in our pQTL analysis with circulating APOB levels, the major lipoprotein of LDL particles, and was previously reported to be pathogenic for familial hypercholesterolemia (FH), a monogenic disorder of LDL cholesterol. 25 Additionally, for several variants reported to be benign in ClinVar, we demonstrated associations with the disease-relevant protein, suggesting that they have clinical consequences.
We found the ABO locus to have links to CHD through five circulating proteins (MCAM, sICAM1, GMP140, sGP130, REG1A). ABO blood type has long been linked to CVD risk, including in the FHS, 26 additional reports have linked the ABO locus to CVD via coagulation pathways. 27,28 ABO locus-related proteins, identified in our study, are involved in inflammatory pathways, including interleukin and interferon signaling ( Figure 2). The multi-protein association of this locus may be driven by the general function of ABO as a glycosyltransferase. Causal Testing: We applied MR testing to infer the causal association between protein levels and CHD for all proteins having cis-pQTLs and those with at least four cis-or trans-pQTL loci that coincided with CHD-associated SNPs from GWAS. 1 MR causally implicated LPA, REG1A, MCAM, and SAA1 via cis-pQTLs as instrument variables (p<0.05; Table S18). For 11 proteins with pQTLs that coincided with SNPs from CHD GWAS and had at least four non-redundant cis-or trans-pQTLs, we conducted MR analyses using a multi-SNP approach, implemented in MRbase, 30 which revealed causal CHD associations for APOB (p=0.0005) and GRN (p=7.0x10 -5 ).
Protein Associations with Clinical Outcomes: For 15 proteins ( Figure 4) with pQTLs that coincided with CHD GWAS SNPs or tested positive in MR analyses at p<0.05 we tested the associations of protein levels with a) major CHD (recognized myocardial infarction or CHD death; n=213 events) and b) CVD death (fatal CHD or death due to stroke, peripheral arterial disease, heart failure, or other CVD causes; n=199 events) occurring during a median follow-up of 14.3 years (25th percentile 11.4, 75th percentile 15.2 years) among 3,520 FHS participants age ≥ 50 years. Twelve of the 14 proteins with pQTLs that coincided with CHD GWAS SNPs were associated (nominal p<0.05) with incident CHD or CVD death ( Table 2). After adjusting for multiple testing of 15 proteins (p<0.05/15 = p<3.3x10 -3 ), nine proteins remained associated with incident events. Four (REG1A, SAA1, APOB, and GRN) of the six proteins that tested causal for CHD by MR (at p<0.05) were associated with CHD/CVD outcomes (at p<0.05). Two proteins (AGP1 and HPX) that tested marginally positive for CHD risk in MR analysis (0.05<p<0.10) were associated with new-onset CHD events (p=3.5x10 -8 for AGP1 and p=2.0x10 -5 for HPX).
The protein effect sizes on CHD predicted from MR were consistent with the observed prospective protein-CHD associations ( Figure 5).
Novel Proteins and Pathways Implicated in CHD: Whereas LPA and APOB can be viewed as positive controls because prior studies identified them as causal for CHD, [31][32][33][34] four proteins that were causal for CHD in MR are novel, including REG1A, MCAM, SAA1, and GRN (Table S18). REG1A is a protein secreted by the pancreas and may be related to islet cell regeneration and diabetogenesis, potentially contributing to increased atherogenic risk of diabetes. 35 REG1A levels were positively associated with CHD and CVD outcomes in our protein-trait association analyses (Table 2) it explained 15% of variance in GRN levels and was associated with CHD at p<4.6x10 -23 in prior GWAS. 1 Previous studies reported that rs12740374 affects expression levels of the SORT1 gene in human hepatocytes, which in turn regulate LDL-cholesterol levels. 46,47 Our longitudinal analyses revealed association of GRN with CVD death in FHS participant (p=0.001; Table 2). Molecular QTL browser: Our pQTL resource is accessible through the NCBI Molecular QTL Browser (ftp://ftp.ncbi.nlm.nih.gov/eqtl/original_submissions/FHS_pQTLs/; a link to the browser will be sent to reviewers under separate cover), which serves as a data 1 resource for associations between genetic variants and molecular phenotypes. The browser links our pQTL results to eQTLs and other molecular resources via a userfriendly interface. Users can browse and search results and specify p-value cutoffs and other data filters. The Molecular QTL Browser also permits users to conduct targeted studies of specific genes based on prior evidence. The integrated data resource enables searches across datasets and filtering by functional annotation and genomic position.

Discussion
Using a multistage strategy, we discovered thousands of pQTLs associated with scores of proteins that were selected a priori as high-value plasma proteins for CVD.
Integration of pQTLs with CHD GWAS revealed 14 proteins with pQTLs that coincide with CHD SNPs (Table 1) and MR analyses identified six proteins with evidence for causal effects on CHD (Table S18), four of which were novel (REG1A, MCAM, SAA1, and GRN). Furthermore, most of these proteins were associated with new-onset CHD or CVD events in FHS participants with long-term follow-up (Table 2). Our strategy connected pQTLs, genes, the proteins they code, and CHD risk ( Figure 4) and highlighted a comprehensive approach to bridge the GWAS knowledge gap for genetic variants that have no links to disease via known mechanisms.
We acknowledge several limitations of our study. Participants were of European ancestry; consequently, the results may not be directly generalizable to populations with different genetic backgrounds. Although our sample size for GWAS was large, our ability to detect pQTLs and to test them for causality using MR was limited by power. Protein levels were measured in whole blood and may not reflect tissue-specific patterns of expression.
To our knowledge, this is the largest sample size pQTL study with well-powered discovery, independent external replication, CHD causal testing, and confirmatory prospective protein-CHD outcome findings. We provide a large and comprehensive compilation of pQTLs as a resource for other researchers (via the NCBI Molecular QTL Browser) and provide evidence that an integrated genomic approach can identify proteins with putatively causal effects on CHD. Although some of our causally-implicated proteins may act through classic CHD risk factors and known pathways, many do not and thus represent attractive candidate targets for drug development. Additional studies are needed to elucidate the mechanisms by which such proteins alter CHD risk as well as trials to confirm our MR prediction that perturbing these pathways can prevent CHD events. Taken together, the genetic variants associated with circulating protein levels in this study shed new light on genes, proteins, and pathways contributing to the pathogenesis of CHD, which could have profound implications for the treatment and prevention of the leading cause of death worldwide. Abbreviations: CHD = coronary heart disease; GWAS = genome-wide association study; pQTL = protein quantitative trait locus (i.e. genetic variant associated with protein level) 2 0  The study consisted of seven steps: 1) selection and measurement of 71 high-value plasma proteins for atherosclerotic CVD via multiplex immunoassays in 7,333 FHS participants, 2) GWAS of the 71 proteins in 6,861 FHS participants to identify genome-wide significant pQTLs, 3) functional enrichment analyses of the identified pQTLs, 4) independent external replication of the sentinel pQTLs in KORA and other previous GWAS, 5) integrated analysis to pQTLs that coincide with CHD SNPs from GWAS, 6) identification of causal proteins for CHD using a Mendelian randomization approach, 7) association analysis of proteins from steps 5 and 6 with risk for incident CHD death and CVD death in 3,520 FHS participants age 50 years or older with available long-term follow-up.
Abbreviations: CHD = coronary heart disease; CVD = cardiovascular disease; FHS = Framingham Heart Study; GWAS = genomewide association study; KORA = Cooperative Health Research in the Region of Augsburg Study; pQTL = protein quantitative trait locus (i.e. genetic variant associated with protein level); SNP = single nucleotide polymorphism 2 5 Loci containing pQTLs previously identified in GWAS to be associated with CHD are written in red text. Proteins with genome-wide significant pQTLs are listed in the right semicircle. The following three conditions are summarized for each protein: 1) The corresponding protein-coding gene is linked to a potential CHD pathway in previous GWAS (above the black line).
2) The corresponding protein-coding gene is a known drug target (green text). 3) GO biological processes for the protein-coding gene (green box denotes lipid metabolism pathways, blue box denotes inflammatory/immune response pathways, yellow box denotes coagulation/platelet/hemostasis pathways, and gray box denotes other pathways not included in the three most common, previously listed pathways). A single primary GO process was chosen when the protein-coding gene was included in multiple pathways.
Abbreviations: CHD = coronary heart disease; GO = Gene Ontology; GWAS = genome-wide association study; pQTL = protein quantitative trait locus (i.e. genetic variant associated with protein level); SNP = single nucleotide polymorphism 2 8    . pQTL-Protein-Coronary Heart Disease Network Network of proteins and significant pQTLs with annotated gene loci for the pQTLs that are also GWAS risk SNPs for CHD (see Table  1). For proteins with multiple pQTLs that coincide with coronary heart disease GWAS SNPs, the pQTL with the lowest p-value of association with its corresponding protein level is shown. The following two conditions are summarized: 1) Proteins that tested causal for CHD in Mendelian randomization (p<0.05). 2) Proteins associated with new-onset CHD (p<0.05) in 3,520 Framingham Heart Study participants age 50 years or older with long-term follow-up. Proteins in green fulfill neither condition 1 nor 2; proteins in blue fulfill condition 1; proteins in red fulfill condition 2; proteins in purple fulfill conditions 1 and 2. The pQTL rs2523535 for sRAGE was reported to be associated with CHD (p=8.1x10-10) in a Japanese GWAS (PMID 21971053).
Abbreviations: CHD = coronary heart disease; FHS = Framingham Heart Study; MR = Mendelian randomization; pQTL = protein quantitative trait locus (i.e. genetic variant associated with protein level) 3 2 For a genetic locus with multiple pQTLs in LD (i.e., LD r 2 >0.2), we selected the pQTL with the lowest p-value to represent the sentinel pQTL for that locus.

Figure 5. Comparison of Protein Effects on Coronary Heart Disease from Mendelian Randomization
For KORA, linear regression models were performed on the follow-up SNPs using R version 3.1.3. 64 Associations between inverse-normalized protein levels and imputed dosages were tested using linear additive genetic regression models adjusted for age, sex, and body mass index. 12 eQTL Mapping: We used linear mixed effects models, accounting for familial relationships using "PEDIGREEMM" in R, 64 to assess associations between ~8.5 million 1000G SNPs that were additively coded and expression levels of 17,873 transcripts. 23 Models were adjusted for age, sex, platelet count, differential white cell count (percentages of lymphocyte, monocyte, eosinophil, and basophil), and for 20 PEER factors 65,66 to reduce confounding due to unmeasured factors. The criteria used to define cis and trans effects for pQTLs were also applied to eQTLs. A false discovery rate (FDR) threshold of 0.05 was applied separately for cis-and trans-eQTLs.
Mendelian randomization: We used an MR approach to test for causal associations between protein biomarkers and CHD risk. The sentinel cis-pQTL for each protein, based on lowest p-value of association in either 1000G GWAS or Exome Chip analysis, was selected as the instrumental variable (IV) for its perspective protein in MR analysis.
Based on the association between the sentinel cis-pQTL and CHD in prior GWAS, 1 a putative causal effect of one standard error difference in inverse-rank normalized protein level on CHD was calculated as the per risk allele effect on CHD risk dependent on the per risk allele effect on one standard error difference in inverse-rank normalized protein level (the Wald ratio test). 67 For proteins with suggestive single cis-pQTL results (0.05<p<0.1) and with additional non-redundant cis-pQTLs, we conducted single-locus multi-SNP MR. Low-level correlation (LD r 2 <0.2) between variants in the genetic risk score was adjusted for in MR analysis using the method developed by Burgess et al. 68 Similarly, for proteins with pQTLs that shared genetic signals with CHD from GWAS, we conducted multi-SNP MR using MRbase 30 when there were at least four nonredundant pQTL loci.
Associations of protein levels with CVD: To analyze associations between plasma protein levels and MI/CHD death and CVD death in FHS participants, protein biomarkers were rank-normalized. Cox proportional hazard models were used to predict MI/CHD death 1 and CVD death for each biomarker, adjusting for age and sex. Participants younger than 50 years of age at baseline were excluded from outcome analyses due to a paucity of events in this age group. In addition, participants with prevalent MI/CHD or CVD at baseline were excluded from analyses of incident events, leaving a final sample size of 3,520 FHS participants.
Independent External Replication: After merging our 1000G and Exome Chip GWAS results, the pQTL with the lowest p-value of association at each genetic locus was selected as the sentinel pQTL. We conducted independent external replication of our sentinel pQTLs in the KORA study 12 and in other protein GWAS. Out of the 60 proteins with pQTL SNPs in the FHS, replication was conducted for 47 proteins from discovery that also were measured in KORA or other studies. The sentinel pQTL at each genetic locus in the FHS was determined to be successfully validated if its corresponding 1000Gimputed genotype or strong proxy (LD r 2 >0.8) in KORA was also a significant pQTL for the corresponding protein and if directionality of pQTL-protein association was preserved. Statistical significance was defined as a p-value <0.05/n (n was the number of pQTLs that were studied in KORA).