Proteome-wide mendelian randomization identifies causal plasma proteins in venous thromboembolism development

Genome-wide association studies (GWAS) have identified numerous risk loci for venous thromboembolism (VTE), but it is challenging to decipher the underlying mechanisms. We employed an integrative analytical pipeline to transform genetic associations to identify novel plasma proteins for VTE. Proteome-wide association studies (PWAS) were determined by functional summary-based imputation leveraging data from a genome-wide association analysis (14,429 VTE patients, 267,037 controls), blood proteomes (1348 cases), followed by Mendelian randomization, Bayesian colocalization, protein-protein interaction, and pathway enrichment analysis. Twenty genetically regulated circulating protein abundances (F2, F11, ABO, PLCG2, LRP4, PLEK, KLKB1, PROC, KNG1, THBS2, SERPINA1, RARRES2, CEL, GP6, SERPINE2, SERPINA10, OBP2B, EFEMP1, F5, and MSR1) were associated with VTE. Of these 13 proteins demonstrated Mendelian randomized correlations. Six proteins (F2, F11, PLEK, SERPINA1, RARRES2, and SERPINE2) had strong support in colocalization analysis. Utilizing multidimensional data, this study suggests PLEK, SERPINA1, and SERPINE2 as compelling proteins that may provide key hints for future research and possible diagnostic and therapeutic targets for VTE.


INTRODUCTION
Venous thromboembolism (VTE), including deep vein thrombosis (DVT) and pulmonary thromboembolism (PTE), is the third most common life-threatening cardiovascular disease after myocardial infarction and stroke.The global incidence rate of VTE is estimated to range between 115 and 269 per 100,000 and mortality rates related to VTE is estimated to range between 9.4 and 32.3 per 100,000 [1][2][3][4].VTE is a complex disease caused by a combination of genetic predisposing factors and acquired risk factors.Additionally, more than 60% of the variation in susceptibility to common thrombosis is attributable to genetic factors [5,6].
Genome-wide association study (GWAS) is a research method based on linkage disequilibrium in a population and uses singlenucleotide polymorphisms (SNPs) as markers to search for genetic factors associated with complex diseases, which can reveal the genetic mechanisms related to the occurrence, development, and treatment of diseases [7] in a comprehensive manner.In recent years, GWAS has been applied to uncover the genetic etiology of VTE [8][9][10][11], and several SNPs and genes have been identified to be related to its pathogenesis.For instance, coagulation factors including coagulation factor II (F2), coagulation factor V (F5), and coagulation factor XI (F11) are typical factors participating in the coagulation process, while protein C (PROC) plays a role in anticoagulation.Glycoprotein 6 (GP6) and phospholipase C gamma 2 (PLCG2) are related to platelet generation and regulation [9,12].
Since the results of GWAS are in the form of SNPs, the isolated outcomes of GWAS are difficult to reflect the impact on genes or proteins.Recently, a novel analytical method called proteomewide association studies (PWAS) was developed to clarify how proteins involve in the occurrence and development of diseases [13].Previous PWAS were mostly conducted on nervous system diseases such as depression, lacunar stroke, Alzheimer's disease, and post-traumatic stress disorder [13][14][15][16] because these studies used tissue-specific proteins instead of blood proteins [17,18].Recent release of data on human plasma proteomes [19] enables the explorations of the associations of proteins and the risk of blood diseases, such as VTE.
In this study, we performed PWAS combining the data from GWAS and protein quantitative trait locus (pQTL) to investigate the proteins deserved further investigation as diagnostic and therapeutic targets for VTE.Mendelian randomization (MR) analysis and Bayesian colocalization analysis were also conducted to clarify the causal relationship between the discovered proteins and VTE pathogenesis.Protein-protein interaction (PPI), and pathway enrichment were applied to explore the potential mechanisms of candidate proteins.
For UKB data, we performed variant level quality control by criteria of MAF > 0.01, genotype missingness <0.02, and pHWE >1 × 10 -10 on 96 million imputed variants.We then obtained 8,473,913 variants for downstream analysis.All genotyped variants passing quality control on autosomal chromosomes were tested for association with VTE through logistic regression adjusting for age, sex, and top ten principal components using PLINK [21].

Human blood proteome reference weight for PWAS
We next obtained whole-blood pQTL data from the Atherosclerosis Risk in Communities (ARIC) study [19] including 1348 cis-heritable plasma proteins from 7213 European Americans (EA) to match the GWAS datasets.Proteomic profiling was performed using the SomaScan technology using the v.4.1 platform.Genotyping was conducted using the Infinium Multi-Ethnic Global BeadChip array (Illumina, GenomeStudio) and imputed to the TOPMed reference panel (Freeze 5 on GRCh38).These pQTL data were used as reference weights for subsequent PWAS analysis.

PWAS
Using the FUSION pipeline (http://gusevlab.org/projects/fusion/),we integrated the GWAS summary statistics with the reference human plasma proteomes (ARIC study [19]) to perform the PWAS analysis [22].We calculated the VTE genetic effect (PWAS z-score) and combined it with the pre-calculated plasma proteome reference weight (z-score × proteome weight) to evaluate the effects of significant SNPs in the GWAS on the protein abundance.Finally, FUSION identified candidate genes associated with VTE regulating the abundance of proteins in the plasma.To control the potential effect of multiple testing on the study results, the false discovery rate (FDR) of P value < 0.05 was used as the significance threshold in our PWAS analysis.

MR analysis
MR was used to verify whether PWAS-significant genes were associated with VTE via their cis-regulated plasma protein abundance [23,24].In the MR analysis, protein-relevant SNPs were used as instrumental variables (IVs) to test the causal effect of the exposure (protein expression) on the outcome (VTE).For MR analysis, the inverse variance weighted (IVW) method [25] was used as the main MR analyses.The MR-Egger [26] method was used to detect directional pleiotropy according to the intercept of weighted linear regression of the SNP-outcome coefficients on SNP-exposure coefficients.We used the default parameters and P value < 0.0025 was set as the significance level (0.05/20 = 0.0025).

Bayesian colocalization analysis
We applied COLOC to assess the probability of the same variant being responsible for both changing VTE risk and protein expression [27].We used the FUSION parameter to perform colocalization based on the GWAS and pQTL data.We used the default COLOC priors of p1 = 10 −4 , p2 = 10 −4 , and p12 = 10 −5 , where p1 is the probability that a given variant is associated with VTE, p2 is the probability that a given variant is a significant pQTL, and p12 is the probability that a given variant is significant in both GWAS and pQTL.A posterior colocalization probability (PP4) of 80% was used to denote a shared causal signal.The regional association plots were generated by the R package "LocusCompareR" [28].

Protein-protein interaction and pathway enrichment analysis
We used STRING, a web platform to investigate networks among the 20 significant proteins based on PPI [29].Additionally, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment assays were performed to conduct gene set enrichment analysis on genes within a PPI network using the R package "clusterProfiler" [30].

Expression analysis of candidate proteins
GSE19151 [31] (https://www.ncbi.nlm.nih.gov/geo/,accessed on 5 February 2023) provided proteins expression in healthy and VTE blood tissues.We verified whether candidate proteins were differentially expressed in VTE patients compared with healthy controls.GSE19151 contained 95 blood tissue samples from Caucasians (47 VTE patients and 48 healthy controls).Differentially expression between VTE patients and healthy controls were screened with P < 0.0083 (0.05/6 = 0.0083).

Protein-protein interaction and pathway enrichment analysis
We used the STRING database to investigate the connectivity among the 20 VTE-related proteins from the PWAS and found a protein community based on PPIs.A module is a set of proteins that are more connected to one another than they are to other groups of proteins.The module included F2, F5, F11, PROC, SERPINA1, SERPINE2, KLKB1, and KNG1 (Fig. 4, and Supplementary Table 6).

Examination of expression in VTE patients
To further confirm the dysregulation of these proteins in VTE, we analyzed blood tissues from VTE patients and healthy control in GSE19151.Gene expression analysis showed that three out of six VTE-related proteins, PLEK, SERPINA1, and SERPINE2, had significantly lower expression levels in the VTE group compared with healthy control group (p < 0.001) (Fig. 5A-C), while the expression level of RARRES2, F2, and F11 was no significant between the VTE group and healthy control group (p > 0.05).

Diagnosis performance of candidate plasma biomarkers
We inspected the individual diagnostic performance of PLEK, SERPINA1, SERPINE2, RARRES2, F2 and F11 between the VTE patients and healthy controls.In Fig. 5D, the area under the ROC curve (AUC) of SERPINE2, SERPINA1, and PLEK were 0.83, 0.74, and 0.71, respectively those of other proteins were in the range of 0.52-0.56 in the VTE diagnosis.SERPINE2, SERPINA1, and PLEK showed good performance for the VTE diagnosis.

Significance of the protein findings
Our research identified the lowest P values for the SNPs within 1 Mb of each of these 20 genes using the summary statistics from the UKB.We determined that ten genes (PROC, KNG1, THBS2, SERPINA1, RARRES2, GP6, SERPINE2, SERPINA10, EFEMP1, and MSR1)  (Supplementary Table 8), implying that these genes could not be significant in GWAS of VTE.These findings are consistent with observations from other PWAS studies that found the novel genes from regions below genome-wide significant P values [32][33][34].
A cross-ancestry investigation of VTE genomic predictors conducted by Thibord et al revealed some protein encoding genes related to VTE risk, some of which are supported by our study, including findings on F2, F11, GP6, PLEK, PROS1, and MSR1 [10].Furthermore, we performed COLOC analysis to assess the probability of some SNPs might be responsible for VTE by changing particular protein expression, which makes outcomes more robust to some extent.
In our analysis, we identified some classical VTE-proteins, such as F2 and F11 [35].F2 is also known as prothrombin, which is an indispensable key factor in both endogenous and exogenous coagulation pathways and possesses procoagulant, anticoagulant, and antifibrinolytic activities.The G20210A variant (rs1799963) of the F2 gene is common in the Caucasian population.It can promote an increase in prothrombin expression and activity and the synthesis of prothrombin, and thus increase the risk of thrombosis [36].F11 affects VTE as a part of the extrinsic coagulation pathway.Interestingly, as another well-known gene, the F5 was only significant in our PWAS study.Maybe the result of the MR study is caused by weak IVs for F5.
Furthermore, we also discovered four novel proteins in VTE including SERPINA1, SERPINE2, and PLEK, which are the most meaningful findings in our research.Some existing evidence has implied their potential relationship with VTE.Genetic variations of the SERPINA1 gene rs28929474 are associated with the risk of VTE [37,38].The genetic variant of the SERPINA1 rs2749527 has also been reported to influence plasma cortisol levels [39,40] and an MR analysis discovered that higher plasma cortisol levels were associated with a reduced risk of VTE [41].SERPINE2 has been discovered to function in many vascular disorders, such as atherosclerosis and restenosis [42].It differs from conventional thrombosis-related factors as it exists on the surface of most cells but is barely expressed in plasma.SERPINE2 can play an inhibitory role in the coagulation system as well as in the fibrinolytic system [43,44].As a result, it is a significant regulator of hemostasis, thrombosis, and vascular disorders, although its function in VTE has not been clarified [45].The PLEK gene is an active factor in VTE, and it was confirmed in our research as well.A large GWAS metaanalysis found that PLEK rs1867312 was an independent genetic risk signal for VTE [8].Additionally, the transcribed protein of PLEK, pleckstrin, is found in platelets and is involved with platelet biology [46][47][48].Our study has several advantages.First, PWAS of VTE was conducted using the human proteome and summary statistics from the UKB, a large population-based prospective study with deep genetic and phenotypic data.Second, we performed the PWAS and verified the risk proteins with independent MR validation analysis.Third, based on Bayesian colocalization used to estimate the probability that two associated signals were observed at a particular site with a common causal variant, we confirmed the pathogenetic proteins (F2, F11, PLEK, SERPINA1, RARRES2, and SERPINE2) of VTE.Finally, PWAS could detect proteins which could be ignored by GWAS.
Our research also has some limitations.First, the impact of ethnic differences on the genetic architecture of VTE cannot be ignored.The current proteomic information was from the European population; thus, the applicability of this study to other populations needs to be discussed.Second, this study only used GWAS and pQTL for PWAS analysis and barely obtained results at the level of protein, which might have minimized the robustness of our conclusions.We could further screen and explore through TWAS to achieve a more complete understanding of the molecular mechanism involved in VTE.Finally, based on our PWAS conclusion, we only performed MR, COLOC, PPI, and functional analysis to verify their rationality, and more diverse and multi-level exploration should be carried out for verification.

Fig. 1
Fig. 1 Manhattan plot for the VTE PWAS integrating the VTE GWAS (N = 281,466) with the ARIC proteomes (n = 1348).Each point represents a single test of association between a gene and VTE ordered by genomic position on the x-axis and the association strength on the y-axis as the -log 10 P value of a z-score test.The PWAS identified 20 genes whose cis-regulated plasma protein abundances were associated with VTE at FDR p < 0.05.The red horizontal line reflects the significant threshold of FDR p < 0.05

Fig. 2
Fig. 2 Result of Mendelian randomization (MR) between PWAS-significant proteins and the risk of VTE.Among these 20 proteins, 13 were consistent with being causal based on MR.IVW inverse variance weighted

Fig. 5
Fig. 5 Differentially expressed of (A) PLEK, (B) SERPINE2, and (C) SERPINA1 in VTE patients and healthy controls.(D) Receiver operating characteristic (ROC) results of different proteins between the VTE and healthy controls

Table 1 .
The results of the PWAS of VTE, followed by Mendelian randomization and COLOC