Introduction

Plasma proteins play important roles in numerous biological pathways, contribute to risk for many diseases and have long been used for clinical risk assessment, diagnosis, prognosis and evaluation of treatment efficacy. Protein levels used as a quantitative trait in genome-wide association studies (GWAS) can act as an intermediate phenotype that functionally links genetic variation to disease-predisposing factors and then to complex disease end points1,2. Therefore, studies that link genetic variants with protein traits may provide a means to reveal the underlying mechanisms of the GWAS findings.

Previous case-control studies have associated many loci with various complex diseases. Unfortunately the effect sizes of genetic associations with complex disorders are generally small and the functional information on the underlying biological processes is often unclear or absent, which complicates the interpretation of the results. As a result, the focus of GWAS is now shifting increasingly away from studying associations with disease end points and toward associations with intermediate traits that are known risk factors for disease3,4,5.

A previous study used GWAS data and various commercially available enzyme-linked immunosorbent assay (ELISA) kits to find genetic variants associated with plasma or serum levels of 42 different proteins (such as interleukin 18, insulin and leptin) implicated in various complex diseases (such as lupus, diabetes and obesity)6. They identified several GWAS hits that could help in understanding the biology of those complex traits6. Recent technological developments have made possible the quantification of multiple proteins in a single analytical procedure, allowing both broader and deeper molecular profiling of large cohorts2,7,8,9,10. Genetic analyses of these data have discovered numerous genomic regions associated with clinically relevant proteins, with recent large-scale proteome analyses having identified many loci associated with serum and plasma concentrations of individual proteins2,7,8,9,10. Nevertheless, our understanding of the genetic basis and pathophysiological impact of variations in protein levels remains far from complete. Most of these studies limited analyses to cis variants or focused on candidate regions rather than genome-wide scans2,7,8,9. Recent research suggests the importance of investigating protein phenotypes beyond those used in traditional genetic studies10.

Here we present the results of an unbiased large genetic investigation of protein phenotypes in 818 unrelated individuals from the Washington University Knight Alzheimer’s Disease Research Center (KADRC) and Alzheimer’s Disease Neuroimaging Initiative (ADNI) who were analyzed for both genome-wide SNP genotypes and for 146 phenotypic measures obtained from multi-analyte panels (Human DiscoveryMAP) of human plasma samples.

Results

Before any genetic analyses we performed extensive quality control (QC) in the genotype and phenotype data. After log transformation and standardization (see materials and methods) we confirmed that the protein levels followed a normal distribution. We also tested the correlation between the analyte values and covariates such as age, gender and Alzheimer’s disease (AD) status (Supplementary Tables S1 and S2). Age, gender, disease status, study and principal components factors (PCs) from population stratification were included as covariates.

We decided to perform a one-stage GWAS rather than a two-stage GWAS because 1) we have GWAS for all the samples and 2) it has been shown that combining data from both stages of a two-stage GWAS to perform a single analysis almost always has increased power to identify genetic association than analyzing the groups separately even though a lower statistical threshold is required to determine significance11. So to maximize our statistical power, we combined the two datasets to perform a joint one-stage GWAS with all 818 individuals from ADNI and KADRC (characteristics shown in Table 1). To verify our results, we followed up with additional analyses stratified by study and performed meta-analyses of the results from each dataset for each analyte and we found that the p-values from the meta-analyses were similar to the joint GWAS p-values (Supplementary Table S3). In order to avoid spurious association and consider a single nucleotide polymorphism (SNP) as a real signal, we required each genome-wide significant association from the joint analysis to meet additional criteria: 1) the SNP association had to be consistent between the two series, in the same direction and with similar effect size, which represents an internal replication (Supplementary Table S3) and 2) since we were using cohorts from AD studies, we wanted to be sure our results were not confounded by AD status. In addition to using AD status as a covariate in our initial analyses, we performed separate GWAS on cases and controls and found no difference in effect size or direction indicating the associations found in the combined GWAS were not confounded by AD status (Supplementary Table S4).

Table 1 Characteristics of ADNI and KADRC cohorts.

We decided to use the common threshold for genome-wide significance (p < 5.0 × 10−8) instead of p < 3.42 × 10−10 (Bonferroni multiple test correction taking into account SNPs and phenotypes) because the latter would consider that all the analytes are independent and not correlated. However there is extensive evidence that this is not the case and in a recent study we demonstrated that some analytes are highly correlated12. Additionally five of the associations in this study in the p = 5 × 10−8–3.42 × 10−10 range have been previously reported and others are located in receptors and genes known to regulate levels of the analyte (Table 2 and Supplementary Table S5) which indicate that these are real signals. We also found complex loci and potential pleiotropic effects that support the evidence that not all of the SNPs and analytes act independently of others. These findings suggest that a multiple test correction threshold of p < 3.42 × 10−10 would be too stringent. For this reason we decided to report all the loci with a p < 5.0 × 10−8, but we also highlight on Table 2 those that pass the p < 3.42 × 10−10 threshold.

Table 2 Genome-wide significant results (cis = within 1MB of gene encoding protein).

Genome-wide association study results

After performing the linear regression with each analyte as a phenotype, there were 56 genome-wide significant loci for 47 analytes (Table 2). Twenty-eight of these associations have been reported in the literature previously and 28 (50%) were novel. Thirty-two of the 56 associations (9 novel) pass the p < 3.42 × 10−10 threshold.

Previously reported findings

Twenty-eight of our genome-wide signals replicated associations reported by 14 different genetic studies of plasma or serum protein levels in humans (Table 2 and Supplementary Table S5)6,9,13,14,15,16,17,18,19,20,21,22,23,24. Six of our most significant SNPs were the same SNP reported previously and the remaining SNPs were in linkage disequilibrium (LD) with reported SNPs (Supplementary Table S5). Fifteen of these 28 genome-wide loci had p < 3.42 × 10−10 in our study and five others were in the p = 5 × 10−8 to 3.42 × 10−10 range, indicating that signals in this range in our study constitute strong associations. Twenty-three of these previously reported loci are in cis (within 1MB of the gene that encodes the protein) and five are in trans (Table 2, Fig. 1 and Supplementary Figs S1–26). Twelve (52%) of the cis effects are coding variants (nine missense) and four of the trans effects are coding variants (three missense; Table 2). None of the trans effects are located in untranslated regions (UTR) but two of the analytes had cis effects that are in the UTR (CD40: 5′ UTR of CD40 and HCC4: 3′ UTR of CCL16). All of the trans effects are located within genes (four coding, one intronic) that have interactions with the analyte that are not known or well understood (Table 2). However our results and the previous published studies suggest that these loci in trans proteins play an important role in regulating the levels of CA19-9, CEA, CRP, SELE and ACE in plasma15,17,21,24. More interestingly, some of these loci, like ABO or FUT2, are genome-wide for more than one analyte, which also indicates that these may constitute master regulatory signals (see pleiotropic section).

Figure 1
figure 1

Manhattan and regional plots for associations with plasma levels of ApoH.

(a) Manhattan plot of −log10 p-values for association with plasma levels of ApoH levels; (b) Regional plot for genome-wide significant association on chromosome 17 with ApoH plasma levels; (c) Regional plot for genome-wide significant association on chromosome 18 with ApoH plasma levels; (d) Regional plot for genome-wide significant association on chromosome 9 with ApoH plasma levels; (e) Regional plot for genome-wide significant association on chromosome 17 after conditioning for rs2873966; (f) Regional plot for genome-wide significant association on chromosome 17 after conditioning for rs2873966 and rs17690171.

Novel findings

We found 28 loci associated with 25 analytes that have not been reported previously (Table 2, Fig. 1 and Supplementary Figs S8, S12, S21, S22, S27–36). Of these novel associations nine pass the p < 3.42 × 10−10 threshold (Table 2). All of the associations were highly consistent (same effect size or beta) between the two datasets which represents an internal replication (Supplementary Table S3) and were not confounded by AD status (Supplementary Table S4).

Five of our 28 novel findings were cis effects (one coding variant and four intergenic; Table 2): 1) rs926144 which is 29.7 KB from SERPINA1 was significantly associated with plasma levels of AAT (p = 4.71 × 10−12; Supplementary Fig. S27), 2) rs2015086 within 2 KB upstream of CCL18 was significantly associated with MIP1a levels in plasma (p = 2.56 × 10−15; Supplementary Fig. S38), 3) a missense variant in AGER (rs2070600) was associated with plasma RAGE levels (p = 1.86 × 10−11; Supplementary S40), 4) rs646776 which is located 33.7 KB from SORT1 was significantly associated with Sortilin plasma levels (p = 2.20 × 10−9; Supplementary Fig. S41) and 5) rs409336 located 3.7 KB from CXCL5 was significantly associated with ENA78 plasma levels (p = 1.11 × 10−8; Supplementary Fig. S30).

Twenty-three of our 28 novel findings were trans effects. Twelve analytes were associated with loci that contained only intergenic SNPs and eleven analytes (ANG2, BLC, CEA, F7, FGF4, GROa, MIP1b, MMP7, RAGE, THPO, TNC) were associated with SNPs on intronic regions in gene-rich areas. Interestingly some of these loci contain intronic SNPs that are likely to be regulatory based on RegulomeDB25: SCARA5 (associated with TNC levels) and PARVG (associated with ANG2 levels) contain SNPs with RegulomeDB25 scores lower than 3 (Supplementary Table S6). Plasma MIP1b levels were also associated with a locus that contains SNPs that are likely to be regulatory. We found that rs145617407, located in the intron of CCR3, was significantly associated with MIP1b levels in plasma (p = 2.58 × 10−10) and this SNP is located less than 119 KB from CCR5 which is the receptor for CCL4/MIP1b (Supplementary Fig. S21).

GWAS Conditional on top hits revealed additional signals within same loci

We then performed conditional analyses to determine whether more than one signal in the same loci exists. When we added the most significant SNP to the linear regression model, five analytes (ApoH, CA19-9, FetuinA, IL6r and LPa) still showed independent and genome-wide significant SNPs at the same locus (Fig. 1, Table 3 and Supplementary Fig. S5, S13, S17 and S19). It is interesting to note that three of four of the complex loci we found were in cis with the respective protein whereas the FUT2/FUT6/FUT3 locus was associated with CA19-9 plasma levels. Since we decided to use the traditional genome-wide p-value threshold (p < 5 × 10−8) for the conditional analyses, we may be missing some additional independent signals.

Table 3 Plasma analyte levels associated with multiple loci.

After conditioning on rs2070633, located in an AHSG intron, we found that rs4917, a missense variant also located in AHSG, was still significantly associated with plasma levels of FetuinA (p = 7.27 × 10−9, original p = 2.61 × 10−42; Table 3 and Supplementary Fig. S13). After conditioning on both SNPs no additional signals were found. An intronic variant in IL6R, rs7526131, was still significantly associated with IL6r plasma levels after conditioning on rs12126142, also located in an intron of IL6R (p = 1.43 × 10−10, original p = 4.47 × 10−72; Table 3 and Supplementary Fig. S17). Plasma levels of LPa were significantly associated with rs783147, located in an intron of PLG 506 KB from LPA and after conditioning on this SNP an intronic variant of SLC22A1 approximately 0.4 MB from LPA (rs783147), was still significantly associated with LPa levels (p = 1.64 × 10−9, original p = 9.86 × 10−9; Table 3 and Supplementary Fig. S19).

We found two analytes (ApoH and CA19-9) that the genome-wide locus contained up to three independent signals (Fig. 1, Table 3 and Supplementary Fig. S5). All three signals in the ApoH analyses contained missense variants located in APOH (rs52797880: I141T, p = 1.57 × 10−12; rs1801690: W335S, p = 5.15 × 10−9, original p = 2.77 × 10−11; rs8178847: R154H, p = 2.20 × 10−12, original p = 1.57 × 10−12; Fig. 1). As reported above, the initial signal in the CA19-9 GWAS contained a missense variant in FUT2. After conditioning on the most significant SNP (rs485073, p = 2.12 × 10−23) from the CA19-9 GWAS, the new signal contained a synonymous variant located in FUT6 (rs112313064, p = 3.79 × 10−26, original p = 7.46 × 10−23) and conditioning on the two SNPs resulted in a separate signal upstream of FUT3 (rs2306969, p = 2.78 × 10−9, original p = 6.11 × 10−23; Supplementary Fig. S5). All of these results indicate these protein levels are highly regulated and that different and independent regulation mechanisms, even at the same locus, are in place: some mechanisms may act by affecting cleavage or receptor binding (non-synonymous variants) and others by regulating gene expression (non-coding variants).

Potential pleiotropy

In addition to finding that some proteins have complex regulation within the structural gene (or a different gene in the case of CA19-9), we also found potentially pleiotropic effects with one gene affecting more than one protein. Potential pleiotropic effects were found for three groups of analyte/associations even though the analyte levels were not correlated: ABO associated with plasma levels of SELE, ACE and vWF (p = 1.01 × 10−52, beta = –0.882; p = 1.90 × 10−8, beta = –0.352; p = 8.87 × 10−8, beta = 0.253 respectively; Table 4 and Fig. 2). ABO has been previously reported to be associated with ACE activity26 and SELE plasma and serum levels15,17. ABO has also been associated with vWF plasma levels and although the locus did not reach genome-wide significance in our analysis it was very close27.

Table 4 Potential pleiotropic associations.
Figure 2
figure 2

Manhattan and regional plots for pleiotropic ABO variant associations with plasma levels of ACE, SELE and vWF.

(a) Manhattan plot of −log10 p-values for association with plasma levels of ACE; (b) Regional plot for genome-wide significant associations in ABO locus with ACE plasma levels; (c) Manhattan plot of −log10 p-values for association with plasma levels of SELE; (d) Regional plot for genome-wide significant associations in ABO locus with SELE plasma levels; (e) Manhattan plot of -log10 p-values for association with plasma levels of vWF; (f) Regional plot for associations in ABO locus with vWF plasma levels, rs687289 was close to genome-wide significance (p = 8.87 × 10−8).

FUT2 was associated with plasma levels of CA19-9 and CEA (p = 2.12 × 10−23, beta = −0.509; p = 4.07 × 10−16, beta = −0.406 respectively; Table 4 and Supplementary Fig. S5, S8); and the APOE region was associated not only with plasma levels of ApoE but also CRP (p = 2.76 × 10−26, beta = −0.594; p = 6.69 × 10−9, beta = −0.354 respectively; Table 4 and Supplementary Fig. S4, S10).

Interestingly none of these analyte pairs or trios are highly correlated (r < 0.25; Table 4), which again supports the idea that these loci (ABO, FUT2 and APOE-TOMM40 region) are truly master-regulatory regions, that protein levels are highly and complexly regulated and that studying the genetic architecture of biological traits can lead to a deeper knowledge of the biological processes.

Impact of these findings with complex diseases

Of the 56 loci that we found associated with plasma protein levels, 46 loci have also been reported to be associated with complex traits and diseases including coronary artery disease (ACE and SELE), stroke (ACE and SELE), various cancers (ACE, CA19-9, CEA, RAGE and SELE), age-related macular degeneration (ApoE, CFHR1 and CRP), periodontitis (ApoH), multiple sclerosis (BLC and CD40), inflammatory bowel disease (CD40 and ENA78) and Type 2 diabetes (IL13, MCSF and RAGE) (Table 5; see supplementary results for a complete description). As an example, the AGER variant rs2070600, which in our study was associated with plasma RAGE levels (p = 1.86 × 10−11) has been reported to be associated with pulmonary function28. A recent study of RAGE plasma levels suggests they are a promising biomarker for acute respiratory distress syndrome, supporting our hypothesis29.

Table 5 Joint GWAS top SNPs/genes related to disease based on NHGRI catalog.

Similarly our genetic analysis for BLC revealed a significant association with SNPs located in DDAH1 (rs7541151, p = 6.44 × 10−9; Table 2), a gene that has been associated with multiple sclerosis (MS). Interestingly BLC levels have recently been reported to be different between patients with MS and controls30, which further supports BLC as a potential biomarker.

Since levels of CD40 in plasma were associated with the CD40 locus and CD40 variants have been associated with MS in three independent GWAS30,31,32, we hypothesized that plasma levels of CD40 may also be associated with MS status. As a proof of concept, we used a Quantikine sandwich ELISA kit (R&D Systems cat #DCCD40) to measure plasma levels of CD40 in 20 individuals with relapsing remitting MS in remission at time of plasma collection (8 male, 12 female; mean age = 44.45 ± 15.51 years) and 20 healthy controls (8 male, 12 female; mean age = 41.84 ± 11.52 years; Supplementary Table S7). We used linear regression to determine if log values of plasma CD40 levels were significantly different between MS cases and controls, with age and gender as covariates. We found plasma levels of CD40 were significantly higher in MS cases (753.26 ± 235.71 pg/mL) than controls (603.02 ± 139.01 pg/mL; p = 0.041, beta = −1.837; Fig. 3), supporting our hypothesis.

Figure 3
figure 3

Plasma levels of CD40 in MS cases versus controls.

More than half of the loci associated with plasma protein levels in our study have been previously reported to be associated with various complex diseases. Based on the current knowledge for RAGE and BLC and in the concept of Mendelian randomization, we hypothesize that these protein levels constitute informative biomarkers for these complex traits although additional studies would be necessary to validate this hypothesis. More detailed information about potential novel biomarkers for complex traits is included in Supplementary Results and analyte abbreviations with full names are in Supplementary Table S8.

Discussion

GWAS of complex traits have been very successful in identifying novel loci associated with those traits, but these studies require extremely large sample sizes and in some cases it is difficult to interpret the results because the associations are with surrogate tag SNPs which may not be the causal SNPs. Many loci contain multiple genes which also makes it difficult to determine the causal gene or variant. Additionally some loci are located in non-protein coding regions where functional effects are poorly understood. Genetic analyses of biological traits may provide more power than traditional GWAS and may be more informative about the biological effects for specific loci. Using a more unbiased approach than previous genetic studies, we were able to replicate many previously reported associations with various plasma protein levels and uncover several novel associations that could warrant further research. The results from our careful analyses suggest that even though we utilized two datasets from Alzheimer’s disease studies there was no confounding effect due to disease status or dataset. Combining datasets from high-throughput technologies that deliver genome-wide genetic data and quantification of protein levels in a single procedure provides a great deal of power to analyses that may help researchers understand the biology of complex traits including the complex loci involved and pleiotropic effects.

Our results clearly indicate that the protein levels are highly and complexly regulated. We found master regulatory regions (pleiotropic; Table 4, Fig. 2 and Supplementary Fig. S4, S10) as well as several independent regulatory elements in the same locus for the same proteins (Table 3, Fig. 1 and Supplementary Fig. S5, S13, S17 and S19). We found protein levels associated with variants in or near the gene coding that protein (cis effects) as well as variants located elsewhere in the genome (trans effects) demonstrating that protein levels are not only affected by the genes that encode the protein but also by interaction with other proteins as in the case of ABO or FUT2 (Table 4).

Interestingly, we found that for almost half of the cis effects (13 out of 28), the association could be explained by a coding variant but for the trans effects most of the loci (24 out of 28) only contain regulatory variants (Table 2). Although these non-coding signals could be synthetic association and are being driven by low frequency variants, our results and those recently published by ENCODE and the GTEx consortium would suggest that those associations are likely to affect gene expression33,34. For this same reason, it is more likely that the association in cis (more frequently due to a non-synonymous variant) will present a higher effect size and are easier to identify in a genetic study than a trans signal, which is more likely to affect gene expression through regulation.

Table 2 shows that most of the trans effects associated with plasma protein levels had less significant p-values and lower betas than most of the cis effects. This could explain why only three of the trans effects we found were previously reported while the other 24 were novel. It is of vital importance to identify trans effects because that will help us to identify novel biological interactions and pathways. Of the 28 trans effects we found in our study, only one corresponded to a protein that constituted the receptor of the studied analyte or a gene known to interact directly with the analyte (rs145617407 located less than 119 KB from CCR5 which is the receptor for CCL4/MIP1b)35. However, the fact that the associations of SELE, ACE and vWF with the ABO locus or CA19-9 and CEA with FUT2 have been identified in other studies, indicates that these signals are real and some of these novel loci may be implicated in regulating the levels of one or more proteins. Additional work is needed because currently it is not clear how ABO regulates plasma levels of SELE, ACE and vWF or how FUT2 regulates CEA and CA19-9 levels. For the novel loci this can be more complicated because several signals are located in very gene-rich regions and several genes could drive the association (Fig. 1 and Supplementary Fig. S1, S6, S8, S10, S21, S24, S28, 29, S33, S36-S37, S42, S44, 46).

Another important finding related to this study is its implication on complex traits. Proteins play a key role in many complex traits, so understanding the genetic variations associated with protein levels is important in understanding the biological basis of these traits. We used the concepts of Mendelian randomization, our data and the data from the NHGRI GWAS catalog to identify genetic regions that are genome-wide significant for various analyte levels as well as previously associated with complex traits. While most of these loci have been associated with complex traits, the associations of most of the plasma analytes with the complex traits have not been previously reported. Our results suggest that some of these plasma protein levels could be novel biomarkers or even endophenotypes for these complex traits.

As an example of our approach providing information useful for understanding potential pleiotropic effects in promising biomarkers for complex diseases that has been supported by previous research: rs485073 in FUT2 was associated in our study with plasma levels of both CEA and CA19-9, which are only weakly correlated in plasma (r = 0.166, p = 2.98 × 10−6). This potential pleiotropy strongly suggests that rs485073 is part of a master regulatory region. In this case this means that plasma levels of CEA and CA19-9 could be important for understanding gastric cancer because FUT2 variants have also been associated with gastric cancer risk36. This is further supported by the fact that both CEA and CA19-9 have been reported as FDA approved biomarkers for other types of cancer37.

We found several promising plasma biomarkers for complex traits including IL13, ENA78, BLC and CD40. Based on our results, plasma levels of IL13 may be informative in Type 2 diabetes research. We found rs7433647, located near UBE2E2, was associated with IL13 plasma levels (p = 1.21 × 10−8). UBE2E2 has previously been associated with Type 2 diabetes in a large GWAS meta-analysis of more than 26,000 cases and 83,000 controls with varied ancestry38. A recent study using a mouse model for Type 2 diabetes suggests that expression of IL13 plays a key role in adipose tissue inflammation and insulin resistance, further supporting the idea that IL13 levels may be important in studying Type 2 diabetes39. ENA78/CXCL5 expression is elevated in the inflamed tissues of patients with rheumatoid arthritis, ulcerative colitis and Crohn’s disease40,41. Several studies have reported association of CXCL5 variants with inflammatory bowel disease and metabolite levels42,43. In our study rs409336, near the CXCL5 gene, showed the strongest effect on plasma ENA78/CXCL5 levels. Because of the similarity in genetic influences on ENA78/CXCL5 levels and inflammatory bowel disease, it is possible that these traits share a common pathophysiological pathway and our findings support further investigation of the involvement of ENA78/CXCL5 in the etiology of inflammatory bowel disease.

We found two promising plasma protein biomarkers for MS: BLC and CD40. In our study rs7541151 in DDAH1 was associated with plasma BLC levels. DDAH1 is responsible for the degradation of ADMA into citrulline and dimethylamine and previous studies showed an association of DDAH1 variants with MS and ADMA levels30,44. Previous studies indicate that CSF levels of BLC/CXCL13 may be an informative biomarker for studying treatment effects in MS45,46,47. Our results indicate plasma BLC/CXCL13 levels may be informative as well. The CD40 locus has been associated with MS30,31,32 but our study appears to be the first to associate CD40 plasma levels with CD40 variants. Plasma levels of CD40 have not been reported as a potential biomarker for MS, but our preliminary data suggests they may be a biomarker for MS. Although we did find a significant difference in CD40 levels in plasma between MS cases and controls, our sample size was small and only contained patients in remission so it would be prudent to evaluate a larger, more varied cohort to determine the possible utility of plasma levels of CD40 as an MS biomarker.

Methods

Ethics Statement

The Institutional Review Board (IRB) at the Washington University School of Medicine in Saint Louis approved the study. Research was carried out in accordance with the approved protocol. A written informed consent was obtained from participants and their family members by the Clinical Core of the Charles F. and Joanne Knight Alzheimer’s Disease Research Center (Knight-ADRC). The approval number for the Knight-ADRC Genetics Core family studies is 93-0006. The MS and control patients have signed the consent for the MS repository, approval number 201104379.

Cohort descriptions

Demographics of the samples included in this manuscript are reported in Table 1.

Washington University Knight Alzheimer’s Disease Research Center (KADRC) cohort

The KADRC sample included 124 AD cases and 188 cognitively normal controls. These individuals were evaluated by Clinical Core personnel of Washington University. Cases received a clinical diagnosis of Alzheimer’s disease in accordance with standard criteria and dementia severity was determined using the Clinical Dementia Rating (CDR)48. Plasma from all KADRC samples was collected in the morning after an overnight fast, immediately centrifuged and stored at −80°C until assayed according to standard procedures49.

Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort

The ADNI sample included 434 AD cases and 72 cognitively normal controls. Data used in the preparation of this article were obtained from the ADNI database (http://adni.loni.usc.edu/). See Supplementary Methods for further information about ADNI’s methods and for up-to-date information see http://www.adni-info.org/. Plasma was collected in the morning after an overnight fast, immediately centrifuged and stored at −80°C until assayed as described previously9. Genetic and phenotypic data for 506 samples was available for this study.

Genotyping and Quality Control

The ADNI protocol for collecting genomic DNA samples has been previously described50. All ADNI samples were genotyped using the Illumina Human610-Quad BeadChip, which contains over 600,000 SNP markers. KADRC samples were genotyped with the Human610-Quad BeadChip or the Omniexpress chip51. Prior to association analysis, all samples and genotypes underwent stringent QC. Genotype data was cleaned using PLINK v1.07 (http://pngu.mgh.harvard.edu/purcell/plink/)52 by applying a minimum call rate for SNPs and individuals (98%) and minimum minor allele frequencies (MAF = 0.02). SNPs not in Hardy-Weinberg equilibrium (P < 1 × 10−6) were excluded. Gender identification was verified by analysis of X-chromosome SNPs. We tested for unanticipated duplicates and cryptic relatedness (Pihat ≥ 0.5) using pairwise genome-wide estimates of proportion identity-by-descent using PLINK v1.07 (http://pngu.mgh.harvard.edu/purcell/plink/)52. When a pair of identical samples or a pair of samples with cryptic relatedness was identified, the sample with a higher number of SNPs that passed QC was prioritized. EIGENSTRAT53 was used for each cohort separately to calculate principal component factors for each sample and confirm the ethnicity of the samples. The 1000 genomes data (June 2011 release) and BEAGLE v3.3.154 were used to impute up to 6 million SNPs. SNPs with a BEAGLE R2 < 0.3, a minor allele frequency (MAF) <0.025, a call rate lower than 95%, a Gprobs score lower than 0.90 and those out of Hardy-Weinberg equilibrium (p < 1 × 10−5) were removed. After imputation, 5,815,690 SNPs passed the QC process.

Assessment of Analyte Profiles and Quality Control

A set of 0.5 mL EDTA plasma samples from ADNI and KADRC participants was selected and shipped to Myriad Rules Based Medicine, Inc. (Myriad RBM, Austin, TX). A set of 190 protein levels from plasma for each selected individual was measured by multiplex immunoassay on the Human DiscoveryMAP panel v1.0 (https://rbm.myriad.com/products-services/humanmap-services/human-discoverymap/) using the Luminex100 platform by RBM. Samples with more than 10% of missing data across analytes were removed, then analytes were excluded if they had missing data for 10% of the samples or values were below the detection limit, in either of the studies. After the QC step, a total of 146 metabolites were included in each dataset of the present study.

Statistical analyses

For each study, prior to the analyses, all analyte values were log-transformed, standardized so the mean for each analyte was equal to zero and outliers were removed as previously described12,51,55,56,57,58,59. Log-transformed, standardized values were tested for significant deviations from a normal distribution using the Shapiro-Wilk test. We performed a single variant analysis for each analyte using PLINK v1.9 (http://pngu.mgh.harvard.edu/purcell/plink/)52, including age, gender, AD status and the first 2 principal components as covariates. The significance threshold for the joint analyses was defined as p < 5.0 × 10−8 based on the commonly used threshold thought to be appropriate for the likely number of independent tests with Bonferroni correction. To approximate an internal replication, all SNPs that passed the genome-wide significance threshold had to pass the threshold p < 0.05 in single variant analyses of the individual datasets and had to have similar effect sizes in the same direction. To ensure that results were not confounded by AD status, single variant analyses were performed on all of the AD cases from both datasets separately from all of the controls from both datasets. All genome-wide significant SNPs from the joint analyses also had to have similar effect sizes in the same direction in the case-control stratified analyses. QQ plots were generated for each analysis to illustrate the distribution of the observed and expected p-values for all eligible SNPs60. Regional plots showing LD and the location of nearby genes were generated for the top ranking SNPs for each metabolite using LocusZoom v1.1, build hg19/1000 Genomes Mar 2012 EUR (http://csg.sph.umich.edu/locuszoom/)61. If more than one significant SNP clustered at a locus, the SNP with the smallest p-value was reported as the sentinel marker. All analyses were performed using BEAGLE v3.3.154, EIGENSTRAT53, SAS v9.2 for Linux (copyright © 2008 by SAS Institute Inc) and PLINK v1.07 and v1.9 (http://pngu.mgh.harvard.edu/purcell/plink/)52 software.

Meta-analyses

We performed the single variant analyses as described above for ADNI and KADRC separately. We used METAL (version released 2011-03-25, http://www.sph.umich.edu/csg/abecasis/Metal/index.html)62 to perform meta-analyses of the two datasets for each analyte by combining p-values across studies, weighting each study by its sample size.

Conditional analyses

To identify additional independent signals in a locus we conducted conditional analyses. We performed a series of sequential conditional analyses by adding the most strongly associated SNP into the regression model as a covariate and testing all remaining regional SNPs for association. This approach was used to determine additional secondary signals and was performed by adding SNPs one at a time until no significance was seen. Consistent with the locus-specific analysis statistical significance for the conditional analysis was defined at p < 5.0 × 10−8.

Annotation of GWAS hits

All significant GWAS SNPs were taken forward for functional annotation. We used SNPnexus (http://www.snp-nexus.org), build GRCh37/hg1963 and ANNOVAR version 2015-03-2264 to perform SNP annotation and to identify the putative functional SNPs. All significant GWAS SNPs were also examined for potential regulatory functions using RegulomeDB (http://regulome.stanford.edu/)25. We searched the National Human Genome Research Institute’s (NHGRI) catalog of genome-wide association studies to identify SNP trait associations for selected analytes.

Additional Information

How to cite this article: Deming, Y. et al. Genetic studies of plasma analytes identify novel potential biomarkers for several complex traits. Sci. Rep. 6, 18092; doi: 10.1038/srep18092 (2016).