Strong effects of genetic and lifestyle factors on biomarker variation and use of personalized cutoffs

Ideal biomarkers used for disease diagnosis should display deviating levels in affected individuals only and be robust to factors unrelated to the disease. Here we show the impact of genetic, clinical and lifestyle factors on circulating levels of 92 protein biomarkers for cancer and inflammation, using a population-based cohort of 1,005 individuals. For 75% of the biomarkers, the levels are significantly heritable and genome-wide association studies identifies 16 novel loci and replicate 2 previously known loci with strong effects on one or several of the biomarkers with P-values down to 4.4 × 10−58. Integrative analysis attributes as much as 56.3% of the observed variance to non-disease factors. We propose that information on the biomarker-specific profile of major genetic, clinical and lifestyle factors should be used to establish personalized clinical cutoffs, and that this would increase the sensitivity of using biomarkers for prediction of clinical end points.

A number of protein biomarkers are used for diagnosis and management of cancers and other diseases. Examples include prostate-specific antigen 1 used to screen for prostate cancer, the ovarian cancer-related tumour marker CA125 (ref. 2) and IL-6, which is a drug target in rheumatoid arthritis (RA) 3 . Ideal biomarkers for early diagnosis should be uniquely present, or overexpressed, in the malignant tumour or blood and not influenced by confounding factors. Most current biomarkers have a function in the normal cell, taking part in, for example, signalling pathways, controlling growth, apoptosis and/ or inflammation 4 . These are not uniquely expressed in the malignant tissue and their expression level is affected by a number of factors, such as the individual's genetic and physical constitution, lifestyle and medication. A detailed understanding of potential confounding factors and their effect size is therefore a necessary prerequisite in the evaluation of the rapidly growing number of candidate biomarkers 5 . The discovery of putative biomarkers for early identification and management of cancer has been greatly facilitated by high-throughput, genome-wide assays. Gene expression analyses have discovered numerous genes that are differentially expressed between malignant and benign tissues 6 , but few have proven suitable as biomarkers, mainly because the mRNA levels do not correlate well with protein abundance 7 . Large-scale studies of protein abundance, on the other hand, have been hampered by lack of high-throughput methods. High-resolution mass spectrometry can be used to examine the underlying genetic contribution to profiles of circulating proteins and effect of environmental covariates, but the resolution is limited by the peptide spectra used and detection sensitivity 8 . An alternative is to use antibody-based measurements, which targets individual, preselected, sets of proteins.
Here we aim to understand the factors that influence normal variation in plasma levels of established and potential biomarkers for cancer, autoimmune diseases and inflammation with the specific goal to facilitate the establishment of individualized clinical cutoffs. To this end, we use the highly sensitive and specific proximity extension assay (PEA) 9 to estimate the abundance of 92 established or potential biomarkers in plasma from 1,005 individuals from a longitudinal cross-sectional population-based study in Sweden. The biomarkers we analyse here constitute a research panel directed against multiple cancers and also contain proteins implicated in autoimmune diseases such as RA and Graves' disease. PEA combines two dedicated antibodies with a real-time quantitative PCR (qPCR) reaction to achieve high specificity and a wide dynamic range. This technology can be multiplexed without introducing crosstalk, while still maintaining its high specificity and sensitivity 10 . We first determine the effect of a wide range of clinical variables and lifestyle factors, including age, sex, blood pressure, blood group or body mass index (BMI), medication and smoking, on biomarker levels. Then we study the heritability of each biomarker, and by using high-resolution genetic singlenulecotide polymorphism (SNP) array data and whole-exome sequencing, we perform a genome-wide association study (GWAS) for each biomarker. This study is the first to measure biomarker abundance on a large scale using a single technology in a general population, in order to identify contributing factors to normal variation. To our knowledge, only one previous study has studied the genetic association of multiple proteins in the general population. Melzer et al. 11 investigated 42 proteins using a variety of assays, prohibiting protein-to-protein comparisons, and also did not investigate the protein-specific profiles of covariates. Here by integration of genetic, clinical and lifestyle data, we identify the set of biomarker-specific factors that can be used to determine appropriate individual clinical cutoffs, and thereby enable a more efficient use of each biomarker in personalized cancer management.

Results
Biomarker measurements. The abundance of 92 proteins (Supplementary Data 1), representing a panel of established and potential biomarkers for cancer and inflammation, were measured in blood plasma of 1,005 individuals from the Northern Sweden Population Health Study (NSPHS), using PEA and qPCR. A total of 77 of the proteins had levels above the detection limit in at least 80% of our samples, with 91.3% (70,651 of 77,385) of qPCR reactions being successful. In the remaining 15 proteins, 96.8% (14,598 of 15,075) of the protein levels were below the detection limit. Also, 96.5% (970 of 1,005) of our samples passed quality control on an individual level. The abundance and distribution of the normalized measurements (delta delta Cq (ddCq)-values) of all the proteins in all samples are illustrated in Fig. 1a, with estimates under the detection limits coloured white. Details on normalization and initial quality control are given below in the Methods section. The proteins with little or no measurable abundance in our samples were: stromelysin-1, GM-CSF (granulocyte-macrophage colony-stimulating factor), estrogen receptor, CA242 (cancer antigen 242), IL-2 (interleukin-2), epiregulin, betacellulin, IL-4, interferon-g, IL-7, TNF (tumour necrosis factor), CEA (carcinoembryonic antigen-related cell adhesion molecule 5), MYD88 (myeloid differentiation primary response protein MyD88), mucin-16 and regenerating islet-derived protein 4. It cannot be ruled out that storage-time and protein degradation could be an influencing factor for these 15 proteins, and previous studies have quantified this specifically for CEA 12 .
Epidemiological associations. To study the effect of clinical and lifestyle factors, we selected 158 phenotypic covariates, including age, sex, blood pressure, BMI, tobacco use, medication, lifestyle (occupation) and sample collection round (2006 or 2009) from the comprehensive set of clinical data available for NSPHS. A multiple linear regression model showed a total of 18 phenotypic covariates to have a significant effect (P-valueo0.05, Bonferroni adjusted) on one or more of 52 of the 77 proteins (Table 1). Factors such as age or weight influenced a broad range of proteins, whereas medication affected specific proteins (Fig. 1b). Notably, smoking affected two proteins, WFDC2 (WAP four-disulfide core domain protein 2) and IL-12, whereas the traditional Swedish moist tobacco product, 'snus' did not have any significant effects, in line with a previous study on effects of tobacco use on DNA methylation 13 . We also found large effects (nominal P-value ranging from 1.8 Â 10 À 4 to 2.3 Â 10 À 7) of ABO blood group on three proteins; E-selectin, PECAM-1 (platelet endothelial cell adhesion molecule) and TIE2 (angiopoietin-1 receptor). The connection between E-selectin and blood groups is known 14,15 , but the effect on PECAM-1 and TIE2 has not been described previously. The medication in NHPHS had been investigated using a questionnaire and the reported medications were annotated using the Anatomical Therapeutic Chemical (ATC) classification system. Among the commonly used medications, dihydropyridine derivatives (ATC: C08CA, 54 users), often used to treat hypertension, were correlated to increased IL-6 levels, whereas glucocorticoids (ATC: R03BA, 26 users) lowered both Basigin and hepatocyte growth factor (HGF) receptor levels. Apart from C08CA, no other hypertensive treatment was correlated with high IL-6 levels. Interestingly, the usage of selective b-2-adrenoreceptor agonist (ATC: R03AC, 13 users), which is commonly found in asthma inhalators, decreased the level of circulating vascular endothelial growth factor D (VEGF-D), which is implicated in the metastasis of non-small lung cancer 16,17 . A detailed description of all investigated covariates and their association with protein levels is given in Supplementary Data 2. The largest fraction of variance explained by a single clinical or environmental covariate was age, which accounted for 27% of the variation seen for WFDC2. The influence on WFDC2 of age and smoking has previously been reported 18 , but we found that the fraction of variance explained by smoking in our data to be only 1.7%, which is much less than for systolic blood pressure (14.3%) or loop-diuretics (ATC: C03CA, plain sulfonamides, 7.2%). However, these covariates are not necessarily independent as blood pressure and use of medication is related to age. Correlations between biomarkers. Inter-biomarker correlation was investigated using abundance levels adjusted for significant clinical and lifestyle covariates. These was then rank-transformed into normally distributed values and used to identify 12 pairs with a Spearman's Rho R 2 greater than 0.5 (Fig. 1c). The highest correlation was found between CASP-3 (caspase-3) and CD69 (early activation antigen CD69; R 2 ¼ 0.85). CASP-3 was also highly correlated with epidermal growth factor (R 2 ¼ 0.81), which in turn was highly correlated with CD69 (R 2 ¼ 0.78). The strong correlation between some of the biomarkers does not appear to be reflected at the transcription levels. For instance, the Illumina Body Map [19][20][21] suggests that CD69 and caspase-3 both are expressed in leukocytes, lymph nodes and adrenal glands (for example, 3 of 16 investigated tissues). In data from leukocytes of 80 controls 22, there was only a weak correlation between the expression levels of CD69 and CASP-3 (R 2 ¼ 0.13), suggesting that the high correlation observed at the protein level is either because of post transcriptional regulation, for example, epigenetic regulation, or owing to expression patterns in distinct cell types. Several of the 12 pairs that were highly correlated were proteins with similar functions, such as C-X-C motif chemokine (CXCL)-9, -10, -11, and TNF-R1 and TNF-R2, whereas in other cases apparently unrelated proteins were highly correlated. These correlations may reflect as yet unknown patterns of co-regulation, and bring into question their value as independent biomarkers.
Heritability and genetic association. All 970 individuals' samples that passed the quality control (QC) were used to estimate the heritability for the 77 proteins with measurable levels by evaluating the co-segregation of the protein levels with the relatedness among individuals using a polygenic model (see Methods for details). In 75% (58 out of 77) of the proteins, the levels were found to be heritable (Bonferroni-adjusted P-value o0.05), with heritability ranging from 0.19 to 0.78 and the highest heritability for CCL24 (C-C motif chemokine 24; Supplementary Data 2). Thus, for a majority of the protein biomarkers, circulating levels are significantly affected by the individual's genetic constitution. To determine the nature of the genetic effects on protein abundance, we performed association analyses using over 4.8M SNPs and INDELs identified by direct genotyping and whole-exome sequencing, followed by highquality imputation. In this analysis, each of the 77 proteins was adjusted for the significant clinical and lifestyle variables (Table 1) and the samples were split into a discovery and a replication cohort based on sample collection round (see Methods for details). In the discovery phase, we identified 15 proteins with genome-wide significant hits (nominal P-value down to 1.1 Â 10 À 40 , Table 2), employing a Bonferroni-corrected P-value cutoff of 0.05. Of these, 14 had at least one replicated association (nominal P-value down to 1.1 Â 10 À 20 , Table 2). In all, 175 genome-wide significant hits were detected in the discovery phase, out of which 101 replicated. A combined analysis of all individuals revealed a total of 226 genome-wide significant hits in 14 proteins, with P-values down to 4.4 Â 10 À 58 , and a single marker explaining as much as 26.6% of the phenotypic variation seen after adjusting for the significant clinical and lifestyle factors ( Table 2). A detailed description of each of the 226 hits, including overlaps with previous associations with any phenotype or trait, is given in Supplementary Data 3. IL-6RA (IL-6 receptor subunit alpha) showed the strongest association and the association was caused by one or very few SNPs located in the gene that encodes the respective protein, similar to the case for the majority of the biomarkers (Fig. 2a). Conditioning on the top-hit revealed that four of the proteins, CCL24, MIC-A  Age  24  Up: ADM, CTSD, CXCL9, CXCL10, CXCL11, ErbB4, FAS, Flt3L, GDF-15, WFDC2, HGF, IL-8, IL-2-RA,  KLK6, hK11, MCP-1, OPG, PSA, TF, TNF-R1, TNF- (major histocompatibility complex class I polypeptide-related sequence A), CXCL5 and Ep-CAM (epithelial cell adhesion molecule), had hits independent of the highest-ranking SNP (Table 3). For CXCL5 (Fig. 2b) and Ep-CAM, the second SNP was located on a different chromosome, whereas for CCL24 ( Fig. 2c) and MIC-A, the second SNPs were located close (o40 kb, Table 3) to the first hit. The second SNP for Ep-CAM explained 6.5% of the variance of the unadjusted phenotype, as compared with 4.9% for the top-ranking SNP. For the other three proteins, the fraction of variance explained by the second-ranking SNPs was small compared with the top-ranking SNP. We compared our 226 hits with eQTLs as reported by the NCBI's eQTL database and found overlapping SNPs in 11 cases. These were reported for IL-17RB (one SNP) and CCL24 (one SNP) in liver 23 and for MIC-A (nine SNPs) in lymphoblastoids 24 (Supplementary Data 3). As expression is cell type specific and eQTL studies only exist for a limited set of tissues, the number of SNPs found here to be eQTL is likely to be an underestimate. For two of the proteins (CCL19 and E-selectin), the genome-wide significant hits were located at other loci than the one coding for the protein ( Table 3). The top hits for CCL19 were located in the major histocompatibility complex class II gene cluster, encoding molecules present on antigen-presenting cells and B-cell lymphocytes. CCL19 is a chemokine implicated in inflammatory and immunological responses, but also in normal lymphocyte recirculation and homing. Higher serum levels of CCL19 have been associated with poor prognostics of AIDS patients 25 . For E-selectin, the circulating level is known to be affected by ABO blood group.
Here, even after correction for blood group at the A/B/0-level, the top hits in the GWAS were located within the ABO-gene, determining the blood group (Fig. 2d), with our top hit (rs507666) being a perfect tag SNP for the A1 subtype 26 , suggesting that the specification of the A group into A1 and A2 is involved. Our dependency of the E-selectin levels on ABO status is consistent with the pattern described earlier 15 , where individuals with the O blood type have the highest levels. This is in contrast to the patterns for TIE2 and PECAM-1, where individuals carrying the B or AB blood group have the highest values ( Supplementary Fig. 1). For the other proteins (Ep-CAM, CCL19 and CXCL5), we found no evidence such as eQTLs or common pathways linking the loci that did not code the protein to the gene coding the protein. In summary, for a large number of the biomarkers, significant genetic effects on protein levels could be identified.
Personalized biomarker-specific covariate profiles. The relative importance of individual genetic, clinical and lifestyle factors on the abundance differed dramatically between the 77 biomarkers ( Fig. 3a). Some biomarkers were affected by strong genetic factors, whereas others mainly by environmental or clinical factors. These variables are not always independent, such as blood pressure and use of medication, which are both related to age. This can be seen in that the total fraction of observed variance, as determined by a combined model including all 158 covariates plus the top-ranking SNP and the top-ranking SNP from the   (Fig. 3b), whereas the sum of the explained variance by individual covariates in some cases reached above 1 (Fig. 3a). Figure 3 illustrates the main results of the study. Most of the 77 biomarkers showed large variation in abundance between individuals but they differed considerably with regard to the specific genetic, clinical or lifestyle factors involved. At one extreme, IL-6RA levels were affected most strongly by the individuals' genotype and only a very small fraction of the variance was explained by other covariates, with BMI being the strongest (1.0%). Even assuming that all 158 covariates, besides the top ranking SNP, contributed independent effects, the sum of the fraction of the variance of IL-6RA levels explained by these factors was less (21.0%) than the single genetic effect (21.3%). At the other end of the spectrum, HGF did not show a significant heritability, and none of the genetic markers reached genome-wide significance. However, 17 other covariates were nominally significant (P-valueo0.05) for HGF and 4 (weight, sample round, systolic blood pressure and age) remained significant after correction for multiple testing. These covariates accounted for 3.3%, 2.9%, 14.1% and 19.3%, respectively, of the percentage-of-variance-explained. In addition, the use of platelet aggregation inhibitors (ATC: B01AC) and loop-diuretics (ATC: C03CA) explained 5.7% and 7.1% of the variance observed in the unadjusted ddCq-values, respectively, whereas the top ranking SNP only accounted for 1.6%. In the middle part of the distribution in Fig. 3a, we find biomarkers that were less affected by the genetic, clinical or environmental factors studied, possibly reflecting limited non-disease-related variability. Information on the set of important variables for each biomarker can be used to reduce the non-disease-related variation. For instance, soluble CXCL10, which shows elevated levels in patients with a number of autoimmune-related diseases 27 , has previously been shown to be associated with systolic blood pressure 28 . Here, we confirm the correlation with systolic blood pressure, which explains 5.6% or the variability, but we also found a significant correlation with age (9.0% of variability) and a very strong effect of genetic variants (35.4% of variability). Stratifying individuals on age did not appreciable reduce the range of variability (Fig. 3c). However, stratifying on the basis of the genotype at the top hit (rs11548618) had a considerable effect on reducing the variability (Fig. 3d). In the case of CCL24, the carriers of the reference allele of rs6946822 had a level 209% (linearized ddCq) of the average value of the homozygote carriers of the alternative allele (Fig. 3e). The effect of medication on the abundance can be demonstrated by IL-6 ( Fig. 3f), where the distribution of protein level was clearly shifted upwards with the use of dihydropyridine derivatives (ATC: C08CA) found in drugs prescribed for treatment of hypertension or angina pectoris. Interestingly, this was the only hypertension medication that is correlated with higher IL-6 levels, and neither angiotensin-converting-enzyme (ACE)-inhibitors (ATC: C09AA), selective b-blocker agents (ATC: C07AB) nor a combination of these mediate this effect (Fig. 3f). This implies that detailed medication information may be needed for proper use of this biomarker.
Availability. Full summary statistics of the combined results from the 14 GWAS's with genome-wide significant hits are available from doi:10.5879/BILS/g000001.

Discussion
We have shown that for 72 of the 77 biomarkers studied, the circulating plasma levels are strongly associated with genetic, zrs11465293 was not discovered in the unconditional discovery-replication analysis and was subsequently rerun in a conditional discovery-replication analysis resulting in the P-values 3.9 Â 10 À 9 and 6.7 Â 10 À 5 for the discovery and replication cohorts, respectively. yImputed. ||When the top SNP(s) was imputed, the top ranking genotyped SNP was also included in the table. zIn perfect LD (R 2 ¼ 1) with top-snp in our cohort.
clinical or lifestyle factors. Most biomarkers are highly heritable, and for 14 biomarkers, we identified strong genetic associations, with the top SNP explaining as much as 36% of the variability in protein abundance between individuals. For these biomarkers, stratifying patients based on their genotype may dramatically enhance the ability to detect deviations from normal circulating levels. A number of non-genetic factors also show a strong effect on biomarker levels, with age, systolic blood pressure and weight affecting a large number of the biomarkers. As cancer incidences increase with age 29, so does the use of prescribed medications (Spearman's rho, R 2 ¼ 0.29, Supplementary Fig. 2). Interestingly, we identified medication as an important clinical variable that should be considered when using the biomarkers for diagnosis or risk prediction. For instance, Basigin expression has been associated with shorter survival and proposed as a biomarker for adjuvant therapy in colorectal cancer 30 . Our analysis did not show any significant association of Basigin levels with covariates such as anthropometrics, age, sex or smoking. However, the use of glucocorticoids commonly found in inhalators used to treat asthma-related conditions, decreased circulating levels of Basigin thereby possibly masking the need for adjuvant treatment. Our results indicate that when using Basigin as a biomarker in an ageing population, medication history and dosage should be taken into account in order to establish an appropriate clinical cutoff. Another example is the IL-6 and IL-6 receptor (IL6-RA), where we confirm the strong effect of the genetic constitution on the circulating IL6-RA levels. We also show that medications used to treat, for example, hypertension such as dihydropyridine derivatives, but not ACE-inhibitors or selective b-blockers agents, cause or maintain an increase in the inflammatory response cascade via high IL-6 levels. The IL-6 signalling is important in the pathogenesis of several autoimmune and chronic inflammatory diseases 31   ARTICLE the inflammatory response 32 . In clinical practice, only two-thirds of the patients treated with these drugs respond to the treatment and factors such as age and medical history have been shown to be predictors of remission and response in RA patients 33 . Future investigations are clearly needed to specifically address the longterm effects of commonly used medications in this perspective. In a clinical context, circulating levels of CXCL10 have been estimated to 120 ± 83 pg ml À 1 in patients diagnosed with Graves' Disease as compared with 72 ± 32 pg ml À 1 in controls 34 ; an average increase of 67% not taking genetic and non-genetic covariates into account. By comparison, the average increase in individuals in our study carrying the reference genotype for rs11548618 was 178% (linearized ddCq) of the level in heterozygous individuals, clearly illustrating the relative importance of carrier genotype versus to disease state on biomarker levels. Previous efforts 35,36 have identified genetic susceptibility loci for Graves' disease but none of these overlap with the loci associated with the CXCL10 levels, suggesting that in this case the causal effects of the disease are not directly linked to the biomarker levels. Another strong genetic effect was observed for CCL24, where carriers of the reference allele of rs6946822 have a level 209% (linearized ddCq) of the average value of the homozygous carriers of the alternative allele (Fig. 3e). The worldwide minor allele frequency of rs6946822 is listed in NCBI's short genetic variation database (dbSNP) as 0.46, implying that every fifth individual will be homozygote, similar in frequency to the individuals who smoke in the United States today 37 , demonstrating the large, common genetic effects on biomarker variation found in the population today.
We also find biomarkers that are not significantly affected by any of the variables examined, rendering them less susceptible to variability induced by non-disease-related factors. Although we have investigated a large number of genetic, clinical and lifestyle factors, they altogether explain at most 56% of the variation in biomarker levels between individuals. The remaining variance must reflect other factors, or non-additive interaction between some of the factors studied, and their identification could further increase the utility of biomarkers by reducing sources of variation unrelated to disease state. For example, CCL24 had a heritability of 0.78, indicating that additional genetic loci might affect protein levels. For 15 of the biomarkers, the vast majority of abundances were below the detection limits in our cohort. Several of these could represent ideal biomarkers without major presence in normal plasma and thus with no influencing genetic or lifestyle factors. Among these was for instance mucin-16 (or CA125) that is used clinically as a test for ovarian cancer 38 and also potential biomarkers such as regenerating islet-derived protein-4 that has been proposed as a biomarker for pancreatic ductal adenocarcinoma 39 .
This study identifies several previously unknown genetic and lifestyle factors influencing the circulating plasma levels of disease biomarkers in a population-based cohort, but has its limitations. First, we have a relatively small sample size (N ¼ 1005) for genetic association studies. Despite this fact, we identify and replicate 12 novel associations of large effect on the disease biomarkers. Although large GWAS consortia have identified hundreds of genetic variants associated with variation in disease-related phenotypes 40,41 , most of these SNPs are common and have such small effect sizes that they are not clinically useful.
Personalized cancer medicine is on a trajectory from long awaited promise to existing reality, with clinical applications for a small number of cancers with directed treatments. In chronic myelogenous leukemia, patients with a specific translocation respond well to treatment with a tyrosine-kinase inhibitor blocking an enzyme that in turns triggers signalling cascades 42 . Also, patients with non-small-cell lung cancer and a gene-fusion mutation have higher drug response rates than those lacking this gene fusion 43 . However, the number of cancer biomarkers in clinical use is still limited. In the set of biomarkers studied here, we identified a surprisingly strong genetic effect on some biomarkers after correcting for clinical (medication) and lifestyle variables. Likewise, other biomarkers were strongly affected by environmental lifestyle or clinical factors. Genotyping of selected polymorphisms with a strong effect on abundance appears to be crucial for about 20% of the biomarkers in our study, whereas lifestyle and medication are important covariates for the majority. In the daily clinical routine, we envision that analysis of broad-spectrum biomarkers could be used as a follow-up analysis for patients, or for screening of risk groups. Our analysis indicate that such tests would be accompanied by collecting additional relevant information such as anthropometrics, medication and genotyping of specific polymorphisms known to affect the baseline of these biomarkers. The clinical laboratory that performs the biomarker analysis would have documentation on which cofactors that significantly influence the baseline levels, and could advise the physician on how to interpret the outcome of the test. Our results imply that using biomarker-specific covariate profiles will make it possible to determine more precise, individualized, clinical cutoff levels. This in term could lead to a more efficient use of protein biomarkers for early detection of abnormal levels and for increased sensitivity and specificity in disease diagnosis. By employing biomarker-specific profiles of covariates it will be possible to fully harness the potential of existing and novel biomarkers for disease diagnosis and management.

Methods
Samples. The NSPHS was initiated in 2006 to provide a health survey of the population in the parish of Karesuando, county of Norrbotten, Sweden, and to study the medical consequences of lifestyle and genetics. This parish has about 1,500 inhabitants who meet the eligibility criteria in terms of age (Z15 years), of which 719 individuals participated in the study (KA06 cohort). As a second phase of the NSPHS, another 350 individuals from a neighbouring village (Soppero) were recruited in 2009 (KA09 cohort). For each participant in the NSPHS, blood samples were taken (serum and plasma) and stored at À 70°C on site. Both the 2006 and 2009 samples used in this study have undergone two freeze-thaw cycles before the measurements carried out here. DNA has been extracted for genetic analyses and detailed descriptions of this study have been published elsewhere [44][45][46] . A questionnaire was used to collect data on medications and lifestyle. The questionnaire was filled in at the local health-care centre in the presence of the local district nurse. Notably, around 15% of the participants of the study adhere to a traditional lifestyle based on reindeer heading and crafts. Differences in, for example, diet in this group compared with the group with a lifestyle typical of more industrialized regions have been shown to increase levels of circulating blood lipids 47, which motivates to include the traditional lifestyle adherence as a covariate.
Ethical considerations. The NSPHS study was approved by the local ethics committee at the University of Uppsala (Regionala Etikprövningsnämnden, Uppsala, 2005:325) in compliance with the Declaration of Helsinki 48 . All participants gave their written informed consent to the study including the examination of environmental and genetic causes of disease. In cases where the participant was not of age, a legal guardian signed additionally. The procedure that was used to obtain informed consent and the respective informed consent form has recently been discussed in light of present ethical guidelines 49 .
Multiplexed PEA. Protein levels in plasma were analysed using the Olink Proseek Multiplex Oncology I 96 Â 96 kit and quantified by real-time PCR using the Fluidigm BioMark HD real-time PCR platform as described earlier 10 . In brief, for each measured protein, a pair of oligonucleotide-labelled antibodies probes bind to the targeted protein, and if the two probes are in close proximity, a PCR target sequence is formed by a proximity-dependent DNA polymerization event and the resulting sequence is subsequently detected and quantified using standard real-time PCR. Each plate contains 96 wells whereof 92 are samples, 1 is a negative control and 3 are positive controls (spiked in IL-6, IL-8 and VEGF-A). Each sample is also spiked in with two incubation controls (green fluorescent protein and phycoerythrin), one extension control and one detection control. These controls are used to determine the lower detection limit (negative control) and to normalize ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms5684 the measurements into ddCq values according to the following formulae where À ð Þ dCq analyte ¼ Cq analyte À Cq Extension Control ð2Þ and dCq blank is a per-assay value defined by the manufacturer to give a positive log2-scale. The ddCq values were then log2-transformed for subsequent analysis.
Each PEA measurement has a specified lower detection limit calculated based on negative controls that are included in each run and measurements below this limit were removed from further analysis. Individual samples where at least one of the internal controls contained an outlier value (n ¼ 35) or where too many (475%) measurements were below detection limits in any PEA (n ¼ 1) were also excluded from further analyses (total n ¼ 35 of 1005, 3.5%). We wanted at least 200 observations per protein above detection limit in order to conduct the downstream statistical analyses and therefore proteins with fewer observations were excluded from further analyses. After individual and protein quality control, 77 proteins measured in 970 individuals remained. Out of the removed proteins, seven proteins (betacellulin, epiregulin, IL-2, CA242, estrogen receptor, G-CSF and stromelysin-1) had 100% of measurements below detection limit. Uniprot recommended short names have been used throughout when these are available otherwise; the assay manufacturers' abbreviations have been used. All assay characteristics including detection limits and measurements of assay performance and validations are available from the manufacturer's webpage (http://www.olink.com/products/ proseek-multiplex/downloads/data-packages).
Genotype data. The reference haplotypes were created using in-house R-scripts, PLINK (v1.07) and phased using SHAPEIT (v2.r) 54 . Data were then imputed for the two cohorts separately using IMPUTE2 (v2.3.0) with a pre-phasing approach 55 . The input data were phased chromosome-wise using SHAPEIT (v2.r). In addition to our in-house panel, we also utilized the 1,000 ABO blood group assignment. We assigned blood groups according to the ABO-system to our samples based on their genetic status of four genotyped SNPs (rs505922, rs8176746, rs8176704 and rs574347) in the region of the ABO gene. These four SNPs allow for accurate assignment of both the A/B/O groups and subtyping of A into A1 and A2 and subtyping of O into O01 and O02 (ref. 58).
Using this approach, we successfully assigned blood groups to 97.9% of our samples.
Statistic analyses. All statistical analysis was conducted in R 59 and illustrations were produced using R and the Circos software 60 . Correlation between proteins and relevant variables was calculated separately for each measured protein by fitting a generalized linear model using the 'glm' function including all covariates simultaneously. The significance of the each covariate's contribution to the total variance was estimated using an analysis of variance approach as implemented by the 'anova.glm' function on the resulting generalized linear model. Covariates were considered significant for a specific protein if their Bonferroniadjusted P-values were below 0.05 (P-valueo3.16 Â 10 À 4 , 0.05/158). Each PEA measurement was individually adjusted for significant covariates (Supplementary Data 2) and rank-transformed to normality by using the 'rntransform' function available from the R-package GenABEL (v1.6.7) 61 . Correlations between pairs of PEA measurements were carried out, on the adjusted and rank-transformed values, using the 'cor' function applying Spearman's Rho statistics on pairwise complete observations. The NSPHS is a population-based study and includes many relatives and special care has to be attributed to avoid relational biases. Therefore, all genetic association calculations were carried out using the GenABEL or ProbABEL 61 software suites, which has been developed to enable statistical analyses of genetic data of related individuals. These packages includes functions for estimating the narrow-sense heritability (h 2 ) and performing genetic association analyses 62 by adjusting for pedigree structure. In brief, the heritability of each trait (protein abundance) is estimated using a polygenic model as implemented by the 'polygenic' method in the GenABEL R-package 61 . This heritability estimate represents the variance in the phenotype that is explained by genetic factors and is estimated by maximizing the likelihood of the trait-data under a polygenic model including fixed effects such as covariates and relatedness among individuals (kinship). The result of the 'polygenic'-call contains the inverse variance-covariance matrix of the estimates and trait residuals and is included in the downstream association calculations together with the posterior genotypic probabilities. Specifically, these calculations are performed using the ProbABEL programme using the '--mmscore' option. Kinship matrix calculations were carried out using the autosomal markers shared (n ¼ 182,916) between the two types of genotyping arrays used in the KA06 and KA09 cohorts. Contribution of single SNP's to phenotypic variation on the unadjusted ddCq values was calculated in R by fitting a linear model (using 'lm') with ddCq values as response and the posterior genotypic probabilities as terms and fraction of variance explained was determined from the resulting model using 'summary.lm'. Fraction of variance explained by a single SNP in the adjusted phenotypes including effects of relatedness was estimated by dividing the resulting chi-square test score (from ProbABEL) with the number of samples used.
The KA06 cohort was used as discovery cohort in the GWAS and KA09 as replication cohort. As we cannot rule out protein degradation effects due to differences in storage time between the two cohorts, this split is favourable to a random split where degradation effects could affect the association analysis. Strict Bonferroni-adjusted P-values (P-valueo1.03 Â 10 À 8 , 0.05/4,840,842) were used to report significance in the discovery cohort and the replication cohort (P-valueo0.05/number of significant SNPs in the discovery cohort). We also ran a combined analysis with the same cutoff used as in the discovery phase. For all proteins with replicated hits, a conditional analysis was carried out in which the genetic associations were re-calculated using the dosage values of the top-ranking SNP as covariate. This analysis was only run in the combined material and on chromosomes that had hits in that replicated in the discovery-replication phase and P-value o5 Â 10 À 8 was used as cutoff.