Genome-wide association studies (GWAS) have identified associations between thousands of common genetic variants and human traits. However, common variants usually explain a limited fraction of the heritability of a trait. A powerful resource for identifying trait-associated variants is whole genome sequencing (WGS) data in cohorts comprised of families or individuals from a limited geographical area. To evaluate the power of WGS compared to imputations, we performed GWAS on WGS data for 72 inflammatory biomarkers, in a kinship-structured cohort. When using WGS data, we identified 18 novel associations that were not detected when analyzing the same biomarkers with genotyped or imputed SNPs. Five of the novel top variants were low frequency variants with a minor allele frequency (MAF) of <5%. Our results suggest that, even when applying a GWAS approach, we gain power and precision using WGS data, presumably due to more accurate determination of genotypes. The lack of a comparable dataset for replication of our results is a limitation in our study. However, this further highlights that there is a need for more genetic epidemiological studies based on WGS data.
Over the past decade, genome-wide association studies (GWAS) have successfully identified associations of thousands of single-nucleotide polymorphisms (SNPs) with human traits and diseases1. Most of the associated alleles discovered so far are common, with a minor allele frequency (MAF) above 5%2. Many SNPs are also located outside coding regions, which complicates the identification of causal mechanisms, functional variants and relevant genes. Additionally, identified SNPs collectively only explain a limited fraction of the heritability3,4,5. A number of hypotheses for this “hidden heritability” have been proposed, such that part of the heritability is due to rare variants6, that there is a non-negligible fraction of unmapped or untagged common variants7, or that variants with very low effect sizes have not been captured in current GWAS. The extent to which rare and low-frequency coding variants (<5%) influence traits and diseases is still not completely understood2. Rare variants may not be present on currently available SNP arrays nor be well tagged by the available SNPs8 on the array due to the low linkage disequilibrium (LD) between common SNPs and rare variants9. Performing GWAS with imputed or genotyped variants is therefore not ideal for detecting associations with rare variants. Additionally, rare variants are often specific to individual populations7,10,11,12, or even families, making them hard to detect with standard GWAS in unrelated participants.
However, the limitations in association studies can be partly reduced. Firstly, a powerful approach to identify complex trait- and disease-associated rare variants is to use populations that comprise of families or individuals from a limited geographical area13,14. Secondly, whole genome sequencing (WGS) data can be used to better capture rare and low frequency variants and variants not in LD with SNPs on a genotyping array. WGS is superior to imputation when it comes to determining genotypes of rare variants with high accuracy8. Simulation studies have shown that the mapping precision for rare variants increases considerably when using WGS data in a GWAS approach, making it an efficient way of detecting and fine-mapping rare variants simultaneously15. Hence, by shifting from genotyped and imputed data to WGS data, a standard GWAS can be performed with a likely increase in both variant capture and precision. Yet, few GWAS have been performed using WGS data to date. During the last years, WGS has been performed in a variety of different populations7,11,12,16. A recent study within a small kinship-structured cohort (similar to ours), tested for the burden of rare variants from WGS data on six cardiometabolic traits17. The authors found novel signals that neither were captured with low-depth sequencing, nor with genome-wide genotyping with dense imputation in the same samples17, supporting the notion that WGS data in kinship-structured cohorts can improve power to identify genetic associations. A number of studies have also performed GWAS in cohorts where the variants were imputed from unique reference panels based on WGS of a subset of the participants of the same cohorts, or performed GWAS on low coverage (~4×) WGS data18,19,20,21,22. For example, in a study on circulating lipid levels and five inflammatory biomarkers22, WGS of 2,120 Sardinians was performed to assess the impact of the variants common in the Sardinian cohort but rare in the 1000 Genomes Project. In total, 14 signals were found, including two new loci that would have been missed if data had been imputed using the 1000 Genomes reference panel, further underlining the advantages of large-scale sequencing.
Biomarkers are often strongly genetically regulated23,24 and have been shown to be less polygenic in comparison to complex traits and diseases, which increases the power to study the effect in a smaller cohort where WGS data is available. When used for diagnosis, an ideal biomarker should be uniquely present or overexpressed in the tissue of interest and not be influenced by confounding factors, such as genetic variants23,24. However, genetic factors commonly have a considerable effect on biomarker levels and introduce noise when biomarkers are used for diagnosis. Better characterization of the genetic contribution to variation in biomarker levels is therefore of great importance.
In this project, we used a GWAS approach to test for associations with no fewer than 72 inflammatory plasma protein biomarkers, in order to investigate the gain in precision and rare variant-capture with WGS data compared to genotyped/imputed SNPs. In total, 100525 individuals with high coverage WGS data from the kinship-structured population-based Northern Swedish population health study (NSPHS)26 were included. This cohort has also been genotyped27 and imputed, making this a valuable opportunity to compare the relative performance of WGS and genotyping/imputation in relation to the same phenotype measurements. This study is one of few that uses WGS data with a GWAS approach in order to capture a greater number of low frequency variants associated with inflammatory protein biomarkers, and to further characterize the genetic structure underlying these associations, aiming to extend our knowledge of the genetic contribution to these biomarkers.
A total of 1005 individuals with WGS data and biomarker data were included in this study. The age of the participants ranged from 14 to 94 years with a median of 52 years, and 50.8% of the participants were females. The biomarkers were measured at two timepoints. At the first timepoint, biomarkers from Olink’s Oncology I and Cardiovascular I panels were measured of which 31 are inflammatory (from now on called ONC_CVD). At the second timepoint, 95 inflammatory biomarkers from the Inflammatory I panel, were measured (from now on called INF). After quality control (QC), 72 unique biomarkers remained of which 42 were only from the INF panel, while one had been measured only on ONC_CVD. As many as 29 biomarkers were included on both the INF and ONC_CVD panels and they were considered technical replicates. The average number of individual measurements per biomarker was 915 (median 929, range: 430–957) in ONC_CVD and 829 (median 871, range: 424–892) in INF.
Genome-wide association for biomarker levels
In the WGS data, a total of 16,890,549 biallelic single-nucleotide variants (SNVs) were called. A MAF threshold of 0.15% was chosen in order to reach enough statistical power in the GWAS (Supplementary Fig. S1). After filtering on MAF and Hardy-Weinberg equilibrium (HWE), 12,210,410 SNVs remained for downstream analyses. For the 72 individual biomarkers analyzed, 5,812 genome-wide significant (P < 1.62 × 10−8) associations were identified, and for 41 (56.9%) of the biomarkers, there was at least one associated SNV (Fig. 1, Table 1, Supplementary Table S1). For CCL4 and CXCL5, two independent associations each were identified, making it a total of 43 independent associations.
We identified 11 biomarkers that had significant associations in both ONC_CVD and INF, representing 1,418 SNV-biomarker associations. Seven of the biomarkers had significant associations only when analyzing the measurements from ONC_CVD, but not when analyzing the same biomarker measured on the INF panel. However, these variants had p-values just below the genome-wide threshold (ranging from 1.17 × 10−8 to 3.43 × 10−13) in ONC_CVD and p-values just above the genome-wide threshold in INF (Supplementary Table S2). Here, the larger sample size in ONC_CVD (90–100 more individuals) probably increased the power enough to reach genome-wide significance.
Most biomarkers (67.44%) with at least one significant hit identified, had an association in cis (i.e., within 1 Mb of the gene encoding the biomarker) or even within the gene encoding the biomarker itself. The rest of the associations were in trans, all located on another chromosome than the gene encoding the biomarker (Fig. 2). Adjusting for the most significant SNV resulted in 15 biomarkers having a secondary, significant signal close to the primary signal, and adjusting for both the primary and secondary SNV resulted in seven biomarkers having a tertiary signal (Table 2, with more extensive variant data in Supplementary Tables S3 and S4). In the conditional analyses, only the SNVs that were located within each associated region (see methods) were analyzed. Due to the reduced number of variants analyzed in the conditional analyses, as compared to the primary GWAS, the power to analyze rarer variants increased and we therefore did not have a MAF threshold in the conditional analyses. Here, we then identified four variants with MAF < 0.15%
In general, the biomarkers without a genome-wide significant association had heritability estimates below 0.3, i.e. less than 30% of the variation in biomarker abundance is due to genetic factors (Supplementary Table S5). For many GWAS-associated biomarkers, the heritability was still fairly high, with the top SNVs and the top conditional SNVs accounting for a total of 5–20% of the total variance in biomarker abundance in most cases (Fig. 3).
Comparison with our previous GWAS using genotyped/imputed data suggests novel loci for many biomarkers
Twenty of the biomarkers (ADA, CASP-8, CCL11, CCL20, CCL23, CD244, CDCP1, CST5, CX3CL1, CXCL1, CXCL11, CXCL9, FGF-5, MCP-3, ST1A1, STAMBP, TGFB1, TNFB, TNFSF14, uPA) with significant associations in the present study, did not have any significant associations in our previous GWAS when using genotyped/imputed SNP data25,28 (Supplementary Figs S2–S20). The abundance of two of these biomarkers (CXCL9 and CXC11) had an associated variant that in our previous studies was identified to be associated only with CXCL10 and is most likely a false positive finding for CXCL9 and CXCL11 (discussed more thoroughly in Supplementary, including Supplementary Figs S21–S25). The remaining novel biomarker associations represented 18 unique loci that were not found to be associated with the levels of the same biomarker using genotyped/imputed data in the same cohort. Of these, 15 loci (see overlap with GWAS catalog below) have not been reported in any previous study of the same biomarkers, thus making them novel loci. In the novel loci, six top variants are considered to be low-frequency variants (MAF < 5%). Additional to the novel loci, four biomarkers (CD6, CXCL5, CCL4, MMP-10) had associations driven by top variants that are only in moderate in LD (R2 < 0.8) with the top variants from our previous studies, and might therefore be considered independent associations (Supplementary Figs S26–S30). Another 19 loci overlapped between the present study and our previous studies with SNP data25,28, for which nine loci had the same top variant. The remaining ten overlapping loci had different top variants, although these variants were in high LD (R2 > 0.8). The top variants in the overlapping loci were more strongly associated (more significant p-value) in the present study than in our previous GWAS, except for two biomarkers (MMP-10 and TRAIL), for which more significant GWAS top variants were found in the previous studies (Tables 3 and 4).
Replication in an independent cohort
Due to the lack of a similar dataset (WGS data and measured levels of the same inflammatory biomarkers) for replication, we could only test for replication for a subset of our results using GWAS results of circulating cytokines in a Finnish population29 (Supplementary Table S6). Of the cytokines that were analyzed in both studies, we fully replicated our primary results for CCL11, i.e., the same top SNV was found in both cohorts. One of the two independent associations with CCL4 was also fully replicated. We also replicated one of the three independent associations with MCP-1, and one of the two independent associations with CXCL1, although with an LD between the top variants of R2 = 0.75 and 0.72 for MCP-1 and CXCL1, respectively. For our second independent association with CCL4, our top variant was either monomorphic in the Finnish population or had not been analyzed. However, the most significant CCL4 associated SNV in the Finnish population is also genome-wide significant in our study (P = 3.52 × 10−12), even if not our most significant. Our results for MCP-3, SCF and CXCL9 did not replicate in the Finnish population (P > 0.05 in the Finnish cohort for our top SNVs) and our result for TNFB was only nominally significant (P = 0.027). On the other hand, the most significant SNVs for IL-7, IL-10, IL-18 and HGF in the Finnish population, were not genome-wide significant in our cohort, even if rs5745687 (for HGF) as well as rs385076, rs17229943, and rs71478720 (for IL-18) were nominally significant (P < 0.05).
Colocalization with eQTL data in blood
Colocalization (the same top variant) with cis-eQTLs in peripheral blood was found for five of the top SNVs (Table 5), associated with the levels of three different biomarkers (CD40, CXCL5 and IL-15RA). For the other top variants from our biomarker GWAS, LD was calculated between each top variant and the most significant eQTL. Two top variants, associated with CXCL5, were colocalized (R2 > 0.8) with a variant associated with a trans-eQTL in blood. Two secondary hits associated with IL-18R1 and TNFSF14 respectively, and one tertiary hit associated with CCL23, were also found to be colocalized with a cis-eQTL. All overlapping variants had the same direction of effect except for the overlap with the trans-eQTL, where two SNVs (rs10740118 and rs7088799) in JMJD1C was associated with increased protein levels of CXCL5, but was in LD (R2 = 0.86) with a variant, rs10761779 that was associated with decreased RNA levels of CXCL5.
Colocalization with data from the GWAS catalog
The association signal for eight biomarkers (CCL19, CCL4, CD40, CD6, CXCL5, IL-12B, IL-18R1, TNFSF14) colocalized with association signals for one or several inflammatory diseases (Table 6, Supplementary Table S7). Here, we regard the signals to be colocalized when a top SNV or any SNVs in LD (R2 > 0.8) with a top SNV identified in our study was also the top SNV for an inflammatory disease in the GWAS catalog (v 1.0.2). In four cases, our top SNV had been associated with an inflammatory disease in previous GWAS: rs113010081/CCL4 with inflammatory bowel disease, rs1569723/CD40 with Crohn’s disease, rs4239702/CD40 with rheumatoid arthritis and rs11230563/CD6 with ulcerative colitis. The remaining top variants are in LD (R2 > 0.8) with previously inflammatory disease-associated top variants. For example, the top variants for CXCL5 found in this study (rs352045 in ONC_CVD and rs425535 in INF) are both in high LD (R2 = 0.944) with a variant previously associated with ulcerative colitis. Some of our top SNVs were also colocalized with associations for different blood-trait (Supplementary Table S8).
We performed a GWAS on 72 inflammatory biomarkers in a Swedish cohort using WGS data, and identified SNVs that were associated with the plasma levels for as many as 41 biomarkers. Of the biomarkers with at least one significant hit, 67.44% had an association within 1 Mb of the gene encoding the biomarker (in cis) and the rest (32.56%) had an association on another chromosome (in trans). Many of the biomarker levels are highly heritable and some top SNVs explained as much as 25% of the variability. When comparing the results to our previous GWA analyses25,28 using genotyped/imputed data, novel associations were identified for 18 biomarkers when WGS data was used, 15 of which has not been identified in any previous study. Additionally, in four of the biomarkers, for which the associated loci overlapped with our previous study, the top variants in the present and former studies were not correlated (R2 < 0.8), thus making these findings potentially independent associations.
We have previously used both mass spectrometry and the recently developed protein extension assay (PEA) to identify the genetic contribution to variation in protein levels in the NSPHS cohort, where we showed that more than 30% of the biomarkers are influenced by genetic variants23,28,30,31. In a recent study with a larger sample size32 (N = 3,394), but also based on genotyped/imputed data, we contributed to the identification of 79 genome-significant loci for 83 plasma protein biomarkers for cardiovascular disease. A more recent study by Sun et al.33 identified nearly two thousand genetic associations with almost 1,500 proteins, which increased the existing knowledge about the human plasma proteome by fourfold. With the use of WGS data, we can extend our knowledge even further. In the present study, we have shown that we can increase the power in identifying novel loci by using WGS data in GWAS, instead of using genotyped or imputed SNPs. We were able to identify associations for 58% of the biomarkers, which is a considerable higher fraction compared to the 30% identified in the same cohort using genotyped/imputed data.
In addition to a gain of power, we have also shown that we can increase the precision by using WGS data instead of genotype/imputed SNP data25,28. Overall, MAF agreed well between the genotyped/imputed dataset and the WGS dataset, for all associated SNVs. The MAF threshold in our previous studies was set to at least one chromosome in the dataset, which corresponds to a lower threshold than 0.15%. This means that even the rarest variants in the present study were included in the previous analyses with genotyped or imputed SNPs, even though there was no power to identify an association with such rare variants. For the 18 associations not found in our previous studies, the imputation quality was overall good, except for the SNVs that did not pass imputation QC (Table 7). In some cases, the associations were just below the significance threshold in the imputed data. Some such examples are ADA, CCL11 and TGFB1 where the top variants in the present study are suggestive hits in our previous study. These three associations are in regions with not many variants genotyped or imputed, but with many SNVs called in the WGS. For example, in the WGS analyses, a missense SNV (rs11555566) within ADA, which encodes adenosine deaminase, was strongly associated with the expression level of the corresponding adenosine deaminase protein (P = 4.9 × 10−19). This variant is quite rare in our cohort (MAF = 1.9%) and was not identified as genome-wide significant in the imputed data, even if imputation quality was suggested as good and the MAF was similar to the WGS data (Supplementary Fig. S2). In the cases of CD244, CDCP1 and ST1A1 on the other hand, the top SNVs from the present study were not imputed in our previous study (CD244), or did not pass imputation QC (CDCP1 and ST1A1). These can therefore all be considered novel associations. However, in the case of STAMBP, the genotype quality in the WGS data is also low, making this the most uncertain association. Excluding this variant, all novel top variants had a genotype quality of at least 75 in the WGS data, and are therefore considered quite robust.
We also compared our association signals for colocalization with data from the GWAS catalog. A total of eight biomarkers (CCL19, CCL4, CD40, CD6, CXCL5, IL-12B, IL-18R1, TNFSF14) had top variants previously associated with an inflammatory disease, or correlated (R2 > 0.8) with variants previously associated with an inflammatory disease, suggesting that variation in biomarker levels might mediate the disease association, although, we did not determine the direction of causality. For example, the minor allele of rs10190555 was found to be associated with higher levels of interleukin 18 receptor 1 (IL-18R1). The minor alleles of two SNPs in LD (R2 = 0.95) with rs10190555 (rs917997[T] and rs6708413[G]) are also associated with a higher risk of inflammatory bowel disease (IBD) and Crohn’s disease respectively. IL-18R1 is a cytokine receptor that binds interleukin 18 (IL-18) and is essential for IL-18 mediated signal transduction. IL18R1 has previously been found to have colocalized disease and eQTL association patterns in CD4 and CD8 cells for both ulcerative colitis and Crohn’s disease34. In that study, reduced transcript levels of IL18R1 in CD4 and CD8 cells was associated with increased risk for IBD, and the SNP most strongly associated with expression was rs11123923. However, rs11123923 is only in weak LD with the top variant (rs10190555) from our study (R2 = 0.20). In our study, the association to IL-18R1 spans over a large region with strong LD. When adjusting for the top variant, a secondary top variant (rs12999517) reached genome-wide significance. This variant is intronic in IL1RL1, which encodes the interleukin-1 receptor-like 1 (IL1RL1, alternatively ST2) protein. In a previous functional study, the C allele in rs6543115, which is located in a distal IL1RL1 promoter, was shown to confer susceptibility to ulcerative colitis as well as increase expression of the soluble ST2 isoform35. However, the top conditional variant (rs12999517) from our study is not in LD with the previously found variant, and is not in strong LD with any other variant previously associated with ulcerative colitis.
When adjusting for the effects of the top SNP (rs344560) associated with TNFSF14, the minor allele of rs2291668 was found to be significantly associated with increased TNFSF14 levels in our study. Interestingly, the secondary variant rs2291668 is found to explain significantly more of the variability in TNFSF14 levels (9.1%, as compared to 4.4% explained by rs344560 [likelihood-ratio test, p < 0.001, χ2 = 26.95, 1 d.f.]). The top SNP (rs344560) in the primary signal is a missense variant located in the gene TNFSF14, which encodes the biomarker. This variant was found to be associated with lower TNFSF14 levels. The missense variant is neither reported in the GWAS catalog, nor is it in LD with any previously reported variants. However, the conditional SNP, rs2291668, is correlated with rs1077667 that has previously been associated with multiple sclerosis. A possible explanation is that the missense variant affects the antibody affinity for the biomarker which in turn will give lower measured protein levels. These lower levels might give a strong association signal and it is not until the top missense variant is adjusted for, that the true regulatory variant is detected.
A subset of our top SNVs overlapped with eQTLs in blood. For example, the correlated variants rs352045 and rs425535 (top SNVs for CXCL5 in the INF and ONC_CVD respectively) are in high LD with an eQTL for CXCL5 RNA levels in blood. Two variants, rs425535: a nonsynonymous exonic variant that lies within a splicing enhancer site, and rs352046: a promoter variant that is located within a transcription factor binding site for myeloid zinc finger proteins, have also previously been associated with CXCL5 mRNA expression38 and CXCL5 protein levels in blood39. Both variants have been shown to be in almost complete LD (R2 = 0.94) to complete LD (R2 = 1) in both U.S and European populations36,39 and are in complete LD in our cohort. The minor allele for rs425535 was previously shown to be associated with significantly higher CXCL5 plasma concentrations and the minor allele for rs352046 with higher CXCL5 expression levels, which agrees with our results where the minor alleles of rs352045 and rs425535 were associated with higher CXCL5 levels (Supplementary Table S6). In our study, rs352046 was not identified as the top variant but instead rs352045 which is in almost complete LD with rs352046 (R2 = 0.97). Our top SNV is located only 137 bp from rs352046 and both are found within transcription factor binding sites. Neither the association with rs352045 nor rs352046 are reported in the GWAS catalog. However, both rs352045 and rs425535 are in high LD (R2 = 0.94) with a variant, rs2457996, that has previously been associated with ulcerative colitis. Ulcerative colitis is a sub type of IBD and CXCL5 have previously been shown to play a role in IBDs, such as ulcerative colitis and Crohn’s disease. In a study by Z’Graggen et al.40, a preferential expression of CXCL5 mRNA in the epithelium of the intestinal tissue from patients with IBD was observed. They also found a strong expression of CXCL5 at protein level. CXCL5, which encodes an epithelial cell-derived neutrophil activating peptide (also called ENA-78), has previously been suggested as a possible candidate gene for inflammatory diseases36,37 Since the previously mentioned variants, rs425535 and rs352046, have been shown to be associated with higher CXCL5 plasma concentrations and higher CXCL5 expression levels respectively, this further indicates that these variants might play a role in the pathogenesis of IBD.
Despite the many results, our study has some limitations with one being the limited sample size. Standard GWAS of complex traits commonly includes hundreds of thousands of samples. However, by analyzing quantitative phenotypes that are less complex, such as biomarkers, we can gain power and the sample size can be dramatically reduced. Despite this, for some biomarkers a sample size of <1000 individuals are not enough to make a robust assessment, and further studies in larger cohorts, or meta-analyzes needs to be performed. Another limitation is the lack of reproducibility, given the nature of the study. It is a small kinship-structured cohort, which makes the results not generalizable to more mixed population or a population of another ancestry. While this population structure increases power to detect rare variants that might be more common in an isolated population, it also makes it harder to reproduce in another cohort. As of now, there are also only a limited number of cohorts that have been measuring the levels of the same inflammatory biomarkers and that have WGS data available. Even within our own data, we fail to replicate some results, both in the technical replicates as well as in the novel associations with regards to the results from genotyped/imputed data. The limitations mentioned above are most likely also dependent on differences in biomarker quantifications between the biomarker panels. As with imputation, variant calling can be more or less precise. Only a genotype probability is given, with additional quality measures. Caution should be taken, until the associations have been validated. This applies especially to the associations containing only a few variants, or variants not yet given an official identifier. Validation has to be performed in a similar cohort, in order to obtain higher confidence and better understanding of the results. This limitation was especially apparent in our validation in the Finnish population, with a possible reason for lack of replication being the discrepancy in population size (around 950 in NSPHS compared to up to 8,293 Finns). Other possible explanations for the lack of replication are the different LD structures, the different techniques used for protein quantification, and that one cohort being based on WGS and the other on genotyped and imputed data.
In summary, we have performed GWAS in a family-based cohort with WGS data in relation to inflammatory biomarkers. By analyzing only sequencing data, we seek to further extend our knowledge on the genetic contribution to these important biomarkers. The cost-efficient solution of sequencing a few individuals and creating a reference panel to be able to do dense genotyping is becoming a well-established method in genetic studies. The use of low-depth sequencing as a way of increasing power is also more common today. By using high coverage WGS data, we do see an increase in both power and precision despite our limited sample size, an increase which indeed appears promising. We compared our results to earlier studies using genotyped/imputed data as well as to previously published GWAS. This study found several new loci associated with inflammatory biomarkers and nearly 50% of the associations were only detected in the present study. The associations were also stronger, with lower p-values, compared to those identified with genotyped/imputed data, suggesting that genotypes are in general more accurately determined using WGS compared to imputed data. Our results demonstrate the need of deep coverage WGS data with deeper coverage to be able to fully understand the genetic structure of common diseases and complex traits.
The NSPHS was initiated in 2006 to provide a health survey of the population in the parish of Karesuando county of Norrbotten, Sweden, to study the medical consequences of lifestyle and genetics. Additional participants were recruited in a second phase from the neighboring village Soppero, in 2009. These parishes have about 2,000 inhabitants of which a total of 1,069 participated in the study, whereof 719 individuals participated from Karesuando (2006) while another 350 individuals participated from Soppero (2009). For each participant in the NSPHS, blood samples were taken and serum and plasma were separated and immediately frozen and stored at −70 °C23.
The NSPHS was approved by the local ethics committee at the University of Uppsala (Regionala Etikprövningsnämnden, Uppsala, 2005:325, and extension of the project was approved 2016-03-09) in compliance with the declaration of Helsinki41. Informed consent to the study was given by all participants, including the examination of environmental and genetic cause of disease. If a person was not of age (<18 years), a legal guardian signed additionally. The procedure that was used to obtain informed consent and the respective informed consent form has recently been discussed in light of present ethical guidelines42.
A total of 1,041 samples were successfully sequenced using Illumina short read technology (X-ten) to 30x coverage per individual. The library preparation, sequencing, and variant calling were performed as previously described10. Briefly, WGS data were aligned to the GR37 using bwa-mem v0.7.1243. The raw alignments were then processed according to GATK best practice44 using GATK v3.3. Variants were called by the GATK HaplotyeCaller 3.3 followed by variant quality score recalibration (VQSR). Sample quality control (QC) was then performed to remove genetic outliers and identify potentially contaminated samples and individuals with sex discordance errors. After QC, 1,021 unique samples with WGS data remained. Before analysis, the VCF-files were converted to PLINK-format with the PLINK software, version 1.90b4.945. Only autosomes and biallelic single nucleotide variants (SNVs) were included in the analysis. If a position had more than two alleles, PLINK keeps the two most common variants and sets the third one to a missing genotype. SNVs within a deletion were also excluded (spanning deletion/overlapping deletion, denoted *). In the same process, the variants without an rs-id were renamed to chr:position. MAF and deviation from HWE information were assessed with the --freq and --hardy commands in PLINK. The GWA analyses were performed using the GenABEL package in R46,47. To make the files compatible with GenABEL, they were first transposed with the --recode transpose command in PLINK, and then imported into GenABEL. Variants were annotated with ANNOVAR v 2017.07.1648 using the refGene database.
Out of 1,021 individuals with WGS data, up to 1,011 individuals also have measured levels of any inflammatory biomarkers using the Proximity Extension Assay (PEA) technology provided by Olink (https://www.olink.com/products/inflammation/). Inflammatory biomarkers have been measured at three different timepoints within the cohort. At the first two timepoints the panels Oncology I (ONC I)23 and Cardiovascular I (CVD I)25 were measured. These include 31 inflammatory biomarkers (ONC_CVD). A total of 1,00525 samples were measured at the first two timepoints. At the third timepoint28, the panel INF I was used to measure biomarker levels (INF). Here, 92 biomarkers were measured in 903 individuals28. Of the inflammatory biomarkers from the ONC_CVD dataset, 30 had overlapping measurements in the INF panel, and thus served as technical replicates. The quality control of the biomarkers has been described previously23,25,28. After quality control, up to 957 individuals had available biomarker data in ONC_CVD and up to 892 from INF. Biomarkers with measurements in less than 400 individuals were excluded from downstream analyses.
The GenABEL package was used to perform GWAS adjusting for relatedness among individuals. GenABEL utilizes a genetic kinship matrix which was estimated with the ibs function. The kinship matrix was estimated based on the SNPs listed for the HumanHap300v2_A chip. This chip contains >300,000 SNPs that are selected to be tagSNPs, i.e. that are not in high LD with each other. This was done to remove non-informative variants in the construction of the kinship matrix. The phenotypic measurements and possible covariates, together with the kinship matrix, are passed to the polygenic function of GenABEL. The residuals from the polygenic model and the inverse covariance-matrix are then passed on to the mmscore, a linear mixed-effects model, which was used to perform the association analysis. All biomarker levels were rank-transformed to standard normal distributions with the rntransform function in GenABEL prior to the GWAS. All biomarker values were adjusted for sex, age and batch effect prior to, or in the GWA analyses. A Bonferroni adjusted p-value threshold was applied to account for the number of independent tests. To calculate the number of independent SNVs in the analysis, LD-pruning was performed in PLINK, using the --indep-pairwise function, with a window size of 10 Mb and variant jump count of 1. This resulted in a p-value cut-off of pthreshold = 0.05 / 3,078,707 independent SNVs = 1.62 × 10−8. A MAF threshold of 0.15% in the primary analyses and HWE cut-off of 5 × 10−8 was used. The MAF threshold was determined from simulation by assuming that the individuals with the most extreme biomarker levels were the only carriers of the minor allele at a given position. The minimum p-values were estimated depending on the number of individuals carrying one copy of the minor allele, as well as on sample size (Supplementary Fig. S1). Given a sample size of 700–1,000 individuals, which corresponds to the sample size for the biomarkers analyzed, a minimum of four individuals with one copy of the minor allele is needed to reach the genome wide significance threshold of our study. Since three copies in 1000 individuals corresponds to a frequency of 0.15%, we therefore used 0.15% as the MAF threshold in the primary GWA analyses of our study (more than three copies per 1000 individuals).
QQ-plots and Manhattan plots were produced with the qqman package in R49. Regional association plots were constructed using Locuszoom50. The 1000 G Nov 2014 EUR population was used for the coloration based on LD. If one biomarker had different top SNVs in the technical replicates, the LD coefficient (R2) was calculated using PLINK within the study population itself (NSPHS). To assess the size of the associated regions, an SNV clumping was performed in PLINK. A clump kb radius of 15 Mb (--clump-kb), a p-value threshold of 1 × 10−8 (--clump-p1) and an R2 cut-off set to default (--clump-r2 0.1) was used. This function clumps SNVs together based on empirical estimates of LD. The range (in bp) of the clumps was then calculated for each biomarker and used to define biomarker-based loci. We also performed conditional analysis, where the top SNV, for each marker with a significant hit, was used as covariate in the mmscore function to see whether there was more than one independent association for each biomarker. These clump-defined loci were then used to calculate a biomarker-based significance threshold for results in the conditional analyses. SNVs with a conditional p-value below 0.05/number of SNVs tested in the predefined locus, were considered as an independent association. If the biomarker had a significant secondary (conditional) signal, a third analysis was performed, adjusting for both the primary and secondary signal. No MAF cutoff was used in the conditional analyses.
Narrow-sense heritability estimates
Narrow-sense heritability (h2) was estimated using the polygenic model in GenABEL. First, the heritability for the biomarker measurements was estimated only adjusting for age, sex, batch effects and kinship. Then, as an estimation of SNV heritability, the top variant was used as a covariate. The difference in heritability estimated between the models gives the variance explained by the top variant. If secondary and tertiary signals were present, they were added as additional covariates and SNV heritability was calculated for each separately. To test significance of heritability, the reported function minimum (twice the negative maximum log-likelihood) is compared to the reported function minimum in a polygenic model with a fixed heritability estimate set to zero. The difference gives a test approximately distributed as chi-squared with 1 degree of freedom.
Colocalization with published GWAS data and comparison with previous biomarker studies
If the top SNVs of two different GWASes are in LD (R2 > 0.8), the phenotypes are considered to be colocalized. In the study cohort, the LD pattern was calculated between each top SNVs and all SNVs within 2 Mb, using PLINK. All variants that were in LD (R2 > 0.8) with a top SNV were extracted. These were used as query to test for colocalization with data from the GWAS catalog (The NHGRI-EBI Catalog of published genome-wide association studies, https://www.ebi.ac.uk/gwas/home) to find out whether the variants have already been published in earlier association studies of inflammatory biomarkers. Entries with p-values up to 1 × 10−6 in the GWAS catalog (version 1.0.2 – downloaded 2018-10-29) were included in the comparison. All entries for the top variants and variants in LD found in the catalog was extracted together with their metadata.
Our GWAS results were also tested for colocalization with the expression quantitative trait locus (eQTL) dataset from Westra et al.51. This dataset consists of both cis-eQTL and trans-eQTLs and is based on an eQTL meta-analysis in non-transformed peripheral blood samples. The names of the genes encoding the biomarkers were matched both in the cis and trans dataset to see if the top SNVs in this study had been reported as an eQTL. If not, LD was calculated between the top SNVs of our study and the most significant eQTL variant, using PLINK, also here within the study cohort (NSPHS).
Further replication was made using another population from Northern Europe. Circulation cytokines have been measured in Finnish populations by Ahola-Olli et al.29. We sought to replicate the results from the 19 inflammatory biomarkers (i.e. cytokines) that were present in both studies: bNGF, CCL11, CCL3, CCL4 (MIP1b), CXCL1 (GROa), CXCL10 (IP10), CXCL9 (MIG), HGF, IL-4, IL-5, IL-5, IL-7, IL-8, IL-10. IL-13, IL-18, MCP-1, MCP-3, SCF, TNFB and TRAIL. However, IL-4, IL-5 and IL-13 did not pass QC in our study, and thus, only 16 biomarkers could be compared. Only top SNVs were used in the replication.
Top variants were compared to the results from previous studies with genotyped/imputed data in the same cohort25,28. In the first study, the ONC_CVD biomarkers were analyzed25 in the whole cohort whereas in the second study28, the INF biomarkers were analyzed using a two-stage design by splitting the cohort into a discovery and replication cohort (genotyped using different arrays and imputed independently) followed by combined analyses for the SNVs that replicated. Even if the two previous studies were based on the same genotyping and imputation data, the quality control of the imputed genotypes where slightly different with one additional requirements of a genotype probability score of Info >0.9 in 95% of the individuals for the analyses of the INF biomarkers whereas the criteria of imputation quality >0.3 and HWE p-value >0.05/the number of SNPs within the sub cohorts based on different genotyping arrays as well as after combining the two sub cohorts. More information about genotyping, imputation and quality control of imputed variants is included in the previous articles25,28. In this comparison, the additional filters were also applied to the imputed data similar to the WGS data, i.e. only biallelic autosomal hits were compared. If an association was previously found with an indel or a variant in a spanning deletion, the biallelic variant with the lowest p-value after the indel or spanning deletion was used for comparison. For a subset of the non-overlapping hits, we also compared the WGS genotypes with the imputed dosage values form the previous studies.
Summary statistics from the GWAS will be uploaded to the GWAS catalog.
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, 1001–1006 (2014).
Marouli, E. et al. Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017).
Allen, H. L. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet. 40, 955–962 (2008).
Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
Eichler, E. E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).
Boomsma, D. I. et al. The Genome of the Netherlands: Design, and project goals. Eur. J. Hum. Genet. 22, 221–227 (2014).
Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120 (2015).
Li, B. & Leal, S. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).
Ameur, A. et al. SweGen: A whole-genome data resource of genetic variability in a cross-section of the Swedish population. Eur. J. Hum. Genet. 25, 1253–1260 (2017).
Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Besenbacher, S. et al. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios. Nat. Commun. 6, (2015).
Panoutsopoulou, K., Tachmazidou, I. & Zeggini, E. In search of low-frequency and rare variants affecting complex traits. Hum. Mol. Genet. 22, 16–21 (2013).
Peltonen, L., Palotie, A. & Lange, K. Use of population isolates for mapping complex traits. Nat. Rev. Genet. 1, 182–90 (2000).
Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 1–10 (2017).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. 113, 11901–11906 (2016).
Gilly, A. et al. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits. Nat. Commun. 9, (2018).
Southam, L. et al. Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits. Nat. Commun. 8, 1–11 (2017).
Morrison, A. C. et al. Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nat. Genet. 45, 899–901 (2013).
de Vries, P. S. et al. Whole-genome sequencing study of serum peptide levels: The Atherosclerosis Risk in Communities study. Hum. Mol. Genet. 26, 3442–3450 (2017).
Danjou, F. et al. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat. Genet. 47, 1264–1271 (2015).
Sidore, C. et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat. Genet. 47, 1272–1281 (2015).
Enroth, S., Johansson, Å., Enroth, S. B. & Gyllensten, U. Strong effects of genetic and lifestyle factors on biomarker variation and use of personalized cutoffs. Nat. Commun. 5 (2014).
Enroth, S., Bosdotter Enroth, S., Johansson, Å. & Gyllensten, U. Effect of genetic and environmental factors on protein biomarkers for common non-communicable disease and use of personally normalized plasma protein profiles (PNPPP). Biomarkers 20, 355–364 (2015).
Ahsan, M. et al. The relative contribution of DNA methylation and genetic variants on protein biomarkers for human diseases. PLoS Genet. 13, 1–24 (2017).
Igl, W., Johansson, A. & Gyllensten, U. The Northern Swedish Population Health Study (NSPHS)–a paradigmatic study in a rural population combining community health and basic research. Rural Remote Health 10, 1363 (2010).
Enroth, S., Hallmans, G., Grankvist, K. & Gyllensten, U. Effects of Long-Term Storage Time and Original Sampling Month on Biobank Plasma Protein Concentrations. EBioMedicine 12, 309–314 (2016).
Enroth, S. et al. Systemic and specific effects of antihypertensive and lipid-lowering medication on plasma protein biomarkers for cardiovascular diseases. Sci. Rep. 8, 1–10 (2018).
Ahola-Olli, A. V. et al. Genome-wide Association Study Identifies 27 Loci Influencing Concentrations of Circulating Cytokines and Growth Factors. Am. J. Hum. Genet. 100, 40–50 (2017).
Johansson, A. et al. Identification of genetic variants influencing the human plasma proteome. Proc. Natl. Acad. Sci. 110, 4673–4678 (2013).
Enroth, S., Enroth, S. B., Johansson, Å. & Gyllensten, U. Protein profiling reveals consequences of lifestyle choices on predicted biological aging. Sci. Rep. 5, 1–10 (2015).
Folkersen, L. et al. Mapping of 79 loci for 83 plasma protein biomarkers in cardiovascular disease. PLoS Genet. 13, 1–21 (2017).
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
Dmitrieva, J. et al. IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes. Nat. Commun. 9, 1–18 (2018).
Díaz-Jiménez, D. et al. A functional IL1RL1 variant regulates corticosteroid-induced sST2 expression in ulcerative colitis. Sci. Rep. 7, 1–14 (2017).
Zineh, I., Welder, G. J. & Langaee, T. Y. Development and cross-validation of sequencing-based assays for genotyping common polymorphisms of the CXCL5 gene. Clin. Chim. Acta 370, 72–75 (2006).
Amoli, M. M., Larijani, B., Thomson, W., Ollier, W. E. R. & Gonzalez-Gay, M. A. Two polymorphisms in the epithelial cell-derived neutrophil-activating peptide (ENA-78) gene. Dis. Markers 21, 75–77 (2005).
Simon, L. M. et al. Integrative Multi-omic Analysis of Human Platelet eQTLs Reveals Alternative Start Site in Mitofusin 2. Am. J. Hum. Genet. 98, 883–897 (2016).
Zineh, I. et al. CXCL5 gene polymorphisms are related to systemic concentrations and leukocyte production of epithelial neutrophil-activating peptide (ENA-78). Cytokine 33, 258–263 (2006).
Z’Graggen, K., Walz, A., Mazzucchelli, L., Strieter, R. M. & Mueller, C. The C-X-C chemokine ENA-78 is preferentially expressed in intestinal epithelium in inflammatory bowel disease. Gastroenterology 113, 808–816 (1997).
Anonymous. World Medical Association. World Medical Association Declaration of Helsinki Ethical Principles for Medical Research Involving Human Subjects. JAMA2 284, 3043–3045 (2000).
Mascalzoni, D. et al. Comparison of participant information and informed consent forms of five European studies in genetic isolated populations. Eur. J. Hum. Genet. 18, 296–302 (2010).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Van der Auwera, G. A. et al. From fastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, https://doi.org/10.1002/0471250953.bi1110s43 (2013).
Purcell, S. et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Aulchenko, Y. S., Ripke, S., Isaacs, A. & van Duijn, C. M. GenABEL: An R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2018).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, 1–7 (2010).
Turner, D. S. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. J. Open Source Softw. 3, 731 (2018).
Pruim, R. J. et al. LocusZoom: Regional visualization of genome-wide association scan results. Bioinformatics 27, 2336–2337 (2011).
Westra, H. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Publ. Gr. 45, 1238–1243 (2013).
The computations were performed on resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under project sens2016007. The research was funded by The Swedish Society for Medical Research (SSMF), the Swedish Medical Research Council, and the Kjell and Märta Beijers, the Marcus Borgström, the K and O F Hedströms, the Åke Wiberg, the Swedish Heart-Lung Foundation, and Science for Life Laboratory.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Höglund, J., Rafati, N., Rask-Andersen, M. et al. Improved power and precision with whole genome sequencing data in genome-wide association studies of inflammatory biomarkers. Sci Rep 9, 16844 (2019). https://doi.org/10.1038/s41598-019-53111-7