Improved power and precision with whole genome sequencing data in genome-wide association studies of inflammatory biomarkers

Genome-wide association studies (GWAS) have identified associations between thousands of common genetic variants and human traits. However, common variants usually explain a limited fraction of the heritability of a trait. A powerful resource for identifying trait-associated variants is whole genome sequencing (WGS) data in cohorts comprised of families or individuals from a limited geographical area. To evaluate the power of WGS compared to imputations, we performed GWAS on WGS data for 72 inflammatory biomarkers, in a kinship-structured cohort. When using WGS data, we identified 18 novel associations that were not detected when analyzing the same biomarkers with genotyped or imputed SNPs. Five of the novel top variants were low frequency variants with a minor allele frequency (MAF) of <5%. Our results suggest that, even when applying a GWAS approach, we gain power and precision using WGS data, presumably due to more accurate determination of genotypes. The lack of a comparable dataset for replication of our results is a limitation in our study. However, this further highlights that there is a need for more genetic epidemiological studies based on WGS data.


Supplementary material
. Sequence depth and quality for top GWAS hits. All variants are from WGS data sequence at a mean coverage of 30x. VQSR tranche ‡ , genotype quality score and sequence depth are presented for each top variant as well as whether they are in a low complexity region (as defined in ref 3 ) Table S3. Results from biomarkers with significant conditional results. The most significant variant and its pvalue is presented for each panel. If the top variant is the same for both panels, the p-value is stated twice. When a technical replicate is available, linkage disequilibrium (R 2 ) between the top variant and the technical replicate is shown.  Table S4. Sequence depth and quality for the top GWAS hits in the conditional analyses. All variants are from the WGS data sequence at a mean coverage of 30x. VQSR tranche ‡ , genotype quality score and sequence depth are presented for each top variant as well as whether they are in a low complexity region (as defined in ref 3 ) Supplementary Table S5. Narrow sense heritability (h2) for each biomarker. h2 is the overall heritability. h2 (~top) is the heritability conditioned of the top variant, h2 (~top+second) on the top variant and top from the conditional analysis when it has been performed. h2 (~top+second+tertiary) shows the heritability conditioned on the top variant, the top from the first conditional analysis and the second, when the biomarkers had significant tertiary peaks. The biomarkers with a non-significant heritability estimate are marked with "n.s".   NSPHS, LD taken from 1000G FIN ** no assoc -indicates that there was no genome-wide significant association in our cohort (P < 1.62e-08) or in the Finnish cohort (P < 1.2e-09) ´r s5745687, rs385076, rs17229943, rs71478720 were nominally associated with the same biomarker in our cohort (P = 3.6e-06, P= 0.00017, P= 3.1e-06, P = 2.7e-06, respectively) ´´ R 2 between the top variant found in the present study and Ahola-Olli et al. Table S7. Disease-associations for inflammatory diseases from the GWAS catalog. Each biomarker is presented with each GWAS variant as well as whether the variant was associated with a disease itself or in LD > 0.8 with a previously associated variant.  Figure S1. Plot of the simulation to find a suitable MAF cutoff. The p-values (-log10) are plotted against the number of individuals with genotype 1, i.e. a genotype giving rise to an extreme biomarker concentration. The rest of the individuals are assumed to have genotype 0 and the biomarker levels are assumed to be rank-transformed. The total number of individuals are coded as follows: red -1000, black -900, green -800, blue -700, magenta -600 and grey -500. The standard genome-wide significance of 5x10 -8 is represented by the dashed line. The plus signs (+) and the triangles (D) represents cases with n=900 when the individuals with genotype 1 does not have the most extreme concentrations. The number of individuals in this study with WGS data is 1021 of which up to 957 in ONC_CVD and 893 in INF have biomarker data. A MAF of 0.15% (3 alleles in 1021 individuals) was therefore used since 4 alleles with extreme biomarker concentrations are detectable in down to 800 individuals with our genome-wide significance of 1.62x10 -8 .

Supplementary Figures S2-S20.
Locuszoom plots for the loci where at least one significant hit was found with WGS data but not with genotyped/imputed data. The regional plots to the right are for the WGS data where the ones only denoted "regional plot" is the INF panel. The ones denoted "regional plot technical replicate" is the ONC_CVD panel. All graphs have colored dots based on the LD data information from 1000G nov 2014 EUR dataset. Grey dots depict when no usable LD information could be found for the reference SNP or other SNPs in relation to the reference SNP.
Supplementary Figure S2. Regional association plots for ADA. Imputed data to the left and WGS data to the right. The lead variant (rs11555566) is in high LD with other variants in the 1000G data (red points in the plot). However, rs11555566 has a slightly lower frequency (1.9%) in our cohort compared to the 1000G cohorts 6.6%) and the LD with other SNVs is much lower, which could explain the less accurate imputation of rs11555566 and the lack of other SVNs with genome wide significant P-values in the region. Recombination rate (cM/Mb)  Figure S3. Regional association plots for CASP-8. Imputed data to the left, WGS data from INF in the middle and from ONC_CVD, the technical replicate, to the right.
CASP−8 regional plot Recombination rate (cM/Mb) Recombination rate (cM/Mb) Recombination rate (cM/Mb)  Recombination rate (cM/Mb)  Figure S6. Regional association plots for CCL23. Imputed data to the left and WGS data to the right. When analyzing the imputed data, CCL23 only had a significant association (shown in the plot below) in the smaller sub-cohort (recruited in 2009) and not in the other sub-cohort (recruited in 2006). This biomarker was therefore note used in the replication step 2 and was not counted as a significant association. Recombination rate (cM/Mb)  Figure S9. Regional association plots for CX3CL1. Imputed data to the left and WGS data to the right. Recombination rate (cM/Mb) Supplementary Figure S10. Regional association plots for CXCL1. Imputed data to the left, WGS data from INF in the middle and from ONC_CVD, the technical replicate, to the right. Recombination rate (cM/Mb) Recombination rate (cM/Mb) Supplementary Figure S11. Regional association plots for CXCL9. Imputed data to the left and WGS data to the right. As discussed below, rs11548618 is most likely a false positive association and has not been considered a novel finding in this study. Recombination rate (cM/Mb) Supplementary Figure S12. Regional association plots for CXCL11. Imputed data to the left and WGS data to the right. As discussed below, rs11548618 is most likely a false positive association and has not been considered a novel finding in this study. Recombination rate (cM/Mb) Supplementary Figure S13. Regional association plots for FGF-5. Imputed data to the left and WGS data to the right. FGF−5 regional plot Recombination rate (cM/Mb) Recombination rate (cM/Mb) Supplementary Figure S15. Regional association plots for ST1A1. Imputed data to the left and WGS data to the right. Recombination rate (cM/Mb) Supplementary Figure S16. Regional association plots for STAMBP. Imputed data to the left and WGS data to the right. No usable LD information could be loaded for the reference SNP. The top variant (1:53206258) could not be plotted. The variant with the second lowest p-value is highlighted instead.
STAMBP regional plot, imputed data Recombination rate (cM/Mb) Recombination rate (cM/Mb) Supplementary Figure S17. Regional association plots for TGFB1. Imputed data to the left, WGS data from INF in the middle and from ONC_CVD, the technical replicate, to the right. Recombination rate (cM/Mb) Recombination rate (cM/Mb) Supplementary Figure S18. Regional association plots for TNFB. Imputed data to the left and WGS data to the right. When analyzing the imputed data, TNFB only had a significant (shown in the plot below) association in the smaller sub-cohort (recruited in 2009) and not in the other sub-cohort (recruited in 2006). This biomarker was therefore note used in the replication step 2 and was not reported as a significant association in the previous study. Recombination rate (cM/Mb) Supplementary Figure S19. Regional association plots for TNFSF14. Imputed data to the left and WGS data to the right. CXCL10, also known as interferon gamma-induced protein type 10 have previously been associated with blood protein measurements, where the variant rs11548618 has been the top SNV 2,[4][5][6] . In this study, rs11548618 was replicated as the top variant. The beta coefficient has previously been reported as positive 1,2,4 . However, in one of the overlapping measurements, ONC_CVD, the same variant showed a significant association with the nearby chemokines CXCL9 and CXCL11 with the opposite direction. This association is not apparent in the INF measurement, where the SNV (rs11548618) does not reach nominal significance (CXCL9: p=0.78, CXCL11: p=0.68). All variants that are significantly associated with CXCL9 or CXCL11 in ONC_CVD are associated with CXCL10 in INF, with a positive beta coefficient in INF and a negative in ONC_CVD. The variants associated with CXCL9 and CXCL11 in ONC_CVD are also associated with CXCL10, with a more significant association. The genes encoding these are all closely located at chromosome 4. The consequence of the minor allele is a missense mutation resulting in an amino acid change from arginine to cysteine. It is possible that this results in a change in antibody affinity when measuring the blood plasma proteins. The PEA is dependent on proximity antibody probe pairs, which binds to their specific antigens and then hybridize. By adding DNA polymerase, and extension and joining of the oligonucleotides is produces and this forms a PCR template 7 . When carrying the allele that leads to a missense mutation, it might be that the antibodies normally binding to CXCL9 and CXCL11 have a higher probability of binding to CXCL10. The hybridization at CXCL9 and CXCL11 cannot form if not both antibodies bind. This then results in a seemingly lower concentration of CXCL9 and CXCL11, when in reality, the protein concentrations could not be accurately measured. Since ONC_CVD have been measured in the cohort on an earlier timepoint than INF, the panels have been refined and the antibody binding have become more precise. rs115485618 has been significantly associated with CXCL10 blood plasma measurements before, thus it is more plausible that this is the true association. The previous studies from the same cohort have used genotyped/imputed data 1,2,6 . There were no associations to those biomarkers that passed genome-wide significance but it might just be due to power issues, when comparing results from genotyped/imputed sequence data to whole genome sequence data. Figure S21. Left: correlation of plasma protein level measurements between ONC_CVD (x axis) and INF (y axis). Right: correlation of nominal p-values from the GWA analysis. Results from ONC_CVD on the y axis and INF on the x axis. A) Measurements from CXCL9. B) Measurements from CXCL10. C) Measurements from CXCL11. Supplementary Figures S26-S30. Locus zoom plots for the loci where at least one significant hit was found with WGS data and with genotyped/imputed data but the top variants are in LD < 0.8. All graphs have colored dots based on the LD data information from 1000G nov 2014 EUR dataset. Grey dots depict when no usable LD information could be found for the reference SNP or other SNPs in relation to the reference SNP. Figure S26. Regional association plots for CCL4. Imputed data to the left, WGS data from INF in the middle and from ONC_CVD, the technical replicate, to the right.  Figure S28. Regional association plots for CD6. Imputed data to the left and WGS data to the right.