Introduction

The importance of immune and inflammatory factors related to many common diseases is becoming more and more evident. Inflammation has long been associated with the development of cancer [1, 2]. Schmitt et al. [3] discovered that natural killer (NK) cells provide immune surveillance of cancer and Lanier [4] reported that their cytotoxic activity is controlled by a balance of activating and inhibitory signals.

Pahl [5] demonstrated the importance of NF-κB complexes in the activation by inflammatory cytokines and promotion of tumour-cell survival through anti-apoptotic signalling in prostate cancer [5, 6]. It is also known that Toll-like receptors play a key role in the innate immune system and have been found expressed by some types of tumour cells [7].

Previous genetic association studies have suggested that human leukocyte antigen (HLA) genes but also non-HLA genes may play an important role in haematopoietic stem cell transplantation (HSCT) and graft-versus-host disease (GvHD) [8]. HLA genes are encoded in a ~3500 kb segment on human chromosome 6p21.3, which is the most variable region in the human genome [8]. In addition, disturbances in immune and inflammatory-related systems have been associated with several psychiatric disorders [9].

Genome-wide association studies (GWAS) involving hundreds of thousands of genetic markers have led to the discovery of susceptibility genes for cancers and inflammatory diseases, as well as many other conditions and illnesses. As one of the first critical steps in developing such a large-scale study, researchers must choose the genotyping array to be implemented. As Delano et al. reported [10], the choice of genotyping platform is highly influential on the power of the study and thus on the likelihood of GWAS success. Thus, during the decision-making process, one needs to take into account a comparison of different single nucleotide polymorphism (SNP) arrays in consideration of the intended goal of the study. The most relevant and used criterion in SNP array evaluation is coverage. Global coverage is defined as the fraction of SNPs captured in terms of linkage disequilibrium (LD) by the SNPs on the array, representing the average level of coverage of all SNPs across the genome [11]. However, although global coverage provides us with an average evaluation of the array, it is not sufficient to capture variability in LD across the genome [12].

In order to achieve a fully informative estimate of coverage with respect to selected immune and inflammatory pathways and related genes, we evaluated three SNPs arrays from Illumina, Inc.: the Infinium Global Screening Array-24 v1.0 (GSA), the Infinium OncoArray-500 K BeadChip (OncoArray) and the Infinium PsychArray-24 v1.2 BeadChip (PsychArray). Our investigation provided estimates of global coverage across the genome, coverage for each chromosome, coverage for selected pathways and finally the coverage for genes of interest within these pathways. Pathways are selected based on their involvement in innate and adaptive immunity, cross talks with inflammation and haematopoietic stem cell transplantation. In order to follow this aim, we chose to investigate eight inflammatory and immunological pathways: natural-killer-cell-mediated cytotoxicity (hsa04650), NF-kappa B signalling (hsa04064), Wnt signalling (hsa04310), antigen processing and presentation (hsa04612), Toll-like receptor signalling (hsa04620), JAK-STAT signalling (hsa04630), insulin signalling (hsa04910) and B-cell receptor signalling (hsa04662). Europeans were used as reference sample in the analyses.

Materials and methods

The arrays

In order to compare arrays meaningfully, we selected only the arrays with at least 500 K markers. We estimated coverage for the GSA and two other consortia-derived customised arrays, the OncoArray and the PsychArray, all three manufactured by Illumina, Inc. These arrays are cost-effective, high-density arrays with at least 500 K SNPs and developed for large-scale genetic studies. Although our focus was on inflammatory and immune-related pathways, we did not consider the Infinium ImmunoArray-24 v2 BeadChip owing to its much smaller size of only 254 K SNPs. The arrays are summarised in Table 1.

Table 1 Summary of arrays

Illumina developed the OncoArray in cooperation with cancer scientists forming the OncoArray Consortium. The array contains ~500 K SNPs, with 250 K proven tag SNPs covering the whole genome (backbone) and 250 K hand-selected SNPs of particular interest. It was built to provide insight into the relationship between gene variants predisposing to breast, ovarian, prostate, colorectal and lung cancer, the most relevant cancers in term of mortality [13, 14].

The PsychArray comprises 593 K markers and was developed in collaboration with the Psychiatric Genomics Consortium and several leading research institutions for genetic studies focussed on psychiatric predisposition and risk. The array contains ~271 K proven tag SNPs found on the Infinium Core-24 BeadChip, ~277 K markers from the Infinium Exome-24 BeadChip and ~50 K markers associated with common psychiatric disorders, such as schizophrenia, bipolar disorder, autism-spectrum disorders, attention deficit hyperactivity disorder, major depressive disorder, obsessive-compulsive disorder, anorexia nervosa and Tourette’s syndrome [15].

The GSA contains ~640 K SNPs and presents itself as a genomic tool for clinical research applications including disease risk profiling studies, pharmacogenomics research, wellness characterisation and complex disease discovery. The GSA has been optimised for unparalleled genomic coverage and imputation performance in all the five defined super populations (Africans, mixed Americans, East Asians, Europeans, South Asia) [16].

Only SNPs defined by rs-numbers were included in the analysis, since their genomic location is obtainable from the NCBI database, dbSNP [17]. Coverage will not be remarkable down biased using this filter, since the median MAF of the excluded markers e.g., for OncoArray is 0.03 in a sample of cancer-free Caucasian of the International Lung Cancer Consortium [18]. The three arrays have ~125 K SNPs in common. All information on the markers contained in the arrays is publicly available on the Illumina website [15, 16, 19].

The reference set

We estimated the coverage of these three Illumina arrays using the 1000 Genomes Project (Version 3 April 2012, NCBI Build 37) as reference set, as it is often adopted for imputation purposes. For our analysis we concentrated on the 286 European individuals (samples: GBR; IBS; CEU; TSI) and on SNPs with MAF ≥ 1%, calculating coverage based only on common SNPs. After MAF filtering, the final reference set included 8,846,061 SNPs on 23 chromosomes (1–22, X). In total, 13,725,914 SNP pairs were found to be in LD at a threshold of r2 ≥ 0.8.

The estimation equation

We calculated global coverage across the genome, global coverage for each chromosome, coverage for selected pathways and coverage for genes of interest. The coverage rate represents the fraction of all SNPs that can be captured by the array. We applied the equation:

$${\mathrm{CR}} = \frac{{\frac{{\mathrm{L}}}{{{\mathrm{R}} \ - \ {\mathrm{T}}}}\left( {{\mathrm{G}} - {\mathrm{T}}} \right) + {\mathrm{T}}}}{{\mathrm{G}}}$$

defined by Barrett and Cardon [11] and Li et al. [12]. R is the number of SNPs in the reference set, T is the number of SNPs directly genotyped on the array and also given in the reference set, L is the number of SNPs in the reference set not on the array but in LD with a SNP on the array, r2 ≥ 0,8; G is the number of all SNPs validated in the dbSNP database with MAF ≥ 1%, 19 million and thus, 6809 SNPs in each 1 Mbp region. Since this equation ignores SNPs that are on the array but not in the reference set, we consider CR as the lower bound of coverage rate. We also computed a modified coverage estimate:

$${\mathrm{CR}}_1 = \frac{{\frac{{{\mathrm{L}}_1}}{{{\mathrm{R}}_1 \ - \ {\mathrm{T}}_1}}\left( {{\mathrm{G}} - {\mathrm{T}}_1} \right) + {\mathrm{T}}_1}}{{\mathrm{G}}}$$

as proposed by Li et al. [12]; using their notation, we replace R1 = R1 + m, T1 = T + m, \({\mathrm{L}}_1 = \frac{{{\mathrm{T}}_1}}{{\mathrm{T}}}{\mathrm{L}}\), where m is the number of SNPs on the array not given in the reference set. Since a linear increase of tagged SNPs L with the number of SNPs on the array is implicitly assumed, CR1 tends to overestimate coverage [20]. Thus, we finally used the average of CR and CR1 as the final estimate.

The calculations were performed for the total sets of SNPs contained on each array and the sample of overlapping SNPs that the arrays have in common. The latter set of markers we defined as genome-wide, common non-customised backbone and is further denoted as the common backbone. Note that this common backbone has only approximately half of the sites of the Illumina backbone for OncoArray and PsychArray.

The reference set and the array data were reduced to the regions of interest in order to estimate coverage, except for global coverage. A gene was defined from the transcriptional start to end positions, including both exons and introns (with 50 Kbp upstream and downstream) [12]. Genes containing fewer than five SNPs were excluded from the analysis in order to prevent unreliable results [12, 20]. For the pathways, the reference and array data were reduced to the regions defined by all the genes in the pathway. The estimation was performed for pathways with more than five SNPs, in our case for any pathway of interest. All analyses were performed using the statistical computing language and environment R 3.4.1 [21].

The pathways and genes of interest

Since we had a special interest in HSCT, we primarily focussed on the following inflammatory and immunological pathways: natural-killer-cell-mediated cytotoxicity (hsa04650), NF-kappa B signalling (hsa04064), Wnt signalling (hsa04310), antigen processing and presentation (hsa04612), Toll-like receptor signalling (hsa04620), JAK-STAT signalling (hsa04630), insulin signalling (hsa04910) and B-cell-receptor signalling (hsa04662). The NF-κB signalling pathways have an essential role in many aspects of inflammation, innate and adaptive immunity. Importantly, NF-κB is also a crucial player in many steps of cell transformation. NF-κB cooperates with multiple other signalling molecules and pathways. Prominent nodes of crosstalk are mediated by Jak/STAT, WNT leading to mutual pathway activations or even negative regulations in a context dependent manner. The Jak/STAT pathway is also tightly linked to the regulation of inflammatory processes mediating the responses of immune cells to pro-inflammatory and anti-inflammatory cytokines. The capability of the host to build a defence against infections or danger signals key is related to the ability to processed and present antigens to immune cells and activates corresponding response cascades. It is well known that antigen presentation is mediated by MHC class I and class II molecules. These molecules are detectable on the surface of antigen presenting cells. The compatibility of these molecules is the key to the outcome of HSCT. Natural killer (NK) cells are large granular cells often lacking antigen specific cell surface receptors and involved in certain innate immune responses. In addition Toll-like receptors (TLRs) play crucial roles in the innate immune system mainly by activating different branches of NF-kB and IRF signalling. For those 37 genes that play a central role in these pathways, gene coverage was determined in addition. The Ensembl and KEGG databases were consulted to annotate SNPs to genes and genes to pathways, respectively [22, 23]. The positions of the markers on the arrays and annotation files correspond to Genome Reference Consortium Human Build 37 (GRCh37).

Results

Details of global coverage and coverage for each chromosome across the arrays studied are depicted in Table 2. The global coverage rates were estimated as 14.23% for the GSA, 12.51% for the OncoArray and 12.24% for the PsychArray. The global coverage estimated for the OncoArray and PsychArray is comparable with the findings of Ha et al. [20] for arrays similar in size, considering the same genetic super population. GSA was the best most of the time. The range of variation for coverage between chromosomes is 10.44–13.52% for GSA, 9.63–13.93% for the OncoArray, and 9.92–12.56% for the PsychArray. Chromosomes 4, 5, 7 and 14 consistently have coverage below 11% for all of the three arrays studied. Note also that the coverage of chromosome 19 for the PsychArray is lower than 10%. Nevertheless, there are two genes on chromosome 19, HCST and TYROBP, particularly well covered by all three arrays (24.32 and 21.02% for GSA, 43.77% and 40.83% for OncoArray, and 25.21 and 21.72% for PsychArray; these two do not belong to the pathways of interest).

Table 2 Global and chromosome coverage

Table 3 illustrates the details on coverage for pathway and number of SNPs for each array. The coverage for the eight pathways of interest ranges from 6.21% (PsychArray for hsa04612) to 13.91% (GSA for hsa04064). With one exception, GSA always outperformed the competing arrays, while the differences between the arrays remain relatively small. Specifically, the range for coverage between the considered pathways is 8.11–13.91% for GSA, 8.65–12.04% for OncoArray and 6.21–11.35% for PsychArray. However, across the eight pathways of interest, there are 16 genes for GSA, 22 for OncoArray and 24 for PsychArray with coverage of less than 1%. The estimates of pathway coverage differ little between the two equations CR and CR1, in median 1.44 percentage points. The maximum difference was observed for the coverage of hsa04612 by the OncoArray, with CR = 4.2% and CR1 = 13.1% (see Supplementary Material).

Table 3 Pathway coverage for selected inflammatory and immunological pathways

The distribution of the gene coverage within each pathway and for each array is displayed in Fig. 1. In comparison, GSA performs the best on average. However, a strong variability in gene coverage can be easily recognised across all arrays. Individual genes reach coverage of more than 50% in all but two pathways (hsa04612/hsa04620); it is worth noting that the best results were obtained with the OncoArray. Most prominent is pathway hsa04612, which contains four genes of the HLA group well covered by GSA and the OncoArray: HLA-DQA1 (6: 32595956-32614839) with 54.08% (GSA) and 19.42% (OncoArray), HLA-DQB1 (6: 32627244-32636160) with 51.49% and 24.18%, also HLA-C (6: 31236526-31239907) with 23.85% and 40.05% and HLA-DOB (6: 32780540-32784825) with 29.49 and 38.45%.

Fig. 1
figure 1

Gene coverage by array within selected pathways. Distribution of gene coverage for the following pathways: hsa04650: natural-killer-cell-mediated cytotoxicity, hsa04064 NF-kappa B signalling, hsa04310 Wnt signalling, hsa04612 antigen processing and presentation, Toll-like hsa04620 receptor signalling, hsa04630 JAK-STAT signalling, hsa04910 insulin signalling, hsa04662 B-cell-receptor signalling. GSA Infinium Global Screening Array-24 v1.0, OncoArray Infinium OncoArray-500 K BeadChip, PsychArray InfiniumPsychArray-24 v1.2 BeadChip

Details of coverage for the 37 genes of interest are presented in Table 4. The coverage for the 37 genes of interest (see Table 4) has a range from 0.28% (OncoArray for CCR5) to 39.48% (OncoArray for IL10RA). The greatest variation in coverage was found for the gene MICA (GSA: 27.34%, OncoArray: 25.10%, PsychArray: 7.82%), the least for TNFSF13 (GSA: 7.78%, OncoArray: 8.11%, PsychArray: 7.06%). For most genes, GSA proved to have the highest coverage; in only five out of 37 genes was GSA the worst of the three arrays studied in terms of coverage. In general, the estimates of gene coverage differ little between the two equations CR and CR1, in median 1.18 percentage points. However, a maximum difference was observed for the coverage of MICA by the GSA, with CR = 9.2% and CR1 = 45.5%. Similar strong deviations were also observed for LTA, LTB, MICB and TNF. All these relatively short genes are located at 6p21.33 within the human major histocompatibility complex (MHC), known for its long LD structure [24] (see Supplementary Material).

Table 4 Gene coverage for a set of genes of interest

Lastly, the common backbone (125 K SNPs overlapping between the arrays) accounts for 5.47% global coverage. This compares to 12.24–14.23% global coverage using all SNPs of an array, thus this SNP set common to all arrays is clearly capturing not even half of the global coverage of the arrays. Coverage across chromosomes in the common backbone ranges from 4.26% to 5.82% (see Table 2). However, the common backbone accounts for no more than 0.04% of coverage (hsa 04310) across pathways (see Table 3). Nevertheless, the gene coverage of the common backbone ranges between 0% (CD40LG) and 20.93% (IL10RA) (see Table 4).

Discussion

We calculated the global coverage for three high-throughput genotyping arrays manufactured by Illumina, Inc., as well as the coverage by chromosome, the gene coverage and coverage for eight selected inflammatory and immune pathways. Analogous to global coverage, gene coverage is the fraction of SNPs on the array and allocated in a gene region, representing the average level of coverage of all SNPs of 1000 Genomes reference set in the same region (with fraction captured in terms of LD). The same applies to chromosomes and to pathways, where the “region” may be scattered across the genome.

The three arrays demonstrated noteworthy variations in their coverage, not only in general but also with respect to inflammatory and immune response genes and pathways.

Our estimates of coverage indicate better performance of the GSA compared to the OncoArray and the PsychArray globally, for most of the chromosomes and for most considered inflammatory and immunological pathways and genes. However, it has to be stated that the improvement in global coverage of about 2% of the GSA may result from the fact that it is larger in size. Nevertheless, this improvement in coverage is small, considering that the GSA contains twice as many rs-SNPs (~627 K) than the OncoArray (~343 K) or PsychArray (~306 K) and the loss by filtering for rs-numbers was lowest for the GSA.

We believe that researchers interested in specific regions of the genome will be able to use our approach to choose the array that best fits their goals. If interested in specific molecular mechanisms or gene families, the emphasis should be placed on pathway or gene coverage rather than on SNP numbers and global coverage. For relatively short genes in regions of low or long-range LD, it seems advisable to examine the lower and upper limits of coverage in addition to their mean.

Low coverage of a gene may result in low power to detect a genetic association, which can often lead to the wrong scientific conclusion that there is no association. Another aspect related to coverage is the possibility of inconsistency in the results between comparable GWAS, up to heterogeneous marker selection for polygenic risk scores [12]. Considering coverage can also be informative for identify markers not included in an already used array but worth to be extra genotyped by e.g., PCR-based methods. One can also exclude genes or pathways of low-coverage from the analysis of already genotyped samples to reduce the burden of multiple tests. Even though it appears that the GSA generally performs best, no general recommendation can be made because of the sample used for comparison. We focussed on a selection of inflammatory and immune pathways. We considered only genotyping of Caucasians with only one reference sample. The extension to other molecular mechanisms, other populations and other reference samples may well affect the results. We therefore recommend calculating global coverage, pathway and gene coverage for the focused pathways and genes of the planned study, but at least for the envisaged population.

To conclude, global coverage alone does not provide enough information to choose the most appropriate genotyping array for a study in planning. Pathway or gene coverage should be considered instead. Moreover, local coverage needs to be regarded when discussing inconsistencies in the findings between GWAS and can be useful in data analysis and decision making for additional genotyping.