Introduction

Mapping studies of gene expression phenotypes have successfully lead to the identification of regulatory variants and networks across the genome.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 In these expression quantitative trait locus (eQTL) analyses, genes have been identified whose expression are regulated by SNP markers, which are either in close proximity to (cis-acting SNPs) or at greater distances from the gene locus (trans-acting SNPs).12 Although the nature of cis-regulation is influenced by factors such as 5′ promoter- or 3′ transcript-variants, the mechanisms involved in trans-regulation include gene-mediated (eg, transcription factors) or sterical interactions such as ‘chromosome cross-talk’.13, 14, 15, 16 However, at many gene loci it must be assumed that both, cis- and trans-effects are involved simultaneously in the regulation of expression. Furthermore, it is possible that expression at certain gene loci is regulated by a more complex process that involves epistasis (eg, cistrans interaction). Unfortunately, these regulatory effects are not detected in one-locus eQTL studies where genetic variants are examined solely. There are two main reasons why two-locus or interaction eQTL mappings have not been applied to existing data. First, potential two-locus effects are difficult to identify and interpret, as substantial correction for multiple testing is required if the interaction was analyzed in a genome-wide fashion. In a genome-wide 100K SNP set, for example, the P-value of an observed interaction would have to be in the range of P=5 × 10−12 per transcript before being considered significant. Second, systematic two-locus eQTL mappings require substantial computational resources, although this limitation has recently been overcome by the introduction of novel biostatistical methods.17, 18, 19

In the present study we tried to circumvent some of the limitations associated to interaction scans and performed a systematic two-locus eQTL study for epistasis. Out of three possible two-locus interaction models (ie, ciscis, cistrans, trans–trans), we restricted our analysis only to cistrans epistasis. We used the expression data of 3107 high-quality transcripts and 86 613 linkage disequilibrium (LD)-pruned SNP markers obtained from 210 HapMap founders. For each transcript, we tested whether expression levels showed statistical epistasis between a locus-specific cis- and an interacting trans-SNP located elsewhere in the genome. Although other interaction effects may be involved in gene regulation, cistrans interacting effects were investigated as these may be easier to interpret. For example, it is difficult to control for intermarker LD in ciscis or for multiple testing in transtrans interaction studies. A further aim of the study was to characterize identified cistrans interaction effects, for example, to determine whether SNP markers involved in epistatic gene regulation also represent significant one-locus eQTLs.

Materials and methods

Expression data and study sample

For our genome-transcriptome eQTL analysis we used the expression phenotypes that have been generated by The Wellcome Trust Sanger Institute Cambridge (GENEVAR, http://ftp://ftp.sanger.ac.uk/pub/genevar/) from human lymphoblastoid cell lines (LCLs) of all 210 founders in the four International HapMap II populations (http://snp.cshl.org/).8, 9 The sample includes 60 Caucasian individuals (CEU, of northern and western European ancestry), 90 Asian individuals (45 Han Chinese, CHB; and 45 Japanese, JPT), as well as 60 African individuals (YRI, from Nigeria). Although this strategy cannot detect interaction effects on gene regulation that are restricted to one particular population, use of the combined sample provides improved statistical power for the detection of epistasis and has been successfully used in previous one-locus eQTL studies.8, 9 In this sample, we used only expression phenotypes for transcripts that were filtered through a detailed and extensive quality control. Of the 47 294 transcripts analyzed using Illumina's human whole genome expression (WG-6 version 1) array (Illumina Inc., San Diego, CA, USA), only those probes that have shown an Illumina detection score of >0.99 in each of the four hybridization experiments conducted across all 210 HapMap individuals were used. These scores were obtained from the Sanger Institute website (‘gene_profile-files’ at http://ftp://ftp.sanger.ac.uk/pub/genevar/) and reduced the number of transcripts included in the present study to 7978 probes. The respective transcripts could be expected to be robustly expressed in human LCLs. In a subsequent step, the presence of SNPs in the hybridization probes was excluded using the web-based program ReMOAT (version March 2009, http://www.compbio.group.cam.ac.uk/Resources/Annotation/index.html)20 and the dbSNP 126 database (http://www.ncbi.nlm.nih.gov/projects/SNP/). Although there is a current debate in the field as to whether this step is necessary and other studies have included SNP-containing probes, we decided to exclude them as they possibly might influence the true expression quantity. However, the removal of probes with known coding SNPs did not substantially reduce the number of included transcripts to 6226 probes. Furthermore, we used ReMOAT for the inclusion of probes that are located on autosomes only and mapped over the full length (50 bp) to a contiguous genomic location (ie, no intron-spanning probes). We decided to use exon-specific probes only in order to avoid any inaccurate expression signals, which could be caused by insufficient hybridization to different isoforms of the gene (eg, due to exon-skipping or -incorporation). This step reduced the number of included probes to 5237. Next, the uniqueness of genomic hits for each probe was determined using nuID (https://prod.bioinformatics.northwestern.edu/nuID/), which represents a probe identifier for microarray experiments. This reduced the number of included probes further to 4418 showing a nuID uniqueness score of 100. Only these probes could be specifically mapped to a single Entrez GeneID. Entrez Gene is a repository from the National Center for Biotechnology Information (NCBI) for gene-specific information. In final steps, we filtered for probes whose corresponding transcripts were annotated as ‘reviewed’ or ‘validated’ using NMN=3124). The RefSeq database provides a collection of annotated sequences including transcripts. When multiple probes hybridized to the same RefSeq NM_ transcript, only one randomly selected probe was included in the analyses. In the final filtering step, the UCSC Browser version HG18 (http://genome.ucsc.edu/cgi-bin/hgGateway) was used to identify probes with defined transcription start and end sites. Exact matches were found for a total of 3107 transcripts, and these were included in the two-locus eQTL analysis. The expression data for each of these 3107 probes were subjected to inverse quantile normalization according to the procedure described by Veyrieras et al10 and the normalized data were saved as PLINK21 alternate phenotype files. PLINK represents the program that was used for the interaction analysis (see below).

Genotyping data

SNP genotypes of each of the 210 founder individuals were obtained from HapMap release 23 using PLINK.21 A total of 3.95 million SNPs were available for each individual after exclusion of SNPs with Mendel errors. The Mendel check was performed in the 30 CEU and 30 YRI trios analyzed in the HapMap Project. Next, only SNPs were selected, which were located on autosomes, which had no HWE deviation (P>0.05), and which had allele frequencies between 0.2–0.8 as well as a per-SNP genotyping missingness cutoff of 0.02. Although this filtering procedure was done in each of the four populations separately, an LD-pruning step was restricted to the YRI acknowledging the lowest LD structure in this population. Here, a pairwise SNP-SNP-r2 of 0.8 was used as a pruning criterion. The filtering process resulted in N=86 613 SNPs, which were saved as PLINK binary file for inclusion in the analyses.

Interaction analysis

The two-locus interaction eQTL analysis was performed using the PLINK --epistasis command. For every transcript that corresponded to an included probe, cis-SNPs were defined as being variants located within the transcript or <1 Mb apart from the transcription start and end site. Each cis-SNP of a transcript was then tested for epistasis with all remaining SNPs, which were defined as trans-SNPs (ie, 86 613 SNPs minus the number of cis-SNPs per transcript). For the interaction eQTL mapping, the four different HapMap populations were used as categorical co-variates. To determine the significance of our findings, we finally corrected for each transcript all cistrans interaction results by multiplying the number of analyzed cis-variants with the number of included trans-SNPs. This resulted in transcript-wise Bonferroni-adjusted P-values between 5.77 × 10−07 (1 cis-SNP and 86 612 trans-SNPs for DNAJA2, NETO2 and ORC6L) and 2.84 × 10−09 (204 cis-SNPs and 86 409 trans-SNPs for CHD8 and SUPT16H). Under the null hypothesis of no enrichment for transcripts showing cistrans interactions 0.05*3107=155 transcripts would be expected to have at least one significant cistrans interaction following a transcript-wise Bonferroni's correction. The applied correction procedure is also given in detail in Supplementary Table 1.

Results

Of all 3107 included probes we identified 440 transcripts whose expression was – transcript-wise Bonferroni-adjusted – regulated by a cistrans interaction (Supplementary Table 2). The significant two-locus eQTL P-values ranged between 4.69 × 10−08 and 2.82 × 10−12. The observed interactions showed a significant (P=2.86 × 10−144) and almost threefold enrichment compared with the number of SNP pairs expected under the null hypothesis, ie 5% of all probes (N=155) would be associated by chance. Table 1 lists the top-16 interaction findings, which were all associated with P-values of <10−10. Importantly, as an LD-pruning step was applied, all of the 440 cistrans SNP combinations were independent and not the result of LD between cis- or trans-markers.

Table 1 Column 1 lists the top-16 cistrans interacting transcripts; column 2 shows the number of tested cis-SNPs for each transcript; column 3 shows the number of cis-trans tests; column 4 list shows the Bonferroni-adjusted P-values necessary for a ‘significant’ finding; column 5 shows the uncorrected P-value per transcript obtained in the two-locus interaction analysis; the next columns provide information about the cis- and trans-SNPs including their eQTL effects under a one-locus model

To elucidate the nature of the epistasis, an analysis was performed to determine whether SNPs, which are involved in gene regulation via one-locus eQTL effects, mainly contributed to the interactions. At present there is no consensus on whether SNPs with so-called ‘marginal effects’ are more likely to be involved in epistasis and should be prioritized for SNP–SNP interaction scans. An analysis was therefore performed to determine whether the 440 cis- and trans-SNPs involved in epistasis also have regulatory effects on gene expression without their interacting markers, that is, in a one-locus fashion. This proved to be true for the cis-markers: a total of 40 of the 440 cis-SNPs (9.09%) also showed regulatory effects in the one-locus analysis at an uncorrected significance level of P≤0.05. This was significant compared with the expected number of SNPs with marginal effects (N=22, P=8.27 × 10−05) (Supplementary Table 3). However, it is notable that the majority of cis-markers (> 90%) were not involved in gene regulation at the one-locus level.

In contrast, only 16 of the 440 two-locus trans-SNPs (3.63%) were involved in gene regulation on the one-locus level. This was not significant compared with the number of expected markers (N=22, P=0.187, Supplementary Table 3) and points to more independent mechanisms involved in the one- and two-locus regulation.

As the mechanisms involved in trans-regulation and -epistasis are complex and not well understood, we tried to characterize them in more detail. We analyzed whether the trans-epistasis is gene or pathway mediated rather than the result of other regulatory mechanisms and tested at each trans-locus if there are more genes in close vicinity to the marker than expected. Of all 440 trans-markers, 198 SNPs (45.10%) were closely located to at least one gene according to the program SNPper (http://snpper.chip.org/bio/snpper-enter), that is, the SNP is located within a distance of ≤10 kb to a corresponding gene (Supplementary Table 2). However, the number of observed genes involved in trans-epistasis was not significantly increased compared with the number of all potentially involved genes tagged by all included trans-SNPs using SNPper (N=35 731, 41.35%, P=0.112).

Previous one-locus eQTL studies have reported an enrichment of certain chromosomal regions involved in the regulation of gene expression. We adapted the approach of Morley et al6 and analyzed our data for evidence for so-called ‘master regulator’ SNP-regions on a two-locus interaction level. Master regulator-regions are chromosomal regions that contain more SNPs involved in epistasis than expected by chance. All 86 613 SNPs were used, and the entire autosomal genome was divided into 444 non-overlapping bins, each containing 200 neighboring SNPs. We estimated that a bin, which comprises more than 4 of the 440 trans-SNPs, would be a master regulator region. However, correcting this number by a factor of 444, which corresponds to the number of analyzed bins, more than six trans-SNPs per bin are necessary for defining a significant master regulator region. Only for bins at the end of chromosomes did we adapt our approach to account for the number of SNPs within these regions. For example, if 100 neighboring SNPs were located within the last bin of a chromosome, more than three trans-SNPs were necessary to fulfill the criterion of a significant master regulator region. Although we found 8 out of the 444 bins harboring four trans-SNPs, which are nominally significant (P=0.019), no bin fulfilled the criterion of a significant master regulator region after the correction procedure. In addition, our data provide no evidence for superordinated mechanisms involved in epistasis by analyzing whether certain chromosomal ‘hotspot’ regions harbor more regulated transcripts than expected. We used all 3107 transcripts, divided the autosomal genome into 321 bins, each containing 10 neighboring transcripts, and estimated that a bin with more than 6 of the 440 identified transcripts would be a significant hit. After a correction for the number of analyzed bins (factor 321) no hotspot could be identified, although one bin harbored six transcripts and 12 further bins harbored four transcripts (uncorrected P=0.001 and P=0.041, respectively).

On the functional level, we tested whether certain cellular processes are particularly regulated by epistatic effects. We used all 440 genes that were identified as being cistrans regulated and performed an analysis for enriched cellular functions using Ingenuity Pathways Analysis (IPA, version 8.6, http://www.ingenuity.com). IPA is a web-based interface that provides computational algorithms to identify biological processes and networks on the basis of functional annotation and molecular interactions. The top biological category was ‘gene expression’, including 69 transcripts. However, the most enriched subcategory ‘transcription of chromosome components’ (P=0.046 after Benjamini–Hochberg correction) was defined by only 4 of all 440 included transcripts (CREBBP, EP300, SRC and TBP). Finally, an analysis was performed to determine whether any of the two-locus regulated genes are implicated in complex disorders. Complex disorders were considered, as genome-wide association studies (GWAS) of a number of diseases have failed to identify any one-locus variants, which are associated with a strong genetic effect size. Two-locus regulation may therefore have an impact on the respective phenotypes. Furthermore, the functional consequence of many top GWAS-SNPs is unknown, which suggests that expression differences may be disease-relevant mechanisms. In total, we identified 25 cistrans regulated genes that have been implicated in complex disorders using the web tool GWAS Catalog (http://genome.gov/26525384). For example (Table 2), we identified a two-locus interaction between a trans-SNP 5.9 kb upstream of CCL4 (MIM 182284) and a cis-SNP of BLK (MIM 191305) influencing its expression. BLK is one of the strongest risk genes for rheumatoid arthritis and systemic lupus erythematosus and CCL4 encodes a chemokine ligand involved in immune activation.22, 23, 24, 25, 26 However, the connection between BLK and CCL4 remains speculative, as it is unclear whether the close proximity of the trans-SNP to CCL4 reflects a gene- or pathway-mediated mechanism, or whether other interaction mechanisms that do not involve CCL4 exist. Unfortunately, we could not test the effect of the trans-SNP on the expression of CCL4 because no probe for CCL4 has been included in our analysis. Another interesting finding concerns STAT2 (MIM 600556). Its expression was found to be cis–trans regulated, and the corresponding trans-SNP is located 31.1 kb upstream of IL23R (MIM 607562) (Table 2). Again, we could not test whether this SNP is involved in the expression of IL23R due to a missing probe, but it is noteworthy that both genes have an important role in the innate immune system and have been implicated in the development of psoriasis in a recently published GWAS.27, 28, 29

Table 2 Column 1 lists the 25 cistrans interacting transcripts listed in GWAS catalog; column 3 lists the observed two-locus P-values; the remaining columns provide information concerning the cis- and trans-SNPs

Discussion

Genes function through a complex mechanism that involves multiple genetic factors. These effects are missed if genetic factors are examined in isolation without taking potential interactions with other genetic factors into account. The aim of the present study was to elucidate the genetic architecture of gene expression through the performance of a systematic cis–trans interaction analysis. Out of 47 294 expression phenotypes, we used 3107 transcripts that survived a stringent quality control procedure and 86 613 LD-pruned SNP markers, which were in linkage equilibrium and have been genotyped in the 210 HapMap founder individuals. Using a conservative correction procedure, we identified that the expression of about 15% of all included transcripts (N=440) is regulated by a two-locus interaction, which is far more than expected by chance (P=2.86 × 10−144). The results of the present study confirm that epistasis has an important role in the genetic architecture of complex phenotypes and imply that this approach may be of relevance to other eQTL and GWAS data sets. Such studies could also benefit from samples that are ethnically more homogeneous. Although we have used four different populations as categorical co-variates, we cannot completely rule out that our results are to a certain degree inflated by the heterogeneity of the present sample.

The present findings also indicate that regulatory one-locus cis-markers are more likely to be involved in two-locus gene regulation than would be expected by chance alone (P=8.27 × 10−05). This suggests that there is a correlation between the mechanisms, which underlie one- and two-locus gene regulation. However, as the majority of cis-markers involved in epistasis showed no ‘marginal effects’, our findings imply that most epistasis effects would be missed if interaction studies were focused on cis-markers with marginal effects only.

Furthermore, the present results indicate that gene- or pathway-mediated trans-effects were not the major source of epistasis, as trans-SNPs were not more likely to be located in or in close proximity to an annotated gene or transcript (P=0.112). Therefore, other regulatory mechanisms, such as non-coding sequence-mediated effects (eg, RNA) and intra- or interchromosomal cross-talk, seem to be of equal importance in trans-epistatic regulation.

Our analyses as to whether particular chromosomal regions are involved in epistasis produced negative results (P>0.05 for master regulators and hotspots). This implies that cistrans epistasis is not ‘topographically’ organized throughout the genome. In addition, the IPA analysis revealed that only one functional category (involving only four transcripts) was enriched for epistatic effects (P=0.046 for the subcategory ‘transcription of chromosome components’ within the high-level category ‘gene expression’). This suggests that multiple cellular processes are regulated by two-locus interactions rather than specific ones. Furthermore, 25 of all cistrans-regulated genes have been found to be associated with complex diseases through GWAS. The trans-markers and -genes identified in the present study may therefore represent interesting candidates for epistatic tests in the respective GWAS data.

In conclusion, the present cistrans interaction approach identified transcripts, which are potentially influenced by a two-locus epistasis, and yielded certain characteristics of the complex process of genome-transcriptome regulation. Furthermore, the approach may represent a solution for overcoming the problem of multiple testing in interaction scans, and it may thus be worthwhile to apply this approach to other eQTL data. A limitation of this approach, however, is that it is only able to detect cistrans epistasis and cannot be used to detect other regulation mechanisms such as ciscis, transtrans or higher-order interactions.