INTRODUCTION

Genome-wide association studies (GWAS) of common complex diseases have escalated during the past decade, and multiple susceptibility loci have been identified. For colorectal cancer (CRC), more than 60 single nucleotide polymorphisms (SNPs) have reached genome-wide significant results, but only few of them have functional studies that identify the gene involved.1 Few SNPs associated to diseases are protein altering non-synonymous variants.2 Most disease-associated variants fall into non-coding regions. Some are located in gene desserts far from putative functional genes,3 and only 5% of the currently validated disease associations are restricted to coding variants.4 Regarding CRC, only 3 (5%) of the GWAS SNPs known so far map to coding regions (Supplementary Table 1). Some SNPs are located in gene introns, but that does not guarantee that the real functional gene is the one where the SNP is located, as it is known that some enhancers are located in introns of nearby genes.5

There is a general agreement that the gene transcript abundance may act as an intermediate phenotype useful to assess the effect of genetic variation on a clinical phenotype.6,7,8,9 From this observation, rose the concept that disease-associated SNPs located in regulatory regions of close (cis) or distant (trans) genes act as quantitative trait loci through the modification of gene expression levels (expression quantitative trait loci (eQTL)). Regulatory sequences of a given gene (promoter, enhancer, splice site), modified by genetic variation, may result in diverse RNA abundance and ultimately, diverse phenotype.

The emergence of microarray technologies, and next-generation sequencing more recently, has allowed the genome-wide exploration of both genetic variation and gene expression. Thus, agnostic search for eQTL is feasible and may be useful to explain SNP-disease associations, to provide prior weights for the statistical analysis of new SNPs discovery10,11,12 and to prioritise which SNPs should be followed-up with additional functional studies.

Data on eQTL-based association studies on CRC are still scarce. Most eQTL analyses have used lymphoblastoid cell lines.13,14,15 Although germ-line genetic variation is assumed constant for all cells, gene expression is tissue specific. The analysis of eQTL may vary across tissues, either because some genes may not be expressed in a specific tissue, thus being undetectable, or because other (epigenetic) regulatory mechanisms of gene expression may interact with the effect of genetic variation. The sharing of eQTL among tissues is of interest.16 The Genotype Tissue Expression (GTEx) project17 has been designed to address this issue, and preliminary analyses showed a high degree of sharing,18 though colon tissue was not available at the time of that publication. Some studies have analysed tumour tissue from The Cancer Genome Atlas project,19 but gene expression is strongly altered in the tumour tissue compared to normal tissue.20 Other than GTEx data, to the best of our knowledge, there are no easy sources for eQTL analysis specific to colon tissue, though some researchers have published their analyses.21,22

In this article, we present an eQTL analysis specific for colon tissue, and a web tool that allows the user to query specific SNPs or genes for analysis, with options to combine data from healthy patients, adjacent normal mucosa from patients with colon cancer and tumour tissue. This source of data has been useful to identify genes associated with CRC susceptibility SNPs,23 and may also be of interest for studies of genetic determinants of inflammatory bowel disease (IBD).24

Materials and Methods

Subjects and samples

Colon tumour and paired adjacent normal mucosa tissue samples were selected from a series of cases with a new diagnosis of colon adenocarcinoma attending the University Hospital of Bellvitge in Barcelona between January 1996 and December 2000. Patients included were diagnosed of stage II, microsatellite stable colon cancer, were surgically treated and had not received adjuvant chemotherapy. Adjacent mucosa was obtained from the proximal surgical margins and was at least 10 cm distant from the tumour lesion. Healthy colon mucosa samples were obtained during colonoscopy between February and May 2010. These samples were donated by a series of unselected individuals who underwent a colonoscopy indicated by screening or suspicion of colonic pathology but no colonic lesions were observed. Biopsies were obtained from the left and right colon. For this study, we randomly selected approximately half from each tumour location. All subjects provided written informed consent to participate in the study and the ethics committee of the hospital cleared the protocol with reference PR074/11. Additional information about the study can be found at https://www.colonomics.org/eqtl-browser. The eQTL analysis was focused on expression data assessed in normal mucosa. Though we initially selected 100 patients and 50 healthy controls, the final sample size after quality control of the data was (N = 144): 97 adjacent normal mucosae from cancer patients and 47 from healthy donors. Gene expression in tumours (n = 97) was also analysed, and these data can be compared or combined with those of normal mucosa.

Gene expression and SNP analyses

DNA was extracted from colon mucosa specimens using a standard phenol–chloroform protocol. Total RNA was isolated from tissue samples using the miRCURY™ RNA Isolation Kit (Exiqon, Vedbæk, Denmark) according to manufacturer’s protocol, quantified by NanoDrop® ND-1000 Spectrophotometer (Nanodrop technologies, Wilmington, DE) and stored at −80 °C. The quality of these RNA samples was assessed with the RNA 6000 Nano Assay (Agilent Technologies, Santa Clara, CA). RNA integrity numbers showed good quality (mean = 8.1 for tumours, 7.5 for adjacent normal and 8.2 for healthy normal). RNA purity was measured with the ratio of absorbance at 260 nm and 280 nm (mean = 1.96, SD = 0.04), with no differences among tissue types.

Expression data were obtained with Affymetrix Human Genome U219 ArrayPlate platform (Affymetrix, Santa Clara, CA). Three 96-array plates were used with a block experimental design to avoid batch effects. Four arrays (two normal—tumour pairs) were excluded due to low quality. Therefore, a final data set of 246 arrays was used for subsequent analyses. Raw data were normalised using the Robust Multiarray Average algorithm implemented in the affy package of R/Bioconductor. Expression levels of a set of genes on this microarray have been validated with quantitative PCR and showed excellent correlation coefficients.25 Prior to the analysis of eQTL, expression probe sets were mapped to genes. For genes with more than one probe set in the array, a principal component analysis was used to capture the largest common variability extracting the first component. The expression array provided data on 20,070 genes, but those with very low variability (SD < 0.1 among all samples) and those on chromosome Y and mitochondrial were excluded for the analysis of eQTL, rendering 15,298.

Genotypes were obtained hybridising genomic DNA extracted from colonic mucosa in Affymetrix Genome-Wide Human SNP 6.0 array (Affymetrix,), which includes nearly 1 million SNP markers. One cancer patient and three healthy subjects had to be excluded because the array quality was not good enough. Thus, the final sample size for eQTL analyses were 47 healthy colon mucosae and 97 paired tumour and adjacent normal tissues. Genotype calling was performed for samples of healthy mucosa and normal tissues with the Corrected Robust Linear Model with Maximum Likelihood Classification algorithm as implemented in R/Bioconductor package crlmm. In addition, untested genotypes were imputed using IMPUTE226 after haplotyping with SHAPEIT.27 The 1000 Genomes panel (March 2012 version) was used as reference (http://www.1000genomes.org). SNPs and indels with low imputation quality (info <0.2 or minor allele frequency [MAF] concordance <0.9) were excluded from the data set. Also, SNPs with MAF < 0.05 were ignored, and the eQTL analysis will be based on 6.76 million SNPs. No filters for redundant SNPs related to linkage disequilibrium were applied.

The gene expression data set is available at the project website: https://www.colonomics.org/data and at Gene Expression Omnibus with GEO series accession number GSE44076. SNP data have been deposited at the European Genome-phenome Archive (EGA, http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession number EGAS00001002453.

Statistical analysis

To reduce the number of tests performed, while maintaining high power to identify eQTL, only the additive genetic model was considered. Genotypes were coded as the number of variant alleles (0, 1, 2) and this variable treated as quantitative. For imputed genotypes, the posterior probabilities (dosage) were used to consider imputation uncertainty. Dosage was calculated as twice the posterior probability of BB genotype plus that of AB. The additive model is known to capture most of the dominant and recessive effects.28

Analysis of eQTL were performed with the R package MatrixEQTL.29 SNPs within 1 Mb of the gene transcription starting site were considered cis-eQTL. SNPs located further than 1 Mb or in other chromosomes were considered trans-eQTL. The search was limited to associations with P-values < 1e−6. Linear models were calculated, adjusted for age, sex, tumour location and group (healthy/affected). To avoid spurious results derived from extreme expression values in some genes, a non-parametric analysis was performed, using a rank transformation of the expression values for each gene. This is equivalent to the Spearman correlation, but in a regression framework that allows adjusting for potential confounders.

To account for multiple comparisons, false discovery rates (FDR) were calculated for all findings,30 taking into account all tests performed ignoring redundancy related to linkage disequilibrium (15,000 genes × 5000 SNPs within 1 Mb per gene). The FDR for the least significant association (P = 1e−6) was 0.001. Also, for a more stringent analysis, a permutation procedure was used. The distribution of the most significant associations for the cis-eQTL analysis for 500 randomly permuted analyses were used to calculate the adjusted 0.01 significance level. This were nominal P-values of 2e−10 for cis-eQTL and 4e−13 for trans-eQTL. These significance levels are very similar to those based on Bonferroni correction (1.3e−10 and 1.1e−13, respectively), indicating an overall low degree of correlation in the data.

Results

Table 1 shows the distribution of the number of genes and SNPs per chromosome selected for the descriptive analysis of eQTL.

Table 1 Distribution of SNPs, genes and eQTL across chromosomes

cis-eQTL

The distribution of SNPs near genes was fairly uniform. On average, 2255 SNPs with MAF > 0.05 were found within 500 Mb (SD = 978) and 4555 (SD = 1815) within 1 Mb. A total of 77,139 cis-eQTL associations were identified with P < 1e−6 (FDR < 0.001). Using the more stringent significance level based on permutations (P < 2e−10), 29,073 cis-eQTL were significant with adjusted P-value < 0.01. We provide in a supplementary file all identified eQTL, but will restrict further analyses to those significant according to the permuted P-value.

A total of 363 genes (2.4%) had at least one significant cis-eQTL within 1 Mb We will refer to these as ‘eGenes’ (Supplementary Table 2). The median number of significant cis-eQTL per gene was 36 (range: 1–1.023). Approximately 27% of the significant cis-eQTL per gene were located intragenic, 44% when only the most significant eQTL per gene were considered. The distribution of eQTL was symmetric upstream and downstream around the gene transcription start site (TSS), irrespective of the gene orientation (Fig. 1). The median distance of the most significant eQTL to the gene TSS was 14Kb (median absolute deviation = 72Kb). There was no association between the average gene expression and the significance of the eQTL or the proportion of variance explained. Regarding SNPs, 95% of the 29,067 significant eQTL were associated with one gene, 2% were associated with two genes and the other 3% ranged 3–6 genes.

Fig. 1
figure 1

The histogram shown the distribution of the distance between the significant cis-eQTL and the gene TSS in Kb. Most of the eQTL are within 200 Kb of the gene TSS

To further analyse the relevance of the identified eQTL database, we analysed the impact of using tumour samples in the analysis of eQTL. When tumours were combined with the normal samples, the number of significant cis-eQTL increased to 37,099 significant gene–SNP pairs, 22,759 (78%) common to the analysis of normal mucosa only, but there were 14,337 new eQTL. Since part of this increment could be due to the power gain related to a larger sample size, we also compared the number of significant eQTL identified in tumour only (n = 6063), 45% less than those identified in the adjacent normal mucosa of patients (n = 13,411). Only 4858 of the eQTL were identified both in tumours and adjacent normal mucosa. A similar number intersected with our list identified in the combination of normal tissue from healthy donors and adjacent to tumour (18%). Figure 2 shows a Venn diagram with these numbers.

Fig. 2
figure 2

Venn diagram, of cis-eQTL intersections according to the tissues used in the analysis

Validation

The significant eQTL identified were compared to those reported by the GTEx consortium for transverse tissue. GTEx (version 6) had reported a total of 592,069 eQTL in their analysis. Of those, 149,773 could be considered significant at P < 2e−10. We could find 12,076 (42%) of our eQTL in GTEx significant results using the same significance level. A 6500-additional list of our significant eQTL were in the GTEx file with lower significance level, increasing the percentage of validated eQTL to 64%. The analysis of sigma samples was similar and the combination slightly increased the number of validated eQTL. The concordance of genes with at least one eQTL was higher, 263 genes in common out of 362 (73%). We also compared our eQTL with that published by Ongen et al. These authors report 1693 genes with the most significant SNP for each gene. Our equivalent list included 363 genes, and 210 of them (58%) were in the list by Ongen et al., but only 15 were perfect match with gene and SNP. However, 157 of Ongen’s eQTL were in our extended list containing all significant SNPs for each gene, not only the most significant one.

Mapping to other functional chromatin elements

We explored whether eQTL mapped to regions of the genome with marks of functional activity. We used the Haploreg database31 to search eQTL and could retrieve chromatin states from 24,697 (85%) of them. To interpret the information, we also searched 100,000 random SNPs within 1 Mb of TSS and retrieved data from 90,558 of them. In general, eQTL were more often located in chromatin marks (89% vs 84%), DNAse (30% vs 25%), promoter histone marks (17% vs 11%) and enhancer histone marks (48% vs 44%). We also observed that eQTL were more often GWAS SNPs (0.5%) than expected (0.2%).

trans-eQTL

Long-distance effects were frequently identified in our analysis. These have not been implemented in the eQTL-browser, but the files can be downloaded from the site. As each SNP is tested against all genes, a more stringent P-value should be used to consider an association significant. We report here significant associations after permutation correction (P < 4e−13), but the supplementary file includes all the associations with nominal P < 1e−6 (FDR < 0.001).

There were 209,144 trans-eQTL identified at nominal P < 1e−6, but only 10,665 statistically significant when the permutation-derived significance level (P < 4e−13) was applied. Interestingly, they were distributed predominantly in chromosomes 6 and 17, with a minor frequency in chromosomes 2, 4 and 12 (Table 1). These trans-eQTL were essentially among SNPs in the same chromosome, but at a larger distance than 1 Mb, with three exceptions: one block of 66 SNPs in chromosome 12 that are eQTL for pseudogene RPS26P11 in chromosome X. The SNPs are near RPS26, the active ribosomal protein 26, located in chromosome 12, but less expressed than the pseudogene in the colon. The second block of 51 SNPs in chromosome 22 that are eQTL for gene GGTLC1 (gamma-glutamyltransferase light chain 1) in chromosome 20. The same SNPs are significant cis-eQTL for GGT1, GGT2, GGT3P and GGTLC2, genes of the gamma-glutamyltransferase family, located in chromosome 22. Finally, there was a bloc of SNPs mapping to HLA-A in chromosome 6, that were eQTL of troponin T3 (TNNT3), a gene located in chromosome 11 related to muscle contraction.

eQTL browser

A web tool was designed to explore gene expression in colon mucosa and analyse eQTL. The R package shiny (https://shiny.rstudio.com) was used to develop the application, which can be accessed at https://www.colonomics.org/data-browser. Some screenshots are available as Supplementary Figs. 1 and 2. The eQTL browser allows searching either one gene by its gene symbol to explore nearby SNPs as candidate eQTL, or one specific SNP, either by rsID or chromosome/position to explore whether its genotypes are associated to the expression of nearby genes. In both types of searches, the output includes location plots and tables with the statistical analyses. The application by default selects samples both from healthy mucosa and adjacent normal tissue, but the used can also exclude some of these or include tumours. Also, samples can be selected according to sex and tumour location (left or right colon). The initial search includes SNPs within 100 Kb upstream and downstream of the selected gene, but the window can be modified up to ±2 Mb. SNPs can be pruned by allele frequency (MAF > 0.01 by default).

If tumours are also included in the analysis, the pairing is ignored. Thus, the P-values for analyses that include both tumours and adjacent normal tissue should be interpreted with caution. By default, partial Pearson correlation coefficients are calculated, adjusted for age, sex, tumour location and group (healthy/affected). Non-parametric partial Spearman correlation can also be used when expression values for some genes show gross asymmetric distribution (expression plots are provided).

The initial analysis for one gene includes a plot of r2, as a measure of the proportion of the gene expression variance explained by each SNP. The SNP with highest r2 is highlighted in the plot. The user can click one SNP in the plot to show a new plot with the association of gene expression and genotypes (or dosage for imputed SNPs).

To reduce false positive results, a threshold line is shown at significance level of 0.01 by default. Also, to account for multiple comparisons, a line with Bonferroni correction threshold is shown, considering the number of SNPs analysed in the selected region. Bonferroni correction may be too conservative since it ignores linkage disequilibrium (LD) among SNPs. Additional threshold lines can be added by the user.

When one SNP is selected, the plot shows r2 for the expression of nearby genes (within 500 Kb by default). Options like the ones explained above can be used to select gene ranges, tissues, statistical test or covariates for adjustment.

Analysis of trans-eQTL, searching for genes associated to SNPs outside the ±2 Mb region is not implemented in the web tool, but has been performed and can be downloaded for additional analysis (Supplementary file).

Discussion

Our web tool for eQTL analysis is useful to inform a potential functional roles of SNPs identified in GWAS of CRC1 or IBD.32 This data source can complement others available specific for colon tissue like the GTEx portal,17 and contribute to study the genetic architecture of colorectal diseases.

Our whole-genome analysis of eQTL has revealed that genetic variability may explain a substantial fraction of gene expression in colon tissue. Nearly 6% (n = 869) of the expressed genes had at least one cis-eQTL within 1 Mb. This number of eGenes is similar to the reported for other tissues in the report of the pilot GTEx project,18 for a similar sample size, or that reported by.22

In an attempt to avoid false positive findings, we have used a non-parametric analysis method, and have restricted to SNP with MAF > 5%. Regarding protection against multiple testing, we have used a significance level of 1e−6 to search for eQTLs. The reported results at this level had a theoretical FDR of 0.001. We have also performed a permutation test to define the significant threshold for 1% family wise false positive results, and found that we should consider significant only findings with P < 2e−10 for cis-eQTL and P < 4e−13 for trans-eQTL. These strict significance levels have reduced the number of significant eGenes to 363. As this might be too conservative, for the analysis of specific genes of candidate SNPs, the web tool only reports nominal significance levels for each association and provides several options to threshold (fixed significance levels or Bonferroni correction according to the number of SNPs/genes considered).

We have compared our eQTL findings with those of GTEx and Ongen et al.,22 and found a reasonable good agreement, taking into account that expression data had been generated in diverse platforms and also the populations analysed probably had different environmental exposures that may affect gene expression. If we relax the significance level, more than 64% of our eQTL were reported in GTEx and up to 78% of the eGenes were concordant. The agreement with Ongen’s results was lower, but probably related to the fact that these authors only reported the most significant eQTL for each gene.

Our expression data is based on 3′ microarrays that do not allow to estimate allele-specific expression at individual level or splicing variants. This is a limitation over studies based on RNA-seq for gene expression estimation, which may also better assess low abundance transcripts and have a larger dynamic range.33 Our microarray data, however, has shown good reproducibility with qPCR in the validation studies performed.25

Only few SNPs identified in GWAS are strong eQTL for candidate genes.23 Many eQTL, though significant, have low correlation values, that indicate that the proportion of gene expression variance explained by the SNP is small. This may not rule out a causal effect, as lifetime exposures to small alterations in gene expression may be sufficient to increase cancer susceptibility. In fact, the associations observed for most common SNPs with cancer are also of small magnitude.

As the specific SNPs identified in GWAS studies are dependent on the genotyping platform used, SNP imputation is a useful option to increase the likelihood of narrowing the window where the functional SNP lies. We have also used SNP imputation in our web tool to increase the likelihood of finding stronger eQTL. In fact, we have observed that often the genetic polymorphisms showing highest association with gene expression is an imputed indel in the region. Indels probably have higher effect disrupting the genome sequence required for enhancers or promoters. SNP imputation may introduce errors in the data and lead to false results. Though recent imputation panels are improving quality for rare variants, we have filtered imputed SNPs with low imputation quality, that generally correspond to low allele frequency. This filter also was decided due to the limited sample size of our sample (144 independent subjects).

The recommended use of our web tool, when the search starts with a candidate SNP identified in a GWAS, is to first search for nearby genes which expression may be related to the SNP. Then, if one or more genes are identified, specific searches by gene may reveal other SNPs in LD with the candidate SNP that are also eQTL but show stronger association. These SNPs are candidate to causal variants and may then be searched in the UCSC genome browser34 to identify whether they lie in regions of open chromatin or other genomic marks suggesting a functional role. In fact, we have compared chromatin marks for our eQTL with those of random SNPs within 1 Mb of the genes’ TSS and found that eQTL are significantly enriched in DNAse, promoter and enhancer chromatin marks, which is in agreement with previous reports that eQTLs more often fall within regulatory elements.22,35 Also, we observed that eQTL more often correspond to GWAS SNPs of diverse traits than random SNPs near genes, as reported by the Haploreg database.31

Our study only included colon tissue, but the eQTL analysis should also be useful for rectal cancer. Previous studies have shown that colon and rectal tissue have very similar expression profiles,19,36 although these cancer sites may have important differences regarding clinical outcomes.

Our study combines expression data from normal colon tissue obtained from volunteers at colonoscopy, with adjacent macroscopically normal colon tissue obtained from patients. Other studies on eQTL in CRC have analysed only tumour tissue or paired tumour and adjacent normal, without healthy tissue as reference.21,22 The inclusion of normal tissue adjacent to tumour may not be completely optimal, as we know that gene expression is altered in adjacent macroscopically normal tissue.37

The inclusion of tumour tissue in the analysis is a matter of debate. Tumour gene expression is very different from normal,20 with genes changing both towards over and under-expression. Although tumours may reveal some new eQTL that are only evident when the gene is overexpressed, diverse genomic alteration occur simultaneously in tumours like copy number aberrations or epigenetic changes, that may generate false positive and false negative results. In our analysis, the list of eQTL identified in tumours had low overlap with that of normal tissue, similar results were obtained by Ongen et al., who observed that 36% of the detected eQTL were tumour specific, and more prone to carry somatic mutations and epigenetic modifications, like methylation changes, compared to matched samples of normal colon mucosa.22 Our web tool allows the user to select which tissues are incorporated into the analysis among healthy colon, adjacent normal colon and tumour. The user should be aware that, if tumour and adjacent tissue are combined, the fact that these samples belong to the same subject may show correlated expression, which is not considered in the statistical analysis.

Even though the potential for a real functional role under a trans-eQTL is low, since some long-distant effect mechanism must mediate the association, it is interesting to note that a large number of trans-eQTL effects were identified at P < 1e−6, but very few remained significant when proper multiple comparisons were accounted for. Most of them were in the same chromosome, probably related to long-distance linkage disequilibrium. The limited sample size of this study does not provide enough power for a proper trans-eQTL analysis, and a deeper analysis would require validation in other datasets and functional studies like knocking-down the candidate trans-eQTL to understand the mediation pathway that leads to the association.14

In conclusion, eQTL analysis is an approach to study functional SNPs identified in GWAS. The web tool that we provide at https://www.colonomics.org/eQTL-browser allows easy analysis of expression and eQTL specific for colon tissue and will help researchers in the area to identify which SNPs deserve further functional research.