Main

Colorectal cancer (CRC) is the third most commonly diagnosed cancer in the world and the second leading cause of cancer deaths in both men and women. Colorectal cancer’s incidence trend is still increasing in most countries (Ferlay et al, 2013). At a molecular level, CRC is a complex disease involving different alterations (An et al, 2015). Large chromosomal aberrations have been described in colon tumours, with recurrent gains in chromosome arms 7p, 8q, 13q, and 20, and losses in 8p, 17p, and 18 (Meijer et al, 1998; Ashktorab et al, 2010; Brosens et al, 2011; Goossens-Beumer et al, 2015). Tsafrir et al (2006) showed that tumour copy number aberrations (CNAs) may lead to changes in gene expression relevant in colorectal carcinogenesis. In particular, genes in amplified chromosome regions (7p, 8q, 13q, and 20q) usually were overexpressed and genes in regions with chromosome losses (1p, 4, 5q, 8p, 14q, 15q, and 18) were under-expressed. These aberrations can lead to the silencing or amplification of tumour suppressor genes, oncogenes, or non-coding RNAs that modify the expression of genes. Some examples of the relevance of CNA in CRC are losses of chromosome 17p, which contains tumour suppressor genes TP53 and MAP2K4 (Han et al, 2013), and gains in 7q31 associated with WNT2 overexpression, which alters Wnt signalling activation (Wang et al, 2016). Gains in 20q have been studied in more detail, because they are associated with poor prognosis in CRC (Hidaka et al, 2000; Wang et al, 2016). This amplification is correlated with the overexpression of TPX2 and AURKA genes (Sillars-Hardebol et al, 2012; Wang et al, 2016), both involved in processes that promote colorectal adenoma to carcinoma progression, cell viability, and the anchorage-independent growth and invasion processes. This region also harbours C20orf24 (20q11.23), ADRM1 (20q13.33), TCFL5 (20q13.33), PLCG1 (20q12), and TH1L (20q13.32), genes that have been highlighted for their importance in chromosomal instability and adenoma to carcinoma progression (Loo et al, 2013; Ali Hassan et al, 2014; Sokolova et al, 2016). However, the relationship between CNA and gene expression is complex and still not completely defined. Furthermore, difficulties in the methodology to define CNAs from SNP arrays may explain some of the heterogeneity in the results reported so far.

Colorectal cancer tumours have been classically classified into microsatellite instable (MSI), derived from deficient DNA mismatch repair machinery, which leads to hyper-mutated tumours, and microsatellite stable (MSS), also referred to as chromosomal instable tumours (CINs). Microsatellite stable tumours often show CNA (Trautmann et al, 2006; Brosens et al, 2011; The Cancer Genome Atlas Network, 2012; Xie et al, 2012) and follow the classical adenoma-to-carcinoma progression model (Brosens et al, 2011). Recently, consensus molecular subtypes (CMSs) of CRC have been defined by means of non-supervised classification techniques using gene expression data (Guinney et al, 2015). This classification establishes four major subtypes (CMS1–4) with specific molecular characteristics. CMS1 (14% of CRC) comprises tumours associated to an MSI phenotype and to immune pathways activation. This subtype usually has the best prognosis. CMS2 (41% of CRC) is characterised by high CIN and strong WNT/MYC pathways activation. CMS3 (8% of CRC) show low CIN, but are generally KRAS mutant and have activated pathways related to energy metabolism. Finally, CMS4 (20% of CRC) show upregulation of TGF-β signalling and have been associated with the worst survival and poor response to chemotherapy. Some controversy exists around whether tumours of CMS4 subtype exhibit a mesenchymal phenotype or are enriched in the stromal component, as genes upregulated in this subtype are mainly expressed by stromal cells rather than by epithelial cells (Isella et al, 2015). Indeed, this is an important issue to consider in copy number analysis, as the diploid nature of stromal cells amalgamated within the tumour bulk could mask real CNA changes in cancer epithelial cells.

In this study we have performed a detailed characterisation of CNA in stage II, MSS colon tumours, taking into account the quantity of diploid stromal cells, which was estimated for each tumour sample. Moreover, we have explored the relation of these aberrations with gene expression changes and characteristics of the tumours such as molecular subtyping and prognosis, aiming to decipher the complex biology underlying colon cancer.

Methods

Patients and samples

Tumour tissues and their paired adjacent normal mucosa from 100 stage II, MSS colon cancer patients have been molecularly profiled to obtain data on copy number, gene expression, methylation, and somatic mutations (with exome sequencing in a subset of 42 samples; Colonomics project: www.colonomics.org; NCBI BioProject PRJNA188510). All patients were treated with radical surgery, did not receive adjuvant therapy, and have been followed up a minimum of 3 years (Supplementary Table 1). Adjacent normal tissue was dissected at pathology from the proximal tumour resection margin with a minimum distance of 10 cm to the tumour lesion. All patients were recruited at the Bellvitge University Hospital (Spain) between 1998 and 2002, provided written informed consent, and the hospital Ethics Committee approved the protocol with reference PR074/11.

Copy number and gene expression data on 222 colon tumours and 22 normal adjacent tissues from The Cancer Genome Atlas (TCGA) repository (The Cancer Genome Atlas Network, 2012) were downloaded and used as a validation data set. These tumours were selected because gene expression was available on the Agilent (San Diego, CA, USA) array platform, equivalent to our setting, which was convenient to estimate the stromal content. To use the maximum sample size, both colon and rectal samples, and diverse stages at diagnosis were included (Supplementary Table 1).

Copy number analysis

Copy number abberations were inferred from the analysis of Affymetrix (Santa Clara, CA, USA) Genome-Wide Human SNP Array 6.0 genotyping arrays (Carter, 2007; Eldai et al, 2013). This array includes probes for the detection of over 906 600 SNPs and an additional 946 000 non-polymorphic oligonucleotides for the assessment of copy number variation. The average inter-marker distance was <700 bp. Affymetrix Power Tools (Version 1.16.1) software was used (Eckel-Passow et al, 2011), with default parameters, to assess a quantitative locus-level copy number estimate (CNE) for each tumour sample, as , where was the normalised intensity at probe j for sample i and was a reference intensity at probe j, typically representing the mean diploid signal derived from the average pool of normal mucosa samples.

Segmentation

A segmentation algorithm was applied to split the set of ordered locus-specific CNE into regions of adjacent elements that had similar CNE. Each region was assigned a unique value that represented the average CNE of segment. Segmentation was performed for each sample in three steps: (1) normalisation: a smoothing spline was fitted to the raw data and used to normalise the distribution of samples CNE; (2) raw partition: the Vega R package (Morganella et al, 2010) was used to locate change points in CNE patterns that would split each chromosome into discrete segments; and (3) consolidation: a t-test was used to compare the CNE values between consecutive regions and those with similar CNE values (P-value<0.0001) were merged.

Tumours with high stroma content, which is assumed to be diploid, could bias the CNA measure in cancer cells due to a masking effect. For this reason, the threshold value to identify a CNA from the CNE was defined as a function of the estimated proportion of stroma in the tumour. The stromal proportion in each sample was calculated with the ESTIMATE R package from gene expression data (Yoshihara et al, 2013). A hierarchical cluster analysis was used to group the tumour samples into four clusters reflecting their different levels of stromal content and a varying cut-off was assigned to each cluster: ±0.5 for low stromal, ±0.4 for medium-low stromal, ±0.3 for medium-high stromal and ±0.2 for high stromal (Supplementary Figure 1). Copy number estimate that exceeded these cut-offs were considered aberrations (gains or losses). The proportion of altered genome was estimated for each tumour by summing the length of regions with CNA.

Data availability

Segmented data for each sample is freely available to download at the project website: https://www.colonomics.org/data and the raw data have been deposited at the European Genome-phenome Archive (http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession number (EGAS00001002453).

Recurrent aberrations in chromosomal segments

The next step was the alignment of segments for all tumours to identify those with recurrent aberrations. The first step was to calculate minimal recurrent regions (MRRs) based on segmentation for each sample. The minimum recurrent regions were small regions with at least 5% of individuals with CNA. These were the smallest units of analysis used in this study. Focal regions with CNA were also calculated. Focal regions were defined as a set of MRRs consecutive with the same CNA sign. It is noteworthy that two consecutive regions with similar CNA may exist but defined by different tumours contributing to each minimal region. This may dilute further associations analysed at this level of aggregation.

Focal regions were analysed with the GAIA package (Morganella et al, 2011; Yuan et al, 2012), which allowed using different cut-offs according to the proportion of stroma. GAIA uses a statistical framework based on a conservative permutation test that estimates the null probability distribution of CNA based on the observed data. A stringent false discovery rate (FDR<1e−5) was used to identify focal regions.

Finally, the third level of analysis considered broad events. A broad event was defined as a gain or loss of >50% of a chromosome arm. A permutation test was performed to detect these changes. In this test, each sample was randomly assigned a CNA status by chromosome and arm to define the null distribution. From that, the significance of observing a loss of more than 50% of a chromosome arm was calculated.

Molecular characterisation of the tumour samples

The CMSclassifier R package was used to classify our samples into the four CRC CMS, using a random forest approach (Guinney et al, 2015). Tumour CIMP classification was derived from the methylation status in CpG islands of genes MLH1, RUNX3, CACNA1G, IGF2, NEUROG1, SOCS1, CRABP1 and CDKN2A. CIMP high cut-off was set at more than or equal to six out of eight methylated promoters, CIMP low was defined as the presence of 1 out of eight to five out of eight methylated markers, and No CIMP as zero out of eight methylated markers (Ogino et al, 2007). The frequency of somatic mutations located in coding regions was assessed for a subset of 42 samples with whole-exome sequencing results (Sanz-Pamplona et al, 2015). Briefly, Genomic DNA from the set of 42 adjacent tumour paired samples was sequenced in the National Center of Genomic Analysis (Barcelona, Spain; CNAG) using the Illumina (San Diego, CA, USA) HiSeq-2000 platform. Exome capture was performed with the commercial kit Sure Select XT Human All Exon 50 MB (Agilent). Tumour exomes were sequenced at 60 × coverage and exomes from adjacent tissues were sequenced at 40 ×. Bowtie 2.0 software was used to align sequences over the human reference genome. Variant calling was executed with GATK software and low-quality variants (mapping quality below 30, read depth below 10, or frequency <10%) were discarded. Germline variants were also removed, that is, variants that were present in normal adjacent paired sequence for each tumour and variants reported in the 1000G project. Finally, variants were annotated using the SeattleSeq Variant Annotation web tool. Mutation data is freely available to download at the project website: https://www.colonomics.org/data.

Association of CNA with clinical and molecular features and prognosis

Non-parametric tests were used to assess the association between clinical or molecular features and the proportion of altered genome or MRRs. Separate analyses for gains and losses were also performed.

The Kaplan–Meier method was used to estimate disease-free survival curves for each CNA state in a specific region. A total of 21 progression events had been observed in the sample with a minimum follow-up of 3 years (median 5 years). Multivariate proportional hazards models were used to assess CNA gains and losses as independent prognostic predictors, adjusted for age, sex, tumour location, and the proportion of stroma. Only MRRs were analysed. False discovery rate was used to control for multiple testing in all analyses when MRRs were explored.

Association of CNA with gene expression changes

Gene expression data, assessed by Affymetrix Human Genome U219 expression array has been previously analysed (Sanz-Pamplona et al, 2014). A unique expression value for each gene was estimated from multiple probes using the first principal component to capture maximal variability among probes. The analyses were focused on gene expression changes between tumour and paired adjacent normal. Thus, gene expression differences were analysed in relation to CNA status (loss, diploid, and gain) with linear models, adjusted for age, sex, tumour location, and stromal content. Furthermore, partial Pearson’s correlation was calculated to assess adjusted correlations between the quantitative CNE at each region and gene expression changes. These analyses were restricted to 14 654 genes (out of 18 902 annotated in the microarray) that had enough variability among samples (s.d. >0.2). Two analyses (cis and trans) were performed. The former only interrogated genes located within each minimum recurrent region (FDR<0.05). The later assessed associations of all CNA with all genes, except for genes located in chromosomes X and Y, and Bonferroni method was used to adjust for multiple comparisons. These analyses were replicated using TCGA data.

Functional analysis

A functional analysis was performed to characterise the list of genes showing significant associations with CNA. The Sigora R package was used, which focuses on genes or gene pairs that are (as a combination) specific to a single pathway (Foroushani et al, 2013). This analysis was restricted to genes with a strong association to CNA (FDR<0.05 and r2>0.33). Furthermore, transcription factor (TF) enrichment was assessed using Fisher’s exact test.

Results

Proportion of altered genome: correlation with clinical and molecular characteristics

CNA events were detected in all samples, with a range of altered genome between 0.04 and 26.6% (Figure 1A). Despite a homogeneous group of MSS stage II tumours were analysed, high variability among samples was observed. Interestingly, 10% of the tumours had <0.1% of the genome altered. A more detailed analysis, dividing CNA between gains and losses, revealed tumours that essentially only showed either gains or losses. The proportion of gained genome ranged between 0.007 and 10.4% (Figure 1B), whereas the proportion of lost genome ranged between 0.02 and 16.9% (Figure 1C). The proportion of altered genome was independent of age, sex, site location, progression, CIMP, CMS, and stromal infiltration (Table 1). The latter is not strange, as the proportion of altered genome at each tumour was adjusted by the proportion of stroma. However, a significant positive correlation was observed between the number of somatic mutations and total CNA (Spearman’s r=0.42, P=0.006) (Supplementary Figure 2). This association, which analysis was restricted to the 42 samples with whole-exome sequencing available, was weaker and no longer significant when gains and losses were analysed separately.

Figure 1
figure 1

Distribution of clinical characteristics according to the proportion of altered genome. The histogram represents the proportion of altered genome by sample (purple: gains, red: losses). In the lower part, the clinical characteristics of the individuals are represented: sex (blue: female, red: male), age (sliding scale from white: minimum to brown: maximum), tumour location (light green: left, dark green: right), development of metastases (light pink: no, dark pink: yes), CIMP (white: no, green: CIMP low, blue: CIMP high), number of mutations (sliding scale from white: 0 to dark blue: maximum), proportion of stroma (light green: low, dark green: high), molecular subtype (yellow: CMS1, blue: CMS2, plink: CMS3, green: CMS4). (A) All CNA. (B) Gains. (C) Losses.

Table 1 Association between the percentage of altered genome and clinical characteristics

Minimal recurrent regions: correlation with clinical and molecular characteristics

A total of 26 423 segments with CNA (10 777 gains and 15 646 losses) were identified. The median number of altered segments per tumour was 53 gains and 118 losses. These segments were transformed into 13 279 MRRs, defined as CNA segments shared by at least five samples (5%) (Supplementary Table 2). Figure 2 shows the chromosomal distribution and frequency of the MRR (both gains and losses). It should be noted that 54% of these regions were located in recurrent regions already described in CRC (recurrent gains in chromosome arms 7p, 8q, 13q, and 20, and recurrent losses in 8p, 17p, and 18). Interestingly, 116 of these MRR were shared by >50% of the samples (Table 2 shows a summary of these regions). Only three of these regions included genes: GSTM1 in 1p13.3, SIRPB1 in 20p13, and ADAM5/ADAM3A in 8p11. The median number of samples per MRR was 8 (interquartile range 6 to 66). This small number of affected samples at each segment limited the power to detect associations with clinical variables. Indeed, no relevant association between MRR and any clinical characteristic was found (FDR>0.05). The association of all MRR with prognosis was also evaluated. After correction for multiple testing (FDR<0.05), only one region in 1p36.33 (chr1:1 627 906–1 628 405) was found to be statistically associated with disease-free survival (P=0.00002). Tumours losing these region (n=6) showed poor prognosis in comparison with diploid ones (Supplementary Figure 3). Gene CDK11B is located within this region.

Figure 2
figure 2

Frequency of CNA by chromosome. Each graph represents a chromosome with chromosomal position in the X-axis. Y-axis displays the percentage of tumour with gains (>0, purple) or losses (<0, red). The height of the bar is proportional to the number of samples showing the CNA change. Dashed lines represent the frequency (black: 5%, green: 20% red: 50%).

Table 2 Summary of MRR with >50% of the samples altered of the CLX data set

Minimal recurrent regions: correlation with gene expression

CIS analysis

Only one-third (4292 out of 13 279) of the MRR contained genes. Some large genes were included in more than one region. The linear models for the 2168 genes included in such regions revealed that 785 of them (36.2% in 545 MRR) showed a significant relationship between the differences of expression and the CNA state at FDR<0.05 (Supplementary Table 3). The median of the partial Pearson’s correlation coefficient between the differences of expression and CNE of these 785 genes was of 0.54 (interquartile range 0.20 to 0.81), indicating that CNA explained a large fraction of the variability in gene expression changes between normal and tumour tissues. It is noteworthy that potential stromal contamination was considered in these analyses by adjusting each test for the stromal content of each sample. Interestingly, 64 out of the 785 differentially expressed genes showed a partial correlation with CNE higher than 0.7 (>50% of variance explained; see some examples in Figure 3A–D).

Figure 3
figure 3

Relationship between gene expression and CNA. (AD) Boxplots showing examples of gene expression changes based on CNA levels. Spearman’s correlation and FDR P-value are shown. ‘L’ indicates number of individuals losing the region, whereas ‘G’ indicates number of individuals gaining the region. (E) Circus plot of CNA recurrent regions and their association with changes in gene expression. Outer circle shows ideograms of the chromosomes. Inner circles show, in order, focal regions (gains in purple, losses in red), broad events, and genomic location of significant associations between CNA and the difference in expression between tumour and adjacent normal (blue) in cis analysis. The central arcs indicate genomic locations with significant trans associations between CNA and changes of gene expression.

As expected, these genes were mainly located on chromosomes 6, 7, 8, 13, 17, 18, and 20, because these are the regions most often showing CNA (Table 3 and Figure 3E). Also unsurprisingly, CNA gains were associated with higher gene expression and CNA losses were associated with lower gene expression levels. This happened in 236 genes located in gained regions and 30 genes located in lost regions, respectively. Furthermore, the expression of genes located in regions in which both losses and gains had been observed (n=266) showed a good correlation with the quantitative CNE in each tumour, which can be interpreted as proportional to the average number of DNA copies. There were 117 MRR in which >1 gene showed significant changes in gene expression (20.6% of 567 MRR with >1 gene). Reinforcing the idea that copy number alterations are associated with changes in gene expression, 78% (92 of 117) of these regions showed half or more of the genes with altered expression. Specifically, 38 of them (32%) were MRR in which all included genes had consistent significant changes in expression.

Table 3 Chromosomal distribution of genes related to CNA and validation in TCGA data

A functional analysis was performed with the 325 genes whose changes in expression were strongly associated to CNA (FDR<0.05 and r2>0.33). A significant enrichment was identified in ‘Colorectal cancer pathway’. This analysis also identified a significant enrichment of pathways related to ‘RNA degradation’, ‘Endocytosis’, ‘Basal transcription factors’, or ‘Glycerophospholipid biosynthesis’ among others (Supplementary Table 4 and Supplementary Table 5).

Trans analysis

Under the hypothesis that CNA could also have long distance effects (trans) on gene expression, due to regulatory effects, the association between CNA and gene expression changes of all annotated genes was evaluated. This analysis explored 15 225 697 relationships (the expression of 14 654 variable genes and 13 279 MRR; cis relationships previously analysed were excluded). Of them, only 191 were significant after Bonferroni correction (P<3.3e−9), involving 42 genes and 168 MRR (Figure 3E). All relationships were between genes and regions located in different chromosomes. Unexpectedly, 105 out of the 168 MRR (62.5%) did not contain genes, pointing to regulatory elements different from TF activity. The remaining CNA regions (n=63) included 53 genes. We tested whether these genes were predominantly TFs, but only 4 of them (GATA3, ST18, PRDM6, and ZNF641) had this function, whereas we expected 10% by chance alone (Supplementary Table 6).

Focal regions and broad events: correlation with clinical and molecular characteristics

Larger focal regions were identified from specific MRR, defined as consecutive MRRs with the same CNA status (but possibly different quantitative CNE value). If one of these regions involved >50% of a chromosome arm, it was defined as a broad event.

From 13 279 MRR, 353 focal regions were found (97 focal gains and 256 focal losses; Figure 3E and Supplementary Table 7). The median number of samples with some aberration in these focal regions was 11 (interquartile range 8 to 14). The focal regions represented 12.5% of the altered genome (9.8% in lost focal regions and 2.7% in gained focal regions). These focal regions were enriched in genes, as 26% of the total number of human genes was in these CNA regions (16% in lost focal regions and 10% in gained focal regions). However, no significant associations were found between the average CNE in these focal regions and any of the clinical characteristics explored, including prognosis.

Five recurrent broad regions were identified as follows: gains in 8q (6% of the samples), 13q (7% of the samples), 20p (6% of the samples), and 20q (24% of the samples), and losses in 8p (7% of the samples). No significant association between recurrent broad CNA and clinical characteristics or prognosis was found, except for gains in 20q, which were related to the number of somatic mutations (P=0.00006). Other classically altered regions in CRC were also detected, but at lower frequency, as follows: 7p gain (n=4), 17p loss (n=4), and 18q loss (n=3). Indeed, if less astringent criteria were used to detect broad regions, more altered tumours emerged (Supplementary Table 8).

Validation in TCGA data

To assess the consistency of our findings, a validation was performed using the TCGA data set comprising 222 CRC tumours. To ensure a comparable data, the same pipeline of analysis used in our samples was followed starting from the raw TCGA data. In agreement with our results, the range of altered genome was 0.038 to 28.1% (0.004–12.1% gains and 0.025–22.5% losses). Unexpectedly, the proportion of altered genome in the TCGA showed a significant negative association with the number of somatic mutations (Spearman’s r=−0.15, P=0.03). Nevertheless, when only MSS samples were considered, this negative correlation changed and a nonsignificant positive correlation between the number of mutations and the proportion of lost genome emerged (Spearman’s r=0.14, P=0.08). This change in correlation derives from the fact that MSI tumours are hyper-mutated and usually diploid. Interestingly, a strong association between CMS and the proportion of altered genome was found in TCGA data. Subtypes CMS2 and CMS4 accumulated higher levels of chromosomal alterations than CMS1 and CMS3 (Supplementary Figure 4).

A total of 8771 MRR (66% of 13279) were validated in TCGA samples and the agreement was very high for MRR altered in >50% of the samples (Table 2). A total of 4105 MRRs were identified in TCGA, which we had not previously observed in our data. If only stage II and III MSS tumours were considered, the percentage of MRR validated in TCGA increased to 69% (n=51) and 68% (n=44), respectively. Finally, when comparing samples from different stages in TCGA data set, 72% of MRR from stage II tumours were found in stage III tumours.

Regarding the association with gene expression, it should be noted that the TGCA data set only analysed 22 normal tissues; thus, the tumour-normal changes have been estimated respect to the average expression of these normal in an unpaired analysis. Furthermore, only 631 out of the 785 significant genes were found in the TCGA validation data set. For this subset, 79% (496 genes) of our gene expression–CNA associations were replicated, thus confirming that expression levels of such genes were in part explained by CNA in colon cancer (Table 3 and Supplementary Table 3).

The trans validation was performed in 127 out of the 191 associations, because 19 genes and 45 MRRs were not found in the TCGA. From these, only 64 relationships (50%) were confirmed in the TCGA data set, which indicated that some of our findings could be spurious even though we used Bonferroni correction to protect from false positive findings.

Concerning focal and broad events, 51% of the focal regions were validated. Surprisingly, almost all were lost regions. Only 10 out of 97 focal gained regions were validated and all broad regions were validated.

Discussion

This comprehensive analysis confirms that CNA are frequent in MSS colon tumours and probably induce relevant changes in gene expression that alter key cancer pathways. Even though all analysed samples were MSS, stage II colon tumours, a high heterogeneity in CNA among them has been observed, both in the percentage of altered genome and the location of the CNA.

The percentage of altered genome ranged from 0.04 to 26.6% (mean 2.6%). This percentage, validated in TCGA data when the same methodology was used to define CNA, is lower than the reported in previous studies (Trautmann et al, 2006; Brosens et al, 2011; Xie et al, 2012). A probable reason is the rigorous cut-off used in our analysis, selected in such a way to reduce the number of false positive CNA that could attenuate the associations with gene expression. However, the frequency of recurrent CNA regions found in our study is consistent with previous reports, with gains in chromosomes 7, 8, 13, and 20, and losses in chromosomes 8, 17, and 18 (Tsafrir et al, 2006; Ashktorab et al, 2010; Brosens et al, 2011; Xie et al, 2012).

In this study we have paid special attention to the stromal content of tumours. An initial analysis that used a fixed cut-off (±0.4) for all samples revealed a strong association between CNA and the proportion of stroma on the tumours. Specifically, tumours with molecular subtype CMS4, which are characterised by high stromal content (Calon et al, 2015; Isella et al, 2015), also showed a reduced frequency of CNA. Therefore, and as stromal cells are diploid, we thought that this result could be a biased estimation of CNA in tumours with high stromal content due to a dilution effect. Moreover, as other studies have described an association of CNA with poor prognosis (Kurashina et al, 2008; Andersen et al, 2011; Orsetti et al, 2014), it seemed paradoxical that the CMS4 subtype that has poor prognosis was the less altered subtype. Based on this observation, we adjusted the cut-off to define a CNA as a function of the stromal content of the sample. After this correction, which we consider less biased, no significant associations were observed between CNA and stromal infiltration. Although not statistically significant, differences were observed per molecular subtypes. CNA were more frequent in CMS4 and CMS2 tumours than in CMS1 and CMS3 (Supplementary Figure 5). In the analysis of the TCGA data we found a statistically significant CNA enrichment in CMS2 and CMS4 tumours. This result agrees with the reported in the study describing the molecular characteristics of CRC CMSs (Guinney et al, 2015). To note, the subtype CMS4, which had the least CNA when stromal component was not considered, emerged as the subtype with more CNA changes when adjusting for stromal content. This observation should be taken into account when interpreting the differences in CNA among tumours with diverse proportion of stroma in studies that have not adjusted this effect.

Most MRR identified in our tumours were validated in the TCGA data set, confirming the validity of our analysis. Furthermore, this percentage was high when only MSS stage II tumours were used for validation purposes. Regarding focal regions, it is interesting to note that almost all lost regions were validated in TCGA data, whereas only 10% of gains were validated. This result suggests that a higher heterogeneity in gained events across patients exists, whereas lost events are prone to be more recurrent. In addition, all described broad events were validated in TCGA data, confirming their validity.

Aberrations in copy number are relevant for the consequences in gene dosage that may produce. This can have a direct effect on the protein levels of genes located in regions with CNA or mediated through modifications in regulatory elements. As expected, we have observed that a large fraction of expression changes in colon tumours can be explained by changes in CNA in the regions where these genes are located. Furthermore, when a region contains multiple genes, most of them change their level of expression in a similar pattern. Nevertheless, although frequent, this is not a general mechanism of gene expression alteration in tumours, as not all genes located in CNA regions change their levels of expression between normal and tumour samples. Indeed, almost 15% of genes were located in CNA regions but only 36% of them changed their level of expression between normal and tumour tissues in a way that might be causal. As expected, most expression changes directly followed the change in gene dosage (although some nonsignificant exceptions have been observed, possibly due to multiple comparisons). Indeed, this relation has been widely described in CRC (Tsafrir et al, 2006; Sillars-Hardebol et al, 2012; Wang et al, 2016).

Gene expression regulation is complex. In addition to these direct relationships, trans associations among CNA and gene expression were also found. We hypothesised that TFs located in CNA regions could explain changes in level of expression of genes located in distant regions of the genome. However, only 3% of such genes are known TFs (and we expected 10%). What is more, 105 out of 168 CNA regions implicated in trans relationship did not contain genes; thus, alternative regulation mechanisms, possibly involving enhancers, methylation, or non-coding RNAs, must be involved in these long-distance effects of CNA in gene expression changes. It is reassuring that most cis (79%) and some trans (50%) relationships were validated using TCGA public data.

We also assessed the association of CNA with clinical and molecular parameters. We found that tumours with higher number of CNA also exhibited higher number of somatic mutations (though this association was restricted to 42 tumours with exome data). As only MSS tumours have been included in this analysis, we could hypothesise that the inverse relationship between CNA and mutational load previously described only emerged when MSS tumours were compared with hyper mutant MSI tumours. Indeed, this inverse relationship was observed in the TCGA validation data set, which included MSI tumours. When only MSS tumours were considered, in line with our results, the trend is towards a positive correlation of aberrations (CNA and somatic mutations). Interestingly, these CNA are likely to be segment losses, which might be related to a requirement of double hit for many mutations to be active.

Specific CNA have been previously suggested as prognostic biomarkers (Brosens et al, 2011; Zhang et al, 2015; Wang et al, 2016). In our data, we have only found one region in 1p36.33 significantly associated with prognosis when multiple comparisons were considered. Tumours with this region lost showed worse prognosis than diploid tumours. This region contains CDK11B gene, which encodes for a cyclin-dependent kinase that has multiple roles in cell cycle progression and apoptosis regulation. Thus, we hypothesise that in a subset of colon tumours CDK11B could act as a tumour suppressor gene. However, owing to the small number of cancer recurrence events in our study (21 out of 99 patients) we cannot exclude the possibility that this region was associated with prognosis just by chance. This result could not be validated in TCGA data, because the follow-up information of the individuals has poor quality, so it deserves further study.

Although originally developed to assess genetic diversity, genotyping arrays have emerged as a useful technology to identify regions with CNA. It is particularly important to highlight the high rates of false positive focal regions that can result by using these high-throughput techniques. For this reason, we have used a conservative and variable threshold according to the proportion of stroma for each sample. The selection of a method to call CNA regions represents a great challenge, because there are many available, usually with little experimental validation, and the results are not necessarily consistent. So far, little work has been deserved to compare results obtained through different methods among them (Morganella et al, 2010; Koike et al, 2011). After exploring diverse software tools, we selected a method that provided more precise results when focal CNA regions were visually inspected. Also, the results obtained regarding the frequency and chromosomal distribution of CNA were similar to previously reported for CRC using different methods, thus reassuring the validity of our approach (Morganella et al, 2010; Rueda and Diaz-Uriarte, 2010; Morganella et al, 2011). Smaller regions with CNA observed in multiple samples help to better identify potential causal genes behind the observed associations. Larger focal regions, as identified by the GAIA software, paradoxically decrease the power to detect associations with clinical variables, because the enlarged region usually combines samples with heterogeneous CNA.

In conclusion, this comprehensive analysis has shown that CNA are highly frequent and heterogeneous events in MSS stage II colon tumours. The variation of gene expression between tumour tissues and their paired adjacent normal mucosa was explained by CNA on 36% of the genes affected by this type of aberrations, and genes often altered belong to key cancer pathways. These altered genes by CNA represent 5% of the total number of genes expressed in the colon.

In addition, from a methodological perspective, we have found that the proportion of tumour stroma may bias the estimation of CNA. To avoid this effect, an adjusted cut-off definition proportional to the estimated stromal content produced more accurate results.