A series of studies have been published that evaluate the chromosomal copy number changes of different tumor classes using array comparative genomic hybridization (array CGH); however, the chromosomal aberrations that distinguish the different tumor classes have not been fully characterized. Therefore, we performed a meta-analysis of different array CGH data sets in an attempt to classify samples tested across different platforms. As opposed to RNA expression, a common reference is used in dual channel CGH arrays: normal human DNA, theoretically facilitating cross-platform analysis. To this aim, cell line and primary cancer data sets from three different dual channel array CGH platforms obtained by four different institutes were integrated. The cell line data were used to develop preprocessing methods, which performed noise reduction and transformed samples into a common format. The transformed array CGH profiles allowed perfect clustering by cell line, but importantly not by platform or institute. The same preprocessing procedures used for the cell line data were applied to data from 373 primary tumors profiled by array CGH, including controls. Results indicated that there is no apparent feature related to the institute or platform and that array CGH allows for unambiguous cross-platform meta-analysis. Major clusters with common tissue origin were identified. Interestingly, tumors of hematopoietic and mesenchymal origins cluster separately from tumors of epithelial origin. Therefore, it can be concluded that chromosomal aberrations of tumors from hematopoietic and mesenchymal origin versus tumors of epithelial origin are distinct, and these differences can be picked up by meta-analysis of array CGH data. This suggests the possibility of prospectively using combined analysis of diverse copy number data sets for cancer subtype classification.
Array comparative genomic hybridization (array CGH) is the high-resolution laboratory technique of choice for the detection of chromosomal DNA copy number aberrations on a genome-wide scale (Oostlander et al., 2004; Pinkel and Albertson, 2005). We and others hypothesize that driver genes determine the location of aberrations, which would occur through clonal selection during the development of cancer (Albertson et al., 2000). Chromosomal aberrations vary across cancer types with a certain specificity, which can be exploited in a meta-analysis of array CGH data. For example, within a gastric cancer set we previously identified array CGH signatures associated with different clinical outcome (Weiss et al., 2003). In addition, more advanced, aggressive or metastatic cancers often display more chromosomal aberrations (Zardo et al., 2002) and different breast tumor classes harbor distinct chromosomal aberrations (Jonsson et al., 2005).
A variety of commercial and non-commercial platforms for array CGH have become available using bacterial artificial chromosomes (BACs), phage artificial chromosomes (PACs), cosmids, cDNAs, fosmids and synthetic oligonucleotides (Ylstra et al., 2006). Several genome-wide array CGH studies of different cancers, such as breast (Pollack et al., 1999), fallopian tube (Snijders et al., 2003b), prostate (van Dekken et al., 2004) and soft tissue cancers (Linn et al., 2003), have been published with actual array data made public (Table1 and Supplementary Table 1).
In this paper, we successfully performed cross-platform analysis of array CGH data obtained from different platforms and institutes. Such meta-analysis is important because: it gives the possibility to achieve more robust and reliable results by considering data sets from different studies, it offers the ability to perform analysis of samples from different types of cancers from different geographical locations and, finally, it allows the identification of subgroups of samples not only within a cancer type but also across different types of cancers. Such cross-cancer-type patterns may show specific biological and/or clinical features (Weiss et al., 2003; Mitelman et al., 2004; Lu et al., 2005).
Array CGH data from different platforms cannot be compared directly, as each platform contains different numbers of clones at variable spacing and resolution, and the noise distributions varies across platforms as well as the amplitude for a given copy number change (Ylstra et al., 2006). Finally, the way each institute performs the experiments may introduce specific types of noise in the data and influence dynamic range. We developed a five-step preprocessing methodology to overcome these problems.
The use of a common reference in array CGH experiments provides an intrinsic baseline, which is an important advantage for array CGH meta-analysis, compared to RNA expression array meta-analysis (Rhodes et al., 2004). Moreover, array CGH data are less complex than expression array data, as the hybridization ratios resulting from the chromosomal copy numbers are integers and nearby chromosomal positions tend to have a highly correlated CGH value. To test and optimize the performance of the methodology, we initially evaluated array CGH data from cell lines obtained by different laboratories with different platforms for preprocessing and hierarchical clustering. This informatics approach yielded additional insights into how array CGH data can best be clustered. Subsequently, we applied the same procedure to primary cancer data from a total of 373 samples.
Common cell lines to optimize cross-platform settings
Different array platforms use different types, lengths, mappings and numbers of clones, different print pins and slide surfaces, different labeling and hybridization procedures, different normal reference DNA samples and DNA isolation procedures, different scanning and imaging procedures, etc. This results in profiles that in a given case recognize the same chromosomal aberrations, yet differ in noise distribution, amplitude and resolution (Supplementary Figure 1).
As a first check of performance of the preprocessing steps, a variance analysis was applied to the normalized meta-data set. The distribution of the s.d.'s is such that for each cell line at least 75% of the s.d.'s are smaller than the top 75% of s.d.'s within the entire data set (Figure 1). This analysis indicates that after preprocessing, the variance owing to the difference in chromosomal composition of cell lines is larger than the variance owing to other factors, such as institute or platform. This conclusion was confirmed by performing a one-way analysis of variance (ANOVA) to test the null hypothesis of no difference between the cell lines. Because the positions are statistically dependent, the ANOVA was carried out separately for each position. For 90% of the positions the P-value (based on the appropriate F-distribution) was found to be below 0.05 and for 82% it was below 0.01.
As a second approach for validating the preprocessing procedure for the cross-platform analysis, we used hierarchical clustering. The dendrogram produced by the hierarchical cluster analysis shows that all experiments on the same cell lines, performed on different platforms, cluster together, not interspersed by other cell lines (Figure 2). Interestingly, we also noted that cell lines VP229 and VP267 derived from a primary breast cancer and its relapse, respectively, clustered together in the same terminal branch of the tree. We therefore conclude that the cell line experiments cluster by copy number characteristics rather than by platform or institute.
Hierarchical cluster analysis of primary cancers
Hierarchical cluster analysis was applied to the primary cancer meta-data, using the same preprocessing settings as used for the cell line data. Internal controls included three colorectal cancer samples that were performed independently (different days and technicians) on two different platforms, the VUMC BAC array and the VUMC oligoarrays (van den IJssel et al., 2005). Figure 3 shows that the three colorectal cancer samples on the two different platforms perfectly pair (Figure 3, black bars). A second set of controls included 37 lymphoma samples, all from the VUMC pathology department, of which 34 were hybridized to the VUMC BAC arrays and three to the oligoarrays. Most lymphomas form one large cluster (Figure 3, green bars), which included all three lymphomas performed on the oligoarrays (indicated by arrowheads). In addition, one large cluster containing 37 colorectal samples is apparent (Figure 3, red bar), where samples profiled on BAC arrays from three different institutes intermingle. The last indication that the meta-clustering is consistent is provided by the small breast cancer cluster (Figure 3 column A, blue bar) comprising five samples that were analysed on three different platforms: cDNA (two samples), oligo- (two samples) and BAC CGH arrays (one sample).
Thus, four controls produced the expected results. First, identical colorectal cancer samples on different platforms cluster perfectly together. Secondly, the lymphoma cases show that samples from one institute and same tissue origin but profiled on different platforms cluster together. Thirdly, colorectal cancer samples from different institutes on same platforms cluster together. Finally, also breast cancer samples from different institutes cluster together. It can therefore be concluded that the meta-clustering is reliable on primary tumor data, as samples cluster by cancer type rather than by institute or platform, in an unsupervised manner.
The meta-analysis discerned major clusters with common tissue origin, although the clusters are interspersed by tumors of different origins. The large clusters that can be recognized are the lymphoma, breast cancer and colorectal cancer clusters (Figure 3, columns A and B and highlights on the right), as well as a large head and neck cancer cluster, one cluster with most of the Fallopian tube carcinomas, one cluster with all four prostate cancers and two smaller soft tissue tumor clusters. Notably, colorectal and gastric cancer samples regularly intermingle in the array CGH data set (Figure 3), although this is not particularly evident in the highlighted colon cluster. Intermingling of gastrointestinal tumors has also been observed using micro-RNA profiling (Lu et al., 2005), supporting the biological relevance of the cross-platform analysis results presented here.
Although breast cancers clustered together on a major cluster, most breast cDNA array CGH samples clustered apart from the breast oligoarray CGH samples (Figure 3). This reflects the intrinsic differences between the two cohorts analysed: most of the cases profiled on cDNA arrays were large tumors (3 cm or larger) and over 50% were node positive (Pollack et al., 2002), whereas the tumors profiled on oligoarrays were mostly small (less than 2 cm) and only 30% were node positive (unpublished data).
Array CGH meta-analysis divides tumors by embryonic origin
Apart from clustering tumors by tissue origin, the cross-platform meta-analysis also clusters tumors by their embryonic origin. The 60 samples at the top of the meta-cluster are primarily (97%) of hematopoietic and mesenchymal origin, whereas the lower 293 samples are all, but three samples (1%), of epithelial origin. These two large meta-clusters are separated by a cluster of 18 samples that have a mixed origin: 12 of epithelial origin (66.7%) and six lymphoma and retinoblastoma samples (33.3%). Therefore, it can be concluded that chromosomal aberrations of tumors from hematopoietic and mesenchymal origin versus tumors of epithelial origin are distinct, which is likely related to different underlying oncogenic mechanisms.
Cross-platform analysis on array data is an important aim, as it would allow better communal use of the large amounts of genomic data collected worldwide, which are uploaded in public databases (Brazma et al., 2001; Edgar et al., 2002). Array CGH is a powerful technique that measures unbalanced chromosomal copy number aberrations (Oostlander et al., 2004). One of the advantages of array CGH as a genomics array technique is that cross-platform analysis of array CGH is more straightforward to perform as well as less ambiguous to interpret compared to meta-analysis of other types of genome-wide data, such as for RNA expression microarrays (Michiels et al., 2005; Rhodes et al., 2004). We conclude here, based on the cell line data, and the lymphoma, breast and colon controls, that cross-platform effects are much smaller than the differences observed between the samples.
The array CGH meta-analysis divides the cancer samples in separate clusters based on tissue origin as well as on embryonic origin. Based on balanced translocations, the distinction between hematological and mesenchymal tumors on the one hand and solid tumors on the other was questionable (Mitelman et al., 2004). In contrast, a distinction between tumors of hematological and mesenchymal origin versus solid tumors has recently also been observed using micro-RNA profiling (Lu et al., 2005), consistent with the data presented here using array CGH meta-analysis. Interestingly, micro-RNA profiling intermingled colorectal and gastric cancers as we observed with array CGH meta-analysis.
Although the majority of tumors analysed cluster by tissue type, some tumors are scattered throughout. This is not surprising and can probably be explained by the fact that genetic aberrations other than chromosomal copy number changes may drive tumorigenesis in these cases, like point mutations, promoter methylation and balanced translocations, and these aberrations cannot be detected by array CGH (Oostlander et al., 2004).
In conclusion, we describe here the first cross-platform unsupervised meta-clustering of array CGH data from primary cancers. Using array CGH data from a selection of different platforms and institutes, we have unambiguously shown that data on chromosomal copy number changes can be successfully compared across different studies. In addition, meta-analysis of such a large series of primary cancers reveals new biological information in that patterns of chromosomal copy number changes cluster primary cancers by tumor site as well as embryonic origin. This raises the exciting prospect that meta-analysis of future array CGH data sets using the methodology described here will help unravel sub-types of cancers of similar tissue origin, which might then be correlated with histopathological and clinical characteristics.
Materials and methods
We collected dual channel array CGH data sets that are publicly available from cell lines and primary cancers. To obtain a reasonable number of data sets while maintaining an overall high resolution, only those with a genome-wide average resolution of 1.5 Mb or higher were selected. Some cell lines were used in experiments on different platforms at different institutes. The cell line data set contained seven duplicates, three triplicates, one quadruplet and one quintuplet (Supplementary Table 1).
Dual channel array CGH profiles of 373 primary tumor samples were collected that are either publicly available or performed in our laboratory. These include either BAC arrays from the Sanger Institute in Cambridge (UK), the University of California in San Francisco (UCSF, San Francisco, USA) and our Amsterdam laboratories (VUMC, The Netherlands); oligonucleotide CGH arrays from our laboratory in Amsterdam; and cDNA arrays from Stanford University (Table 1). The new VUMC BAC array CGH and oligoarray CGH experiments were performed according to protocols previously described (Schreurs et al., 2005; van den IJssel et al., 2005) and data are available at Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ (Edgar et al., 2002), under Accession number GSE5051. The fallopian data set is available at http://cc.ucsf.edu/albertson/public/, all others are as stated in the references (Table 1).
Preprocessing data to transform samples into a common format.
Preprocessing methods were developed and applied, which perform noise reduction and transform samples into a common format.
The May 2004 freeze of the UCSC, ENSEMBL and CHORI databases was used. Clones not found in these databases were excluded.
To overcome the problem of varying noise across platforms, the array CGH smoothing algorithm was applied (Jong et al., 2004). Smoothing settings depend on the number of clones and noise profile, and therefore on platform. Previously the settings for the BAC platform were adjusted by expert opinion (Jong et al., 2004), which was applied here to the other platforms. The parameter (λ) in the smoothing algorithm, a value that is inversely proportional to the number of breakpoints identified, was set to 2 for the oligo platform, 1.5 for the cDNA platform and 0.8 for the BAC platform.
Chromosomal position sampling.
To deal with the varying positions of the different clones on the genome, 100 positions were sampled on each chromosome at equal spacing. This approach weighs each chromosome equally and emphasizes breakpoints over chromosomal length.
The DNA copy number ratio for each sampled position was set to the ratio of the closest position in the smoothed data.
The dynamic range for the CGH ratios may vary across platforms such that a single-copy gain in one platform gives a higher or lower value compared to another platform (Ylstra et al., 2006). One of the most notable differences between DNA from cell lines and primary cancers is the purity of the samples. Whereas the chromosomal DNA of cell line samples is identical in nearly 100% of its cells, cancer samples have different admixtures of cancer and normal cells. This heterogeneity of samples results in cancer/reference ratios closer to one, or equivalently log ratios closer to zero in primary cancer samples. Transforming the smoothed log ratios to z-scores was used to reduce this effect. Therefore, the smoothed log 2 ratios, Xi, where i indicates position, were transformed to z-score by subtracting from Xi the average over all positions, divided by the variance calculated over all positions, using standard settings in MatLab version 6.5.1.
The cell line data was used to optimize the preprocessing procedure outlined in the previous section. Hierarchical clustering with the average linkage method was then used for the analysis of array CGH data (Snijders et al., 2005). To overcome scaling problems, we used the Spearman rank correlation distance, as it is scale independent (Eisen et al., 1998). The robustness of the clustering obtained with the cell line data was evaluated with the ‘support tree’ method in TIGR MeV (Saeed et al., 2003) This is a bootstrapping method where chromosomal positions are randomly sampled with replacement and the hierarchical clustering is performed again on these data. This was performed 100 times. A particular cluster that reappears frequently is likely not to be biased by a small number of clones. Secondly, we analysed the amount of variance owing to platform or institute. This was performed for each cell line and each sampled position by computing the s.d. at that position over all the experiments on that cell line.
The primary cancer data were handled using the same preprocessing steps as for the cell line data. Likewise, the clustering settings were kept identical to those used for the clustering of the cell line meta-data.
Albertson DG, Ylstra B, Segraves R, Collins C, Dairkee SH, Kowbel D et al. (2000). Quantitative mapping of amplicon structure by array CGH identifies CYP as a candidate oncogene. Nat Genet 25: 144–146.
Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C et al. (2001). Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat Genet 29: 365–371.
Douglas EJ, Fiegler H, Rowan A, Halford S, Bicknell DC, Bodmer W et al. (2004). Array comparative genomic hybridization analysis of colorectal cancer cell lines and primary carcinomas. Cancer Res 64: 4817–4825.
Edgar R, Domrachev M, Lash AE . (2002). Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30: 207–210.
Eisen MB, Spellman PT, Brown PO, Botstein D . (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868.
Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B . (2004). Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 20: 3636–3637.
Jonsson G, Naylor TL, Vallon-Christersson J, Staaf J, Huang J, Ward MR et al. (2005). Distinct genomic profiles in hereditary breast tumors identified by array-based comparative genomic hybridization. Cancer Res 65: 7612–7621.
Linn SC, West RB, Pollack JR, Zhu S, Hernandez-Boussard T, Nielsen TO et al. (2003). Gene expression patterns and gene copy number changes in dermatofibrosarcoma protuberans. Am J Pathol 163: 2383–2395.
Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D et al. (2005). MicroRNA expression profiles classify human cancers. Nature 435: 834–838.
Michiels S, Koscielny S, Hill C . (2005). Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365: 488–492.
Mitelman F, Johansson B, Mertens F . (2004). Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet 36: 331–334.
Nakao K, Mehta KR, Fridlyand J, Moore DH, Jain AN, Lafuente A et al. (2004). High-resolution analysis of DNA copy number alterations in colorectal cancer by array-based comparative genomic hybridization. Carcinogenesis 25: 1345–1357.
Oostlander AE, Meijer GA, Ylstra B . (2004). Microarray-based comparative genomic hybridization and its applications in human genetics. Clin Genet 66: 488–495.
Pinkel D, Albertson DG . (2005). Array comparative genomic hybridization and its applications in cancer. Nat Genet 37 (Suppl): S11–S17.
Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF et al. (1999). Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23: 41–46.
Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE et al. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA 99: 12963–12968.
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D et al. (2004). Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 101: 9309–9314.
Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N et al. (2003). TM4: a free, open-source system for microarray data management and analysis. Biotechniques 34: 374–378.
Schreurs MW, Hermsen MA, Geltink RI, Scholten KB, Brink AA, Kueter EW et al. (2005). Genomic stability and functional activity may be lost in telomerase-transduced human CD8+ T lymphocytes. Blood 106: 2663–2670.
Snijders AM, Fridlyand J, Mans DA, Segraves R, Jain AN, Pinkel D et al. (2003a). Shaping of tumor and drug-resistant genomes by instability and selection. Oncogene 22: 4370–4379.
Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J et al. (2001). Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet 29: 263–264.
Snijders AM, Nowee ME, Fridlyand J, Piek JM, Dorsman JC, Jain AN et al. (2003b). Genome-wide-array-based comparative genomic hybridization reveals genetic homogeneity and frequent copy number increases encompassing CCNE1 in fallopian tube carcinoma. Oncogene 22: 4281–4286.
Snijders AM, Schmidt BL, Fridlyand J, Dekker N, Pinkel D, Jordan RC et al. (2005). Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma. Oncogene 24: 4232–4242.
van Dekken H, Paris PL, Albertson DG, Alers JC, Andaya A, Kowbel D et al. (2004). Evaluation of genetic patterns in different tumor areas of intermediate-grade prostatic adenocarcinomas by high-resolution genomic array analysis. Genes Chromosomes Cancer 39: 249–256.
van den IJssel P, Tijssen M, Chin SF, Eijk P, Carvalho B, Hopmans E et al. (2005). Human and mouse oligonucleotide-based array CGH. Nucleic Acids Res 33: e192.
Weiss MM, Kuipers EJ, Postma C, Snijders AM, Siccama I, Pinkel D et al. (2003). Genomic profiling of gastric cancer predicts lymph node status and survival. Oncogene 22: 1872–1879.
Ylstra B, van den IJssel P, Carvalho B, Brakenhoff RH, Meijer GA . (2006). BAC to the future! or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH). Nucleic Acids Res 34: 445–450.
Zardo G, Tiirikainen MI, Hong C, Misra A, Feuerstein BG, Volik S et al. (2002). Integrated genomic and epigenomic analyses pinpoint biallelic gene inactivation in tumors. Nat Genet 32: 453–458.
We thank Tony Cox, Ian Stalker (Sanger Institute, UK) and Nir Gamliel (Compugen USA, San Jose, CA, USA) for mapping the oligos into ENSEMBL. In addition, we would like to thank Cindy Postma and Erik Hopmans for their help with array hybridizations. We wish to thank Dr Stephen Ethier for the SUM cell lines and Dr Morag McCallum for the VP cell lines. SFC and CC are funded by Cancer Research UK. SFC was furthermore funded by COST-STSM B19. We furthermore acknowledge funding by the EU-sixth framework project: DISMAL, Contract No.: LSHC-CT-2005-018911. The work on the primary colon tumors at VUMC was supported by the Dutch Cancer Society Grant KWF-VU 02-2618.
About this article
Cite this article
Jong, K., Marchiori, E., van der Vaart, A. et al. Cross-platform array comparative genomic hybridization meta-analysis separates hematopoietic and mesenchymal from epithelial tumors. Oncogene 26, 1499–1506 (2007). https://doi.org/10.1038/sj.onc.1209919
- array CGH
- chromosomal aberrations
Breast Cancer Research and Treatment (2013)
Cellular Oncology (2013)
Wavelet-based identification of DNA focal genomic aberrations from single nucleotide polymorphism arrays
BMC Bioinformatics (2011)
Genome-wide comparison of paired fresh frozen and formalin-fixed paraffin-embedded gliomas by custom BAC and oligonucleotide array comparative genomic hybridization: facilitating analysis of archival gliomas
Acta Neuropathologica (2011)
An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models
BMC Medical Genomics (2010)