Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

DNA copy number amplification profiling of human neoplasms


DNA copy number amplifications activate oncogenes and are hallmarks of nearly all advanced tumors. Amplified genes represent attractive targets for therapy, diagnostics and prognostics. To investigate DNA amplifications in different neoplasms, we performed a bibliomics survey using 838 published chromosomal comparative genomic hybridization studies and collected amplification data at chromosome band resolution from more than 4500 cases. Amplification profiles were determined for 73 distinct neoplasms. Neoplasms were clustered according to the amplification profiles, and frequently amplified chromosomal loci (amplification hot spots) were identified using computational modeling. To investigate the site specificity and mechanisms of gene amplifications, colocalization of amplification hot spots, cancer genes, fragile sites, virus integration sites and gene size cohorts were tested in a statistical framework. Amplification-based clustering demonstrated that cancers with similar etiology, cell-of-origin or topographical location have a tendency to obtain convergent amplification profiles. The identified amplification hot spots were colocalized with the known fragile sites, cancer genes and virus integration sites, but global statistical significance could not be ascertained. Large genes were significantly overrepresented on the fragile sites and the reported amplification hot spots. These findings indicate that amplifications are selected in the cancer tissue environment according to the qualitative traits and localization of cancer genes.


Gene amplifications cause an increase in the gene copy number and subsequently elevate the expression of the amplified genes. The expression of amplified genes has been shown to increase significantly in breast and prostate cell lines and in primary tumors, but the number of overexpressed genes in the amplified chromosomal areas varies upon cancer tissue type (Pollack et al., 1999, 2002; Hyman et al., 2002). Amplification-activated human oncogenes include AKT2 in ovarian cancer, ERBB2 in breast and ovarian tumors, MYCL1 in small-cell lung cancer, MYCN in neuroblastoma, REL in Hodgkin's lymphoma, EGFR in glioma and non-small-cell lung cancer and MYC in various cancers (Futreal et al., 2004). Amplifications are frequently observed in advanced cancers, which have lost the p53-mediated maintenance of genomic integrity (Livingstone et al., 1992; Yin et al., 1992). Amplifications of various drug target genes have been observed to confer drug resistance in several in vitro experiments as well as in clinical studies (Wahl et al., 1979; Johnston et al., 1983; Goker et al., 1995; Shannon, 2002).

DNA copy number amplifications arise as multiplication of intra-chromosomal regions of 0.5–10 Mb in length. In contrast to the definition of amplification, DNA copy number increase in larger chromosomal areas or intact chromosomes, owing to translocations or aneuploidy, is defined as a gain (Lengauer et al., 1998). Amplifications produce excess gene copies in homogeneously staining regions (HSRs) and extrachromosomal acentric DNA fragments (double minutes and episomes) (Albertson et al., 2003). HSRs manifest as a ladder-like structure of inverted repeats within chromosomes (Schwab, 1998). Extrachromosomal amplicons appear as double minutes, which can be seen using conventional cytogenetics (Hahn, 1993), and episomes (250 bp in length), which are only detectable using molecular biology methods (Maurer et al., 1987; Graux et al., 2004). HSRs, double minutes and episomes may contain fused genetic material from different chromosomal loci (Guan et al., 1994; Graux et al., 2004). Models for pathways that result in gene amplification have been reviewed in detail by Myllykangas and Knuutila (2006). Amplification pathway models predict that two independent DNA double-stranded breaks at the margins of an amplified region are required to occur to enable the amplification of the intra-chromosomal DNA segment. The human genome is perturbed by specific labile domains, such as fragile sites, virus integration sites and large genes, which are more sensitive to DNA damage than other loci. Fragile sites are chromosomal regions, in which chromosomal breaks can be induced using replication-stalling chemicals. Based on their prevalence in the population, the fragile sites (119) are classified as common or rare. Eighty-eight common fragile sites are found in all individuals and 31 rare fragile sites in less than 5% of the population. Susceptibility to damage varies between the fragile sites, as 75% of the DNA damage lesions have been located to 22 common fragile sites (out of the 139 chromosomal loci with identified lesions) (Glover et al., 1984). Especially, common fragile sites are colocalized with amplifications (Hellman et al., 2002). WWOX (1.1 Mb) and FHIT (0.8 Mb) are particularly large genes and located in two of the most damage-prone fragile sites (16q23.1 and 3p14.2, respectively). WWOX and FHIT are frequently involved in DNA breaks. Besides fragile site expression, retroviral DNA insertion at specific genomic sites is believed to induce DNA breaks.


DNA copy number amplification profiles of human neoplasms

DNA copy number amplifications in human neoplasms were extracted from a publicly available data collection. A data matrix containing binary chromosome sub-band-specific information of DNA amplifications was generated (downloadable at The number of examined loci was 393 (list of all analysed chromosomal loci in Appendix 1 in the Supplementary information). The assembled data matrix contained 4590 cases with 5740 coherent, amplified chromosomal regions. In all, 1 803 870 chromosome bands were analysed and 26 527 of them were labeled amplified. Appendix 2 in the Supplementary information presents an overview of the studied neoplasms (n=73) and DNA amplifications. The majority of the neoplasms were malignant tumors or hematologic malignancies, but non-malignant or pre-malignant lesions were also included in this study. All 73 neoplasm categories included cases with amplifications. Amplification profiling was carried out by calculating chromosome sub-band-specific amplification frequencies. Amplification profiles of all studied neoplasms are presented in Figure 1a, which shows amplification frequencies at chromosome band resolution with frequency values of individual neoplasms zoomed to the maximum. Neoplasms in Figure 1a are sorted according to the hierarchical clustering. Four main neoplasm clusters (Figure 1a) were identified according to the clustering analysis. Detailed amplification profiles were generated for 73 neoplasms (see Appendix 3 in the Supplementary information).

Figure 1

DNA amplification profiling of human neoplasms and amplification hot spots in the human genome. (a) Overview of amplification profiles of all analysed neoplasms. Amplification incidences were calculated for each chromosome band and zoomed to the maximum value in each neoplasm. Hierarchical clustering was applied to sort the neoplasms. Four main clusters are marked (clusters 1–4). (b) Amplification hot spots were identified using ICA. The cutoff value for chromosome bands, which were included in the amplification hot spots, was set at 0.5. Hot spot temperature (values between 0.5 and 1) is presented using gray to black scaling. Stability indexes of the ICA factors (amplification hot spots) were assessed using bootstrapping analysis. Amplification hot spots were sorted by the stability indexes.

Amplification hot spots represent frequently amplified genomic domains

DNA copy number amplifications were found to be preferentially localized in the genome (Figure 1a and Appendix 3 in the Supplementary information). Therefore, amplification hot spots, the genomic loci that are prone to amplify in human neoplasms, were identified using computational modeling, specifically the independent component analysis (ICA). Chromosome bands with values >0.5 were included in the genome-wide ICA factors representing the amplification hot spots. Values between 0.5 and 1 within the amplification hot spots are shown using gray to black scaling. Figure 1b presents the top 30 identified amplification hot spots (10% of all identified independent factors) in the order of decreasing stability index, which was evaluated by bootstrapping the ICA factors. Besides factor 19, which embodies the whole chromosome 21, the longest coherent amplification hot spots encompassed single chromosome arms. Note that three factors are composed of two disjoint regions (factors 9, 12 and 27). The disjoint hot spots spanned across the centromeres of chromosomes 8, 17 and 5, respectively. Amplification hot spots did not cross chromosome boundaries, although this would have been conceivable in the continuous vector presentation of amplifications. The amplification hot spots are also presented in Table 1, which lists the fragile sites and cancer genes that colocalize with the amplification hot spots. The amplification hot spots covered 178 chromosome bands (45% of the genome).

Table 1 Amplification hot spots of the human genome with colocalized fragile sites and cancer genes

Amplification profile-based clustering reveals relations between human neoplasms

DNA amplification profiling (Figure 1a) showed that amplifications were not randomly distributed in the human genome, but specific amplifications were observed in distinct neoplasms. Hierarchical clustering was performed to investigate whether there were amplification-based similarities between neoplasms. Figure 2 shows the correlation matrix and the dendrogram of the neoplasms. Four main neoplasm clusters are marked in correlation matrix.

Figure 2

Hierarchical clustering of human neoplasms. Neoplasms were sorted using pairwise hierarchical clustering. In the correlation matrix, neoplasms (marked 1–73) appear in the same order on the x and y axis, and black lines separate the clusters. Four main clusters are marked (clusters 1–4). Linkage distances of paired neoplasms were used to join the best matching neoplasm pairs to form the dendrogram tree.

Associations between amplification hot spots, cancer genes, fragile sites, virus integration sites and gene size cohorts

The identification of amplification hot spots (Figure 1b) suggested that some structural genomic properties or labile genomic features might explain the site specificity of the amplifications. Localization of cancer genes, virus integration sites and fragile sites relative to amplification hot spots was evaluated to explain the mechanisms behind the amplification hot spot findings. Only one of the hot spot areas showed significant overrepresentation of cancer genes (8p23-q12; 8q23; difference of means 0.0227, P=0.0070). The same analysis performed with regard to the virus integration sites yielded four significant findings of hot spot areas with overrepresentation, namely 2q13–q36 (difference 0.0404, P<0.0001), 8q24.1–q24.3 (difference 0.0407, P=0.0020), 9q11–q34 (difference 0.1264, P<0.0001) and 18q11.2–q23 (difference 0.1975, P<0.0001). National Center for Biotechnology Information (NCBI)-derived fragile sites were not significantly overrepresented in any of the reported amplification hot spots. In a global statistical setting, the fragile sites, which had been extracted from the NCBI database or from a publication (Glover et al., 1984), were not significantly over- or underrepresented in the reported amplification hot spot areas when compared with non-amplified chromosomal regions. Similarly, when analysed in global scale, neither cancer genes nor virus integration sites were significantly represented on the reported amplification hot spots.

Associations between amplification hot spots, prevalence weighted fragile sites and gene size cohorts were tested to analyse whether the gene size affects chromosomal fragility or has influence on gene amplifications. Differences in proportions of genes located on amplification hot spots and non-amplified chromosomal regions as well as in fragile and non-fragile sites were measured for each gene size cohort, and statistical testing was applied to assess the significance of the findings. Gene size cohorts 0–100 kb (difference of means 0.234 and P-value <0.0000001), 50–150 kb (difference of means 0.206 and P-value=0.002), 250–350 kb (difference of means 0.204 and P-value <0.0000001) and 300–400 kb (difference of means 0.146 and P-value=0.001) were found to be associated with amplification hot spots. When the gene size cohorts were measured against the prevalence weighted fragile sites, the 200–300, 250–350, and 900 kb–1 Mb gene size samplings were found to be associated with the fragile sites (differences of means 0.000021, 0.000022 and 0.000103 and P-values 0.006, <0.0000001 and 0.002, respectively). The low number of genes in the test sets prevented both gene size assessments in cohorts >1.5 Mb.


Cytogenetic and molecular genetic markers have long been applied in cancer diagnostics. In recent years, it has become evident that molecular classification of tumors provides essential knowledge about the mechanisms of carcinogenesis and guides the development of targeted therapies and clinical practices. Although some leukemias have already been classified using molecular biology parameters, such classifications are generally lacking. Amplification profiles can be regarded as the molecular signature of the history of the neoplasm, because they are distributed across the entire genome in a fairly resolute manner. Moreover, amplifications are mechanistically associated with cancer progression and common in late-stage cancers of various organs. In this context, amplification profile-based molecular classification of neoplasms is well grounded. Naturally, amplification profiling is only one means to classify neoplasms and similar classification can be based on any other measurable molecular variable.

DNA copy number amplification data was collected in a bibliomics survey. Data mining by means of statistical modeling was applied to elicit knowledge from DNA copy number amplifications in human cancer. Amplification frequency profiles were identified for neoplasm-type-specific sample groups (Figure 1a and Appendix 3 in the Supplementary information). The method applied in the amplification profiling analysis assumes that a single measurement of significance only applies to that specific chromosome band ignoring the uncertainty resulting from multiple testing. Conservative multiple testing correction was not applied as the method was not used in hypothesis testing but in the context of data mining and knowledge discovery. In addition, the DNA amplification profiling tests were not independent (required by the correction algorithms) as the genomic map positions are interconnected. Thus, uncertainty of the amplification frequency profiling analyses is mainly reflected in the sample size of the respective classes (see Appendix 2 in the Supplementary information). Hierarchical clustering was used to identify neoplasm clusters based on amplification profiles (Figure 2) and ICA was applied to identify amplification hot spots (Figure 1b) that were subsequently probed with parameters related to carcinogenesis using hypothesis testing.

Neoplasm cluster 1 consists of miscellaneous cancers (Figure 2). Sarcomas are the main group of malignancies in cluster 1, but adenocarcinomas (breast and prostate) and also other types of neoplasms are present. No underlying unambiguously common denominators could be designated. It is worth mentioning that both breast and prostate adenocarcinomas have the same cell-of-origin (undifferentiated suprabasal stem cells) (Sell, 2004). Cluster 2 comprises two tight clusters of squamous cell carcinomas (head and neck, cervical, anal and oral carcinoma) and carcinomas of columnar epithelium (fallopian tube carcinoma, non-small-cell lung cancer and endometrial carcinoma). Noteworthy is that non-small-cell lung cancer contains also pulmonary squamous carcinoma (almost half of the cases), which strengthens the association of cluster 2 with squamous carcinomas. According to the dendrogram and correlation matrix analyses, clusters 1 and 2 are also related with each other.

The third neoplasm cluster consists of assorted sarcomas and carcinomas with preference to childhood solid tumors. Besides Ewing's sarcoma and Wilms' tumor, hepatoblastoma and renal cell carcinoma are tumors that usually affect young patients. A somewhat diverging clustering (from other neoplasms) of these adolescent cancers suggests that they also differ from adult cancers on the basis of the amplifications. Wilms' tumor, Ewing's sarcoma and hepatocellular carcinoma were tightly related within this cluster. Although Wilms' tumor and Ewing's sarcoma were the second most related pair of cancers in this research setting, they have not been considered to be closely related. Wilms' tumor usually arises in children and originates from residual embryonic stem cells (Sell, 2004). The cell-of-origin of Ewing's sarcoma is unknown but it has been postulated to originate from bone marrow stem cells (Torchia et al., 2003). It has been debated whether hepatocellular carcinoma is of bone marrow stem cell origin (Sell, 2004). In this context, Wilms' tumor, Ewing's sarcoma and hepatocellular carcinoma share similar pluripotent stem cell origin, which might explain the close amplification profile-based kinship. Histologically, Wilms' tumor and Ewing's sarcoma are small round cell tumors. In cluster 3, malignant fibrous histiocytoma (MFH), liposarcoma and malignant mesenchymoma of bone grouped together. This was expected, because they resemble each other histologically. Interestingly, melanomas of the skin and eye were totally separated in this evaluation, even though they are thought to be related.

The fourth main neoplasm cluster contains adenocarcinomas of the gastrointestinal tract. Gastric and colorectal adenocarcinomas were closely related, which might be explained by the fact that both of these cancers originate from functionally anchored stem cells of the crypts (Sell, 2004). All gastrointestinal adenocarcinomas clustered together, with the exception of pancreatic adenocarcinoma. Such gender-specific adenocarcinomas as breast cancer and prostate cancer were clearly outside this main adenocarcinoma cluster. In general, adenocarcinomas were scattered, with the exception of gastrointestinal adenocarcinomas, which suggests that adenocarcinomas are a highly heterogeneous group with concordant subclustering based on gender, topography, progenitor cell or surrounding tissue function. The gastrointestinal tract cluster was more related to clusters 1 and 2 than to cluster 3, which overrepresented the childhood solid tumors.

Instead of clear clusters, the rest of the neoplasms were observed as highly related cancer pairs and triplets or single outliers. Small-cell lung cancer and neuroendocrine lung cancers are regarded to be related and they clustered together also in this setting. Pituitary adenoma (not neoplastic), adenocortical carcinoma and thyroid carcinoma clustered together, but neuroendocrine lung carcinoma was clearly separated from the rest of the endocrine tumors. Germline tumors were found to be connected. Retinoblastoma and Merkel cell carcinoma were clearly related and differed considerably from the other studied neoplasms. These two tumors have not been reported to be related previously.

ICA was used to identify amplification hot spots, the frequently amplified chromosome bands. Similar factor-based methods are applicable as well as other methods, for example, the Hidden Markov Model. Amplification hot spots were congruent with the amplification profiles (Figure 1). For example, 11q13 amplification was observed in the majority of the cancers. 8q24.1–q24.3 peaked in the amplification profiles in addition to the other chromosome 8 amplifications, and also 1q, 3q, 5p, 12p, 12q and 17q amplifications stood out in the amplification profile presentation (Figure 1). The amplification hot spots were related to known amplified genes. The reported amplifications hot spots harbor such amplification-activated oncogenes as EGFR, MYCN, MYC and ERBB2 (Futreal et al., 2004) (Table 1). The RUNX1 (alias AML1) gene amplification (gene locus 21q22.3 maps to amplification hot spot 27) is typical in a variety of hematologic malignancies (Roumier et al., 2003). 11q23-qter is typically amplified in AML but rarely in other tumors (Zatkova et al., 2004). The BCL2 (18q21.3) amplification is typical in B-cell non-Hodgkin's lymphoma (Monni et al., 1999). 9q11–q34 amplification was a somewhat unsuspected finding among the top 10% hot spots, because the ABL1 oncogene (located in 9q34) is usually activated by translocations in hematologic malignancies (CML and AML) and not by amplification. Recently, ABL1 was shown to be amplified in T-cell acute lymphoblastic leukemia (Bernasconi et al., 2005). 20p13–p11.1 amplification was peculiar among the hot spots as no known cancer genes are located in this region (Table 1). Likewise, 5p15.3–p15.1 amplification hot spot was divergent to the other hot spots, because the region is not known to harbor any fragile sites or cancer genes (Table 1). Amplifications at 13q, 19 and 20q would be expected to appear among the amplification hot spots. These amplifications did not break into the top 10%, but they were seen in the individual cancer profiles. For example, 20q amplifications were frequent in gastric cancer and colorectal cancer, and 19q (the AKT2 gene is located at 19q13.1–q13.2) contained amplifications in the majority of neoplasms (Figure 1). The 2p13 locus of the REL gene is also missing from the amplification hot spot list. REL is frequently amplified in some lymphomas (Futreal et al., 2004). Amplifications in 3q, 5, 6p, 8q, 11q13, 15q, 17q, 21, 22 and X are typical of cancers in general. In addition to these general amplifications, cancer-specific amplifications emerged in the global analysis of amplification hot spots. 2p25–p23 amplification (containing the MYCN gene) is specific to neuroblastoma (Schwab, 2004), 12p amplifications frequently appear in embryonal tumors (teratomas) and liposarcomas (Looijenga and Oosterhuis, 1999; Rieker et al., 2002), 17p amplifications are typical of some sarcomas (osteosarcoma, MFH and leiomyosarcoma) (El-Rifai et al., 1998; Weng et al., 2003; Atiye et al., 2005) and 1q21–q24 amplifications are frequent in osteosarcomas (Ozaki et al., 2002). Our findings suggest that some amplification hot spots can be used as clinical markers for cancer diagnostics and confirm the general conception that DNA amplifications are characteristic of malignant tumors, whereas benign tumors show no amplifications. Our findings clearly indicate that DNA amplifications may be applied in the differential diagnosis of small round cell tumors: 2p amplification (hot spot 4) in neuroblastoma, 17p (hot spot 6) in osteosarcoma, 18q (hot spot 18) in lymphoma, and 1q (hot spots 17 and 24), 12 (hot spots 2, 21 and 30) and 8 (hot spots 9, 14 and 16) in Ewing's sarcoma. There are more than 10 malignant small round cell tumors affecting mainly children and adolescents who have different prognosis and treatment and are difficult to differentiate.

Colocalization of gene size cohorts and amplification hot spots or fragile sites was studied using statistical modeling to assess the associations between gene size and DNA amplifications and chromosomal fragility. Gene sizes between 0 and 150 kb (92.32% of all genes) and 250 and 400 kb (2.5% of all genes) were found to be overrepresented in the amplification hot spot loci when compared with a random reference. Genes between 200 and 350 kb (2% of all genes) and 900 kb and 1 Mb (0.09% of all genes) were overrepresented in the prevalence weighted fragile sites. These gene size cohorts represent a minority in the human gene population, suggesting that genes in the fragile chromosomal regions are generally quite large. Interestingly, the genes that were between 900 kb and 1 Mb in length were associated with fragile sites, supporting the hypothesis that extra large genes are breakage prone, even though neither of the known fragile genes (WWOX or FHIT) was included in this gene set. Noteworthy is that the genes, which were 250–350 kb in length, showed the best association scores in both fragile site and amplification hot spot assessments. These genes represent 1.5% of the human genome and are lengthwise among the top 3.8%. These findings suggest that large genes might be involved in the amplification process and genomic fragility.

Associations between labile genomic features and specific amplification hot spots were statistically analysed. Only one amplification hot spot factor (8p23-q12; 8q23) (amplification hot spot no. 9, Figure 1b and Table 1) was enriched with cancer genes when compared to a random reference. This finding may be explained by overrepresentation of cancer genes in this chromosomal location. Virus integration sites were overrepresented in four amplification hot spots, at 2q13–q36, 8q24.1–q24.3, 9q11–q34 and 18q11.2–q23. None of the amplification hot spots were enriched with fragile sites and most of the amplification hot spot findings cannot be explained by cancer gene accumulation or virus integration site colocalization. Evaluation of general associations revealed no statistical overrepresentation of fragile sites, virus integration sites or cancer genes on the reported amplification hot spots when compared with non-amplified chromosomal regions. Even when the expression intensity of the common fragile site (measured as a lesion count) (Glover et al., 1984) was taken into account, no global association was observed.

Although no statistical associations were observed in the global setting, most of the amplification hot spots contain many cancer genes and fragile sites (Table 1), suggesting that these sites might be selected to amplify because of the qualitative properties of the cancer genes and facilitation conducted by the fragile site expression. Furthermore, the resolution of the microscopic data used in this study is inadequate to enable thorough investigation of the relationship between genomic fragility and amplifications, because amplification hot spots span 45% of the human genome and fragile sites are located in 30% of the chromosome bands. Thus, further studies using molecular biology resolution are needed to address the colocalization of amplification breakpoint regions and fragile DNA sequence features. Amplification breakpoint regions need to be mapped using genome-wide array comparative genomic hybridization (aCGH) and cloning is essential in determining fragile sequences (Buttel et al., 2004). A cross-cancer database of aCGH results linked with labile genomic features is instrumental for systems biology oriented data mining and research of amplification mechanisms.

This study was restricted to amplifications because identification of DNA copy number gains and losses by conventional CGH is relatively unreliable. Nonetheless, it is well accepted that DNA copy number aberrations owing to deletions and low-level gains are also of great importance in cancer pathology. The profiling analysis was restricted to group averages, even though quantitative data for a single case might also be important in deciphering tumor-maintaining cancer gene networks. It is also recognized that cancer cell genome is variable and tumor tissue is not homogeneous with regard to tissue architecture or function. DNA copy number losses as well as minor gains, single case experiments and cancer genome plasticity are elements that can be studied more explicitly using the proposed aCGH approach.

In conclusion, the amplification profiling analyses showed that cancers with similar cell-of-origin, histology or topographical location obtained convergent amplification profiles, suggesting that amplifications are selected in the cancer tissue environment according to the qualitative traits and localization of cancer genes. The amplification patterns seem to be too complex for very simple mechanistic explanations, although some associations of colocalization between labile genomic features and DNA amplifications were observed. The amplification profiling results demonstrate the independent value of DNA copy number amplifications in cancer pathology. The identified amplification hot spots may have clinical value in differential diagnostics of some tumors. These findings encourage the development of molecular classification based on amplification profiling to supplement the conventional, histology-based classification of tumors. Subtyping of tumors based on amplification signatures would be advantageous, because characterization based on molecular properties is more relevant in the clinical setting and enables development of approaches directed at biologically active targets in cancer. Knowledge and comparison of molecular pathology of tumors are also essential in designing innovative clinical practices, such as multi-cancer therapies, and flexible diagnostic and prognostic applications.

Materials and methods

DNA copy number amplification data

The CGH data was obtained from a publicly available data collection ( containing 23 284 cases, which have been collected by inspecting 838 original research articles of chromosomal CGH studies published in peer-reviewed journals between 1992 and 2002 (Knuutila et al., 2000). Amplifications were determined from the CGH data collection. A cutoff value of 1.5 in the CGH measurement ratios or definitions given by the authors were applied when selecting the amplified chromosome bands. Amplification data for each case in the data collection was presented as a vector of zeros and ones (not amplified or amplified chromosome bands). Chromosome sub-band resolution was used to map the DNA copy number amplifications to the genome (ISCN, 1995). Vectors of amplification observations in chromosome sub-bands were assembled into a data matrix.

Amplification profiling of human neoplasms

Each case in the amplification data matrix was classified according to the World Health Organization (WHO) guidelines. The cases were then grouped according to the neoplasm classifications. An amplification profile for each neoplasm (n=73) was defined by calculating chromosome band-specific and chromosome arm-specific frequencies of amplification observations. For reference, similar amplification profiles were calculated for all neoplasms excluding the neoplasm that was studied. To define characteristic amplifications of each neoplasm, differences between the amplification profiles of studied neoplasms and the reference were analysed using a hypothesis test. Significance of the difference between the studied neoplasm and the reference profiles was analysed using a permutation test (1000 permutations) (Good, 2000). The significance threshold was set at P=0.01.

Clustering of human neoplasms based on amplification profiles

Hierarchical clustering was applied to evaluate the amplification-based similarity between neoplasms. The average values of neoplasm-specific groups were used in the clustering. A pair-wise correlation matrix was calculated based on the neoplasm group-specific amplification profiles, and a dendrogram was built using average linkage distance between the neoplasms. Four coherent clusters were identified based on the dendrogram and the correlation matrix.

Identification of amplification hot spots

Amplification hot spots were identified computationally using chromosome band resolution and case-specific data without neoplasm information. The ICA method was applied to identify independent factors in the data set. The ICA factor model for identifying candidate amplification hot spots assumes that an observed pattern of amplifications, vector x, is formed as a sum of basis vectors (in matrix A) triggered by independent latent variables in vectors s, that is, x=As. One basis vector represents a hot spot. In this model, a hot spot need not be contiguous with respect to the spatial location on bands.

As the amplification data is binary, we used the threshold method described previously (Himberg and Hyvärinen, 2001). Himberg and Hyvärinen (2001) show that the linear ICA can be successfully used for ‘sparse’ data even if A, x and s are binary, and the mixing model contains a post-mixture threshold function U, that is, x=U(As). The amplification data is sparse (less than 5% ones among zeros).

To reduce noise, the dimensionality was lowered using the principal component analysis (Hyvärinen et al., 2001). We selected 60 components with the largest eigenvalues. Then, FastICA was run 30 times with different initial conditions and the data were bootstrapped (Himberg et al, 2004). The contrast function was based on skewness. The mixing matrix A was then scaled and thresholds were set (Himberg and Hyvärinen, 2001). The components were ordered according to the stability index defined by Himberg et al. (2004), and 30 basis vectors corresponding to the most stable components were selected.

Statistical evaluation of colocalization of amplification hot spots, fragile sites, virus integration sites, cancer genes and gene size cohorts

Elemental factors contributing to the induction of amplifications were tested in a global statistical setting to investigate the underlying mechanisms. Amplification hot spots were extracted as explained above. Fragile sites and virus integration sites were adopted from the NCBI Locus Link (National Center for Biotechnology Information, Bethesda, CA, USA) and lesion prevalence in the common fragile sites was derived from a previous publication (Glover et al., 1984). A census of human cancer genes was published by Futreal et al. (2004) and an updated list of cancer genes was retrieved from the Cancer Genome Project web site ( When fragile sites, virus integration sites and cancer genes were matched with amplifications, their genomic locations were reported using coarser chromosome sub-band resolution.

Enrichment of cancer genes, fragile sites (NCBI derived) and virus integration sites on specific amplification hot spots was measured by counting the factor-specific observations and comparing the results against a random reference. In addition, the distribution of fragile sites (NCBI annotated as well as prevalence weighted), cancer genes and virus integration sites between amplification hot spots and non-amplified chromosome regions was measured. The significance of the differences in the observations was defined using a hypothesis test in a non-parametric setting. P-values for the differences in observations were assigned according to the empirical distribution generated using a permutation test.

In order to test whether gene size contributes to site specificity of chromosomal fragility or amplifications, the distribution of gene size cohorts between fragile sites (NCBI and prevalence weighted) and non-fragile regions as well as amplification hot spots and non-amplified chromosomal regions was measured. The human genes were collected using the Ensembl services (Birney et al., 2004) and sorted by length. Genomic size-based gene cohorts were extracted by selecting windows of 100 kb in length with an overlap of 50 kb, that is, 0–100, 50–150, 100–200 kb, etc. Significance of the differences in means was evaluated using a non-parametric permutation test similar to the method described above.


  1. Albertson DG, Collins C, McCormick F, Gray JW . (2003). Nat Genet 34: 369–376.

  2. Atiye J, Wolf M, Kaur S, Monni O, Bohling T, Kivioja A et al. (2005). Genes Chromosomes Cancer 42: 158–163.

  3. Bernasconi P, Calatroni S, Giardini I, Inzoli A, Castagnola C, Cavigliano PM et al. (2005). Cancer Genet Cytogenet 162: 146–150.

  4. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L et al. (2004). Genome Res 14: 925–928.

  5. Buttel I, Fechter A, Schwab M . (2004). Ann NY Acad Sci 1028: 14–27.

  6. El-Rifai W, Sarlomo-Rikala M, Knuutila S, Miettinen M . (1998). Am J Pathol 153: 985–990.

  7. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R et al. (2004). Nat Rev Cancer 4: 177–183.

  8. Glover TW, Berger C, Coyle J, Echo B . (1984). Hum Genet 67: 136–142.

  9. Goker E, Waltham M, Kheradpour A, Trippett T, Mazumdar M, Elisseyeff Y et al. (1995). Blood 86: 677–684.

  10. Good P . (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag: Berlin.

    Book  Google Scholar 

  11. Graux C, Cools J, Melotte C, Quentmeier H, Ferrando A, Levine R et al. (2004). Nat Genet 36: 1084–1089.

  12. Guan XY, Meltzer PS, Dalton WS, Trent JM . (1994). Nat Genet 8: 155–161.

  13. Hahn PJ . (1993). Bioessays 15: 477–484.

  14. Hellman A, Zlotorynski E, Scherer SW, Cheung J, Vincent JB, Smith DI et al. (2002). Cancer Cell 1: 89–97.

  15. Himberg J, Hyvärinen A . (2001). In: Lee T-W, Jung T-P, Makeig S, Sejnowski TJ (eds). Third International Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001). Institute for Neural Computation, University of California San Diego: San Diego, CA, USA, pp 552–556.

    Google Scholar 

  16. Himberg J, Hyvärinen A, Esposito F . (2004). Neuroimage 22: 1214–1222.

  17. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E et al. (2002). Cancer Res 62: 6240–6245.

  18. Hyvärinen A, Oja E, Karhunen J . (2001). Independent Component Analysis. John Wiley & Sons: New York.

    Book  Google Scholar 

  19. ISCN (1995). An International System for Human Cytogenetic Nomenclature. S Karger: Basel.

  20. Johnston RN, Beverley SM, Schimke RT . (1983). Proc Natl Acad Sci USA 80: 3711–3715.

  21. Knuutila S, Autio K, Aalto Y . (2000). Am J Pathol 157: 689.

  22. Lengauer C, Kinzler KW, Vogelstein B . (1998). Nature 396: 643–649.

  23. Livingstone LR, White A, Sprouse J, Livanos E, Jacks T, Tlsty TD . (1992). Cell 70: 923–935.

  24. Looijenga LH, Oosterhuis JW . (1999). Rev Reprod 4: 90–100.

  25. Maurer BJ, Lai E, Hamkalo BA, Hood L, Attardi G . (1987). Nature 327: 434–437.

  26. Monni O, Franssila K, Joensuu H, Knuutila S . (1999). Leukamia Lymphoma 34: 45–52.

  27. Myllykangas S, Knuutila S . (2006). Cancer Lett 232: 79–89.

  28. Ozaki T, Schaefer KL, Wai D, Buerger H, Flege S, Lindner N et al. (2002). Int J Cancer 102: 355–365.

  29. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF et al. (1999). Nat Genet 23: 41–46.

  30. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE et al. (2002). Proc Natl Acad Sci USA 99: 12963–12968.

  31. Rieker RJ, Joos S, Bartsch C, Willeke F, Schwarzbach M, Otano-Joos M et al. (2002). Int J Cancer 99: 68–73.

  32. Roumier C, Fenaux P, Lafage M, Imbert M, Eclache V, Preudhomme C . (2003). Leukemia 17: 9–16.

  33. Schwab M . (1998). Bioessays 20: 473–479.

  34. Schwab M . (2004). Cancer Lett 204: 179–187.

  35. Sell S . (2004). Crit Rev Oncol Hematol 51: 1–28.

  36. Shannon KM . (2002). Cancer Cell 2: 99–102.

  37. Torchia EC, Jaishankar S, Baker SJ . (2003). Cancer Res 63: 3464–3468.

  38. Wahl GM, Padgett RA, Stark GR . (1979). J Biol Chem 254: 8679–8689.

  39. Weng WH, Ahlen J, Lui WO, Brosjo O, Pang ST, Von Rosen A et al. (2003). Br J Cancer 89: 720–726.

  40. Yin Y, Tainsky MA, Bischoff FZ, Strong LC, Wahl GM . (1992). Cell 70: 937–948.

  41. Zatkova A, Ullmann R, Rouillard JM, Lamb BJ, Kuick R, Hanash SM et al. (2004). Genes Chromosomes Cancer 39: 263–276.

Download references


This research was supported by grants from the Finnish Academy (SYSBIO Research Program), the Helsinki University Central Hospital Research Funds and the Sigrid Jusélius Foundation. The University of Helsinki, the Helsinki University of Technology and the HUSLAB provided the research facilities and the equipment used in this study.

Author information



Corresponding author

Correspondence to S Myllykangas.

Additional information

Supplementary Information accompanies the paper on Oncogene website (

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Myllykangas, S., Himberg, J., Böhling, T. et al. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 7324–7332 (2006).

Download citation


  • cancer
  • gene amplification
  • fragile site
  • bioinformatics
  • data mining
  • molecular pathology

Further reading


Quick links