Introduction

Comparative genomic hybridization (CGH) is a method designed for identifying chromosomal segments with copy number aberration. Differentially labeled genomic DNA samples are competitively hybridized to chromosomal targets, where copy number balance between the two samples is reflected by their signal intensity ratio. Since its development in the early 1990s, a great deal of effort has been devoted to improving the resolution of this technology. The use of DNA targets immobilized in an array format, replacing the conventional metaphase chromosome spreads, represents a significant advance.1, 2 Traditionally, the resolution of array-based CGH has been defined by the genomic distance between each DNA target represented on the array.3, 4 Pollack et al5 extended this technology to facilitate high-resolution genome-wide survey of segmental alterations by using cDNA microarrays for CGH analysis. The development of a whole-genome bacterial artificial chromosome (BAC) tiling path array has further improved the resolution of CGH.6 This article describes the various types of genomic arrays, highlights their application in identifying genetic alterations in cancer and genetic diseases, and summarizes computational software used in the visualization and analysis of array CGH data.

Genomic microarrays

Although numerous platforms have been developed to support array CGH studies, they all revolve around a common principle of detecting copy number alterations between two samples (Figure 1). These platforms vary in terms of the size of the genomic elements spotted and their coverage of the genome. This section characterizes commonly used genome-wide approaches (Supplemental Table 1) and highlights their relevant features.

Figure 1
figure 1

Principles of array comparative genomic hybridization. (a) Sample and reference DNA are differentially labeled with fluorescent dyes (typically cyanine-3 and cyanine-5), combined, and cohybridized to a microarray containing spots of genomic material. The sample and reference competitively bind to the spots and the resulting fluorescence intensity ratios are reflected by their relative quantities. (b) Whole-genome idiogram of a small cell lung cancer cell line hybridized against a normal male reference on the submegabase resolution tiling array. Each black dot represents a single BAC clone spotted on the array. The red, purple and green vertical lines adjacent to each chromosome represent log2 fluorescence ratios of 0.5, 0 and −0.5, respectively. (c) Magnified view of a high-level amplification at the c-Myc oncogene locus at 8q24.21 in the small cell lung cancer cell line.

Genome-wide marker-based arrays

The genome-wide approach to array CGH was introduced using cDNA microarrays, which were originally used in gene expression profiling.5 One advantage of this technique is that high-level amplifications and deletions can be directly correlated to expression changes using the same platform.5, 7 However, only exonic regions of the genome are covered by the cDNA targets making alterations to promoter regions and other protein binding sites undetectable. New generation cDNA arrays are consisted of exon-specific targets.8

Marker-based large insert clones (LIC) arrays sample the genome at megabase intervals, typically covering about 10% of the genome9, 10, 11, 12 (Figure 2a). However, these arrays are often labeled as ‘high resolution’, relative to classical chromosomal CGH analysis. The main advantages of genome-wide arrays are that LICs, such as BACs, provide robust targets for sensitive detection of hybridization signals and that BACs are not limited to loci annotated with genes. The size of the arrayed elements also provide a higher signal to noise ratio compared to platforms using smaller targets as signal intensities increase as the complexity of the DNA spotted increases.13 Thus, BAC-based platforms allow highly sensitive and reproducible detection of a wide range of copy number changes including single copy number gains and losses, homozygous deletions and high-level amplifications.13

Figure 2
figure 2

Genomic array and sample labeling design (a) display of marker based (top) and tiling path (bottom) approaches to array design. Marker-based approaches sample the genome at intervals, while the tiling path approach improves resolution by using overlapping clones. (b) Illustrates genomic representation (top) and whole-genome (bottom) approaches to sample labeling. The representational approach enriches for short restriction fragments by linker-mediated PCR amplification while whole-genomic labeling typically involves random priming of the genomic DNA sample without complexity reduction.

Use of single nucleotide polymorphisms oligonucleotide arrays in CGH

Arrays of photolithographically synthesized short oligonucleotides (21–25 nucleotides in length) originally designed for detecting single nucleotide polymorphisms (SNP), have been recruited for use in copy number assessment in CGH experiments.14 In a method known as whole-genome sampling assay (WGSA), linker-mediated PCR is performed on the sample DNA to enrich for small XbaI restriction fragments throughout the genome in order to reduce sample complexity prior to hybridization15 (Figure 2b). Although the reduced sample no longer represents the entire genome, this process decreases the probability of crosshybridization to multiple short oligonucleotide targets on the array, effectively decreasing nonspecific signals.14 The strength of this strategy is its ability to relate copy number and allelic status at selected loci.

Using cancer cell lines, Bignell et al14 compared the performance of WGSA coupled with an array consisting of 8473 predicted SNPs with that of conventional BAC array CGH hybridized with the same samples. Although high-level amplifications and homozygous deletions were evident, the oligonucleotide array showed greater variation in the detection of single copy gains and losses in contrast to the BAC array.14 In another study, Zhao et al16 compared SNP, cDNA and BAC arrays for their ability to detect copy number changes in the breast cancer cell line BT474. These array platforms detected a similar but not identical pattern of alterations across the genome, with SNP results showing 70% similarity to BAC arrays and 62% similarity to the cDNA method.16 The BAC arrays showed the highest signal to noise ratios, making them better suited to detect single copy alterations.16 However, the SNP arrays allow copy number changes and genotype to be measured in a single experiment. The recent development of high-density SNP arrays, for example the Affymetrix GeneChip Mapping 100 K array, will improve copy number assessment at the SNP loci residing within the genome-wide linker-mediated PCR amplified restriction fragments.17

Long oligonucleotide arrays

Increasing the length of the target oligonucleotide aims to improve hybridization specificity. Unlike SNP array platforms, arrays of spotted oligonucleotides of typically 60–70 mers in length are used to directly assay genomic DNA samples without the need for a complexity reduction step prior to hybridization.18, 19, 20 The ratios detected on a genome-wide scale have been reported to be comparable to BAC arrays in the magnitude of signal and background noise.18 Recently, a tiling path oligonucleotide array with 6 kb median probe spacing was utilized to analyze chromosomal breakpoints in neuroblastoma.21 However, these array platforms typically require the calculation of a moving average to observe single copy changes which may decrease their effective resolution.19 Future studies are needed to determine the effectiveness of these techniques for use with archival clinical samples.

Sample genome complexity reduction was combined with hybridization to long oligonucleotide arrays in a method called representational oligonucleotide microarray analysis (ROMA). A genomic DNA sample is cleaved with a methylation insensitive restriction enzyme (usually BglII) followed by linker-mediated PCR which enriches for fragments <1.2 kb in length.22 This results in a low complexity representation comprising ∼2.5% of the genome, which improves the signal to noise ratio when hybridized to oligonucleotide targets. The arrays are comprised of 70 mer probes designed to hybridize to a specific representation fragment from the genome.22 These oligonucleotides are from random portions of the genome and are picked according to their signal strength. This leads to variable coverage across the genome with some areas poorly represented and others densely represented. Currently, arrays with approximately 85 000 probes have been developed allowing amplifications and single copy deletions in cancer cell line genomes to be observed.22

Whole-genome tiling path array

Although the methods described above allow copy number changes to be assessed on a genome-wide scale, the coverage of the arrayed elements can vary greatly. This leads to large gaps where no genomic information is obtainable. Thus, to fully understand the alterations occurring in various diseases, probes covering the entire genome are required. To date, Ishkanian et al6 have produced the only array CGH platform with whole-genome coverage (Figure 1). This submegabase resolution tiling set (SMRT) array is comprised of 32 433 overlapping BAC clones spotted in triplicate on two glass slides. Like other large insert clone-based approaches, the SMRT array yields high signal to noise ratios due to the hybridization sensitivity of the BACs to their corresponding genome targets. In contrast to marker-based approaches, the overlapping arrangement of the BAC clones abrogates the need to infer genetic events between marker clones and the redundancy provides confirmation of copy number status at each locus (Figure 2). The tiling nature also increases the probability of detecting microalterations that may fall between marker probes in other array platforms.

However, the major consideration in interpreting whole-genome BAC array data is the fact that some of the clones map to multiple places in the genome due to crosshybridization to highly homologous sequences. In megabase interval arrays, these clones would be excluded; however, in order to provide tiling path coverage, such clones have to be included. Tracking of these clones computationally would improve the accuracy of interpretation.

Sequencing-based alternatives for copy number analysis

Digital karyotyping and fosmid paired-end sequencing are emerging methods for genome-wide profiling of copy number variations.23, 24 Digital karyotyping involves the isolation and enumeration of short sequence tags corresponding to specific loci, and tag abundance reflects copy number status.23 Similarly, fosmid paired-end sequencing enumerates the relative abundance of cloned fragments created from a genome of interest. However, by aligning the end sequences against the human genome sequence assembly, one can also detect structural variations such as insertions, deletions and inversions.24 Despite the precision of these sequence-based technologies, their widespread use will likely require a reduction in costs associated with DNA sequencing.

Array requirements and considerations

The choice of platform technology for an array CGH study primarily depends on the type of samples being analyzed and the level of detail desired. Here, we examine issues pertaining to input material as well as genomic resolution.

Quality and quantity of sample DNA

A major consideration in selecting an array platform is sample requirement. DNA quantity may be limiting when analyzing small biopsies, while DNA quality may be compromised in formalin-fixed, paraffin-embedded archival specimens. Large insert clone (such as those comprised of BAC clones of ∼150 kb in size) arrays efficiently capture signals from samples of low DNA quantity and quality for genome-wide analysis, while oligonucleotide and small PCR fragments could facilitate more detailed investigation at selected regions when DNA quality and quantity are not limiting. BAC arrays require 200–400 ng of DNA,2, 6 whereas oligonucleotide and cDNA platforms typically require microgram amounts.5, 18, 25 Amplification techniques, such as those used in WGSA and ROMA, have proven effective in increasing hybridization signal strength and limiting noise through the reduction of sample complexity at the expense of genomic coverage. However, these techniques yield variable results in repeatability using the same sample and the biases potentially introduced by the PCR step are not fully understood.18

Tissue heterogeneity in a sample affects detection sensitivity of copy number changes and therefore is another consideration for array selection.26 Noncancerous cells in a tumor sample effectively dampen the shift in signal ratio associated with genetic alterations in the cancer cells. Garnis et al27 mimicked this phenomenon experimentally and concluded that increasing the number of measurements over a genomic distance could provide more data points within a segmental alteration, thereby increasing the probability of detection. The use of tiling path resolution arrays, as opposed to interval marker arrays, should be considered in analyzing heterogeneous tissue samples.

Another important consideration in array CGH experiments is the selection of reference DNA. Common options are using a sex-matched reference (eg male reference for male sample), sex-mismatched, a reference consisting of DNA from a pool of individuals or using a reference from a single individual.

Functional resolution of genome-wide array CGH

The definition of resolution in terms of array CGH is ambiguous. A practical definition is the genomic distance between array elements (clones or oligonucleotides). However, such elements may not be evenly distributed throughout the genome and some platforms may require multiple elements to detect an alteration, so that calculating resolution based on the mean or median distance would be misleading. Furthermore, tiling path arrays that span chromosomes with overlapping clones cannot be assessed in this way. A functional measure of resolution can be the size limit of detecting a segmental copy number alteration.28

Applications of array CGH

Although the most frequent use of array CGH is in the detection of somatic segmental changes pertaining to cancer (Figure 3a), there are other applications. We now review the use of array CGH in measuring copy number status in cancer, in genetic diseases, and in evolutionary comparisons. We also consider its potential as a diagnostic tool in a clinical setting.

Figure 3
figure 3

Somatic alterations and copy number variations. (a) Example of a segmental duplication observed at chromosome arm 2p present in the cancer cells but absent in the normal cells from the same individual. Each black dot represents a single BAC clone spotted on the array. The purple line represents equal fluorescent intensity ratio between sample and reference. Copy number gain (and loss) shifts the ratio to the right (and left). (b) Illustrates a copy number variation observed at chromosomal region 21q21.1. Three normal individuals exhibit equal, more and fewer copies relative to the reference DNA, indicating variation in the population.

Identifying somatic DNA alterations in cancer

In the past 5 years, there have been numerous reports of high-resolution array CGH studies of copy number alterations initially focusing on specific regions of tumor genomes and later expanding to entire chromosome arms, for example, a 3p array of a tiling set of 535 BAC clones used in defining common gains and losses in oral cancer;29 a 5p array of 491 BAC clones spanning the 50 Mb of 5p28 and a 1p array of 642 ordered BACs spanning 120 Mb30 that were instrumental in the discovery of genes involved in lung tumorigenesis.28, 30, 31, 32 Furthermore, arm-specific arrays have been used to profile astrocytic tumors and other various cancers.33, 34

In terms of genome-wide approaches, cDNA microarrays and interval LIC arrays have yielded much information on the genomic landscape of a variety of cancers and the discovery of recurrent genetic alterations.5, 35, 36, 37, 38 In addition, genome-wide profiles are used to deduce features for disease classification including drug response.39, 40, 41, 42, 43

The recent development of a whole-genome tiling path array has advanced such analysis to examining tumor genomes at unprecedented details.6 Whole-genome profiling of mantle cell lymphoma (MCL) cell lines using the SMRT array has identified an average of 35 alterations per genome with equal numbers of gains and losses and found recurrent alterations as small as 130 kb in size.44 Further utilization of the SMRT array has provided useful in mapping focal amplifications in oral and lung cancer as well as osteosarcoma.38, 45, 46, 47 Application of array CGH to epithelial and hematological malignancies have yielded novel genetic alterations that have escaped the detection of conventional methods, and facilitated a concerted search for multiple disruptions in biological pathways.42, 43

Identifying segmental copy number changes in genetic diseases

Segmental duplications and deletions have been well documented in inherited diseases.10, 48, 49 Advances in array-based CGH have greatly facilitated the discovery of such genetic alterations. Megabase interval genomic arrays have been instrumental in delineating regions affected in a variety of genetic diseases. Submicroscopic chromosomal deletions and duplications were identified in cytogenetically normal patients exhibiting mental retardation and dysmorphisms.10, 48 In addition, copy number changes were refined in Cri-du-chat syndrome, congenital diaphragmatic hernia (CDH), and Prader–Willi Syndrome (PWS) using array CGH.50, 51, 52 These examples illustrate the value of array CGH in identifying aberrations that have escaped traditional cytogenetic analysis and in refining aberrations previously characterized in various disorders.

Identifying DNA copy number variation in the human population

A new area in which array CGH is being utilized is in the characterization of large-scale DNA variations (Figure 3b). Using an array of 2632 large-insert clones, 55 unrelated individuals were examined to quantify genetic variation.53 Overall, 255 loci across the human genome contained genomic imbalances, 24 of them were present in over 10% of the participants and six of these large-scale copy number variations (LCVs) were present in 20% of the individuals. Among the total number of LCVs, over half (142) harbored genes. Strikingly, 14 LCVs were located near loci associated with cancer or genetic diseases, suggesting that certain individuals may have higher susceptibility to disease than others. A subsequent study using a BAC array specific to potential rearrangement hotspots in the genome was used to assess copy number variation in 41 normal individuals.54

Similar conclusions were derived using the ROMA technology. In total, 20 subjects were analyzed and 76 unique copy number variations were discovered.55 From a total of 226 copy number differences, on average, each individual harbored 11 variations with a length of 465 kb. The copy number variations contain genes that have been implicated in cell growth and other functions.

Clearly, these three studies illustrate the utility of array CGH in investigating large-scale variation. However, due to the limited genomic coverage of the techniques used, more comprehensive studies using whole-genome tiling path arrays are necessary to enumerate and identify all such LCVs in the human population.

Evolutionary characterization

Array CGH technology has been employed for use in interspecies comparisons. In a comparison of the human genome against four great ape genomes, using a LIC array of 2460 BACs, 63 sites of DNA copy number variation between the human and great apes were identified. A significant number of these sites existed in interstitial euchromatin.56 In a recent study, using a cDNA array CGH approach, over 29 000 human genes across five hominoid species (human, bonobo, chimpanzee, gorilla, and orangutan) were compared leading to the identification of >800 genes that gave genetic signatures unique to a specific hominoid lineage.57 Moreover, there was a more pronounced difference between copy number increases and decreases in humans and a number of genes amplified are thought to be involved in the structure and function of the brain. These studies illustrate the use of array CGH in interspecies comparisons.

Array CGH as a diagnostic tool and applications to clinical settings

The development of genomic and gene expression profiling technologies has allowed the simultaneous interrogation of thousands of loci and offers unprecedented opportunities to obtain global molecular signatures of the state of activity of cells in patient samples. The use of DNA-based technology has notable practical advantages in a clinical setting. DNA is stable, relatively easy to transport, and can be obtained from archival paraffin tissue blocks, while the acquisition and optimal transport of high-quality RNA is challenging due to its inherent instability. Furthermore, with array CGH, genomic DNA from normal cells of any origin from the same individual can be used as a baseline to define changes. In contrast, normal tissue from matching tissue type or precursor cells is required in order to properly define expression changes.

Current application of DNA-based diagnosis, primarily by standard cytogenetic analysis and chromosomal CGH, has wide application in a clinical setting, but suffers from low resolution and is not precisely linked to sequence-based map information. Array CGH offers the opportunity to globally profile segmental copy number imbalances at unprecedented resolution in constitutional or tumor DNA samples, thus serving as a diagnostic and investigative tool.

Disease-specific arrays have been constructed for cancer diagnostics. These arrays are enriched for the coverage of multiple cancer gene loci facilitating simultaneous assessment of gains and losses of tumor suppressor and oncogenes in a variety of cancers.12, 42, 43 Similarly, diagnostic arrays have been designed for the diagnosis of congenital anomalies, developmental delay, and mental retardation58, 59, 60 as well as the detection of chromosomal aberrations in embryos.49, 61, 62 In order for array CGH to have a more prominent role in clinical diagnosis, many factors such as cost, standardization of protocol, robustness of arrays, and user acceptance need to be addressed. Furthermore, advancements in array CGH software pertaining to ease of use, interpretation, visualization, and functionality will also be necessary.

Array CGH data computational visualization and analysis

With the increase in array CGH applications and the diversity of platform technologies, a variety of software has been developed for data visualization and statistical analysis63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75 (Table 1).

Table 1 Software for array CGH visualization and analysis

The first step in visualization is the conversion of spot image data to locus-specific copy number ratio, a function included in many microarray scanner software packages as well as custom software.76 The next step involves the linking of the array elements to their genomic positions. The addressing is achieved by relating sequence information, such as the sequence of an oligonucleotide element or the end sequences of a BAC clone, to the human genome sequence, so that the signal ratio for each locus can be discretely displayed. This task becomes a challenge when displaying ratio data from tiling path arrays, where the array elements represent overlapping genomic segments.75

The identification and detection of segmental losses and gains requires statistical analysis. We have compared 17 publicly accessible software in terms of functionality, hardware and software requirements, input format, types of algorithms used, cost, and availability. We also compared the types of analysis supported: filtering and excluding data using cutoffs or thresholds, visualizing data at various levels of magnification, and viewing of multiple experiments simultaneously. Computationally basic software has the ability to view single experiments and to perform simple data analysis. This category of programs includes ArrayCyCHt, Caryoscope, and SeeGH v1.5. Computationally advanced software can view multiple experiments simultaneously, provide numerous forms of visualization, and perform sophisticated analysis techniques in determining alterations. Examples of computationally advanced software are CGHPro, CGHAnalyzer and SeeGH v2.0. Figure 4 illustrates some features of SeeGH v2.0 software including the alignment of multiple experiments and the calculation of frequency of alteration for array loci that are useful in the analysis of large data sets. A comparison of all the software programs is provided in Table 1. Although some of the software described can perform multiple types of analyses, further development is necessary to assemble functionalities in a single software package, that can support spot data normalization, map position addressing, multiple profile alignment, automated breakpoint detection, frequency, and cluster analysis, gene track referencing and alignment with exogenous data tracks such as complementary data on gene expression profiles and allelic status. Furthermore, specialized software and user interface may have to be developed tailoring to clinical application as compared to research use.

Figure 4
figure 4

Analysis of array CGH data using SeeGH software. (a) Shows a multiple alignment of chromosome 8 from six tumor profiles. The main function of array CGH software packages is to link the array elements to genomic position. The elements are mapped according to base pair position to a specific chromosomal location. Segmental gains and losses are identified by shifting signal ratio to the left (loss) and right (gain) of the purple line, which represents a log2 signal ratio of zero. (b) Illustrates a frequency plot summarizing genetic alterations. The vertical lines on each side of the chromosome represent the proportion of the samples containing a loss (green) or gain (red) of a particular array element. (c) Representation of the gene track (green lines) corresponding to the BAC clones (black lines) facilitating the linkage to public databases such as OMIM, NCBI Entrez and UCSC Genome Browser. For example, the amplification at 8q24.21 corresponds to the c-Myc oncogene. (d) A summary of the frequency of gains and losses in the six karyograms.

Conclusion

Array CGH technology has advanced greatly in the past decade with the development of a myriad of array platforms expanding its application to many aspects of genetic research.38, 42, 49, 77 The relative stability of DNA (as compared to RNA) and the ease of isolation allow investigation of clinical specimens that may not be suitable for gene expression profiling. New array design will continue to improve resolution and detection sensitivity, while more efficient production strategies and streamlined experimental protocols will reduce cost and effort requirement. In addition, the emergence of new software aiming at automating breakpoint detection and statistical analysis will simplify the daunting task of the interpretation of array CGH data sets. The continuing technical advances and growing databases of disease-specific profiles will broaden the use of array CGH in both research and clinical settings.