There has been much focus recently on normal genome variation, particularly in the form of single-nucleotide polymorphisms1, and the extent of segmental duplication in the human genome is now well-documented2. But the importance of normal copy-number variation involving large segments of DNA has been largely unappreciated, as only a handful of instances have been reported3,4,5,6,7,8. Now, using DNA microarrays (array comparative genomic hybridization) to screen the human genome for changes in copy number, two studies9,10 report a substantial degree of large-scale copy-number variation (LCV) in the human population. John Iafrate and colleagues9, reporting on page 949 of this issue, describe their use of large insert clone arrays for this purpose, and Sebat et al., reporting in Science10, describe similar findings using oligonucleotide arrays. The conclusion is that >200 large segments of the genome vary severalfold in copy number in the human population. This unexpected level of LCV forces us to re-evaluate our view of the structure of the normal human genome.

How many, and how common?

Iafrate et al. report 255 variable loci in 55 individuals, whereas Sebat et al. report 76 LCVs in 20 individuals, with average differences between any two individuals of 12.4 and 11 LCVs, respectively. The two sets of loci can be viewed at the Genome Variation Database (http://projects.tcag.ca/variation). The two studies report only 11 loci in common (within 1 Mb). It is not clear whether the loci described in each study were mapped onto the same build of the human genome, but it seems at first glance that each study underestimates the number of LCVs in the human population.

How can we explain this discrepancy? One factor might be the resolution of the arrays used. Neither array achieves both high resolution and complete coverage of the genome (150 kb every 1 Mb with a detection size of 50 kb (ref. 9) versus 1 probe every 35 kb with a detection size of 105 kb (ref. 10)). Moreover, it is common practice in selecting clones or probes for array comparative genomic hybridization to avoid regions that hybridize to more than one genomic location or show variation in normal hybridizations. Such arrays are probably biased away from LCVs. Further studies with genome-wide clone tiling path arrays or higher density oligonucleotide arrays will be required to settle this issue, but many more LCVs probably remain to be discovered.

How common are LCVs? In both studies, approximately one-half of the variable loci were polymorphic in only one individual, whereas one-half showed copy-number variation in more than one individual. Only a few of the most common LCVs were detected in both studies (e.g., 7q35, 14q32.33). For example, the most common variant in the Iafrate et al. study, located at 1p21.1 (AMY1AAMY2A) and present in >49% of the individuals studied (Fig. 1), was not detected by Sebat et al. Again, does this discrepancy reflect a technical difference between the two studies or the very different ethnic mix of the two study groups? We will need to study many more individuals from a range of ancestral backgrounds to arrive at an accurate picture of the frequency of the more common LCVs.

Figure 1: Fluoresence in situ hybridization on stretched DNA fibers shows copy-number variation at the AMY1AAMY2A locus on 1p21.1.
figure 1

Photo courtesy of Charles Lee

The image shows hybridization of a 5′ amylase gene probe (red) and a 3′ amylase gene probe (green) to DNA fibers (blue) from three different individuals, each with a different number of tandem copies of the variable segment.

LCVs, duplications and disease

Analysis of the reference sequence shows that 5% of the human genome is duplicated2. These segmental duplications, defined as multiple regions sharing at least 1 kb of 90% identical sequence, are thought to have had a key role in human genome evolution11 and may be responsible, through nonallelic homologous recombination (NAHR), for many chromosome rearrangements leading to disease12. Both Iafrate et al. and Sebat et al. report a higher than expected association of LCVs with known segmental duplications and with regions associated with human genetic disease or cancer. This suggests that LCVs and other genomic rearrangements might have a common mechanistic basis.

It also has been suggested that large segmental duplications could complicate sequence assembly and lead to gaps in the sequence13. Iafrate et al. note that 12.7% of LCVs are located in the 100 kb of gaps in the current sequence assembly. Similarly, LCVs with high sequence homology might be assembled out of the reference sequence. Furthermore, the reference sequence was produced from clone libraries generated from a small number of individuals, and so most LCVs would not be represented in the libraries. This raises the question, “What is the sequence of the normal human genome?” Much more detailed sequence analysis of LCVs in a large number of individuals will be needed to address this issue.

In both studies, a high proportion of LCVs overlapped with known genes. Further studies of these genes in individuals with different copy numbers will be interesting, as copy-number differences will probably be found to influence gene expression14. Alternatively, regulatory mechanisms could compensate for differences in copy number between individuals. While LCVs have been identified in phenotypically normal individuals, we cannot determine the phenotypic consequences of such large polymorphisms. Some of these LCVs may be associated with age-related susceptibilities to disease, and deletion polymorphisms may reveal recessive mutations with phenotypic consequences. A recent study of 50 individuals with learning disability and dysmorphology identified five LCVs that were inherited from normal parents and so did not segregate with the disease phenotype15. These observations underscore the importance of identifying LCVs in the normal population so that we can gauge the importance of copy-number changes in individuals with diseases.

Both Iafrate et al. and Sebat et al. report that LCVs are frequently located in regions of the genome that are susceptible to rearrangement, particularly by NAHR. The copy-number variation found in LCVs could certainly be generated by NAHR, suggesting that there may be a common mechanism for disease-associated and normal copy-number variation. If this turns out to be true, the LCVs themselves may point to unstable regions of the genome at which new disease-associated rearrangements may be found in the future.