We have placed 7,600 cytogenetically defined landmarks on the draft sequence of the human genome to help with the characterization of genes altered by gross chromosomal aberrations that cause human disease. The landmarks are large-insert clones mapped to chromosome bands by fluorescence in situ hybridization. Each clone contains a sequence tag that is positioned on the genomic sequence. This genome-wide set of sequence-anchored clones allows structural and functional analyses of the genome. This resource represents the first comprehensive integration of cytogenetic, radiation hybrid, linkage and sequence maps of the human genome; provides an independent validation of the sequence map1,2 and framework for contig order and orientation; surveys the genome for large-scale duplications, which are likely to require special attention during sequence assembly; and allows a stringent assessment of sequence differences between the dark and light bands of chromosomes. It also provides insight into large-scale chromatin structure and the evolution of chromosomes and gene families and will accelerate our understanding of the molecular bases of human disease and cancer.
With the draft of the human genome available2, scientists can conduct global analyses of its gene content, structure, function and variation. One important challenge is to define the genetic contribution to human diseases. For many developmental disorders, inherited conditions and cancers, gross chromosomal aberrations provide clues to the locations of the causative molecular defects. These aberrations are visible as alterations in chromosomal banding patterns3 or in the number or relative positions of DNA sequences labelled by fluorescence in situ hybridization (FISH)4. Although tracing gross abnormalities to the level of DNA sequence5 has revealed the genetic causes of many diseases, molecular characterization of chromosomal aberrations has lagged far behind their discovery6. To proceed from cytogenetic observation to gene discovery and mechanistic explanation, scientists will need access to a resource of experimental reagents that effectively integrates the cytogenetic and sequence maps of the human genome.
We describe here the results of a concerted effort to assemble such a genome-wide resource of well mapped, large-insert DNA clones. Each clone has been localized directly to chromosomal band(s) by FISH (Fig. 1a) and assigned one or more unique sequence tags, which can anchor the clone to the emerging draft sequence. We used complementary strategies to amass the current set of 8,877 clones. The set, which consists primarily of bacterial artificial chromosome (BAC) clones, includes clones targeted to contain sequence-tagged sites (STSs) ordered along the genome by genetic linkage or radiation hybrid mapping (for well ordered and distributed coverage); clones randomly selected for end sequencing from the RPCI-11 library (for coverage of regions low in STSs); clones identified during intense mapping efforts that preceded sequencing of some chromosomes (for denser coverage); and clones suspected of being partially duplicated at more than one location in the genome (to flag regions of the genome that might complicate sequence assembly7). The molecular signatures are STSs (many corresponding to genes or expressed sequence tags (ESTs)), BAC end sequences, or the actual draft or final sequence of the clone (Table 1). Earlier publications have described genome-wide and chromosome-specific subsets of this collection8,9,10,11,12.
Each clone is publicly available as single-colony-purified bacterial stocks and is ready for distribution. Each clone can each be obtained from one of three stock centres by e-mail: mapped-clones@mail. cho.org, email@example.com and firstname.lastname@example.org. The website http://www.ncbi.nlm.nih.gov/genome/cyto provides information about all clones in this collection, including how to obtain each clone. (Additional information can be obtained at the websites listed in Supplementary Information 1).
The 8,877 clones provide excellent coverage of the human genome (Table 1), with at least one clone on average per megabase (Mb) for 23 of the 24 chromosomes. Clone density ranges from greater than ∼5 clones per Mb for chromosomes 1, 6, 20, 22 and X to about 0.3 clones per Mb for chromosome Y.
Our study provides an assessment of the representation of the human genome in the RPCI-11 BAC library13, which serves as the intermediate template for most sequencing efforts2 and the foundation of genome-wide contig assembly by fingerprint analyses1. We randomly selected 1,243 clones from this library for FISH analysis. The number of clones assigned to each chromosome correlated well with chromosome size, with no significant bias in the distribution of clones between Giemsa (G)-dark and G-light bands of chromosomes (see Supplementary Information 2 and 3).
Cytogenetic mapping is one of several methods that can produce a framework of ordered clones upon which the human sequence can be assembled. The resource provides an opportunity to cross-check these critical framework maps, because over 3,300 FISH-mapped clones have STSs that reference the radiation hybrid14 or linkage maps15,16. Overall, the concordance between cytogenetic map order and marker order established by radiation hybrid and linkage mapping is very high for clones with single cytogenetic locations (94–98%, depending on the map; Table 2). Significant discrepancies were observed for only around 140 of these clones and are probably due to errors in clone tracking. Integration of cytogenetic and linkage maps also aids efforts to map disease genes. The location of the cytogenetic abnormality in one patient can guide the choice of polymorphic markers to assess linkage in other families that have similar phenotypes, but no visible chromosomal aberrations.
At present, 7,303 clones that map to single cytogenetic locations are positioned by their sequence tags on the draft sequence assembly of 7 October 2000 (Table 1). The fraction of clones located on the draft sequence ranges from 76% to 91% across different chromosomes (see Supplementary Information 4). We expect these percentages to rise as more sequence is merged into the draft and algorithms for locating tags are refined.
The connections between the cytogenetic map and the draft sequence are well distributed across the genome, and the correspondence in position on the two maps is excellent for these 7,303 clones (Fig. 2 shows chromosome 12 as an example). Of the 943 contigs of overlapping clones in the 7 October 2000 draft sequence, 660 are connected to the cytogenetic map by at least one clone, and 531 by two or more clones. Thus, many contigs can be oriented on the chromosome on the basis of FISH results of constituent clones. Relatively few discrepancies between cytogenetic location and position in the draft sequence are apparent at this level of resolution (∼5% of the clones map either to other chromosomes or more than one band away from the expected position; Table 1). We found only eight locations where the cytogenetic data indicated that portions of the sequence were misplaced within an earlier draft assembly (5 September 2000). The sequencing centres used these cytogenetic findings to locate errors in the assembly and produce the later draft of improved quality (Table 2).
FISH analyses of this clone collection reveal abundant paralogous relationships among sites dispersed across the human genome. Of 1,243 clones randomly selected from the RPCI-11 library, 5.4% hybridize to more than one chromosomal location (see Supplementary Information 3). The entire collection includes 393 clones that together identify over 150 bands containing at least one segment with significant homology to one or more (up to 25) other sites in the genome (see Supplementary Information 5). These data provide clues to duplications and exchanges that have occurred within and between chromosomes. Among the 393 clones, 111 contain blocks duplicated within the same chromosome; 282 hybridize to more than one chromosome. Paralogous relationships involving pericentromeric and subtelomeric regions of multiple chromosomes are particularly frequent and complex. Clones in the collection also identify low-copy duplications specific to chromosomes 1, 7, 11 and 16, the pseudoautosomal regions of X and Y, and sites of the olfactory receptor gene family17. Many previously undescribed patterns were also observed; some were confirmed with two or more clones, but others require further study to verify that they reflect true duplications.
Many of these duplications are functionally significant, as some have generated multigene families, and some are potential sites of recombination events, which can result in chromosome abnormalities. The cytogenetic data should greatly facilitate analyses of these regions, which are likely to pose challenges to sequence assembly. The sequence tags of 84% of the clones that hybridize to more than one site were placed in the 7 October 2000 draft assembly, and the location(s) were roughly consistent with at least one FISH observation for 88% of these clones. Collectively, the multisite clones highlight regions that are more likely to become entangled with other regions of the genome during sequence assembly than clones with single FISH locations. Indeed, global BLAST analyses show that regions encompassing sequence tags of multi-site clones (either the sequence of the FISH-mapped clone or a surrogate clone from the assembly) contain blocks of homology found at an average of around 3.9 chromosomal locations (compared to around 1.3 for the regions underlying clones with single FISH signals). The regions observed by FISH and revealed through homology searches are not fully congruent, however (not shown). These findings indicate that both FISH and sequence analyses may underestimate large-scale duplications and that these complex, inter-related regions of the genome will require special attention during the finishing stages of genome sequencing.
The extensive integration of cytogenetic and primary sequence data gives investigators access to fine-structure information—including details on predicted genes—for cytogenetic locations of interest. Tools such as NCBI's MapViewer and the UCSC and ENSEMBL genome browsers (see Fig. 2 for URLs) allow researchers to navigate readily between chromosomal location and annotated sequence.
This integration provides insight into the sequence differences underlying cytogenetic banding patterns. Sequence analyses of 200-kilobase (kb) regions surrounding the sequence tags of 338 clones mapped with the finest band resolution reveal more striking differences in the base-pair composition between Giemsa-positive and -negative bands than were predicted from earlier studies18. These clones were mapped with high precision to 850-level bands of varying staining intensity19 on seven chromosomes. The AT content of 58 of the 59 clones in the darkest G-bands exceeds the genome-wide average of 0.59 (mean 0.63), whereas the AT content of only 22 of the 143 clones in G-negative bands is higher than average (mean 0.55; χ2 = 43, P < 0.005). These data confirm that dark G-bands are more AT-rich than G-negative bands.
The utility of a sequence-integrated cytogenetic resource is illustrated by two examples. In the first, clones are applied in conventional FISH assays to rapidly narrow the search for candidate genes disrupted or deregulated by translocations causing developmental disorders. The process is expedited by selection of clones assigned to the regions implicated by banding analyses. In a patient with multiple congenital malformations and mental retardation (DGAP012, http://dgap.harvard.edu), a breakpoint-spanning clone was identified (Fig. 1b). This clone spans a 170-kb interval containing the gene for MKK7, a human mitogen-activated protein kinase, and a novel sequence with homology to the tre-2 oncogene, both plausible candidate genes. More typically, breakpoints will be mapped to an interval between neighbouring clones. For example, a translocation implicated in mental retardation in another patient maps to an interval containing at least 12 genes, including protocadherin 8, a promising candidate given its exclusive expression in fetal and adult brain20.
In the second example, an array of around 2,000 BAC clones from the collection is used to perform a genome-wide scan for segmental aneuploidy by comparative genomic hybridization (CGH) (Fig. 3 and A. Snijders et al., in preparation). The array format offers better sensitivity and resolution21,22 than metaphase chromosomes, the traditional target for CGH23, and, because the arrayed clones are integrated into the draft, copy-number abnormalities can be related directly to sequence information. To illustrate the power of array CGH, the ML-2 cancer cell line was ‘karyotyped’ using the array. Array CGH revealed relative copy-number losses on 1p, 6q, 11q and 20p and gains of 12, 13 and 20q (Fig. 3). Copy-number abnormalities on chromosomes 6, 11 and 20 were subsequently confirmed by FISH using clones predicted by array CGH to be included in the region of loss. Several of these alterations were noted in previous banding analyses (1p-, 6q-, 11q-, +12, +13q+)24, but array CGH locates the breakpoints precisely relative to BACs that reference specific locations in the sequence.
More than 7,500 clones now link the cytogenetic map and sequence of the human genome. Application of these reagents in combination with increasingly detailed knowledge of genes and other functional motifs in the human sequence will transform the process of identifying genes that are altered in cancer and other diseases. Ultimately, this resource will contribute to a better understanding of the organization of the cell nucleus, the compacting of DNA into mitotic chromosomes, and the basis of the chromosomal banding patterns that have been so valuable in uncovering the aetiology of human diseases.
GenBank was screened for draft, finished or end sequences derived from clones in this collection. BACs were screened for STS content by a combination of hybridization and polymerase chain reaction (see refs 8, 25 and Supplementary Information for details). Sequence tags were located on the draft sequence by a combination of methods (see Supplementary Information and refs 26, 27). Sequence at these locations was compiled with the results of a genome-wide BLAST analysis (ref. 2 and J. A. Bailey and E. E. Eichler, in preparation) to identify paralogous regions of the genome (regions in the draft sequence containing ∼ 20 kb of sequence that match sequence of the FISH-mapped clone or that of a surrogate clone from the assembly at ∼ 90% identity in non-repeat-masked bases over each 1-kb segment), and these locations were translated into estimated band positions using a dynamic programming algorithm (T. S. Furey et al., in preparation; and see Supplementary Information).
Details of FISH procedures are provided elsewhere4,28. Only locations of unique or low-copy portions of the clone are identified, because high-copy interspersed repetitive sequences were suppressed by addition of unlabelled Cot1 DNA. Replicate analyses indicate that the precision of FISH assignments to metaphase bands is roughly 5–10 Mb (1–1.5 band). A subset of 442 clones was ordered at very high (∼2–3-Mb) resolution11. FISH analyses were performed using DNA from the bacterial stock used for STS typing. Data that failed to replicate (for example, replicate FISH analyses of the same clone or different clones assigned the same marker) have been removed. Hybridization to arrays was carried out as described previously29 and by Snijders et al. (in preparation).
We thank M. Arcaro, M. Bakis, J. Burdick, J. Chang, H.-C. Chen, S. Chiu, Y. Fan, C. Harris, L. Haley, R. Hosseini, J. Kent, M. A. Leversha, J. Martin, L.-T. Nguyen, P. Quinn, Y. H. Ramsey, T. Reppert, L. J. Rogers, J. Shreve, J. Stalica, M. Wang, T. Weber, A. M. Yavor, J. Young, K. Zatloukal, and members of the TIGR BAC Ends Team for assistance. This work was supported by grants from NIH (NCI, NHGRI, NIDCD and NICHD), US DOE, NSF, HHMI, PPG, Merck Genome Research Institute, Vysis, Inc., and start-up funds provided by Obstetrics and Gynecology at Brigham and Women's Hospital.
About this article
Nature Biotechnology (2013)