Here we present a finished sequence of human chromosome 15, together with a high-quality gene catalogue. As chromosome 15 is one of seven human chromosomes with a high rate of segmental duplication1, we have carried out a detailed analysis of the duplication structure of the chromosome. Segmental duplications in chromosome 15 are largely clustered in two regions, on proximal and distal 15q; the proximal region is notable because recombination among the segmental duplications can result in deletions causing Prader-Willi and Angelman syndromes2,3. Sequence analysis shows that the proximal and distal regions of 15q share extensive ancient similarity4. Using a simple approach, we have been able to reconstruct many of the events by which the current duplication structure arose. We find that most of the intrachromosomal duplications seem to share a common ancestry. Finally, we demonstrate that some remaining gaps in the genome sequence are probably due to structural polymorphisms between haplotypes; this may explain a significant fraction of the gaps remaining in the human genome.
The present work describes the completion of a physical map, high-quality finished sequence, and gene catalogue for the euchromatic q arm of human chromosome 15, representing 2.9% of the human genome. The finished sequence contains 81,871,010 bases and is interrupted by nine euchromatic gaps and one gap containing the heterochromatic p arm and centromere regions (Fig. 1). The total size of the euchromatic gaps is estimated at 544 kilobases (kb) (Methods and Supplementary Table S1). These gaps remain despite the screening of genomic libraries containing a combined ∼53-fold physical coverage, and are refractory to current cloning and mapping technology; six are within or adjacent to large duplicated regions. Of the finished sequence, 74% was generated by the Broad Institute of MIT and Harvard (formerly the Whitehead Institute/MIT Center for Genome Research (WICGR)), 25% by the Multimegabase Sequencing Center (initially at the University of Washington, currently at the Institute for Systems Biology), and the remaining ∼1% by three other groups (Supplementary Table S2). The analyses here are referenced to NCBI Build 35; however, we have slightly improved this sequence (including closing one of the euchromatic gaps), and provide the updated clone path in Supplementary Table S3. Details of construction of the clone map and sequencing are described in the Supplementary Information. The short arm of chromosome 15, as in other acrocentric human chromosomes (chromosomes 13, 14, 21 and 22), is heterochromatic and was not sequenced as part of the Human Genome Project; it is estimated at 17 Mb (ref. 5) and contains arrays of ribosomal RNA genes, satellite sequences and other repeated sequences6.
We assessed the local accuracy of the clone path by aligning paired-end sequences from a human fosmid library (designated WIBR2, representing 10 × physical coverage) to the finished sequence7,8. This analysis revealed no aberrant clones. In addition, an independent quality assessment exercise commissioned by the National Human Genome Research Institute9 estimated the accuracy of the finished sequence to be better than one error in 100,000 bases (J. Schmutz, personal communication).
Several analyses suggest that nearly the entire euchromatic region of chromosome 15 is present and accurately represented in the finished sequence. All genes in the RefSeq10 database (596 loci, 742 transcripts) previously mapped to chromosome 15 are present and complete in the finished sequence. Furthermore, the finished sequence shows excellent alignment to genetic and radiation hybrid maps (Supplementary Fig. S1). The genetic map11 shows perfect alignment, with no discrepancies among 125 sequence-based genetic markers (Supplementary Table S4). The radiation hybrid map12 contains only local discrepancies, owing to its lower resolution (Supplementary Table S5). A large gap in the radiation hybrid coordinates (254–280 cR) at ∼74 Mb in the physical map, near a region where chromosome breakage has been observed independently in multiple mammalian lineages (see below), is probably the result of non-random breakage in the generation of the radiation hybrid panel.
We produced a manually curated8 catalogue of genes, containing 695 gene loci (including all genes in RefSeq) and 250 pseudogene loci on chromosome 15. Table 1 classifies the genes according to standardized categories. The 3% of genes in the ‘novel’ and ‘putative’ categories were annotated based only on spliced expressed-sequence-tag (EST) evidence; some of these may prove to be pseudogenes. The full-length transcripts of known genes have an average length of 3,267 bp, with an average of 11.6 exons. Internal exon lengths average 156 bp. Gene loci have an average of 4.6 distinct transcripts, with 66% having at least two transcripts. These gene statistics are similar to recent reports8,13,14,15,16. Examples of genes that represent extremes of these distributions are described in the Supplementary Information. Most (74%) of the 250 pseudogenes are processed. In addition, we identified 9 transfer RNA genes (Supplementary Table S6) and found six known microRNAs mapping to chromosome 15 (Supplementary Table S7).
In most aspects of its landscape, chromosome 15 is close to genome-wide averages7. The overall gene density is 8.6 genes per Mb. There are 18 gene deserts (defined as 500 kb without an identified coding gene, Supplementary Table S8) comprising 14.9 Mb (∼18.3% of the chromosome). The overall G + C content is 42.2%, but varies substantially across the chromosome (Fig. 1b). Transposable element fossils cover 38.3%. Chromosome 15 is also typical in its content of non-coding sequence conservation (see Supplementary Information).
Chromosome 15 is, however, one of seven autosomes that are significantly enriched in segmental duplications (defined as regions >1 kb that are not high-copy repeats and have >90% identity to another region in the genome17), with 8.8% of its euchromatin composed of such sequence (Supplementary Fig. S2). As with other heavily duplicated chromosomes, chromosome 15 has a large fraction of intrachromosomal duplication: 50% is strictly intrachromosomal, 30% is both intra- and interchromosomal, and 20% is solely interchromosomal (largely in the proximal 1.5 Mb). The proportion of purely interchromosomal duplication might be even lower, as some undetected tandem duplication may exist near the centromere (see below). Recombination among segmental duplications within the region 15q11–q13 gives rise to deletions that are known to cause Prader-Willi and Angelman syndromes2,3 (Supplementary Information).
We sought to investigate the duplication landscape of chromosome 15 by studying the relationships among the duplicated segments. Previous work has shown that a sequence within the Prader-Willi/Angelman syndrome region, termed LCR15 (ref. 4), is also duplicated on distal 15q (Supplementary Fig. S2). By extending our analysis to detect more ancient relationships (sequence identity less than 90%), we found much more extensive similarity among the duplicated sequences in both proximal and distal 15q (Fig. 1a). We clustered together segmental duplications containing related sequence (Methods) and found that most fell into a single large cluster, which we refer to as ‘class 1’. The class includes 67% of all bases in segmental duplications and 91% of all pairwise duplication events (as some bases reside within multiple independent events) (Supplementary Table S9).
Although the segmental duplications are related to one another in a complex fashion, we sought to identify a ‘core element’ that was present in many of the class 1 elements. We took the longest duplicated class 1 region (213 kb starting at 18.89 Mb, within the Prader-Willi/Angelman syndrome region) and aligned all duplicated regions of the chromosome to it, counting the number of different duplication regions that aligned to each base. We selected a core element that includes the highest peak of coverage (Supplementary Fig. S3); the element is 2,920 bp long and lies within the ∼15-kb LCR15 element.
The human genome contains 41 nearly full-length copies of the core element: there are 37 on chromosome 15, two on the Y chromosome, and one each on chromosomes 2 and 10. To understand the origins of the element, we compared the core element to the dog18 and mouse19 genomes. The dog and mouse genomes each contain a single copy of the element, which is orthologous to the copy on human chromosome 2. The similarity among the sequences is shown in a phylogenetic tree (Fig. 2, see Methods). The copy on chromosome 2 is at the root of the human duplications, closest to mouse and dog, as would be expected from conserved synteny. The duplications on chromosome 15 fall into two distinct and well-separated branches: a proximal branch containing all the elements in the Prader-Willi/Angelman syndrome region (chromosome position 18–32 Mb), and a distal branch containing all the elements from 73 to 88 Mb, with a tight clustering of elements around 80–83 Mb. A further two repeats in the subtelomeric region (98–100 Mb) are closely related to the proximal branch. Pairwise divergence between elements in the two branches is ∼11%, indicating that they share an ancient origin followed by local duplications, but with no recent interaction between branches.
From the tree, it is possible to reconstruct the likely history of the core element. The sequence on chromosome 2 lies in the 3′ untranslated region (UTR) of a splice variant of the gene intersectin 2 (ITSN2). This sequence seems to have moved by retroposition to chromosome 10 (at 30.68 Mb), inserting immediately downstream of the 5′ coding sequence of an interchromosomally duplicated copy of GOLGA2 (the origin of which is on chromosome 9). A combined unit (15 kb, consisting of GOLGA2 and the ITSN2 UTR) then was copied to chromosome 15, where it has duplicated extensively. Finally, two copies exist on the arms of a large palindrome on the Y chromosome, and seem to have moved to the Y chromosome by segmental duplication of ∼40 kb of chromosome 15 (at 82.7 Mb).
We next sought to understand why the large regions of segmental duplication in proximal 15q (denoted ‘A’) and distal 15q (denoted ‘C’) are separated by a large stretch that contains almost no duplicated sequence (denoted ‘B’). Analysis of conserved synteny with other species allows a reconstruction of the history of chromosome 15 (Fig. 3). Briefly, the three segments were adjacent in the boreoeutherian ancestor (the common ancestor of Euarchontoglires and Laurasiatheria), but were found in the order A–C–B. In the primate lineage, the chromosome apparently underwent a single large inversion that separated segments A and C. (Details of the reconstruction and comparison to recent reports20,21 can be found in the Supplementary Information and Supplementary Fig. S4.) This suggests that the core element was transferred to chromosome 15 before the divergence of apes and Old World monkeys, and expanded locally (in the originally contiguous A–C region). The inversion subsequently separated regions A and C, and the element continued to expand separately in each region.
To test this hypothesis, we examined the current draft assembly of the rhesus macaque genome (rheMac1; R. Gibbs, personal communication). We found at least 12 nearly full-length copies of the core element that we added to the evolutionary tree (Supplementary Fig. S5). We also found unique orthologues of the copies on human chromosomes 2 and 10. The remaining macaque elements were split between the proximal and distal clusters, confirming that the element had already appeared and begun to duplicate on chromosome 15 before the divergence of Old World monkeys and apes. The human and macaque elements are grouped into separate clusters in both the proximal and distal branches, indicating that local duplication has continued to occur in both the human and macaque lineages.
The analysis of conserved synteny also reveals that the segmental duplications are closely associated with chromosomal rearrangements. Chromosome 15 has 15 human-specific breakpoints of conserved synteny, all of which are inversions. Of these, 13 occur in regions containing class 1 duplications. This suggests that the segmental duplications may have mediated the inversions and that these inversions may have helped to disperse the elements.
The class 1 core element serves as a useful marker for tracing chromosomal history. However, the ubiquity of the core element raises the possibility that it had a causal role in the process of segmental duplication on chromosome 15. The element is derived from a UTR on chromosome 2, of which at least 500 bases are highly conserved across mammals and thus are presumably functional. Moreover, many of the copies on chromosome 15 are transcribed: 13 known genes on chromosome 15 (all golgins or golgin-like proteins) contain this duplicated UTR, and another 16 transcripts stop just short of it (Supplementary Table S10). It will be interesting to investigate whether functional properties of the fusion element on chromosome 15 promote local duplication, and to explore whether this had significant implications for primate evolution.
Finally, we note that the segmental duplications represent the main challenge in closing the remaining gaps in the sequence of chromosome 15. Build 35 contains ten gaps, seven of which lie within or immediately adjacent to class 1 duplications (Fig. 1). In some cases, the duplicated sequences flanking the gaps are so similar (> 99.7% identity) that they may represent allelic variants. Moreover, six of the seven duplication-associated gaps are adjacent to or within reported sites of copy-number polymorphism22,23 (Supplementary Table S1). We have recently been able to close one gap (at 82.7 Mb) (decreasing the number of gaps to nine) by finding previously missed overlap between two flanking clones; another clone spanning this gap carries an alternative haplotype with an additional 100 kb, including an 80-kb near-perfect duplication. Examination of three of the other gaps suggests that they might also be due to structural variation, although more work will be required to confirm this.
The finished sequence of chromosome 15 offers a window into the natural history of segmental duplications and the structural history of chromosomes. Notably, most of the intrachromosomal duplication involves a single class of duplicons. On the basis of these results, we suggest an important role for such duplicons in structural evolution and gene diversification.
Production of gene catalogue and annotation
The gene catalogue was produced as described previously8. Gene symbols were assigned by the HUGO Gene Nomenclature Committee for biologically characterized loci. A complete list of gene symbols from this paper can be found in Supplementary Table S11. Annotation was performed as described previously8. Our annotations are available from the Vertebrate Genome Annotation database (VEGA, http://vega.sanger.ac.uk/Homo_sapiens).
Segmental duplications were defined as pairs of regions of 90% or greater identity (excluding repeat-masked bases) that extend for 1 kb or more. The map of segmental duplications was prepared using a method adapted from ref. 17, by concatenating all-against-all MegaBlast24 alignments. A genome database was built using hard-masked sequence. This same hard-masked sequence was presented to MegaBlast as a probe, chromosome by chromosome. All alignments of 80% or better identity with expectation <10-4 were kept. Alignments were then concatenated if they were contiguous except for masked repeats. Unmasked gaps could be crossed but were penalized to prevent over-merging by being treated as bases of 50% identity. Final segments meeting the 1-kb length and 90% identity criteria were retained.
Duplication class clustering
Pairwise intrachromosomal duplications were defined as above. A pairwise duplication A∼A′ was considered to be in the same class as another pairwise duplication B∼B′ if B or B′ overlapped A or A′ by 150 bp or more. We extended this by transitive closure to build maximally linked sets (that is, if A∼A′ linked to B∼B′ and C∼C′, all were clustered, even if B∼B′ did not overlap C∼C′). The number of duplications in a class is counted as the number of distinct pairwise alignments X∼X′ that were clustered. The number of bases in a class is counted as the number of distinct bases covered by at least one pairwise duplication in that class.
Construction of core element phylogeny
Full-length or nearly full-length copies of the core element in human were identified by MegaBlast (release 2.2.11). Copies in the mouse and dog genomes were identified by MegaBlast followed by blastn (release 2.2.11) to refine the boundaries and extend the regions. Multiple alignments of the elements were generated with ClustalW (v.1.83). Pairwise and multiple alignment parameters were adjusted by reducing the gap extension penalty to 0.1 and replacing the standard DNA matrix with a custom matrix scoring 10 for any match, -5 for any mismatch, and 0 for any alignment to an unknown base (N). The trees were output in phylip format and all gaps of length >1 converted to single indels by substitution of ‘?’ characters for all but the first ‘ - ’ in the gap to avoid generating disproportionately long branches for element copies with substantial deletion. Terminal gaps were also treated this way. Trees were built with the dnapars parsimony module of phylip (v.3.65)25. The tree represented is the first of 15 equally likely trees that differ only in the leaf placement of the seven nearly identical copies of the element at 80 and 82 Mb on chromosome 15.
We thank L. Gaffney for help with figures and text. We are grateful to T. Furey for help with lists of genetic markers and placement of RefSeqs, and to K. Lindblad-Toh for sharing data from the genome projects of dog and opossum. Fluorescence in situ hybridization (FISH) data for opossum were provided by M. Breen. We thank the members of the Baylor College of Medicine Human Genome Sequencing Center, the J. Craig Venter Institute Joint Technology Center, and the Washington University Genome Sequencing Center for generation and early release of the assembly of the rhesus macaque genome. We thank the Sanger Institute for gap sizing by FISH. We also acknowledge the HUGO Gene Nomenclature Committee (S. Povey (chair), E. A. Bruford, V. K. Khodiyar, R. C. Lovering, M. J. Lush, T. P. Sneddon, C. C. Talbot Jr and M. W. Wright) for assigning official gene symbols. We are grateful to all members, present and past, of the Broad (and Whitehead) sequencing platform for their dedication and the consistent high quality of their data.
Relationship of the finished sequence map to genetic and radiation hybrid maps of chromosome 15.
Segmental duplication structure of human Chromosome 15.
Density plot(s) of the proximal element (coverage map).
Detailed history of the structural organization of human Chromosome 15.
Phylogenetic tree of core element with macaque included.
Segmental duplication structure of the Prader-Willi/Angelman syndrome region on proximal Chromosome 15.