Chromosome 5 is one of the largest human chromosomes and contains numerous intrachromosomal duplications, yet it has one of the lowest gene densities. This is partially explained by numerous gene-poor regions that display a remarkable degree of noncoding conservation with non-mammalian vertebrates, suggesting that they are functionally constrained. In total, we compiled 177.7 million base pairs of highly accurate finished sequence containing 923 manually curated protein-coding genes including the protocadherin and interleukin gene families. We also completely sequenced versions of the large chromosome-5-specific internal duplications. These duplications are very recent evolutionary events and probably have a mechanistic role in human physiological variation, as deletions in these regions are the cause of debilitating disorders including spinal muscular atrophy.
The US Department of Energy's interest in chromosome 5 emerged from a series of pilot studies begun at the Lawrence Berkeley National Laboratory focusing on a cluster of interleukin genes located at human 5q31. The insights gained from these detailed analyses of a single megabase of chromosome 5 illustrated how finished human sequence could contribute to gene annotation and how multi-mammalian sequence comparisons could lead to the sequence-based identification of noncoding elements possessing gene regulatory activities1,2,3. The finished sequence of chromosome 5 and its analysis alone and in comparison to orthologous regions in other vertebrate genomes now provides a chromosome-wide catalogue of genes and evolutionarily conserved noncoding sequences. Many of these observations, as well as clues into disease-causing deletions arising from the segmented duplication landscape of chromosome 5, can only now be appreciated upon finishing the sequence of this chromosome.
Mapping and sequencing
After the completion of the initial draft sequencing in 2001 we selected clones with an approach that integrated all of the public sequence, previously reported clone contigs4,5,6 including the Celera scaffolds7, bacterial artificial chromosome (BAC) and fosmid end sequences, and BACs isolated with an overgo hybridization strategy to close gaps between anchored contigs. The final version of the tiling path contains 1,763 clones, (96% BACs) with four gaps remaining, all in the long arm. None of these remaining gaps are part of the large chromosome 5 duplications, and they appear to be unclonable in current vector systems. In addition, our standard strategy of seeding and then walking into gaps based on restriction maps proved unworkable in the duplication region of 5q13 associated with spinal muscular atrophy (SMA), and led to mapping errors with its primary insertion copy at 5p14 and secondary copy at 5p13. Therefore, we adopted a strategy of drafting high depth clone coverage from the single individual RPCI-11 BAC library in order to construct single haplotype paths spanning the duplications.
On the basis of internal and external quality checks, we estimate the accuracy of our finished sequence to exceed 99.99%8. In total, we finished 177,702,766 base pairs (bp) and estimate the total chromosome size, including the clone gaps and the recalcitrant centromeric and subtelomeric regions, to be 180.8 megabases (Mb). The finished sequence covers 99.9% of the euchromatic sequence and captures all known genes that were previously mapped to chromosome 5 (T. Furey, personal communication). The Stanford v.4 G3 radiation hybrid map9 was compared to the sequence and it matched the marker order well (see Supplementary Fig. S1). Thirteen (out of 442) unplaced markers were found to have been originally incorrectly assigned to chromosome 5. Recombination distances from the deCODE10 meiotic maps were compared to physical distances with recombination rates accurately tracking physical distance (see Supplementary Fig. S2), as previously reported for other chromosomes11,12,13.
We placed gene model transcripts on the chromosome 5 sequence and manually reviewed these models using previously described methods11 (Table 1). Ultimately, 923 protein-coding regions were verified as gene loci (see Supplementary Table S1 and http://www.jgi.doe.gov/human_chr5). These loci contain 1,598 full-length (or nearly full-length) transcripts, including partial evidence for additional splice variants (see Supplementary Information). Loci were placed in three categories: ‘known’, ‘novel’ and ‘pseudogenes’, consistent with our previous definitions11. Transcripts for which a unique open reading frame (ORF) could not be determined and putative genes defined by ab initio models but with no supporting experimental evidence were not considered valid. A total of 827 known loci were identified based on 2,203 RefSeq genes and other full-length complementary DNA sequences in GenBank, extending 36% of RefSeq transcripts by more than 50 bp at the 5′ end and 18% at the 3′ end, while maintaining the original ORF. Gene loci 3′ ends were not extended when the only evidence was from rare expressed sequence tag (EST) variants. Evidence for 55 novel loci was supported by full-length cDNA sequence, spliced ESTs, and/or similarity to known human or mouse gene sequences. Forty-one putative gene loci were modelled using orthologous mouse cDNA sequences. Twenty transfer RNA genes and four tRNA pseudogenes were predicted, similar in density to other finished chromosomes11,12,13.
The extent of alternative splicing was characterized based on the existing cDNA and EST data. Considering only messenger RNA sequences in GenBank, 1,598 distinct transcripts were identified, providing an average coverage of 1.7 annotated transcripts per locus (see Supplementary Information). These mRNAs provide strong evidence for alternative splicing of 408 (44%) of the 923 loci, each having two or more associated transcripts. A total of 577 pseudogenes and pseudogene fragments were also identified, representing two classes: (1) 98 non-processed pseudogenes that display a structure similar to the parent locus and probably resulted from genomic duplication events; (2) 479 processed pseudogenes that presumably resulted from viral retrotransposition of spliced mRNAs (see Supplementary Information). No significant bias towards over-representation of pseudogenes from a particular gene family was observed.
Chromosome 5 genomic duplications
We performed a detailed analysis of duplicated sequence (≥ 90% identity and ≥1 kilobase (kb) length) by comparing chromosome 5 against the July 2003 human genome assembly. An estimated 3.49% (6.26 Mb) of the chromosome consists of segmental duplications, lower than the genome-wide average of 5.3% (see Supplementary Table S2 and Supplementary Fig. S4). Chromosome 5 segmental duplications, however, show a higher degree of sequence identity (≥ 97.5%), especially with other regions of chromosome 5 (see Supplementary Fig. S5), than do the duplications on other chromosomes. Intrachromosomal duplications are clustered in ten regions (Fig. 1) and represent the majority of the gene duplications, including the largest gene family: the protocadherins (see Supplementary Information). The high degree of sequence identity underlying most of these intrachromosomal genomic duplications suggests that these structures are relatively recent duplications or gene conversion events that emerged during the separation of humans and the great apes (see Supplementary Fig. S3 and Supplementary Table S2).
Subtelomeric and pericentromeric biases have been reported for segmental duplications for other human chromosomes. Despite the fact that large tracts of alpha-satellite DNA have been sequenced on both chromosomal arms near the centromere, there is little evidence for extensive pericentromeric duplication, with 5p11 showing almost a complete absence of duplications. A single duplication in 5q11 (96% identity over 250 kb) between chromosomes 1 and 5 accounts for nearly all pericentromeric duplicated bases. The pericentromeric region of chromosome 5, along with 19q11, may define a duplication-quiescent model of pericentromeric organization. The telomeric regions do show extensive interchromosomal duplications (Fig. 1), with 25% (2.48 out of 9.08 Mb) of all interchromosomal alignments occurring within 2 Mb of the long arm telomeric repeat sequence (see Supplementary Table S3).
SMA duplication region
One of the most duplicated regions on chromosome 5 occurs in a 1–2-Mb interval in 5q13.3. Homozygous deletions of the SMN1 gene and variable copies of the SMN2 duplication in this region have been associated with various forms of spinal muscular atrophy and susceptibility to the disease14,15. Analysis of carriers and controls suggests extreme locus variability, but the underlying structural variation has never been documented at the sequence level16. We identified a complex arrangement of 311 pairwise alignments representing the SMA region (Fig. 1). On average, the duplications are long (∼ 200 kb) and show a high degree of identity (98.66%). Duplications in this region include interchromosomal duplications, all of which map to chromosome 6, with three very large tandem (> 99.5% identity) and other various interspersed intrachromosomal duplications (Fig. 2). Interestingly, this region is enriched in genes. We annotated 14 loci in this region, including SERF1 (small EDRK-rich factor 1), BIRC1 (baculoviral IAP repeat-containing 1) and SMN (survival of motor neuron), the gene for SMA.
During the sequencing and assembly of this region, we generated a consensus sequence for a second haplotype variant from the RPCI-11 BAC library. Both haplotypes represent high-quality finished sequence and differ only by a remaining ∼50-kb clone gap within SMAvar2. Sequence comparison of these regions (SMAvar1 against SMAvar2) revealed extensive variation. At least two large-scale rearrangements (> 400 kb) and multiple smaller insertion/deletion events are required to reconstruct an ancestral haplotype. Although there are many scenarios for the evolution of these variants, one explanation may be that a portion of the SMAvar2 region (0.3–0.9 Mb) was inverted (68.9–69.4 Mb) and subsequently duplicated in SMAvar1 (69.8–70.4 Mb). Such extensive structural variation between haplotypes may not be uncommon in regions of extensive segmental duplication.
To understand further the evolution and functional sequences of human chromosome 5, we performed comparative analyses against the available chimpanzee, mouse, rat, chicken, frog (Xenopus tropicalis) and fish (Fugu rubripes) draft genomes. These comparisons revealed numerous large-scale chromosomal rearrangement events occurring since each of these species last common ancestor with humans, as well as a variety of nonrandomly distributed conserved noncoding regions (Fig. 3a). Additional analyses of the distribution of genes and conserved noncoding sequences along the length of the chromosome support the existence of large gene-poor regions with highly conserved noncoding sequences that may regulate genes from a distance. Furthermore, we examined conservation in a comparative analysis of the extensively studied interleukin gene cluster.
By building segmental maps from DNA alignments of all the vertebrate species described above, we were able to confirm and extend previous homologous chromosomal relationships with human chromosome 5. Whereas recent experimental studies support that large-scale rearrangements (40–175 kb) have frequently occurred during primate genome evolution17, our comparison of chromosome 5 and the recent chimpanzee draft genome sequence (International Chimpanzee Genome Sequencing Consortium, manuscript in preparation) uncovered even larger-scale events. For example, we found a large 80-Mb inversion in comparison to the chimpanzee genome, homologous to almost half of human chromosome 5 between 5p14 and 5q15 (Fig. 3a). This finding using the genomic draft data independently confirms previous fluorescence in situ hybridization (FISH) experiments18. It has been proposed that these large-scale rearrangements create barriers to fertile mating and triggered the speciation that separated these two lineages19. Comparison with the mouse genome sequence20 yielded 142 chromosomal rearrangements ranging in size from 200 kb to 17 Mb. Between human and chicken, we found that one-third of chromosome 5 is homologous to the chicken sex chromosome Z21, further indicating that sex chromosomes have evolved independently after the avian and mammalian split some 300 million years ago22.
In addition to exploring the syntenic relationship between chromosome 5 and the chimpanzee draft assembly, we catalogued sequence changes between these two primates. To explore the constraint on human–chimpanzee evolution in noncoding regions, we compared the number of nucleotide substitutions in coding sequences, as well as noncoding regions conserved and not conserved in rodents. We found a substitution rate of 0.0067 changes per nucleotide in coding sequences, 0.0091 in noncoding regions conserved in rodents, and 0.015 in noncoding regions not conserved in rodents. The decreased substitution rate in coding sequences and noncoding sequences conserved in rodents (compared to noncoding regions not conserved in rodents) support the theory that both of the former categories are under evolutionary constraint. This also supports the theory that human/chimpanzee coding and noncoding sequences conserved in rodents have been under moderate selective constraint since the last common human/chimpanzee ancestor. We next compared the patterns of variation within human and chimpanzee exons to identify genes potentially under positive selection in the human lineage as reported in ref. 23. We found 21 genes randomly distributed over chromosome 5 displaying a P-value less than 0.01 for an increased evolutionary rate in humans. Of note is that the two highest ranked genes (FBN2 and SQSTM1) are both linked to human diseases. Mutations in FBN2 cause pathologies similar to Marfan syndrome (FBN1), whereas SQSTM1 has been linked to Paget's disease of the bone24. As the chimpanzee genome reaches a further draft state, a similar complete re-analysis of the entire human gene set will probably yield large numbers of quickly evolving genes, which may explain unique aspects of human biology.
To annotate functional elements, we identified slowly evolving regions, presumably under evolutionary constraint25,26, through DNA comparison with rodent, chicken, Xenopus and Fugu (P-value <0.01). A chromosome-wide analysis resulted in 15,325 discrete noncoding regions conserved between human/mouse/rat, 2,429 between human/mouse/chicken, 258 between human/mouse/Xenopus and 213 between human/mouse/Fugu. We found that the distribution of human/mouse/Fugu conserved noncoding sequences was highly uneven along the chromosome (Fig. 3b), with 42 centred in 5p15 around an Iroquois homeobox (IRX) gene family. These discrete evolutionarily conserved sequences represent a prioritized substrate for future experimental studies to elucidate their function and potential role in gene regulation.
Recent work has shown that a significant fraction of noncoding elements conserved between human and Fugu has gene regulatory activity even though many are located at great distances from the genes whose expression they control27. In addition to their location between conserved flanking genes, evidence to support distant gene regulatory sequences is found in the maintenance of long syntenic blocks across distant evolutionary species28. To determine whether such regions exist on human chromosome 5, we built a segmental homology map between human, chimp, mouse, rat and chicken. This map revealed two segments larger than 3 Mb that do not contain any evolutionary break points or insertions (> 250 kb) within all examined species. Notably, despite this high level of conservation, these two large segments have very few known genes and overlap the extreme gene-poor regions at 5p15 (3.1 Mb) and 5q34 (5.0 Mb). In addition, each is highly enriched for conserved noncoding sequences with distantly related non-mammalian vertebrates (Fig. 3c). In contrast to the interleukin cluster (described below) and despite being gene poor, the 5p15 region contains 378, 220 and 42 noncoding elements conserved in rodents, chicken and Fugu, respectively3. A similar level of noncoding conservation was observed in the 5q34 gene desert region containing 1,087 noncoding elements conserved with rodents, 301 with chicken, but none with Fugu. Although functional studies are needed to determine whether these ancient conserved sequences regulate the limited number of genes in these regions, it is interesting to note that the 5p15 region contains a cluster of IRX genes that have multiple roles during pattern formation in vertebrate development. The high density of conserved noncoding elements with extended synteny in these gene-poor regions suggests that these regions contain elements that regulate distant genes.
The interleukin gene cluster on 5q31 is a region of particular interest to immunologists because of the presence of five haematopoietic growth factor genes (IL3, CSF2, IL5, IL13 and IL4) and two quantitative trait loci associated with atopic asthma and Crohn's disease susceptibility. From the comparative analysis of this 1 Mb of sequence, we found that 140 of the 190 (76%) human coding exons overlap regions conserved in mouse. This number decreased to 126 (66%) when examining human/mouse/chicken conservation (P-value <0.01; Fig. 3d; see also Supplementary Table S4). Consistent with the known fast evolutionary rate of the interleukin genes, most of the interleukin exons (18 of 21) are among the exon sequences that lack similarity between the species. In the analysis of noncoding sequences, we found 83 conserved human/mouse elements that include two previously characterized gene enhancers (CNS-1 and CNS-7)22. One of these elements is more highly conserved than CNS-1 and CNS-7, yet remains functionally undefined. In addition, we found six human/mouse/chicken conserved noncoding sequences, one of which is also conserved in Xenopus.
Not long after the concept of using anonymous polymorphic DNA markers to localize disease loci was proposed, linkage to many diseases on chromosome 5 was found, and positional cloning and other strategies rapidly isolated the genes for these clearly segregating disorders. So far, mutations in 66 specific genes are known for mendelian diseases (see Supplementary Table S5); an additional 14 single-gene diseases have been mapped to chromosome 5 but have not yet been linked to specific genes. In one of the first examples of a study taking advantage of linkage disequilibrium to positionally clone a gene, ref. 29 identified the DTD gene mutated in diastrophic dysplasia in the Finnish population in 1994. Identification of mutations in the growth hormone receptor gene, at 5p12-p13, in Laron dwarfism was an early case of ‘positional candidate cloning’, in which the gene was cloned and its location known before mapping the trait30. In addition to SMA, microdeletions in a duplicated region in 5q35 cause Sotos syndrome, a debilitating disorder that results in cranial overgrowth and mental retardation31, in which the duplication is thought to mediate severity32. The availability of this completed sequence will further advance our understanding of human disease, and the rate at which disease genes are identified and cloned with causative mutations should be greatly accelerated.
Mapping and sequencing
We seeded chromosome 5 with P1, PAC and Caltech BAC clones anchored to a set of 1,645 radiation hybrid markers and known genes, mapping 5,392 clones to chromosome 5 and with 4,943 of these localized by FISH. After constructing a single enzyme restriction digest map, we chose a minimal tiling path. For the SMA duplication regions, hybridization probes were designed at 50-kb intervals across the working maps with additional probes for each uniquely identified duplicon and screened against RPCI-11. Results were binned and ∼40% of positives selected for sequencing. Single haplotype maps were constructed by sequence analysis, relying on >30-kb alignments with zero or one discrepancy and multiple clone depth. For the complex 5q13 copy, we used an iterative cycle of probing, sequencing, direct repeat resolution, finishing and re-analysis.
We generated sequence by using a clone-by-clone shotgun sequencing strategy33 followed by finishing with a custom primer approach. BAC DNA was sheared by using a Hydroshear Instrument (GeneMachines), size selected (3–4 kb) and subcloned into the vector pUC18. Randomly selected subclones were sequenced in both directions using universal primers and BigDye Terminator chemistry to an average depth of ×8. Sequences were assembled and edited by using the Phred/Phrap/Consed suite of programs34,35. After manual inspection of the assembled sequences, clones were finished by re-sequencing and by sequencing off of plasmid subclones or the large insert clone by using custom primers. All finishing reactions were performed with dGTP BigDye Terminator chemistry (Applied Biosystems). Clones with high repeat content or that showed considerable bias when cloned into pUC18 had additional 8–10-kb libraries constructed in a low copy number vector. Recalcitrant areas and difficult to sequence gaps were closed with sequence data derived from transposon sequencing, small insert shatter libraries36, or PCR. Each clone was finished according to the agreed international standard for the human genome (http://genome.wustl.edu/Overview/g16stand.php).
Genetic markers were placed on the genomic sequence using electronic PCR37. Markers were allowed to have up to three mismatches and were subsequently verified by placing the STS sequence (downloaded from UniSTS) via NCBI Megablast using a drop-off value of 180, a match reward of 10, a gap penalty of -20, and a word size of 22.
Pseudogenes were defined as gene models built by homology to known human genes where alignment between the model and the homologue shows at least one stop codon or frameshift mutation. For the fragments of chromosome 5 genomic sequence that were masked of repeats by using RepeatMasker (A. Smit and P. Green, unpublished data)38, we identified homology to human IPI proteins by using NCBI BLASTX. For each fragment of genomic sequence homologous to an IPI protein, we built gene models by using the GeneWise program. The overlapping gene models were clustered and the alignment of the top-scoring model with its human homologue was analysed for the presence of stop codons and frameshifts. The models were then manually analysed to confirm pseudogene status. Sequences of 431 processed pseudogenes that had been identified previously39 were mapped to the genomic sequence of chromosome 5 by using the BLAT tool. Loci with multi-exon mapping, overlaps with the pseudogenes described above, and simple repeats identified by RepeatMasker were eliminated. Pseudogene status of the remaining sequences was manually validated.
Segmental duplication analysis
We used a BLAST-based detection scheme40 to identify all pairwise similarities representing duplicated regions (≥ 1 kb and ≥90% identity) within the finished sequence of chromosome 5 and compared to all other chromosomes in the NCBI genome assembly (build 34). A total of 1,818 pairwise alignments representing 16.57 Mb of aligned base pairs and 6.26 Mb of non-redundant duplicated bases were analysed on chromosome 5. The program Parasight (J. A. Bailey, unpublished data) was used to generate images of pairwise alignments. We also analysed pairwise alignments for per cent identity and the number of aligned bases. Satellite repeats were detected by using RepeatMasker (version 15 May 2002) on slow settings. Analysis of haplotype structural variation was performed using the program Miropeats (threshold = 7,000)41.
In this work, we used the following genomic assembly builds: chimpanzee November 2003, mouse October 2003, rat June 2003, chicken February 2004 (from http://genome.ucsc.edu), X. tropicalis v1.0 and F. rubripes v3.0 (from http://jgi.doe.gov/). All the segmental homology maps in n-dimensions are computed using PARAGON (v2.13; O. Couronne, unpublished data). As input for PARAGON, we used BLASTZ (v6)42 DNA pairwise alignments of all the species to human. Slowly evolving regions are extracted from the alignments using PEAK-VISTA (P-value >0.01; S. Prabhakar, unpublished data). We built a four-dimension human/chimp/mouse/rat segmental homology map with PARAGON, aligned all the segments with MLAGAN (v12)43 and computed the slowly evolving conserved regions with PEAK-VISTA. Interleukin homology among species was extracted from the PARAGON segmental map, built with MLAGAN multiple alignments; the slowly evolving conserved regions were extracted with RANK-VISTA.
Frazer, K. A. et al. Computational and biological analysis of 680 kb of DNA sequence from the human 5q31 cytokine gene cluster region. Genome Res. 7, 495–512 (1997)
Symula, D. J. et al. Functional screening of an asthma QTL in YAC transgenic mice. Nature Genet. 23, 241–244 (1999)
Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000)
Church, D. M., Yang, J., Bocian, M., Shiang, S. & Wasmuth, J. J. A High-resolution physical and transcript map of the Cri du Chat region of human chromosome 5p. Genome Res. 7, 787–801 (1997)
Puechberty, J. et al. Genetic and physical analyses of the centromeric and pericentromeric regions of human chromosome 5: Recombination across 5cen. Genomics 56, 274–287 (1999)
Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948–951 (2001)
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001)
Schmutz, J. et al. Quality assessment of the human genome sequence. Nature 429, 365–368 (2004)
Olivier, M. et al. A high-resolution radiation hybrid map of the human genome draft sequence. Science 291, 1298–1302 (2001)
Kong, A. et al. A high-resolution recombination map of the human genome. Nature Genet. 31, 241–247 (2002)
Grimwood, J. et al. The DNA sequence and biology of human chromosome 19. Nature 428, 529–535 (2004)
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003)
Hiller, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003)
Melki, J. et al. De novo and inherited deletions of the 5q13 region in spinal muscular atrophies. Science 264, 1474–1477 (1994)
Monani, U. et al. A single nucleotide difference that alters splicing patterns distinguishes the SMA gene SMN1 from the copy gene SMN2. Hum. Mol. Genet. 8, 1177–1183 (1999)
Chen, Q. et al. Sequence of a 131-kb region of 5q13.1 containing the spinal muscular atrophy candidate genes SMN and NAIP. Genomics 48, 121–127 (1998)
Locke, D. P. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res. 13, 347–357 (2003)
Yunis, J. J. & Prakash, O. The origin of man: a chromosomal pictorial legacy. Science 215, 1525–1530 (1982)
Noor, M. A., Grams, K. L., Bertucci, L. A. & Reiland, J. Chromosomal inversions and the reproductive isolation of species. Proc. Natl Acad. Sci. USA 98, 12084–12088 (2001)
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002)
Groenen, M. A. et al. A consensus linkage map of the chicken genome. Genome Res. 10, 137–147 (2000)
Nanda, I. et al. 300 million years of conserved synteny between chicken Z and human chromosome 9. Nature Genet. 21, 258–259 (1999)
Clark, A. G. et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302, 1960–1963 (2003)
Hocking, L. J. et al. Domain-specific mutations in sequestosome 1 (SQSTM1) cause familial and sporadic Paget's disease. Hum. Mol. Genet. 11, 2735–2739 (2002)
Pennacchio, L. A. & Rubin, E. M. Genomic strategies to identify mammalian regulatory sequences. Nature Rev. Genet. 2, 100–109 (2001)
Ghanem, N. et al. Regulatory roles of conserved intergenic domains in vertebrate Dlx bigene clusters. Genome Res. 13, 533–543 (2003)
Nobrega, M. A., Ovcharenko, I., Afzal, V. & Rubin, E. M. Scanning human gene deserts for long-range enhancers. Science 302, 413 (2003)
Flint, J. et al. Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the alpha globin cluster. Hum. Mol. Genet. 10, 371–382 (2001)
Hästbacka, J. et al. The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell 78, 1073–1087 (1994)
Barton, D. E., Foellmer, B. E., Wood, W. I. & Francke, U. Chromosome mapping of the growth hormone receptor gene in man and mouse. Cytogenet. Cell Genet. 50, 137–141 (1989)
Kurotaki, N. et al. Haploinsufficiency of NSD1 causes Sotos syndrome. Nature Genet. 30, 365–366 (2002)
Kurotaki, N. et al. Fifty microdeletions among 112 cases of Sotos syndrome: low copy repeats possibly mediate the common deletion. Hum. Mutat. 22, 378–387 (2003)
International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using Phred. I. accuracy assessment. Genome Res. 8, 175–185 (1998)
Gordon, D., Abajian, C. & Green, P. Consed: A graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998)
McMurray, A. A., Sulston, J. E. & Quail, M. A. Short insert libraries as a method of problem solving in genome sequencing. Genome Res. 8, 562–566 (1998)
Schuler, G. D. Sequence mapping by electronic PCR. Genome Res. 7, 541–550 (1997)
Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16, 418–420 (2000)
Zhang, Z., Harrison, P. M., Liu, Y. & Gerstein, M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13, 2541–2558 (2003)
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001)
Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995)
Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003)
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003)
Kurdoa-Kawaguchi, T. et al. The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nature Genet. 29, 279–286 (2001)
We thank the International Chimpanzee Sequencing Consortium for pre-publication access to and permission to analyse the relevant portions of the chimpanzee genomic sequence, and the Washington University Genome Sequencing Center for pre-publication access to the chicken genomic assembly. We also thank M. Christensen, P. Butler and E. Fields for technical support, D. Gordon of the University of Washington for his assistance in developing and customizing finishing tools, T. Furey and G. Schuler for their efforts towards assessing the quality and completeness of our assembly, and P. DeJong for the construction of genomic resources. This work was performed under the auspices of the US DOE's Office of Science, Biological and Environmental Research Program, by the University of California, Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory and Stanford University.
The authors declare that they have no competing financial interests.
Additional information to what is presented in the text. (DOC 33 kb)
Supplementary Figure 1
A comparison of the Stanford G3 radiation hybrid map v4 to the finished chromosome 5 sequence. (JPG 35 kb)
Supplementary Figure 2
Recombination distance from the deCode genetic map compared to physical sequence of chromsome 5. (JPG 107 kb)
Supplementary Figure 3
Sequence Similarity of Segmental Duplications: For all pairwise alignments, the total number of aligned bases was calculated and binned based on percent sequence identity. Sequence identity distributions for interchromsomally (red) and intrachromosomally (blue) duplicated bases are shown. (PPT 38 kb)
Supplementary Figure 4
Distribution of Segmental Duplications. A schematic of chromosome 5 segmental duplications depicting the location of interchromosomal (red) and intrachromosomal (blue) duplicated sequence. Each horizontal line represents 5 Mb of sequence, with tick marks every 500 kb. Sequencing gaps are represented as discontinuities within the horizontal line. The centromere is shown as a purple bar. Duplications detected by whole genome shotgun sequence are represented as green bars above the chromosome sequence. (PDF 45 kb)
Supplementary Figure 5
Sequence Identity of Segmental Duplications on Chromosome 5. Interchromosomal (red) and intrachromosomal duplications (blue) are shown to scale along the horizontal line in 2Mb increments. Green bars above the horizontal line correspond to duplications detected by other method, whole genome shotgun sequence detection6. The underlying pairwise alignments of segmental duplications (>90% >1kb) are depicted as a function of % identity below the horizontal line. Different colors correspond to the location of the pairwise alignment on different human chromosomes (i.e. chromosome 5 is shown as tan). (PDF 132 kb)
Supplementary Table 1
The gene catalog for chromosome 5. PPG=processed pseudogene, NPG=non-processed pseudogene. (XLS 192 kb)
Supplementary Table 2
Chromosome 5 bases involved in segmental duplication and pairwise alignment. Percent of non-redundant duplications are based on the total non-gap genome size 2,865,069,170 and chromosome 5 size 177,702,766. All segmental duplications have at least 1kb aligned bases with 90 ~ 100% identities. (XLS 18 kb)
Supplementary Table 3
Segmental duplication in pericentromeric and telomeric regions. Segmental duplication within 2 Mb of centromere and 2 Mb of the terminals of the chromosome are counted as pericentromeric and telomeric respectively. (XLS 16 kb)
Supplementary Table 4
Interleukin locus non exonic conserved regions in human/mouse and human/chicken. (XLS 20 kb)
Supplementary Table 5
Mendelian Disease genes on Chromosome 5 (from OMIM). (DOC 153 kb)
Rights and permissions
About this article
Cite this article
Schmutz, J., Martin, J., Terry, A. et al. The DNA sequence and comparative analysis of human chromosome 5. Nature 431, 268–274 (2004). https://doi.org/10.1038/nature02919
This article is cited by
Risk of migraine contributed by genetic polymorphisms of ANKDD1B gene: a case–control study based on Chinese Han population
Neurological Sciences (2022)
Determination of complete chromosomal haplotypes by bulk DNA sequencing
Genome Biology (2021)
NAIP expression increases in a rat model of liver mass restoration
Journal of Molecular Histology (2021)
An integrated Asian human SNV and indel benchmark established using multiple sequencing methods
Scientific Reports (2020)
Advances in therapy for spinal muscular atrophy: promises and challenges
Nature Reviews Neurology (2018)
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.