The DNA sequence and comparative analysis of human chromosome 5

Schmutz, Jeremy; Martin, Joel; Terry, Astrid; Couronne, Olivier; Grimwood, Jane; Lowry, Steve; Gordon, Laurie A.; Scott, Duncan; Xie, Gary; Huang, Wayne; Hellsten, Uffe; Tran-Gyamfi, Mary; She, Xinwei; Prabhakar, Shyam; Aerts, Andrea; Altherr, Michael; Bajorek, Eva; Black, Stacey; Branscomb, Elbert; Caoile, Chenier; Challacombe, Jean F.; Man Chan, Yee; Denys, Mirian; Detter, John C.; Escobar, Julio; Flowers, Dave; Fotopulos, Dea; Glavina, Tijana; Gomez, Maria; Gonzales, Eidelyn; Goodstein, David; Grigoriev, Igor; Groza, Matthew; Hammon, Nancy; Hawkins, Trevor; Haydu, Lauren; Israni, Sanjay; Jett, Jamie; Kadner, Kristen; Kimball, Heather; Kobayashi, Arthur; Lopez, Frederick; Lou, Yunian; Martinez, Diego; Medina, Catherine; Morgan, Jenna; Nandkeshwar, Richard; Noonan, James P.; Pitluck, Sam; Pollard, Martin; Predki, Paul; Priest, James; Ramirez, Lucia; Retterer, James; Rodriguez, Alex; Rogers, Stephanie; Salamov, Asaf; Salazar, Angelica; Thayer, Nina; Tice, Hope; Tsai, Ming; Ustaszewska, Anna; Vo, Nu; Wheeler, Jeremy; Wu, Kevin; Yang, Joan; Dickson, Mark; Cheng, Jan-Fang; Eichler, Evan E.; Olsen, Anne; Pennacchio, Len A.; Rokhsar, Daniel S.; Richardson, Paul; Lucas, Susan M.; Myers, Richard M.; Rubin, Edward M.

doi:10.1038/nature02919

Article
Published: 16 September 2004

The DNA sequence and comparative analysis of human chromosome 5

Jeremy Schmutz¹,
Joel Martin²,
Astrid Terry²,
Olivier Couronne³,
Jane Grimwood¹,
Steve Lowry²,
Laurie A. Gordon^2,4,
Duncan Scott²,
Gary Xie^2,5,
Wayne Huang²,
Uffe Hellsten²,
Mary Tran-Gyamfi^2,4,
Xinwei She⁶,
Shyam Prabhakar³,
Andrea Aerts²,
Michael Altherr^2,5,
Eva Bajorek¹,
Stacey Black¹,
Elbert Branscomb^2,4,
Chenier Caoile¹,
Jean F. Challacombe⁵,
Yee Man Chan¹,
Mirian Denys¹,
John C. Detter²,
Julio Escobar¹,
Dave Flowers¹,
Dea Fotopulos¹,
Tijana Glavina²,
Maria Gomez¹,
Eidelyn Gonzales¹,
David Goodstein²,
Igor Grigoriev²,
Matthew Groza⁴,
Nancy Hammon²,
Trevor Hawkins²,
Lauren Haydu¹,
Sanjay Israni²,
Jamie Jett²,
Kristen Kadner²,
Heather Kimball²,
Arthur Kobayashi^2,4,
Frederick Lopez¹,
Yunian Lou²,
Diego Martinez²,
Catherine Medina¹,
Jenna Morgan²,
Richard Nandkeshwar⁴,
James P. Noonan⁷,
Sam Pitluck²,
Martin Pollard²,
Paul Predki²,
James Priest³,
Lucia Ramirez¹,
James Retterer¹,
Alex Rodriguez¹,
Stephanie Rogers¹,
Asaf Salamov²,
Angelica Salazar¹,
Nina Thayer^2,5,
Hope Tice²,
Ming Tsai¹,
Anna Ustaszewska²,
Nu Vo¹,
Jeremy Wheeler¹,
Kevin Wu¹,
Joan Yang¹,
Mark Dickson¹,
Jan-Fang Cheng³,
Evan E. Eichler⁶,
Anne Olsen^2,4,
Len A. Pennacchio^2,3,
Daniel S. Rokhsar²,
Paul Richardson²,
Susan M. Lucas²,
Richard M. Myers¹ &
…
Edward M. Rubin^2,3

Nature volume 431, pages 268–274 (2004)Cite this article

14k Accesses
79 Citations
15 Altmetric
Metrics details

Abstract

Chromosome 5 is one of the largest human chromosomes and contains numerous intrachromosomal duplications, yet it has one of the lowest gene densities. This is partially explained by numerous gene-poor regions that display a remarkable degree of noncoding conservation with non-mammalian vertebrates, suggesting that they are functionally constrained. In total, we compiled 177.7 million base pairs of highly accurate finished sequence containing 923 manually curated protein-coding genes including the protocadherin and interleukin gene families. We also completely sequenced versions of the large chromosome-5-specific internal duplications. These duplications are very recent evolutionary events and probably have a mechanistic role in human physiological variation, as deletions in these regions are the cause of debilitating disorders including spinal muscular atrophy.

You have full access to this article via your institution.

Download PDF

The structure, function and evolution of a complete human chromosome 8

Article Open access 07 April 2021

The complete sequence of a human Y chromosome

Article 23 August 2023

Chromosome level genome assembly of the Etruscan shrew Suncus etruscus

Article Open access 07 February 2024

Main

The US Department of Energy's interest in chromosome 5 emerged from a series of pilot studies begun at the Lawrence Berkeley National Laboratory focusing on a cluster of interleukin genes located at human 5q31. The insights gained from these detailed analyses of a single megabase of chromosome 5 illustrated how finished human sequence could contribute to gene annotation and how multi-mammalian sequence comparisons could lead to the sequence-based identification of noncoding elements possessing gene regulatory activities^1,2,3. The finished sequence of chromosome 5 and its analysis alone and in comparison to orthologous regions in other vertebrate genomes now provides a chromosome-wide catalogue of genes and evolutionarily conserved noncoding sequences. Many of these observations, as well as clues into disease-causing deletions arising from the segmented duplication landscape of chromosome 5, can only now be appreciated upon finishing the sequence of this chromosome.

Mapping and sequencing

After the completion of the initial draft sequencing in 2001 we selected clones with an approach that integrated all of the public sequence, previously reported clone contigs^4,5,6 including the Celera scaffolds⁷, bacterial artificial chromosome (BAC) and fosmid end sequences, and BACs isolated with an overgo hybridization strategy to close gaps between anchored contigs. The final version of the tiling path contains 1,763 clones, (96% BACs) with four gaps remaining, all in the long arm. None of these remaining gaps are part of the large chromosome 5 duplications, and they appear to be unclonable in current vector systems. In addition, our standard strategy of seeding and then walking into gaps based on restriction maps proved unworkable in the duplication region of 5q13 associated with spinal muscular atrophy (SMA), and led to mapping errors with its primary insertion copy at 5p14 and secondary copy at 5p13. Therefore, we adopted a strategy of drafting high depth clone coverage from the single individual RPCI-11 BAC library in order to construct single haplotype paths spanning the duplications.

On the basis of internal and external quality checks, we estimate the accuracy of our finished sequence to exceed 99.99%⁸. In total, we finished 177,702,766 base pairs (bp) and estimate the total chromosome size, including the clone gaps and the recalcitrant centromeric and subtelomeric regions, to be 180.8 megabases (Mb). The finished sequence covers 99.9% of the euchromatic sequence and captures all known genes that were previously mapped to chromosome 5 (T. Furey, personal communication). The Stanford v.4 G3 radiation hybrid map⁹ was compared to the sequence and it matched the marker order well (see Supplementary Fig. S1). Thirteen (out of 442) unplaced markers were found to have been originally incorrectly assigned to chromosome 5. Recombination distances from the deCODE¹⁰ meiotic maps were compared to physical distances with recombination rates accurately tracking physical distance (see Supplementary Fig. S2), as previously reported for other chromosomes^11,12,13.

Gene catalogue

We placed gene model transcripts on the chromosome 5 sequence and manually reviewed these models using previously described methods¹¹ (Table 1). Ultimately, 923 protein-coding regions were verified as gene loci (see Supplementary Table S1 and http://www.jgi.doe.gov/human_chr5). These loci contain 1,598 full-length (or nearly full-length) transcripts, including partial evidence for additional splice variants (see Supplementary Information). Loci were placed in three categories: ‘known’, ‘novel’ and ‘pseudogenes’, consistent with our previous definitions¹¹. Transcripts for which a unique open reading frame (ORF) could not be determined and putative genes defined by ab initio models but with no supporting experimental evidence were not considered valid. A total of 827 known loci were identified based on 2,203 RefSeq genes and other full-length complementary DNA sequences in GenBank, extending 36% of RefSeq transcripts by more than 50 bp at the 5′ end and 18% at the 3′ end, while maintaining the original ORF. Gene loci 3′ ends were not extended when the only evidence was from rare expressed sequence tag (EST) variants. Evidence for 55 novel loci was supported by full-length cDNA sequence, spliced ESTs, and/or similarity to known human or mouse gene sequences. Forty-one putative gene loci were modelled using orthologous mouse cDNA sequences. Twenty transfer RNA genes and four tRNA pseudogenes were predicted, similar in density to other finished chromosomes^11,12,13.

Table 1 Chromosome 5 sequence features

Full size table

The extent of alternative splicing was characterized based on the existing cDNA and EST data. Considering only messenger RNA sequences in GenBank, 1,598 distinct transcripts were identified, providing an average coverage of 1.7 annotated transcripts per locus (see Supplementary Information). These mRNAs provide strong evidence for alternative splicing of 408 (44%) of the 923 loci, each having two or more associated transcripts. A total of 577 pseudogenes and pseudogene fragments were also identified, representing two classes: (1) 98 non-processed pseudogenes that display a structure similar to the parent locus and probably resulted from genomic duplication events; (2) 479 processed pseudogenes that presumably resulted from viral retrotransposition of spliced mRNAs (see Supplementary Information). No significant bias towards over-representation of pseudogenes from a particular gene family was observed.

Chromosome 5 genomic duplications

We performed a detailed analysis of duplicated sequence (≥ 90% identity and ≥1 kilobase (kb) length) by comparing chromosome 5 against the July 2003 human genome assembly. An estimated 3.49% (6.26 Mb) of the chromosome consists of segmental duplications, lower than the genome-wide average of 5.3% (see Supplementary Table S2 and Supplementary Fig. S4). Chromosome 5 segmental duplications, however, show a higher degree of sequence identity (≥ 97.5%), especially with other regions of chromosome 5 (see Supplementary Fig. S5), than do the duplications on other chromosomes. Intrachromosomal duplications are clustered in ten regions (Fig. 1) and represent the majority of the gene duplications, including the largest gene family: the protocadherins (see Supplementary Information). The high degree of sequence identity underlying most of these intrachromosomal genomic duplications suggests that these structures are relatively recent duplications or gene conversion events that emerged during the separation of humans and the great apes (see Supplementary Fig. S3 and Supplementary Table S2).

**Figure 1: Distribution of segmental duplications on chromosome 5.**

Subtelomeric and pericentromeric biases have been reported for segmental duplications for other human chromosomes. Despite the fact that large tracts of alpha-satellite DNA have been sequenced on both chromosomal arms near the centromere, there is little evidence for extensive pericentromeric duplication, with 5p11 showing almost a complete absence of duplications. A single duplication in 5q11 (96% identity over 250 kb) between chromosomes 1 and 5 accounts for nearly all pericentromeric duplicated bases. The pericentromeric region of chromosome 5, along with 19q11, may define a duplication-quiescent model of pericentromeric organization. The telomeric regions do show extensive interchromosomal duplications (Fig. 1), with 25% (2.48 out of 9.08 Mb) of all interchromosomal alignments occurring within 2 Mb of the long arm telomeric repeat sequence (see Supplementary Table S3).

SMA duplication region

One of the most duplicated regions on chromosome 5 occurs in a 1–2-Mb interval in 5q13.3. Homozygous deletions of the SMN1 gene and variable copies of the SMN2 duplication in this region have been associated with various forms of spinal muscular atrophy and susceptibility to the disease^14,15. Analysis of carriers and controls suggests extreme locus variability, but the underlying structural variation has never been documented at the sequence level¹⁶. We identified a complex arrangement of 311 pairwise alignments representing the SMA region (Fig. 1). On average, the duplications are long (∼ 200 kb) and show a high degree of identity (98.66%). Duplications in this region include interchromosomal duplications, all of which map to chromosome 6, with three very large tandem (> 99.5% identity) and other various interspersed intrachromosomal duplications (Fig. 2). Interestingly, this region is enriched in genes. We annotated 14 loci in this region, including SERF1 (small EDRK-rich factor 1), BIRC1 (baculoviral IAP repeat-containing 1) and SMN (survival of motor neuron), the gene for SMA.

**Figure 2: Diagram of the SMA region showing both SMAvar1, the published variant, and SMAvar2, the alternative RPC11 variant.**

During the sequencing and assembly of this region, we generated a consensus sequence for a second haplotype variant from the RPCI-11 BAC library. Both haplotypes represent high-quality finished sequence and differ only by a remaining ∼50-kb clone gap within SMAvar2. Sequence comparison of these regions (SMAvar1 against SMAvar2) revealed extensive variation. At least two large-scale rearrangements (> 400 kb) and multiple smaller insertion/deletion events are required to reconstruct an ancestral haplotype. Although there are many scenarios for the evolution of these variants, one explanation may be that a portion of the SMAvar2 region (0.3–0.9 Mb) was inverted (68.9–69.4 Mb) and subsequently duplicated in SMAvar1 (69.8–70.4 Mb). Such extensive structural variation between haplotypes may not be uncommon in regions of extensive segmental duplication.

Comparative biology

To understand further the evolution and functional sequences of human chromosome 5, we performed comparative analyses against the available chimpanzee, mouse, rat, chicken, frog (Xenopus tropicalis) and fish (Fugu rubripes) draft genomes. These comparisons revealed numerous large-scale chromosomal rearrangement events occurring since each of these species last common ancestor with humans, as well as a variety of nonrandomly distributed conserved noncoding regions (Fig. 3a). Additional analyses of the distribution of genes and conserved noncoding sequences along the length of the chromosome support the existence of large gene-poor regions with highly conserved noncoding sequences that may regulate genes from a distance. Furthermore, we examined conservation in a comparative analysis of the extensively studied interleukin gene cluster.

Synteny

By building segmental maps from DNA alignments of all the vertebrate species described above, we were able to confirm and extend previous homologous chromosomal relationships with human chromosome 5. Whereas recent experimental studies support that large-scale rearrangements (40–175 kb) have frequently occurred during primate genome evolution¹⁷, our comparison of chromosome 5 and the recent chimpanzee draft genome sequence (International Chimpanzee Genome Sequencing Consortium, manuscript in preparation) uncovered even larger-scale events. For example, we found a large 80-Mb inversion in comparison to the chimpanzee genome, homologous to almost half of human chromosome 5 between 5p14 and 5q15 (Fig. 3a). This finding using the genomic draft data independently confirms previous fluorescence in situ hybridization (FISH) experiments¹⁸. It has been proposed that these large-scale rearrangements create barriers to fertile mating and triggered the speciation that separated these two lineages¹⁹. Comparison with the mouse genome sequence²⁰ yielded 142 chromosomal rearrangements ranging in size from 200 kb to 17 Mb. Between human and chicken, we found that one-third of chromosome 5 is homologous to the chicken sex chromosome Z²¹, further indicating that sex chromosomes have evolved independently after the avian and mammalian split some 300 million years ago²².

Chimpanzee

In addition to exploring the syntenic relationship between chromosome 5 and the chimpanzee draft assembly, we catalogued sequence changes between these two primates. To explore the constraint on human–chimpanzee evolution in noncoding regions, we compared the number of nucleotide substitutions in coding sequences, as well as noncoding regions conserved and not conserved in rodents. We found a substitution rate of 0.0067 changes per nucleotide in coding sequences, 0.0091 in noncoding regions conserved in rodents, and 0.015 in noncoding regions not conserved in rodents. The decreased substitution rate in coding sequences and noncoding sequences conserved in rodents (compared to noncoding regions not conserved in rodents) support the theory that both of the former categories are under evolutionary constraint. This also supports the theory that human/chimpanzee coding and noncoding sequences conserved in rodents have been under moderate selective constraint since the last common human/chimpanzee ancestor. We next compared the patterns of variation within human and chimpanzee exons to identify genes potentially under positive selection in the human lineage as reported in ref. 23. We found 21 genes randomly distributed over chromosome 5 displaying a P-value less than 0.01 for an increased evolutionary rate in humans. Of note is that the two highest ranked genes (FBN2 and SQSTM1) are both linked to human diseases. Mutations in FBN2 cause pathologies similar to Marfan syndrome (FBN1), whereas SQSTM1 has been linked to Paget's disease of the bone²⁴. As the chimpanzee genome reaches a further draft state, a similar complete re-analysis of the entire human gene set will probably yield large numbers of quickly evolving genes, which may explain unique aspects of human biology.

Vertebrate conservation

To annotate functional elements, we identified slowly evolving regions, presumably under evolutionary constraint^25,26, through DNA comparison with rodent, chicken, Xenopus and Fugu (P-value <0.01). A chromosome-wide analysis resulted in 15,325 discrete noncoding regions conserved between human/mouse/rat, 2,429 between human/mouse/chicken, 258 between human/mouse/Xenopus and 213 between human/mouse/Fugu. We found that the distribution of human/mouse/Fugu conserved noncoding sequences was highly uneven along the chromosome (Fig. 3b), with 42 centred in 5p15 around an Iroquois homeobox (IRX) gene family. These discrete evolutionarily conserved sequences represent a prioritized substrate for future experimental studies to elucidate their function and potential role in gene regulation.

Gene-poor regions

Recent work has shown that a significant fraction of noncoding elements conserved between human and Fugu has gene regulatory activity even though many are located at great distances from the genes whose expression they control²⁷. In addition to their location between conserved flanking genes, evidence to support distant gene regulatory sequences is found in the maintenance of long syntenic blocks across distant evolutionary species²⁸. To determine whether such regions exist on human chromosome 5, we built a segmental homology map between human, chimp, mouse, rat and chicken. This map revealed two segments larger than 3 Mb that do not contain any evolutionary break points or insertions (> 250 kb) within all examined species. Notably, despite this high level of conservation, these two large segments have very few known genes and overlap the extreme gene-poor regions at 5p15 (3.1 Mb) and 5q34 (5.0 Mb). In addition, each is highly enriched for conserved noncoding sequences with distantly related non-mammalian vertebrates (Fig. 3c). In contrast to the interleukin cluster (described below) and despite being gene poor, the 5p15 region contains 378, 220 and 42 noncoding elements conserved in rodents, chicken and Fugu, respectively³. A similar level of noncoding conservation was observed in the 5q34 gene desert region containing 1,087 noncoding elements conserved with rodents, 301 with chicken, but none with Fugu. Although functional studies are needed to determine whether these ancient conserved sequences regulate the limited number of genes in these regions, it is interesting to note that the 5p15 region contains a cluster of IRX genes that have multiple roles during pattern formation in vertebrate development. The high density of conserved noncoding elements with extended synteny in these gene-poor regions suggests that these regions contain elements that regulate distant genes.

Interleukin cluster

The interleukin gene cluster on 5q31 is a region of particular interest to immunologists because of the presence of five haematopoietic growth factor genes (IL3, CSF2, IL5, IL13 and IL4) and two quantitative trait loci associated with atopic asthma and Crohn's disease susceptibility. From the comparative analysis of this 1 Mb of sequence, we found that 140 of the 190 (76%) human coding exons overlap regions conserved in mouse. This number decreased to 126 (66%) when examining human/mouse/chicken conservation (P-value <0.01; Fig. 3d; see also Supplementary Table S4). Consistent with the known fast evolutionary rate of the interleukin genes, most of the interleukin exons (18 of 21) are among the exon sequences that lack similarity between the species. In the analysis of noncoding sequences, we found 83 conserved human/mouse elements that include two previously characterized gene enhancers (CNS-1 and CNS-7)²². One of these elements is more highly conserved than CNS-1 and CNS-7, yet remains functionally undefined. In addition, we found six human/mouse/chicken conserved noncoding sequences, one of which is also conserved in Xenopus.

Human disease

Not long after the concept of using anonymous polymorphic DNA markers to localize disease loci was proposed, linkage to many diseases on chromosome 5 was found, and positional cloning and other strategies rapidly isolated the genes for these clearly segregating disorders. So far, mutations in 66 specific genes are known for mendelian diseases (see Supplementary Table S5); an additional 14 single-gene diseases have been mapped to chromosome 5 but have not yet been linked to specific genes. In one of the first examples of a study taking advantage of linkage disequilibrium to positionally clone a gene, ref. 29 identified the DTD gene mutated in diastrophic dysplasia in the Finnish population in 1994. Identification of mutations in the growth hormone receptor gene, at 5p12-p13, in Laron dwarfism was an early case of ‘positional candidate cloning’, in which the gene was cloned and its location known before mapping the trait³⁰. In addition to SMA, microdeletions in a duplicated region in 5q35 cause Sotos syndrome, a debilitating disorder that results in cranial overgrowth and mental retardation³¹, in which the duplication is thought to mediate severity³². The availability of this completed sequence will further advance our understanding of human disease, and the rate at which disease genes are identified and cloned with causative mutations should be greatly accelerated.

Methods

Mapping and sequencing

We seeded chromosome 5 with P1, PAC and Caltech BAC clones anchored to a set of 1,645 radiation hybrid markers and known genes, mapping 5,392 clones to chromosome 5 and with 4,943 of these localized by FISH. After constructing a single enzyme restriction digest map, we chose a minimal tiling path. For the SMA duplication regions, hybridization probes were designed at 50-kb intervals across the working maps with additional probes for each uniquely identified duplicon and screened against RPCI-11. Results were binned and ∼40% of positives selected for sequencing. Single haplotype maps were constructed by sequence analysis, relying on >30-kb alignments with zero or one discrepancy and multiple clone depth. For the complex 5q13 copy, we used an iterative cycle of probing, sequencing, direct repeat resolution, finishing and re-analysis.

We generated sequence by using a clone-by-clone shotgun sequencing strategy³³ followed by finishing with a custom primer approach. BAC DNA was sheared by using a Hydroshear Instrument (GeneMachines), size selected (3–4 kb) and subcloned into the vector pUC18. Randomly selected subclones were sequenced in both directions using universal primers and BigDye Terminator chemistry to an average depth of ×8. Sequences were assembled and edited by using the Phred/Phrap/Consed suite of programs^34,35. After manual inspection of the assembled sequences, clones were finished by re-sequencing and by sequencing off of plasmid subclones or the large insert clone by using custom primers. All finishing reactions were performed with dGTP BigDye Terminator chemistry (Applied Biosystems). Clones with high repeat content or that showed considerable bias when cloned into pUC18 had additional 8–10-kb libraries constructed in a low copy number vector. Recalcitrant areas and difficult to sequence gaps were closed with sequence data derived from transposon sequencing, small insert shatter libraries³⁶, or PCR. Each clone was finished according to the agreed international standard for the human genome (http://genome.wustl.edu/Overview/g16stand.php).

Marker placement

Genetic markers were placed on the genomic sequence using electronic PCR³⁷. Markers were allowed to have up to three mismatches and were subsequently verified by placing the STS sequence (downloaded from UniSTS) via NCBI Megablast using a drop-off value of 180, a match reward of 10, a gap penalty of -20, and a word size of 22.

Pseudogene identification

Pseudogenes were defined as gene models built by homology to known human genes where alignment between the model and the homologue shows at least one stop codon or frameshift mutation. For the fragments of chromosome 5 genomic sequence that were masked of repeats by using RepeatMasker (A. Smit and P. Green, unpublished data)³⁸, we identified homology to human IPI proteins by using NCBI BLASTX. For each fragment of genomic sequence homologous to an IPI protein, we built gene models by using the GeneWise program. The overlapping gene models were clustered and the alignment of the top-scoring model with its human homologue was analysed for the presence of stop codons and frameshifts. The models were then manually analysed to confirm pseudogene status. Sequences of 431 processed pseudogenes that had been identified previously³⁹ were mapped to the genomic sequence of chromosome 5 by using the BLAT tool. Loci with multi-exon mapping, overlaps with the pseudogenes described above, and simple repeats identified by RepeatMasker were eliminated. Pseudogene status of the remaining sequences was manually validated.

Segmental duplication analysis

We used a BLAST-based detection scheme⁴⁰ to identify all pairwise similarities representing duplicated regions (≥ 1 kb and ≥90% identity) within the finished sequence of chromosome 5 and compared to all other chromosomes in the NCBI genome assembly (build 34). A total of 1,818 pairwise alignments representing 16.57 Mb of aligned base pairs and 6.26 Mb of non-redundant duplicated bases were analysed on chromosome 5. The program Parasight (J. A. Bailey, unpublished data) was used to generate images of pairwise alignments. We also analysed pairwise alignments for per cent identity and the number of aligned bases. Satellite repeats were detected by using RepeatMasker (version 15 May 2002) on slow settings. Analysis of haplotype structural variation was performed using the program Miropeats (threshold = 7,000)⁴¹.

Comparative analysis

In this work, we used the following genomic assembly builds: chimpanzee November 2003, mouse October 2003, rat June 2003, chicken February 2004 (from http://genome.ucsc.edu), X. tropicalis v1.0 and F. rubripes v3.0 (from http://jgi.doe.gov/). All the segmental homology maps in n-dimensions are computed using PARAGON (v2.13; O. Couronne, unpublished data). As input for PARAGON, we used BLASTZ (v6)⁴² DNA pairwise alignments of all the species to human. Slowly evolving regions are extracted from the alignments using PEAK-VISTA (P-value >0.01; S. Prabhakar, unpublished data). We built a four-dimension human/chimp/mouse/rat segmental homology map with PARAGON, aligned all the segments with MLAGAN (v12)⁴³ and computed the slowly evolving conserved regions with PEAK-VISTA. Interleukin homology among species was extracted from the PARAGON segmental map, built with MLAGAN multiple alignments; the slowly evolving conserved regions were extracted with RANK-VISTA.

References

Frazer, K. A. et al. Computational and biological analysis of 680 kb of DNA sequence from the human 5q31 cytokine gene cluster region. Genome Res. 7, 495–512 (1997)
Article CAS PubMed Google Scholar
Symula, D. J. et al. Functional screening of an asthma QTL in YAC transgenic mice. Nature Genet. 23, 241–244 (1999)
Article CAS PubMed Google Scholar
Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000)
Article ADS CAS PubMed Google Scholar
Church, D. M., Yang, J., Bocian, M., Shiang, S. & Wasmuth, J. J. A High-resolution physical and transcript map of the Cri du Chat region of human chromosome 5p. Genome Res. 7, 787–801 (1997)
Article CAS PubMed Google Scholar
Puechberty, J. et al. Genetic and physical analyses of the centromeric and pericentromeric regions of human chromosome 5: Recombination across 5cen. Genomics 56, 274–287 (1999)
Article CAS PubMed Google Scholar
Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948–951 (2001)
Article CAS PubMed Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001)
Article ADS CAS PubMed Google Scholar
Schmutz, J. et al. Quality assessment of the human genome sequence. Nature 429, 365–368 (2004)
Article ADS CAS PubMed Google Scholar
Olivier, M. et al. A high-resolution radiation hybrid map of the human genome draft sequence. Science 291, 1298–1302 (2001)
Article ADS CAS PubMed Google Scholar
Kong, A. et al. A high-resolution recombination map of the human genome. Nature Genet. 31, 241–247 (2002)
Article CAS PubMed Google Scholar
Grimwood, J. et al. The DNA sequence and biology of human chromosome 19. Nature 428, 529–535 (2004)
Article ADS CAS PubMed Google Scholar
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003)
Article ADS CAS PubMed Google Scholar
Hiller, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003)
Article ADS Google Scholar
Melki, J. et al. De novo and inherited deletions of the 5q13 region in spinal muscular atrophies. Science 264, 1474–1477 (1994)
Article ADS CAS PubMed Google Scholar
Monani, U. et al. A single nucleotide difference that alters splicing patterns distinguishes the SMA gene SMN1 from the copy gene SMN2. Hum. Mol. Genet. 8, 1177–1183 (1999)
Article CAS PubMed Google Scholar
Chen, Q. et al. Sequence of a 131-kb region of 5q13.1 containing the spinal muscular atrophy candidate genes SMN and NAIP. Genomics 48, 121–127 (1998)
Article CAS PubMed Google Scholar
Locke, D. P. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res. 13, 347–357 (2003)
Article CAS PubMed PubMed Central Google Scholar
Yunis, J. J. & Prakash, O. The origin of man: a chromosomal pictorial legacy. Science 215, 1525–1530 (1982)
Article ADS CAS PubMed Google Scholar
Noor, M. A., Grams, K. L., Bertucci, L. A. & Reiland, J. Chromosomal inversions and the reproductive isolation of species. Proc. Natl Acad. Sci. USA 98, 12084–12088 (2001)
Article ADS CAS PubMed PubMed Central Google Scholar
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002)
Article Google Scholar
Groenen, M. A. et al. A consensus linkage map of the chicken genome. Genome Res. 10, 137–147 (2000)
CAS PubMed PubMed Central Google Scholar
Nanda, I. et al. 300 million years of conserved synteny between chicken Z and human chromosome 9. Nature Genet. 21, 258–259 (1999)
Article CAS PubMed Google Scholar
Clark, A. G. et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302, 1960–1963 (2003)
Article ADS CAS PubMed Google Scholar
Hocking, L. J. et al. Domain-specific mutations in sequestosome 1 (SQSTM1) cause familial and sporadic Paget's disease. Hum. Mol. Genet. 11, 2735–2739 (2002)
Article CAS PubMed Google Scholar
Pennacchio, L. A. & Rubin, E. M. Genomic strategies to identify mammalian regulatory sequences. Nature Rev. Genet. 2, 100–109 (2001)
Article CAS PubMed Google Scholar
Ghanem, N. et al. Regulatory roles of conserved intergenic domains in vertebrate Dlx bigene clusters. Genome Res. 13, 533–543 (2003)
Article CAS PubMed PubMed Central Google Scholar
Nobrega, M. A., Ovcharenko, I., Afzal, V. & Rubin, E. M. Scanning human gene deserts for long-range enhancers. Science 302, 413 (2003)
Article CAS PubMed Google Scholar
Flint, J. et al. Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the alpha globin cluster. Hum. Mol. Genet. 10, 371–382 (2001)
Article CAS PubMed Google Scholar
Hästbacka, J. et al. The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell 78, 1073–1087 (1994)
Article PubMed Google Scholar
Barton, D. E., Foellmer, B. E., Wood, W. I. & Francke, U. Chromosome mapping of the growth hormone receptor gene in man and mouse. Cytogenet. Cell Genet. 50, 137–141 (1989)
Article CAS PubMed Google Scholar
Kurotaki, N. et al. Haploinsufficiency of NSD1 causes Sotos syndrome. Nature Genet. 30, 365–366 (2002)
Article CAS PubMed Google Scholar
Kurotaki, N. et al. Fifty microdeletions among 112 cases of Sotos syndrome: low copy repeats possibly mediate the common deletion. Hum. Mutat. 22, 378–387 (2003)
Article CAS PubMed Google Scholar
International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Article Google Scholar
Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using Phred. I. accuracy assessment. Genome Res. 8, 175–185 (1998)
Article CAS PubMed Google Scholar
Gordon, D., Abajian, C. & Green, P. Consed: A graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998)
Article CAS PubMed Google Scholar
McMurray, A. A., Sulston, J. E. & Quail, M. A. Short insert libraries as a method of problem solving in genome sequencing. Genome Res. 8, 562–566 (1998)
Article CAS PubMed PubMed Central Google Scholar
Schuler, G. D. Sequence mapping by electronic PCR. Genome Res. 7, 541–550 (1997)
Article CAS PubMed PubMed Central Google Scholar
Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16, 418–420 (2000)
Article CAS PubMed Google Scholar
Zhang, Z., Harrison, P. M., Liu, Y. & Gerstein, M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13, 2541–2558 (2003)
Article CAS PubMed PubMed Central Google Scholar
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001)
Article CAS PubMed PubMed Central Google Scholar
Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995)
CAS PubMed Google Scholar
Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003)
Article CAS PubMed PubMed Central Google Scholar
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003)
Article CAS PubMed PubMed Central Google Scholar
Kurdoa-Kawaguchi, T. et al. The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nature Genet. 29, 279–286 (2001)
Article Google Scholar

Download references

Acknowledgements

We thank the International Chimpanzee Sequencing Consortium for pre-publication access to and permission to analyse the relevant portions of the chimpanzee genomic sequence, and the Washington University Genome Sequencing Center for pre-publication access to the chicken genomic assembly. We also thank M. Christensen, P. Butler and E. Fields for technical support, D. Gordon of the University of Washington for his assistance in developing and customizing finishing tools, T. Furey and G. Schuler for their efforts towards assessing the quality and completeness of our assembly, and P. DeJong for the construction of genomic resources. This work was performed under the auspices of the US DOE's Office of Science, Biological and Environmental Research Program, by the University of California, Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory and Stanford University.

Author information

Authors and Affiliations

Stanford Human Genome Center, Department of Genetics, Stanford University School of Medicine, 975 California Ave, Palo Alto, California, 94304, USA
Jeremy Schmutz, Jane Grimwood, Eva Bajorek, Stacey Black, Chenier Caoile, Yee Man Chan, Mirian Denys, Julio Escobar, Dave Flowers, Dea Fotopulos, Maria Gomez, Eidelyn Gonzales, Lauren Haydu, Frederick Lopez, Catherine Medina, Lucia Ramirez, James Retterer, Alex Rodriguez, Stephanie Rogers, Angelica Salazar, Ming Tsai, Nu Vo, Jeremy Wheeler, Kevin Wu, Joan Yang, Mark Dickson & Richard M. Myers
DOE's Joint Genome Institute, 2800 Mitchell Avenue, Walnut Creek, California, 94598, USA
Joel Martin, Astrid Terry, Steve Lowry, Laurie A. Gordon, Duncan Scott, Gary Xie, Wayne Huang, Uffe Hellsten, Mary Tran-Gyamfi, Andrea Aerts, Michael Altherr, Elbert Branscomb, John C. Detter, Tijana Glavina, David Goodstein, Igor Grigoriev, Nancy Hammon, Trevor Hawkins, Sanjay Israni, Jamie Jett, Kristen Kadner, Heather Kimball, Arthur Kobayashi, Yunian Lou, Diego Martinez, Jenna Morgan, Sam Pitluck, Martin Pollard, Paul Predki, Asaf Salamov, Nina Thayer, Hope Tice, Anna Ustaszewska, Anne Olsen, Len A. Pennacchio, Daniel S. Rokhsar, Paul Richardson, Susan M. Lucas & Edward M. Rubin
Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, California, 94720, USA
Olivier Couronne, Shyam Prabhakar, James Priest, Jan-Fang Cheng, Len A. Pennacchio & Edward M. Rubin
Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, California, 94550, USA
Laurie A. Gordon, Mary Tran-Gyamfi, Elbert Branscomb, Matthew Groza, Arthur Kobayashi, Richard Nandkeshwar & Anne Olsen
Los Alamos National Laboratory, Los Alamos, New Mexico, 87545, USA
Gary Xie, Michael Altherr, Jean F. Challacombe & Nina Thayer
Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, Ohio, 44106, USA
Xinwei She & Evan E. Eichler
Department of Genetics, Stanford University School of Medicine, Stanford, California, 94305, USA
James P. Noonan

Authors

Jeremy Schmutz
View author publications
You can also search for this author in PubMed Google Scholar
Joel Martin
View author publications
You can also search for this author in PubMed Google Scholar
Astrid Terry
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Couronne
View author publications
You can also search for this author in PubMed Google Scholar
Jane Grimwood
View author publications
You can also search for this author in PubMed Google Scholar
Steve Lowry
View author publications
You can also search for this author in PubMed Google Scholar
Laurie A. Gordon
View author publications
You can also search for this author in PubMed Google Scholar
Duncan Scott
View author publications
You can also search for this author in PubMed Google Scholar
Gary Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Huang
View author publications
You can also search for this author in PubMed Google Scholar
Uffe Hellsten
View author publications
You can also search for this author in PubMed Google Scholar
Mary Tran-Gyamfi
View author publications
You can also search for this author in PubMed Google Scholar
Xinwei She
View author publications
You can also search for this author in PubMed Google Scholar
Shyam Prabhakar
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Aerts
View author publications
You can also search for this author in PubMed Google Scholar
Michael Altherr
View author publications
You can also search for this author in PubMed Google Scholar
Eva Bajorek
View author publications
You can also search for this author in PubMed Google Scholar
Stacey Black
View author publications
You can also search for this author in PubMed Google Scholar
Elbert Branscomb
View author publications
You can also search for this author in PubMed Google Scholar
Chenier Caoile
View author publications
You can also search for this author in PubMed Google Scholar
Jean F. Challacombe
View author publications
You can also search for this author in PubMed Google Scholar
Yee Man Chan
View author publications
You can also search for this author in PubMed Google Scholar
Mirian Denys
View author publications
You can also search for this author in PubMed Google Scholar
John C. Detter
View author publications
You can also search for this author in PubMed Google Scholar
Julio Escobar
View author publications
You can also search for this author in PubMed Google Scholar
Dave Flowers
View author publications
You can also search for this author in PubMed Google Scholar
Dea Fotopulos
View author publications
You can also search for this author in PubMed Google Scholar
Tijana Glavina
View author publications
You can also search for this author in PubMed Google Scholar
Maria Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Eidelyn Gonzales
View author publications
You can also search for this author in PubMed Google Scholar
David Goodstein
View author publications
You can also search for this author in PubMed Google Scholar
Igor Grigoriev
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Groza
View author publications
You can also search for this author in PubMed Google Scholar
Nancy Hammon
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Hawkins
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Haydu
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Israni
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Jett
View author publications
You can also search for this author in PubMed Google Scholar
Kristen Kadner
View author publications
You can also search for this author in PubMed Google Scholar
Heather Kimball
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Frederick Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Yunian Lou
View author publications
You can also search for this author in PubMed Google Scholar
Diego Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Medina
View author publications
You can also search for this author in PubMed Google Scholar
Jenna Morgan
View author publications
You can also search for this author in PubMed Google Scholar
Richard Nandkeshwar
View author publications
You can also search for this author in PubMed Google Scholar
James P. Noonan
View author publications
You can also search for this author in PubMed Google Scholar
Sam Pitluck
View author publications
You can also search for this author in PubMed Google Scholar
Martin Pollard
View author publications
You can also search for this author in PubMed Google Scholar
Paul Predki
View author publications
You can also search for this author in PubMed Google Scholar
James Priest
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Ramirez
View author publications
You can also search for this author in PubMed Google Scholar
James Retterer
View author publications
You can also search for this author in PubMed Google Scholar
Alex Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie Rogers
View author publications
You can also search for this author in PubMed Google Scholar
Asaf Salamov
View author publications
You can also search for this author in PubMed Google Scholar
Angelica Salazar
View author publications
You can also search for this author in PubMed Google Scholar
Nina Thayer
View author publications
You can also search for this author in PubMed Google Scholar
Hope Tice
View author publications
You can also search for this author in PubMed Google Scholar
Ming Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Anna Ustaszewska
View author publications
You can also search for this author in PubMed Google Scholar
Nu Vo
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Wheeler
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Joan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Mark Dickson
View author publications
You can also search for this author in PubMed Google Scholar
Jan-Fang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar
Anne Olsen
View author publications
You can also search for this author in PubMed Google Scholar
Len A. Pennacchio
View author publications
You can also search for this author in PubMed Google Scholar
Daniel S. Rokhsar
View author publications
You can also search for this author in PubMed Google Scholar
Paul Richardson
View author publications
You can also search for this author in PubMed Google Scholar
Susan M. Lucas
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Myers
View author publications
You can also search for this author in PubMed Google Scholar
Edward M. Rubin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jeremy Schmutz or Edward M. Rubin.

Ethics declarations

Competing interests

The authors declare that they have no competing financial interests.

Supplementary information

Supplementary Data

Additional information to what is presented in the text. (DOC 33 kb)

Supplementary Figure 1

A comparison of the Stanford G3 radiation hybrid map v4 to the finished chromosome 5 sequence. (JPG 35 kb)

Supplementary Figure 2

Recombination distance from the deCode genetic map compared to physical sequence of chromsome 5. (JPG 107 kb)

Supplementary Figure 3

Sequence Similarity of Segmental Duplications: For all pairwise alignments, the total number of aligned bases was calculated and binned based on percent sequence identity. Sequence identity distributions for interchromsomally (red) and intrachromosomally (blue) duplicated bases are shown. (PPT 38 kb)

Supplementary Figure 4

Distribution of Segmental Duplications. A schematic of chromosome 5 segmental duplications depicting the location of interchromosomal (red) and intrachromosomal (blue) duplicated sequence. Each horizontal line represents 5 Mb of sequence, with tick marks every 500 kb. Sequencing gaps are represented as discontinuities within the horizontal line. The centromere is shown as a purple bar. Duplications detected by whole genome shotgun sequence are represented as green bars above the chromosome sequence. (PDF 45 kb)

Supplementary Figure 5

Sequence Identity of Segmental Duplications on Chromosome 5. Interchromosomal (red) and intrachromosomal duplications (blue) are shown to scale along the horizontal line in 2Mb increments. Green bars above the horizontal line correspond to duplications detected by other method, whole genome shotgun sequence detection⁶. The underlying pairwise alignments of segmental duplications (>90% >1kb) are depicted as a function of % identity below the horizontal line. Different colors correspond to the location of the pairwise alignment on different human chromosomes (i.e. chromosome 5 is shown as tan). (PDF 132 kb)

Supplementary Table 1

The gene catalog for chromosome 5. PPG=processed pseudogene, NPG=non-processed pseudogene. (XLS 192 kb)

Supplementary Table 2

Chromosome 5 bases involved in segmental duplication and pairwise alignment. Percent of non-redundant duplications are based on the total non-gap genome size 2,865,069,170 and chromosome 5 size 177,702,766. All segmental duplications have at least 1kb aligned bases with 90 ~ 100% identities. (XLS 18 kb)

Supplementary Table 3

Segmental duplication in pericentromeric and telomeric regions. Segmental duplication within 2 Mb of centromere and 2 Mb of the terminals of the chromosome are counted as pericentromeric and telomeric respectively. (XLS 16 kb)

Supplementary Table 4

Interleukin locus non exonic conserved regions in human/mouse and human/chicken. (XLS 20 kb)

Supplementary Table 5

Mendelian Disease genes on Chromosome 5 (from OMIM). (DOC 153 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schmutz, J., Martin, J., Terry, A. et al. The DNA sequence and comparative analysis of human chromosome 5. Nature 431, 268–274 (2004). https://doi.org/10.1038/nature02919

Download citation

Received: 10 May 2004
Accepted: 02 August 2004
Issue Date: 16 September 2004
DOI: https://doi.org/10.1038/nature02919

This article is cited by

GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads
- Fanny-Dhelia Pajuste
- Maido Remm
Scientific Reports (2023)
Elevated levels of FMRP-target MAP1B impair human and mouse neuronal development and mouse social behaviors via autophagy pathway
- Yu Guo
- Minjie Shen
- Xinyu Zhao
Nature Communications (2023)
Risk of migraine contributed by genetic polymorphisms of ANKDD1B gene: a case–control study based on Chinese Han population
- Tianxiao Zhang
- Hang Wei
- Tao Li
Neurological Sciences (2022)
Determination of complete chromosomal haplotypes by bulk DNA sequencing
- Richard W. Tourdot
- Gregory J. Brunette
- Cheng-Zhong Zhang
Genome Biology (2021)
NAIP expression increases in a rat model of liver mass restoration
- Julio Plaza-Díaz
- Ana I. Álvarez-Mercado
- Francisco Abadía-Molina
Journal of Molecular Histology (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Abstract

Similar content being viewed by others

Main

Mapping and sequencing

Gene catalogue

Chromosome 5 genomic duplications

SMA duplication region

Comparative biology

Synteny

Chimpanzee

Vertebrate conservation

Gene-poor regions

Interleukin cluster

Human disease

Methods

Mapping and sequencing

Marker placement

Pseudogene identification

Segmental duplication analysis

Comparative analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links