The DNA sequence of human chromosome 7

Hillier, LaDeana W.; Fulton, Robert S.; Fulton, Lucinda A.; Graves, Tina A.; Pepin, Kymberlie H.; Wagner-McPherson, Caryn; Layman, Dan; Maas, Jason; Jaeger, Sara; Walker, Rebecca; Wylie, Kristine; Sekhon, Mandeep; Becker, Michael C.; O'Laughlin, Michelle D.; Schaller, Mark E.; Fewell, Ginger A.; Delehaunty, Kimberly D.; Miner, Tracie L.; Nash, William E.; Cordes, Matt; Du, Hui; Sun, Hui; Edwards, Jennifer; Bradshaw-Cordum, Holland; Ali, Johar; Andrews, Stephanie; Isak, Amber; VanBrunt, Andrew; Nguyen, Christine; Du, Feiyu; Lamar, Betty; Courtney, Laura; Kalicki, Joelle; Ozersky, Philip; Bielicki, Lauren; Scott, Kelsi; Holmes, Andrea; Harkins, Richard; Harris, Anthony; Strong, Cynthia Madsen; Hou, Shunfang; Tomlinson, Chad; Dauphin-Kohlberg, Sara; Kozlowicz-Reilly, Amy; Leonard, Shawn; Rohlfing, Theresa; Rock, Susan M.; Tin-Wollam, Aye-Mon; Abbott, Amanda; Minx, Patrick; Maupin, Rachel; Strowmatt, Catrina; Latreille, Phil; Miller, Nancy; Johnson, Doug; Murray, Jennifer; Woessner, Jeffrey P.; Wendl, Michael C.; Yang, Shiaw-Pyng; Schultz, Brian R.; Wallis, John W.; Spieth, John; Bieri, Tamberlyn A.; Nelson, Joanne O.; Berkowicz, Nicolas; Wohldmann, Patricia E.; Cook, Lisa L.; Hickenbotham, Matthew T.; Eldred, James; Williams, Donald; Bedell, Joseph A.; Mardis, Elaine R.; Clifton, Sandra W.; Chissoe, Stephanie L.; Marra, Marco A.; Raymond, Christopher; Haugen, Eric; Gillett, Will; Zhou, Yang; James, Rose; Phelps, Karen; Iadanoto, Shawn; Bubb, Kerry; Simms, Elizabeth; Levy, Ruth; Clendenning, James; Kaul, Rajinder; Kent, W. James; Furey, Terrence S.; Baertsch, Robert A.; Brent, Michael R.; Keibler, Evan; Flicek, Paul; Bork, Peer; Suyama, Mikita; Bailey, Jeffrey A.; Portnoy, Matthew E.; Torrents, David; Chinwalla, Asif T.; Gish, Warren R.; Eddy, Sean R.; McPherson, John D.; Olson, Maynard V.; Eichler, Evan E.; Green, Eric D.; Waterston, Robert H.; Wilson, Richard K.

doi:10.1038/nature01782

Article
Published: 10 July 2003

The DNA sequence of human chromosome 7

LaDeana W. Hillier¹,
Robert S. Fulton¹,
Lucinda A. Fulton¹,
Tina A. Graves¹,
Kymberlie H. Pepin¹,
Caryn Wagner-McPherson¹,
Dan Layman¹,
Jason Maas¹,
Sara Jaeger¹,
Rebecca Walker¹,
Kristine Wylie¹,
Mandeep Sekhon¹,
Michael C. Becker¹,
Michelle D. O'Laughlin¹,
Mark E. Schaller¹,
Ginger A. Fewell¹,
Kimberly D. Delehaunty¹,
Tracie L. Miner¹,
William E. Nash¹,
Matt Cordes¹,
Hui Du¹,
Hui Sun¹,
Jennifer Edwards¹,
Holland Bradshaw-Cordum¹,
Johar Ali¹,
Stephanie Andrews¹,
Amber Isak¹,
Andrew VanBrunt¹,
Christine Nguyen¹,
Feiyu Du¹,
Betty Lamar¹,
Laura Courtney¹,
Joelle Kalicki¹,
Philip Ozersky¹,
Lauren Bielicki¹,
Kelsi Scott¹,
Andrea Holmes¹,
Richard Harkins¹,
Anthony Harris¹,
Cynthia Madsen Strong¹,
Shunfang Hou¹,
Chad Tomlinson¹,
Sara Dauphin-Kohlberg¹,
Amy Kozlowicz-Reilly¹,
Shawn Leonard¹,
Theresa Rohlfing¹,
Susan M. Rock¹,
Aye-Mon Tin-Wollam¹,
Amanda Abbott¹,
Patrick Minx¹,
Rachel Maupin¹,
Catrina Strowmatt¹,
Phil Latreille¹,
Nancy Miller¹,
Doug Johnson¹,
Jennifer Murray¹,
Jeffrey P. Woessner¹,
Michael C. Wendl¹,
Shiaw-Pyng Yang¹,
Brian R. Schultz¹,
John W. Wallis¹,
John Spieth¹,
Tamberlyn A. Bieri¹,
Joanne O. Nelson¹,
Nicolas Berkowicz¹,
Patricia E. Wohldmann¹,
Lisa L. Cook¹,
Matthew T. Hickenbotham¹,
James Eldred¹,
Donald Williams¹,
Joseph A. Bedell¹,
Elaine R. Mardis¹,
Sandra W. Clifton¹,
Stephanie L. Chissoe¹,
Marco A. Marra¹^nAff9,
Christopher Raymond²,
Eric Haugen²,
Will Gillett²,
Yang Zhou²,
Rose James²,
Karen Phelps²,
Shawn Iadanoto²,
Kerry Bubb²,
Elizabeth Simms²,
Ruth Levy²,
James Clendenning²,
Rajinder Kaul²,
W. James Kent³,
Terrence S. Furey³,
Robert A. Baertsch³,
Michael R. Brent⁴,
Evan Keibler⁴,
Paul Flicek⁴,
Peer Bork⁵,
Mikita Suyama⁵,
Jeffrey A. Bailey⁶,
Matthew E. Portnoy⁷,
David Torrents⁵,
Asif T. Chinwalla¹,
Warren R. Gish¹,
Sean R. Eddy⁸,
John D. McPherson¹^nAff11,
Maynard V. Olson²,
Evan E. Eichler⁶,
Eric D. Green⁷,
Robert H. Waterston¹^nAff10 &
…
Richard K. Wilson¹

Nature volume 424, pages 157–164 (2003)Cite this article

31k Accesses
201 Citations
16 Altmetric
Metrics details

Abstract

Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.

You have full access to this article via your institution.

Download PDF

The complete sequence of a human Y chromosome

Article 23 August 2023

Assembly of 43 human Y chromosomes reveals extensive complexity and variation

Article 23 August 2023

A draft human pangenome reference

Article Open access 10 May 2023

Main

As the reference human genome sequence nears completion, the sequences of individual chromosomes are providing foundational information for genome structure, organization and evolution. Previously, chromosome 7 has been the focal point in the search for the gene associated with cystic fibrosis¹ and the frequent cytogenetic changes associated with some forms of cancer². Here, we describe our analysis of the chromosome 7 sequence, which has exploited orthology with the mouse genome in refining gene predictions and has discovered some unusual structural features that have been implicated in genetic diseases. The unrestricted access of the community to this sequence as it was generated has aided in the discovery of genes on chromosome 7 related to human health and well being, and we anticipate that the full sequence will provide further impetus to these studies.

General features of the chromosome 7 sequence

We generated the sequence of human chromosome 7 using a clone-by-clone shotgun sequencing strategy^3,4 and organized the resulting sequence into 11 contigs (Fig. 1). With the exception of the centromere and one gap near the terminal end of the long arm (pter), the distances between contigs are relatively small (Table 1), with most sized by fluorescence in situ hybridization (FISH) of DNA fibres or by comparison to the analogous region in the mouse genome sequence. The DNA in these remaining gaps is repetitive or has proved recalcitrant to isolation in bacterial- or yeast-based cloning systems, including the screening of a series of large-insert genomic libraries that together provided 100-fold coverage of the human genome. On the basis of the size estimates of these gaps, the available sequence represents greater than 99.4% of the total euchromatic sequence.

Table 1 Contiguous sequence lengths and gap sizes

Full size table

Isolation of clones containing the telomeric and pericentromeric regions of the p arm were particularly hampered by the presence of repetitive sequences. A 7p-telomere-containing ‘half-YAC’⁵ (yeast artificial chromosome) has been identified, but has been too unstable to sequence. However, three cosmid clones estimated to be 34 kilobases (kb) from the true telomere were recovered and contain scattered telomere-specific repeats. Such difficulties were not encountered with the 7q telomeres, where the sequence appears to extend into the telomere. On either side of the centromere, we analysed the sequence for higher-order alpha satellite repeats indicative of centromeric boundaries. On the short (q) arm side of the centromere, the sequence clearly contains such boundary features. The boundary was less clear for the p-arm side of the centromere, and the search for it was compounded by the duplication of this region elsewhere on the chromosome.

The quality of the chromosome 7 sequence exceeds the 99.99% accuracy standard established by the International Human Genome Consortium for sequencing the human genome⁶. We further checked the integrity of the sequence and its assembly in two ways. First, as each clone was finished, an in silico digest of the sequence was compared to restriction digests of the clone DNA. We also checked the fully assembled sequence by performing in silico digests of clone-sized fragments across the chromosome against the underlying fingerprint data used to construct the physical map. In this way, we confirmed more than 99.9% of the testable bands.

Comparison to physical and genetic maps

We evaluated the completeness of the chromosome 7 sequence by looking for its representation of sequence-tagged sites (STSs) from previously constructed physical and genetic maps of the chromosome, specifically a YAC-based STS-content map⁷, the Genethon microsatellite-based genetic map⁸, and a chromosome-7-specific radiation-hybrid (RH) map⁷. There were only a small number of unassigned STSs (see Supplementary Information), which included those from multi-copy sequences, sequence polymorphisms, regions within the remaining chromosome 7 sequence gaps, and clerical errors that preclude accurate matching of STS names to their true underlying sequence.

We also used these maps to evaluate the assembly of the sequence. The chromosome 7 sequence positions of the identified STSs were plotted relative to their established map positions (Fig. 2). Less than 1% of the identified STSs in the YAC-based map are in serious disagreement (defined as >3 megabases (Mb)) with their sequence position. In only one and four instances of the Genethon genetic and RH maps, respectively, are there discrepancies of more than several megabases between the map and sequence position of the STSs.

**Figure 2: Comparisons of mapped positions of STSs and their locations within the chromosome 7 sequence.**

Together, these findings reveal an excellent overall concordance between the chromosome 7 sequence and previously constructed physical and genetic maps. The rare discrepancies can be largely accounted for by the inherent lower-resolution nature of the various mapping methods; however, it is also possible that some of the observed differences reflect polymorphisms between the different copies of chromosome 7 used for map construction and sequence generation. Nonetheless, these results, in conjunction with the robustness of the bacterial artificial chromosome (BAC)-based physical map used for sequence generation³, provide strong support for the established chromosome 7 sequence.

Orthology to mouse

The relationship between the human chromosome 7 sequence and the mouse genome could be readily defined for approximately 92% of the chromosome, with 26 identifiable segments sharing the same order of highly conserved sequences in the two species at a resolution of 300?kb (ref. 9) or 46 segments at a resolution of 100?kb (Supplementary Methods). The smallest and largest defined segments are 200?kb and 38 megabases (Mb), respectively, with the latter residing on 7q and containing the cystic fibrosis gene, the ST7 tumour suppressor gene¹⁰, olfactory and taste receptors^11,12, and the T-cell receptor beta gene region¹³.

General features

We analysed the chromosome 7 sequence for interspersed repeat content, (G + C) content and the presence of CpG islands. The sequence has an overall repeat content (47%) and distribution of individual repeat classes (for example, short interspersed nucleotide elements (SINEs), long interspersed nucleotide elements (LINEs) and long terminal repeats (LTRs)) that differ only slightly from the whole-genome averages⁴ (Supplementary S1). Similarly, its overall (G + C) content (41%) is almost identical to that of the genome as a whole. The regions of highest (G + C) content flank the remaining gaps, consistent with the idea that the (G + C)-rich regions may be difficult to clone. Analysis of the repeat-masked chromosome 7 sequence revealed 1,461 CpG islands. Of the known chromosome-7-derived messenger RNAs (see below), the 5′ end of 66% were at or near (5-kb upstream to 1-kb downstream of) a CpG island. For the full gene set presented here, the number with overlapping CpG islands was 51%—the two values bracketing the reported figure of 60% for the genome as a whole¹⁴. Table 2 provides a comparison of some of the general features of chromosome 7 with the other published chromosomes.

Table 2 Comparison of human genome and published chromosomes

Full size table

Previous analyses of the draft human genome sequence suggested that Alu repeat distribution correlated more strongly with gene content than with (G + C) content⁴. We re-examined this issue by correlating (G + C) content, Alu repeat content and known exonic sequences (see below) across chromosome 7. Using a range (50–800?kb) of non-overlapping windows, we found strong positive correlations between (G + C) content, exonic sequence density and Alu content. However, for all window sizes, Alu elements were more strongly correlated with (G + C) content than with exonic sequence density (R² = 0.66 compared with R² = 0.41 for 200-kb windows). This difference from the overall genomic pattern suggests that the factors influencing Alu distribution are complex.

Known protein-coding genes

The identification of genes within genomic sequence typically uses known mRNA sequences¹⁵, ab initio methods¹⁶ and comparative nucleotide or protein sequence data^17,18. Recognizing the challenges of gene annotation¹⁹ we sought to establish a foundation for the gene catalogue of human chromosome 7 as an intermediate step en route to a full understanding of all the functional elements encoded by the chromosome. First, a set of 1,073 human mRNAs from REFSEQ¹⁹ and the Mammalian Gene Collection²⁰ were uniquely assigned to the sequence and manually edited, resulting in 605 non-overlapping mRNAs with 45% showing alternative splicing. Only two (<0.3%) of the known genes mapped to chromosome 7 were not identified in the existing sequence.

Detailed examination of these mRNAs aligned against the genome revealed some potential artefacts, even within this experimentally supported set. For example, 23 mRNAs (2.4%) had no similarities to any mouse gene or to any known protein in the database. By contrast, less than 1% of predicted mouse genes have no homologue in the human genome⁹. Although some of these could be true genes, they may also represent untranslated segments of bona fide genes or transcripts of uncertain function. Nonetheless, these were retained in the current set. Additionally, eight other mRNAs had no significant open reading frame and were not included in the final gene set.

We also investigated 61 mRNAs where the matched genomic sequence contained differences from the mRNA that caused in some cases a frameshift and/or truncation of the protein product. To determine the origin of the difference, we re-sequenced the region of interest in a panel of 24 diverse individuals²¹, in the starting BAC and in some cases in overlapping BACs. Ten cases could not be uniquely amplified because of surrounding repetitive sequence. In six cases the BAC sequence was found to be in error (representing either a simple sequence error or a mutation at the site during propagation), and the sequence was corrected. For another 35 cases, there was support only for the BAC sequence. Using expressed sequence tag (EST) data, underlying mRNA data and conservation with mouse, these 35 cases were determined to be mRNA error (primarily a deletion or insertion with a second compensatory insertion/deletion downstream). In the remaining ten cases, the site was found to be polymorphic, with support for both BAC and mRNA sequence. One of these polymorphisms is a deletion in zonadhesin (ZAN)²², a sperm membrane protein that binds in a species-specific manner to the extracellular matrix (zona pellucida) of the egg. This deletion creates a frameshift at position 1,922 of this 2,812 amino acid protein.

Predicted protein-coding genes

To predict additional genes on chromosome 7, we applied three gene prediction programs. One, Genewise¹⁷, uses protein homologies to seed prediction, and the other two, Twinscan^18,23 and FGENESH2¹⁶, use comparative sequence analysis. We used all available protein predictions²⁴ for Genewise and the mouse genome sequence⁹ as informant sequence for Twinscan and FGENESH2. The combined output predicted 90% of all known exons and 98% of the known genes, indicating that for known genes the combined output is reasonably comprehensive with high sensitivity, albeit at the cost of specificity.

To reduce the number of false positives and pseudogenes in the collection, we demanded that the predicted genes have a highly significant match in the mouse gene set in the orthologous region of mouse where possible, and in turn that the matching mouse gene have among its best matches the original chromosome 7 predicted gene (‘reciprocal match’). Furthermore, single-exon genes were removed from the collection if they had matches to multi-exon genes in either the human or mouse genomes. Redundancy between the three sets among themselves and with known genes was eliminated, accepting in order known genes, FGENESH2, Twinscan and Genewise predictions, with gene models with a reciprocal match taking precedence. Predictions showing signs of non-functionality (truncation or absence of introns) and those that produced L1/reverse transcriptase were also removed from the set. This yielded 545 predicted genes, bringing the total number of protein-coding genes on chromosome 7 to 1,150.

We next examined ESTs to look for genes that the above process may have missed. Of 41,399 spliced ESTs that had their best match to chromosome 7, 93% at least partially overlapped existing exons, and an additional 1% lay near or within existing genes and suggested alternative splice forms (all were represented in the redundant gene prediction set). The remainder lacked significant open reading frames, and none satisfied the reciprocal match criteria used for the gene predictions above (indeed only 5% had any match to mouse predicted genes). Although these unplaced spliced ESTs could represent protein-coding genes, there is currently little corroborative evidence that they do. More likely they represent other transcription products, including non-coding RNA genes or untranslated fragments of protein-coding genes.

By several criteria, the predicted gene set is robust. As expected given the methods used to establish it, 94% have a reciprocal best match (that is, the predicted gene has as its best match a mouse gene, which in turn has as its best match the starting human gene) with the mouse genome (known genes show a 92% best match). The remainder have a reciprocal match. High percentages of the predictions are supported by similarities to non-mammalian vertebrate genomes and by EST matches (Table 3a), but as expected the rates for each of these are not as high as seen for known genes. The known genes are enriched in highly expressed genes, which are more likely to be represented in EST sets and are more likely to be conserved across evolution. The predicted genes also compare favourably to known genes in coding exon number and in total coding sequence (Table 3b). The fact that both exon number and total coding sequence is smaller for the predicted genes suggests that either some terminal exons have been missed or that there are fragmented genes in the set that reduce average values. Finally, the pseudogene analysis carried out below shows that the set is remarkably free of likely pseudogenes, and that the set is not missing many genes (< 60) with similarities to known proteins.

Table 3 Coverage and characteristics of exons and genes on chromosome 7

Full size table

Pseudogenes

To identify pseudogenes on chromosome 7, we adapted an approach used for analysing the Anopheles²⁵ and mouse⁹ genome sequences. This involved identifying sequence with significant similarity to known proteins in regions that reside between known or predicted genes, and then evaluating the ratio of non-synonymous to synonymous coding changes (the K_A/K_S ratio) for each potential coding sequence. Although not absolute, this ratio is an indicator of selective constraints associated with particular DNA regions, and can be used to assess differences between genes that evolve under purifying selection and pseudogenes that evolve in a neutral fashion²⁶. Of the 941 such regions identified, nearly all (97 ± 3%) seem to evolve under neutrality and therefore are considered to contain pseudogenes. In contrast, only 5 ± 3% of the predicted and known genes have a K_A/K_S ratio consistent with neutral evolution. As the search did not attempt to identify highly diverged copies and may have merged two or more pseudogenes in the same interval, this probably corresponds to a lower limit for the number of pseudogenes. Indeed, there may be more pseudogenes than true genes on chromosome 7. As with the mouse genome sequence⁹, a significant fraction (33%) of the identified pseudogenes contain neither stop codons nor frameshifts. Virtually all pseudogenes (94%) could be aligned to another region in the human genome with higher sequence identity than to any region in the mouse genome, suggesting that they originated after the human–mouse divergence.

Pseudogenes are generally formed by two independent mechanisms of duplication: retrotransposition (giving rise to processed pseudogenes) and segmental duplication (often leading to non-processed pseudogenes). In an attempt to classify the pseudogenes on chromosome 7, we exploited the available mouse sequence. First, in contrast to non-processed pseudogenes, processed pseudogenes integrate throughout the genome and are unlikely to have sequence similarity in orthologous mouse regions. Furthermore, pseudogenes that arose before the divergence of mouse and human are probably so diverged as to be below the thresholds of detection used here. For 654 pseudogenes within regions of chromosome 7 with identified mouse orthology, 573 (88%) appear to be processed, and 81 (12%) appear to be segmentally duplicated pseudogenes. Processed pseudogenes are broadly distributed along chromosome 7, with a slight tendency to cluster near the telomeres, whereas non-processed pseudogenes are concentrated in gene-rich regions (see Fig. 1). A minority of the former (100) has inserted into the introns of unrelated functional genes. The direction of these processed pseudogenes relative to the host intron varies almost randomly: 41 (41%) are integrated in the same strand as the host gene, whereas 59 (59%) are in the opposite strand.

Non-coding RNAs

We identified non-coding RNA (ncRNA) genes in the chromosome 7 sequence as described for the draft human genome sequence¹⁹, followed by a refinement step using matches to mouse orthologous regions. De novo computational gene-finding methods for most non-coding RNA genes are not yet sufficiently robust for automated genome annotation²⁷. Annotation was restricted to transfer RNA genes, for which there is a robust identification programme, tRNAscan-SE²⁸, and to strong primary sequence similarities to known mammalian ncRNAs, including recently discovered human microRNAs²⁹.

Twenty-three tRNA genes were identified in the chromosome 7 sequence, including a cluster of 20 tRNAs with 18 tRNACys-GCA genes in one 400-kb interval. Eleven other ncRNA genes were found: four microRNAs, two U6 small nuclear RNA (snRNA) genes, and five genes for the four cytoplasmic Y RNAs (hY1, hY4, hY5 and two near-identical copies of hY3); all of the human Y RNA genes were already known to be located on chromosome 7 (ref. 30). Additionally, 302 putative pseudogenes were detected by BLAST similarity to other human ncRNA genes; this included 65 apparent U6 snRNA pseudogenes, 43 hY3 RNA pseudogenes and 37 non-Alu SRP-RNA pseudogenes.

Protein index

Using the above-generated gene set, we derived an index of predicted protein sequences. In turn, this was compared to the Interpro database³¹ using Interproscan³², which predicts protein families, domain and repeat families, and sequence motifs. The Interpro results were used to assign Gene Ontology (GO) codes³³: 51% of proteins were in the category of molecular function, 31% in biological process and 18% in cellular components. Of the 68% of the proteins that had an Interpro classification, 56% were multi-domain. The two most prevalent families are also two of the most prevalent in the human genome: the immunoglobulins and zinc fingers.

The general homeobox domain proteins are heavily represented on chromosome 7, accounting for 21 of the 211 currently annotated records in the Ensembl gene set³⁴ of the human genome. In addition, the chromosome 7 gene set contains one-third of each of the more specific homeobox proteins in the genome: antennapedia homeobox proteins, engrailed-type homeobox proteins and homeodomain protein CUT domains. The chromosome 7 HOX region, one of the four homeobox clusters in the human genome, contains ten two-exon genes (with one alternative form). This 90.7-kb region has a (G + C) content of 52% and is completely devoid of known interspersed repeats.

Williams syndrome critical region

Williams–Beuren syndrome (WBS) is associated with large (typically about 1.6?Mb) deletions³⁵ within 7q11.23. A series of large (>140?kb) duplicated segments (duplicons) span over 2?Mb in this region and are shown as coloured boxes in Fig. 3. The commonly deleted region is flanked by a duplicon that contains the p47-phox gene (or pseudogene), which has served as a useful marker for differentiating among the different duplicated segments. A third copy of the duplicon resides distal to the other pair within the commonly deleted region, in inverted orientation. At present there is a gap in the physical map and the chromosome sequence between the second and third copies of this duplicon.

Figure 3: Repetitive content surrounding the commonly deleted region of the Williams–Beuren Syndrome area of 7q11.23 illustrated using self_dot_plot (H. Skaletsky and S. Rozen, personal communication).

The WBS region presented the single greatest challenge to the mapping and sequencing of the euchromatic regions of chromosome 7. Indeed, even after considerable effort, there remains some uncertainty about the location and orientation of some sequences. The duplicated segments approach the size of BACs, and the frequency of sequence differences among duplicons approaches (or is less than) the human polymorphism rate. To add to this complexity, the long-range organization of the region, perhaps including the number of duplicons, differs among individuals.

In attempting to establish a representative sequence of the WBS region, we used clones from a single BAC library to reduce the allelic complexity to just two variants. We also deliberately sequenced BACs with extensive overlaps to establish linkage of variant sites and used polymerase chain reaction (PCR) analysis of clones and a human population panel to distinguish polymorphic sites from differences between duplicated segments. This process eventually yielded the two sequence contigs shown in Fig. 3.

Segmental duplications

Segmental duplications are large low-copy repeats that arise as a consequence of duplication of genomic DNA and may range up to hundreds of kilobases in length. We performed a detailed analysis of duplicated sequence (≥90% sequence identity and ≥1?kb in length; see Methods), comparing the chromosome 7 sequence against a recent assembly of the human genome. We identified a total of 3,215 pairwise alignments that met these criteria of length and sequence identity (Fig. 4a; see also Supplementary S3 and S4), making chromosome 7 one of the most duplicated human chromosomes. Overall, 8.2% (12,588?kb) of the sequence shares sequence homology to more than one location in the genome (Fig. 4a; see also Supplementary S4). The enrichment is predominantly due to an increase in intrachromosomal duplications (7.0% of the sequence) rather than interchromosomal duplications (2.2%) with 0.5% (729,982?bp) sequence overlap between the two types. The spatial distribution of the interchromosomal and intrachromosomal duplications is clustered (Fig. 4a; see also Supplementary S3 and S4). As expected, large blocks of interchromosomal duplication locate preferentially within the pericentromeric and subtelomeric regions^36,37. A marked asymmetry, however, was observed between the short and long arms of chromosome 7. The short arm of chromosome 7 has large blocks of recent interchromosomal duplications within both the pericentromeric (600?kb) and subtelomeric (about 150?kb) regions. The p arm subtelomere contains the most recent interchromosomal duplications (>99%). The low degree of sequence divergence suggests that such regions may have duplicated and/or undergone gene conversion since the separation of the human and chimpanzee lineages from a common ancestor.

In contrast to the short arm, the q arm pericentromeric region shows a much smaller block of duplication (<200?kb) followed by a large (about 600?kb) tract of monomeric alpha-satellite repeat sequence. These duplications are highly divergent (<93%). Although possibly a much more extensive domain of pericentromeric duplication remains to be sequenced closer to the centromere, the current configuration and the low degree of sequence similarity among these interchromosomal duplications suggest that the 7q pericentromeric region has been relatively quiescent over the last 25 million years of evolution. The subtelomeric region of chromosome 7q is even more striking in this regard. There is virtually no evidence of segmental duplication within a megabase of the telomere. The molecular mechanisms that underlie the difference in subtelomeric and pericentromeric duplication architecture between the two arms of chromosome 7 are unknown, but the asymmetry may represent a general property of metacentric chromosomes as has been suggested previously^36,38.

Comparison of the features of inter- and intrachromosomal duplications reveals some intriguing differences. In general, the intrachromosomal alignments tend to be larger than interchromosomal alignments (8 out of 10 pairwise alignments in excess of 100?kb were intrachromosomal duplications; Supplementary S5). This trend may, in part, be due to the fragmented nature of the draft sequence for other unfinished chromosomes. Furthermore, intrachromosomal duplications show an abundance of highly similar duplications (>98% identity), whereas most (57.1%) of the interchromosomally duplicated bases cluster between 93% and 96% identity (mode of 94.5–95.0%) (Fig. 4b; see also Supplementary S5). This interchromosomal mode is significantly different from the average computed for the draft genome as well as other published chromosomes (96.0–98.0%)³⁹. Using sequence divergence as an indicator of evolutionary age, the data suggest that chromosome 7 has been the target of more ancient interchromosomal duplication and/or gene conversion events. In contrast, more recent chromosome-specific duplication/gene conversion events⁴⁰ have occurred since the separation of the human and African ape lineages.

Segmental duplications are known sites of both pathological and evolutionary instability^41,42. To identify regions on chromosome 7 that may be associated with genomic disorders, we searched for regions (between 50?kb and 10?Mb apart) that were flanked by large (≥10?kb), highly homologous (≥95%) segmental duplications (Supplementary S6). In addition to the WBS region, we detected ten regions of chromosome 7 (Supplementary S7 and S8), nine of which contained genes and would be considered candidates for genomic disorders. In total, these 11 regions (corresponding to 103 pairwise alignments) of chromosome 7 implicate almost a quarter (40.7?Mb) of the chromosome as being susceptible to duplication-mediated rearrangement. It will be important to investigate these regions experimentally for large-scale variation and association with disease.

Segmental duplications have long been noted for their potential role in the evolution of new genes⁴³. To examine the transcriptional and coding potential of duplicated regions, we analysed a hierarchical, non-overlapping set of known genes, predicted genes and remaining spliced EST clusters (see above). For each group, we categorized every exon as unique or duplicated on the basis of its overlap with duplicated sequence (Table 4). Almost 14% (1,244 out of 9,890) of all exons are duplicated, and most of these lie within intrachromosomal duplicated sequence as opposed to interchromosomal duplicons. Our analysis shows that the relative number of transcribed exons is significantly greater for duplicated DNA when compared with non-duplicated DNA on chromosome 7. These results support a previous observation³⁹ that recently duplicated regions are rich in genes/transcripts. It should be noted, however, that many transcripts within the duplicated sequence have poor translational potential (44% compared with 20% of exons in pseudogenes for duplicated and unique regions), based on analysis of open reading frames (see pseudogene analysis above). Indeed, our analyses suggest that genes within duplicated regions show relaxed selective constraint when compared with genes encoded within unique portions of chromosome 7. Duplicate regions of chromosome 7 are, therefore, enriched for a particular class of pseudogene, which may be transcribed and may possess intron–exon structure but is unlikely to be translated. Most of the pseudogenes probably represent dying transcripts, which may on rare occasions lead to the formation of new genes.

Table 4 Duplicated features for non-redundant transcripts

Full size table

Conclusions

The sequence of human chromosome 7 described here, and that of several other human chromosomes^44,45,46,47, represent landmark steps in the Human Genome Project. As chromosome sequences advance from their initial ‘draft’ status to a high-accuracy comprehensive stage, the molecular landscape becomes clearer, and the ability to perform detailed analyses becomes more robust. For our studies of chromosome 7, the combination of a high-quality, nearly complete sequence and a draft sequence of the mouse genome allowed us to perform rigorous gene analyses that included an improved ability to distinguish pseudogenes from bona fide genes. In addition to generating a gene index for the chromosome, our data provide evidence for a small subset of proteins that contain a polymorphism leading to a truncated protein in the human lineage. A chromosome-wide view of segmental duplications revealed that, compared with other chromosomes analysed so far, chromosome 7 exhibits a much higher rate of intrachromosomal duplication. Furthermore, there seems to be evolutionary asymmetry between the long and short arms. Taken together, these findings illustrate the dynamic nature of a mammalian chromosome. Such dynamic behaviour also has adverse consequences, as revealed by the sequence features of the WBS region, where large duplicons of remarkably high sequence similarity mediate disease-causing deletions. Finally, the sequence we report here for chromosome 7 has directly facilitated the identification of a number of genes associated with human disease (for example, refs 10, 48). But these examples, although highly gratifying, simply represent the beginning of efforts to capitalize on the knowledge provided by finished genomic sequence for better understanding the genetic bases for human health and disease.

Methods

Tiling-path verification

To evaluate clone overlaps where the rate of difference between overlapping clone sequences was higher than 1 in 1,000 bases, a PCR product encompassing the differences was sequenced from each BAC in the overlap region and from each of a panel of 24 ethnically diverse genomic DNA samples²¹. If the 24 samples showed allelic variation, the overlap was judged to be correct, but if the 24 samples yielded persistent heterozygosity, the sequence was judged to be derived from a repeated sequence, with sequence differences between the copies.

Assaying mRNA/genomic discrepancies

To investigate discrepancies between mRNA and genomic sequence, PCR products were generated, re-sequenced and the polymorphic bases examined in the DNA from 24 individuals, the original BAC, and in some cases other BACs.

Pseudogene detection

After masking all predicted and known genes and common repeats, we performed homology searches by comparing the DNA sequence of chromosome 7 with a non-redundant protein database (see Supplementary Information). All regions matching non-viral and non-transposon known proteins (E value <0.001) were further processed by merging those likely to be parts of the same gene or pseudogene. This step, although likely to cause the loss of some real pseudogenes, is essential to eliminate possible fragmented predictions, which could otherwise lead to an overestimation. We refined the prediction for the resulting DNA fragments by comparing them with the closest protein sequence using Genewise¹⁷. We finally confirmed the integrity of the predictions by removing elements without significant matches (E values <0.001) in a second round of BLASTX against NRDB.

Each candidate pseudogene and all predicted and known genes were subjected to a K_A/K_S analysis. We first inferred the ancestral sequence of each of the target sequences (A) using a protein-based DNA multiple alignment of A and its two closest matches in NRDB between 50% and 95% identical. We next estimated the number of synonymous and non-synonymous substitutions occurring in sequence A by comparing it with its ancestral sequence using the YN00 program. In addition to the program's pre-set, we excluded those K_A/K_S ratios that were based on excessively low (<20) or high (K_S > 1) numbers of substitutions, resulting in reliable calculations for approximately 50% of analysed sequences.

The fraction of neutrally evolving sequences included in the pseudogene and gene sets was calculated comparing their K_A/K_S distributions with benchmark distributions for functional and pseudogenic elements. These distributions were obtained from the K_A/K_S analysis of 2,000 functional human genes randomly selected from a 50% non-redundant RefSeq¹⁹ (reviewed) collection, and 1,730 processed pseudogenes with open reading frame truncations identified from a homology search through the whole human genome as performed for chromosome 7 (manuscript in preparation). Each of the K_A/K_S distributions associated to the pseudogene and gene sets of chromosome 7 was compared with benchmark distributions using a ‘least squares fitting’ to obtain estimates of neutrally evolving fractions of sequences. The error rate associated with this estimate is <3% according to cross-validation analysis with the benchmark sequences.

We compared each identified pseudogene at the protein level with all mouse genes (Ensembl³⁴) located in the corresponding orthologous regions. Positive match was considered when the associated E-value was <10^-8. To avoid mis-annotations owing to the possible absence of decisive sequences in the mouse gene set, we also compared each translated pseudogene with the whole mouse orthologous region using tBLASTn with the same E-value cutoff as above.

Accession codes

Accessions

GenBank/EMBL/DDBJ

BL000002

Data deposits

Accession numbers for the sequence analysed for this paper can be found in Table 1. All reported DNA sequences have been deposited in GenBank or EMBL. The updated chromosome 7 sequence can be accessed through GenBank accession BL000002.

References

Riordan, J. R. et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989)
CAS PubMed Google Scholar
Liang, H. et al. Molecular anatomy of chromosome 7q deletions in myeloid neoplasms: evidence for multiple critical loci. Proc. Natl Acad. Sci. USA 95, 3781–3785 (1998)
CAS PubMed PubMed Central Google Scholar
Human Genome Mapping Consortium. A physical map of the human genome. Nature 409, 934–941 (2001)
Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Google Scholar
Riethman, H. et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948–951 (2001)
CAS PubMed Google Scholar
Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from the Human Genome Project. Genome Res. 9, 1–4 (1999)
CAS PubMed Google Scholar
Bouffard, G. G. et al. A physical map of human chromosome 7: an integrated YAC contig map with average STS spacing of 79?kb. Genome Res. 7, 673–692 (1997)
CAS PubMed Google Scholar
Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380, 152–154 (1996)
CAS PubMed Google Scholar
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002)
Google Scholar
Zenklusen, J. C., Conti, C. J. & Green, E. D. Mutational and functional analyses reveal that ST7 is a highly conserved tumor-suppressor gene on human chromosome 7q31. Nature Genet. 27, 392–398 (2001)
CAS PubMed Google Scholar
Nadezhdin, E. V. et al. Identification of paralogous HERV-K LTRs on human chromosomes 3, 4, 7 and 11 in regions containing clusters of olfactory receptor genes. Mol. Genet. Genom. 265, 820–825 (2001)
CAS Google Scholar
Chandrashekar, J. et al. T2Rs function as bitter taste receptors. Cell 100, 703–711 (2000)
CAS PubMed Google Scholar
Rowen, L., Koop, B. F. & Hood, L. The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. Science 272, 1755–1762 (1996)
CAS PubMed Google Scholar
Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 90, 11995–11999 (1993)
CAS PubMed PubMed Central Google Scholar
Wheelan, S. J., Church, D. M. & Ostell, J. M. Spidey: a tool for mRNA-to-genomic alignments. Genome Res. 11, 1952–1957 (2001)
CAS PubMed PubMed Central Google Scholar
Solovyev, V. V. Statistical Approaches in Eukaryotic Gene Prediction (ed. Balding, D.) (Wiley, New York, 2001)
Google Scholar
Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547–548 (2000)
CAS PubMed PubMed Central Google Scholar
Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, 140–148 (2001)
Google Scholar
Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001)
CAS PubMed PubMed Central Google Scholar
Mammalian Gene Collection Program Team Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc. Natl Acad. Sci. USA 99, 16899–16903 (2002)
Google Scholar
The International Human SNP Working Group A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001)
Google Scholar
Hardy, D. M. & Garbers, D. L. A sperm membrane protein that binds in a species-specific manner to the egg extracellular matrix is homologous to von Willebrand factor. J. Biol. Chem. 270, 26025–26028 (1995)
CAS PubMed Google Scholar
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003)
CAS PubMed PubMed Central Google Scholar
Benson, D. A. et al. GenBankM. Nucleic Acids Res. 30, 17–20 (2002)
CAS PubMed PubMed Central Google Scholar
Zdobnov, E. M. et al. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298, 149–159 (2002)
CAS PubMed Google Scholar
Ohta, T. & Ina, Y. Variation in synonymous substitution rates among mammalian genes and the correlation between synonymous and non-synonymous divergences. Mol. Evol. 41, 717–720 (1995)
CAS Google Scholar
Eddy, S. R. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinform. 2, 18 (2002)
Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997)
CAS PubMed PubMed Central Google Scholar
Mourelatos, Z. et al. miRNPs: a novel class of ribonucleoproteins containing numerous microRNAs. Genes Dev. 16, 720–728 (2002)
CAS PubMed PubMed Central Google Scholar
Maraia, R. J., Sasaki-Tozawa, N., Driscoll, C. T., Green, E. D. & Darlington, G. J. The human Y4 small cytoplasmic RNA gene is controlled by upstream elements and resides on chromosome 7 with all other hY scRNA genes. Nucleic Acids Res. 22, 3045–3052 (1994)
CAS PubMed PubMed Central Google Scholar
Mulder, N. J. et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 3, 225–235 (2002)
CAS PubMed Google Scholar
Zdobnov, E. M. & Apweiler, R. InterProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848 (2001)
CAS PubMed Google Scholar
The Gene Ontology Consortium Creating the Gene Ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001)
Google Scholar
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002)
CAS PubMed PubMed Central Google Scholar
Korenberg, J. R. et al. Genome structure and cognitive map of Williams syndrome. J. Cogn. Neurosci. 12, 89–107 (2000)
PubMed Google Scholar
Horvath, J. E., Bailey, J. A., Locke, D. P. & Eichler, E. E. Lessons from the human genome: transitions between euchromatin and heterochromatin. Hum. Mol. Genet. 10, 2215–2223 (2001)
CAS PubMed Google Scholar
Mefford, H. C. & Trask, B. J. The complex structure and dynamic evolution of human subtelomeres. Nature Rev. Genet. 3, 91–102 (2002)
CAS PubMed Google Scholar
Eichler, E. E., Archidiacono, N. & Rocchi, M. CAGGG repeats and the pericentromeric duplication of the hominoid genome. Genome Res. 9, 1048–1058 (1999)
CAS PubMed Google Scholar
Bailey, J. A. et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 (2002)
CAS PubMed Google Scholar
DeSilva, U., Massa, H., Trask, B. J. & Green, E. D. Comparative mapping of the region of human chromosome 7 deleted in Williams syndrome. Genome Res. 9, 428–436 (1999)
CAS PubMed PubMed Central Google Scholar
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002)
CAS PubMed Google Scholar
Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002)
CAS PubMed Google Scholar
Ohno, S., Wolf, U. & Atkin, N. Evolution from fish to mammals by gene duplication. Hereditas 59, 169–187 (1968)
CAS PubMed Google Scholar
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003)
CAS PubMed Google Scholar
Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001)
CAS PubMed Google Scholar
Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–318 (2000)
CAS PubMed Google Scholar
Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999)
CAS PubMed Google Scholar
Everett, L. A. et al. Pendred syndrome is caused by mutations in a putative sulphate transporter gene (PDS). Nature Genet. 17, 411–422 (1997)
CAS PubMed Google Scholar
Collins, J. E. et al. Reevaluating human gene annotation: A second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003)
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the following laboratories, which contributed sequence segments to the final chromosome 7 sequence: Whitehead Institute Center for Genome Research; Institute of Molecular Biotechnology, Jena, Germany; Sanger Institute; University of Washington Multimegabase Sequencing Center; University of Oklahoma Advanced Center for Genome Technology; Stanford DNA Sequencing and Technology Development Center; Joint Genome Institute; Human Genome Studies Group, Los Alamos. We also acknowledge the efforts of the respective staff of the Washington University Genome Sequencing Center and the University of Washington Genome Center. We thank E. Birney for assistance in producing Fig. 1.

Author information

Marco A. Marra
Present address: Genome Sciences Centre, British Columbia Cancer Agency, 600 West 10th Avenue, Room 3427, Vancouver, British Columbia, V5Z-4E6, Canada
Robert H. Waterston
Present address: Department of Genome Sciences, Box 357730, University of Washington, 1705 NE Pacific Street, Seattle, Washington, 98195-7730, USA
John D. McPherson
Present address: Baylor College of Medicine, Human Genome Sequencing Center, 1 Baylor Plaza, N1519, Houston, Texas, 77030, USA

Authors and Affiliations

Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri, 63108, USA
LaDeana W. Hillier, Robert S. Fulton, Lucinda A. Fulton, Tina A. Graves, Kymberlie H. Pepin, Caryn Wagner-McPherson, Dan Layman, Jason Maas, Sara Jaeger, Rebecca Walker, Kristine Wylie, Mandeep Sekhon, Michael C. Becker, Michelle D. O'Laughlin, Mark E. Schaller, Ginger A. Fewell, Kimberly D. Delehaunty, Tracie L. Miner, William E. Nash, Matt Cordes, Hui Du, Hui Sun, Jennifer Edwards, Holland Bradshaw-Cordum, Johar Ali, Stephanie Andrews, Amber Isak, Andrew VanBrunt, Christine Nguyen, Feiyu Du, Betty Lamar, Laura Courtney, Joelle Kalicki, Philip Ozersky, Lauren Bielicki, Kelsi Scott, Andrea Holmes, Richard Harkins, Anthony Harris, Cynthia Madsen Strong, Shunfang Hou, Chad Tomlinson, Sara Dauphin-Kohlberg, Amy Kozlowicz-Reilly, Shawn Leonard, Theresa Rohlfing, Susan M. Rock, Aye-Mon Tin-Wollam, Amanda Abbott, Patrick Minx, Rachel Maupin, Catrina Strowmatt, Phil Latreille, Nancy Miller, Doug Johnson, Jennifer Murray, Jeffrey P. Woessner, Michael C. Wendl, Shiaw-Pyng Yang, Brian R. Schultz, John W. Wallis, John Spieth, Tamberlyn A. Bieri, Joanne O. Nelson, Nicolas Berkowicz, Patricia E. Wohldmann, Lisa L. Cook, Matthew T. Hickenbotham, James Eldred, Donald Williams, Joseph A. Bedell, Elaine R. Mardis, Sandra W. Clifton, Stephanie L. Chissoe, Marco A. Marra, Asif T. Chinwalla, Warren R. Gish, John D. McPherson, Robert H. Waterston & Richard K. Wilson
University of Washington Genome Center, 2225 Fluke Hall on Mason Road, Campus Box 352 145, Seattle, Washington, 98195, USA
Christopher Raymond, Eric Haugen, Will Gillett, Yang Zhou, Rose James, Karen Phelps, Shawn Iadanoto, Kerry Bubb, Elizabeth Simms, Ruth Levy, James Clendenning, Rajinder Kaul & Maynard V. Olson
Center for Biomolecular Science and Engineering, University of California, 321 BE, Santa Cruz, California, 95064, USA
W. James Kent, Terrence S. Furey & Robert A. Baertsch
Department of Computer Science, Washington University, Box 1045, St Louis, Missouri, 63130, USA
Michael R. Brent, Evan Keibler & Paul Flicek
EMBL, Meyerhofstrasse 1, 69117, Heidelberg, Germany
Peer Bork, Mikita Suyama & David Torrents
Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, Ohio, 44106, USA
Jeffrey A. Bailey & Evan E. Eichler
Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Building 50, Room 5222, South Drive, Bethesda, Maryland, 20892, USA
Matthew E. Portnoy & Eric D. Green
Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, Campus Box 8232, 4566 Scott Ave., St Louis, Missouri, 63110, USA
Sean R. Eddy

Authors

LaDeana W. Hillier
View author publications
You can also search for this author in PubMed Google Scholar
Robert S. Fulton
View author publications
You can also search for this author in PubMed Google Scholar
Lucinda A. Fulton
View author publications
You can also search for this author in PubMed Google Scholar
Tina A. Graves
View author publications
You can also search for this author in PubMed Google Scholar
Kymberlie H. Pepin
View author publications
You can also search for this author in PubMed Google Scholar
Caryn Wagner-McPherson
View author publications
You can also search for this author in PubMed Google Scholar
Dan Layman
View author publications
You can also search for this author in PubMed Google Scholar
Jason Maas
View author publications
You can also search for this author in PubMed Google Scholar
Sara Jaeger
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Walker
View author publications
You can also search for this author in PubMed Google Scholar
Kristine Wylie
View author publications
You can also search for this author in PubMed Google Scholar
Mandeep Sekhon
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Becker
View author publications
You can also search for this author in PubMed Google Scholar
Michelle D. O'Laughlin
View author publications
You can also search for this author in PubMed Google Scholar
Mark E. Schaller
View author publications
You can also search for this author in PubMed Google Scholar
Ginger A. Fewell
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly D. Delehaunty
View author publications
You can also search for this author in PubMed Google Scholar
Tracie L. Miner
View author publications
You can also search for this author in PubMed Google Scholar
William E. Nash
View author publications
You can also search for this author in PubMed Google Scholar
Matt Cordes
View author publications
You can also search for this author in PubMed Google Scholar
Hui Du
View author publications
You can also search for this author in PubMed Google Scholar
Hui Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Edwards
View author publications
You can also search for this author in PubMed Google Scholar
Holland Bradshaw-Cordum
View author publications
You can also search for this author in PubMed Google Scholar
Johar Ali
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie Andrews
View author publications
You can also search for this author in PubMed Google Scholar
Amber Isak
View author publications
You can also search for this author in PubMed Google Scholar
Andrew VanBrunt
View author publications
You can also search for this author in PubMed Google Scholar
Christine Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Feiyu Du
View author publications
You can also search for this author in PubMed Google Scholar
Betty Lamar
View author publications
You can also search for this author in PubMed Google Scholar
Laura Courtney
View author publications
You can also search for this author in PubMed Google Scholar
Joelle Kalicki
View author publications
You can also search for this author in PubMed Google Scholar
Philip Ozersky
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Bielicki
View author publications
You can also search for this author in PubMed Google Scholar
Kelsi Scott
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Holmes
View author publications
You can also search for this author in PubMed Google Scholar
Richard Harkins
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Harris
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia Madsen Strong
View author publications
You can also search for this author in PubMed Google Scholar
Shunfang Hou
View author publications
You can also search for this author in PubMed Google Scholar
Chad Tomlinson
View author publications
You can also search for this author in PubMed Google Scholar
Sara Dauphin-Kohlberg
View author publications
You can also search for this author in PubMed Google Scholar
Amy Kozlowicz-Reilly
View author publications
You can also search for this author in PubMed Google Scholar
Shawn Leonard
View author publications
You can also search for this author in PubMed Google Scholar
Theresa Rohlfing
View author publications
You can also search for this author in PubMed Google Scholar
Susan M. Rock
View author publications
You can also search for this author in PubMed Google Scholar
Aye-Mon Tin-Wollam
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Abbott
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Minx
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Maupin
View author publications
You can also search for this author in PubMed Google Scholar
Catrina Strowmatt
View author publications
You can also search for this author in PubMed Google Scholar
Phil Latreille
View author publications
You can also search for this author in PubMed Google Scholar
Nancy Miller
View author publications
You can also search for this author in PubMed Google Scholar
Doug Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Murray
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey P. Woessner
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Wendl
View author publications
You can also search for this author in PubMed Google Scholar
Shiaw-Pyng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Brian R. Schultz
View author publications
You can also search for this author in PubMed Google Scholar
John W. Wallis
View author publications
You can also search for this author in PubMed Google Scholar
John Spieth
View author publications
You can also search for this author in PubMed Google Scholar
Tamberlyn A. Bieri
View author publications
You can also search for this author in PubMed Google Scholar
Joanne O. Nelson
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Berkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Patricia E. Wohldmann
View author publications
You can also search for this author in PubMed Google Scholar
Lisa L. Cook
View author publications
You can also search for this author in PubMed Google Scholar
Matthew T. Hickenbotham
View author publications
You can also search for this author in PubMed Google Scholar
James Eldred
View author publications
You can also search for this author in PubMed Google Scholar
Donald Williams
View author publications
You can also search for this author in PubMed Google Scholar
Joseph A. Bedell
View author publications
You can also search for this author in PubMed Google Scholar
Elaine R. Mardis
View author publications
You can also search for this author in PubMed Google Scholar
Sandra W. Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie L. Chissoe
View author publications
You can also search for this author in PubMed Google Scholar
Marco A. Marra
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Raymond
View author publications
You can also search for this author in PubMed Google Scholar
Eric Haugen
View author publications
You can also search for this author in PubMed Google Scholar
Will Gillett
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rose James
View author publications
You can also search for this author in PubMed Google Scholar
Karen Phelps
View author publications
You can also search for this author in PubMed Google Scholar
Shawn Iadanoto
View author publications
You can also search for this author in PubMed Google Scholar
Kerry Bubb
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Simms
View author publications
You can also search for this author in PubMed Google Scholar
Ruth Levy
View author publications
You can also search for this author in PubMed Google Scholar
James Clendenning
View author publications
You can also search for this author in PubMed Google Scholar
Rajinder Kaul
View author publications
You can also search for this author in PubMed Google Scholar
W. James Kent
View author publications
You can also search for this author in PubMed Google Scholar
Terrence S. Furey
View author publications
You can also search for this author in PubMed Google Scholar
Robert A. Baertsch
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Brent
View author publications
You can also search for this author in PubMed Google Scholar
Evan Keibler
View author publications
You can also search for this author in PubMed Google Scholar
Paul Flicek
View author publications
You can also search for this author in PubMed Google Scholar
Peer Bork
View author publications
You can also search for this author in PubMed Google Scholar
Mikita Suyama
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Matthew E. Portnoy
View author publications
You can also search for this author in PubMed Google Scholar
David Torrents
View author publications
You can also search for this author in PubMed Google Scholar
Asif T. Chinwalla
View author publications
You can also search for this author in PubMed Google Scholar
Warren R. Gish
View author publications
You can also search for this author in PubMed Google Scholar
Sean R. Eddy
View author publications
You can also search for this author in PubMed Google Scholar
John D. McPherson
View author publications
You can also search for this author in PubMed Google Scholar
Maynard V. Olson
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar
Eric D. Green
View author publications
You can also search for this author in PubMed Google Scholar
Robert H. Waterston
View author publications
You can also search for this author in PubMed Google Scholar
Richard K. Wilson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard K. Wilson.

Ethics declarations

Competing interests

The authors declare that they have no competing financial interests.

Supplementary information

Supplementary Table 1: Interspersed Repeat Content (DOC 33 kb)

Supplementary Table 2: Correlation of Common Repeats with GC and exonic content. (XLS 16 kb)

41586_2003_BFnature01782_MOESM3_ESM.pdf

Supplementary Figure 1: Distribution of Segmental Duplications. A schematic of chromosome 7 segmental duplications depicting the location of interchromosomal (red) and intrachromosomal (blue) duplicated sequence. Each horizontal line represents 1 Mb of sequence, with tick marks every 200 kb. Sequencing gaps are represented as discontinuities within the horizontal line. Duplications detected by whole genome shotgun sequence are represented as green bars above the chromosome sequence. The centromere is purple. Alpha satellite (black) and beta satellite (grey) are bars below the sequence. (PDF 164 kb)

41586_2003_BFnature01782_MOESM4_ESM.pdf

Supplementary Figure 2: Sequence Identity of Segmental Duplications. A schematic of chromosome 7 segmental duplications depicting the sequence identity of interchromosomal (red) and intrachromosomal (blue) duplicated sequence. Each horizontal line represents 1 Mb of sequence, with tick marks every 200 kb. The vertical axis shows the percent sequence identity. Different colored bars indicate chromosome assignment. Sequencing gaps are represented as discontinuities within the horizontal line. (PDF 285 kb)

41586_2003_BFnature01782_MOESM5_ESM.pdf

Supplementary Figure 3: Segmental Duplication Length Distribution (≥90% and ≥1kb). The total number of aligned bases was calculated for two categories of duplication (90-95% and 95-100% sequence identity). The length of each interchromosomal (red) and intrachromosomal (blue) alignment is shown. The relative proportion of large to small duplications decreases with decreasing sequence identity. (PDF 103 kb)

41586_2003_BFnature01782_MOESM6_ESM.pdf

Supplementary Figure 4: Duplications flanking Williams-Beuren Syndrome region. The complex structure and interrelationship of sequence alignments is shown. Joining lines demarcate the begin and end points of the most highly similar (≥97% and ≥10 kb) alignments (most likely to mediate rearrangements) in the Williams Syndrome region. The alignments demonstrate extensive large-scale sequence variation including insertions, deletions and inversions among paralogous loci. (PDF 41 kb)

Supplementary Table 3: Putative rearrangement prone regions. (XLS 8 kb)

Supplementary Table 4: Known genes with duplicated exon sequence. (XLS 21 kb)

Supplementary Materials and Methods (DOC 124 kb)

PDF version of Figure 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hillier, L., Fulton, R., Fulton, L. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003). https://doi.org/10.1038/nature01782

Download citation

Received: 25 February 2003
Accepted: 23 April 2003
Issue Date: 10 July 2003
DOI: https://doi.org/10.1038/nature01782

This article is cited by

A Distance-Delivered Social Skills Program for Young Adults with Williams Syndrome: Evaluating Feasibility and Preliminary Efficacy
- Marisa H. Fisher
- Rebecca R. Kammes
- Katie Cwiakala
Journal of Autism and Developmental Disorders (2022)
Sleep characteristics and problems of 2-year-olds with Williams syndrome: relations with language and behavior
- Caroline Greiner de Magalhães
- Louise M. O’Brien
- Carolyn B. Mervis
Journal of Neurodevelopmental Disorders (2020)
An Examination of Social Skills, Friendship Quality, and Loneliness for Adults with Williams Syndrome
- Marisa H. Fisher
- Cynde Katherine Josol
- Carolyn M. Shivers
Journal of Autism and Developmental Disorders (2020)
Psychometric Properties and Utility of the Social Vulnerability Questionnaire for Individuals with Intellectual and Developmental Disabilities
- Marisa H. Fisher
- Carolyn M. Shivers
- Cynde K. Josol
Journal of Autism and Developmental Disorders (2020)
Associations between paraoxonase 1 (PON1) polymorphisms and susceptibility and PON1 activity in rheumatoid arthritis patients, and comparison of PON1 activity in patients and controls: a meta-analysis
- Sang-Cheol Bae
- Young Ho Lee
Clinical Rheumatology (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Abstract

Similar content being viewed by others

Main

General features of the chromosome 7 sequence

Comparison to physical and genetic maps

Orthology to mouse

General features

Known protein-coding genes

Predicted protein-coding genes

Pseudogenes

Non-coding RNAs

Protein index

Williams syndrome critical region

Segmental duplications

Conclusions

Methods

Tiling-path verification

Assaying mRNA/genomic discrepancies

Pseudogene detection

Accession codes

Accessions

GenBank/EMBL/DDBJ

Data deposits

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links