Letters to Nature

Nature 424, 788-793 (14 August 2003) | doi:10.1038/nature01858; Received 11 April 2003; Accepted 16 June 2003

Comparative analyses of multi-species sequences from targeted genomic regions

J. W. Thomas1,11, J. W. Touchman1,2,11, R. W. Blakesley1,2, G. G. Bouffard1,2, S. M. Beckstrom-Sternberg1,2, E. H. Margulies1, M. Blanchette3, A. C. Siepel3, P. J. Thomas2, J. C. McDowell2, B. Maskeri2, N. F. Hansen2, M. S. Schwartz3, R. J. Weber3, W. J. Kent3, D. Karolchik3, T. C. Bruen3, R. Bevan3, D. J. Cutler4, S. Schwartz5, L. Elnitski5, J. R. Idol1, A. B. Prasad1, S.-Q. Lee-Lin1, V. V. B. Maduro1, T. J. Summers1, M. E. Portnoy1, N. L. Dietrich2, N. Akhter2, K. Ayele2, B. Benjamin2, K. Cariaga2, C. P. Brinkley2, S. Y. Brooks2, S. Granite2, X. Guan2, J. Gupta2, P. Haghighi2, S.-L. Ho2, M. C. Huang2, E. Karlins2, P. L. Laric2, R. Legaspi2, M. J. Lim2, Q. L. Maduro2, C. A. Masiello2, S. D. Mastrian2, J. C. McCloskey2, R. Pearson2, S. Stantripop2, E. E. Tiongson2, J. T. Tran2, C. Tsurgeon2, J. L. Vogt2, M. A. Walker2, K. D. Wetherby2, L. S. Wiggins2, A. C. Young2, L.-H. Zhang2, K. Osoegawa6, B. Zhu6, B. Zhao6, C. L. Shu6, P. J. De Jong6, C. E. Lawrence7, A. F. Smit8, A. Chakravarti4, D. Haussler3,9, P. Green10, W. Miller5 & E. D. Green1,2

  1. Genome Technology Branch, National Human Genome Research Institute, and
  2. NIH Intramural Sequencing Center, National Institutes of Health, Bethesda, Maryland 20892, USA
  3. Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
  4. Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21287, USA
  5. Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
  6. Children's Hospital Oakland Research Institute, Oakland, California 94609, USA
  7. The Wadsworth Center for Laboratories and Research, New York State Department of Health, Albany, New York 12201, USA
  8. The Institute for Systems Biology, Seattle, Washington 98103, USA
  9. Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA
  10. Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
  11. Present addresses: Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia 30322, USA (J.W.Th.); Translational Genomics Research Institute, Phoenix, Arizona 85004 and Department of Biology, Arizona State University, Tempe, Arizona 85287, USA (J.W.To.)

Correspondence to: E. D. Green1,2 Email: egreen@nhgri.nih.gov
GenBank accession numbers for BAC-derived sequences are provided in the Supplementary Information.

Top

The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding1, 2, 3, 4, 5, 6 and conserved non-coding4, 6, 7 regions, including regulatory elements8, 9, 10, and provide insight into the forces that have rendered modern-day genomes6. As a complement to whole-genome sequencing efforts3, 5, 6, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.

The NIH Intramural Sequencing Center (NISC) Comparative Sequencing Program aims to sequence and to analyse targeted genomic regions in multiple vertebrates. Our initial target is a genomic segment of about 1.8 Mb on human chromosome 7q31.3 containing the gene encoding the cystic fibrosis transmembrane conductance regulator11 and nine other genes (referred to below as the 'greater CFTR region'). We sought to clone and to sequence the orthologous genomic segments in multiple other vertebrates (Table 1 and Methods). So far, our efforts have yielded more than 12 Mb of high-quality comparative sequence data (see Supplementary Information), over 95% of which has been finished to the standards established for human genome sequence12. This represents the most diverse collection of large blocks of orthologous vertebrate sequence generated to date.


To identify regions of sequence conservation, we used blastz13 to construct pair-wise alignments of the sequences and MultiPipMaker14 to show pair-wise percentage-identity plots of an annotated reference sequence against multiple query sequences (Fig. 1a and Supplementary Information). Alignments between the human sequence and the sequence of each of the other 12 species allowed the general patterns of conservation to be investigated. As expected, the fraction of sequence that can be aligned generally decreases with increasing evolutionary distance from humans (Fig. 1b). The only exceptions to this trend are mouse and rat, which, although considered to be closer to humans than to the other non-primate mammals included here15, have a lower fraction of sequence that can be aligned with the human sequence. This probably reflects a particularly high rate of sequence evolution, including large deletions, in the rodent lineage6.

Figure 1: Patterns of sequence conservation.
Figure 1 : Patterns of sequence conservation. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, MultiPipMaker results for the greater CFTR region. Pair-wise alignments between the human reference sequence and the sequence of each indicated species were generated. The percentage identity of each gap-free alignment is indicated. Numbered boxes correspond to exons. b, Proportion of aligned human sequence in each of four annotated categories, shown for alignments with the sequence of each indicated species. c, Deduced phylogenetic tree of indicated mammalian species (see Supplementary Information). Labels on branches reflect differences in exon lengths, as determined manually by parsimony (+ , insertion; -, deletion; e, extension due to alteration of splice site or stop codon; s, early stop codon).

High resolution image and legend (116K)

For all species, reasonably consistent alignments are seen in coding exons (Fig. 1a, b). The human–fish alignments are largely limited to these, with only 14% of the aligned sequence outside coding exons; in fact, only two local human–fish alignments do not at least partially overlap a coding exon (5' to CAV2 and within intron 4 of CORTBP2; see Supplementary Information). Almost a third (31.4%) of the human coding sequence does not align to fish sequence in the orthologous region. By contrast, the human–chicken alignments exclude less than 2% of the coding sequence, which is comparable to the alignments between human and the other mammals. Within the alignments, sequence divergence relative to human (measured as the percentage of single-nucleotide mismatches) varies from a low of 1.15% for chimpanzee to a high of 35.60% for zebrafish (see Supplementary Information).

Analyses of large-scale mutational events in this genomic region support mammalian phylogenies15, 16 that place the rodents and primates as sister groups in one clade and the artiodactyls and carnivores as sister groups in another clade (Fig. 1c). Both of these groupings are confirmed by transposon insertions that are present in sister groups and absent from the other species. In particular, we found three clear examples of insertions that support the rodent–primate grouping (all MLT1A0 elements) and three that support the artiodactyl–carnivore grouping (one MLT1A0 and two L1MA9; see Supplementary Information). In each case, insertions of the same transposon subtype, present in the same orientation and with the same target-site duplication, were identified at homologous positions in all members of sister groups.

Both MLT1A0 and L1MA9 elements are thought to have been active at around the time of the eutherian radiation, on the basis of the reconstructed phylogenies and divergence levels of the elements themselves. We found no insertions supporting alternative phylogenies (for example, rodents as an outgroup to primates, artiodactyls and carnivores). For example, out of 81 identified segments present in the primates, artiodactyls and carnivores but missing from rodents, none is an identifiable transposable element; instead, all seem to be deletions in the rodent lineage. These analyses seem to definitively refute alternative phylogenies that place the rodents as an outgroup to the other mammals studied here16, 17.

Tabulation of exon-length differences indicate a significant excess of deletions relative to insertions, which is particularly strong in the rodent lineage (consistent with previous reports for mouse genes6). When this excess is taken into account, the most parsimonious interpretation of the differences again favours the grouping of primates with rodents (Fig. 1c).

Neutral substitution rates estimated from sites in ancestral repeats, which are relics of transposons inserted before the eutherian radiation, also show a higher rate of substitution in the rodent lineage (Fig. 1c). The branch lengths that we estimate from ancestral repeat sites total about 1.2 substitutions per site in the mammals and are similar to previous calculations (made with these sequence data but with a different multi-sequence alignment18) that were based on examining untranslated regions (UTRs; total 0.93 substitutions per site), non-exonic regions (total 1.35 substitutions per site) and synonymous substitutions in codons (total 1.28 substitutions per codon).

It is important to note, however, that there are several currently unresolved methodological issues regarding substitution analysis of non-coding genomic sequences. Alignments involving the more diverged sequences tend to have significant uncertainties, and current substitution models do not easily accommodate several known complexities in neutral mutational patterns, including context effects19, positional variation in substitution rates6, 20 and the fact (discovered with these sequence data21) that there are substitution rate asymmetries associated with transcribed regions. As a result, these rate estimates must be interpreted with caution. Together, these data show that combined molecular evolution studies examining transposon events, neutral substitutions and exonic changes provide a more robust and informative phylogenetic analysis than any method alone.

Of special interest are small genomic regions that are more highly conserved across multiple species than are neutrally evolving sequences, as these may be under purifying selection (that is, selection against mutation of the base) for a functional role. By the use of two methods (E.H.M. et al., manuscript in preparation, and Supplementary Information), we identified 'multi-species conserved sequences' (MCSs) across the greater CFTR region. Each method was calibrated such that 5% of bases in the region fell within an MCS (a value chosen to be consistent with the estimated 5% of the mammalian genome under selection based on human–mouse sequence comparisons6). Nearly 80% of the MCSs identified with the two methods overlap, and 75% of the MCS bases are identical.

We examined the 1,194 MCSs that overlapped between the two methods. These average 58 base pairs (bp) and represent 3.7% of the bases in the region. About 2% of MCS bases fall in ancestral repeats (corresponding to approx0.4% of the ancestral repeat sequence in the region), suggesting that the fraction of neutrally evolving sequence falsely identified as conserved is small. A total of 32% of the MCS bases overlap known coding sequence or UTRs, involving 98% (125/128) of the known coding exons and 67% (14/21; note that the 5' UTR of the MET gene spans two exons) of the known UTRs. The MCSs contain 90.4% and 27.2% of coding and UTR bases, respectively. The relative paucity of UTR bases in MCSs might reflect the fact that these regions include unselected bases or are under lower selection or occasional positive selection, or a combination of these. The remaining 68% of the MCS bases are outside known exons, and virtually all of these (92%) are in MCSs that contain less than 5% repetitive sequence and are present in a single-copy fashion in the human genome (see Supplementary Information). Of the non-exonic MCSs, 16 out of 966 (1.7%; averaging 31 bp) fall within the 1-kilobase (kb) segments immediately upstream of a known transcription start site (the presumed location of most core promoters), which is about 2.3 times more frequent than would be expected if the MCS bases were randomly distributed.

Interestingly, 950 of the 1,194 MCSs (80%; averaging 44 bp) are neither exonic nor lie less than 1-kb upstream of transcribed sequence. Of these, 648 fall in introns and 302 are intergenic; this represents a 28% enrichment of MCS bases in introns (as compared with a random distribution). The detected MCSs overlap with 63% of the functionally validated regulatory elements in the region and 26% of promoters predicted by in silico analyses (see Supplementary Information). Several factors could account for the failure of MCSs to overlap all such elements: many of these elements are notably small (< 14 bp), whereas our methods do not identify MCSs of less than 25 bp; some of the regulatory elements may be specific to the primate lineage; and the presence or position of some of the annotated elements may be incorrect. Note that most (98%) non-exonic MCSs do not correspond to currently known regulatory elements, yielding a rich supply of candidates for future functional studies.

An important issue relevant to future genome sequencing projects is the degree to which the detection of highly conserved sequences depends on the particular set of species being studied. In the absence of a comprehensive catalogue of functional elements for any large region of the human genome, we used the detection of MCSs as a surrogate for the ability of a species' sequence to identify functionally important regions. Because a draft mouse genome sequence is now available6, we first examined the ability of human–mouse pair-wise alignments to detect a set of 561 MCSs in a portion of the greater CFTR region for which sequence coverage in all species was nearly complete (see Supplementary Information). Adjustments to the stringency of the human–mouse alignments were ineffective at accurately identifying these MCSs (Fig. 2a). For example, at a percentage-identity threshold of 85%, the sensitivity (percentage of MCS bases that overlap aligned mouse sequence) is only 41%, whereas the specificity (percentage of the aligned mouse sequence that overlaps MCSs) is 77%. This is consistent with the observation that individual conserved elements often cannot be reliably detected with only the human and mouse sequences6.

Figure 2: Detection of MCSs by using different mammalian sequences.
Figure 2 : Detection of MCSs by using different mammalian sequences. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, For each indicated percentage-identity threshold (see Supplementary Information), the number of aligned mouse bases that overlap a set of 598 MCSs (see text) is shown in blue. Also shown are the number of aligned mouse bases that do not overlap MCSs (yellow) and the number of MCS bases that do not overlap aligned mouse sequence (grey). The arrow indicates the percentage-identity threshold that results in a sensitivity of 48% (with a specificity of 62%; see text). b, The relationship between the total branch length of the phylogenetic tree relating the species and the sensitivity of MCS detection is shown for all possible human-containing subsets of the nine mammalian sequences (see Supplementary Information). Results are plotted for three different specificities.

High resolution image and legend (125K)

To explore more broadly how MCS detection is dependent on the specific sequences used in the analysis, we identified MCSs with each possible subset of species (always including human) and in each case calibrated the results to yield a defined specificity (percentage of bases overlapping the above 561 MCSs; see Supplementary Information). For the various subsets of mammalian species, there is strong correlation between total divergence of the subset (as measured by the combined branch length of the phylogenetic tree defined by the specific subset of species) and the ability to detect MCSs (Fig. 2b). The results at 90% specificity show that eliminating chimpanzee and baboon does not reduce the number of detected MCS bases. Eliminating the non-human primates, chicken and fish—thereby retaining only the six non-primate mammals in addition to human—reduces the number by 17%. With only one mammal from each major lineage (baboon, cat, cow and mouse), the number is reduced by 29%. Of note, chicken sequence alone detects 40% of MCS bases (representing 94% of the coding but only 29% of the non-coding MCS bases), and this value is higher than for any other single species. Fish sequences alone are effective at detecting coding exons but miss most non-coding MCSs.

The trends in Fig. 2b suggest that most MCSs are broadly distributed among the mammalian lineages, because the power to detect them seems to depend mainly on the total divergence of the subset of species rather than on the particular distribution of the species among lineages. In addition, the fact that at high specificity the sensitivity increases steadily throughout the full range of branch lengths suggests that we may be far from saturating the ability to detect such conserved sequences, even with the full set of mammals examined here. It thus seems that combined branch length will be a useful metric for guiding the selection of additional genomes to sequence.

The nature of the mutational events that have produced the observed differences in the greater CFTR region among the sequenced species is of fundamental evolutionary interest. Despite strict conservation in gene order and content and the absence of any observed syntenic breaks, there is significant variation in the amount of non-coding sequence, suggesting variation in genome expansion or compression. For example, the region is about tenfold smaller in the two pufferfish species than in the mammals, which themselves vary by as much as approx15% (see Supplementary Information). These findings are consistent with the relative genome sizes established for some of these species3, 5, 6 and point to significant changes in this genomic region throughout vertebrate evolution. To account for these differences, we looked for evidence of molecular events that have contributed to genome expansion or compression.

The fraction of sequence corresponding to interspersed repeats, which are remnants of insertion events mediated by active or extinct transposons, varies from roughly 29% to 39% among the mammals (Fig. 3a; see Supplementary Information). The interspersed repeat content is much lower in the non-mammalian vertebrates, with the pufferfish and zebrafish sequences containing less than 1% and 13% interspersed repeats, respectively. In addition, the distribution of interspersed repeat types differs greatly among species (Fig. 3a). Thus, the accumulation of interspersed repeats correlates with expansion of the greater CFTR region in the main lineages.

Figure 3: Comparison of genome dynamics among species.
Figure 3 : Comparison of genome dynamics among species. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, Relative content of different interspersed repeat types plotted for each species. b, Relative contribution of different mutational events within major lineages. Sequence alignments between the indicated species pairs were used to classify all sequence differences: single-nucleotide changes, small indels (< 100 bp), large indels (> 100 bp) and complex events (involving >100 bp and most probably resulting from an inversion or more than one deletion and/or insertion; see Supplementary Information). The relative fraction of nucleotide differences falling into each class is shown for the indicated species pairs. The overall percentage of single-nucleotide mismatches for the greater CFTR region in each species pair is 1.15%, human–chimpanzee; 5.60%, human–baboon; 13.93%, mouse–rat; 15.57%, cat–dog; 17.52%, cow–pig; and 20.50%, FuguTetraodon. SINE, short interspersed nuclear element; LINE, long interspersed nuclear element.

High resolution image and legend (88K)

Differences in the pattern of interspersed repeat types are also apparent between and within mammalian lineages. Indeed, the accumulation of species-specific interspersed repeats seemingly accounts for much of the differences in relative size and interspersed repeat composition of this genomic region within these lineages (see Supplementary Information). Our findings implicate an increased rate of interspersed repeat insertions rather than variation in the extent of deletions as a primary cause of the observed size differences for this region; this contrasts with the observed higher rate of deletion versus insertion in the mouse lineage, which has resulted in a significantly smaller size of the mouse genome relative to human6.

For the three primate species, we estimated the rate of large (> 100 bp) insertions or deletions (indels) using parsimony (see Supplementary Information). This revealed variable rates of insertion and deletion, producing a slight relative expansion in the human lineage and a slight compression in both the chimpanzee and baboon lineages. We also examined the relative importance of interspersed repeat insertions and large indels compared with small indels (< 100 bp) and single-nucleotide changes. Although most mutational events leading to human–chimpanzee differences are single-nucleotide changes, they account for only 33% of the bases that differ between these two species (Fig. 3b and Supplementary Information). Indeed, nearly half of the differing bases correspond to large indels (Fig. 3b; as an example, note the large deletion in the chimpanzee sequence immediately upstream of CAPZA2 exon 2 in Fig. 1a). Similarly, at least 44% of the bases that differ between the human and baboon sequences reflect large indels. Thus, among the primates, large indels are the principal mechanism accounting for the observed sequence differences, a finding that is consistent with other studies22, 23.

Similar analyses of the other mammals identified a similar spectrum of mutational events, with large indels accounting for the largest fraction of bases that differ in each lineage (Fig. 3b). This suggests that the non-aligning regions in mammalian sequence primarily reflect the insertion of repetitive elements and the deletion of ancestral sequences. Given this assumption, the alignment of the human greater CFTR region to the other mammalian sequences suggests that a minimum of 63% of this segment of the human genome was present in the last common ancestor of the mammals sampled here, some 94 million years ago24, and that the rodents represent the most derived mammalian sequence (minimum 39% ancestral; Fig. 1b). Finally, the pufferfish sequences show a distinct pattern of genomic changes (Fig. 3b). As compared with the mammals, single-nucleotide mismatches and small indels are more prominent than large indels. In part, this probably reflects the paucity of transposon insertions in pufferfish and the deleterious consequences of large indels in a more compact genome.

In summary, our approach for multi-species sequence generation and analysis is providing a previously unavailable glimpse through the window of vertebrate evolution. Our findings confirm that the genomes of rodents are changing faster than those of primates, carnivores and artiodactyls, in agreement with many reports6, 25 but in contrast to studies defending the molecular clock26. In addition, we have found a substantial number of sequence elements conserved across multiple species, most of which are located in non-coding regions. Although the general mutational spectrum is similar among all vertebrate lineages, differences in the relative contribution of the various types of molecular events have sculpted the genome of each species in a unique fashion. It should be noted that the findings reported here pertain to a single genomic region, whereas the vertebrate genome is known to show highly non-uniform patterns of sequence conservation6, 20. Our efforts to sequence many other genomic targets in multiple species (see the NISC Comparative Sequencing Program website: http://www.nisc.nih.gov) should provide a broader and more informed view of such regional variation. Together, our studies point to myriad avenues for investigating the evolutionary and functional features of the vertebrate genome.

Top

Methods

Sequence generation and analysis

Targeted genomic regions of interest were isolated in overlapping bacterial artificial chromosome (BAC) clones27 by using libraries derived from different vertebrate species. We used 'universal' hybridization probes, designed from small stretches of sequence conserved between human and mouse28, to isolate BACs from multiple mammals in parallel. For isolating BACs from non-mammalian vertebrates, species-specific probes (typically designed from available gene sequences) were mostly used for clone isolation. For each species, a minimally overlapping tiling path of BACs spanning the genomic target was selected, and individual BACs were sequenced by a conventional shotgun sequencing strategy (see Supplementary Information).

To facilitate computational analyses, we compiled a single, non-redundant nucleotide sequence for each species by merging together the data generated for all sequenced BACs. This was then annotated to indicate the locations of known genes and coding regions on the basis of matches to human and/or mouse reference cDNA sequences, exons predicted by Genscan29, and repetitive elements using RepeatMasker. Annotated sequence records are available at the NISC Comparative Sequencing Program website (http://www.nisc.nih.gov/data). A more detailed description of this assembly and annotation process is given in the Supplementary Information.

The assembled and annotated sequences were then subjected to a series of analyses, starting with the generation of multi-sequence alignments and including the studies described above and in Figs 1–3 (such as establishing orthology, examining the general patterns of sequence conservation, performing phylogenetic analyses, detecting highly conserved sequences and investigating genome dynamics). Details of these analyses are given in the Supplementary Information.

Visualization and dissemination of results

The establishment of robust and user-friendly computational-based approaches for capturing, visualizing and disseminating multi-species sequences and the data emanating from their comparative analyses represents an increasingly important challenge. By using our data as a model, we developed a viable solution to this problem by incorporating our data into the University of California Santa Cruz Genome Browser30. The resulting web-based resource (http://genome.ucsc.edu) serves as an additional electronic supplement to this paper; a low-resolution overview of this website is given in the Supplementary Information.

Integrating our data into the UCSC Genome Browser allows it to be visualized within the context of the growing body of information about the human and other genome sequences represented on this browser. We have configured the browser to display custom tracks showing the results of various analyses, including those involving multiple sequences (see Supplementary Information). The dynamic nature of the browser allows convenient navigation from a detected region of conservation to the underlying sequence alignment, as well as the ability to examine the results of comparative analyses using different species' sequence as the reference.

Availability of data and analysis tools

Details about the data underlying the studies reported here (including updated BAC contig maps, information about each sequenced clone, compiled and annotated sequence files for each species and links to the various electronic resources) are available on the NISC Comparative Sequencing Program website (http://www.nisc.nih.gov/data). The blastz program and MultiPipMaker network services can be obtained on the Penn State Bioinformatics Group website (http://bio.cse.psu.edu).

Top

References

  1. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. & Lander, E. S. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950−958 (2000) | Article | PubMed | ISI | ChemPort |
  2. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235−238 (2000) | Article | PubMed | ISI | ChemPort |
  3. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860−921 (2001) | Article | PubMed | ISI | ChemPort |
  4. Chen, R., Bouck, J. B., Weinstock, G. M. & Gibbs, R. A. Comparing vertebrate whole-genome shotgun reads to the human genome. Genome Res. 11, 1807−1816 (2001) | PubMed | ISI | ChemPort |
  5. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301−1310 (2002) | Article | PubMed | ISI | ChemPort |
  6. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520−562 (2002) | Article | PubMed | ISI | ChemPort |
  7. Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304−1306 (2000) | Article | PubMed | ISI | ChemPort |
  8. Gottgens, B. et al. Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnol. 18, 181−186 (2000) | Article | PubMed | ChemPort |
  9. Hardison, R. C. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16, 369−372 (2000) | Article | PubMed | ISI | ChemPort |
  10. Pennacchio, L. A. & Rubin, E. M. Genomic strategies to identify mammalian regulatory sequences. Nature Rev. Genet. 2, 100−109 (2001) | Article | PubMed | ISI | ChemPort |
  11. Rommens, J. M. et al. Identification of the cystic fibrosis gene: chromosome walking and jumping. Science 245, 1059−1065 (1989) | PubMed | ISI | ChemPort |
  12. Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from The Human Genome Project. Genome Res. 9, 1−4 (1999) | PubMed | ISI | ChemPort |
  13. Schwartz, S. et al. Human−mouse alignments with BLASTZ. Genome Res 13, 103−107 (2003) | Article | PubMed | ISI | ChemPort |
  14. Schwartz, S. et al. MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 31, 3518−3524 (2003) | Article | PubMed | ISI | ChemPort |
  15. Murphy, W. J. et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294, 2348−2351 (2001) | Article | PubMed | ISI | ChemPort |
  16. Poux, C., Van Rheede, T., Madsen, O. & de Jong, W. W. Sequence gaps join mice and men: phylogenetic evidence from deletions in two proteins. Mol. Biol. Evol. 19, 2035−2037 (2002) | PubMed | ISI | ChemPort |
  17. Huelsenbeck, J. P., Larget, B. & Swofford, D. A compound Poisson process for relaxing the molecular clock. Genetics 154, 1879−1892 (2000) | PubMed | ISI | ChemPort |
  18. Cooper, G. M. et al. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13, 813−820 (2003) | Article | PubMed | ISI | ChemPort |
  19. Siepel, A. & Haussler, D. Proc. 7th Annual Int. Conf. Research in Computational Molecular Biology (ACM, New York, 2003)
  20. Hardison, R. C. et al. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13, 13−26 (2003) | Article | PubMed | ISI | ChemPort |
  21. Green, P. et al. Transcription-associated mutational asymmetry in mammalian evolution. Nature Genet. 33, 514−517 (2003) | Article | PubMed | ISI | ChemPort |
  22. Frazer, K. A. et al. Genomic DNA insertions and deletions occur frequently between humans and nonhuman primates. Genome Res. 13, 341−346 (2003) | Article | PubMed | ISI | ChemPort |
  23. Britten, R. J. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc. Natl Acad. Sci. USA 99, 13633−13635 (2002) | Article | PubMed | ChemPort |
  24. Springer, M. S., Murphy, W. J., Eizirik, E. & O'Brien, S. J. Placental mammal diversification and the Cretaceous/Tertiary boundary. Proc. Natl Acad. Sci. USA 100, 1056−1061 (2003) | Article | PubMed | ChemPort |
  25. Li, W. H., Ellsworth, D. L., Krushkal, J., Chang, B. H. & Hewett-Emmett, D. Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Mol. Phylogenet. Evol. 5, 182−187 (1996) | Article | PubMed | ISI | ChemPort |
  26. Kumar, S. & Subramanian, S. Mutation rates in mammalian genomes. Proc. Natl Acad. Sci. USA 99, 803−808 (2002) | Article | PubMed | ChemPort |
  27. Shizuya, H. et al. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl Acad. Sci. USA 89, 8794−8797 (1992) | PubMed | ChemPort |
  28. Thomas, J. W. et al. Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Res. 12, 1277−1285 (2002) | Article | PubMed | ISI | ChemPort |
  29. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78−94 (1997) | Article | PubMed | ISI | ChemPort |
  30. Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996−1006 (2002) | Article | PubMed | ISI | ChemPort |
Top

Supplementary Information

Supplementary information accompanies this paper.

Top

Acknowledgements

We thank J. Weissenbach and H. Roest Crollius for Tetraodon BACs; M. Diekhans for computational expertise; N. Goldman and Z. Yang for advice on phylogenetic analyses; and F. Collins and J. Mullikin for critically reading the manuscript. We acknowledge the support of the National Human Genome Research Institute (National Institutes of Health) and the Howard Hughes Medical Institute.

Top

Competing interests statement

The authors declare no competing financial interests.

Extra navigation

.

Open Innovation Challenges

naturejobs

natureproducts


ADVERTISEMENT