Journal home
Advance online publication
Current issue
Archive
Press releases
Free Association (blog)
Supplements
Focuses
Guide to authors
Online submissionOnline submission
For referees
Free online issue
Contact the journal
Subscribe
Advertising
work@npg
Reprints and permissions
About this site
For librarians
 
NPG Resources
Nature
Nature Biotechnology
Nature Cell Biology
Nature Medicine
Nature Methods
Nature Reviews Cancer
Nature Reviews Genetics
Nature Reviews Molecular Cell Biology
news@nature.com
Nature Conferences
NPG Subject areas
Biotechnology
Cancer
Chemistry
Clinical Medicine
Dentistry
Development
Drug Discovery
Earth Sciences
Evolution & Ecology
Genetics
Immunology
Materials Science
Medical Research
Microbiology
Molecular Cell Biology
Neuroscience
Pharmacology
Physics
Browse all publications
Review
Nature Genetics  33, 219 - 227 (2003)
doi:10.1038/ng1114

Massive parallelism, randomness and genomic advances

J. Craig Venter, Samuel Levy, Tim Stockwell, Karin Remington & Aaron Halpern

The Center for the Advancement of Genomics, 1901 Research Blvd., Rockville, Maryland 20850, USA.

Correspondence should be addressed to J. Craig Venter jcventer@tcag.org
In reviewing the past decade, it is clear that genomics was, and still is, driven by innovative technologies, perhaps more so than any other scientific area in recent memory. From the outset, computing, mathematics and new automated laboratory techniques have been key components in allowing the field to move forward rapidly. We highlight some key innovations that have come together to nurture the explosive growth that makes a new era of genomics a reality. We also document how these new approaches have fueled further innovations and discoveries.
In 1987 Victor McKusick and Frank Ruddle launched a journal titled Genomics. This was the first widespread use of the term, which was actually coined by T.H. Roderick of the Jackson Laboratory1. 'Genomics' was used to describe a field of science differing from genetics in its focus on the study of DNA from a broader standpoint, that of the entire complement of genetic material. Genomics would primarily consider the full haploid set of chromosomes in an organism rather than studying a single gene or a family of functionally or structurally related genes. As is now clear, genomics has built on many of the important discoveries in genetics, beginning with the elucidation of the double helix structure of DNA by Watson and Crick in the 1950s, the discovery of reverse transcriptase in the late 1960s, the discovery of recombinant DNA and restriction enzymes in the 1970s and the discovery of polymerase chain reaction (PCR) in the early 1980s. The method developed in the 1970s by Sanger et al.2 to sequence longer stretches of DNA, dideoxynucleotide sequencing, now known simply as Sanger sequencing, was a guiding force in the development of genomics. This technique was revolutionary at the time, and in fact, was the only method used to determine the base sequence of DNA for many years. It was the combination of these breakthroughs that allowed researchers to even contemplate sequencing a large eukaryotic genome, such as that of humans.

Though the field of genetics was dominated by laborious hunts for single disease genes during the 1980s and early 1990s, this was also the era when active discussions and planning were underway for development of the large, multi-national, publicly funded Human Genome Project (HGP). It soon became clear that the research community would need an additional outlet to publish what was sure to be a myriad of new discoveries from the rapid advances in genetics and the burgeoning field of genomics. In response to this need, a new journal was launched in 1992—Nature Genetics. The inaugural volume of the new journal contained, among other contributions, two articles describing the results of the first human genome test sequencing projects using the Applied Biosystems 373 automated DNA sequencer3, 4. These papers detailed the work of a team at the National Institute of Neurological Disorders and Stroke (NINDS) that involved sequencing three cosmids each from human chromosome 19 and chromosome 4. The cosmids, from regions thought to be associated with myotonic dystrophy and Huntington disease, were deliberately selected with the help of collaborators who had long been looking for the genes associated with these diseases. These two projects were representative of the state of the art in the community at the time, which was focused on a strategy in which genomic sequencing commenced only after cosmid mapping had been done. When these projects began in 1989, the fledgling discipline of genomics was clearly more an abstraction than a practical endeavor.

The HGP settled on a cosmid mapping strategy because the pervading assumption of the time was that clones larger than cosmids (35 kb) were not readily sequenceable because of the limitations of shotgun sequencing. In the late 1980s there was considerable debate and skepticism concerning the use of automated DNA sequencers. Much of this discussion was due to the belief that single machines were expected to sequence the genome. For example, early attention centered on claims from Japan about a million-base-pair DNA sequencing machine built by Wada and his team5, 6. This team was confident that this machine would give Japan the ability to sequence the human genome, but ultimately the program was unsuccessful. Sequencers at the time were very limited; in fact, the initial model of automated DNA sequencers in the United States could handle only 16 templates per day and produce roughly 300 bp per template.

The papers published in the inaugural issue of Nature Genetics were not revolutionary, but they did represent an important first step in the application of automated DNA sequence analysis to unknown areas of the human genome. This was a critical turning point for subsequent genome sequencing and required de novo analysis of raw human genomic DNA sequence where the gene content was unknown.

For the next few years, genomics progressed linearly. The plan for the public genome project by the United States National Human Genome Research Institute (NHGRI) and Department of Energy (DOE) was to sequence six genomes by 2005—those of Escherichia coli, Saccharomyces cerevisiae (yeast), Drosophila melanogaster, Caenorhabditis elegans, mouse and human7. By 2003, however, at least 99 genomes have already been sequenced and published (Fig. 1), including the original six in the plan, and hundreds of others are in progress. Not surprisingly, this exponential growth in sequenced genomes contributed to the exponential growth in GenBank data (Fig. 2). The first four genomes sequenced and published were not part of the original NHGRI/DOE plan. Instead, they were completed by independent, not-for-profit institutes in the United States and Japan. The first genome completed by the publicly funded sequencing consortium was that of yeast8. The E. coli genome, the first to receive public funding, was not completed until 1997 (ref. 9).

Figure 1. Genomes of living organisms sequenced between 1995 and 2002.
Figure 1 thumbnail

Full FigureFull Figure and legend (178K)
Figure 2. The growth of GenBank since 1992.
Figure 2 thumbnail

Cumulative sequences (in millions) are shown as filled diamonds. Cumulative base pairs of DNA (in billions) are shown as filled area. (See URLs.)



Full FigureFull Figure and legend (16K)
In this review, we assess the changes that have caused the dramatic acceleration in sequencing genomes from the original plan of just six completed genomes by 2005. We highlight some key advances that have allowed researchers in both the private and public sectors of genomics to make remarkable progress in sequencing and analyzing the genomes of more organisms. Two dominant themes have driven such innovations in genomics: first, the advent and adoption of massively parallel systems, both in sequencing and computing; and second, the subtle, yet equally important, philosophical change in the way genomic projects have been conceived and organized. Early genomic projects were large and distributed and proceeded in a linear, methodical fashion, whereas current projects are based on smaller, multi-dimensional teams and are organized in a quality-controlled environment to take advantage of the flexibility of random sampling. In this review, we follow these themes from the development of expressed-sequence tags (ESTs) and bacterial artificial chromosomes (BACs), which were key breakthroughs in gene discovery efforts and for mapping and sequencing in clone-by-clone methods, through to whole-genome shotgun (WGS) sequencing.

Expressed-sequence tags
Throughout the 1980s, the process of discovering genes and producing DNA sequences was extremely labor-intensive and time-consuming. But the discovery in the early 1990s of a new way to detect genes radically changed the pace and scope of gene discovery. A paper published in 1991 (ref. 10) established the usefulness of expressed-sequence tags (ESTs) with the discovery of several hundred new genes. Though the use of ESTs has now become commonplace, before 1991 as the idea was being developed, the usefulness of an approach based on random, partial cDNAs was far from clear to many in the scientific community. Early in the planning phase of the HGP, Sydney Brenner and Paul Berg, as well as several other internationally recognized researchers, made strong arguments to include a large cDNA effort in the initial stage of the HGP. But many involved in this area of science maintained that mRNA expression would provide only a small number of highly abundant transcripts and would leave most human genes undetected, and some speculated that as little as 8−9% of genes could be uncovered using the EST method11. It was an innovative idea using random sampling—motivated by attempts to annotate de novo human genome sequence—that ultimately proved this conventional wisdom wrong.

After generating the sequence data from chromosomes 19 and 4 in 1989, the NINDS team discovered that interpreting and annotating the sequences required more time than obtaining the sequence. Using de novo (meaning sequence-based rather than evidence-based) gene-prediction software of the era that applied a neural-network technique12, only 4 of 10 exons in the known gene ERCC1 were accurately predicted, and false predictions were higher than correct exon hits10. Six frame translations from all predicted exons were screened against GenBank data, but owing to the dearth of available data at the time, few matches were found, with the exception of exact matches to known cDNA clones. Therefore, to verify exons, PCR primers were synthesized and used for PCR amplification from human brain and placental cDNA libraries3. Amplified PCR products were then sequenced to verify the correct annotation and to provide evidence for transcription of the genes. It became clear that without sequenced cDNA clones from human libraries, the annotation of the human genome would be a slow, laborious, error-prone and very expensive task.

The key idea behind the EST publication was this: rather than sequencing 1,000 fragments from a cosmid clone, which would yield at most one human gene but could be entirely fruitless, why not do the same 1,000 sequencing reactions but target clones randomly picked from a cDNA library instead? With the unique automated DNA sequencing and bioinformatics capabilities of the NINDS lab in 1990, this idea could readily be tested. Mark Adams was recruited to do the experiment, which focused on neurotransmission by using a human brain cDNA library. When the results came off the sequencers and were checked against the database, it was clear that a winning idea had been born. Before publishing their results, the researchers concentrated on optimization of the methods and computational analysis of the data. They tested subtractive cDNA libraries to reduce redundancy and mapped a considerable number of the newly discovered genes back to the human genome to demonstrate the bi-directional nature of the method. The new method was named expressed-sequence tags, derived from the term sequence-tagged site13, as the group had successfully mapped sequences, derived from expressed genes, in the genome. From the first 1,000 cDNA clones sequenced, 373 new human genes expressed in the brain were discovered. Though there had been previous small-scale attempts by other researchers to apply random cDNA sequencing, one early study on partial cDNA sequencing from skeletal muscle seemed to confirm early notions that only a few over-expressed transcripts would be found by cDNA sequencing14.

Although the results of the EST experiment initially met with mixed reviews, successful early applications, such as the discovery of new DNA mismatch-repair enzymes linked to colon cancer15, 16 and substantial gene discovery in the plant field with Arabidopsis thaliana17, helped to earn its ultimate acceptance. Figure 3 shows the growth of EST sequencing, with more than 70 organisms now represented by more than 10,000 publicly available EST entries. From the initial deposit of 9,388 EST sequences to GenBank in 1992, today's collection contains more than 15 million sequences, comprising over 7.6 billion base pairs across 484 organisms, making the EST method the principal method of gene discovery. Citation of Adams et al.10 has increased throughout the last decade, and Figure 4 shows the increase in the number of publications that report use of the method. It is clear that many of today's most exciting genomic technologies owe at least a nod to the EST method (Fig. 5). One of the conclusions of Adams et al.10 was that ESTs had potential to form the primary basis of genome annotation. This has clearly been the case for annotating the human genome18, 19, and ESTs contributed substantially to ordering and assembling the HGP human genome as well19.

Figure 3. The growth of EST sequencing since 1992.
Figure 3 thumbnail

Stacked columns indicate the cumulative number of organisms with over 10,000 ESTs in GenBank. Animals shown in white, plants in green, fungi in red, alveolate in blue, euglenozoa in purple, mycetozoa (slime mold) in yellow and red algae in gray.



Full FigureFull Figure and legend (20K)
Figure 4. The number of publications listed in PubMed that contain the expressions 'EST' (shown as red squares; ref. 10), 'BAC' (green diamonds; ref. 93) and 'YAC' (blue triangles; ref. 94).
Figure 4 thumbnail

Data shown are from the year in which the seminal paper describing each technique was published (for 'EST' and 'BAC') or from 1992 (for YAC). Each data point was 'normalized' for spurious matches by subtracting the number of occurrences of the search term for the year before the original publication. Data for 2002 are partial-year data.



Full FigureFull Figure and legend (18K)
Figure 5. Citation tree for the original human EST paper by Adams et al.10.
Figure 5 thumbnail

The 930 articles that have cited this publication were ranked according to the number of citations each received. The tree was then constructed from the concepts elaborated by the publications that have been cited more than 150 times. The citation frequency of each publication is shown in red.



Full FigureFull Figure and legend (51K)
In this review, we use citation trees to elucidate 'idea trails' and thus identify significant research milestones that were reached after the publication of certain seminal articles. This approach is one indicator of how a particular article may have contributed to the subsequent development of new ideas in the scientific community. Figure 5 shows one such citation tree for the original paper by Adams et al. on ESTs10. This tree was generated by ranking all publications that cited the original article according to the number of times they themselves have been cited in subsequent biological literature. The targeted paper was cited 930 times between 1991 and 2002, and 49 citing publications have each been cited more than 150 times.

The tree is organized according to the major concepts that are addressed in these publications. It highlights the development of three distinct methods to detect levels of gene expression: the serial analysis of gene expression (SAGE) sequence tag method20, high-density oligonucleotide microarrays21, 22 and cDNA microarrays23. These methods have been used extensively since their development because they can, in a highly parallel fashion, detect the expression of many genes. Coupled with an ongoing effort to sequence ESTs from different species (Fig. 5; refs. 24, 25, 26, 27, 28, 29, 30), these methods have enhanced our understanding of tissue-specific gene expression. The EST approach, in concert with gene maps31, has more accurately estimated gene numbers in humans, C. elegans and A. thaliana. The growth of sequence databases has improved the efficiency and sophistication of computational search methods32. Ultimately, the combination of greater amounts of sequencing information and computational analysis has contributed significantly to experiments aimed at determining the functions of individual genes and gene sets (Fig. 5; refs. 15,16,33, 34, 35).

Whole-genome shotgun sequencing
With the formation in 1992 of The Institute for Genomic Research (TIGR), a not-for-profit research institute, EST sequencing grew substantially from the original effort at NIH. TIGR's initial project was called the Human Gene Anatomy Project and was aimed at sequencing cDNA clones from libraries made from every major human tissue and organ. By establishing a high-throughput DNA factory to effectively generate and analyze the hundreds of thousands of EST sequences, this team had unknowingly created a new paradigm for genomic facilities that would eventually be the model for many large-scale DNA sequencing projects, including the human genome effort at Celera Genomics. A key component of this model was a new mathematical algorithm created by Granger Sutton. Sutton's algorithm (packaged in the TIGR Assembler36) was designed to assemble ESTs into clusters to reduce the redundancy of sequences, enable assembly of full-length cDNA clones and provide better estimates for the total number of human genes. These innovations allowed TIGR to publish the first comprehensive catalog of human genes and chromosome maps29. Realizing that this algorithm was a powerful tool, the TIGR team began to seek ways to use it more broadly in their research. Thus, the idea for whole-genome shotgun (WGS) sequencing was born, and it debuted with the publication of the genome of Haemophilus influenzae37.

As was the case with ESTs, the WGS innovation relied on a random sampling approach. Its widespread use for large projects is due to its suitability for massively parallel data collection and processing. Although shotgun sequencing of small segments of sequence had been used since the 1970s, this was the first time that shotgun sequencing had been used on the entire genome of a free-living organism. The group selected the genome of H. influenzae as the target both because its GC content is similar to that of the human genome and because they wanted to examine an organism about which virtually nothing was known at the genomic level. By the spring of 1995, the genome was complete, all gaps closed and annotated, with publication soon after37.

This paper was catalytic and enabled innovations in several other areas. As shown in Figure 6, the paper has been cited 2,294 times since its publication in 1995, and 53 of these citing publications themselves have been cited more than 150 times each. Among the major developments based on the whole-genome sequencing of H. influenzae was the sequencing of other prokaryotic species. This increase in sequenced genomes stimulated comparative sequence analysis, which in turn significantly aided in the functional annotation of genes. This was done primarily in the prokaryotes owing to the wealth of genome data available38, 39, 40, 41 but later included D. melanogaster, C. elegans and yeast42. More recently these comparative methods have been applied to the human and mouse genomes43, 44 and provide a means of characterizing protein function by virtue of identifying orthologs across several genomes45. The complete genome sequences of model organisms such as yeast allows the selective and sequential 'knock-out' of individual genes as a means of establishing a molecular basis for phenotype46. Such studies are clearly aided by high-throughput technologies, such as cDNA microarrays to monitor yeast gene expression47, as well as the promise of highly parallel protein analysis48, 49. Furthermore, the use of sequenced genomes as a means to better understand the molecular biology of organisms is evident in the H. influenzae citation tree; for example, numerous subsequent studies concentrated on host−pathogen interactions with various bacterial species serving as the pathogen50, 51, 52, 53.

Figure 6. Citation tree for the WGS sequencing of H. influenzae37, produced in the same manner as the tree in Fig. 5. W, WGS sequencing methodology; C, clone-by-clone sequencing methodology; R, review article.
Figure 6 thumbnail

Full FigureFull Figure and legend (85K)
Despite the advances leading to the H. influenzae publication and the ever-increasing numbers of completed microbial genomes, by 1998 there was some concern in the scientific community that the one large-scale, publicly funded sequencing project—the HGP—had reached a saturation point. Total finished bases amounted to just 3% of the genome, and scientists at major sequencing centers admitted concern and uncertainty about how they would achieve their goals54. Major weaknesses in the sequencing technology of the time, including lane tracking problems, inadequate sequencer throughput and high sequencing error rates, seemed daunting. Though the H. influenzae paper had concluded with the prediction that WGS sequencing could be useful in the human genome sequencing effort, the feasibility of this idea remained hotly debated55, 56, even under the assumption that sequencing technology would ultimately improve.

Technology changes that advanced genomics
In addition to some incremental changes in technology, including increased automation and miniaturization57, 58, improved computational resources and various browsers and search tools for scientists to better access the genomes59, there are a few developments that warrant special acclaim for having enabled the sequencing of large eukaryotic genomes, whether by the clone-by-clone or the WGS approach.

Capillary DNA sequencers. With greater capacity of lanes per run and runs per day and smaller sample volumes, capillary DNA sequencers (most notably the ABI 3700 sequencing machines) vastly increased the throughput that could be achieved for a fixed cost. These machines eliminated the need to pour gels by hand, thus reducing the amount of labor; permitted the generation of 400 kb of sequence in a day on a single machine with as little as 15 minutes for maintenance and sample loading; permitted unprecedented read lengths and improved sequencing accuracy; and, through a vast improvement in lane tracking, enabled approaches to sequence assembly that otherwise would have been unthinkable.

New clones for genomics. Although heroic efforts resulted in maps dense enough for cosmid tiling in some regions of the human genome60, construction of maps suitable for a directed sequencing effort proved elusive. Indeed, Saccharomyces cerevisiae may be the only organism whose genome was sequenced by first producing a map and then sequencing. The first genome project, for E. coli, was initially based on sequencing lambda clones that had been mapped in a three-year effort. Slow progress essentially forced an abandonment of the effort to sequence sequential lambda clones9. Yeast artificial chromosomes (YACs), which for some time were considered the successor to the cosmid, today have been almost completely abandoned. Because significant efforts were diverted toward the use of YACs, they may have delayed progress in genomic sequencing for several years. It was the use of bacterial artificial chromosomes (BACs) that enabled progress both in mapping and sequencing in a clone-by-clone fashion. The decline of the YAC and rise of the BAC is shown in Figure 4.

The acceptance of BACs into the genome community led to a new round of map construction, assigning BACs to locations on existing maps. Additionally, the introduction of BAC-end sequencing61 made possible a clone-by-clone approach that did not require sequencing to wait for map construction yet maintained the efficiency of a directed approach. The draft version of the HGP human genome was not obtained by carefully choosing which clone to work on next, although sequencing was conducted mostly on a clone-by-clone basis. This is arguably because the capacity of the sequencers was so great relative to the rate of mapping that it was not practical to leave the sequencers idle while the next set of clones was determined and samples prepared.

Base-calling with quality values. In the late 1990s significant progress had been made in high-throughput, automated DNA sequencing, as noted above, with the introduction of capillary-based sequencing machines. Though these machines certainly improved the ability to process and assess base pairs through software that was packaged with the machine, it was clear that additional software programs would be useful in determining accuracy of the bases of DNA. Most notable of these efforts were the Phred base-calling program62, 63 and TraceTuner. These two chromatogram base-calling programs included an estimated quality value for each individual base in the raw sequence. These quality values allowed users to 'quality trim' individual DNA sequence reads64 and to determine how well fragments overlap and select the most likely consensus sequence during the process of assembly with software tools such as phrap and CAP4.

Advances in computing.
The SPARCcenter 2000 that was used to assemble the genome of H. influenzae in 1995 was the last model of a generation of computers that was limited by an architecture capable of addressing only 2 gigabytes of random access memory (RAM). Revolutionary biology would demand revolutionary computing: 64-bit hardware and operating systems that could use 32 GB of RAM and beyond. In 1992 DIGITAL Equipment Corporation introduced the first 64-bit processor, the Alpha, together with a 64-bit operating system. By 1998, the Alpha microprocessor was already nearly 100 times more powerful than the processor used in the 1995 SPARCcenter. Processors offered today by Intel, Hewlett Packard and IBM are almost 200 times more powerful. But processors and operating systems are only part of the story; the cost of storage has dropped dramatically. In 1992 one terabyte of disk-space cost $1,000,000; in 2002 the cost has dwindled to near $10,000. It is possible that computational biology would have stalled in the early 1990s had disk storage not dropped dramatically in cost while significantly increasing in performance. In 1992 10 MB per second was the state-of-art in networks. Today, even some laptops use gigabit network interfaces that are a hundred times faster. The very nature of large-scale computing has changed from systems relying on one or a few powerful custom-designed processors to scalable parallel systems or farms of computers. Complex problems are now broken down into a set of smaller jobs that are run concurrently, an approach that clearly enabled genome assembly.

Genome assembly. In 1989, the GEL sequencing program was limited to 1,000 reads and required extensive human curation of the results3, 4. By 1995, Sutton's TIGR Assembler had proved its ability to assemble microbial genomes. The program ran for some 30 hours on a SPARCcenter2000, using slightly under 64 MB of RAM (G.G. Sutton, pers. comm.). Total coding time, including code originally written for assembling ESTs, was on the order of one programmer-year. The bulk of the compute time was spent doing fragment overlaps, a process that is naively quadratic. The roughly 27 million fragments making up the human genome assembly were nearly 1,000 times as many as the 28,000 fragments that made up the H. influenzae assembly and required 1,000,000 times as many fragment pairs to be compared. Moreover, repetitive elements longer than a single read often required manual assembly and finishing. Clearly, improvements in computer hardware alone would not be sufficient to allow WGS assembly of large genomes.

In the same time frame, Eugene Myers proposed a novel formulation of the fragment assembly problem65 that involved breaking the problem down into the determination of unique regions (unitigs) followed by resolution of repetitive elements by use of mate pairs. This idea would later become the foundation for the WGS assembler developed at Celera Genomics66. Weber and Myers55 suggested that sequencing all fragments as paired ends would significantly enhance the capabilities of such an assembler. In 1999, Myers and Sutton, now leading the assembly effort at Celera, provided a proof-of-concept for the new assembly algorithm with their team's assembly of the D. melanogaster genome67. Ultimately, WGS assemblies of the human, mouse43 and mosquito68 genomes were obtained with an enhanced version of this prototype. The human genome was assembled in approximately 20,000 CPU hours on Compaq Alpha chips running at 500−667 MHz and using a peak of slightly under 32 GB of RAM and approximately 0.5 terabytes of storage18.

In the last two years, several alternate assemblers for large-scale sequencing projects have been described. In the WGS assembly context, these include ARACHNE69, JAZZ70 and Phusion71, each following the basic model of the Celera assembler66. Clone-by-clone projects have assembled reads from single clones into contigs, using programs such as phrap, followed by whole-genome scale ordering and orienting of contigs into scaffolds based on bridging information from a variety of data, as shown with the GigAssembler72. Other assembly programs have introduced important innovations such as 'error correction', which we believe was first described in connection with the CAP assembler73, and the introduction of the Eulerian formulation of the assembly problem74, though these assemblers remain to be tested in the context of a full-scale eukaryotic sequencing project. Advances in assembly methods will undoubtedly continue as WGS sequencing continues to be the method of choice for genome sequencing (Fig. 7) or until sequence read length eliminates the need for assembly algorithms.

Figure 7. The number of genomes sequenced each year since 1995.
Figure 7 thumbnail

a, The number of genomes sequenced using WGS is shown as blue triangles, and the number of genomes sequenced with other strategies is shown as yellow squares. b, The stacked columns indicate the cumulative sizes (in billions of base pairs) of genomes sequenced with WGS and other methods. H. sapiens (Celera Genomics) shown in white, Mus musculus (Celera Genomics) in green, M. musculus (Mouse Genome Sequencing Consortium) in red, Oryza sativa L. ssp. indica in blue, O. sativa L. ssp. japonica in orange, Fugu rubripes in black, Anopheles gambiae in yellow, D. melanogaster in brown, all other genomes sequenced with WGS in purple, H. sapiens (International Human Genome Sequencing Consortium) in gray and all other genomes sequenced with methods other than WGS (including those of A. thaliana, C. elegans, Plasmodium falciparum and Schizosaccharomyces pombe) in pink.



Full FigureFull Figure and legend (25K)
Conclusions
Clearly, the scientific community is benefiting greatly from numerous published genomes across a broad spectrum of organisms (Fig. 1) that could only have been completed with the aforementioned advances. Milestone genomes include the first fully sequenced living organism, H. influenza, the first archaea, Methanococcus jannaschii37; the first eukaryote, S. cerevisiae8; the first multi-cellular organism, C. elegans75; the first large eukaryote, D. melanogaster67; Homo sapiens18, 19; and the first plant, A. thaliana76. This wealth of information was hard to imagine just a few years ago. For perspective, in 1994 the largest sequenced genome was a virus (human cytomegalovirus, at roughly 230 kb; GenBank NC_001347; ref. 77), and that high-water mark just barely exceeded what had been reached nearly a decade earlier (Epstein−Barr virus, at roughly 170 kb; GenBank NC_001345; ref. 78).

Each completed genome has had a catalytic effect on its field (for example, Fig. 6). Citation evidence for E. coli (Fig. 8) suggests that since completion of the genome sequencing of this organism, studies undertaken provided a better understanding of mechanisms regulating cellular physiology that include DNA replication, biosynthetic pathways and metabolism79, 80, 81, 82, 83. The genome sequencing of this microorganism has most positively impacted the study of the 'minimal genome', the smallest set of genes that can support the necessary biological mechanisms of self-replication and fructification84, 85, 86. The E. coli citation tree also indicates that as the sequences of more genomes accumulate, molecular evolutionary methods will afford more accurate measures of mutation rates between populations87, 88, and that ultimately the medical consequences of genomics will be realized in the fields of basic and applied biomedical research89.

Figure 8. Citation tree for the WGS sequencing of E. coli9.
Figure 8 thumbnail

All publications that cited the original paper and have themselves been cited more than 100 times since their publication are shown.



Full FigureFull Figure and legend (103K)
In the next decade of genomics we expect new technologies to accelerate the field by orders of magnitude and to have an unimaginable, but clearly catalytic, impact on biology and medicine.

URLs. Information on TraceTuner is available at http://www.paracel.com/publications/
tracetuner1_092100.pdf
and on CAP4 at http://www.paracel.com/publications/
cap4_092200.pdf
. Documentation for phrap and cross_match is available at http://www.phrap.org/phrap.docs/
phrap.html
. Information on the growth of GenBank is available at http://www.ncbi.nlm.nih.gov/Genbank/
genbankstats.html
.


 Top
REFERENCES
  1. Kuska, B. Beer, Bethesda, and biology: how "genomics" came into being. J. Natl. Cancer Inst. 90, 93 (1998). | Article | PubMed  | ChemPort |
  2. Sanger, F., Nicklen, S. & Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467 (1977). | PubMed  | ChemPort |
  3. Martin-Gallardo, A. et al. Automated DNA sequencing and analysis of 106 kilobases from human chromosome 19q13.3. Nat. Genet. 1, 34–39 (1992). | Article | PubMed  | ChemPort |
  4. McCombie, W.R. et al. Expressed genes, Alu repeats and polymorphisms in cosmids sequenced from chromosome 4p16.3. Nat. Genet. 1, 348–353 (1992). | Article | PubMed  | ChemPort |
  5. Wada, A. The practicability of and necessity for developing a large-scale DNA-base sequencing system: toward the establishment of international super DNA-sequencing centers. Basic Life Sci. 46, 119–130 (1988). | PubMed  | ChemPort |
  6. Wada, A. Fundamental significance of DNA mass-sequencing factory for biological sciences in future. Adv. Biophys. 30, 85–103 (1994). | Article | PubMed  | ChemPort |
  7. Understanding Our Genetic Inheritance: The Human Genome Project, The First Five Years, FY 1991–1995. NIH Report 90–1590 (1990).
  8. Goffeau, A. et al. Life with 6000 genes. Science 274, 563–567 (1996). | Article |
  9. Blattner, F.R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474 (1997). | Article | PubMed  | ChemPort |
  10. Adams, M.D. et al. Complementary DNA sequencing: expressed-sequence tags and human genome project. Science 252, 1651–1656 (1991). | PubMed  | ChemPort |
  11. Roberts, L. Gambling on a shortcut to genome sequencing. Science 252, 1618–1619 (1991). | PubMed  | ChemPort |
  12. Uberbacher, E.C. & Mural, R.J. Locating protein-coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261–11265 (1991). | PubMed  | ChemPort |
  13. Olson, M., Hood, L., Cantor, C. & Botstein, D. A common language for physical mapping of the human genome. Science 245, 1434–1435 (1989). | PubMed  | ChemPort |
  14. Putney, S.D., Herlihy, W.C. & Schimmel, P. A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing. Nature 302, 718–721 (1983). | Article | PubMed  | ChemPort |
  15. Papadopoulos, N. et al. Mutation of a Mutl homolog in hereditary colon cancer. Science 263, 1625–1629 (1994). | PubMed  | ChemPort |
  16. Nicolaides, N.C. et al. Mutations of 2 Pms homologs in hereditary nonpolyposis colon cancer. Nature 371, 75–80 (1994). | Article | PubMed  | ChemPort |
  17. Somerville, C. & Somerville, S. Plant functional genomics. Science 285, 380–383 (1999). | Article | PubMed  | ChemPort |
  18. Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). | Article | PubMed  | ChemPort |
  19. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). | Article | PubMed  | ChemPort |
  20. Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484–487 (1995). | PubMed  | ChemPort |
  21. Lockhart, D.J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675–1680 (1996). | Article | PubMed  | ChemPort |
  22. Wodicka, L., Dong, H.L., Mittmann, M., Ho, M.H. & Lockhart, D.J. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol. 15, 1359–1367 (1997). | Article | PubMed  | ChemPort |
  23. Duggan, D.J., Bittner, M., Chen, Y.D., Meltzer, P. & Trent, J.M. Expression profiling using cDNA microarrays. Nat. Genet. 21, 10–14 (1999). | Article | PubMed  | ChemPort |
  24. Waterston, R. et al. A survey of expressed genes in Caenorhabditis elegans. Nat. Genet. 1, 114–123 (1992). | Article | PubMed  | ChemPort |
  25. Okubo, K. et al. Large-scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nat. Genet. 2, 173–179 (1992). | Article | PubMed  | ChemPort |
  26. Adams, M.D., Kerlavage, A.R., Fields, C. & Venter, J.C. 3,400 new expressed-sequence tags identify diversity of transcripts in human brain. Nat. Genet. 4, 256–267 (1993). | Article | PubMed  | ChemPort |
  27. Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA sequencing (expressed-sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 4, 373–386 (1993). | Article | PubMed  | ChemPort |
  28. Newman, T. et al. Genes galore—a summary of methods for accessing results from large-scale partial sequencing of anonymous Arabidopsis cDNA clones. Plant Physiol. 106, 1241–1255 (1994). | Article | PubMed  | ChemPort |
  29. Adams, M.D. et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 377, 3–174 (1995). | PubMed  | ChemPort |
  30. Hillier, L. et al. Generation and analysis of 280,000 human expressed-sequence tags. Genome Res. 6, 807–828 (1996). | PubMed  | ChemPort |
  31. Schuler, G.D. et al. A gene map of the human genome. Science 274, 540–546 (1996). | Article | PubMed  | ChemPort |
  32. Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. Issues in searching molecular sequence databases. Nat. Genet. 6, 119–129 (1994). | Article | PubMed  | ChemPort |
  33. Fernandesalnemri, T., Litwack, G. & Alnemri, E.S. Cpp32, a novel human apoptotic protein with homology to Caenorhabditis elegans cell-death protein Ced-3 and mammalian interleukin-1beta-converting enzyme. J. Biol. Chem. 269, 30761–30764 (1994). | PubMed  | ChemPort |
  34. Simonet, W.S. et al. Osteoprotegerin: a novel secreted protein involved in the regulation of bone density. Cell 89, 309–319 (1997). | Article | PubMed  | ChemPort |
  35. Messersmith, E.K. et al. Semaphorin III can function as a selective chemorepellent to pattern sensory projections in the spinal cord. Neuron 14, 949–959 (1995). | Article | PubMed  | ChemPort |
  36. Sutton, G.G., White, O., Adams, M.D. & Kerlavage, A.R. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. 1, 9–19 (1995). | ChemPort |
  37. Fleischmann, R.D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995). | PubMed  | ChemPort |
  38. Himmelreich, R. et al. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24, 4420–4449 (1996). | Article | PubMed  | ChemPort |
  39. Nelson, K.E. et al. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329 (1999). | Article | PubMed  | ChemPort |
  40. Philipp, W.J. et al. An integrated map of the genome of the tubercle bacillus, Mycobacterium tuberculosis H37Rv, and comparison with Mycobacterium leprae. Proc. Natl. Acad. Sci. USA 93, 3132–3137 (1996). | Article | PubMed  | ChemPort |
  41. Brown, J.R. & Doolittle, W.F. Archaea and the prokaryote-to-eukaryote transition. Microbiol. Mol. Biol. Rev. 61, 456–502 (1997). | PubMed  | ChemPort |
  42. Rubin, G.M. et al. Comparative genomics of the eukaryotes. Science 287, 2204–2215 (2000). | Article | PubMed  | ChemPort |
  43. Mural, R.J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661–1671 (2002). | Article | PubMed  | ChemPort |
  44. Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). | Article | PubMed  | ChemPort |
  45. Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631–637 (1997). | Article | PubMed  | ChemPort |
  46. Dujon, B. The yeast genome project: what did we learn? Trends Genet. 12, 263–270 (1996). | Article | PubMed  | ChemPort |
  47. Lashkari, D.A. et al. Yeast microarrays for genome-wide parallel genetic and gene-expression analysis. Proc. Natl. Acad. Sci. USA 94, 13057–13062 (1997). | Article | PubMed  | ChemPort |
  48. Blackstock, W.P. & Weir, M.P. Proteomics: quantitative and physical mapping of cellular proteins. Trends Biotechnol. 17, 121–127 (1999). | Article | PubMed  | ChemPort |
  49. Yates, J.R. Mass spectrometry and the age of the proteome. J. Mass Spectrom. 33, 1–19 (1998). | Article | PubMed  | ChemPort |
  50. Govan, J.R.W. & Deretic, V. Microbial pathogenesis in cystic fibrosis: mucoid Pseudomonas aeruginosa and Burkholderia cepacia. Microbiol. Rev. 60, 539–574 (1996). | PubMed  | ChemPort |
  51. Freiberg, C. et al. Molecular basis of symbiosis between Rhizobium and legumes. Nature 387, 394–401 (1997). | Article | PubMed  |