Filling in the gaps

Bork, Peer; Copley, Richard

doi:10.1038/35057274

News & Views
Published: 15 February 2001

The draft sequences

Filling in the gaps

Peer Bork¹ &
Richard Copley¹

Nature volume 409, pages 818–820 (2001)Cite this article

1893 Accesses
37 Citations
Metrics details

Two rough drafts of the human genome sequence are now published. Completion of the sequences lies ahead, but the implications for studying human diseases and for biotechnology are already profound.

You have full access to this article via your institution.

Download PDF

With the publication of the human genome sequence — described and analysed on page 860 of this issue¹ and in this week's Science² — we cross a border on the route to a better understanding of our biological selves. But unlike the previously published sequences of human chromosomes 21 and 22 (refs 3,4), the present sequences of the whole human genome are not considered complete. The bulk of the data make up what is called a 'rough draft'. So what is all the fuss about? What exactly does 'rough draft' mean, and what can we learn from sequences such as this?

In the draft from the publicly funded International Human Genome Sequencing Consortium¹, around 90% of the gene-rich — euchromatic — portion of the genome has been sequenced and 'assembled', the term used to describe the process of using a computer to join up bits of sequence into a larger whole. Each base pair of this 90% was sequenced four times on average, ensuring reasonable precision. Only about a quarter of the whole genome is considered 'finished' — another bit of genomics jargon, which basically means that each base pair has been sequenced eight to ten times on average, with gaps in the sequence existing only because of the limitations of present technology. Nonetheless, the sequence of base pairs in the draft is very accurate, and is unlikely to change much; 91% of the euchromatin sequenced has an error rate of less than one base in 10,000 (ref. 1).

For the other draft, that produced by Celera Genomics², a variety of methods suggest that between 88% and 93% of the euchromatin has been sequenced and assembled. But direct comparison of these numbers with the public consortium's draft is almost impossible — different procedures and measures were used to process the data and to estimate accuracy. Both projects also have sequence data that were not used in the assembly process, raising the real level of coverage by a few percentage points.

These numbers might seem rather arbitrary, but even when the first genome of an animal species was published⁵, it was clear that simple, practical finish lines do not exist (Box 1, Fig. 1). The present level of coverage of the human genome reflects the point where a shift of focus occurs, from sequencing the genome many times over to producing a high-quality, continuous sequence⁶. There is some way to go yet.

**Figure 1: Sequenced eukaryotic genomes.**

Essentially, 'rough draft' refers to the fact that the sequences are not continuous — there are gaps (Box 1). If there are too many gaps, it can be impossible to order and orientate the many small strings of bases that are the raw products of genome sequencing. This might, for example, hamper projects that seek to identify genes involved in inherited diseases. A first step to finding such genes is to work out which region of which chromosome they are on. The complete genome sequence should be immensely useful for the next step — identifying the relevant gene at that region. But gaps and errors in ordering and placing the strings of sequence will make this difficult.

Another problem of incompleteness is that it is difficult to make definitive statements about which genes are unique to other species and do not have relatives in the human genome. So it might be prudent not to place too much emphasis on such 'missing' genes at this stage. Even so, they are running out of places to hide, particularly because the level of coverage of the human genome is probably higher than reported here^1,2 — there are other chunks of unassembled genome sequence in public databases, such as in independent collections of so-called expressed sequence tags.

But ensuring high quality and high coverage are only two aspects of producing a finished genome. For most biologists, the real interest is in the genes themselves. Here, the picture is less rosy, although the problems are caused not so much by the draft nature of the sequence as by the difficulty in finding genes among the other genomic DNA (Box 2).

Even coming up with a rough count of the number of genes is not straightforward. The public consortium's initial set contains about 32,000 genes, made up of around 15,000 known genes and 17,000 predictions. But these 32,000 genes are estimated to come from around 24,500 actual genes — some predicted genes could be 'pseudogenes', or just fragments of real genes. On the other hand, the sensitivity of prediction tends to be only about 60%, so it is reasonable to assume that another 6,800 or so genes (40% of 17,000) have been overlooked. This is how the present estimate of about 31,000 genes (6,800 plus 24,500) was reached¹. Celera predicts that there are around 39,000 genes, but warns that the evidence for some 12,000 of these is weak². The two groups use different gene-identification techniques, so these numbers are not directly comparable. Minor changes in procedures or data could alter either figure considerably. For example, such changes led to a recent estimate being lowered^7,8 from 120,000 to fewer than 81,000 — and both now seem untenable. Much is a matter of interpretation.

Fortunately, there is every reason to believe that the quality of gene prediction will rapidly improve, and an experimental technique for doing so is discussed on page 922 (ref. 9). With the sequencing of the genomes of other vertebrates, our ability to detect genes by their similarity to known sequences will get better. This is because, thanks to natural selection, gene sequences tend to be altered less during evolution than the DNA surrounding them. In a couple of years we should have at least a more complete list of testable gene candidates.

Despite all this, the information now available has profound implications. For example, there are already many heavily hunted disease-associated genes that have been identified using the public draft (ref. 1, Table 26, page 912). Together with studies of single nucleotide polymorphisms — the base differences from human to human — the draft also provides a framework for understanding the genetic basis and evolution of many human characteristics.

With the draft in hand, researchers have a new tool for studying the regulatory regions and networks of genes. Comparisons with other genomes should reveal common regulatory elements, and the environments of genes shared with other species may offer insight into function and regulation beyond the level of individual genes. The draft is also a starting point for studies of the three-dimensional packing of the genome into a cell's nucleus. Such packing is likely to influence gene regulation.

On a more applied note, the information can be used to exploit technologies such as chips made using DNA or proteins, complementing more traditional approaches. Such chips could now, for instance, contain all the members of a protein family, making it possible to find out which are active in particular diseased tissues. A new world of biotechnology will provide tools and information by exploiting genome data.

Sequencing the tough leftovers of the human genome will be essential. Without a finished sequence, we will not know what we are missing. Each missed gene is potentially a missed drug target, and even gene-poor areas might be critical for gene regulation. Nevertheless, we must now confront the fact that the era of rapid growth in human genomic information is over. The challenge we face is nothing less than understanding how this comparatively small set of genes creates the diversity of phenomena and characteristics that we see in human life. The human genome lies before us, ready for interpretation.

Box 1 What makes a completely sequenced genome?

When is sequencing work on a genome complete? No genome for a eukaryotic organism — roughly, those organisms whose cells contain a nucleus — has been sequenced to 100%. There are regions, often highly repetitive, that are difficult or impossible to clone (one of the initial steps in a sequencing project) or sequence with current technology. Fortunately, such regions are expected to contain relatively few protein-coding genes^4,10.

The extent of these regions varies widely in different species. So, rather than applying a universal gold standard, each sequencing project has made pragmatic decisions as to what constitutes a sufficient level of coverage for a particular genome. For example, as much as one-third of the sequence of the fruitfly Drosophila melanogaster was not stable in the cloning systems used, and so was not sequenced. But 97% of the so-called euchromatic portion — where most genes are thought to reside — was sequenced ¹¹ (Fig. 1).

For the human genome, one definition of 'finished' is that fewer than one base in 10,000 is incorrectly assigned⁶; more than 95% of the euchromatic regions are sequenced; and each gap is smaller than 150 kilobases ¹². Such standards represent realistic goals given current technology. By this standard, over a quarter of the public consortium's sequence¹ is considered finished at present, including the previously published long arms of chromosomes 21 and 22 (refs 3, 4; Fig. 1). The Celera sequences of chromosomes 21 and 22 are slightly more gappy than those from the public consortium, but the converse seems to be true for the other chromosomes². But again, as different protocols were used, it is not easy to compare the overall status of the two assemblies. In the longer term, as much of the heterochromatin — which is harder to sequence, and contains few genes — as possible must be sequenced, because we might otherwise miss important features. P.B. & R.C.

Box 2 When is a predicted gene a gene?

How many genes are encoded in the human genome? This is a simple question without — as yet — a straightforward answer¹³. The density of genes in the human genome is much lower than for any other genome sequenced so far (Fig. 1), making it particularly difficult to predict where genes are.

Both Celera and the public sequencing consortium used computational algorithms to model genes and make predictions, but such methods are far from perfect. Not only can the start and end positions of a predicted gene be wrong, but exons (the coding parts of a gene) can be missed entirely or wrongly predicted to exist. To reduce this latter effect, the public sequencing consortium required the exons of predicted genes to be 'confirmed', by showing significant similarity to a known sequence (DNA or protein) in a database. But this requirement might be too conservative, making it difficult to predict the presence of new gene families. Celera has required similar confirmation of predictions, but its mouse-genome sequencing project may have provided evidence for further vertebrate-specific genes.

Spurious prediction is also a problem. All genes are expressed by being copied (transcribed) into messenger RNA; most messenger RNAs are then translated into proteins. But even evidence that a stretch of DNA is transcribed does not definitively show that stretch to be a gene. We do not know how efficiently cells control transcription; indeed, it seems likely that non-gene DNA sequences are transcribed relatively frequently¹². Nor do we know how well the cell identifies transcripts that cannot be translated into a functioning protein. Moreover, proteins that cannot serve any useful function (for example, because they cannot fold correctly) could be made, but rapidly removed. To arrive at a true set of protein-encoding genes, we cannot rely on computational techniques alone, but must continue to characterize proteins and their functions.

These problems provide scope for estimates of human gene number to vary widely. Although recent estimates are converging in the 30,000–40,000 range (as opposed to earlier estimates of 100,000 or so), it could be many years before we have the final answer. P.B & R.C.

References

International Human Genome Sequencing Consortium Nature 409, 860–921 ( 2001).
Venter, J. C. et al. Science 291, 1304– 1351 (2001).
Article ADS CAS Google Scholar
Dunham, I. et al. Nature 402, 489–495 (1999).
Article ADS CAS Google Scholar
The Chromosome 21 Mapping and Sequencing Consortium Nature 405, 311–319 ( 2000).
The C. elegans Sequencing Consortium Science 282, 2012–2018 (1998).
Collins, F. S. et al. Science 282, 682–689 (1998).
Article ADS CAS Google Scholar
Liang, F. et al. Nature Genet. 25, 239– 240 (2000).
Article CAS Google Scholar
Liang, F. et al. Nature Genet. 26, 501 ( 2000).
CAS Google Scholar
Shoemaker, D. D. et al. Nature 409, 922–927 (2001).
Article ADS CAS Google Scholar
The Arabidopsis Sequencing Consortium Cell 100, 377–386 (2000).
Adams, M. D. et al. Science 287, 2185– 2195 (2000).
Article Google Scholar
Normile, D. & Pennisi, E. Science 285, 2038–2039 (1999).
Article CAS Google Scholar
Aparicio, S. Nature Genet. 25, 129–130 (2000).
Article CAS Google Scholar
Goffeau, A. et al. Nature 387 (suppl.), 1– 105 (1997).
Google Scholar
The Arabidopsis Genome Initiative Nature 408, 796–815 (2000).

Download references

Author information

Authors and Affiliations

EMBL, Meyerhofstrasse 1, Heidelberg, 69012, Germany
Peer Bork & Richard Copley

Authors

Peer Bork
View author publications
You can also search for this author in PubMed Google Scholar
Richard Copley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Peer Bork or Richard Copley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bork, P., Copley, R. Filling in the gaps. Nature 409, 818–820 (2001). https://doi.org/10.1038/35057274

Download citation

Issue Date: 15 February 2001
DOI: https://doi.org/10.1038/35057274

This article is cited by

Segmental duplications in the silkworm genome
- Qian Zhao
- Zhenglin Zhu
- Ze Zhang
BMC Genomics (2013)
Genome Desertification in Eutherians: Can Gene Deserts Explain the Uneven Distribution of Genes in Placental Mammalian Genomes?
- Walter Salzburger
- Dirk Steinke
- Axel Meyer
Journal of Molecular Evolution (2009)
Proteomic studies of human and other vertebrate muscle proteins
- S. S. Shishkin
- L. I. Kovalyov
- M. A. Kovalyova
Biochemistry (Moscow) (2004)

Filling in the gaps

Box 1 What makes a completely sequenced genome?

Box 2 When is a predicted gene a gene?

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

This article is cited by

Segmental duplications in the silkworm genome

Genome Desertification in Eutherians: Can Gene Deserts Explain the Uneven Distribution of Genes in Placental Mammalian Genomes?

Proteomic studies of human and other vertebrate muscle proteins

Search

Quick links

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Segmental duplications in the silkworm genome

Genome Desertification in Eutherians: Can Gene Deserts Explain the Uneven Distribution of Genes in Placental Mammalian Genomes?

Proteomic studies of human and other vertebrate muscle proteins

Search

Quick links