Published online 12 February 2001 | Nature | doi:10.1038/news010215-3

News

A journey into the genome: what's there

The human genome is 95% junk. Only about 5% of it consists of genes - the instructions to make proteins. Despite the complexity of human structure and behaviour, the number of genes in the human genome is comparable to that in much smaller genomes.

“The human genome is 95% junk”


The Human Genome Sequencing Consortium estimates that our genomes contain 31,780 protein-coding genes. So far it has spotted 22,000. This is fewer than the 25,498 genes in the genome of the tiny plant thale cress (Arabidopsis thaliana) and not much more than the fruitfly's 13,601 or the roundworm's 19,099 genes.

Arabidopsis thaliana.Arabidopsis thaliana.(C) SPL

Clearly, there is little correlation between the complexity of an organism and the amount of DNA it has. The human genome contains at least 200 times more DNA than the yeast genome's 12 million bases (the letters of the genetic code), but the genome of Amoeba dubia, a unicellular creature as simple as yeast, dwarfs the human genome by 200-fold.

How many is not many?

Even with a sequence touted as almost complete, the consortium can still only roughly estimate the number of genes contained in the human genome. There are several reasons for this. One is that human genes are so few and far between. There are, on average, around 12 genes per million bases of human DNA, compared with 117 in fruit flies, 197 in roundworms and 221 in Arabidopsis. Spotting genuine genes amid the morass of meaningless DNA has proven a sore trial to current computer software.

“Human genes are so few and far between”


Another reason human genes are hard to detect is that, compared with other creatures' genes, they are highly fragmented. In organisms more complicated than bacteria, genes tend to be divided into sections of coding sequence, 'exons', interrupted by non-coding spacers called 'introns' -- just as TV programmes are interrupted by commercial breaks. Generally, human genes have many small exons and longer-than-average introns -- some are more than 10,000 bases long.

The largest human gene is 2.4 million bases long. It encodes the muscle protein 'dystrophin' (and malfunctions in muscular dystrophy). But most of it is non-coding DNA. The record-holder for coding sequence is the gene for 'titin', another muscle protein. The gene is 80,780 bases long, divided into 178 exons, the largest of which contains 17,106 bases.

“Apparently, it is not how many genes you have, but how you use them”


Introns in the fruitfly and roundworm have a 'preferred' length, tens or at most hundreds of bases long. Human introns are much more variable. Most are around 87 bases long, but a substantial population are very long, dragging the average length up to more than 3,300 bases. Human exons, in contrast, can be very small indeed, and therefore easy to miss -- more than 40 are known in the human genome that are each just 19 bases long.

Arabidopsis thaliana.Arabidopsis thaliana.(C) SPL

More than 91% of the draft sequence reported by the consortium is 99.99% accurate -- that is, accurate to one base in 10,000. There are still many gaps, but not so many as to cause major confusion in ordering the bases. The gaps could hide missing genes, but if so, "they are running out of places to hide", Peer Bork and Richard Copley of the European Molecular Biology Laboratory in Heidelberg comment in the same issue of Nature.

Apparently, it is not how many genes you have, but how you use them. The fragmentation of human genes allows many different proteins to be built from the same genes, by combining the instructions in different exons in different ways. At least 35% of all human genes, it appears, may be read in several ways. In this way the human genome could encode five times as many proteins as the less flexible genomes of the fruitfly or roundworm.

So much for the genes - what's all the other stuff?

More than half of the human genome -- including 47 known genes -- consists of 'transposable elements'. These parasitic stretches of DNA copy themselves and spread throughout the genome, determining much of its architecture. Almost all of these rogue elements have been inactive for millions of years.

The human genome is richer in transposable elements and other repetitive DNA sequences than any other genome known, although the density of repeats varies widely. A 525,000-base region of the X chromosome consisting of 89% of repeated sequences is the most cluttered. At the other extreme are the 'HOX clusters' which regulate development. These contain less than 2% of repeated elements.

“The genome is a museum of the viral infections suffered by humanity and its ancestors”


Many transposable and repeated sequences started life as the genomes of independent entities that became integrated into the genome. Many viruses, including that of the human immunodeficiency virus HIV-1, have genomes made of RNA, a close chemical relative of DNA. These genomes encode an enzyme, reverse transcriptase, that makes DNA copies of the RNA genome and integrates them into the genome of a host.

Large stretches of the human genome show signs of having once, perhaps millions of years ago, been viruses. David Baltimore of the California Institute of Technology in Pasadena, one of the discoverers of reverse transcriptase, says that "in places, the genome looks like a sea of reverse-transcribed DNA with a small admixture of genes". The genome is a museum of the viral infections suffered by humanity and its ancestors. Viruses made us what we are.

Hundreds of other genes -- encoding at least 223 proteins -- seem to have come from bacteria. Around 40 bacterial genomes are now completely sequenced, from which it is evident that these organisms exchange genes with bohemian abandon. But it is surprising to find evidence for the direct transfer of bacterial genes into humans.

“The genome has come to do much more than it could possibly have been designed to do”


Some proteins of bacterial origin seem to be involved in the metabolism of antibiotics and neurologically active agents. One such protein is the enzyme monoamine oxidase, important in the metabolism of neuroactive substances (such as alcohol) and a target of important psychiatric drugs.

The capacity for bacteria and viruses to exchange genes is the basis for the genetic modification of organisms. It is perhaps ironic that all humans, including those in the anti-GM lobby, are GM organisms.

So we know that the human genome is a large and disordered jumble of ancient viruses punctuated by a modest collection of genes, some from bacteria, and that it has come to do much more than it could possibly have been designed to do. But it is too much to expect that the study of the human genome should further our understanding of what it is that gives humans their complexity of structure, behaviour, conscious action, learning, memory -- humanity.

Nonetheless, as Baltimore notes, the questions that the draft genome now open to investigation include some of the simplest and deepest, such as: "Daddy, where did I come from?"