Main

We have analysed the draft human genome sequence for data related to evolutionary genomics. Our investigations reveal new information about repetitive elements, domain sharing and conservation, and gene duplication in the human genome (for Methods, see Supplementary Information).

Numbers of repetitive elements

Analysing 76% of the human genome (using almost all available contigs, Table 1), we estimated that around 43% of the human genome is occupied by four major classes of interspersed repetitive element: (1) short interspersed elements (SINEs), (2) long interspersed elements (LINEs), (3) elements with long terminal repeats (LTR elements), and (4) DNA transposons. There are more than 4.3 million repetitive elements in the human genome, with Alu and LINE1 (L1) being the most frequent. These estimates largely agree with previous ones1,2. As many repetitive elements would have degenerated to the extent that they cannot be detected by the computer program RepeatMasker (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker), more than 50% of the human genome would have come from insertion of repetitive elements.

Table 1 Repetitive elements in the human genome

Repetitive elements in proteins

Contrary to the belief that a repetitive element insertion into a gene is deleterious and unlikely to survive, many translated repetitive elements are found in proteins (Table 2). From the International Protein Index (ref. 3; http://www.ensembl.org/IPI), we derived a new database by eliminating ‘isoforms’ (due to alternative splicing). We set the expected (E) value < 10-80 in BLASTing (using tBLASTN) the database by itself and deleted all but one copy of the genes whose chromosome locations overlapped by more than 50%. This procedure reduced the number of ‘proteins’ in the database from 45,112 to 43,195, of which 15,337 are ‘known’ proteins and 27,858 are predicted proteins (translated from gene predictions). Because of the stringent conditions used, the chance of misidentifying isoforms is negligible. The new database probably still contains some isoforms because the chromosomal locations of many sequences are unknown and their isoforms cannot be identified.

Table 2 Repetitive elements in ‘known’ and predicted proteins

We then BLASTed each sequence in the new database against a recent release of RepBase (http://www.girinst.org). Predicted proteins on average contain many more matches to repetitive element fragments than ‘known’ proteins (Table 2), suggesting many false positives in gene prediction. This is not a serious problem for ‘known’ proteins, as they have been translated from genes cloned by traditional methods or from ‘genes’ that have a high similarity to known genes. Surprisingly, many ‘known’ proteins also contain (truncated) repetitive elements, especially L1 and Alu. A closer look suggests that repetitive elements were usually not inserted into the original open reading frames, but became part of a gene because of alternative splicing, which can sometimes extend or truncate the coding region. L1 has on average the highest E-scoring matches (Table 2), indicating that L1-mediated gene evolution may be common. In addition, there is evidence that transduction of 3′-flanking sequences (including exons) is common in L1 retrotransposition4, so that L1 might have mediated many exon-shuffling events. Therefore, repetitive elements may have been significant in gene evolution and species differentiation.

To reduce the effect of false gene predictions, we deleted from the database 2,615 predicted proteins that had a significant hit (E < 10-4) by a repetitive element and did not have a domain structure other than reverse transcriptase or transposase. The ‘cleaned’ database contains 15,337 ‘known’ proteins and 25,243 predicted proteins (total, 40,580).

Domain sharing and conservation

A domain is a structural or functional unit in a protein. To investigate the frequency of domain sharing, where the same domain appears in different proteins, we obtained a collection of human, fruitfly, nematode and yeast proteins (15,312, 8,896, 9,254 and 3,136 polypeptides) containing at least one domain; we used the InterPro domain database. In each case of nested domains, only the shortest one was included in the final dataset. There are 1,865, 1,218, 1,183 and 973 domain types in human, fruitfly, nematode and yeast, respectively, and the proportions of mosaic proteins (containing more than one domain type) in the four taxa are 28%, 27%, 21% and 19%.

First, we consider the sharing of domain types (or domain combination), regardless of the order or the number of times a domain appears in a protein; for example, a protein with A-A-B-B-A contains only two domain types and has the combination AB. In our database, the largest number of domain types per protein is nine in human and fruitfly, and seven in nematode and yeast. The frequency of domain sharing is very high among human proteins (Table 3); for example, there are 88 cases where three proteins share two types of domain. There are also many human proteins that share more than one type of domain with Drosophila, (slightly less frequently) with C. elegans, and (much less frequently) with yeast proteins. But there are only three cases where a combination of more than three domain types is shared by human and yeast proteins and only two of these cases are shared by the four taxa. One of these two cases has a combination of seven domain types; it occurs once in human, nematode and yeast but twice in the fruitfly. It is a carbamoyl-phosphate synthase (EC 6.3.5.5) involved in the first three steps of de novo pyrimidine nucleotide biosynthesis (SwissProt accession nos P07259, Q18990, Q9VXD5, P27708).

Table 3 Domain sharing and order conservation within human and between human and other eukaryotes

We now consider the conservation of domain arrangements (the number and order of domains within a protein). There are 3,433, 1,702, 1,248 and 470 distinct arrangements of two or more domains in human, fruitfly, nematode and yeast proteins, respectively. Some proteins exhibit extensive domain repetition: in human, the largest number of domain types in a protein is nine, but the largest total number of domains in a protein is 130. Many human proteins have identical arrangements (Table 3). In the case of two domain types, many of the human arrangements are shared by fruitfly, (less frequently) by nematode, and (even less frequently) by yeast. The largest domain arrangement shared by all four taxa contains 11 domains, with only two domain types. It is ornithine decarboxylase, which catalyses a rate-limiting step in the biosynthesis of polyamines (SwissProt: P08432, Q94278, Q9V352, P11926). The shared arrangement that has the largest number of domain types (four) contains five domains; it is a sulphonylurea receptor (SURx) in fruitfly, which is a subunit of the ATP-sensitive potassium channel (SwissProt: P53049, Q9U6Z2, Q9V352).

Duplicate genes

Two genes that were derived from a gene duplication are said to be paralogous; two genes (in two species) are orthologous if they were derived from the same gene through speciation. Predicting whether two proteins are paralogous is relatively simple when their sequence identity (I) is high (>40% for long sequences) but becomes difficult when I is in the medium range (20–35%) or lower, especially for short sequences.

Rost5 proposed an empirical formula for clustering proteins in a database (Table 4). Two proteins are assumed to be paralogous if the proportion (p) of identical residues over the L aligned amino-acid residues between the two proteins is higher than the cut-off point (pI) defined by the formula. The cut-off point increases as L decreases because two unrelated short sequences may by chance have a high p value. A common practice in clustering proteins into groups is to use single linkage: if proteins A and B have a p higher than pI and so do proteins B and C, then A, B and C are clustered in the same group, even if the p value for A and C does not meet the cut. Applying Rost's formula with n = 5 (n is a factor to raise the cut-off point) to the ‘cleaned’ protein database, we found that the largest group contained 15,121 members, which is more than one-third of the database and includes various proteins. Even for n = 25 the largest group still contained 4,519 members. Such large groups occur probably because nonhomologous proteins may share the same domains (see above).

Table 4 Protein groups inferred from sequence similarities

We propose to use I′ = I × Min(n1/L1, n2/L2), where I is the proportion of identical amino acids in the aligned region (including gaps) between the query (sequence 1) and target (sequence 2) sequences obtained by the alignment program FASTA, Li is the length of sequence i, and ni is the number of amino acids in the aligned region in sequence i. The factor Min(n1/L1, n2/L2), which means the smaller of n1/L1 and n2/L2, takes care of the situation where a high I value is obtained when a short protein shares one or more domains with a longer protein. Another difference between I′ and pI is that I′ imposes a gap penalty in the aligned region. For short proteins, however, I′ may become high by chance and so we impose I′ ≥ pI with n = 5.

Table 4 shows the protein groups inferred from our formula. I′ ≥ 50% corresponds to Dayhoff's definition of protein families. The largest group (139 members) contains the L1 reverse transcriptase (RT) and sequences with high I′ values with L1 RT. This is surprising, but many ‘known’ and predicted proteins contain (truncated) L1 RT; note also that many L1 RTs may still be nearly intact in the human genome. The second largest group (129 members) contains 91 immunoglobulin heavy chains, 1 rheumatoid factor, 6 unnamed proteins and 31 predicted proteins; the third (124 members) contains 85 immunoglobulin light chains, 2 heavy chains, 1 microfibrillar protein, 2 unnamed proteins and 34 predicted proteins; the fourth (104 members) contains 38 zinc finger proteins, 6 unnamed proteins and 60 predicted proteins; and the fifth (51 members) contains 16 olfactory receptors and 35 predicted proteins. This criterion identifies 3,007 families, 2,041 of which are two-protein families. These should be taken as minimum estimates because many human genes remain unidentified. For I′ ≥ 40%, the zinc finger group becomes the largest, and the L1 RT and olfactory receptor groups become the second and third largest. For I′ ≥ 30%, the five largest groups are zinc finger proteins, olfactory receptors, immunoglobulins (both light and heavy chains), L1 RTs and keratins. For I′ ≥ 25%, some of the largest groups become very heterogeneous, indicating that at this level of similarity it requires a more rigorous analysis to determine whether two proteins are related.

The I′ ≥ 30% criterion identifies 3,982 superfamilies (Table 4). Although some of the groupings may be false positives, this number may represent a minimum estimate because many human genes remain unidentified and because many of the proteins in the ‘singleton’ groups (25,237) may actually be related to each other. Taking the data at face value, the proportion of ‘singleton’ groups is 25,237/40,580 = 62% of the total ‘proteins’ in our ‘cleaned’ database. This may be an overestimate, but should be taken cautiously because many of the ‘singletons’ may be false positive and because the total number of human genes remains unknown.

Our analysis has provided some insights into the evolutionary genomics of the human genome. There are many repetitive elements in our genome (Table 1), and they may have been very important in the evolution of mammalian proteins (Table 2). Domain sharing is common among proteins, and many domain arrangements have been conserved (Table 3). But many challenges remain. For example, as the number of human genes is still unknown, it remains unclear how many human genes exist as single copies. Reliable annotation of the human genome and clean databases of human genes and proteins are required for a rigorous analysis. In addition, better tools are needed for analysis. Single linkage does not seem appropriate for clustering proteins. Finally, better methods are needed for deciding whether two proteins are homologous, especially for short proteins.