Journal home
Advance online publication
Current issue
Archive
Press releases
Free Association (blog)
Supplements
Focuses
Guide to authors
Online submissionOnline submission
For referees
Free online issue
Contact the journal
Subscribe
Advertising
work@npg
Reprints and permissions
About this site
For librarians
 
NPG Resources
Nature
Nature Biotechnology
Nature Cell Biology
Nature Medicine
Nature Methods
Nature Reviews Cancer
Nature Reviews Genetics
Nature Reviews Molecular Cell Biology
news@nature.com
Nature Conferences
NPG Subject areas
Biotechnology
Cancer
Chemistry
Clinical Medicine
Dentistry
Development
Drug Discovery
Earth Sciences
Evolution & Ecology
Genetics
Immunology
Materials Science
Medical Research
Microbiology
Molecular Cell Biology
Neuroscience
Pharmacology
Physics
Browse all publications
Letter
Nature Genetics  25, 235 - 238 (2000)
doi:10.1038/76118

Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence

Hugues Roest Crollius, Olivier Jaillon, Alain Bernot, Corinne Dasilva, Laurence Bouneau, Cécile Fischer, Cécile Fizames, Patrick Wincker, Philippe Brottier, Francis Quétier, William Saurin & Jean Weissenbach

Genoscope and CNRS FRE2231, Evry cedex, France.

Correspondence should be addressed to Jean Weissenbach jsbach@genosocope.cns.fr
The number of genes in the human genome is unknown, with estimates ranging from 50,000 to 90,000 (refs 1,2), and to more than 140,000 according to unpublished sources. We have developed 'Exofish', a procedure based on homology searches, to identify human genes quickly and reliably. This method relies on the sequence of another vertebrate, the pufferfish Tetraodon nigroviridis, to detect conserved sequences with a very low background. Similar to Fugu rubripes , a marine pufferfish proposed by Brenner et al.3 as a model for genomic studies, T. nigroviridis is a more practical alternative4 with a genome also eight times more compact than that of human. Many comparisons have been made between F. rubripes and human DNA that demonstrate the potential of comparative genomics using the pufferfish genome5. Application of Exofish to the December version of the working draft sequence of the human genome and to Unigene showed that the human genome contains 28,000−34,000 genes, and that Unigene contains less than 40% of the protein-coding fraction of the human genome.

To determine the conditions that would generate alignments in coding regions between human DNA and a pufferfish distant by 400 million years, we first tested a large number of BLAST conditions on a small set of 13 annotated human-pufferfish homologous genes (Table 1). We used F. rubripes genes because no complete T. nigroviridis gene sequence existed at the time of this work. We then applied the optimal conditions to a larger set of 322 annotated human genes and the partial T. nigroviridis genome sequence (33% of which has been determined), in which the positions of genes are unknown. We found that the existing sequence of the T. nigroviridis genome detects 26.5% of the 2,693 human exons in conditions in which no alignments fall in introns (Fig. 1a). The 724 exons detected are distributed in 64.9% of the genes (209/322). To estimate the influence of the amount of T. nigroviridis genome sequenced on the sensitivity of this approach in detecting exons and genes in human DNA, we represented the fraction of exons and genes identified with increasing amounts of T. nigroviridis sequence (Fig. 1b). The fraction of human exons detected increases at a rate proportional to the amount of T. nigroviridis genome coverage generated. The probability of identifying a gene by at least one of its exons is higher because genes in general contain many exons, in addition to the fact that the random sequence tag (RST) database represents approximately 170,000 random sequences in the genome.

Figure 1. Construction of Exofish.
Figure 1 thumbnail

a, Distribution of 8.3 million alignments generated by comparing the partial T. nigroviridis genome with a set of 322 human genes (2,693 exons). Each circle represents a population of alignments of a given length and a given percentage of identity, with a clear boundary between those which exclusively fall in exons (circle) of human genes and those for which at least one alignment falls in an intron (). This provides robust selection criteria to determine if any new alignment corresponds to a human exon, based on its length and identity with a T. nigroviridis sequence. For convenience, all alignments longer than 60 aa were arbitrarily drawn at 60, the longest measuring 245 aa. b, Evolution of the theoretical T. nigroviridis genome coverage (—) and observed sensitivity in gene detection (+) and exon detection (diamond) by Exofish in the set of 322 human genes, as a function of T. nigroviridis sequences produced (10% increments). The dotted line is positioned at the current status of the sequencing project (33% of genome coverage). The theoretical coverage is calculated on the basis of a Poisson distribution of sequences of average size 886 bases on a genome of 385 Mb.



Full FigureFull Figure and legend (39K)
Table 1. Performance of different BLAST configurations
Table 1 thumbnail

Full TableFull Table
To reflect the fact that different T. nigroviridis sequences may generate overlapping alignments over the same exon and define a single, conserved human region, we defined the contiguous assembly of the different overlapping alignments as an 'ecore' (for evolutionary conserved region). In the set of 322 reference genes, the 209 genes (or 724 exons) that were detected by T. nigroviridis contained 831 ecores (2.58 ecores per gene). This result (Fig. 1a) provides a means to decide if new alignments between human and T. nigroviridis DNA overlap human exons, based on their length and percentage of identity. This criterion is the basis of the Exofish (for exon finding by sequence homology) selection mechanism (Fig. 2). To confirm the sensitivity of Exofish in detecting human genes, we performed a second comparison on a set of 4,888 complete human cDNA sequences extracted from Unigene version 105 (ref. 6). Using this set, 70% of the genes were identified, and each gene contained an average of 3.18 ecores (including the 30% of undetected genes). This ratio was used to derive a number of genes from a given number of ecores detected by Exofish.

Figure 2. Schematic of Exofish.
Figure 2 thumbnail

Full FigureFull Figure and legend (35K)
We analysed the sequence of chromosome 22 (ref. 7) with Exofish to estimate its capacity to confirm existing annotations and to detect new genes. We found 1,525 ecores over the complete length of the chromosome (Fig. 3). The distribution of ecores among the different types of annotated features showed important variations (Table 2). Related genes (based on homologies to protein and genes from human and other species) and predicted genes (based on EST sequences) contained less ecores than known genes. These two categories of annotations also contained less annotated exons per gene, presumably because their respective counterparts in sequence databases are only partially homologous or are incomplete. In fact, 70 of 148 predicted genes consisted of a single exon. We estimate that approximately 50% of the 181 ecores that fell outside of annotations belonged to genes that are incompletely annotated or to pseudogenes. Therefore, the remaining 90 ecores corresponded to approximately 30 novel genes on chromosome 22 (Fig. 3b,c). We thus estimate that chromosome 22 contains less than 600 genes.

Figure 3. Examples of chromosome 22 results.
Figure 3 thumbnail

Open blue boxes linked by broken lines represent gene annotations from ref. 7. Red boxes represent exons predicted by Genscan. Green boxes represent ecores generated by Exofish that overlap gene annotations, and dark blue boxes represent ecores that do not overlap annotations. The scale in nucleotides is relative to the sequence described in ref. 7. a, Typical result in which a known gene with 19 exons (encoding carnitine palmitoyltransferase I) is partially predicted by Genscan (17 exons) and Exofish (9 exons). b, On the 'Up' strand (above the scale), five ecores indicate a new gene that is not predicted by Genscan, whereas two genes on the top and the bottom strand (similar to mouse Htf9c and Ranbp1, respectively) have several exons predicted by Exofish, whereas none are correctly predicted by Genscan. c, On the 'Up' strand, both Genscan and Exofish partially confirm a known gene (HMG2L1), whereas on the same strand a new gene seems to be predicted by both approaches. d, LIMK2 has 16 annotated exons, of which 14 are predicted by Genscan and 6 by Exofish. Two additional exons that are presumably alternatively spliced are predicted by Exofish (arrows), one of which is also predicted by Genscan.



Full FigureFull Figure and legend (34K)
Table 2. Distribution of ecores in chromosome 22 annotations
Table 2 thumbnail

Full TableFull Table
Of the 1,344 ecores that fell within the boundaries of the annotations, 1,197 (89%) corresponded to genes and 147 (11%) to pseudogenes. To estimate the sensitivity of Exofish in detecting genes on chromosome 22, we considered only the 247 known genes, because others are likely to be incomplete. Ecores were found in 32.0% of the 2,298 exons and 66.8% of the 247 known genes. These values are comparable to the 26.5% of exons and 64.5% of genes identified in the reference set of 322 human genes, and to the 70% sensitivity obtained on 4,888 full-length cDNA sequences. Exofish detects only 8% of the 325 genes predicted by Genscan that are not confirmed by homologies (compared with 64.5% for known genes), suggesting that most of these predictions are false positives.

It is possible to exploit the compactness of the T. nigroviridis genome to confirm that several neighbouring ecores that fell outside of existing annotations do belong to the same gene. For instance, the five isolated ecores (Fig. 3c) were joined by three T. nigroviridis RSTs. Subsequent to the release of the sequence of chromosome 22 (ref. 7), a human cDNA clone and a homologous gene in a Caenorhabditis elegans cosmid clone have confirmed that these ecores define a true gene. By contrast, ecores identified inside the boundaries of the 545 annotated genes, but outside exons (that is, in introns), would correspond to exons that remained undetected by other homology-based approaches, presumably because of alternative splicing. We found 25 ecores in the introns of 21 annotated genes, of which 19 were also predicted by Genscan (Fig. 3 d). Approximately 50% of ecores that fell either in introns or outside of annotations have been confirmed as exons by the chromosome 22 annotation team at the Sanger Centre (J. Collins, D. Beare and I. Dunham, pers. comm.).

To estimate the number of genes in the human genome, we analysed the human working draft sequence with Exofish. In release 61 (December 1999) of the EMBL database, the publicly available human working draft sequence contained 1,272.3 Mb of non-redundant human DNA. Analysis of this fraction of the human genome (approx42.4%) by Exofish generated 42,066 ecores. Results on human chromosome 22 indicated that 89% of ecores fell in genes, whereas the remaining 11% fell in pseudogenes. Based on the result that Exofish detects on average 3.18 ecores per human gene, the human genome would contain (42,066times0.89)/0.424 = 88,299 ecores and 88,299/3.18 = 27,767 genes. We estimated the gene distribution for each chromosome and compared the results with the EST gene map of the human genome8 (Fig. 4). The gene-dense chromosomes (17, 19, 22) have an excess of ecores compared with ESTs, as does chromosome 16. To set an upper limit to our estimate, another calculation was based on the lower ratio of ecores per gene found in the initial gene test set and gave 88,299/2.58 = 34,224 genes. We therefore estimate that the human genome contains 28,000−34,000 genes.

Figure 4. Distribution of gene and ecores on individual human chromosomes according to the EST physical map8 and Exofish.
Figure 4 thumbnail

a, Exofish confirms the density of genes obtained by EST mapping for most chromosomes, except chromosomes 16, 17, 19 and 22, and introduces an estimate for chromosome Y. b, The two independent data sets show a good correlation (correlation factor 0.88), which confirms (in most cases) the distribution obtained by physical mapping of ESTs.



Full FigureFull Figure and legend (64K)
We used Exofish on the non-redundant set of human gene sequences represented by Unigene6 to estimate more accurately the fraction of protein-coding DNA present in publicly available databases. Release 105 of Unigene contains 10,501 clusters represented by known genes, whereas the remaining 82,430 clusters only contain EST sequences. When matched to the selected sequences representing Unigene clusters, Exofish detected 33,079 ecores, which identified 62% of the 10,501 known genes and only 4.2% of the EST sequences. As the human genome is estimated to contain 88,299 ecores, the 33,079 ecores found in Unigene represent only 37.5% of the coding fraction of human genes. This result is coherent with the very low number of matches obtained on the EST sequences. Most selected ESTs representing Unigene clusters (87%) are 3' reads of cDNA clones, and most likely correspond to untranslated regions.

Because a genome is a finite entity that contains all genes with all exons necessary to express all the proteins required at any stage or in any tissue, the sensitivity of Exofish is not biased by the traditional problems encountered in cDNA databases, such as alternative splicing and varying gene-expression levels. This is confirmed by the fact that Exofish identifies the same fraction of genes (approx2/3) in three collections of human genes of diverse origins and characteristics. Our finding that the human genome contains only 28,000−34,000 genes is unexpected, considering that it corresponds to just over twice the number of genes in the fly or worm. It is therefore to be expected that organismal complexity is not a direct consequence of gene number, but has its source in other mechanisms that may include alternative splicing and multi-domain proteins. As Unigene contains 92,000 clusters, and Exofish predicts 28,000−34,000 genes in the genome, Unigene is partially redundant and also contains mostly non-coding sequences. It is likely, however, that Unigene contains a 'tag' for most human genes and as such is an invaluable resource for gene identification. Exofish still cannot detect one-third of human genes (false negatives), including those for which the corresponding T. nigroviridis sequence is not yet known, those that evolve rapidly and for which protein sequence similarity is weak, and those that are strictly specific to mammals. It is likely, however, that smaller protein domains also participate in the detection process and enable Exofish to detect genes outside the limits of orthologous or paralogous genes. As described here, two immediate applications for Exofish include the annotation of genomic DNA and the estimation of the coding fraction in cDNA collections. Exofish also enables comparison of the T. nigroviridis genome with entire vertebrate genomes at the protein level in a few hours of computation time, and as such it is a powerful tool to explore new avenues in vertebrate genome research in a way so far only possible for bacteria or unicellular eukaryotes.

Methods
Construction of Exofish.
A summary of the approach used to select the optimal BLAST conditions for Exofish is shown ( Table 1) and a full description is available (see Methods, http://www.nature.com/ng/supplementary_info). Exofish is available as an annotation tool for human sequences (http://www.genoscope.cns.fr/exofish ). We constructed a set of 322 complete human genes by global BLASTN alignments between a database of 10,067 human mRNA sequences and 3,930 genomic clones. Details of the parameters and selections used, as well as a file of the 322 human genes in fasta format, are available (see Methods, http://www.nature.com/ng/supplementary_info).

T. nigroviridis genomic sequence.
BAC library construction and insert-end sequencing are described elsewhere9 Genomic DNA from a male T. nigroviridis specimen (ascertained as T. nigroviridis using morphological and mitochondrial DNA sequence characteristics) was extracted to construct a plasmid library. DNA was mechanically sheared, separated on a preparative agarose gel, and a size fraction corresponding to approx3 kb was excised, end-repaired and cloned in pcDNA2. After electroporation in DH10B electrocompetent cells, clones were plated on 2YT agar plates containing 70 mug/ml carbenicilin and 100,000 recombinants were robotically picked and replicated in microtitre plates. We sequenced 127,229 insert ends as described9. Including the BAC ends database, 174,828 sequences of an average useful length of 886 nt were produced, equivalent to 154.9 Mb of combined DNA. T. nigroviridis repeats (rRNAs, transposable elements, satellites) and microsatellite repeats were masked following BLASTN alignments against a T. nigroviridis repeat database9 and microsatellite database, respectively10. Minisatellites were identified by Tandem Repeat Finder11 and subsequently masked. Microsatellites were further masked based on TBLASTX alignments, and low-complexity regions were identified and masked by RepeatMasker. In total, 11% of nucleotides were masked in the T. nigroviridis genomic sequence database.

Human working draft sequence and Unigene.
We retrieved the entire HTG1, HTG2 and HTG3 sections, and sequences larger than 35 kb from the human DNA section of EMBL release 61, and for each sequence the highest version number was retained to remove internal redundancy. Sequences were distributed as follows: HTG1, containing genomic clone sequences in unordered segments (726.9 Mb, 24.2%); HTG2, in which contigs were ordered within each genomic clone (55.5 Mb, 1.8%); HTG3, in which genomic clones are represented as a contiguous sequence (36.4 Mb, 1.2%); and HUM, in which sequences are considered finished, with an error rate of less than one in 104 bases (480.9 Mb, 16.0%). All sequences were filtered to remove remaining cloning vector sequences (0.21%) and stretches of 'N' used to separate sequence contigs in HTG1 and HTG2 (1.87%). We used Unigene version 105 (January 2000) for comparisons with the T. nigroviridis genome.

Computing alignments.
All alignments were computed with the suite of BLAST (ref. 12) algorithms or with the SMITH-WATERMAN (ref. 13) algorithm, implemented in LASSAP (Large Scale Sequence Comparison Package) version 1.1.5 (ref. 14). For all calculations, hardware consisted of four Digital quadriprocessor (AXP 21264 (EV6) at 525 MHz) computers (Compaq GS60) with 4 Go of memory each, except for comparison of the partial T. nigroviridis genome with the human working draft sequence, for which a SUN Enterprise 10000 server with 64 UltraSPARC-II (400 MHz) processors and 64 Go central memory were used with LASSAP version 1.2.0a.

Accession numbers.
T. nigroviridis sequences, EMBL AL163976 to AL352938; human cDNA clone, AB033118.

Note: Supplementary information is available on the Nature Genetics web site (http://genetics.nature.com/supplementary_info/).

 Top
Received 10 March 2000; Accepted 2 May 2000

REFERENCES
  1. Fields, C., Adams, M.D., White, O. & Venter, J.C. How many genes in the human genome? Nature Genet. 7, 345-346 (1994). | Article | PubMed  | ISI | ChemPort |
  2. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 90, 11995-11999 (1993). | PubMed  | ChemPort |
  3. Brenner, S. et al. Characterization of the pufferfish (Fugu) genome as a compact model vertebrate genome. Nature 366, 265-268 (1993). | Article | PubMed  | ISI | ChemPort |
  4. Crnogorac-Jurcevic, T., Brown, J.R., Lehrach, H. & Schalkwyk, L.C. Tetraodon fluviatilis, a new puffer fish model for genome studies. Genomics 41, 177-184 (1997). | Article | PubMed  | ChemPort |
  5. Elgar, G. et al. Generation and analysis of 25 Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning. Genome Res. 9, 960-971 (1999). | Article | PubMed  | ISI |
  6. Schuler, G.D. et al. A gene map of the human genome. Science 274, 540-546 (1996). | Article | PubMed  | ISI | ChemPort |
  7. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489-495 (1999). | Article | PubMed  | ISI | ChemPort |
  8. Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744-746 (1998). | Article | PubMed  | ISI | ChemPort |
  9. Roest Crollius, H. et al. Characterization and repeat analysis of the compact genome of the freswater pufferfish Tetraodon nigroviridis. Genome Res. (in press).
  10. Jin, L., Zhong, Y. & Chakraborty, R. The exact numbers of possible microsatellite motifs. Am. J. Hum. Genet. 55, 582-583 (1994). | PubMed  | ISI | ChemPort |
  11. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573-580 (1999). | Article | PubMed  | ISI | ChemPort |
  12. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990). | Article | PubMed  | ISI | ChemPort |
  13. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981). | Article | PubMed  | ISI | ChemPort |
  14. Glemet, E. & Codani, J. LASSAP, a large scale sequence comparisons package. Comput. Appl. Biosci. 13, 137-143 (1997). | PubMed  | ISI | ChemPort |
 Top
Acknowledgments
We thank the sequencing and template preparation team at Genoscope; Sun Microsystems for access to the SUN benchmark centre; and F. Francis for critical reading of the manuscript. This work would not have been possible without the public availability of a large fraction of the sequence of the human genome, and we thank all contributing genome centres.

FULL TEXT
Previous | Next
Table of contents
Download PDFDownload PDF
Send to a friendSend to a friend
Save this linkSave this link
Methods
Figures & Tables
Acknowledgments
References
Supplementary info
See also: News and Views by Aparicio
Export citation
Export references
natureproducts

Search buyers guide:

 
ADVERTISEMENT
 
Nature Genetics
ISSN: 1061-4036
EISSN: 1546-1718
Journal home | Advance online publication | Current issue | Archive | Press releases | Supplements | Focuses | For authors | Online submission | Permissions | For referees | Free online issue | About the journal | Contact the journal | Subscribe | Advertising | work@npg | naturereprints | About this site | For librarians
Nature Publishing Group, publisher of Nature, and other science journals and reference works©2000 Nature Publishing Group | Privacy policy