the mouse genome

Nature 420, 512-514 (5 December 2002) | doi:10.1038/420512a

commentaryMining the mouse genome

Allan Bradley1

We have the draft sequence — but how do we unlock its secrets?

Mining the mouse genome

Pooled resources: the mouse sequence will offer insight into the workings of the human genome.

The mouse genome sequence, published in this issue1, has already made a huge impact on the research community. Although only a draft, it is clear that the sequence is a very high-quality product, with excellent coverage and reliability over large genomic expanses. It is a huge asset to researchers, and its significance matches that of the human genome. In the past six months, for example, the Ensembl genome browser of the Sanger/European Bioinformatics Institute dealt with 2.6 million requests for detailed information about the mouse genome, and 3.2 million queries about the human sequence.

But there is one important difference between these two resources — the mouse genome encodes an experimentally tractable organism. This means that it is now truly possible to determine the function of each and every component gene by experimental manipulation and evaluation, in the context of the whole organism.

An ideal tool

Over the past two decades, the mouse has emerged as the pre-eminent model organism for two fundamental reasons. First, it is a mammal, and so has many physiological, anatomical and metabolic parallels with humans. Although the anatomical differences between humans and mice appear striking, they reflect alterations in size and shape — detailed analysis of organs, tissues and cells reveals many similarities, extending to whole-organ systems, physiological homeostasis, reproduction, behaviour and disease. The mouse is an excellent surrogate for exploring human biology, and disease processes in the animal can accurately reflect those in humans. This explains why the mouse is widely used to investigate diverse aspects of mammalian biology and pathology, ranging from embryonic development to metabolic disease, behaviour and cancer in adults.

Second, although it is certainly true that mammalian biology is available in other species that in some cases are closer to humans, the mouse has one feature that has been uniquely developed compared with all other mammals — genetic tractability. The similarities in biology and pathology between mouse and human are reflected in the genomes. For virtually every gene in the human genome, a counterpart can readily be identified in the mouse. Genetic manipulation within the living mouse is routine and can these days be done with extraordinary precision. Consequently, many mouse strains have been generated with genetic lesions that echo those observed in human genetic disease.

In some cases, these manipulations yield a model that closely resembles the pathology of the analogous human condition — for example, the susceptibility to cancer of mice deficient in the gene encoding the protein p53 resembles that of humans who have mutations of their p53 gene. In other cases, only some aspects of the human pathology are apparent; for instance, mice with mutations in the cystic fibrosis gene do not develop lung disease, which is the most devastating aspect of the condition in humans. Understanding how the disease process is suppressed in mice will provide important clues for treating the human condition. Genetic studies in mice are greatly helped by the availability of inbred strains, which allow experimental parameters to be measured in a homogeneous genetic background.

The mouse genome sequence1 is already fuelling the next phase of research. At a very fundamental level, we now have a reasonable 'parts list' for the mouse. Some of these parts — the transcription units — are the elements we understand best. Although there are now many new gene transcripts to explore, existing experimental methods can be used to examine them. But the mouse genome also contains many non-coding regions of sequence identical to their counterparts in the human genome. New experimental methods will have to be deployed to examine the function of these conserved sequences.

The complementary DNA (cDNA) clone and sequence resource described on page 563 by a group from the Institute of Physical and Chemical Research (RIKEN) in Japan2, illustrates the potential usefulness of the mouse genome sequence. The authors' comparison of their FANTOM clones with the genome sequence immediately illustrates gene structures that will enable the mutagenesis work discussed below. The physical cDNA resource can be used in many experimental situations: for instance, in overexpression studies in cell lines or transgenic mice (in which the product of the gene concerned is made in abundance to allow its function to be investigated); to produce proteins for structural studies or antigens to obtain antibodies for investigating gene expression; or in studies of interactions between different proteins. For laboratories interested in a single cDNA clone, knowledge of the end-points of the messenger RNA is sufficient to design primers to rapidly retrieve a cDNA by reverse-transcriptase PCR (polymerase chain reaction) analysis, saving months of time 'walking' through cDNA libraries. The mouse cDNA parts list is already being used to illuminate patterns of gene expression.

Although the draft mouse genome sequence is available now, it will be two years before it is a finished, reference-quality product. Even so, the annotation of function to genes is already under way on a small scale in many laboratories, and larger-scale studies are being initiated. Unlike DNA sequencing, experimental work is not always amenable to high-throughput, automated approaches and can be complex to interpret. The research community clearly faces an enormous task — one that will extend well beyond the next decade.

Joined-up efforts

The assembled mouse genome provides a framework onto which functional information will increasingly be layered. Genome browsers (see Box) provide an interactive graphical view of the genome, rendering its vast size (equivalent to a million pages of text) accessible to the scientific community. They can provide detailed information about a single gene, or enable a wider perspective of the genomic landscape in a single species or across several species.

Mutation data, phenotype information and gene-expression data (described below) are just three of the many potential data sets that have to become accessible by these means. Many laboratories generating large data sets with 'post-sequence' goals in mind recognize the need to make their data accessible electronically. But few have sought ways to link their data back to a browser, and many may not have access to the computational infrastructure needed to respond to a large volume of queries. The mouse genetics community has enjoyed an exceptionally high-quality mouse genome database for many years (see Box), but this must continue to evolve and be intimately integrated with genome browsers. Joined-up data sets are essential if we are to reap the potential rewards from the mouse genome.

The draft mouse genome is being intensively scrutinized, but how much functional information can be deciphered from sequence-gazing alone? Which experimental approaches will provide the greatest information for the resources spent? Computational prediction and alignment programs can already identify many genes and predict some aspects of gene structure. Yet computers remain inferior to cellular machinery in recognizing a gene, where it starts and stops, and when it should be turned off and on. At best, computer predictions provide a very incomplete picture of a genome. Detailed 'hand-crafted' curation, coupled with experimental analysis, is necessary to clarify the encoded gene set. The paper from RIKEN2 goes some way towards satisfying the need to identify the complete set of mouse genes. This resource must be distributed widely and quickly, without restrictions on access or usage, if the value of the clone set is to be realized.

Untangling the genome

Computational analysis is also being used to classify genes into families based on conserved motifs in the gene sequences, although such functional insights are limited to some classification of a protein's biochemical activity or cellular function based on the motifs. For instance, transcription factors and enzymes can be recognized from their motifs, but not where and when they would be expressed and who their partners are. This information is encoded in the genome, but it is indecipherable at present and so impossible to extrapolate to the physiological role of any gene.

Considerably more experimental information is needed to predict gene function, but which experimental approaches are applicable to high-throughput analysis of large gene sets in complex multicellular organisms? Several groups believe that information on gene expression is invaluable and should be generated for every gene in the genome. The expression pattern of a gene in a multicellular organism is a basic feature of the biological function of any gene, whatever its function. The more contexts in which expression is examined, the greater the insight into function. In principle, a definitive and comprehensive atlas of the expression pattern of every gene in the genome can be generated.

One experimental approach in which thousands of genes can be analysed in parallel is to isolate messenger RNA and to display the gene-expression profile on a chip. When this technique is applied to tissues, data are lost because aspects of the three-dimensional structures of multiple cell types are destroyed in the biochemical extraction. Data from in situ analyses contain more detailed information about each gene, but the generation of these data is serial and significantly slower.

Gene expression is being systematically examined at the transcriptional level by several groups, for instance in the 9.5-day-old mouse embryo and in adult tissues (see Box). Two other papers in this issue3, 4 report large-scale analyses of gene expression in embryonic and adult stages, but so far have examined just 0.5% of the genes in the genome, the homologues of the genes on chromosome 21. Transcription studies in situ have relatively limited resolution, and the tissues constituting a multicellular organism are complex mixtures of different cell types. Unless each cell is individually visualized for gene expression in combination with histological criteria, important information relating to biological function is lost, for instance the subcellular compartment(s) occupied by a protein.

The Sanger Institute's Atlas project is being established to systematically examine the expression pattern of every gene product at tissue-, cellular- and subcellular-level resolution, to provide a permanent, definitive and accessible record of the molecular architecture of normal tissues and cells. The ultimate goal is to define protein expression patterns for all 30,000 mouse genes in hundreds of different tissues, all gathered in archival data sets to support research projects worldwide. Data will be collected electronically and archived with a vocabulary allowing complex queries.

A protein's location in a cell and a tissue provides important clues about its function, but such data are still insufficient to reveal its physiological role in vivo, or the temporal and spatial specificity of gene products for an as-yet-unknown functional activity. Gene-expression data are informative and can guide an experimental path, but cannot be interpreted in isolation.

Mutational analysis

One of the most informative experimental approaches for examining gene function is to analyse mutants. Spontaneous and induced mutations in the mouse have been studied for more than 100 years, but in the past decade there has been an explosion in their use.

The isolation of embryonic stem (ES) cells5 and demonstration that these cultured cells can recolonize the mouse germ line6 were the two fundamental discoveries that led to the first 'knockout' mouse in 1987, through a genetic modification that had been engineered in vitro. This heralded a golden era for mouse genetics. Today, it is possible to engineer mice with genetic changes as subtle as a single nucleotide substitution or with major alterations of the genome such as the deletion or duplication of millions of base pairs7. The genetic tractability of ES cells has made the mouse uniquely accessible for genetic studies compared with every other multicellular organism.

Despite this success, the combined output of the mouse genetics community over the past 10 years has described mutations in just a few thousand genes by this method, 10–15% of the predicted gene content of the organism. Can this rate be speeded up so that it will not take 50 years to mutate and analyse the remaining 85–90%? Several leading mouse genetics laboratories have begun to discuss plans to generate a knockout for every gene in the mouse genome, but how will this be achieved?

Gene targeting requires detailed knowledge of gene structure to ensure that the target locus has been effectively mutated. One of the most immediate practical benefits of the assembled mouse genome sequence is the availability of detailed gene-structure information, enabling mutations to be made with a full understanding of the likely functional consequences. Knowledge of the genome sequence has also made it possible to index libraries of gene-targeting vectors, eliminating the need to screen a library to obtain a genomic clone for targeting. Library indexing by end-sequencing significantly increases the rate at which knockout mice can be generated.

On target

Although the genome sequence has enhanced the rate at which targeted mutations can be generated, is this going to be fast enough? Targeting is an inherently serial process — a single experiment typically generates one type of allele. Gene trapping, on the other hand, in which genes are tagged for sequence retrieval by insertional mutagenesis, generates hundreds of different mutations from a single electroporation or viral infection. Recognizing the importance of a genome-wide gene-trap library, but unable to fund this in the academic sector, I and other colleagues set up Lexicon Genetics. Although the Lexicon resource is now quite comprehensive, the cost is significant, intellectual property rights may have to be negotiated, and some mutants will not be available for commercial reasons.

There are also gene-trap libraries in the public sector (see Box), through which about 16,000 ES cell clones are now available. In principle, these resources should be distributed with few constraints. Although their coverage is more limited and the effort less centralized than the Lexicon library, over the next couple of years this resource should expand considerably. The value of such an archive will be fully realized only if it can be exploited by the community. One key aspect of the elaboration of this resource is to ensure that the knowledge of the location of these mutations in the genome is linked with access to the physical resource (the trapped ES cell clone), so that ES cell clones can be retrieved and used to establish mice carrying these mutations.

Over the past decade, knocking out genes has provided a rich source of information about gene function — and as a result, this sequence-driven approach has been widely adopted. But there is considerable uncertainty in predicting the phenotypes that will be displayed by the mutant mice. In my own view, selection of a candidate gene for mutational analysis might as well be stochastic as based on assimilation of existing knowledge.

Sadly, our knowledge of conserved domains, expression patterns, biochemical activity, protein–protein interactions and molecular structure is inadequate to predict function. A knockout phenotype often shamelessly displays our collective ignorance about gene function. We have to accept that in many cases, sequence-directed mutagenesis may not efficiently identify genes specific to certain functions or disease — for instance, the genes involved in diabetes — by knocking out individual candidates.

Mutational analysis can be focused to identify the players in a specific process by performing a genetic screen, as widely used in organisms such as the yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, and the fruitfly Drosophila melanogaster. Screens did not catch the imagination of the mouse community when first introduced 15 years ago because they emerged in parallel with gene-targeting approaches, which looked so promising. Over the past few years, genetic screens have become much more popular because of a few high-profile successes8. Currently, ENU (N-ethyl-N-nitrosourea) mutagenesis is the most widely used method for random mutagenesis, and several programmes using this approach have been initiated (see Box).

Although the 1,000-plus new mutations generated by ENU mutagenesis provide a resource for future studies, they will not add any knowledge about gene function until the underlying genetic lesions have been identified, and the molecular mechanisms relating the lesion to the observed phenotype are understood in detail. This leaves the strategy with two major bottlenecks — the identification of the mutation (typically a nucleotide substitution); and a detailed phenotypic understanding of each mutant.

Community work

Whatever method is used to generate a mutation, understanding the mechanistic cause of the observed phenotype is central to determining gene function. This understanding depends not only on knowing the mutated gene, but also on having a very detailed picture of the phenotype. Several centres are developing standardized expertise in this area, but high-throughput screens will not detect subtle variations from normal, nor will they examine mice for every possible phenotype. Some of the most valuable screens will be pursued in the context of very detailed phenotyping, possibly in the context of another genetic alteration in the background of the mice being screened.

So the job of phenotyping mutants for a specific characteristic is a task that cannot be easily delegated to a centre; rather, it is an activity for the whole community. This requires the mobility of existing strains between groups and availability of funding to pursue smaller, more scientifically focused, screens in specific areas of biological expertise. Progress in understanding the mouse genome will involve the input of diverse experimental approaches in thousands of small and a few large laboratories. The accessibility of mutants is key to this progress, and this comes with a cost — because strains will need to be maintained or archived for decades. So the avalanche of genome sequence will be followed by an explosion of mutant mice, requiring new mouse facilities to house and phenotypically evaluate this global genetic resource.

Top

References

------------------

References

1. Mouse Genome Sequencing Consortium Nature 420, 520-562 (2002). | Article |
2. The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team Nature 420, 563-573 (2002). | Article |
3. Reymond, A. et al. Nature 420, 582-586 (2002). | Article |
4. The HSA21 Expression Map Initiative Nature 420, 586-590 (2002). | Article |
5. Evans, M. J. & Kaufman, M. H. Nature 292, 154-156 (1981).
6. Bradley, A., Evans, M. J., Kaufman, M. H. & Robertson, E. J. Nature 309, 255-256 (1984).
7. Ramirez-Solis, R., Liu, P. & Bradley, A. Nature 378, 720-724 (1995). | Article |
8. Vitaterna, M. H. et al. Science 264, 719-725 (1994).
  1. Allan Bradley is at the Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

Extra navigation

.

SEARCH PUBMED FOR

natureproducts


ADVERTISEMENT