Nature Publishing Group, publisher of Nature, and other science journals and reference works NATURE.COM NATURE NEWS NATUREJOBS NATUREEVENTS ABOUT NPG
Help Nature.com site index  
The Pharmacogenomics Journal
SEARCH     advanced search my account e-alerts subscribe register
Journal home
Advance online publication
Current issue
Archive
Press releases
For authors
For referees
Contact editorial office
About the journal
For librarians
Subscribe
Advertising
naturereprints
Contact NPG
Customer services
Site features
NPG Subject areas
Access material from all our publications in your subject area:
Biotechnology Biotechnology
Cancer Cancer
Chemistry Chemistry
Dentistry Dentistry
Development Development
Drug Discovery Drug Discovery
Earth Sciences Earth Sciences
Evolution & Ecology Evolution & Ecology
Genetics Genetics
Immunology Immunology
Materials Materials Science
Medical Research Medical Research
Microbiology Microbiology
Molecular Cell Biology Molecular Cell Biology
Neuroscience Neuroscience
Pharmacology Pharmacology
Physics Physics
Browse all publications
 
2002, Volume 2, Number 1, Pages 16-19
Table of contents    Previous  Article  Next   [PDF]
Perspectives
The HUGO Mutation Database Initiative
R G H Cotton1,2 and O Horaitis1 on behalf of the HUGO Mutation Database Initiative

1Genomic Disorders Research Centre, St Vincent's Hospital Melbourne, Fitzroy, Australia

2The University of Melbourne, Department of Medicine, Melbourne, Australia

Correspondence to: R G H Cotton, Genomic Disorders Research Centre, St Vincent's Hospital Melbourne, PO Box 2900, Fitzroy VIC 3065, Australia. Tel: +61 3 9288 2980 Fax: +61 3 9288 2989 E-mail: cotton@ariel.ucs.unimelb.edu.au

Abstract

The Pharmacogenomics Journal (2002) 2, 16-19. DOI: 10.1038/sj/tpj/6500070

The human genome has somewhere around 30000 genes.1 If we consider that some genes such as cystic fibrosis have nearly 1000 mutations causing this rare inherited disorder, it is possible that there may be up to 30 ´ 106 mutations causing single gene disorders if mutations in all genes cause disease. A more conservative figure is 3 ´ 106. If we consider also non-disease causing polymorphisms that are thought to occur every 200-1000 bases in the 3 ´ 109 genome, we arrive at 3-15 million possible polymorphisms. In the case of polymorphisms these are important in common disease, in variation in drug metabolism and as markers in linkage studies. When one considers single base changes in the 3 ´ 109 bases and that each of these can change to one of three others, there are potentially 9 ´ 109 base changes possible (without insertions or deletions). Thus it is clear that there are likely to be at least tens of millions of base changes that are important to human health. In the case of single gene disorders, each mutational event needs to be characterized by at least 10 extra pieces of data, ideally more like 50,2 whereas polymorphisms perhaps need less. This means that there are at least hundreds of millions of pieces of data that are needed to fully record variation in the human genome. This is only one order of magnitude less than the task of recording the human genome sequence of 3 ´ 109 units. Thus it is in the interest of medical science that a system be put in place to systematically collect accurate variation data, safely store it, and make it available to those who need the data. It was this impending scenario and a need for a system to cope with it that induced the formation of the HUGO Mutation Database Initiative in 1994. Its history and progress can be seen on its website.3 It has been supported by the Human Genome Organization (HUGO) and the March of Dimes and has around 600 members in 34 countries.

Phenotype variation has been known and used for thousands of years, however molecular variation was only revealed in the 1950s once protein sequencing of mutant proteins was established. The rate of discovery of molecular variation accelerated considerably in the 1960s and 1970s when gene cloning and DNA sequencing were applied to disease genes in rare inherited disorders. The globin gene was perhaps the first human disease gene in which mutations were described. These were collected by those interested in such variation and printed as a book4 and, with the advent of computer databases, were made with relevant software and placed on the WWW. In the case of the globin genes the data from the book have been transferred to a website.5 Also in the 1960s Victor McKusick began collecting inherited syndromes6 and later began listing mutations in the genes that were found mutant and ultimately this listing, online Mendelian Inheritance in Man, has been placed on the Web.7

Since these early developments there has been an expansion of numbers of databases. Those databases collecting mutations in single genes are called locus specific mutation databases (LSDBs), whereas those collecting mutation in all or many genes are referred to as central or general mutation databases.

Central or general mutation databases collect mutation in all genes but those that exist differ because of their reasons for being initiated. These have recently been reviewed.8 OMIM7 began as a systematic record of inherited syndromes in print form. As genes causing the syndromes were identified, the records in this compilation began to include mutations identified in such genes. Because it cannot keep up with all mutations it only collects the first mutation and then the most interesting after that. For example in cystic fibrosis and phenylketonuria (12 September 2001), OMIM contains 127 of 989 and 65 of 443 mutations in these diseases respectively, compared with the mutations in the Locus Specific Database for these genes. HGMD9 began as a research tool to document the different types of mutations occurring in humans and ultimately led to the finding that mutations in CpG doublets were the most frequent and then to exploration of why this was so. This collection from the published literature has become a useful compilation so that users could find if a particular mutation had been described and, if so, who by and where. Currently this compilation is some months behind due to a commercial agreement necessary for its funding. dbSNP10 was initiated by the NCBI as the US government's public compilation of variation and specifically for polymorphisms (or SNPs¾single nucleotide polymorphisms as they were known from that time), which were discovered from a major funding initiative to do so. However, despite the name, this database collects all variation that is submitted including that causing single gene disorders. HGBASE11 collects any and all variation types regardless of clinical correlations or not, as well as carrying frequency data. Strict criteria are applied to variants hosted (single copy, mapped uniquely, fully consistent allied data features). Variations from other databases and the literature are actively sought out, giving a significantly broader scope than dbSNP. Other databases reside at the Whitehead Institutes and there is another, initiated by a group of pharmaceutical companies.

As a general rule, those central databases recording variation causing single gene disorders, only record published variation. The SNP database tends to record published SNPs as well as large numbers submitted from major funding efforts. Collection from the literature may not be sustainable in the long term when we have a need for documentation of millions of mutations.

The listing of the mutations in the globin gene(s) was in fact the first locus specific mutation database (LSDB), where the main author was interested in collecting the details of the mutation and the phenotype. Today there are around 260 LSDBs mounted on nearly 100 websites.2 These databases vary in almost every aspect (except those on the same website where their characteristics are similar), because not only do they use 10 or so different software types but also the initiators have had different interests and different objectives in mind. Also, some are better funded than others so appear more professional. There are three main types of LSDBs, those focusing on the mutation only and describing only the first example of each, eg the PAH database,12 those cataloguing patients with specific diseases and noting the mutations, eg MUTBASE13 and those cataloguing somatic mutations, eg TP53.14

Mutation View15 was originally described as an integrated system of central and LSDBs. It does catalogue each published case of a particular mutation.

There are two major differences between LSDBs and Central Databases that have important consequences regarding utility for specific purposes. First is that LSDBs are run by experts in the gene involved and secondly most of them collect unpublished mutations. The consequence of the first point is that many of the LSDBs are more knowledge bases of the genes, eg PAH12 with enormous amounts of information ranging from that for biochemists to that for patients. On the second point, the consequences are that a recent survey16 showed that LSDBs contained around 100% more mutations than HGMD that only collects published mutations.

The main overarching objective in achieving the aims of the HUGO MDI has been to combine the strengths of the central database and the LSDBs. Thus in broad terms the Initiative set out to establish a federation of LSDB curators to ensure capture and work with central databases to ensure storage and distribution on a proper bioinformatics basis.

It is an enormous challenge to aim towards the day when tens of millions of sequence variations will have been accurately collected and stored and made available to the public. However, because of the contribution such a compilation will have to medical care, medical research and biological science generally, there has been considerable interest, enthusiasm and drive at the twice yearly Mutation Database Initiative meetings. Members of the Initiative have published extensively on the topic particularly focusing on topics of concern (see HUGO-MDI website3).

Mutation nomenclature was an early concern, as whilst there were several systems in use, proper discussion with consequent recommendations had never occurred. The outcome of such a process has resulted in a HUGO-MDI recommended nomenclature for the simple changes17 with a further discussion for more complex mutations.18

Recommendations for content have been published19 as well as a recommended entry form20 produced by many members of the Initiative. Some members of the Initiative have produced off-the-shelf software for the initiation of web-based databases.21,22

Quality control and peer review has rightly received considerable attention. One of the outcomes of this has been automated mutation by mutation checkers23 that simply check if the correct amino acid change, base numbers, etc have been deduced, also a set of rules have been developed for the acceptance of a mutation as causing single gene disorders.24

The most daunting problem is how to ensure complete collection of all variation that is being uncovered. This problem is being compounded by the fact that journals are generally not accepting reports of single mutations after the initial wave once a disease gene is discovered. This is especially so for the 452nd mutation causing PKU or even a group of them. Initially the journal Human Mutation accepted such publications electronically and published them electronically but this has ceased. The Initiative members have thus been moved to plan an integrated system of receival, review, publication, PubMed ID registration, and public storage. This has resulted in a pilot receival point, the 'WayStation'25 and agreement for publication of data by Wiley-Liss in Human Mutation and agreement by HGBASE11 to be the storage database for the data.

Another approach to ensuring mutation capture has been to encourage National Databases who are likely to be able to contact all diagnostic and research laboratories in their country to induce collection of mutations. One such database is the Turkish database.26 Besides ensuring mutation collection, such national (or ethnic) databases are a vital aid to delivery of national genetic health care. Because of the past and current huge transnational migration such national/ ethnic databases are of international importance.

Because of concerns to ensure genetic privacy, mutation databases need to consider ethical aspects whereby patients may not wish to have their perhaps identifiable mutation on the WWW. Another concern has been Copyright and intellectual property aspects. There have been cases of databases being taken from a site without permission and placed on another without attribution. There needs to be some mechanism to avoid this problem.

Of concern to readers of this journal is the collection of SNPs now that wholesale collection by concerted public and industry funding has ceased. We expect SNPs in single gene disorder genes to be collected through the same mechanisms particularly from diagnostic laboratories.

The eternal problem of such projects is funding. In the case of MDI we have had generous support from HUGO and the March of Dimes and we are currently looking at government support (see below) and possibly commercial funding.

To ensure that we achieve our most difficult remaining aims there are several initiatives underway. Coupled with the agreement with Wiley/ Human Mutation, the HUGO MDI is about to create a Society, with Human Mutation as its journal. Regarding funding we have been invited to and have submitted a P41 grant application to NIH to fund the collection (WayStation), the storage (an update of HGBASE11) and the Administrative office.

The current plan is to receive mutation reports at the WayStation; once they have been automatically checked (mutation checker23) they will be sent for expert review. LSDB curators will be asked to do this, where one exists or an expert in the gene in question where one doesn't. Once the submission has been approved it will be sent first to NCBI for a PubMed ID, then to HGBASE and the LSDB if one exists. The current scheme is illustrated in Figure 1.

To ensure systematic and complete capture of all variation described, there will be a need not only for voluntary work, as is now occurring in the HUGO MDI, but certainly (besides software) a need to pay key individuals to ensure the grass roots are searched for mutations and SNPs. Thus the success of our enterprise depends on funding and this will have to be either from government, from sales of updates of our data or straight commercial funding. The latter of course will need to allow the database content to be fully and immediately public.

Much has been achieved since the initiative began in November 1994. However, it is natural that what remains is the most difficult task so we will have to be patient in carrying it out. The indications are that all is in place to make it achievable. This has been and will be a community activity and we invite all who are able to assist to do so by contacting the authors.

DEFINITIONS

Uses of the words mutation, polymorphism and SNP have been problematic. Throughout biology, mutation is any base change but in clinical genetic usage mutations refer to a deleterious change causing single gene disorder. In the same clinics polymorphism is used for harmless base changes. Further confusion has been added by dbSNP at NCBI, which receives not only single nucleotide polymorphisms as 'advertised' but also other types of polymorphisms and base changes of any type causing single gene disorders. The simplest nomenclature is to call them base changes as such or refer to sequence variation without making a judgment on its effect as is made in the clinic.

DUALITY OF INTEREST

None declared.

References

1 Das M et al. Genomics 2001; 77: 71-78. Article MEDLINE

2 Claustres M et al. Genome Res (under review).

3 HUGO MD. URL: http://www.genomic.unimelb.edu.au/mdi/.

4 Huisman THJ et al. A Syllabus of Human Haemoglobin Variants, 1st edn The Sickle Cell Anaemia Foundation: Augusta, GA, USA.

5 The Globin Gene Server. URL: http://globin.cse.psu.edu/.

6 McKusick VA. Mendelian Inheritance in Man: Catalogs of Autosomal Dominant, Autosomal Recessive, and X-Linked Phenotypes The Johns Hopkins Press: Baltimore, MD, USA, 1966.

7 Online Mendelian Inheritance in Man. http://www3.ncbi.nlm.nih.gov/omim/.

8 Porter CJ et al. Hum Mut 2000; 15: 1236-1244.

9 Krawczak M, Cooper DN. Trends Genet 1997; 13: 121-122, URL: http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html. Article MEDLINE

10 NCBI dbSNP. URL: http://www.ncbi.nlm. nih.gov/SNP/.

11 Human Genic Bi-Allelic Sequences HGBASE. URL: http://hgbase.cgr.ki.se/.

12 Phenylalanine Hydroxylase Locus Database. URL: http://ww2.mcgill.ca/pahdb/.

13 MUTBASE. http://www.uta.fi/imt/bioinfo/mutdatbas.html#idmdb.

14 TP53 database. http://www.iarc.fr/P53/index.html.

15 Minoshima S et al. NAR 2001; 1: 327-328, http://131.113.190.126/mutview3/mutview/.

16 Cotton RGH. Hum Mut 2000; 15: 4-6. MEDLINE

17 Antonarakis SE and the Nomenclature Working Group. Hum Mut 1998; 11: 1-3. MEDLINE

18 den Dunnen JT, Antonarakis SE. Hum Mut 2000; 15: 7-12. MEDLINE

19 Scriver CR et al. Hum Mut 1999; 13: 344-350. MEDLINE

20 Entry Form. http://www.genomic.unimelb.edu.au/mdi/entry.html.

21 Universal Mutation Database Software. http://www.umd.necker.fr/.

22 MuStarÒ. http://www.hgu.mrc.ac.uk/Soft data/Mustar/.

23 DNA Mutation checker. http://www2.ebi. ac.uk/cgi-bin/mutations/check.cgi.

24 Cotton RGH, Horaitis O. Hum Mut 2000; 15: 16-21. MEDLINE

25 WayStation Pilot. http://www.centralmutations.org.

26 Turkish mutation database. http://bioserver.bio.boun.edu.tr.

Figures

Figure 1 Proposed data submission to release scheme.

2002, Volume 2, Number 1, Pages 16-19
Table of contents    Previous  Article  Next    [PDF]
Privacy Policy © 2002 Nature Publishing Group