The human genome has somewhere around 30000 genes.1 If we consider that some genes such as cystic fibrosis have nearly 1000 mutations causing this rare inherited disorder, it is possible that there may be up to 30 ´ 106 mutations causing single gene disorders if mutations in all genes cause disease. A more conservative figure is 3 ´ 106. If we consider also non-disease causing polymorphisms that are thought to occur every 200-1000 bases in the 3 ´ 109 genome, we arrive at 3-15 million possible polymorphisms. In the case of polymorphisms these are important in common disease, in variation in drug metabolism and as markers in linkage studies. When one considers single base changes in the 3 ´ 109 bases and that each of these can change to one of three others, there are potentially 9 ´ 109 base changes possible (without insertions or deletions). Thus it is clear that there are likely to be at least tens of millions of base changes that are important to human health. In the case of single gene disorders, each mutational event needs to be characterized by at least 10 extra pieces of data, ideally more like 50,2 whereas polymorphisms perhaps need less. This means that there are at least hundreds of millions of pieces of data that are needed to fully record variation in the human genome. This is only one order of magnitude less than the task of recording the human genome sequence of 3 ´ 109 units. Thus it is in the interest of medical science that a system be put in place to systematically collect accurate variation data, safely store it, and make it available to those who need the data. It was this impending scenario and a need for a system to cope with it that induced the formation of the HUGO Mutation Database Initiative in 1994. Its history and progress can be seen on its website.3 It has been supported by the Human Genome Organization (HUGO) and the March of Dimes and has around 600 members in 34 countries.
Phenotype variation has been known and used for thousands of years, however molecular variation was only revealed in the 1950s once protein sequencing of mutant proteins was established. The rate of discovery of molecular variation accelerated considerably in the 1960s and 1970s when gene cloning and DNA sequencing were applied to disease genes in rare inherited disorders. The globin gene was perhaps the first human disease gene in which mutations were described. These were collected by those interested in such variation and printed as a book4 and, with the advent of computer databases, were made with relevant software and placed on the WWW. In the case of the globin genes the data from the book have been transferred to a website.5 Also in the 1960s Victor McKusick began collecting inherited syndromes6 and later began listing mutations in the genes that were found mutant and ultimately this listing, online Mendelian Inheritance in Man, has been placed on the Web.7
Since these early developments there has been an expansion of numbers of databases. Those databases collecting mutations in single genes are called locus specific mutation databases (LSDBs), whereas those collecting mutation in all or many genes are referred to as central or general mutation databases.
Central or general mutation databases collect mutation in all genes but those that exist differ because of their reasons for being initiated. These have recently been reviewed.8 OMIM7 began as a systematic record of inherited syndromes in print form. As genes causing the syndromes were identified, the records in this compilation began to include mutations identified in such genes. Because it cannot keep up with all mutations it only collects the first mutation and then the most interesting after that. For example in cystic fibrosis and phenylketonuria (12 September 2001), OMIM contains 127 of 989 and 65 of 443 mutations in these diseases respectively, compared with the mutations in the Locus Specific Database for these genes. HGMD9 began as a research tool to document the different types of mutations occurring in humans and ultimately led to the finding that mutations in CpG doublets were the most frequent and then to exploration of why this was so. This collection from the published literature has become a useful compilation so that users could find if a particular mutation had been described and, if so, who by and where. Currently this compilation is some months behind due to a commercial agreement necessary for its funding. dbSNP10 was initiated by the NCBI as the US government's public compilation of variation and specifically for polymorphisms (or SNPs¾single nucleotide polymorphisms as they were known from that time), which were discovered from a major funding initiative to do so. However, despite the name, this database collects all variation that is submitted including that causing single gene disorders. HGBASE11 collects any and all variation types regardless of clinical correlations or not, as well as carrying frequency data. Strict criteria are applied to variants hosted (single copy, mapped uniquely, fully consistent allied data features). Variations from other databases and the literature are actively sought out, giving a significantly broader scope than dbSNP. Other databases reside at the Whitehead Institutes and there is another, initiated by a group of pharmaceutical companies.
As a general rule, those central databases recording variation causing single gene disorders, only record published variation. The SNP database tends to record published SNPs as well as large numbers submitted from major funding efforts. Collection from the literature may not be sustainable in the long term when we have a need for documentation of millions of mutations.
The listing of the mutations in the globin gene(s) was in fact the first locus specific mutation database (LSDB), where the main author was interested in collecting the details of the mutation and the phenotype. Today there are around 260 LSDBs mounted on nearly 100 websites.2 These databases vary in almost every aspect (except those on the same website where their characteristics are similar), because not only do they use 10 or so different software types but also the initiators have had different interests and different objectives in mind. Also, some are better funded than others so appear more professional. There are three main types of LSDBs, those focusing on the mutation only and describing only the first example of each, eg the PAH database,12 those cataloguing patients with specific diseases and noting the mutations, eg MUTBASE13 and those cataloguing somatic mutations, eg TP53.14
Mutation View15 was originally described as an integrated system of central and LSDBs. It does catalogue each published case of a particular mutation.
There are two major differences between LSDBs and Central Databases that have important consequences regarding utility for specific purposes. First is that LSDBs are run by experts in the gene involved and secondly most of them collect unpublished mutations. The consequence of the first point is that many of the LSDBs are more knowledge bases of the genes, eg PAH12 with enormous amounts of information ranging from that for biochemists to that for patients. On the second point, the consequences are that a recent survey16 showed that LSDBs contained around 100% more mutations than HGMD that only collects published mutations.
The main overarching objective in achieving the aims of the HUGO MDI has been to combine the strengths of the central database and the LSDBs. Thus in broad terms the Initiative set out to establish a federation of LSDB curators to ensure capture and work with central databases to ensure storage and distribution on a proper bioinformatics basis.
It is an enormous challenge to aim towards the day when tens of millions of sequence variations will have been accurately collected and stored and made available to the public. However, because of the contribution such a compilation will have to medical care, medical research and biological science generally, there has been considerable interest, enthusiasm and drive at the twice yearly Mutation Database Initiative meetings. Members of the Initiative have published extensively on the topic particularly focusing on topics of concern (see HUGO-MDI website3).
Mutation nomenclature was an early concern, as whilst there were several systems in use, proper discussion with consequent recommendations had never occurred. The outcome of such a process has resulted in a HUGO-MDI recommended nomenclature for the simple changes17 with a further discussion for more complex mutations.18
Recommendations for content have been published19 as well as a recommended entry form20 produced by many members of the Initiative. Some members of the Initiative have produced off-the-shelf software for the initiation of web-based databases.21,22
Quality control and peer review has rightly received considerable attention. One of the outcomes of this has been automated mutation by mutation checkers23 that simply check if the correct amino acid change, base numbers, etc have been deduced, also a set of rules have been developed for the acceptance of a mutation as causing single gene disorders.24
The most daunting problem is how to ensure complete collection of all variation that is being uncovered. This problem is being compounded by the fact that journals are generally not accepting reports of single mutations after the initial wave once a disease gene is discovered. This is especially so for the 452nd mutation causing PKU or even a group of them. Initially the journal Human Mutation accepted such publications electronically and published them electronically but this has ceased. The Initiative members have thus been moved to plan an integrated system of receival, review, publication, PubMed ID registration, and public storage. This has resulted in a pilot receival point, the 'WayStation'25 and agreement for publication of data by Wiley-Liss in Human Mutation and agreement by HGBASE11 to be the storage database for the data.
Another approach to ensuring mutation capture has been to encourage National Databases who are likely to be able to contact all diagnostic and research laboratories in their country to induce collection of mutations. One such database is the Turkish database.26 Besides ensuring mutation collection, such national (or ethnic) databases are a vital aid to delivery of national genetic health care. Because of the past and current huge transnational migration such national/ ethnic databases are of international importance.
Because of concerns to ensure genetic privacy, mutation databases need to consider ethical aspects whereby patients may not wish to have their perhaps identifiable mutation on the WWW. Another concern has been Copyright and intellectual property aspects. There have been cases of databases being taken from a site without permission and placed on another without attribution. There needs to be some mechanism to avoid this problem.
Of concern to readers of this journal is the collection of SNPs now that wholesale collection by concerted public and industry funding has ceased. We expect SNPs in single gene disorder genes to be collected through the same mechanisms particularly from diagnostic laboratories.
The eternal problem of such projects is funding. In the case of MDI we have had generous support from HUGO and the March of Dimes and we are currently looking at government support (see below) and possibly commercial funding.
To ensure that we achieve our most difficult remaining aims there are several initiatives underway. Coupled with the agreement with Wiley/ Human Mutation, the HUGO MDI is about to create a Society, with Human Mutation as its journal. Regarding funding we have been invited to and have submitted a P41 grant application to NIH to fund the collection (WayStation), the storage (an update of HGBASE11) and the Administrative office.
The current plan is to receive mutation reports at the WayStation; once they have been automatically checked (mutation checker23) they will be sent for expert review. LSDB curators will be asked to do this, where one exists or an expert in the gene in question where one doesn't. Once the submission has been approved it will be sent first to NCBI for a PubMed ID, then to HGBASE and the LSDB if one exists. The current scheme is illustrated in Figure 1.
To ensure systematic and complete capture of all variation described, there will be a need not only for voluntary work, as is now occurring in the HUGO MDI, but certainly (besides software) a need to pay key individuals to ensure the grass roots are searched for mutations and SNPs. Thus the success of our enterprise depends on funding and this will have to be either from government, from sales of updates of our data or straight commercial funding. The latter of course will need to allow the database content to be fully and immediately public.
Much has been achieved since the initiative began in November 1994. However, it is natural that what remains is the most difficult task so we will have to be patient in carrying it out. The indications are that all is in place to make it achievable. This has been and will be a community activity and we invite all who are able to assist to do so by contacting the authors.
|