An unprecedented repository of disease-related data is bringing together information about the genes, health and lifestyles of thousands of subjects studied over many years. The web-based portal will allow any interested investigator to search across multiple epidemiological studies, in the hope of identifying new links to disease.

The 'database of Genotype and Phenotype' (dbGaP) is funded by the US National Institutes of Health (NIH) and was launched earlier this month. So far, it contains data from two big studies that aim to link information about genetic make-up (genotype), physical characteristics including health (phenotype) and lifestyle to particular diseases. These are the 600-subject Age-Related Eye Diseases Study and the 2,573-subject Parkinsonism Study.

Starting next year, the repository will also contain data from the huge Framingham Heart Study, which has followed 14,000 subjects over three generations since 1948. These will be joined by information from the NIH's recently launched Genetic Association Information Network (GAIN), a public–private partnership that genotypes samples from existing studies. Other US and international databases will follow.

It is the first time that some of these big studies, such as the Framingham, have been available to interested parties. And researchers will now be able to mine the vast stores of genetic, phenotypic and study-protocol data simultaneously. “This is really going to change the scope and efficiency of access to the data,” says Chris O'Donnell, associate director of the Framingham study.

The repository is part of a larger push towards genome-wide association studies at the NIH. And similar efforts are under way elsewhere. For example, Britain's Wellcome Trust Case Control Consortium began collecting genotype data for such studies in 2005. Rather than studying only a few genes or gene markers of interest, as researchers have done in the past, these studies try to link features in individuals suffering from certain diseases to variations in a genome-wide set of genetic markers. It is hoped that this will identify new associations, and the huge sample sizes in the NIH repository should give such studies more power than ever before.

There are technical caveats. If gene variants for a particular disease are very uncommon, or a disease risk comes from hundreds of different genes, then links could still go undetected. Furthermore, data might not have been treated in the same way across different studies, and so may not be strictly comparable. “Controls for schizophrenia might not be as good for diabetes,” notes GAIN investigator Pablo Gejman, director of the Center for Genetics in Psychiatry at Northwestern University in Evanston, Illinois. But it is hoped that investigators will gain major insights by looking across studies for genetic variants linked to, for example, blood pressure, which is often measured.

The launch of the dbGaP comes as the NIH wrestles with how best to monitor and disseminate the unprecedented amount of clinical and biological data that studies are now producing. The agency is working on a policy that will govern all the genome-wide association studies that it funds.

How much of an advantage should be given to researchers who have spent their careers collecting the patient data?

There is particular focus on addressing concerns over privacy. Some people worry about the consequences of, say, insurance companies or law-enforcement officials getting hold of the data. The dbGaP's regulations on data access mirror the proposed NIH-wide rules: although study documents, protocols and summary data will be freely available online, researchers will have to apply for access to individual genotype and phenotype results, and these data will lack any information that could identify the subjects.

Also at issue is how much of a publishing advantage should be given to the researchers who may have devoted the best part of their careers to collecting the patient data. The proposed NIH policy calls for a nine-month period during which only the original study investigators can submit a paper for publication based on the data, although all researchers are free to analyse them immediately. In the case of the dbGaP, that period might vary according to the study. Framingham study investigators, for example, will have a 12-month window, according to O'Donnell.