Download PDF

News
Published: 18 July 2012

Gene data to hit milestone

Monya Baker

Nature volume 487, pages 282–283 (2012)Cite this article

259 Accesses
22 Citations
200 Altmetric
Metrics details

Subjects

Genetic databases

A Correction to this article was published on 01 August 2012

This article has been updated

With close to one million gene-expression data sets now in publicly accessible repositories, researchers can identify disease trends without ever having to enter a laboratory.

DNA microarrays allow researchers to analyse the expression of a huge number of genes simultaneously. Credit: A. Nantel/Shutterstock

Purvesh Khatri sits in front of an oversized computer screen, trawling for treasure in a sea of genetic data. Entering the search term ‘breast cancer’ into a public repository called the Gene Expression Omnibus (GEO), the postdoctoral researcher retrieves a list of 1,170 experiments, representing nearly 33,000 samples and a hoard of gene-expression data that could reveal previously unseen patterns.

That is exactly the kind of search that led Khatri’s boss, Atul Butte, a bioinformatician at the Stanford School of Medicine in California, to identify a new drug target for diabetes. After downloading data from 130 gene-expression studies in mice, rats and humans, Butte looked for genes that were expressed at higher levels in disease samples than in controls. One gene was strikingly consistent: CD44, which encodes a protein found on the surface of white blood cells, was differentially expressed in 60% of the studies (K.Kodama et al. Proc. Natl Acad. Sci. USA 109, 7049–7054; 2012 ). The CD44 protein is not widely investigated as a drug target for diabetes, but Butte’s team found that treating obese mice with an antibody against it caused their blood glucose levels to drop.

Butte and his team are now using publicly available data to answer a diverse range of questions — Khatri, for instance, hopes to discover secrets behind kidney-transplant rejection. “We don’t do wet lab experiments for discovery,” he says. Those are for validating hypotheses. The beauty of analysing data from multiple experiments is that biases and artefacts should cancel out between data sets, helping true relationships to stand out, Butte says. “There is safety in numbers.”

And those numbers are rising rapidly. Since 2002, many scientific journals have required that data from gene-expression studies be deposited in public databases such as GEO, which is maintained by the National Center for Biotechnology Information in Bethesda, Maryland, and ArrayExpress, a large gene-expression repository at the European Bioinformatics Institute (EBI) in Hinxton, UK. Some time in the next few weeks, the number of deposited data sets will top one million (see ‘Data dump’).

The result is an unprecedented resource that promises to drive down costs and speed up progress in understanding disease. Gene-sequence data are already shared extensively, but expression data are more complex and can reveal which genes are the most active in, say, liver versus brain cells, or in diseased versus healthy tissue. And because studies often look at many genes, researchers can repurpose the data sets, asking questions other than those posed by the original researchers.

Credit: Sources: NIH, EBI

It is easy to track how many data sets are being deposited — much harder is working out how they are being used. Heather Piwowar, who studies data reuse with the National Evolutionary Synthesis Center from the University of British Columbia in Vancouver, Canada, found that 20% of data sets deposited in GEO in 2005 and 17% of those in 2007 had been cited by the end of 2010. But those rates are certainly underestimates, she says. The PubMed Central repository, which her study relied on, holds only about one-third of the relevant papers, and her algorithms identify reuse only when researchers cite database accession numbers, which many don’t do. More studies are reusing data every year, she says. “We have every reason to believe it is game-changing.”

Having access to such data is “immensely valuable,” agrees Enrico Petretto, a genomicist at Imperial College London. “We would never be in a position to look across multiple tissues and species with the money we have.” But he cautions that using other people’s data can be tricky. If data sets give contradictory outcomes, it is unclear whether that is because the underlying data contradict each other or because something went wrong with the analysis. “That’s why people sometimes don’t trust this,” he says.

Change of practice

Still, few researchers are using the data to their greatest potential, says Alvis Brazma, a bioinformatician at the EBI. “Being able to reuse functional genomics data is a really new thing,” he says. Researchers rarely download more than half a dozen data sets, and most use the data only to compare with their own results. Studies that use only other scientists’ data to come up with new findings are still unusual.

That makes Butte and Khatri trailblazers. Another pioneer is Gustavo Stolovitzky, a computational biologist at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, who has used publicly available data to train algorithms to recognize gene signatures for diseases such as lung cancer, chronic obstructive pulmonary disease (COPD) and psoriasis. Not only can the algorithms distinguish lung cancer from COPD, they can also tell squamous-cell carcinoma from adenocarcinoma. “There is enough info in existing databases to predict disease in samples that algorithms have never seen before,” Stolovitzky says.

Other efforts promise to unleash even more power from the growing repositories. In 2009, for instance, curators of ArrayExpress used their database to create the Gene Expression Atlas, which allows researchers to look at how the expression of a gene might vary across tissues, disease states and species without having to download any data.

Curators will have to adjust to the ways that data are changing, says Tanya Barrett, coordinator at GEO. A growing proportion of the data finding their way into repositories are derived from RNA sequences, which poses challenges: the files are larger, methods are still in flux and integration with conventional microarray data is difficult. But the biggest factor to limit data reuse could be cultural. Many researchers are reluctant to use data that are in different formats, or from other experimental designs or materials, says Ann Zimmerman, who studies data reuse at the University of Michigan in Ann Arbor. Familiarity could help to solve the problem, says Barrett. The more examples of data reuse that scientists see, the more ways they will find to reuse data.

Change history

26 July 2012
The graph in this story originally miscounted data sets in ArrayExpress for the years 2003–11. The graphic has now been corrected.

Authors

Monya Baker
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baker, M. Gene data to hit milestone. Nature 487, 282–283 (2012). https://doi.org/10.1038/487282a

Download citation

Published: 18 July 2012
Issue Date: 19 July 2012
DOI: https://doi.org/10.1038/487282a

This article is cited by

Role of innate immunity-triggered pathways in the pathogenesis of Sickle Cell Disease: a meta-analysis of gene expression studies
- Bidossessi Wilfried Hounkpe
- Maiara Marx Luz Fiusa
- Erich Vinicius De Paula
Scientific Reports (2015)
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data
- Stefan Kurtenbach
- Sarah Kurtenbach
- Georg Zoidl
BMC Research Notes (2013)
Data‐driven hypotheses
- Paul van Helden
EMBO reports (2013)

Gene data to hit milestone

Subjects

Change history

26 July 2012

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

This article is cited by

Role of innate immunity-triggered pathways in the pathogenesis of Sickle Cell Disease: a meta-analysis of gene expression studies

Array data extractor (ADE): a LabVIEW program to extract and merge gene array data

Data‐driven hypotheses

Search

Quick links

Subjects

Change history

26 July 2012

Related links

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Role of innate immunity-triggered pathways in the pathogenesis of Sickle Cell Disease: a meta-analysis of gene expression studies

Array data extractor (ADE): a LabVIEW program to extract and merge gene array data

Data‐driven hypotheses

Search

Quick links