Purvesh Khatri sits in front of an oversized computer screen, trawling for treasure in a sea of genetic data. Entering the search term ‘breast cancer’ into a public repository called the Gene Expression Omnibus (GEO), the postdoctoral researcher retrieves a list of 1,170 experiments, representing nearly 33,000 samples and a hoard of gene-expression data that could reveal previously unseen patterns.
That is exactly the kind of search that led Khatri’s boss, Atul Butte, a bioinformatician at the Stanford School of Medicine in California, to identify a new drug target for diabetes. After downloading data from 130 gene-expression studies in mice, rats and humans, Butte looked for genes that were expressed at higher levels in disease samples than in controls. One gene was strikingly consistent: CD44, which encodes a protein found on the surface of white blood cells, was differentially expressed in 60% of the studies (Proc. Natl Acad. Sci. USA 109, 7049–7054; 2012). The CD44 protein is not widely investigated as a drug target for diabetes, but Butte’s team found that treating obese mice with an antibody against it caused their blood glucose levels to drop. et al.
Butte and his team are now using publicly available data to answer a diverse range of questions — Khatri, for instance, hopes to discover secrets behind kidney-transplant rejection. “We don’t do wet lab experiments for discovery,” he says. Those are for validating hypotheses. The beauty of analysing data from multiple experiments is that biases and artefacts should cancel out between data sets, helping true relationships to stand out, Butte says. “There is safety in numbers.”
And those numbers are rising rapidly. Since 2002, many scientific journals have required that data from gene-expression studies be deposited in public databases such as GEO, which is maintained by the National Center for Biotechnology Information in Bethesda, Maryland, and ArrayExpress, a large gene-expression repository at the European Bioinformatics Institute (EBI) in Hinxton, UK. Some time in the next few weeks, the number of deposited data sets will top one million (see ‘Data dump’).
The result is an unprecedented resource that promises to drive down costs and speed up progress in understanding disease. Gene-sequence data are already shared extensively, but expression data are more complex and can reveal which genes are the most active in, say, liver versus brain cells, or in diseased versus healthy tissue. And because studies often look at many genes, researchers can repurpose the data sets, asking questions other than those posed by the original researchers.
It is easy to track how many data sets are being deposited — much harder is working out how they are being used. Heather Piwowar, who studies data reuse with the National Evolutionary Synthesis Center from the University of British Columbia in Vancouver, Canada, found that 20% of data sets deposited in GEO in 2005 and 17% of those in 2007 had been cited by the end of 2010. But those rates are certainly underestimates, she says. The PubMed Central repository, which her study relied on, holds only about one-third of the relevant papers, and her algorithms identify reuse only when researchers cite database accession numbers, which many don’t do. More studies are reusing data every year, she says. “We have every reason to believe it is game-changing.”
Having access to such data is “immensely valuable,” agrees Enrico Petretto, a genomicist at Imperial College London. “We would never be in a position to look across multiple tissues and species with the money we have.” But he cautions that using other people’s data can be tricky. If data sets give contradictory outcomes, it is unclear whether that is because the underlying data contradict each other or because something went wrong with the analysis. “That’s why people sometimes don’t trust this,” he says.
Change of practice
Still, few researchers are using the data to their greatest potential, says Alvis Brazma, a bioinformatician at the EBI. “Being able to reuse functional genomics data is a really new thing,” he says. Researchers rarely download more than half a dozen data sets, and most use the data only to compare with their own results. Studies that use only other scientists’ data to come up with new findings are still unusual.
That makes Butte and Khatri trailblazers. Another pioneer is Gustavo Stolovitzky, a computational biologist at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, who has used publicly available data to train algorithms to recognize gene signatures for diseases such as lung cancer, chronic obstructive pulmonary disease (COPD) and psoriasis. Not only can the algorithms distinguish lung cancer from COPD, they can also tell squamous-cell carcinoma from adenocarcinoma. “There is enough info in existing databases to predict disease in samples that algorithms have never seen before,” Stolovitzky says.
Other efforts promise to unleash even more power from the growing repositories. In 2009, for instance, curators of ArrayExpress used their database to create the Gene Expression Atlas, which allows researchers to look at how the expression of a gene might vary across tissues, disease states and species without having to download any data.
Curators will have to adjust to the ways that data are changing, says Tanya Barrett, coordinator at GEO. A growing proportion of the data finding their way into repositories are derived from RNA sequences, which poses challenges: the files are larger, methods are still in flux and integration with conventional microarray data is difficult. But the biggest factor to limit data reuse could be cultural. Many researchers are reluctant to use data that are in different formats, or from other experimental designs or materials, says Ann Zimmerman, who studies data reuse at the University of Michigan in Ann Arbor. Familiarity could help to solve the problem, says Barrett. The more examples of data reuse that scientists see, the more ways they will find to reuse data.
- Journal name:
- Date published: