One of the best ways for a neuroscientist like me to keep up to date with what colleagues are working on is to attend conferences. But on recent trips I have noticed a problem. Too few researchers are consulting and using publicly available data — my own included. What is going on?
Massive amounts of biological information are being accumulated using high-throughput sequencing techniques. Many scientists have used some of those resources, such as the Encyclopedia of DNA Elements (ENCODE) launched by the US National Human Genome Research Institute. But many more laboratories in neuroscience and other subdisciplines of cell and molecular biology generate their own data sets. These data are piling up in community databases and offer information on gene expression and regulation. Unless this information is used, it is wasted.
For instance, I study brain cells thought to be important for the maintenance of chronic pain. Called microglia, these cells are also investigated by immunologists interested in the cells’ role in, say, multiple sclerosis. Together, these results provide a full profile of which genes these cells express.
For such studies to be published, the data must be put in a public repository for anyone to download. In the case of sequencing data, it is usually the Gene Expression Omnibus (GEO) website run by the US National Center for Biotechnology Information (NCBI). This means that any biologist can find out what microglia should look like from a molecular perspective. The same is true for many other kinds of cell, including neurons and the cell types found in human blood.
This is useful information. Knowing whether your favourite gene is active in a certain cell provides crucial clues about how to proceed with your research. That’s why funders and journals have worked so hard over the past decade to ensure that researchers don’t hoard their results jealously.
Hence my surprise when sitting through long discussions at conferences about whether gene X is found in microglia, when I can open a public database there and then on my laptop and see that it is absent from this cell type in the relevant RNA-sequencing data sets. Similarly, many papers I read or review make claims about which proteins are expressed in the cell, but these don’t match publicly available results.
“Taking no notice of deposited data is akin to ignoring several independently published replication experiments.”
Of course, people should not take others’ data as gospel. Sequencing data can be wrong. There can be systemic technical artefacts. And known biases are associated with certain approaches, for example single-cell data sets can miss more than half the transcripts in individual samples. But simply taking no notice of deposited data is akin to ignoring several independently published replication experiments. If your results don’t agree, you should, at the very least, discuss the discrepancy, and propose a biologically valid reason for it.
Why are so many bench biologists overlooking this wealth of cell-type-specific expression data?
My hunch is there are two reasons. First, researchers underestimate how many of these data have been published over the past few years because they are being generated across so many different fields. Second, they are wary of the data. Because you need bioinformatics knowledge to generate and analyse sequencing results, people assume that they also need such expertise to locate and interpret them.
Not so. In the past five years, improvements in technology, together with stricter deposition guidelines, mean that simple Excel files commonly accompany papers. These can be downloaded in minutes from the Supplementary Information of a relevant paper, or from the ‘GEO Datasets’ tab on the NCBI website using search terms. It is like PubMed for spreadsheets. They require minimal knowledge to browse.
It is often difficult to share big data in science. Sequencing data are fairly unusual, in that it is easy to standardize, display and judge them from the outside. This is not the case for many other kinds of scientific output. For instance, resources for data sharing in brain imaging or engineering are less well developed. Obstacles include the high cost of storage — although valiant efforts have been made to overcome this, for instance in 3D neuronal anatomy.
More researchers need to be aware that they can profit from a vast library of material. To that end, I’ve made a step-by-step guide and a video on how to access public sequencing data (see go.nature.com/2lbvcts). This includes links to purpose-built browsers, such as Blueprint, where researchers can enter the name of a gene and get cell-type-specific transcriptional data.
Journals could also help, by requiring scientists to state that they have checked their own claims about gene expression against several publicly available sequencing results. And reviewers could verify these statements: spending 15 minutes searching for a few spreadsheets on GEO is not much different from spending 15 minutes on PubMed to confirm other types of statements on prior literature.
People might argue that this problem is age-old. Have we not always missed papers, gone up blind alleys and repeated work while the answer was in the library all along? Yes. But it has never been easier to avoid doing so. The data are a mouse-click away, and many more are to come. All we have to do is access them.
- Journal name:
- Date published: