Sherlock Holmes understood: “It is a capital mistake,” he said, “to theorise before one has data.” Data are the lifeblood of science, the foundation of innovation. Behind every great discovery is a pile of data; but, crucially, it should not be too far behind.
For more than four decades, the Protein Data Bank (PDB) has been where structural biologists keep their data close. Nearly every biology-publishing journal in the world, Nature included, requires protein structures to be deposited in the PDB before publication.
So there was considerable worry at the database when Nature accepted a molecular map of HIV’s capsid protein shell last year (G. Zhao et al. Nature 497, 643–646; 2013). The multimillion-atom complex was larger than anything then in the PDB, and the database’s team had to devise a way to make the data dump available (and useful) at short notice.
Thus it goes at the PDB — whose trove surpasses 100,000 structures this week (see page 265) — and other long-running archives that have managed to stay relevant and essential. It is not easy. Just ask the scientists, funders, technicians and others who shepherd them.
Money is often the limiting factor. Computer storage and processing power may be getting cheap as chips, but much of the expense is in paying the people (many of them highly trained scientists) who organize and verify data entries, and engage scientific communities.
There are many ways for a database to stay in the black. The three-decades-old GenBank, a clearing house for DNA sequences, is funded directly by the US government’s support of the National Center for Biotechnology Information (NCBI). By contrast, the 50-year-old Cambridge Structural Database, which stores 700,000 small-molecule structures, gets by on support from industry and around 1,300 institutes.
The PDB is actually hosted by several organizations that provide access to the same data trove, each funded independently. Gerard Kleywegt, who heads the European franchise at the European Bioinformatics Institute (EBI) in Hinxton, UK, says that healthy competition between his portal and others in the United States and Japan helps him to get grants, and keeps the database pertinent. Scientists “vote with their mouse clicks”, he says. “They go to the place where they get the best answer for their questions.”
In the 1970s, protein structures were consumed by a small community of X-ray crystallographers interested in the nitty-gritty of individual enzymes. Now scientists use a range of techniques to determine structures, and researchers of many stripes want to know how proteins behave in a larger context, such as in a malignant cancer cell. A database must change with the times, or face extinction.
The closure of a database is not so awful — as long as its useful information remains available elsewhere. In 2011, NCBI announced that it was mothballing a database that collected information about protein fragments used in proteomics experiments. A competing database run by the EBI has since swallowed up those data. But with 100,147 structures (as Nature went to press), and growing at about 200 per week, the PDB, at least, shows no sign of folding.
- Journal name:
- Date published: