Protein structures are getting regular makeovers with the help of 're-refinement' software developed by Dutch structural biologists.

The Protein Data Bank (PDB) holds nearly 53,000 three-dimensional structures of protein molecules and nucleic acids that have mainly been deciphered through X-ray crystallography. Most journals, including Nature, require such data to be deposited in the PDB if a paper with a protein structure is to be published.

But some structures are not as accurate as they could be. The data bank began in 1971, and the ability to analyse crystallographic data has improved dramatically since then.

"There are definitely errors in the PDB," says crystallographer Nenad Ban at the Swiss Federal Institute of Technology Zurich.

This has consequences for scientists who want to use the PDB to look for sites on proteins to target with small-molecule drugs, or to feed the data into molecular dynamics simulations. More profoundly, a wrong structure in the data bank could also trigger wrong ideas about how the protein works.

Human papilloma virus protein E7. Credit: LAGUNA DESIGN/SPL

To help, Gert Vriend at Radboud University Medical Centre in Nijmegen, the Netherlands, and his colleagues are writing software that they hope will eventually automatically re-refine, at the click of a mouse, all the data deposited in the PDB.

So far, Vriend has gone through 38,000 data files with his PDB_REDO program. Earlier this year, he published the initial results of work on 16,807 files (R. P. Joosten et al. J. Appl. Crystallogr. 42, 376–384; 2009); for each he produced a new structure based on the old data. Roughly 67% of those new structures were better than the original structure as measured by a quantity known as R-free, used by crystallographers to determine structure quality.

The automated program could also decipher other problems caused by human error: data deposited with the wrong labels, for example, in which intensity of a signal is labelled as amplitude. "If the intensity and amplitude are swapped, the structure doesn't make sense," says collaborator Robbie Joosten.

Vriend runs his program on all new entries in the PDB every two weeks and sends the PDB a monthly report flagging problems. Administrators can correct small problems like names of labels being swapped; bigger problems are added to an ongoing list of things to fix.

Maintenance for the PDB is spread over three sites: the Research Collaboratory for Structural Bioinformatics based at Rutgers University in Piscataway, New Jersey; the European Bioinformatics Institute (EBI), in Hinxton, UK; and the Japan Science and Technology Agency in Tokyo.

Helen Berman of Rutgers, who runs the US part of the PDB, says that the data bank welcomes efforts to improve deposited data. "This is exactly the vision we had when we started," she says. Staff at the data bank run a standard set of quality checks before depositing a structure and flag any problems to the researcher who submitted it, but "we're not the data police". The data bank doesn't reject an entry even if its advice on improving a structure is ignored.

Vriend occasionally contacts the scientists who deposited data that he has refined, with mixed reactions. "Sometimes people are very grateful, and sometimes they are insulted," he says.

Sometimes people are very grateful, and sometimes they are insulted.

Occasionally, the program can cause researchers to change interpretations of their data. Annalisa Pastore, a molecular biologist at the National Institute for Medical Research, London, UK, recalls asking Vriend to validate a protein structure she had worked out from her nuclear magnetic resonance (NMR) data. Vriend told her she had got it wrong, and she took a closer look. It turned out that she wasn't wrong, but that she had uncovered a histidine residue that was unusually buried within the protein. "Gert correctly focused our attention to this residue," she says. "In the end we could definitely say the structure was right."

Pastore says researchers might be more careful about submitting their structures to the data bank if they think re-refinement software might be checking up on it.

Vriend is not the only one looking at data-bank quality. At the Lawrence Berkeley National Laboratory in California, computational biologist Paul Adams is testing the PHENIX crystallography software he develops on PDB data before sharing the software with other academics or licensing it to companies. "We [want to] make sure the software we are giving to people can do the right thing," he says. Adams doesn't make his results public, but says he has noticed that his software often improves an original structure assignment.

Vriend hopes his re-refinement software will eventually be linked to the PDB so that a user could click through from the data bank to obtain the most up-to-date protein structure. Gerard Kleywegt, who took over the European PDB operations at the EBI late last month, says that this will probably be implemented at some point.

Even so, the software is not sophisticated enough to automatically fix problems that are more than cosmetic, Kleywegt says. More serious problems, such as amino-acid side chains that have been assigned to the wrong location, require manual intervention.

"I see this as a first step," he says.