The steady march of the crowdsourcing movement has reached molecular biology in force. Earlier this week, bioinformatics experts announced the creation of a Wikipedia-style database that allows anyone to create and amend a repository of complex biochemical pathways. The database, called WikiPathways, joins WikiGene, Protein Data Bank Wiki, and WikiProteins, not to mention a host of other wikis used by individual labs and smaller communities of researchers.

Crowdsourcing relies on the goodwill and enthusiasm of large numbers of volunteers to maintain a useful online collection of knowledge — most notably in Wikipedia.

WikiPathways: crowdsourcing molecular biology. Credit: WikiPathways

The recent explosion of wikis (named after the Hawaiian word for ‘quick’) covering molecular biology is heralded by many as a new and necessary way manage the enormous amounts of data flooding in from labs. Over the past few years, faster DNA sequencing machines and refined protein-analysis methods have enabled increasingly ambitious projects. As a result, the number and complexity of databases trying to organize these data has multiplied.

Some databases rely on expert curators to annotate individual genes and proteins with information culled from the scientific literature, although many have struggled to keep up with the surge of new data.

Many researchers have raised concerns about annotation errors that creep into these databases — on his blog 'Sandwalk' biochemist Laurence Moran of the University of Toronto, Canada, has referred to the errors in DNA sequence databases as “one of the ugly little secrets of molecular biology”. These mistakes can percolate up into the scientific literature, perpetuating the error indefinitely.

The solution, some biologists argue, is to adopt a crowdsourcing model in which anyone is free to add their own annotations and corrections to a given gene or pathway. “Right now, it’s the large sequencing centres that are distributing the information,” says Andrew Su, a bioinformaticist at the Genomics Institute of the Novartis Foundation in San Diego, California. “The scientist working on their one protein at the bench doesn’t have a way to participate in that process.”

Harnessing the long tail

Not everyone is comfortable with the idea. In a letter to Science in March, 256 researchers asked that GenBank, a US National Institutes of Health repository for molecular biology data, incorporate a wiki component as a way to combat errors1. But GenBank declined, reportedly concerned that some users would fail to distinguish between verified, ‘gold-standard’ annotation and the relative anarchy of Wikipedia-style entries.

That hesitancy hasn’t stopped bioinformaticists from starting wiki databases of their own. On 28 May, researchers unveiled a protein annotation database called WikiProteins2. The database, spearheaded by Barend Mons at the Erasmus Medical Centre in Rotterdam, the Netherlands, stops short of becoming a chaotic wiki free-for-all by requiring that data be entered into specific categories, such as protein function, and by noting those who participate in the editing.

This structured approach makes the data easier to mine, but some argue that the structure can limit participation. “The more structure you add, it can become an activation barrier to having people contribute content,” says Su.

Survival of the wikiest

Instead, Su and his collaborators have taken a much more grass-roots approach. They simply created Wikipedia pages corresponding to individual human genes – about 7,500 in all – and modified the format of the 650 gene pages which already existed, before inviting researchers to annotate each gene with information about its function and the role of the protein it encodes3. The result is more chaotic, and more difficult for computer programs to mine for data, but Su hopes that it will provide a complementary approach to the more structured WikiProtein.

WikiPathways, meanwhile, has had to be more structured. Created by bioinformaticists Bruce Conklin of the University of California, San Francisco and Chris Evelo of Maastricht University, the Netherlands, and their colleagues, the program relies on Wikipedia’s software but adds a specialized code for creating models of molecular pathways4. It is the first of the wikis to take on such pathways – a particularly complex field for annotators to tackle.

Building a community

Conklin has experience in the area. In 2001 he and his collaborators released a free piece of software called the Gene Map Annotator and Pathway Profiler (GenMAPP) that allowed researchers to assemble information about complex molecular pathways into a simple schematic. Although the program was widely used, Conklin says that scientists rarely sent their schematic back to the team so that they could build the results into a library. “We just thought the community would self organize and it doesn’t,” he says.

In the short term, the proliferation of crowdsourced databases may have also created a sense of wiki fatigue. In 2006, Kai Wang, then a bioinformatics graduate student at the University of Washington in Seattle, wrote a letter to Nature 5 urging the community to create wiki-based gene annotation tools. Now a postdoc at the Children’s Hospital of Philadelphia in Pennsylvania, Wang says he has not yet contributed to any of the new wikis. “There are too many of them to invest a lot of time to test drive every single one of them,” he says. “I haven’t found a website yet that really promises to be the gold standard in the future.”

Both Su and Conklin realize that they will have to work hard to build a community around their wikis to make sure that they self-perpetuate. Many early biology databases were abandoned when their creators lost interest or funding, and funding agencies such as the National Science Foundation and National Institutes of Health are increasingly interested in ensuring that databases live on past their creators, says Conklin. “WikiPathways would live on if I got hit by a truck tomorrow,” he says. “Or if our entire team were to decide to go work for Google.”