Interest in natural products—active compounds produced by organisms—for applications in biotechnology and medicine is being renewed. A key technique in natural product analysis is mass spectrometry. Research in this field has been inhibited, however, by the lack of an electronic mechanism to share mass spectrometry data, according to University of California, San Diego (UCSD) researchers Nuno Bandeira and Pieter Dorrestein. “The knowledge that is generated through structure elucidation by the natural product community is important for many in the life sciences as well as the metabolomics community,” they write in a joint e-mail. “Structures are shared, but not the data that led to the structure.”

Together, they aimed to collate, curate and make available such valuable raw data in a searchable resource by developing the Global Natural Products Social Molecular Networking knowledgebase (GNPS), available at http://gnps.ucsd.edu. “We wanted to create an infrastructure that has the ability to store, analyze and disseminate both data and knowledge in an integrated manner to make data continuously more informative,” they say. They liken GNPS to GenBank, which facilitates the sharing of DNA sequence data, and BLAST, which facilitates searching of such sequence data.

GNPS allows users to make their raw tandem mass spectrometry (MS/MS) data for natural products available through UCSD's MassIVE (Mass Spectrometry Interactive Virtual Environment) repository. It also contains a spectral library of known natural product spectra, aggregated from their own libraries as well as third-party public libraries. At the time of publication, the GNPS spectral library contained more than 220,000 spectra for more than 18,000 compounds. All data sets uploaded to GNPS are subjected to an automated analysis that compares the experimental spectra against the library spectra, a process known as dereplication.

However, the chemical structure space of natural products is much, much greater than the current spectral library in GNPS (or any other spectral library, for that matter), which prevents the identification of most spectra in a data set for a particular organism. Uniquely, GNPS relies on crowdsourcing to not only help grow its spectral library but also help annotate unknown spectra in submitted data sets. At the time of publication, contributors had added spectra representing 1,325 new compounds to the spectral library and had revised the annotations of 563 library spectra. Each spectrum added to the library is given a gold, silver or bronze quality rating depending on how it was derived.

GNPS also includes a molecular networking tool that enables users to visualize related molecules, similarly to how sequence alignment is used to reveal related genes and other coding sequences. Network analysis can help reveal connections between data sets from disparate data sources that would otherwise remain hidden.

GNPS implements the concept of 'living data': as the spectral library grows, all public data sets in GNPS are periodically reanalyzed by the dereplication and molecular networking tools. This allows the data to become more annotated over time, say Bandeira and Dorrestein. Data contributors are automatically notified of new spectral matches made to their data sets, and users can 'follow' specific data sets of interest. “GNPS changes the way we interact with the data as the data starts to interact with the user,” note Bandeira and Dorrestein.

In the future, they plan to expand GNPS to include other types of MS data, provide tools to support automated metadata capture and 3D visualization, and add analysis capabilities for specific applications such as microbiome research.

“Unfortunately too many groups in the natural product community or in the metabolomics community still do not share their data,” Bandeira and Dorrestein lament. GNPS could help change this, as individual researchers can reap the positive feedback loop that crowdsourcing offers in return for sharing their precious data.