Dereplication of microbial metabolites through database search of mass spectra

Natural products have traditionally been rich sources for drug discovery. In order to clear the road toward the discovery of unknown natural products, biologists need dereplication strategies that identify known ones. Here we report DEREPLICATOR+, an algorithm that improves on the previous approaches for identifying peptidic natural products, and extends them for identification of polyketides, terpenes, benzenoids, alkaloids, flavonoids, and other classes of natural products. We show that DEREPLICATOR+ can search all spectra in the recently launched Global Natural Products Social molecular network and identify an order of magnitude more natural products than previous dereplication efforts. We further demonstrate that DEREPLICATOR+ enables cross-validation of genome-mining and peptidogenomics/glycogenomics results.


SUPPLEMENTARY TABLES
Supplementary

SUPPLEMENTARY NOTES
Supplementary Note 1. Datasets. MSV000078839 dataset. 36 strains of Streptomyces were grown on A1, MS and R5 agar, extracted sequentially with ethyl acetate, butanol and methanol, and analyzed on Agilent 6530 Accurate-Mass Q-TOF spectrometer coupled to a C18 RP Agilent 1260 LC system (ESI ionization, CID fragmentation).
MSV000078568 dataset. A total of 317 cyanobacterial collections were extracted repetitively with CH2Cl2:MeOH 2:1, dried in vacuo, and fractionated into nine fractions (A-I) by silica gel vacuum liquid chromatography (VLC) using a stepwise gradient of hexanes/EtOAc and EtOAc/MeOH, and analyzed on a Maxis Impact mass spectrometer coupled to C18 RP-UHPLC (ESI ionization, CID fragmentation).
Supplementary Note 2. Further analysis of Spectra ActiSeq dataset. We divided Spectra ActiSeq into two parts, Spectra ActiSeq-CID consisting of 473135 spectra from MSV000078839 dataset and Spectra ActiSeq-HCD consisting of 178635 spectra from MSV000078604 dataset. DEREPLICATOR+ identified 2979 spectra (129 compounds) in Spectra ActiSeq-CID and 5215 spectra (404 compounds) in Spectra ActiSeq-HCD . 45 compounds are found in both CID and HCD datasets. Supplementary Data 1 shows compounds discovered in CID and HCD spectra at 1% FDR.
To evaluate whether DEREPLICATOR+ incorrectly identifies common mass spectrometry contaminants as natural products, we performed an evaluation of the masses of 760 contaminants from Keller et al.
Supplementary Note 3. Further analysis of Spectra library dataset.
Compounds in Spectra Library have on average 8 isomers with identical chemical formula in DNP. Among 4360 compounds (80%) of Spectra Library , which have at least one other isomer in the DNP database, DEREPLICATOR+ correctly identified 1746 (40%) of compounds (Supplementary Data 6).
To assess the capability of DEREPLICATOR+ in identifying adducts, we searched 1207 annotated spectra with sodium/potassium adducts from the GNPS spectral library using DEREPLI-CATOR+, and 280 (23%) of them were correctly identified at 1% FDR, while 23 (2%) were falsely identifies as a non-adduct compounds (Supplementary Data 8).
To assess the capability of DEREPLICATOR+ in identifying compounds in the negative ionization modes, we analyzed 341 additional spectra in the GNPS spectral library collected in the ESI (-) mode. DEREPLICATOR+ search correctly identified 88 (26%) of these compounds as top predictions (Supplementary Data 9).