Increased diversity of peptidic natural products revealed by modification-tolerant database search of mass spectra

Abstract

Peptidic natural products (PNPs) include many antibiotics and other bioactive compounds. While the recent launch of the Global Natural Products Social (GNPS) molecular networking infrastructure is transforming PNP discovery into a high-throughput technology, PNP identification algorithms are needed to realize the potential of the GNPS project. GNPS relies on the assumption that each connected component of a molecular network (representing related metabolites) illuminates the ‘dark matter of metabolomics’ as long as it contains a known metabolite present in a database. We reveal a surprising diversity of PNPs produced by related bacteria and show that, contrary to the ‘comparative metabolomics’ assumption, two related bacteria are unlikely to produce identical PNPs (even though they are likely to produce similar PNPs). Since this observation undermines the utility of GNPS, we developed a PNP identification tool, VarQuest, that illuminates the connected components in a molecular network even if they do not contain known PNPs and only contain their variants. VarQuest reveals an order of magnitude more PNP variants than all previous PNP discovery efforts and demonstrates that GNPS already contains spectra from 41% of the currently known PNP families. The enormous diversity of PNPs suggests that biosynthetic gene clusters in various microorganisms constantly evolve to generate a unique spectrum of PNP variants that differ from PNPs in other species.

Main

After a decline in the pace of antibiotics discovery in the 1990s, antibiotics and other natural products are again the centre of attention, as exemplified by the recent discovery of teixobactin1. Previous studies of natural products mainly relied on low-throughput nuclear magnetic resonance-based technologies requiring large amounts of highly purified material that are often difficult to obtain. The key condition for enabling the renaissance of the antibiotics research is the development of high-throughput computational discovery pipelines such as the recently launched Global Natural Products Social (GNPS) molecular network2, which already contains over one billion tandem mass spectra, a gold mine for bioactive compounds discovery. However, natural products identification algorithms are needed to transform antibiotics discovery into a high-throughput technology and to realize the promise of the GNPS project.

This study focuses on peptidic natural products (PNPs), an important class of natural products with an unparalleled track record in pharmacology: many antibiotics, antiviral and anti-tumour agents and immunosuppressors are PNPs. PNPs are produced by non-ribosomal peptide synthetases (NRPSs)3 and ribosomally synthesized and post-translationally modified peptide synthases (RiPPSs)4. NRPSs and RiPPSs synthesize non-ribosomal peptides (NRPs) and ribosomally synthesized and post-translationally modified peptides (RiPPs), respectively. NRPs are not directly inscribed in genomes, but instead are encoded by NRPSs using non-ribosomal code, with each A-domain in NRPS responsible for a single amino acid in NRP5,6. While RiPPs are encoded in the genome, the RiPP-encoding genes are often short, making it difficult to annotate them7.

PNP identification

Given a spectrum and a peptide database, ‘peptide identification’ refers to finding a peptide in the database that generated the given spectrum. Identification of spectra derived from PNPs8,9,10,11 is more difficult than traditional peptide identification in proteomics because many PNPs are nonlinear peptides (for example, cyclic or branch-cyclic) that contain non-standard amino acids and complex modifications.

Identification of nonlinear peptides is only one of the two major challenges in PNP identification. In many cases, a PNP is absent in the database of known PNPs, but its modified or mutated variant is present in this database. Identification of an unknown PNP from its known variants is called ‘variable identification’ (as opposed to ‘standard identification’ when a PNP is present in the database). Similar to the problem of variable identification of modified peptides in traditional proteomics12,13,14,15, variable PNP identification is difficult because the computational space of this problem is several orders of magnitude larger than for standard PNP identification.

Since most PNPs form families of related peptides, variable identification is crucial for PNP discovery. Finding variants of known PNPs is important since these variants are sometimes more effective than the most abundant representatives of PNP families that currently dominate the PNPdatabases. The anti-fungal drug Cancidas16 or modified variants of vancomycin17 are some examples of variant PNPs that have proven to be effective in clinical applications.

Spectral networks

Given a set of PNPs P1, …, Pm, their peptide network is a graph with nodes P1, …, Pand edges connecting two PNPs if they differ by a single modification or mutation (substitution, insertion or deletion)18. Each component in the peptide network defines a PNP family. In reality, we are not given PNPs P1, …, Pm, but only their spectra S1, …, Sm. Nevertheless, one can approximate the peptide network by constructing the spectral network on nodes S1, …, Sm, where spectra Si and Sj are connected by an edge if they are similar; for example, if they can be aligned against each other using the spectral alignment algorithm13.

Although spectral networks19 reveal spectra of related peptides without us knowing what these peptides are, they have an important limitation—they work only when one of the spectra (nodes) in the connected component of the network corresponds to an unmodified peptide from a database (referred to as an ‘unmodified parent’). As the result, ‘orphan components’ in the spectral network (components without annotated nodes) represent the ‘dark matter of PNPs’ since the spectral network propagation approach18,19 lacks the ability to interpret them in the absence of an unmodified parent (Fig. 1a).

Fig. 1: Network-based and network-independent strategies for variable PNP identification.
figure1

Variant PNPs are coloured the same as their known compounds in the database; modified or mutated amino acids are highlighted by darker colours. a, Network-based PNP identification starts with the standard identification of spectra (1) and construction of a spectral network (2). Next, the network is annotated (3) using the identified PNPs via the spectral network propagation approach. In this example, the network component on the left has a single unmodified parent (coloured green) as the related PNP, while the component on the right is an orphan. Annotation propagation (4) through the network results in two variable PNP identifications, represented by additional green nodes. b, Network-independent PNP identification relies on efficient enumeration of all PNP variants (1) and further matching of spectra against these variants using the standard identification strategy (2).

PNP identification strategies

The DEREPLICATOR algorithm11 identified many PNPs in the GNPS dataset through standard identification (without modifications) and variable identification via spectral networks (Fig. 1a). However, the limitation of the spectral network approach prevents DEREPLICATOR from finding many PNP variants. Indeed, variable identification via spectral networks works only when an unmodified parent exists in a given connected component. Since PNPs vary across diverse related bacteria, this condition does not hold for many GNPS datasets because a PNP identified in one bacterium (and present in a database) is often represented by its modified variant in another bacterium. This limitation raises the challenge of developing methods for variable PNP identification that do not rely on spectral networks.

Modification-tolerant search reveals diverse PNP variants

Since PNPdatabases are dominated by the most abundant representatives of PNP families, existing algorithms focusing on the identification of known PNPs annotate only a small fraction of GNPS spectra. To address this limitation, we developed a network-independent VarQuest algorithm for modification-tolerant PNP identification (Fig. 1b).

VarQuest revealed that a vast majority of PNP families (78%) identified in GNPS were not represented by any non-modified known PNP and thus are not detectable using the spectral network approach. This observation suggests not only that PNPs are extremely diverse across evolutionarily distant microbial species, but also that PNP families rapidly evolve so that PNP variants present in one species are often mutated or modified even in closely related species. This evolution of PNP families may reflect adaptation to unique ecological niches under various pressures, not unlike the evolution of skyllamycins in Pseudomonas aeruginosa20.

The great diversity of PNP variants underscores the importance of variable PNP identification via VarQuest and reveals a limitation of the spectral network approach implemented in DEREPLICATOR (most components in the GNPS spectral network turned out to be orphans). VarQuest has now fixed this unanticipated limitation. We benchmarked VarQuest and identified an order of magnitude more PNP variants compared with existing PNP identification strategies.

Results

Brute-force approach

A novel PNP is referred to as a variant of a known PNP if it has the same topology and sequence of amino acids, except for a single modified or mutated amino acid. We focus on the identification of PNP variants with mass offset up to MaxMod (the default value MaxMod = 300 Da).

The brute-force approach to variable identification (BruteForce) is based on enumeration of all possible modifications and mutations for each peptide from the database21. Given a spectrum S and each PNP P from the database (with mass difference δ < MaxMod), it considers a modification of mass δ on all possible amino acids in P, forms a list CandidatePeptides(S) containing all such modified PNPs and finds a PNP in CandidatePeptides(S) with the best match to S. Since the resulting list CandidatePeptides(S) contains a large portion of the entire PNPdatabase, this approach is prohibitively time consuming. Various database filtering strategies and spectral alignment algorithms were developed to speed up the brute-force approach in traditional proteomics12,13,14,22,23. However, extending variable identification algorithms from linear peptides to nonlinear PNPs remains a challenge.

Spectral network approach

The spectral network approach (SpecNets) constructs the spectral network of all spectra and identifies the connected component of the spectral network that contains a given spectrum S (denoted Component(S)). It further forms the set CandidatePeptides(S) as the set of identifications of all spectra in Component(S) that were discovered using the standard identification method. Afterwards, it applies the spectral network propagation approach to CandidatePeptides(S) to perform variable identification of S. Although this approach is fast (since the CandidatePeptides(S) is typically small), it fails when Component(S) is an orphan; that is, when it does not contain any spectra originating from known PNPs.

VarQuest algorithm

The VarQuest pipeline for a single spectrum S (Fig. 2) starts with selection of a short list CandidatePeptides(S) from the PNPdatabase. Afterwards, VarQuest scores S against each PNP (with a single modification) in CandidatePeptides(S) and computes P values of resulting PNP spectrum matches (PSMs)24. A peptide with the lowest P value among all PNPs in CandidatePeptides(S) is reported as a candidate PNP that gave rise to the spectrum S.

Fig. 2: VarQuest pipeline.
figure2

For a spectrum and a PNPdatabase, VarQuest starts by scoring the spectrum (S) against the entire database (1) to form a list of candidate PNPs. All possible modifications are considered for each candidate (2) and the spectrum is scored against all variants (3) to select the highest-scoring variant per candidate PNP. Statistical significance of the scores is computed (4) and the most statistically significant PSM is reported (5).

Efficient selection of the small list of CandidatePeptides(S) is the key step of VarQuest. The standard identification approaches include a peptide P into CandidatePeptides(S) as long as Mass(S) ≈ Mass(P) with error up to Δ. Since Δ is small for high-resolution spectra, the list CandidatePeptides(S) is much smaller than the number of PNPs in the database, enabling fast standard identification but preventing detection of novel PNP variants. VarQuest detects novel PNP variants by constructing a short list of CandidatePeptides(S) that is different from the long list constructed by BruteForce, as described in the Methods.

Although VarQuest searches for unknown PNPs with a single modification (as compared to a known PNP), it has the ability to find PNPs with multiple modifications. However, in such cases, instead of reporting multiple modifications, it reports a single modification with a combined mass equal to the total mass of multiple modifications. Below, we illustrate how further analysis allows one to infer the positions and masses of multiple modifications.

Benchmarking VarQuest

We benchmarked VarQuest on five high-resolution GNPS datasets: SpectraPSEUD (~400,000 spectra from 260 Pseudomonas isolates25), \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\) (~200,000 spectra from Streptomyces7), \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\) (~500,000 spectra from Streptomyces11,26), SpectraCYANO (~11 million spectra from Cyanobacteria27) and SpectraGNPS (~130 million spectra from GNPS11). Details of these datasets are provided in Supplementary Table 1.

We matched all spectral datasets against our target database (PNPdatabase), which was constructed by combining all PNPs from AntiMarin28, DNP29, MIBiG30 and StreptomeDB31 (5,021 PNPs forming 1,582 PNP families). The results were compared with BruteForce, SpecNets and the standard identification algorithms (Table 1). The time and memory requirements of these methods are described in Supplementary Table 2. Although BruteForce can find all VarQuest identifications (for a given P value threshold) on the same set of spectra, it becomes prohibitively time consuming even on moderately sized spectral datasets such as SpectraCYANO. Note that all methods are based on DEREPLICATOR11 and that BruteForce and SpecNets failed to process the SpectraCYANO and SpectraGNPS datasets due to large memory requirements.

Table 1 Comparison of four PNP identification approaches on various spectral datasets against our PNPdatabase with 5,021 PNPs (1,582 PNP families)

We compared the number of identified PSMs and unique peptides for all methods at 0 and 5% false discovery rate (FDR) levels. To compute the FDR, VarQuest uses the concept of a decoy database extended to nonlinear peptides (see Methods). All PSMs with P values above 10−10 were removed beforehand and the FDR was conservatively computed for the remaining PSMs. Table 1 shows a greater than tenfold increase in the number of PSMs and a fivefold increase in the number of PNPs and PNP families identified in GNPS via variable identification with VarQuest compared with the standard DEREPLICATOR at 5% FDR. Figure 3 shows the peptide network of the largest PNP family (cyclosporins) in the PNPdatabase identified by VarQuest (Supplementary Table 3).

Fig. 3: Peptide network of the cyclosporin family in the PNPdatabase extended by the newly identified cyclosporin variants.
figure3

The peptide network was constructed for 47 known cyclosporins using the GNPS interface (the Molecular Networking workflow2). Each node represents the theoretical spectrum of a cyclosporin variant. The number inside each node represents the monoisotopic mass (in Da) rounded to integers. Two nodes are connected by an edge if the corresponding theoretical spectra are similar (a cosine score of at least 0.8). Blue edges correspond to characteristic mass shifts of 14, 16, 28, 32 and 42 Da. The remaining edges are black. a, Peptide network of cyclosporin variants present in the PNP database. The red nodes are 29 cyclosporins identified in SpectraGNPS as known PNPs (both by DEREPLICATOR and VarQuest). The blue nodes are theoretical spectra of the other 18 PNPs, which are not present in GNPS in their known form and were added to the network for the sake of completeness. b, Peptide network of known and novel cyclosporin variants. The green nodes are theoretical spectra of 18 novel variants identified by VarQuest in GNPS. Each novel variant is the most statistically significant identification of the corresponding blue node (an absent, known cyclosporin PNP). Likely insertions and deletions are shown on the corresponding edges. Hiv, hydroxyisovaleric acid; mLeu, methylated leucine.

While DEREPLICATOR revealed spectra corresponding to only 8% of peptides in the PNPdatabase, VarQuest increased this to 40%. Of the 2,025 PNPs identified by VarQuest in SpectraGNPS (at 5% FDR), 1,605 have their unknown variants present in SpectraGNPS, while their known variants are absent and thus cannot be detected by the standard identification strategies.

Our analysis of the entire GNPS dataset at 5% FDR identified 648 PNP families (41% of all known PNP families). At the same time, only a small fraction of identified PNP families (143 out of 648) were identified as an unmodified parent; that is, the majority of identified PNP families (78%) were not represented by any non-modified peptides in the PNPdatabase and thus are not detectable using the spectral network approach.

Modifications and mutations identified by VarQuest

Supplementary Table 4 shows the most common mass offsets identified by VarQuest in the SpectraGNPS dataset at 5% FDR. For each mass offset, we identified its most likely position in the PNP. As expected, the most common offsets are −14 Da (demethylation), +14 Da (methylation), +28 Da (dimethylation), +18 Da (hydration) and +16 Da (hydroxylation), corresponding to either modifications/adducts or mutations. In addition, Supplementary Table 4 reveals many surprising offsets, such as −95 Da (primarily at leucine or isoleucine) and −81 Da (primarily at valine). These offsets may correspond to the combination of the amino acid loss (−113 Da for leucine or isoleucine and −99 Da for valine) and hydration (+18 Da). The abundance of such offsets suggests that the recently described phenomenon of amino acid deletions or insertions in NRPs25 (due to A-domain stuttering and skipping in NRPSs32) may be more prevalent than was previously thought. Genome mining efforts typically rule out such events due to the consecutive arrangements of A-domains in NRP synthetases.

To conservatively estimate the number of indels revealed by spectra in SpectraGNPS, we considered identified mass offsets that matched the monoisotopic masses of proteinogenic amino acids within a 0.02 Da error. This analysis revealed 217 putative insertions and 169 deletions out of 19,619 PNP variants identified by VarQuest in SpectraGNPS at 5% FDR. Our confidence in the deletions is higher than in the insertions since for each of them we also checked that the deleted amino acid is present in the known PNP structure (which reduced the initial number of potential deletions by 30% from 242 to 169).

Analysis of PNP diversity

VarQuest identified 19,619 PNP variants related to 2,025 distinct PNPs in SpectraGNPS at 5% FDR. More than 70% of the identified PNPs (1,489) were found in at least two different forms. Each identified PNP was found in 9.7 various PNP forms on average with the maximum value equal to 239 for tolybyssidin A. Our analysis adds a chemical dimension to the recently revealed PNP diversity at the biosynthetic gene cluster (BGC) level33,34. We further revealed that related bacteria are likely to produce similar PNP variants rather than identical PNPs (see Methods).

Validation of VarQuest identifications

Our analysis revealed that about 60% of the PNP variants identified by VarQuest in Pseudomonas and Streptomyces datasets are missed by DEREPLICATOR and SpecNets. We validated the most statistically significant VarQuest hits using a literature search for identified PNP origin, which should correlate with the sample origin (see Methods), and searched for BGCs by genome mining35 whenever the genome of the analysed species was available. We further analysed three identified PNP variants (referred to as Massetolide-1252, Venepeptide-2154 and Surugamide-769) in more detail (Supplementary Fig. 1).

Massetolide-1252

Massetolide A is a known NRP from Pseudomonas36 that consists of a cycle TISLSLI and a branch EL (along with a 3-hydroxydecanoic acid lipid tail of mass 171 Da) attached to the cycle via a bond connecting T in the cycle with E in the branch. We represent branch-cyclic peptides as a concatenate of its cyclic sequence and its branch sequence, both starting from their attachment points; for example, massetolide A is represented as TISLSLI*EL. VarQuest identified massetolide A and its novel variant Massetolide-1252 (sequence TISL +113SLI*EL and mass 1,252.8 Da) with a P value of 4.2 × 10−19 using a spectrum from Pseudomonas synxantha. The +113 Da offset corresponds to insertion of leucine or isoleucine residue and matches the recently identified poaeamide B with sequence TISLLSLI*EL and mass 1,252.8 Da. Note that a single run of VarQuest instantly achieved the same goal as the time-consuming semi-manual discovery of poaeamide B25.

VarQuest also rediscovered bananamides, a family of PNPs discovered in the same study25. Bananamide (referred to as Bananamide-1093) was identified with a P value of 4.3 × 10−10 as a variant of massetolide A (sequence TIS−46LSLI*EL and mass 1,093.7 Da) using a spectrum from Pseudomonas fluorescens. While the recent study25 did not derive the amino acid sequence from this spectrum, it purified and sequenced a related PNP (named bananamide 2) with sequence TLLQLI*DL (along with a C12 3-hydroxy unsaturated acid lipid tail of mass 197 Da) and mass 1,105.7 Da (amino acids differing from massetolide A are highlighted except for a change between amino acids I and L with identical masses). While the amino acid sequences TIS−46LSLI*EL and TLLQLI*DL appear to be rather different, note that S−46LS has the same mass as LQ, suggesting that TIS−46LSLI*EL may actually correspond to TILQLI*EL with a single deleted amino acid compared with massetolide A. Note that there is only one difference with respect to the masses of the amino acids between this sequence (TILQLI*EL) and the sequence of bananamide 2 (TLLQLI*DL).

Our analysis of Bananamide-1093 suggests that bananamides emerged from the massetolides family after deletion of a single amino acid (or alternatively, massetolides emerged from bananamides after insertion of a single amino acid). Interestingly, while the PSM for Bananamide-1093 is statistically significant, PSMs for bananamides 1, 2 and 3 identified in ref. 25 have rather high P values that did not pass the VarQuest P value threshold. Remarkably, the manual analysis in ref. 25 missed the most statistically significant PSM for bananamides identified by VarQuest, illustrating the power of automated approaches to PNP identification. Moreover, after identifying Bananamide-1093, VarQuest identifies spectra of bananamides 1, 2 and 3 against Bananamide-1093 as statistically significant PSMs with P values of 1.1 × 10−13, 9.8 × 10−12 and 1.1 × 10−16, respectively.

Surugamide-769

Surugamides are cyclic NRPs from marine Streptomyces11,37. VarQuest identified both the known PNP surugamide B with sequence IAIVKIFL and its novel variant IAIVK−128IFL using a spectrum from Streptomyces albus (P = 1.7 × 10−19). The SpecNets approach missed this compound because its connected component does not contain known surugamides.

The amino acid sequence IAIVK−128IFL of Surugamide-769 corresponds to a loss of lysine. This annotation is consistent with the arrangement of the genes in the surugamide BGC since the deleted lysine corresponds to the last A-domain in one of two genes in this BGC (Supplementary Fig. 2). Thus, Surugamide-769 represents the second evidence of the same NRP synthetase producing two cyclic peptides with different numbers of amino acids, similar to poaeamide B and massetolide A25. However, further experimental validation of this hypothesis and the many other likely insertions and deletions listed in Supplementary Table 4 is needed.

Venepeptide-2154

Venepeptide is a linear ribosomal peptide M +28NVITNLLAGVVHFLGWLV that was identified from Streptomyces venezuelae38. VarQuest identified its variant M +28NVITN +31LLAGVVHFLGWLV (mass 2,154.1 Da) with a P value of 3.2 × 10−15 using a spectrum from Streptomyces lividans. DEREPLICATOR missed this compound because GNPS does not contain a spectrum corresponding to the known venepeptide. A sequence similarity search39 of this peptide against the genome of S. lividans revealed the sequence MNLLTDILAGLVHFVGWLV (the differences with venepeptide are underlined). A match of the spectrum against this sequence resulted in a PSM with a P value of 8.5 × 10−24 and suggested modification +44 Da on the M residue. Note that while VarQuest is limited to finding variants with a single modified amino acid, it was able to identify that a spectrum from S. lividans has arisen from a variant of venepeptide. The further manual analysis revealed that the Venepeptide-2154 structure differs from venepeptide in four amino acids.

Discussion

Although the launch of high-throughput natural product discovery pipelines, such as the GNPS molecular network, is an important step towards future discoveries, the lack of computational approaches is still a bottleneck for spectral identifications in the GNPS infrastructure. Currently, the GNPS spectral library, a collection of identified spectra from GNPS, represents a minuscule fraction of all GNPS spectra. While molecular networks2,19 have already resulted in discoveries of various PNPs and their variants25,40, these discoveries still require time-consuming manual follow-up analysis. Here, we demonstrate how the same goal can be achieved in a single push-of-a-button VarQuest run, replicating recent PNP discoveries and finding previously unknown PNP variants. Moreover, variable dereplication of the entire GNPS revealed both surprising diversity of PNPs and limitations of the spectral networks approach. In particular, we demonstrated that the recently discovered phenomenon of insertions and deletions of amino acids is widespread among NRPs.

There is yet another reason why variable identification is important. A recent study41 revealed many BGCs with PNPs representing the largest group of secondary metabolites encoded by these BGCs. However, the vast majority of the predicted BGC products remained unknown, reflecting the limited information available for characterized natural products and the lack of genome mining and peptidogenomics tools for matching BGCs and spectra.

While databases in traditional proteomics consist of known peptides, the ongoing genome mining efforts for PNP discovery35 generate vast databases of still-uncharacterized putative PNPs7,42,43. Since predicting an NRP encoded by an NRPS is difficult, various tools for predicting the specificities of A-domains44 output multiple rather than single candidate amino acids for each A-domain. Supplementary Fig. 2 presents three top candidate amino acids for each of eight A-domains in suragamide-encoding NRPS resulting in 38 candidate NRPs. As a result, genome mining efforts typically generate large databases of error-prone putative PNPs, and matching spectra against such databases is prohibitively time consuming. Thus, the development of fast algorithms for variable PNP identification is important for the success of genome mining efforts.

We present a VarQuest algorithm for variable PNP identification via a database search of mass spectra—the only modification-tolerant approach capable of searching the entire GNPS spectral network. Our method revealed an order of magnitude more PNPs than the standard search by DEREPLICATOR, illuminating the 'dark matter of PNPs'45. It also greatly increased the spectral library of PNPs in GNPS by identifying 41% of all known PNP families in the PNPdatabase. An iterative run of VarQuest has the potential to identify even more PNP variants with multiple modifications.

VarQuest revealed a surprising diversity of PNPs that may reflect evolutionary adaptation of various bacterial species to changing environments and competition; for example, a continuous change in the repertoire of variants of peptidic antibiotics in response to developing antibiotic resistance. It also revealed a limitation of existing NRP mining tools that were developed based on 'NRPS–single NRP' pairs as the training datasets44 aimed at predicting a single NRP. A more biologically adequate approach would be to use the training datasets 'NRPS–NRP network' that have recently become available. With the growing availability of paired genomics and mass spectrometry datasets, it is now possible to generate such training datasets using VarQuest.

Methods

Scoring PSMs

A PNP graph of a PNP P is defined as a graph with nodes corresponding to amino acids in P and edges corresponding to generalized peptide bonds11. The mass of a PNP graph (referred to as Mass(P)) is defined as the total mass of its amino acids and TheoreticalSpectrum(P) is defined as the set of masses (theoretical peaks) of all connected components of the PNP graph resulting from removal of two edges (a 2-cut in cyclic and branch-cyclic PNPs) or a single edge (a bridge in a branch-cyclic PNP)11. Note that each such removal results in two peaks with a total mass equal to Mass(P).

Given a peptide P and a spectrum S, SPCScore(P,S) is defined as the shared peak count—the number of peaks shared between TheoreticalSpectrum(P) and S. Two peaks are shared if their masses are within a threshold ε (0.02 Da for high-resolution spectra). We compute this score only if the precursor mass of the spectrum, denoted as Mass(S), matches Mass(P) with error up to Δ (0.02 Da for high-resolution spectra).

If (A1, …, An) is the list of amino acid masses in a PNP P, we define Variant(P,i,δ) as (A1,…, Ai + δ, …, An), where P and Variant(P,i,δ) have the same topology and Ai + δ ≥ 0. VariableScore(P,S) is defined as:

$$max({\rm{SPCScore}}({\rm{Variant}}(P,i,\omega ),S)),$$
(1)

where ω is Mass(P) − Mass(S) and i varies from 1 to \(| P| \) (\(| P| \) stands for the number of amino acids in the peptide P). We define a variant of peptide P derived from a spectrum S (referred to as Variant(P,S)) as Variant(P,i,ω) of peptide P that maximizes SPCScore(Variant(P,i,ω),S) across all positions i in P.

Selecting candidate peptides

Consider a peptide P and its variant P* = Variant(P,i,δ). TheoreticalSpectrum(P) and TheoreticalSpectrum(P*) share approximately half of their peaks while the remaining peaks in TheoreticalSpectrum(P*) are shifted by δ with respect to the corresponding peaks in TheoreticalSpectrum(P) (Supplementary Fig. 3). Thus, if an experimental spectrum S is produced by a peptide P* and shares N peaks with TheoreticalSpectrum(P*), we expect that \({\rm{SPCScore}}(P,S)\approx \frac{N}{2}\). However, this condition often does not hold in practice due to many noisy and missing peaks in the experimental spectra. In practice, we reduce the size of CandidatePeptides(S) by retaining all PNPs that satisfy the condition:

$${\rm{SPCScore}}(P,S)\ge \eta ,$$
(2)

for a small value η. To select the threshold η, we analysed the values of SPCScore for peptides reported by the brute-force method at various significance levels (Supplementary Table 5). Since the majority of statistically significant PSMs (P value ≤ 10−10) share at least 5 peaks with the corresponding known peptides (74, 80 and 72% for SpectraPSEUD, \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\) and \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\), respectively), we set the default value η = 5.

For a given spectrum S, VarQuest forms the list CandidatePeptides(S) by selecting all PNPs satisfying equation (2) (among all PNPs with mass differing from Mass(S) by at most MaxMod). Checking this condition requires computing SPCScore(P,S) values for each peptide P from the PNPdatabase Peptides. Since a naive approach (computing SPCScore for each PNP) is time consuming, VarQuest pre-processes the PNPdatabase and scores a spectrum S against the entire database at once.

Pre-processing a PNPdatabase

For a given PNPdatabase Peptides, the pre-processing starts from generation of theoretical spectra for each PNP in the database (stage 1 in Supplementary Fig. 4). All peaks from all theoretical spectra are combined and sorted to form the array SortedPeaks(Peptides) (stage 2). The peaks in SortedPeaks(Peptides) are partitioned into M bins of size θ (the default values M = 20,000 and θ = 0.2 Da). Afterwards, VarQuest constructs an indexing table Index(Peptides,M,θ) (stage 3). The table is designed in such way that the i-th cell Index(i) contains a pointer to the smallest peak p, such that p ≥ iθ for all i [0..M − 1].

Scoring a spectrum against a PNPdatabase

To score a given spectrum S against all PNPs in the PNPdatabase Peptides, VarQuest iterates through all the peaks in S and stores PNPs matching the peak into a counting set FeasiblePeptides(S) (Supplementary Fig. 5). To match a peak s against all the PNPs, VarQuest counts all the matching theoretical peaks in the interval (s − ε, s + ε) by finding plower (the smallest matched peak) and pupper (the largest matched peak). VarQuest sets \({i}_{lower}=\lfloor \frac{s-\varepsilon }{\theta }\rfloor \) and uses binary search to search for the smallest matching peak in the interval between Index(ilower) and Index(ilower + 1) (pupper is found in a similar way). Since the interval is small, the binary search is much faster than the search on the entire array SortedPeaks(Peptides).

After processing all peaks in the spectrum S, the number of occurrences of a PNP P in FeasiblePeptides(S) corresponds to the number of shared peaks between P and S. Thus, the list FeasiblePeptides(S) contains information about SPCScore(P,S) for all PNPs P sharing at least one theoretical peak with S.

Computing FDR

The target-decoy approach46 for estimating FDR is based on generating a decoy database DecoyPeptides from the target database Peptides and searching all spectra against the combined DecoyPeptides and Peptides databases. The target-decoy approach further uses the numbers of PSMs found in both databases to evaluate FDR. We refer to the set of all PSMs found in Peptides(DecoyPeptides) with P values below a threshold τ as PSMτ(Peptides,Spectra) (PSMτ(DecoyPeptides,Spectra)). As the decoy database consists of randomly generated peptides, we expect to find very few PSMs in PSMτ(DecoyPeptides,Spectra) for an appropriately chosen τ (we used τ = 10–10). Note that the size of DecoyPeptides is not necessarily equal to the size of Peptides. We consider the situation when the frequencies of target and decoy peptides in the combined database are t and d, respectively (t + d = 1). We define the decoy ratio D as \(\frac{d}{t}\) and compute FDR as follows:

$${{\rm{FDR}}}_{\tau }=\frac{1}{D}\frac{\left|{{\rm{PSM}}}_{\tau }({\rm{DecoyPeptides,Spectra}})\right|}{\left|{{\rm{PSM}}}_{\tau }({\rm{Peptides,Spectra}})\right|}.$$
(3)

Since the VarQuest algorithm is linear with respect to the size of the PNPdatabase, larger DecoyPeptides lead to an increased running time. In contrast, small database DecoyPeptides may result in an inaccurate estimate of FDR. We thus benchmarked VarQuest with various values of D to show that D = 1 is a good trade-off (Supplementary Table 6).

Generating decoy databases

A popular method for generating decoy databases in traditional proteomics is random shuffling of amino acids for each target protein. However, this strategy (Supplementary Fig. 6b, further referred to as ‘classical’) is not suitable for PNPs because (1) PNPs are much smaller than proteins, (2) many PNPs are cyclic or branch-cyclic and (3) many PNPs contain multiple copies of the same amino acid (Supplementary Fig. 7). This results in decoy peptides that are similar to the target peptides after the shuffling procedure, resulting in an inflated FDR.

To address this challenge, DEREPLICATOR11 randomly redistributes the total mass of a peptide over the nodes of its PNP graph (Supplementary Fig. 6c, DEREPLICATOR strategy). This strategy is motivated by the fact that PNPs often contain non-standard amino acids with a wide range of masses.

VarQuest uses a novel decoy generation approach based on amino acid shuffling and random bond displacement (Supplementary Fig. 6d, VarQuest strategy). For each target PNP, VarQuest first generates a decoy PNP by rearranging the amino acids. Afterwards, it randomly selects an edge in the PNP graph and substitutes it with a new edge, connected to a randomly selected position, such that the resulting decoy structure represents a connected graph. This strategy takes into account the complex structures present in many PNPs, resulting in a more diverse decoy database.

To compare the accuracy of the FDR estimation using these methods, we conducted the following experiment. We took 200 top-scoring unique PNP identifications from DEREPLICATOR run on the entire GNPS11. These annotations were manually curated and validated as reliable. In experiment 1, we ran VarQuest on the spectra with the same PNPdatabase as in ref. 11. In experiment 2, we excluded 200 target peptides and all their known variants from the PNPdatabase and ran the VarQuest again. We expected the FDR to be around 0% in experiment 1 (all mass spectra are highly trustable) and around 50% in experiment 2 (the correct peptides are missing from the database and matches to the target and decoy PNPs are equally likely). Supplementary Table 7 shows the FDR estimations for both experiments computed based on various decoy generation approaches. The classical strategy overestimates FDR in experiment 1 while the DEREPLICATOR method underestimates FDR in experiment 2. The VarQuest decoy generation strategy has an acceptable performance in both cases (0.5 and 55.0%, respectively).

Constructing the PNPdatabase

We combined all compounds with at least four generalized peptide bonds from AntiMarin28, DNP29, MIBiG30 and StreptomeDB31 into a single non-redundant database with 10,067 distinct compounds (Supplementary Table 8). These chemical entities were classified into chemical classes using the ClassyFire47 software tool. Compounds related to peptidic classes were included in our target database (PNPdatabase), which consisted of 5,021 distinct PNPs forming 1,582 PNP families (Supplementary Table 9). Supplementary Tables 1014 show the distributions of PNP origins, PNP family sizes and PNP structures, the number of peptide bonds and the most frequent amino acids in the PNPdatabase.

Revealing PNP diversity in related bacteria

Spectral libraries in metabolomics2,48 rely on the comparative metabolomics assumption that assumes that two related bacteria are likely to produce identical metabolites. Our analysis revealed that in the case of PNPs, such cases are relatively rare and that related bacteria are likely to produce similar rather than identical PNPs. To illustrate this point, we visualized strain relations based on DEREPLICATOR (identical known PNPs) and VarQuest (PNP variants of the same origin) identifications in SpectraCYANO (the largest) and \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\) (the least contaminated) datasets at 5% FDR. To illustrate the diversity of PNPs across related bacteria, we introduced the concept of the strain graph with nodes representing strains and edges connecting two strains if they produce variants of the same known PNP (see Supplementary Fig. 8)

SpectraCYANO

DEREPLICATOR identified 42 known PNPs in 68 out of 352 Cyanobacteria strains. The strain graph constructed on these PNPs has 284 edges (two strains share identical PNPs) and consist of 13 connected components (Supplementary Fig. 8a). VarQuest detected PNP variants of 334 known PNPs in the same set of 68 strains. The strain graph has 618 edges (two strains produce PNP variants of the same known PNP) and a single connected component (Supplementary Fig. 8b). The VarQuest strain graph on the entire SpectraCYANO contains 272 nodes, 3,791 edges and 19 connected components.

SpectraSTREP1

DEREPLICATOR identified 20 PNPs in 10 out of 17 Streptomyces strains. Its strain graph has only six edges and consists of six connected components (Supplementary Fig. 8c). In contrast, VarQuest identified 78 PNP variants in these 10 strains and enlarged the graph by 29 additional edges (35 total) turning it into a single connected component (Supplementary Fig. 8d). Moreover, VarQuest was able to identify PNP variants in all 17 Streptomyces strains in this dataset. The full strain graph has 73 edges and 3 connected components.

Validating VarQuest identifications using a literature search

Supplementary Table 15 shows the list of 244 peptide variants identified by VarQuest in Pseudomonas and Streptomyces datasets. We considered all PNP variants at 5% FDR (871, 287 and 56 for SpectraPSEUD, \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\) and \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\), respectively) and excluded identifications of known PNPs with zero mass offsets (resulting in 662, 239 and 43 remaining peptide variants, respectively). Afterwards, we analysed 100 peptide variants with the lowest P values per dataset (for \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\) we considered all 43 variants). The origin of each PNP family was determined based on a literature search. The most contaminated dataset is SpectraPSEUD, in which only 52% of variants have Pseudomonas origin. Four large non-Pseudomonas families (surfactins, xentrivalpeptides, bacillomycins and SNA-60-367) cover 26 variants in this dataset. SpecNets identified 19 of these 26 variants, which indirectly suggests that these spectra are true contaminants rather than VarQuest false positives. Half of the singleton (PNP families with a single identified member of the family) contaminants (10 out of 22) are also reported by SpecNets. Both Streptomyces datasets have a higher rate of PNPs originally found in Streptomyces (94% for \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\) and 65% for \({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\)).

There are a few reasons why spectra from the SpectraPSEUD dataset form PSMs with PNPs from other bacterial sources apart from being false PSMs; for example, laboratory contamination and morphology misidentification, as many laboratory collections contain organisms that are misidentified11. Also, Luria broth growth media before autoclaving is not sterile (for example, surfactins are commonly found in the growth media, even in freshly opened bottles).

Running VarQuest iteratively

While VarQuest is limited to searching for PNP variants with a single modification, this limitation can be potentially addressed by an iterative run of VarQuest. In this case, PNP variants identified in the initial VarQuest run are iteratively used as an input PNPdatabase for the subsequent run on the same spectral dataset. Supplementary Table 16 presents the results of iterative VarQuest run on SpectraCYANO. On the initial run on this dataset, VarQuest identified 2,083 and 95 PNP variants in the target and decoy versions of the PNPdatabase, respectively. For the second iteration, we selected PNP variants with the most reliable mass offsets equal to ±14 Da (methylation), ±28 Da (dimethylation), ±18 Da (hydration) and ±16 Da (hydroxylation) and ended up with a new PNPdatabase with 81 PNP variants (78 targets and 3 decoys) representing 53 unique PNPs. We further refer to this database as FirstIterationDB.

Out of 385 PNP variants identified on the second iteration, 69 were already identified among 2,083 PNP variants reported in the first VarQuest run and the remaining 284 are novel (14% increase). We investigated why 353 PNP variants identified at the second iteration were not reported on the first run of VarQuest (Supplementary Fig. 9). It turned out that a large fraction of newly identified PNP variants (114 out of 353) were actually identified (but not reported) by VarQuest since they have P values slightly above the default P value threshold of 10−10 (varying from 10−10 to 10−7). Thus, FirstIterationDB presents a better PNPdatabase for identifying PNPs with multiple modifications compared with the original PNPdatabase.

To provide additional evidence that the newly found PNP variants represent correct rather than erroneous modifications, we further checked whether the most frequent offsets identified in the second iteration are consistent with the most frequent offsets identified in the initial run of VarQuest (Supplementary Table 17). It turned out that most of these offsets correlate with the most common offsets identified by VarQuest in SpectraGNPS (Supplementary Table 4).

Life Sciences Reporting Summary

Further information on experimental design is available in the Life Sciences Reporting Summary.

Code availability

VarQuest is available both as a command line tool (http://cab.spbu.ru/software/varquest) and as a web application on the GNPS website (http://gnps.ucsd.edu).

Data availability

Liquid chromatography-tandem mass spectrometry data are publicly accessible under MassIVE accession numbers MSV000079450 (SpectraPSEUD), MSV000078604 (\({{\rm{Spectra}}}_{{{\rm{STREP}}}_{1}}\)), MSV000078839 (\({{\rm{Spectra}}}_{{{\rm{STREP}}}_{2}}\)) and MSV000078568 (SpectraCYANO) at http://gnps.ucsd.edu/ProteoSAFe/datasets.jsp. The list of 120 MassIVE accession numbers for SpectraGNPS is available at http://cab.spbu.ru/software/varquest. The PNPdatabase is available at http://cab.spbu.ru/software/varquest.

References

  1. 1.

    Ling, L. L. et al. A new antibiotic kills pathogens without detectable resistance. Nature 517, 455–459 (2015).

    CAS  Article  Google Scholar 

  2. 2.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS  Article  Google Scholar 

  3. 3.

    Marahiel, M. A., Stachelhaus, T. & Mootz, H. D. Modular peptide synthetases involved in nonribosomal peptide synthesis. Chem. Rev. 97, 2651–2674 (1997).

    CAS  Article  Google Scholar 

  4. 4.

    Arnison, P. G. et al. Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature. Nat. Prod. Rep. 30, 108–160 (2013).

    CAS  Article  Google Scholar 

  5. 5.

    Stachelhaus, T., Mootz, H. D., Bergendahl, V. & Marahiel, M. A. Peptide bond formation in nonribosomal peptide biosynthesis. Catalytic role of the condensation domain. J. Biol. Chem. 273, 22773–22781 (1998).

    CAS  Article  Google Scholar 

  6. 6.

    Von Dohren, H., Dieckmann, R. & Pavela-Vrancic, M. The nonribosomal code. Chem. Biol. 6, R273–R279 (1999).

    CAS  Article  Google Scholar 

  7. 7.

    Mohimani, H. et al. Automated genome mining of ribosomal peptide natural products. Acs. Chem. Biol. 9, 1545–1551 (2014).

    CAS  Article  Google Scholar 

  8. 8.

    Ng, J. et al. Dereplication and de novo sequencing of nonribosomal peptides. Nat. Methods 6, 596–599 (2009).

    CAS  Article  Google Scholar 

  9. 9.

    Ibrahim, A. et al. Dereplicating nonribosomal peptides using an informatic search algorithm for natural products (iSNAP) discovery. Proc. Natl Acad. Sci. USA 109, 19196–19201 (2012).

    CAS  Article  Google Scholar 

  10. 10.

    Mohimani, H. & Pevzner, P. A. Dereplication, sequencing and identification of peptidic natural products: from genome mining to peptidogenomics to spectral networks. Nat. Prod. Rep. 33, 73–86 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Mohimani, H. et al. Dereplication of peptidic natural products through database search of mass spectra. Nat. Chem. Biol. 13, 30–37 (2017).

    CAS  Article  Google Scholar 

  12. 12.

    Pevzner, P. A., Mulyukov, Z., Dancik, V. & Tang, C. L. Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. Genome Res. 11, 290–299 (2001).

    CAS  Article  Google Scholar 

  13. 13.

    Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P. A. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol. 23, 1562–1567 (2005).

    CAS  Article  Google Scholar 

  14. 14.

    Tanner, S. et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).

    CAS  Article  Google Scholar 

  15. 15.

    Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).

    CAS  Article  Google Scholar 

  16. 16.

    Balkovec, J. M. et al. Discovery and development of first in class antifungal caspofungin (CANCIDAS®)—a case study. Nat. Prod. Rep. 31, 15–34 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Okano, A., Isley, N. & Boger, D. L. Peripheral modifications of vancomycin with added synergistic mechanisms of action provide durable and potent antibiotics. Proc. Natl Acad. Sci. USA 114, 5052–5061 (2017).

    CAS  PubMed  Google Scholar 

  18. 18.

    Mohimani, H. et al. Multiplex de novo sequencing of peptide antibiotics. J. Comput. Biol. 18, 1371–1381 (2011).

    CAS  Article  Google Scholar 

  19. 19.

    Bandeira, N. Spectral networks: a new approach to de novo discovery of protein sequences and posttranslational modifications. Biotechniques 42, 687–695 (2007).

    CAS  Article  Google Scholar 

  20. 20.

    Navarro, G. et al. Image-based 384-well high-throughput screening method for the discovery of skyllamycins A to C as biofilm inhibitors and inducers of biofilm detachment in Pseudomonas aeruginosa. Antimicrob. Agents Ch. 58, 1092–1099 (2014).

    Article  Google Scholar 

  21. 21.

    Yates, J. R., Eng, J. K., McCormack, A. L. & Schieltz, D. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67, 1426–1436 (1995).

    CAS  Article  Google Scholar 

  22. 22.

    Pevzner, P. A., Dancik, V. & Tang, C. L. Mutation-tolerant protein identification by mass spectrometry. J. Comput. Biol. 7, 777–787 (2000).

    CAS  Article  Google Scholar 

  23. 23.

    Na, S., Bandeira, N. & Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell. Proteom. 11, M111.010199 (2012).

    Article  Google Scholar 

  24. 24.

    Mohimani, H., Kim, S. & Pevzner, P. A. A new approach to evaluating statistical significance of spectral identifications. J. Proteome Res. 12, 1560–1568 (2013).

    CAS  Article  Google Scholar 

  25. 25.

    Nguyen, D. D. et al. Indexing the Pseudomonas specialized metabolome enabled the discovery of poaeamide B and the bananamides. Nat. Microbiol. 2, 16197 (2016).

    CAS  Article  Google Scholar 

  26. 26.

    Duncan, K. R. et al. Molecular networking and pattern-based genome mining improves discovery of biosynthetic gene clusters and their products from Salinispora species. Chem. Biol. 22, 460–471 (2015).

    CAS  Article  Google Scholar 

  27. 27.

    Luzzatto-Knaan, T. et al. Digitizing mass spectrometry data to explore the chemical diversity and distribution of marine cyanobacteria and algae. eLife 6, e24214 (2017).

  28. 28.

    Blunt, J., Munro, M. & Laatsch, H. AntiMarin Database (Univ. Canterbury, Christchurch, and Univ. Gottingen, Gottingen, 2007); https://www.scienceopen.com/document?vid=03a1a98e-434c-4255-a287-5a900f59d024

  29. 29.

    Gozalbes, R. & Pineda-Lucena, A. Small molecule databases and chemical descriptors useful in chemoinformatics: an overview. Comb. Chem. High T. Scr. 14, 548–458 (2011).

    CAS  Article  Google Scholar 

  30. 30.

    Medema, M. H. et al. Minimum information about a biosynthetic gene cluster. Nat. Chem. Biol. 11, 625–631 (2015).

    CAS  Article  Google Scholar 

  31. 31.

    Lucas, X. et al. StreptomeDB: a resource for natural compounds isolated from Streptomyces species. Nucleic Acids Res. 41, D1130–D1136 (2013).

    CAS  Article  Google Scholar 

  32. 32.

    Challis, G. L. & Naismith, J. H. Structural aspects of non-ribosomal peptide biosynthesis. Curr. Opin. Struc. Biol. 14, 748–756 (2004).

    CAS  Article  Google Scholar 

  33. 33.

    Schmidt, E. W. The hidden diversity of ribosomal peptide natural products. BMC Biol. 8, 83 (2010).

    Article  Google Scholar 

  34. 34.

    Hadjithomas, M. et al. IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes. Nucleic Acids Res. 45, D560–D565 (2017).

    CAS  Article  Google Scholar 

  35. 35.

    Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011).

    CAS  Article  Google Scholar 

  36. 36.

    Gerard, J. et al. Massetolides A-H, antimycobacterial cyclic depsipeptides produced by two pseudomonads isolated from marine habitats. J. Nat. Prod. 60, 223–229 (1997).

    CAS  Article  Google Scholar 

  37. 37.

    Takada, K. et al. Surugamides A-E, cyclic octapeptides with four D-amino acid residues, from a marine Streptomyces sp.: LC-MS-aided inspection of partial hydrolysates for the distinction of D - and L -amino acid residues in the sequence. J. Org. Chem. 78, 6746–6750 (2013).

    CAS  Article  Google Scholar 

  38. 38.

    Kodani, S., Sato, K., Hemmi, H. & Ohnish-Kameyama, M. Isolation and structural determination of a new hydrophobic peptide venepeptide from Streptomyces venezuelae. J. Antibiot. 67, 839–842 (2014).

    CAS  Article  Google Scholar 

  39. 39.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Article  Google Scholar 

  40. 40.

    Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).

    CAS  Article  Google Scholar 

  41. 41.

    Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017).

  42. 42.

    Mohimani, H. et al. Cycloquest: identification of cyclopeptides via database search of their mass spectra against genome databases. J. Proteome Res. 10, 4505–4512 (2011).

    CAS  Article  Google Scholar 

  43. 43.

    Mohimani, H. et al. NRPquest: coupling mass spectrometry and genome mining for nonribosomal peptide discovery. J. Nat. Prod. 77, 1902–1909 (2014).

    CAS  Article  Google Scholar 

  44. 44.

    Rottig, M. et al. NRPSpredictor2—a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. 39, W362–W367 (2011).

    Article  Google Scholar 

  45. 45.

    Da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    CAS  Article  Google Scholar 

  46. 46.

    Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    CAS  Article  Google Scholar 

  47. 47.

    Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016).

    Article  Google Scholar 

  48. 48.

    Smith, C. A. et al. METLIN: a metabolite mass spectral database. Ther. Drug. Monit. 27, 747–751 (2005).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank K. Vyatkina for fruitful discussions and A. Prjibelski for help with manuscript preparation. The work of A.G., A.M., A.S., A.K. and P.A.P. was supported by the Russian Science Foundation (grant 14-50-00069). The work of H.M. and P.A.P. was supported by the US National Institutes of Health (grant 2-P41-GM103484).

Author information

Affiliations

Authors

Contributions

A.G. implemented the VarQuest algorithm. A.S. and A.K. improved and sped up the DEREPLICATOR software. A.G., A.M. and H.M. designed the webserver. A.G. and A.M. did the VarQuest benchmarking. H.M. and P.A.P. designed and directed the work. A.G., H.M. and P.A.P. wrote the manuscript.

Corresponding author

Correspondence to Pavel A. Pevzner.

Ethics declarations

Competing interests

P.A.P. has an equity interest in Digital Proteomics—a company that may potentially benefit from the research results. The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–18, Supplementary Figures 1–9 and Supplementary References.

Life Sciences Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gurevich, A., Mikheenko, A., Shlemov, A. et al. Increased diversity of peptidic natural products revealed by modification-tolerant database search of mass spectra. Nat Microbiol 3, 319–327 (2018). https://doi.org/10.1038/s41564-017-0094-2

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing