Introduction

Microbial genome mining is an important emerging technology to discover new and novel natural products (NPs) for drug discovery.1, 2, 3, 4 The genome mining concept stems from the seminal observations of the Hopwood and Ōmura groups that the genome sequences of Streptomyces coelicolor5 and Streptomyces avermitilis6 appeared to encode about 10-fold more potential secondary metabolites (SMs) than were known from the expressed secondary metabolomes. These predictions have been confirmed experimentally,7, 8 and generalized to actinomycetes and other bacterial taxa with large genomes.9, 10, 11, 12, 13, 14, 15, 16

NPs continue to be important sources of new and novel chemical scaffolds for drug discovery,17, 18 and actinomycetes, particularly Streptomyces species, continue to be the most productive sources.4, 19, 20 To access the enormous untapped cryptic SM coding capacity of actinomycetes, it is critical to develop robust approaches to activate gene cluster expression. Many approaches, including isolation of mutants altered in transcription or translation, genetic manipulation of positive and negative regulation, heterologous expression in specialized hosts and others, have been described.8, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

For microbial genome mining to become a robust methodology to drive the discovery of new and novel NPs for drug discovery, it is important to have a set of bioinformatics tools that can predict which microorganisms are the most ‘gifted’ for SM production. antiSMASH 3.031 is particularly useful to identify the numbers and types of SMs encoded by microbes, and has been used to survey a wide range of bacteria and archaea for those microorganisms most gifted for SM production.16 Among the most gifted are bacteria with genomes >8.0 Mb, including many actinomycetes. None-the-less, having a large genome is not sufficient to predict abundant coding capacity for SMs. Furthermore, antiSMASH 3.0 gives accurate accounting of SM coding capacity for finished genomes, but not for draft genomes which often have poorly assembled NRPS, PKS-I or mixed NRPS/PKS-I (NRPS-PKS-I refers to all three types) gene clusters because of the short reads by the most economical sequencing technologies, and the high sequence similarities within repeating functional domains in modular NRPS and PKS-I mega-genes.32, 33 Since the vast majority of bacterial genomes in public databases remain in draft form, it would be useful to have bioinformatics search methods to predict which microbes are the most gifted to target for complete genome sequencing and genome mining.

Historically, the most productive sources for NP-derived drugs have been biosynthetic pathways that employ NRPS-PKS-I mechanisms.4 The majority of SM pathways employing these mechanisms in bacteria terminate assembly on mega-enzymes with TEs that release linear or cyclized molecules from terminal modules.34 TE domains associated with NRPS-PKS-I mega-enzymes encoded by actinomycetes are usually preceded by PCPs or ACPs (see below). The peptidyl carrier protein-thioesterase (PCP-TE) and acyl carrier protein-thioesterase (ACP-TE) di-domains are relatively small, so their DNA sequences should be assembled correctly by any sequencing methodology. There are exceptions to this release strategy. For instance, the NRPS mega-enzymes that encode glycopeptides related to vancomycin and cephamycins have terminal modules with epimerase (E) domains between the PCP and TE,35, 36 and the hybrid PKS-I/NRPS pathways for rapamycin and FK506 cyclize the molecules by a different mechanism.37, 38 In spite of the exceptions, a survey of the number of PCP-TE and ACP-TE di-domains per genome might help identify the most gifted microbes. Gifted microbes also tend to encode multiple MbtH homologs, which serve as non-enzymatic chaperones for NRPS adenylation reactions,39, 40 and multiple phosphopantetheinyl transferase (PPTase) genes involved in converting apo-PCPs and apo-ACPs to active holo-enzymes.16, 23 In this report, I describe the use of concatenated PCP-TE and ACP-TE pentamers as beacons to estimate the numbers of NRPS and PKS-I gene clusters in finished and draft genomes from actinomycetes, and also survey the numbers of genes encoding MbtH and PPTases using concatenated multi-probes.16, 40 The results indicate that these four types of multi-probes targeting NRPS-PKS-I mega-enzymes serve as useful beacons to identify gifted microbes from finished or draft genome sequences of actinomycetes.

Materials and methods

Concatenated CP-TE di-domains

The sources of ACP-TE and PCP-TE di-domains are shown in Table 1. Five ACP-TEs and five PCP-TEs were concatenated to generate individual multi-probes (Supplementary Figures S1 and S2) for analyses of actinomycete genomes.

Table 1 Sources of ACP-TE and PCP-TE di-domains

MbtH and PPTase multi-probes

The MbtH and PPTase multi-probes have been described elsewhere.16, 40

Protein searches

Protein searches were carried out with multi-probes by BLASTp for MbtH homologs and DELTA BLASTp for PPTase, ACP-TE and PCP-TE homologs (http://blast.ncbi.nlm.nih.gov/Blast.cgi).41, 42 The multi-probes also pick up many ACPs and PCPs not linked to TEs, so only full-length hits were counted. Also, each hit was verified manually to correspond to the appropriate ACP or PCP annotation (PKS-I of NRPS); in some cases, ACP-TEs pick up PCP-TEs and vice versa, so false-positive ‘hits’ were not counted.

Bioinformatic searches for SMGCs

The numbers and sizes of NRPS, PKS-I and mixed NRPS/PKS-I gene clusters in finished and unfinished actinomycetre genomes were carried out by antiSMASH 3.0.31

Results

Assembly of NRPS and PKS-I mega-genes in draft genome sequences

To illustrate the problem of assembly of large NRPS and PKS-I gene clusters in draft genomes, BLASTp analysis was carried out on unfinished genomes of the producers of spinosad (Saccharopolyspora spinosa), tylosin (Streptomyces fradiae) and daptomycin (Streptomyces roseosporus) as subjects using PKS-I and NRPS mega-enzymes from finished biosynthetic gene clusters as queries (Table 2). The PKS-I mega-enzyme involved in biosynthesis of spinosad has five subunits, SpnA–SpnE.43 The draft genome of S. spinosa44 has the spnA gene assembled correctly. However, spnB is truncated; spnC is missing; spnD is split into two unlinked segments, one fused to a heterologous sequence and the other fused to two heterologous sequences; and spnE is split into two unlinked segments.

Table 2 Annotation of PKS and NRPS genes in finished clusters and draft genomes

The PKS-I mega-enzyme involved in tylosin biosynthesis in S. fradiae has five subunits, TylGI–TylGV.45 The draft genome of S. fradiae46 has a correctly assembled tylGV. However, tylGI is truncated; the complete tylGII gene is fused to a heterologous segment; tylGIII is truncated; and tylGIV is split into three truncated segments which span only two-thirds of the protein (Table 2).

The NRPS mega-enzyme involved in daptomycin biosynthesis in S. roseosporus has three subunits, DptA, DptBC and DptD,47 which have been exploited in combinatorial biosynthesis.48 The three subunit arrangement has also been observed in the highly related cryptic daptomycin-like pathway in the finished genome of Saccharopolyspora viridis49 and the finished biosynthetic gene cluster for taromycin A in the marine Saccharopolyspora sp. CNQ-490.50 Draft genomes of two strains of S. roseosporus are available on the NCBI website from a sequencing project by the Broad Institute. S. roseosporus NRRL 11379 (ATCC 31568; A21978.6) is the wild-type strain discovered by Eli Lilly and Company, and S. roseosporus NRRL 15998 (A21978.65) is a more productive derivative of NRRL 11379 derived by N-methyl-N′-nitro-N-nitrosoguanidine mutagenesis.51 It is instructive that the draft genomes of these highly related strains have the daptomycin gene cluster assembled in two different ways. The wild-type strain (A21978.6) has dptA, dptBC and dptD correctly assembled and in proper order, but has an additional partial sequence of dptBC. A21978.65 has a correctly assembled dptA gene, but dptBC is split into two partial segments, each fused to two heterologous DNA fragments. The dptD gene is also split into two segments, one of which is fused to a heterologous DNA segment.

These examples demonstrate that draft genome sequences are not adequate to correctly assemble secondary metabolite gene clusters (SMGCs) that employ NRPS or PKS-I biosynthetic mechanisms, the hallmarks of the most productive sources for clinically useful drugs.4 This major shortcoming limits the utility of antiSMASH 3.031 and other bioinformatics tools that require high-quality finished genome sequences for reliable predictions,16 as further demonstrated below.

Distribution of NRPS, PKS-I and mixed NRPS/PKS-I clusters in finished and draft genomes

To further exemplify the problem of incorrect assembly of large SMGCs containing NRPS-PKS-I mega-genes, I have compiled the numbers and sizes of clusters employing these biosynthetic mechanisms determined by antiSMASH 3.0 analyses of ten Streptomyces genomes, five finished and five drafts (Table 3). The finished genomes encode 7–26 NRPS-PKS-I clusters ranging from 41.4 to 246.2 kb, and averaging 70.5 to 94.7 kb. Of the 71 total clusters, 41 were >60 kb, and none were <40 kb. In contrast, among the 69 clusters from the unfinished genomes, which ranged from 4.2 to 79.7 kb, and averaged 26.5 to 50.5 kb, only 5 were >60 kb and 40 were <40 kb. It is apparent that draft genomes have many fragmented, and therefore incorrectly assembled NRPS-PKS-I gene clusters.

Table 3 Distribution of NRPS, PKS-I and mixed NRPS/PKS-I cluster sizes in finished and draft Streptomyces genomes

Construction of ACP-TE and PCP-TE multi-probes

Many NRPS-PKS-I biosynthetic pathways encode terminal modules containing ACP-TE or PCP-TE di-domains to release the nascent peptides, polyketides or mixed peptide-polyketides from the mega-enzymes as linear or cyclized intermediates or final products. A survey of 37 important NPs produced by actinomycetes and biosynthesized by these mechanisms indicated that 25 (68%) have terminal modules with APC-TE or PCP-TE di-domains (Supplementary Figure S3). Individual ACP-TEs and PCP-TEs are relatively small, ranging from 394 to 440 amino acids and 371 to 378 amino acids, respectively (Table 1). As such, their coding domains should remain together in genome assemblies regardless of the sequencing technologies employed. The ACP-TE pentamer was constructed from well-characterized di-domains from PKS subunits from tylosin, erythromycin, spinosad and tautomycetin pathways and an uncharacterized PKS from S. avermitilis. The PCP-TE pentamer was constructed from the third subunit of the daptomycin NRPS cluster (DptD) and four uncharacterized NRPS subunits from Streptomyces griseus, S. avermitilis, Amycolatopsis orientalis and Streptomyces clavuligerus. The use of five diverse CP-TE di-domains in each case, coupled with the use of DELTA BLASTp,48 gives high likelihood that individual ACP-TE and PCP-TE di-domains in finished and draft genomes will be counted as surrogates for the number of NRPS-PKS-I gene clusters, independent of poor assembly and fragmentation of large pathways into smaller erroneous SMGCs in antiSMASH 3.0 searches of draft genomes. It is instructive that the ACP-TE multi-probe readily picks up the terminal ACP-TEs of SpnE and TylGV, and the PCP-TE multi-probe picks up the terminal PCP-TE of DptD in draft genomes, even though the tylosin, spinosad and daptomycin gene clusters were assembled incorrectly (Table 1). The multi-probes count each pathway one time, regardless of the other mistakes in gene cluster assembly.

Survey of PCP-TE and ACP-TE di-domains in finished actinomycete genomes

To establish the correlation between the numbers of CP-TEs and NRPS-PKS-I gene clusters, 25 finished actinomycete genomes ranging from 3.64 to 12.7 Mb were surveyed for the numbers CP-TEs by DELTA BLASTp with the two multi-probes, and for NRPS-PKS-I gene clusters by antiSMASH 3.0 (Table 4). In addition, the numbers of MbtH homologs and PPTases encoded by these strains23, 40 are also shown. The numbers of NRPS-PKS-I gene clusters generally increased in proportion to geneome size, ranging from 1 in Thermobifida fusca (genome, 3.64 Mb) to 26 in Streptomyces rapamycinicus (genome, 12.7 Mb) (Table 4; Figure 1). The numbers of CP-TEs also increased with genome size (Table 4), and the ratios of CP-TEs/NRPS-PKS-I clusters ranged from 0.46 to 1.0, with a mean of 0.65 and a slope of ~0.7 (Figure 2a), consistent with data from well-characterized SMCs (Supplementary Figure S3). It is noteworthy that the four most ‘gifted’ G actinomycetes (Kutzneria albida, Streptomyces violaceusniger, Streptomyces bingchenggensis and Streptomyces rapamycinicus), which devote 2.3–3.1 Mb of their very large genomes (9.9–12.7 Mb) to SM biosynthesis and encode 43–53 SMGCs,16 encode 19–26 NRPS-PKS-I clusters and 13-21 CP-TEs. These stand out as highly gifted by counting NRPS-PKS-I clusters or CP-TEs.

Table 4 CP-TEs and NRPS-PKS-I gene clusters in finished actinomycete genomes
Figure 1
figure 1

NRPS-PKS-I clusters as a function of actinomycete genome size. Data from Table 4.

Figure 2
figure 2

Relationship between CP-TEs and NRPS-PKS-I clusters in finished (a) and draft (b) actinomycete genomes.

It is noteworthy that the average numbers of MbtH and PPTase genes for the 25 strains were 4.0 and 3.5, respectively, and the most gifted strains generally encode higher total numbers (Table 4), as demonstrated previously.16, 23 The two least gifted strains, Thermobifida fusca YX and Pseudonocardia dioxanivorans CB1190, encode only single MbtH and PPTase proteins, and a single NRPS-PKS-I cluster with one CP-TE di-domain.

Survey of PCP-TE and ACP-TE di-domains in draft actinomycete genomes

Less than 10% of large actinomycete genomes in public databases are finished (Baltz, unpublished). To further examine the use of the CP-TE multi-probes to search for gifted actinomycetes among draft genomes, 30 strains from two populations were chosen for antiSMASH 3.0 and multi-probe analyses. The first population included 15 draft genomes from producers of important secondary SMs,4 and the other 15 were chosen from a group of 369 unspeciated Streptomyces isolates (https://www.ncbi.nlm.nih.gov/genome/?term=streptomyces) with genomes >8.4 Mb (Table 5). The first set had an average of 14.7 NRPS-PKS-I ‘clusters’, and 5.1 CP-TEs per strain, giving a ratio of CP-TEs/NRPS-PKS-I clusters of 0.35, or approximately one-half that observed with finished genomes. The 0.35 ratio suggests that this set of draft genomes has many fragmented NPRS-PKS-I clusters, and the number of clusters is overestimated by antiSMASH 3.0 by ~2-fold. The second set had an average of 11.6 NRPS-PKS-I clusters and 2.8 CP-TEs, giving a ratio of CP-TEs/NRPS-PKS-I clusters of 0.24, suggesting that the numbers of NPRS-PKS-I clusters are overestimated by >2-fold. Figure 2b shows a scatter plot of these data. Note that there is much more scatter in the plot of CP-TEs versus NRPS-PKS-I clusters with draft genomes than with finished genomes (Figure 2a). This is likely due to the variable quality in assemblies of NRPS-PKS-I clusters is draft genomes. The regression lines in Figures 2a and b are drawn to reflect the average ratios calculated in Table 5. In spite of the fragmentation of actual NRPS-PKS-I clusters in draft genomes, the six most gifted Streptomyces sp., encoding 8–12 CP-TEs, were easily identified by multi-probe analyses. These strains also encode 1–6 MbtH and 2–3 PPTase proteins, and could be candidates for complete genome sequencing.

Table 5 CP-TEs and NRPS-PKS-I gene clusters in draft genomes of actinomycetes

Discussion

Microbial genome mining is a promising approach to discover new and novel NPs for drug delopment.3, 4, 16 It is now clear that microbes with large genomes encode the largest numbers of SMs, and that the uncultured microbial majority, which generally have small genomes, are nearly devoid of SM pathways encoding drug-like NPs.16 Historically, the majority of important NP drugs were biosynthesized by NRPS, PKS-I or mixed NRPS/ PKS-I mechanisms, mostly by actinomycetes.4 It has been demonstrated that among the actiomycetes, there are some strains that are gifted or highly gifted for SM biosynthesis, encoding 20–53 SMs, and dedicating 0.8–3.1 Mb of DNA coding capacity to SMGCs. These gifted strains also encode multiple NRPS-PKS clusters.16

The analysis of gifted status among a wide range of microbes with finished genome sequences was carried out by using the standard antiSMASH 3.0 algorithm.16, 31 However, the quality and reliability of antiSMASH 3.0 analysis is limited by the quality of genome seqeuences analyzed. In this report, I demonstrate that the large SMGCs encoding daptomycin, spinosad and tylosin were assembled incorrectly in draft genome sequences, and that draft genomes generally tend to have fragmented assemblies of NRPS-PKS-I gene clusters, resulting overestimation of cluster numbers by antiSMASH 3.0. This unfortunate outcome of the current acceptance of draft genome quality for publication and deposition of genome sequences makes it difficult to mine genomic data for SMGCs encoding drug-like molecules. To address this shortcoming, it is necessary to first sort through genomic data with small DNA sequences that can serve as beacons for desired drug-like pathways. In the current report, multi-probes directed at di-domains containing TE functionality were evaluated to identify gifted actinomycetes that encode multiple NRPS, PKS-I and mixed NRPS/PKS-I-derived SMs. The approach derives from the observation that the majority of NRPS and PKS-I mega-enzymes have terminal modules containing single TE domains preceded by small PCP or ACP domains, respectively. Exceptions include NRPSs that terminate with D-amino acids (for example, vanocmycin and cephamycin which have PCP-E-TE tri-domains)35, 36 and the PKS-I/NRPSs rapamycin and KF506, which terminate assembly by a different mechanism.37, 38 NRPSs that insert terminal D-amino acids could be identified by an extending the PCP-TE multi-probe to include PCP-E-TE tri-domains from the vancomycin and cephamycin pathways. Rapamycin/FK506-like SMGCs can be identified by BLASTp searching with pathway-specific probes (for example, cyclodeaminase).37, 38

The sizes of the individual PCP-TE and ACP-TE di-domains assembled in the multi-probes are small, ranging from 371 to 440 amino acids, and their coding sequences are likely to be assembled correctly by any DNA sequencing method employed. Thus CP-TE multi-probes can be used as molecular beacons to count NRPS-PKS-I clusters, even if the actual gene clusters are fragmented and misassembled. The ratio of CP-TEs/NRPS-PKS-I clusters in 25 finished actinomycete genomes was ~0.7. Thus a count of 7 CP-TEs by DELTA BLASTp analysis predicts ~10 NRPS-PKS-I clusters. The ratio of CP-TEs/NRPS-PKS-I clusters in 30 draft genomes was ~0.3, but with much scatter, consistent with gene cluster splitting and variable quality in assembly. None-the-less, the six most gifted strains were easily identified by the CP-TE multi-probes. Those six strains encode 8–12 CP-TEs, which predict ~11–17 NRPS-PKS-I clusters.

Only 15 large genomes among the 369 unspeciated Streptomyces draft genomes available on the NCBI website were surveyed. The CP-TE multi-probe analysis could be used to identify the top 10–20% of gifted streptomycetes in this group and also to screen much larger libraries of proprietary draft actinomycete genomes. The most gifted strains could be targeted for complete genome sequencing, which is now feasible and relatively inexpensive by using a combination of Illumina and PacBio sequencing.16, 52, 53, 54 By adding additional CP-TEs from other taxons, the CP-TE multi-probe approach can be adapted to survey draft genomes of other bacteria with large genomes, such as species of the Myxobacteria, Burkholderia and Photorhabdus, all of which have gifted members that encode multiple SMGCs.16