Relationship between bacterial phylotype and specialized metabolite production in the culturable microbiome of two freshwater sponges

Microbial drug discovery programs rely heavily on accessing bacterial diversity from the environment to acquire new specialized metabolite (SM) lead compounds for the therapeutic pipeline. Therefore, knowledge of how commonly culturable bacterial taxa are distributed in nature, in addition to the degree of variation of SM production within those taxa, is critical to informing these front-end discovery efforts and making the overall sample collection and bacterial library creation process more efficient. In the current study, we employed MALDI-TOF mass spectrometry and the bioinformatics pipeline IDBac to analyze diversity within phylotype groupings and SM profiles of hundreds of bacterial isolates from two Eunapius fragilis freshwater sponges, collected 1.5 km apart. We demonstrated that within two sponge samples of the same species, the culturable bacterial populations contained significant overlap in approximate genus-level phylotypes but mostly nonoverlapping populations of isolates when grouped lower than the level of genus. Further, correlations between bacterial phylotype and SM production varied at the species level and below, suggesting SM distribution within bacterial taxa must be analyzed on a case-by-case basis. Our results suggest that two E. fragilis freshwater sponges collected in similar environments can exhibit large culturable diversity on a species-level scale, thus researchers should scrutinize the isolates with analyses that take both phylogeny and SM production into account to optimize the chemical space entering into a downstream bacterial library.

Since 2015 our team has worked with citizen scientist SCUBA divers in the Great Lakes region of the United States to collect freshwater sponges as sources of bacterial isolates for our SM drug discovery pipeline. Historically, when samples are collected from the environment, differences in both the phylogenetic diversity and the SM capacity of the culturable microbial population are not fully considered when creating a bacterial library. However, this information is critical to determine the extent to which samples should be collected, particularly when operating under a limited research budget [32]. It is important to note that while metagenomic and 16S rRNA gene surveys are useful tools to characterize microbiomes, previous studies employing both culture-dependent and -independent techniques on the same samples have shown low species overlap between methods [33][34][35][36], exposing biases of both approaches and making direct comparison of the two difficult. Thus, understanding the culturable bacterial phylogenetic and SM variance between sponge samples would make the overall sample collection and bacterial library creation process more efficient [37,38].
With this in mind, two samples of the freshwater sponge species Eunapius fragilis (Porifera: Demospongiae: Spongillida: Spongillidae) were collected 1.5 km apart in the St. Lawrence River. The collected tissue was prepared using protocols described in the experimental section, and every distinguishable bacterial colony was isolated from nutrient media plates. We focused solely on the readily culturable population because this reflects the methodological constraints of many modern microbial drug discovery programs. Protein and SM mass spectra were collected using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) and analyzed using the opensource bioinformatics software IDBac to simultaneously assess the proteomic and metabolomic variation of the resulting isolates [39,40]. MALDI-TOF MS protein spectra (3000-15,000 Daltons) consist primarily of ions of intact ribosomal, cell structure, and regulatory proteins [41,42] and are often used to group microorganisms at the genus, species, and subspecies taxonomic levels [43][44][45][46][47][48][49]. This technology has been applied extensively in clinical settings [49,50]. However, the widespread adoption of MALDI-TOF MS for the study of environmental bacteria has lagged behind clinical applications [51]. This is primarily due to the lack of reference spectra needed to identify unknown isolates and a scarcity of MALDI-TOF MS analysis tools that are freely available to the public [44,45,[52][53][54]. We previously designed IDBac as an open-source bioinformatic tool to first cluster bacterial isolates by protein spectrum similarity, followed by subgrouping them based on feature overlap within corresponding SM spectra (<2000 Daltons) [39,40]. A unique advantage of IDBac is that it combines proteomic and metabolomic data into a semi-automated pipeline that facilitates the assessment of microbial diversity beyond the scope of typical MALDI-TOF MS pseudo-phylogenetic analyses. "Pseudo" refers to the fact that sequencing-based bacterial phylogenetic analyses typically rely on ribosomal protein similarity, whereas MALDI-TOF MS analyses rely on ionized protein similarity that includes but is not limited to ribosomal proteins. This allowed us to investigate differences in culturable bacterial pseudo-phylogenetic and SM populations between two E. fragilis samples collected close in location and time. Specifically, we examined (1) the extent to which readily culturable bacterial populations varied between the two sponges, and (2) patterns of SM variation within groups of closely related bacterial isolates.

RESULTS/DISCUSSION
Culturing the Eunapius fragilis microbiome Two samples of Eunapius fragilis were collected 1.5 km apart, in both a similar time window and life cycle, and are referenced herein as sponges "SCD18" and "SCD21". Both sponge samples were processed simultaneously, using identical protocols (see "Methods" section). As expected, some of the resulting isolation plates exhibited no discernible colonies or were overgrown with bacterial biofilms or cycloheximide-resistant fungus. For the remaining plates, every distinguishable colony was then isolated, resulting in 522 isolates from sponge SCD18 and 329 isolates from sponge SCD21 (total of 851 isolates). To compare bacterial diversity equally between the two sponge samples, only isolates derived from matching sample plates were included in our analysis (e.g., if a nutrient agar diversity plate was contaminated with a fungus, the plate with matching conditions from the second sponge was removed from analysis as well). This resulted in a total of 692 bacterial isolates that were advanced to our MALDI-TOF MS pipeline, where protein and SM spectra were acquired for three biological replicates of each isolate.
Documenting the pseudo-phylogenetic bacterial diversity within Eunapius fragilis To assess the taxonomic diversity and overlap between all 851 bacterial isolates from both sponge samples, IDBac was used to cluster isolates into a dendrogram (Fig. S1) by measuring the cosine similarity of MALDI-TOF MS protein spectra [45] and clustering with Ward's method [55,56]. The resulting groupings were evaluated for accuracy by subsampling isolates across the dendrogram and using 16S rRNA gene sequencing analysis to obtain genus-level identifications. Additional genus-level assignments were made by seeding MALDI-TOF MS protein spectra of previously identified isolates [38] into the dataset, see Data and Code Availability. Taxonomic assignments made by MALDI-TOF MS and 16S rRNA analyses complemented one another (Figs. S2 and S3).
In addition to confirming accurate clustering, these complementary techniques revealed that isolates spanned at least four phyla commonly associated with sponges: Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes (Fig. S4). These phyla are responsible for greater than half of known microbially-produced SMs to date [32,57]. In total, 11 genera were identified in the dendrogram, though more were cultivated and remain unidentified (Figs. S1-S4). This initial organization of sponge isolates into a dendrogram allowed for more in-depth analyses, including comparison of recovered bacterial communities between the two sponges, and the overlap of SM features within phylotypes.
Measuring variation between freshwater sponge culturable bacterial populations using MALDI-TOF MS Unsupervised machine learning techniques such as hierarchical clustering are often used to group MALDI-TOF MS protein spectra of bacteria [39,44,45]. One method to create discrete groups from hierarchical clustering (visualized as a dendrogram) is to "cut" across the axis representing distance. However, a notable limitation in hierarchical clustering of MALDI-TOF MS spectra is its inability to consistently link last common ancestors at higher taxonomic ranks such as Family and above [38]. This limitation extends in some cases to highly diverse genera such as Streptomyces and Bacillus [38]. This is attributed to the widespread use of similarity measures (by all current intact-cell MALDI-TOF MS analyses, including IDBac) that cannot account for biologically based chemical modifications, however, minor, which result in a mass shift. Because MALDI-TOF MS spectrum similarity does not have a linear correlation with 16S rRNA similarity [45], inferred pseudo-phylogenetic relationships made using MALDI-TOF MS spectra are most accurate when comparing isolates at the species to subspecies levels [39,44,45,58]. In other words, MALDI-TOF MS dendrograms can accurately group isolates of the same species and subspecies, but often will not cluster higher taxonomic ranks together.
Groupings from hierarchical clustering of MALDI-TOF spectra are more accurate with an increasing number of isolates. For this reason, we performed initial clustering using all 851 isolates (Fig.  S1), but then pruned 159 isolates from non-matched diversity plates from the full dendrogram ( Fig. 1). This left a total of 692 recovered bacterial isolates that were subjected to comparative analyses.
The pruned dendrogram was cut 160 times at height intervals of 0.05, and for every cut, the composition of the resulting pseudo-phylogenetic groups (phylotypes) was tallied, whether groups contained isolates from SCD18, SCD21, or both (Fig. 2). Between cut heights ranging from 0 to 2 (roughly species level and below), most groups consisted of isolates from either SCD18 or SCD21, but not both. When the dendrogram was cut at a height greater than 2.5 (roughly genus level and above), groups largely consisted of isolates from both sponges, except for the Delftia group which consisted of isolates only from SCD18. This suggests that at the bacterial species level, most phylotypes were only present in either one sponge or the other. This conclusion was supported by observing similar results from hierarchical clustering using a different peak binning algorithm and distance measure (Fig. S5), and additional analysis by k-means clustering (Fig. S6).
Many of the clusters identified at the genus level in Fig. 2 could be defined by linkage heights between 2 and 3, while much of the variation between sponges occurred at or below a height of 2. To confirm that the different populations were not simply the result of experimental MS artifacts, a phylogenetic tree was created with SILVA Alignment, Classification and Tree (ACT) [59] from 16S rRNA gene sequences of isolates and closely related reference strains (Fig. S7). This revealed that identified genera were not composed of a single species. It also revealed that cuts of the protein spectra C.M. Clark et al. dendrogram at a height of approximately 2 and below (where variation between sponges was highest) represented true species differentiation, not simply MS artifacts.
These results are likely not representative of the relative abundances of in situ bacterial populations [60], rather they represent the populations of culturable bacteria that are easily accessible to researchers who engage in microbial drug discovery efforts. These results suggest that although the readily recoverable bacterial populations from two Eunapius fragilis samples collected in similar location and time windows shared many of the same genera, they varied significantly at the species, and possibly subspecies level. Since phylogenetic diversification is correlated  Fig. 1 IDBac dendrogram of culturable bacterial population from freshwater sponges Eunapius fragilis SCD18 and SCD21. Dendrogram of 692 sponge-derived bacterial isolates, grouped by MALDI-TOF MS protein spectra similarity. The occurrence of bacterial isolates as a function of growth condition was mapped onto the dendrogram and color coded by source sponge.
with differential SM production capacity, this information has great potential to drive future SM discovery efforts [61][62][63]. With this in mind, it was important to analyze whether isolates within similar phylotypes harbored overlapping SM mass features.
The relationship between phylotype and chemotype in bacteria cultured from Eunapius fragilis Since the re-isolation of known SM scaffolds from redundant entries in a microbial library is a major concern for researchers engaging in microbial drug discovery, it was necessary to document the degree of SM overlap within dendrogram groupings. When determining the overlap of SMs between cultured bacterial isolates, it is important to consider that many bacterial SMs are encoded by co-localized genes within biosynthetic gene clusters (BGCs). Within genera, the loss and/or functional replacement of BGCs are critical forces that drive SM diversity within and between bacterial communities [64,65]. This, in addition to transcriptional and translational regulation of BGCs [66], means that SMs may be observed frequently across multiple species in a given genus or may be specific to a single species. To determine the intra-genus SM variation among our freshwater sponge bacterial isolates, we generated metabolite association networks (MANs) from dendrogram groupings, an embedded function within IDBac. MANs were used to analyze MALDI-TOF SM spectra by linking isolates with shared mass features (Fig. 3) [39]. Larger colored circles represent individual bacterial isolates, while smaller gray circles are mass features in the spectra, which commonly represent SMs [39]. For this analysis a subset of six genus-level phylotypes were chosen based on their historic precedent for producing SMs (Streptomyces, Bacillus, Micromonospora, Pseudomonas; these account for over half of documented bacterial SMs to date) [32,57] or due to the abundance of recovered isolates from sponge samples (Chryseobacterium and Delftia; these  represented 45% of the cultured isolates obtained from both sponges). Representative isolates from these six genera were identified either by matching MALDI-TOF MS spectra to spectra from characterized isolates in our in-house library and/or 16S rRNA gene sequence analysis (Figs. S2 and S3).
Across the dataset, the largest observed influence on SM spectrum similarity was protein spectrum similarity, which is a proxy for phylogenetic relatedness [43,48]. Positive Pearson correlation coefficients between protein and SM spectra for the six groups mentioned are displayed in Fig. S8, in addition to pairwise comparisons between all isolates in Figs. S9 and S10. This phenomenon is clearly demonstrated in the MAN of the Pseudomonas dendrogram grouping (Fig. 3a), which contained sub-groups of isolates that exhibited differential SM production patterns and correlated directly to breaks in the dendrogram (Fig. 3b). This correlation aligns well with other recent studies [61,62,[67][68][69] and highlights that species-level phylogenetic identity can predict SM diversity.
Some groupings were exceptions to the previous example and did not show a high correlation of protein and SM spectra similarities at the species level. For example, isolates within the Chryseobacterium MAN (Fig. 3b) showed significant SM overlap that was less correlated to protein spectrum similarity ( Fig. 3b and  S8). Possible explanations for this include (1) SM BGCs are relatively conserved within the Chryseobacterium genus and/or this set of isolates (recent preliminary evidence supports this, though extensive investigation [70]); (2) Chryseobacterium isolates exhibit fewer expressed SMs under these cultivation conditions compared to the Pseudomonas isolates, and as a result provide fewer opportunities to observe differentiation; (3) different resolution of the MALDI-TOF MS protein pseudo-phylogeny was achieved between Chryseobacterium and Pseudomonas isolates. However, MALDI-TOF analysis, which specializes in species/ subspecies differentiation, delineated several distinct groups of Chryseobacterium isolates below the level of genus (Figs. S12 and S14). Observing varying relationship of phylotype to chemotype (Figs. S8 and S11a) highlights the need to resolve isolates using orthogonal methods, such as protein and SM spectra similarity.
Lastly, SM variation as a function of source sponge was analyzed within the six genera (Fig. S11b). When analyzing isolates within each genus-level dendrogram grouping, MANs did not exhibit any noticeable partition of SM mass features according to their source sponge, indicating that if a bacterial species occurred in both sponges, it was likely to produce similar suites of SMs. This indicates that in the dataset analyzed, phylogeny was a better prediction of SM production similarity than sample source.
The positive correlation of phylogeny with SM production, though generally true, is not sufficient to fully annotate the extent to which SMs are distributed within bacterial taxa. This principle was recently demonstrated on large-scale analyses of the human gut microbiome [71]. More complex correlations between phylotype and SM production that align with this principle are exhibited in each of the MANs of Micromonospora, Streptomyces, Bacillus, and Delftia groups (Fig. S11a). To gain greater insight of SM variation within a large bacterial library, both phylogenetic and metabolomic analyses should be undertaken. A linear-mode-only MALDI-TOF MS instrument (limited to acquiring protein spectra) is sufficient to provide species to subspecies resolution of a bacterial isolate collection, and a dendrogram of those isolates may be a strong indicator of SM variation. However, to observe sub-genus SM variation, linear and reflectron (SM analysis) mode MALDI-TOF MS is required.
A major obstacle to current microbial drug discovery campaigns is the accumulation of redundant isolates in a library resulting in highly overlapping SM populations. Phylogenetic and SM redundancy greatly reduce drug discovery efficiency, particularly in resource-limited academic laboratories. One cause of this limitation is that these campaigns have cultivated relatively small portions of host microbiomes. Our findings show that morphological-based colony isolation tends to be reductionist, as it does not take into consideration the complexities of vertical and horizontal transfer of SM BGCs. For example, SM populations in a drug discovery library could be highly redundant if researchers incorporated multiple isolates from the sponge Chryseobacterium group in Fig. 3b. Alternatively, unique SM populations could be overlooked in the case of the Pseudomonas group, where although the gross SM mass features appear largely homogeneous, many of the MAN subgroupings contain unique mass features that correspond to a SM that is highly specific to a given bacterial isolate, and may hold a unique ecological purpose and biological activity. These complexities highlight the need to consider both phylogeny and SM production on a case-bycase basis.

CONCLUSIONS
While metagenomic analyses would afford a more complete, though likely nonoverlapping [33][34][35][36] representation of the taxa that comprise the sponge holobiome, the current manuscript aimed to analyze culturable microbial populations that are often studied by researchers that work in specialized metabolism, chemical ecology, and drug discovery. It is important to study this culturable population since it is the source of nearly all natureinspired microbial metabolites in clinical, industrial, and agricultural use [72,73]. The current study revealed that MALDI-TOF MS and IDBac are valuable tools for post-hoc analyses of culturable bacterial isolates. We demonstrated that within two sponge samples of the same species, the culturable bacterial populations contained significant overlap in approximate genus-level phylotypes but mostly nonoverlapping populations of isolates when grouped lower than the level of genus. Further, correlations between bacterial phylotype and SM production varied at the species level and below, suggesting SM distribution within bacterial taxa must be analyzed on a case-by-case basis. Our results suggest that two Eunapius fragilis freshwater sponges collected in similar environments can exhibit large culturable diversity on a species-level scale, thus researchers should scrutinize the isolates with analyses that take both phylogeny and SM production into account to optimize the chemical space entering into a downstream bacterial library. Depending solely on 16S rRNA gene sequencing, MALDI protein spectra, or morphological analyses will not fully inform researchers of the SM potential of recovered isolates, and this will lessen the downstream efficiency within a SM discovery program.

MATERIALS AND METHODS Sample collection and processing
Samples of Eunapius fragilis were collected by citizen science divers using SCUBA. Sponge SCD18 was collected July 15, 2015 from a depth of 21 m, off of the wooden substrate of the Vickery shipwreck (-76.0191, 44.28025). Sponge SCD21 was collected June 17, 2015 from a depth of 21 m, off of the wooden substrate of the Iroquois shipwreck (−76.0055, 44.28025). Samples were scraped into 50 mL sterile centrifuge tubes and shipped overnight to the laboratory where they were processed immediately, at room temperature. Following removal of any macro-debris, sponges were rinsed five times with autoclaved lake water and sectioned to include approximately equal volumes of all anatomic parts, totaling 1 cm 3 of material from each sponge. Samples were then placed, separately, in a sterilized mortar, 10 mL of sterilized 20% glycerol was added (allowing both short-term dilution and long-term cryopreservation), and ground thoroughly with a sterilized pestle for two minutes. The resulting solution was separated into 500 μL aliquots. Two pretreatment conditions (with and without heat treatment), three dilution series, and nine agar-based nutrient plate formulations generated 108 microbial diversity plates. Heat treated aliquots were held at 54°C for 9 min to select for spore-forming species. Dilution series of 0, 1/10, and 1/100 were created. Each agar media type contained 28 μM cycloheximide (for media recipes, see supplemental C.M. Clark et al. information). For each preparation condition 50 μL sponge/glycerol solution was spread evenly over the surface. Agar/nutrient plates were sealed with Parafilm ® and left at room temperature in the dark; these are referred to as "diversity plates". To capture more bacterial isolates from low-nutrient/low-dilution diversity plates, colonies were isolated over a longer period (four months) and plates were regularly checked to prevent overgrowth. A total of eight diversity plates immediately overgrew with biofilms, while fungus was prevalent and overgrew in 39, despite including cycloheximide in the isolation media. These plates and their counterparts that were not overgrown in both samples were removed from the dataset to allow for direct sample comparison from a total of 61 diversity plates. See "Data and code availability" for a link to matching figures for both the full and reduced datasets.

Sponge identification
Classic taxonomic morphological analysis was carried out at the genus and species level. A set of diagnostic macro-and micro-morphotraits was focused for diagnosis (i.e., growth form, consistency, skeletal architecture, traits and dimensions of skeletal megascleres and microscleres such as siliceous spicules, gemmuloscleres morphs, and architecture of gemmules such as resting stages [9,10]. Morphometries were performed on gemmules and spicules of each spicular type. Samples were studied by stereomicroscope for macro-traits. For skeletal analyses, representative body fragments were dissected by hand, processed by dissolution of organic matter in boiling 65% nitric acid, and rinsed in tap water. After decantation, cleaned spicules were suspended in ethanol, dropped onto slides, and glued in Eukitt with a cover slide as permanent preparations. The spicular complements were compared with recent and historical literature and with registered collections. Both samples were ascribed to Eunapius fragilis (Porifera: Demospongiae: Spongillida: Spongillidae). This species is common in inland water and widespread from almost all Terrestrial Ecoregions of the World (Nearctic, Palaearctic, Neotropical, Afrotropical, Oriental, and Australian). In the Nearctic Region this species was recorded from Canada and the United States (Colorado, Connecticut, Florida, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Michigan, Minnesota, Montana, Newfoundland, New Jersey, New Scotland, New York, Ohio, Pennsylvania, Texas, Wisconsin, and Wyoming).
The main morphotraits of the species are the following: Growth form encrusting, variably thick. Consistency is notably soft and fragile in both in vivo and dry conditions. Color is whitish to greenish. Surface is slightly hispid likely due to erected spicules. Oscules are not conspicuous in vivo, and are scattered in a network of subdermal canals. Skeletal network is made of siliceous monaxial spicules and collagen. Megascleres are smooth oxeas (160-270 × 4-15 μm). Microscleres are absent. Gemmules are subspherical (300-450 μm). Gemmuloscleres are straight to slightly curved spiny strongyles to strongyloxeas (35-140 × 3-8 μm) and are irregularly tangential and embedded into the gemmular theca. The habitat consists of a wide range of lentic and lotic habitats.

MALDI-TOF MS analysis
Isolates were purified onto high nutrient agar plates (A1) and biological replicates (three colonies of each isolate) were separately applied, as a thin smear, to a 384-spot MALDI target plate (Bruker Daltonics, Billerica, MA, USA) using a sterile toothpick. As we compared protein and SM spectra, it was important to calibrate spectra and perform manual inspections to ensure correlation wasn't due to experimental design/batch effects. MALDI-TOF MS settings can be found in previous IDBac publications [38,39], including in a written and video protocol [40]. A minor modification was the shift in the analysis range to 4-20 kDa as opposed to 3-20 kDa, based on recent work by Strejcek et al. [45] IDBac relies on a number of R packages including mzR [74] and MALDIquant [75,76].

16S-rRNA analysis
DNA from 361 bacterial isolates was extracted using a DNeasy UltraClean microbial kit (Qiagen). The 16S rRNA gene was amplified using 27F (5′-CAGAGTTTGATCCTGGCT-3′) and 1492R (5′-AGGAGGTGATCCAGCCGCA-3′) [77] primers using polymerase chain reaction (PCR) under the following conditions: initial denaturation at 95°C for 5 min; followed by 35 cycles of denaturation at 95°C for 15 s, annealing at 60°C for 15 s, and extension at 72°C for 30 s; and a final extension step at 72°C for 2 min. PCR products were purified using a QIAquick PCR purification kit from Qiagen, and the amplicons sequenced by Sanger sequencing. Data were analyzed by Geneious V11.1.4 software and genus assigned using the SILVA ACT Service [59].

DATA AVAILABILITY
MALDI-TOF MS data were deposited in MassIVE (https://doi.org/10.25345/C5F84T, accession: MSV000087941). The IDBac database and MALDI-TOF MS spectra of previously identified bacteria used for spectra matching and identification are available from MassIVE (https://doi.org/10.25345/C5261K, accession: MSV000083461). Partial and full 16S-rRNA sequences of environmental bacteria used in this study were deposited in GenBank with the accession numbers MT596540 to MT596560.