Multiplex amplicon sequencing for microbe identification in community-based culture collections

Microbiome analysis using metagenomic sequencing has revealed a vast microbial diversity associated with plants. Identifying the molecular functions associated with microbiome-plant interaction is a significant challenge concerning the development of microbiome-derived technologies applied to agriculture. An alternative to accelerate the discovery of the microbiome benefits to plants is to construct microbial culture collections concomitant with accessing microbial community structure and abundance. However, traditional methods of isolation, cultivation, and identification of microbes are time-consuming and expensive. Here we describe a method for identification of microbes in culture collections constructed by picking colonies from primary platings that may contain single or multiple microorganisms, which we named community-based culture collections (CBC). A multiplexing 16S rRNA gene amplicon sequencing based on two-step PCR amplifications with tagged primers for plates, rows, and columns allowed the identification of the microbial composition regardless if the well contains single or multiple microorganisms. The multiplexing system enables pooling amplicons into a single tube. The sequencing performed on the PacBio platform led to recovery near-full-length 16S rRNA gene sequences allowing accurate identification of microorganism composition in each plate well. Cross-referencing with plant microbiome structure and abundance allowed the estimation of diversity and abundance representation of microorganism in the CBC.

Scientific RepoRts | 6:29543 | DOI: 10.1038/srep29543 interactions for their growth 10,13 . In such cases, the strict mutual dependence within microbes may drastically reduce their growth in culture media thus lessen the chance that it would be represented in collections obtained using pure colonies pickings. To this point, there is a need for new methods to isolate microbes, store them and annotate their identification in a high-throughput manner, preserving potential microbe-microbe interactions that is lost when culture collections are constructed based on axenic colonies.
Although technologies for microbial surveys have rapidly evolved, the same did not occur for methods of isolation and identification of microbial communities. This impairment between microbial surveys and isolation procedures constitute a barrier for microbiome research imposed by the lack of methods for systematically store microbiota culture collections 11,12,14 . Also, it is important to have the possibility to cross-reference culture collections with culture-independent community analysis which is an important step to explore fundamental aspects of microbial studies such as estimate microbial recovery of isolation procedures, perform functional screenings and study host-microbe interactions 11 .
Here we advance the concept of community-based culture collections (CBC) that is based on the isolation of colonies containing single or multiple microorganisms, and store them in a single plate well. The rationale is to bypass the laborious work of pure culture isolation and to concentrate on relevant insights based on community analysis, microbe-microbe and microbe-host interactions. The CBC approach may be useful in case of need to recover larger fractions of microbiota from a given environment to preserve putative microbe-microbe interactions. If microbes present in a given plate well appear to be of interest, pure colonies can be easily obtained, or their interactions further studied. Similar methods based on isolation of communities instead of axenic colonies have recently been reported 15 . We also presented a method to annotate the CBC through a multiplex strategy for pooling and sequencing near-full-length 16S ribosomal gene amplicons. This approach allows precise identification of CBC microorganisms that can be directly compared with culture-independent community analysis for the estimation of microbial community recovery and selection of microbial candidates. The method presented here is suitable for the construction of culture collections from any environment in a high-throughput and large-scale manner.

Results and Discussion
Community-based culture collection (CBC) of sugarcane. The method described here was developed when preparing a sugarcane CBC. The sugarcane CBC was prepared by picking colonies from primary platings of enriched microbial fractions of the rhizosphere, endophytic root and endophytic stalk of developing plants. Microbial samples were diluted and plated on three different culture media supplemented with sugarcane syrup. Colonies were picked regardless if they were single or multiple (community) microorganisms. A total of 5,137 colony communities were picked and stored in 96-well plates.
Multiplex amplicon sequencing for microbe identification. In order to identify and annotate the CBC near-full-length microbial 16S rRNA amplicons were sequenced from 96-well plates pooled down to a single tube using the PacBio RS sequencing platform (Fig. 1a). The use of longer 16S sequences enhances the quality of The multiplex strategy allows pooling of libraries from more than one 96-well plate in one tube for sequencing of the near-full-length gene using a single run of the PacBio RS platform. (a) A two-step PCR for library preparation. The first step adds unique plate barcodes and the second step, rows and columns barcodes. (b) The combination of plates, rows and columns barcodes allows tracing back the exact well of origin from each sequence. Highlighted wells are kept empty and allow us to use them as positive (H10) and negative (H11 and H12) controls during PCR reactions and libraries preparation. bc P : plate barcode. bc R : row barcode. bc C : column barcode.
Scientific RepoRts | 6:29543 | DOI: 10.1038/srep29543 biological information, by either improving the depth of taxonomic assignment 16,17 and comparison with available full-length 16S rRNA databases 18,19 . Two-step PCR amplifications were designed to add individual barcodes for the identification of each plate and well using primers for the 16S rRNA gene based on the V3-V9 region ( Supplementary Fig. 1a). The first PCR step amplifies the 16S rRNA gene and adds specific barcodes for each 96-well plate ( Supplementary Fig. 1b). First step PCR-barcoded amplicons were pooled in a single 96-well plate and used as templates for the second PCR step, which adds specific barcodes for columns and rows ( Supplementary  Fig. 1b). The second PCR-barcoded amplicons were pooled into a single tube. The pooled amplicons were sequenced on the PacBio RS platform because it conveniently allows the retrieval of the near-full-length sequence of the 16S rRNA genes. The multi-tagging method (Fig. 1b), coupled to PacBio SMRT sequencing, allows tracing back individual sequences to their original wells regardless of whether the well contains one or multiple microorganisms. The method was developed using five 96-well plates from the sugarcane CBC. A bioinformatics pipeline was established for the sequence demultiplexing, removal of chimeric and nonspecific sequences, clustering into operational taxonomic units (OTUs) and their taxonomical assignment ( Supplementary Fig. 2). Circular consensus sequence (CCS) accuracy and OTU assignment. The intrinsically low raw-read quality of PacBio was overcome by generating circular consensus sequences (CCSs). Since errors are randomly distributed along raw sequences, error rates are almost eliminated by the CCS processing, and the higher the coverage of CCSs, the higher their accuracy [20][21][22] . Data accuracy was evaluated by assembling CCSs using a minimum value of 2× coverage. The per-base error rates of assembled CCSs were assessed by comparing CCSs generated from positive control wells that contained 16S rRNA amplicons from E. coli to a 16S reference sequence from this organism ( Supplementary Fig. 3a). The average per-base error rate was 2.3% for CCSs with coverage lower than 5× , 0.7% for coverage from 5 to 9× , and 0.3% for CCSs with coverage higher than 9× ( Supplementary Fig. 3b). Notably, 74.7% of the dataset was constituted by CCSs with coverage greater than 9× ( Supplementary Fig. 4a) and showed the expected amplicon lengths ( Supplementary Fig. 4b).
We then investigated if the low-coverage CCSs could lead to a misidentification or produce false positive OTUs. The assumption was that OTUs called by high-coverage assembled CCSs have a low probability of being false positives because of their high sequence accuracy. Low-coverage CCSs have a greater chance of containing more errors and therefore could lead to singleton OTUs that do not represent true biological sequences. We first addressed this question by looking for sequence similarity through pairwise alignment of CCSs. Based on quality accuracy, CCSs were arbitrarily classified as low-(< 10× ) and high-coverage (≥ 10× ). We found that the majority of low-coverage CCSs align with at least one high-coverage CCS, which raises their reliability and lowers the chance that they represent false and non-biological sequences (Fig. 2a). We investigated if it directly reflects in OTU clustering by looking how many OTUs were formed by the clustering of high-coverage CCS, low-coverage CCS or a combination of both. A total of 361 out of 1,003 identified OTUs presented at least one low-coverage CCS clustered with high-coverage CCSs. Hence, they could more confidently be assigned as true OTUs. Among 558 singleton OTUs, 335 were called from one low-coverage CCS and 223 from one high-coverage CCS ( Supplementary Fig. 5).
Reliability filter for removal of error-prone sequences. Traditionally, quality filters based on CCS coverage are used to obtain highly accurate sequences. However, our analysis indicates that removing sequences based on coverage is not an adequate method to remove error-prone CCSs (Fig. 2b) because it could potentially lead to the loss of relevant biological information. The trade-off between accuracy and throughput should be determined for each application. In addition, considering that preserving information is highly desirable, we elaborated a strategy to maximize the number of sequences with high confidence to represent true biological sequences (reliable sequences). The rationale involved filtering the data based on the reliability of the CCS, rather than solely on its coverage. A CCS could be considered reliable if (1) it is above a threshold of similarity to a sequence deposited in a curated database and/or (2) if it is above a threshold of similarity to any other sequence within the dataset. The first case comprised instances in which the microorganism was already identified and had its gene sequences deposited in a database. The second one assumed that within the collection there might exist redundancy, i.e. more than one well having the same microbe. Thus, the fact that a sequence appeared more than once in the collection, especially if called by high-coverage CCSs, makes it reliable.
After sequence demultiplexing, all chimeric, non-specific and larger than expected sequences were removed ( Table 1). Out of 11,750 demultiplexed CCSs, a total of 12.3% and 0.14% represented chimeras and nonspecific sequences, respectively (Table 1). From the remaining CCSs (usable CCS) we were able to recover a total of 375 wells out of the expected 480 (78.1%; Table 1). Unrecovered wells might have occurred due to the lack of microorganism growth, low DNA concentration or low primer specificity. Usable CCSs were filtered by reliability ( Supplementary Fig. 6a), and a total of 312 CCSs were discarded after being classified as error-prone ( Table 1). The reliability filter allowed us to retain 22,6% of the total usable CCSs that would have been discarded if a filter based on coverage had been applied (Table 1 and Supplementary Fig. 6b). As isolates in the sugarcane CBC might contain multiple microorganisms, the gain in the number of CCSs per well is advantageous for identification of microbes and also for further community analysis.

Microbial identification and cross-reference with culture-independent community analysis.
We next evaluated the number of identified OTUs per well. Over half of the wells showed two or more OTUs (Fig. 2c). The high number of wells that contained more than one identified microorganism can be considered as a proof of the concept for the method described in this work ( Supplementary Fig. 7 and Supplementary Table 1). Considering the identification at the genus level with a confidence score of >0.95 given by UTAX (www.drive5. com/usearch/manual/utax_algo.html) we counted 34 distinct genera in the five plates tested (Supplementary  Table 2). Together, the four most representative genera (Bacillus, Chitinophaga, Rhizobium and Burkholderia) accounted for over 56% of the total OTUs assigned in the five plates. A total of 206 OTUs could not be assigned at the genus level.
Finally, we estimated the extent of the sugarcane microbiota diversity that we recovered using the described method. The OTUs identified in the five CBC plates were cross-referenced with the 20,731 OTUs from the sugarcane microbiota previously identified by culture-independent community analysis 23 (Supplementary Note). The cross-referencing was performed by sequence alignment using USEARCH 24 with a threshold of 97% identity. This analysis revealed that the sugarcane CBC contains members of the most abundant microbial genera of the root (Fig. 3a and Supplementary Fig. 8) and endophytic stalk sugarcane organ core microbiomes 23 (Fig. 3b). Identified isolates comprised OTUs that account for 13.3%, 14.8% and 29.1% of the total bacterial relative abundance in sugarcane rhizosphere, endophytic root and endophytic stalk, respectively (Fig. 4). Although we aimed to isolate root and stalk endophytic microorganisms, our collection contained microorganisms that represent 12.2%, 39.5% and 50.8% of the exophytic stalk, exophytic leaf and endophytic leaf diversity, respectively (Fig. 4). Furthermore, most of the isolates represent groups of bacteria that have never been studied regarding their biological role in association with plants ( Fig. 3a,b and Supplementary Fig. 8).

Conclusions
The biotechnological use of environmental microbiota requires methods for construction and identification of large microbial culture collections. Conventional techniques for isolation and identification of microorganisms are based on axenic cultures obtained by several steps of picking and streaking colonies. These methods are hardly scalable for massive isolation of microorganisms and do not allow cross-referencing with data from culture-independent community analysis. Here we presented an alternative for high-throughput identification of microorganisms in a culture collection constructed by colony picking of primary platings. These colonies may contain single of multiple microorganisms, and for this reason it was called community-based culture collection (CBC). The identity of the microorganisms in each well was assigned using multiplex sequencing of amplicons that allowed tagging plates and wells. The generated data could be cross-referenced with culture-independent community analysis to estimate precisely the microbial recovery. The method can be used for annotation of culture collections of any environment as it overcomes the major limitation of current microbiome research related to microbiota culture collections.

Material and Methods
Construction of a sugarcane community-based culture collection. The microorganisms used in this study represent a sample of a sugarcane community-based culture collection (CBC). This culture collection was constructed from microorganism-enriched pellets prepared from the rhizosphere, endophytic root and endophytic stalk compartments 23   The sugarcane community profile has identified a core bacterial community comprised of less than 20% of the total microbial richness but accounting for over 90% of the total microbial relative abundance 23 (Supplementary Note). Because of its relevance, we have highlighted the recovery for the core microbiome. OTUs in five culture collection plates were cross-referenced against core community profile from root (rhizosphere and endophytic root) and stalks (endophytic bottom, medium and superior region of stalks) from sugarcane plants. The identified isolates comprised OTUs that account for 14.2% and 29.1% of the total bacterial relative abundance in sugarcane roots and endophytic stalk, respectively. Cladograms were constructed at the genus level based on taxonomy assignment. (a) Endophytic root and rhizosphere microbiota. The cladogram was simplified to show only five phyla; genera with no representatives in the collection were collapsed (see Supplementary Fig. 8  Primer design. The 16S rRNA V3-V9 gene region ( Supplementary Fig. 1a) was amplified using 341f 25 and 1492r 26 primers. For the first PCR step, plate barcodes were added to the 1492r primer (first PCR; Supplementary  Fig. 1b). Primers for the second step were designed with MT-FS region and Illumina barcode sequences (Nextera sequences) for columns and rows (second PCR; Supplementary Fig. 1b). A two-step library preparation is a standard protocol used for 16S community analysis and provides reliable sequences 27 . Additionally, we used Illumina sequences (Nextera sequences) to give flexibility regarding sequencing platform. Using these primers, amplicons are compatible with both Illumina SBS chemistry and PacBio platform. Both the first and second PCR forward and reverse primers are listed in the Supplementary Data 1. All of the primers were obtained from Integrated DNA Technologies (IDT, Coralville, IA, USA).
First PCR step. PCR reactions were performed in 96-well plates (Fig. 1a). The reaction was designed to amplify the 16S rRNA V3-V9 gene region with unique plate barcodes. The reverse primer contained an overhang with a 9-bp barcode sequence ( Supplementary Fig. 1b). To minimize the pipetting variation of small volumes, master mixes were prepared for 100 reactions. The master mix was prepared using the KAPA2G Robust PCR Kit Pooling and purification of the first-PCR amplicons. The amplicons of the five 96-well plates were pooled by transferring, with a multichannel pipette, 2 μ L of each well of first-PCR reactions into a single 96-well plate (Fig. 1a). The final 10-μ L volume was purified with Agencourt AMPure XP Beads (Beckman Coulter, Brea, CA, USA) according to the manufacturer's instruction, at a bead-to-DNA ratio of 0.6:1. Purified PCR products were resuspended in 30 μ L of sterile deionized water and stored at − 20 °C.
Second PCR step. The pooled first-PCR amplicons were used as templates for the second PCR step to add unique barcodes for rows and columns in a 96-well plate format (Fig. 1a). The PCR reactions were performed using forward and reverse primers from Nextera XT DNA Library Prep Kit (FC-121-1012; Illumina, San Diego, Data processing. The raw sequence data generated by the PacBio RS II sequencing was assembled in circular consensus sequences (CCSs) using the RS_ReadsOfInsert protocol from the PacBio's SMRT Portal v2.1.1 and the parameters of "minFullPasses 2" and "minPredictedAccuracy 90". The read coverage and length distributions are shown in Supplementary Fig. 4a,b, respectively. Data processing was automated using in-house implemented scripts. The scripts used for data processing are available in the GitHub repository (https://github.com/ saccharome/CBC). The data analysis considered only CCSs with at least one hit against the Greengenes 18 high-quality full-length 16S rRNA sequences database (May 2013 release) and/or the CCS dataset. CCSs that did not meet these criteria were considered error-prone and were discarded.
CCS coverage. The CCSs coverage was calculated as the sum of nucleotides of all subreads related to the CCS (i.e. after removing SMRTbell adapters from raw sequences) divided by the length of the assembled CCS. Numbers were rounded to the nearest integer value.
Demultiplexing. CCSs were demultiplexed according to their PCR barcodes. The UBLAST algorithm (www.drive5.com/usearch) was used to identify sequences of both forward and reverse Nextera transposase and non-barcoded 1492r primer sequences (14, 15 and 16 nucleotides, respectively). Based on their location, the positions of barcodes were determined. The processing pipeline accepted a maximum of 3 mismatches for the alignments. The high stringency was chosen to ensure the correct traceability of sequences, and CCSs with at least one unassigned barcode were discarded. Barcodes identified on their predicted positions were considered consistent. The "search_oligodb" program was used to assign the barcodes with no mismatch. Both programs used are from the software package USEARCH 24 .
Sequence reliability filtering. CCSs were aligned to the Greengenes 18 database using "usearch_global" from USEARCH 23 software package with a 97% identity threshold ( Supplementary Fig. 6a). CCSs with no hit against the database were then aligned to the original CCS dataset using the same parameters. CCSs with no hit against both databases were considered error-prone and discarded.
Chimera and nonspecific sequences removal. Sequences that contained fragments from different templates were considered chimeras. Chimeras were removed using the UCHIME 28 program against a high-quality chimera-free reference database (www.drive5.com/uchime/rdp_gold.fa). CCSs larger than expected (> 1,600 base pairs) were also discarded as they may also represent chimeras 21 . The dataset that contained cleaned sequences was aligned to the most recent version of the Greengenes 18 database using the "usearch_global" program from the USEARCH 24 software package. Sequences with no hit or identity lower than 75% were discarded.
Clustering. CCS sequences were clustered within each well individually. Clustering was performed with a 97% identity threshold using UPARSE 29 pipeline in a Perl script to automate the process. OTUs from this step were used for taxonomical classification, cross-referencing of culture collection with culture-independent community analysis and recovery estimate. In order to access the redundancy of microorganisms, OTUs were re-clustered at 97% identity in a second step, using the above-mentioned parameters.
Cross-referencing of culture-dependent and -independent community analysis and recovery estimate. An adapted culture-independent community profile of sugarcane root, leaf, and stalks using V4 16S ribosomal gene 23 (Supplementary Note) was used for cross-referencing between culture-dependent and -independent community analysis. Sequences were globally aligned against OTUs from culture-independent data using "usearch_global" from USEARCH 24 software package and only hits with ≥ 97% sequence identity were considered. Recovery estimate was calculated by summing the relative abundance of OTUs from the culture-independent data that had at least one hit against a CCS from the culture collection. Cladograms and trees for comparison of culture-dependent and -independent data were constructed using GraPhlAn 30 .