Multi-marker metabarcoding of coral skeletons reveals a rich microbiome and diverse evolutionary origins of endolithic algae

Bacteria, fungi and green algae are common inhabitants of coral skeletons. Their diversity is poorly characterized because they are difficult to identify with microscopy or environmental sequencing, as common metabarcoding markers have low phylogenetic resolution and miss a large portion of the biodiversity. We used a cost-effective protocol and a combination of markers (tufA, 16S rDNA, 18S rDNA and 23S rDNA) to characterize the microbiome of 132 coral skeleton samples. We identified a wide range of prokaryotic and eukaryotic organisms, many never reported in corals before. We additionally investigated the phylogenetic diversity of the green algae—the most abundant eukaryotic member of this community, for which previous literature recognizes only a handful of endolithic species. We found more than 120 taxonomic units (near species level), including six family-level lineages mostly new to science. The results suggest that the existence of lineages with an endolithic lifestyle predates the existence of modern scleractinian corals by ca. 250my, and that this particular niche was independently invaded by over 20 lineages in green algae evolution. These results highlight the potential of the multi-marker approach to assist in species discovery and, when combined with a phylogenetic framework, clarify the evolutionary origins of host-microbiota associations.

After collection, the coral tissue was removed with pliers and razor blades and samples were stored in RNAlater or 100% ethanol. The DNA was extracted using either: (1) a modified phenol-chloroform protocol -see Cremen et al. (2016); our only modification was to use phenol/chloroform/isoamyl alcohol [25:24:1] (instead of chloroform/isoamyl alcohol only) in the first extraction step. Or (2) the Wizard Genomic DNA Purification Kit (Promega). More amplification success and DNA yield was obtained using the kit with a small modification: incubating a piece of coral skeleton (ca. 80mm 3 ) in the lysis buffer for 3 hours without grinding the sample, then proceeding with the manufacturer's instructions for plant tissue.

Library preparation:
The amplicons and respective primers used here were: 16S rDNA: We used either the 515f/806r (Caporaso et al., 2012;Gilbert et al., 2014) or the S-D-Bact-0341-b-S-17/S-D-Bact-0785-a-A-21 (Klindworth et al., 2013) primer pairs to PCRamplify this marker. The amplicons generated by these 2 primer pairs overlap in the V3-V4 region, so we used only this overlapping region -the sequence length after trimming primer sequences was, on average, 225 base pairs.

23S rDNA:
We used the algal specific primer pair p23SrV_r1/p23SrV_f1, which PCR-amplifies the Universal Plastid Amplicon (Presting 2006). tufA: We used primers tufAR (Fama et al., 2002) and a forward primer designed here for Ostreobium (Oq-tuf: ACN GGN CGN GGN ACN GT), which has several ambiguous bases in order to amplify a larger range of green algae species. for the ribosomal DNA markers. Annealing temperature was set at 50°C for primer pair 515f/806r, 55°C for p23SrV_r1/p23SrV_f1 and S-D-Bact-0341-b-S-17/S-D-Bact-0785-a-A-21 and 60°C for NF1/18Sr2b. Because tufA is a coding gene, it has higher mutation rates (especially at 3 rd codon positions) when compared to ribosomal DNA, therefore a touchdown step and a lower annealing temperature is required (55-48°C for 14 cycles followed by 24 cycles at 48°C).

Supplementary
Unspecific amplification does occur, but those are excluded in the analysis pipeline (e.g. steps 5, 6 and 9 of the pipeline).
For the second PCR we used the following conditions: initial denaturation step at 94°C for 5 min, followed by 8 cycles of denaturation (94°C at 30 s), annealing (55°C at 30 s) and extension (72°C at 30 s) and a final extension step at 72°C for 5 min. We purified the samples using home-made magnetic beads as described in Rohland and Reich (2012) (Magoč and Salzberg, 2011).
5. Quality control: filter merged reads based on a quality threshold (average of 35 per merged read) using PRINSEQ (Schmieder and Edwards, 2011).
6. Trim primers from merged reads. Sequences that do not meet a minimum length threshold and/or do not have the exact primer sequence at the 3' and 5' ends are excluded from analysis in order to ensure quality (i.e., the sequence belong to the target gene) and global trimming (i.e., they start and end at the same position).
7. Format reads' identification and generate one file per gene containing all samples.
8. Run UPARSE pipeline (Edgar, 2013): dereplication, sort by size, cluster OTUs and produce OTU map. We chose UPARSE because other available software (e.g. Qiime and Mothur) seem to significantly overestimate the number of OTUs (Edgar, 2013). Based on the divergence of tufA among Bryopsidales we used a similarity threshold of 98% for OTU clustering in this marker, which is a conservative threshold for species level (i.e., most Bryopsidales species are more similar than that, so at 98% the OTUs will be somewhere between species and genus level).
We choose the 97% threshold for the other markers for two main reasons: 1) there is not enough information in literature about the rDNA markers similarity among Bryopsidales species, on the contrary, it is known that they do not have phylogenetic signal to distinguish them; 2) our aim was to compare how the normally used markers (with their commonly used thresholds) perform in distinguishing algae species.
9. Alignment: we used PyNAST (Caporaso et al., 2010a) to align the 16S and 18S rDNA sequences. This aligner requires a reference database with aligned sequences and lots of gaps in the alignment. Due to the lack of such reference databases for 23S rDNA and tufA, we chose MAFFT (Katoh et al., 2002) to align 23S rDNA and tufA. The OTUs that failed to align were excluded from downstream analysis.
10. Assign taxonomy using the Naïve Bayesian Classifier (RDP) implemented in Qiime (Wang et al., 2007;Caporaso et al., 2010b). We used Greengenes and SILVA databases for the 16S and 18S rDNA sequences respectively. In order to produce an RDP-friendly database for the 23S rDNA and the tufA, we downloaded reference sequences from Genbank, used a phylogenetic similarity threshold (based on a UPGMA tree) to equalize the dataset (i.e. exclude repetitive species, which will bias the RDP classifier) and produced the reference dataset (one file with the sequences and another with taxonomic ranks). We used RDP taxonomic assignments to: i) infer the abundance of reads assigned to the main microbial groups ( Figure 2); and ii) pre-filter OTUs to build a green algae phylogenetic tree: OTUs that were not classified as "Eukaryotic" (or "Chloroplast" in the 16S) were excluded from the phylogenetic analysis.
11. Filter OTUs found in negative controls (mock extractions and negative control PCRs). Although virtually no DNA was detected (with Qubit) in these controls, we added those samples to our library in order to detect any possible contaminant. But apart from cross contamination, sequencing errors can yield false-positives. Therefore some OTUs found in the controls could be, for example, the most abundant Ostreobium sequences which should not be excluded. So we filtered the OTUs present in the controls, but only if they would represent less than 1% of the total number of reads. For the 18S dataset, OTUs matching Cnidaria and Dinophyceae were considered contaminants from the coral tissue and were also removed from the analysis.
13. Filter rare OTUs (less than 5 reads) and produce final filtered OTU fasta file. This is the input for the phylogenetic analysis.
14. Produce final OTU table and statistics. These statistics are another sort of quality check, and the OTU table is necessary for beta-diversity/comparative analysis (which we do not do in this study). From here one can check the sequencing depth and proceed to Qiime's core_diversity_analysis.py, for example.

Phylogenetic analysis:
The short reads generated by high-throughput sequencing technologies remain an issue for phylogenetic analysis. To overcome this problem, we used longer larger parts of the sequenced genes (available on Genbank, generated by Sanger sequencing) and additional genes to reconstruct the backbone of the green algal phylogeny. That way, even though the relationship within OTU-only clades may not be well resolved due to short fragments, the position of these clades among the green algae phylogeny can be inferred with strong support. We concatenated genes of different species of the same genus when same-species-sequences were not available, therefore filling as much as possible the gaps in the alignment. We used only species for which there was a reference sequence (for species or genus) for the marker analyzed -for example: there is no Ostreobium 18S rDNA sequence available (Supplementary Table 4 • tufA-OTUs phylogeny: tufA (including OTUs) + rbcL + 18S rDNA = 4833 bp.