Introduction

During the past decade, studies of the diversity of unicellular eukaryotes, or protists, have been deeply affected by the advent of cultivation-independent molecular methods. Initially, this approach was based on the analysis of 18S ribosomal RNA (rRNA) genes using PCR amplification, cloning and Sanger sequencing. When these approaches were first applied to study marine plankton, they revealed a wide and unsuspected protist diversity (López-García et al., 2001; Moon-van der Staay et al., 2001; Not et al., 2007a, 2007b; Guillou et al., 2008; Massana and Pedrós-Alió, 2008). More recently, the development of 454 pyrosequencing, which produces thousands of 18S ribosomal DNA (rDNA) amplicon sequence reads from a single sample, has revolutionized the study of natural microbial communities via a so-called ‘deep-sequencing’ approach (Margulies et al., 2005; Sogin et al., 2006). In particular, this has revealed the existence of large collections of lineages represented by only a few sequences, which have been interpreted as evidence of a rare microbial biosphere (Pedrós-Alió, 2007). Furthermore, very high estimates of eukaryotic species richness have been established in this way, notably in marine waters with several thousands of different operational taxonomic units (OTUs) detected in just a few liters (Brown et al., 2009; Edgcomb et al., 2011). The 18S rDNA amplicon pyrosequencing has become a widespread approach for protist diversity studies.

The sequence length obtained by 454 pyrosequencing, originally of 100 bp, now exceeds 400 bp (Titanium chemistry, 454 Life Sciences, Basel, Switzerland), allowing a more precise taxonomic classification of reads. However, multiple sources of bias have been identified, including pyrosequencing errors (Huse et al., 2007; Kunin et al., 2010), poor alignment quality (Sun et al., 2009; Schloss, 2010), inappropriate clustering approaches (Huse et al., 2010) or inconsistent results from different targeted 18S rDNA regions (Stoeck et al., 2010). These biases can have profound effects on diversity estimation, most often leading to the overestimation of OTU richness. Algorithms to remove pyrosequencing errors have been developed to reduce such overestimations. This is the case of AmpliconNoise (Quince et al., 2011), based on flowgram clustering, or the single-linkage preclustering (SLP) approach (Huse et al., 2010), based on sequence similarity analysis. Nevertheless, only a few studies have compared the results of using different in silico analytical approaches when assessing species diversity in natural samples using pyrosequencing reads (see, for example, Quince et al., 2009; Huse et al., 2010; Bik et al., 2012).

In spite of a rich and long history of taxonomic description of microbial eukaryotes, accurate comparative studies between morphological and molecular approaches in natural environments remain rare. One exception is the recent study of the overall diversity of eukaryotes in a freshwater lake, which demonstrated the efficiency of 18S rDNA pyrosequencing in detecting rare species but revealed strong incongruencies among taxa with regard to their relative abundances (Medinger et al., 2010). In fact, to be reliable and informative, environmental 18S rDNA sequence data analysis requires knowledge about how genotypic and phenotypic diversity relate (Caron et al., 2009). Consequently, we need to know how diversity defined by classical morphology approaches relates to that based on molecular methods, either conventional (cloning/Sanger sequencing) or recent deep-sequencing technologies (for example, pyrosequencing). Employing these different approaches on the same samples, using morphology-based species assessments as a control, could be a robust approach to assess the magnitude of the microbial rare biosphere indicated by pyrosequencing of 18S rDNA fragments. A robust comparison of these distinct approaches requires their application on a biological model group that is abundant, diverse and easily distinguishable by morphology.

The species-rich order Tintinnida represents one of the most morphologically diverse groups of planktonic protists. Tintinnids are commonly found in marine surface waters of the globe, in concentrations generally ranging from 101 to 104 cells per liter (Pierce and Turner, 1993). By examining specimens in plankton settling chambers using an inverted microscope, they can be easily identified based on their species-specific secreted shell, the lorica (see, for example, Dolan et al., 2009). Several studies based on tintinnid single-cell identification, 18S rRNA gene and internal transcribed spacer (ITS) PCR amplification and sequencing have recently been carried out, providing a comprehensive phylogenetic framework that allows the identification of environmental sequences at the species level (Snoeyenbos-West et al., 2002, Strüder-Kypke and Lynn, 2008; Bachy et al., 2012; Saccà et al., 2012; Santoferrara et al., 2012). Thus, we applied a taxon-specific approach focused on marine tintinnid ciliates to assess their diversity in aliquots of the same samples using microscopic analysis (species identified based on classical morphological descriptions) and both cloning/Sanger sequencing and pyrosequencing analyses of 18S rDNA and ITS regions. Furthermore, we also examined different pyrosequencing data processing approaches to estimate species richness.

Materials and methods

Sample collection

Seawater samples were collected with Niskin bottles at the deep chlorophyll maximum depth from two sites in Mediterranean Sea: the Bay of Villefranche-sur-Mer (Ligurian Sea, sample VilleFr-43) at the site ‘Point B+’ (43°41′00″N, 7°18′44″E; 50 m depth) on 18 November 2009; and the Ionian Sea (sample Ioni-7; 36°28′58″N, 15°38′58″E; 77 m depth) on 1 October 2010, during the Marmara 2010 cruise. A total of 20 l were collected in the coastal station of Villefranche and 168 l in the Ionian Sea offshore station. After sampling, plankton samples were immediately prefiltered through a 200-μm Nitex screen and then subdivided into two equal volumes. For the VilleFr-43 sample (Dynamic Aqua-Supply Ltd, Surrey, BC, Canada), one subsample was immediately preserved with Lugol iodine solution (2% final concentration) and then concentrated via serial gravity sedimentation (that is, 10 to 2 liters and then to 200 ml) to a final volume of 20 ml for taxonomic identification by morphological observation (see below). The second subsample was filtered onto a 5-μm pore-diameter polycarbonate filter (Millipore, Billerica, MA, USA) to concentrate all biomass. The filter was conserved in absolute ethanol at −20 °C following a protocol already tested for plankton samples (López-García et al., 2003; Bachy et al., 2011). For the Ioni-7 sample, the two subsamples were concentrated by gravity filtering through a 20-μm Nitex filter to a final volume of 4 ml. One subsample was fixed with Lugol solution (Sigma, St Louis, MO, USA) and the other subsample was fixed in ethanol (80% final concentration) for further processing. The cells fixed in ethanol were then filtered onto a 5-μm pore-diameter filter to concentrate the biomass and immediately processed for DNA extraction (see below).

Morphological observation

For each subsample fixed in Lugol solution, the entire concentrated volume was examined to exhaustion at × 200 magnification using an inverted microscope (Olympus IX51, Rungis, France) in 2 ml aliquots settled in sedimentation chambers. Taxonomic assignation of every single cell of tintinnid species was determined based on lorica morphology following standard taxonomic monographs (Kofoid and Campbell, 1929, 1939; Marshall, 1969). Empty loricae were included in the enumerations.

Generation of clone libraries and sequencing

Total DNA was purified from filters cut in small pieces using the PowerSoil DNA Isolation Kit (MoBio, Carlsbad, CA, USA). Nucleic acids were resuspended in 10 mM Tris-HCl, pH 8. Near-full-length 18S rDNA fragments were amplified by PCR using the tintinnid-specific forward primer 18S-Tin3F (5′-GCGGTATTTATTAGATAWCAGCC-3′; Bachy et al., 2012) and the eukaryotic-specific reverse primer 1498R (5′-CACCTACGCAAACCTTGTTA-3′; López-García et al., 2003). PCR was carried out under the following conditions: 35 cycles (denaturation at 94 °C for 15 s, annealing at 52 °C for 30 s, extension at 72 °C for 2 min) preceded by 3 min denaturation at 94 °C and followed by 15 min extension at 72 °C. We also targeted longer DNA fragments covering the near-full-length 18S rRNA gene, the ITSs 1 and 2, the 5.8S rDNA gene and a partial 28S rDNA fragment under similar PCR conditions as described above. These fragments were amplified with the tintinnid-specific forward primer 18S-Tin3F and the tintinnid-specific reverse primer 28S-TinR1 (5′-TGGTGCACTAGTATCAAAGT-3′). This primer set yielded an amplicon size of 2200 bp. Clone libraries were constructed using the Topo TA cloning system (Invitrogen, Carlsbad, CA, USA) following the instructions provided by the manufacturer. Positive inserts of expected size were selected from each library and Sanger-sequenced with the forward primer. We generated a total of 200 high-quality partial sequences (>700 bp) for each sample, covering the 18S rDNA regions also targeted by the primers used for pyrosequencing analysis. These clone sequences served for a preliminary phylogenetic analysis and the identification of OTUs (defined here as clusters of sequences having 99% identity). For each library, we completely sequenced at least one clone per OTU to obtain complete sequences representative of the whole taxonomic diversity found, preferentially the ones that also encompassed the ITS, 5.8S and 28S rDNA. A total of 116 ciliate sequences (among which 100 were tintinnid sequences) formed a reference data set that provided a phylogenetic framework for the attribution of environmental clones. From all the tintinnid OTUs detected by our stringent criterion (99% sequence identity), 25 and 17 OTUs were found in the coastal and offshore samples, respectively. In addition, two OTUs represented by one sequence each were excluded as non-tintinnids from the VilleFr-43 sample as they belonged to strombiliid ciliate species. The new tintinnid sequences formed with the previous reference data set the ‘18S rDNA data set’ (that is, 100 tintinnid reference sequences plus 42 new clone sequences attributed to tintinnids). The sequences for which the ITS region was also sequenced formed the ‘ITS data set’, comprising a smaller number of 43 sequences (22 tintinnid reference sequences and 21 new clone sequences attributed to tintinnids). The 42 new complete sequences were submitted to GenBank (accession numbers JX567350JX567503).

Clustering of clone sequences into OTUs

To affiliate clone sequences to different tintinnid taxa, they were compared by BLAST (Altschul et al., 1997) against the reference data set. A fasta file with all the high-quality partial 18S rDNA clone sequences was created and submitted to the module NDIST of AmpliconNoise (Quince et al., 2011) to generate a distance matrix of all pairwise sequence combinations by Needleman–Wunsch pairwise alignment (with default parameters). A first distance matrix was generated with an alignment of all available positions (840 bp), and a second one with only the region covering positions targeted in the 18S rDNA pyrosequence read analysis (see below). The resulting matrices were used as input for Mothur (Schloss et al., 2009) in order to group clone sequences by average linkage clustering at different sequence identity levels.

Generation of amplicon libraries and pyrosequencing

Two sets of primers were designed for this study to specifically target variable regions of the tintinnid 18S rRNA gene and ITS region. The primers were adapted for 454 pyrosequencing following the manufacturer’s recommendations: addition of the forward and reverse fusion primers (A/B) and also of a 10-bp molecular identifier (MID) tag inserted between the adapter A and the specific primer to distinguish the two samples. The primers targeted 18S rDNAs from all tintinnid families sequenced to date. The first set of primers amplified a 330–340-bp fragment of the 18S rRNA gene including the hypervariable V4 region, whereas the second one amplified a 250–260-bp fragment of the ITS region including partial ITS1, complete 5.8S and partial ITS2. PCR reactions were carried out using 2 μl of environmental DNA in a volume of 20 μl Platinum HIFI reaction mix (Invitrogen) containing the tintinnid-specific 18S rDNA primers Tin454-18SFw (5′-GTAGTGACAAGAAATAAYGG-3′) and Tin454-18SRev (5′-CAACCCCTCARACCAACAG-3′). To amplify the ITS region, PCR reactions were carried out using the tintinnid-specific ITS primers Tin454-ITSFw (5′-CAATYTGTTGCAGRGCGMAAGC-3′) and Tin454-ITSRev (5′-AGCAATAGAAGGGCATCTA-3′). The PCR reactions were done under the following conditions: 25 cycles (denaturation at 94 °C for 15 s, annealing at 52 °C for 30 s, extension at 72 °C for 1 min) preceded by 3 min denaturation at 94 °C and followed by 15 min extension at 72 °C. We carried out only 25 PCR cycles in order to minimize the generation of recombinant PCR products and compensated for the lower yield by pooling the products of 10 independent PCRs. DNA was purified for each amplicon from each sample and mixed in 1:1:1:1 proportions before being 454-pyrosequenced from the A adapter on a 454 GS FLX sequencer using Titanium protocols (454 Life Sciences) and supplies (Eurofins MWG Operon, Ebersberg, Germany). Sequences are available at the Sequence Read Archive (ERP001739).

Low-quality and phylogenetic filtering of pyrosequencing reads

As a first step of filtering, only pyrosequencing reads having exact forward and reverse primer sequences and with no ambiguous nucleotides were kept. The remaining reads were then trimmed after eliminating the primer sequences. Then, they were compared by BLAST (Altschul et al., 1997) against the ‘18S rDNA’ and ‘ITS’ data sets (see above) to determine their preliminary taxonomic affiliation. Sequences with a similarity <98% with bona fide tintinnid sequences were extracted and their affiliation to the Tintinnida was checked by phylogenetic analysis. These reads were aligned using MAFFT (Katoh et al., 2002) and an approximately maximum likelihood (ML) tree was constructed using FastTree (Price et al., 2009) with choreotrich and oligotrich sequences as outgroup for the ‘18S rDNA’ data set and with spirotrich sequences as outgroup for the ‘ITS’ data set. All 18S rDNA and ITS Ionian Sea reads did belong to tintinnid species, whereas 804 18S rDNA Villefranche reads appeared to belong to strombiliid species, a sister clade of tintinnids, and were therefore discarded. The rest of Villefranche sequences corresponded to tintinnid species. No chimera was detected with the Chimera Slayer algorithm (Haas et al., 2011) implemented in Mothur (Schloss, 2010) using the ‘18S rDNA’ and ‘ITS’ curated data sets as references. To avoid potential biased comparisons because of variable data set sizes (Gihring et al., 2012), we randomly selected within each sample the number of pyrosequence reads contained in the smaller data set (10 504 and 8734 reads in VilleFr-43 and Ioni-7, respectively). The trimmed, phylogenetically checked and size-equivalent pyrosequence read libraries served as starting point for subsequent analyses. With the corresponding read identifiers, we also prepared flowgram data sets for AmpliconNoise, which needs them as input files.

Clustering of pyrosequence reads into OTUs

We applied several ways of clustering the pyrosequencing reads. First, with the quality- and taxonomy-filtered pyrosequence read data sets (see above), we used MAFFT (Katoh et al., 2002) with default parameters for multiple sequence alignment and distance matrix generation. We then used Mothur (Schloss et al., 2009) to estimate OTUs richness using the average linkage clustering option. Another clustering process was to apply the NDIST module of AmpliconNoise (Quince et al., 2011) with default parameters to generate the distance matrix of all Needleman–Wunsch pairwise sequence alignments. Then, the average linkage option in the AmpliconNoise FCluster module was used to cluster reads into OTUs. As an alternative, we also applied the clustering procedure implemented in Usearch (Edgar, 2010) with default parameters at different identity levels. The first steps of filtering (see above) were done to diminish the per nucleotide error rate of sequences (estimated at 0.25% for pyrosequencing; see Huse et al., 2007). We then used different additional approaches to correct the remaining PCR and pyrosequencing errors. We first analyzed the filtered pyrosequence reads with the SLP algorithm as detailed in Huse et al. (2010), with a pairwise distance <0.005 used as threshold (equivalent to a single-nucleotide difference in the 18S rDNA and ITS regions). We then used Mothur to create OTUs using the average linkage option. Alternatively, we ran AmpliconNoise with parameters identified as efficient to minimize the number of spurious OTUs for pyrosequencing: –r 0.005 for FCluster, –s 60.0 and –c 0.005 for PyroNoiseM and –s 10.0 and –c 0.08 for SeqNoise (Quince et al., 2011). The sequences passing this filter were clustered using pairwise alignments and average linkage clustering as described above. In addition, we ran CD-HIT-OTU, a recent clustering algorithm for 18S rDNA pyrosequence data, with default parameters (including –e 0.0025) on the OTU finder web server (http://weizhong-lab.ucsd.edu/metagenomic-analysis/server/cd-hit-otu/). It should be noted that before the last step of OTU clustering with this algorithm, the clusters of only 1 or 2 sequences are considered as noise and therefore removed. For OTUs inferred using all the different methods, rarefaction analyses were computed with Mothur (Schloss et al., 2009).

Phylogenetic analyses

All representative clone sequences and 18S rDNA reads selected after filtering of pyrosequences by AmpliconNoise, together with their most similar sequences identified by BLAST (Altschul et al., 1997) in the reference data set, were aligned using MAFFT. The multiple sequence alignment was then manually edited using the program ED from the MUST package (Philippe, 1993). An approximate ML tree was constructed with FastTree for all these sequences in order to choose one sequence (the longest one) as representative for each OTU for further phylogenetic analyses. The selected data set was then analyzed by ML using TREEFINDER (Jobb et al., 2004) applying a general time reversible model of sequence evolution with a 4-rate categories Gamma law to accommodate for among-site rate variation. ML bootstrap proportions were inferred using 1000 replicates.

Results

We studied the tintinnid ciliate community in two deep chlorophyll maximum marine plankton samples from a coastal (Villefranche) and an offshore (Ionian Sea) location. Each sample was divided in two equal parts, the first for morphological identification of tintinnid cells and the second for both 18S rRNA gene library construction and Sanger sequencing and direct pyrosequencing of 18S rRNA gene and ITS amplicons using tintinnid-specific primers.

Morphology and sequence data sets

We identified 673 tintinnid cells in the plankton subsamples subjected to microscopic examination (465 in Villefranche VilleFr-43 sample and 208 in the Ionian Sea Ioni-7 sample). From plankton subsamples devoted to molecular analyses, after DNA extraction and PCR amplification, we generated partial 18S rDNA sequences of 400 clones (200 from each locality) using Sanger sequencing and a total of 76 206 pyrosequence reads (termed here as ‘pyroreads’) for both samples and molecular markers. Quality and phylogenetic filtering of these pyroreads yielded 50 817 sequences (10 504 and 8734 18S rDNA reads and 17 378 and 14 201 ITS reads for Villefranche and the Ionian Sea, respectively). To allow direct comparison between the two fragments targeted by pyrosequencing (18S rDNA and ITS) within each sample, we limited the pyroread data sets to the same size by random selection of sequences within each set (that is, 10 504 18S rDNA and ITS reads for VilleFr-43, and 8734 18S rDNA and ITS reads for Ioni-7; see Table 1).

Table 1 Number of tintinnid cells, clone sequences and pyroreads retained for comparison from the VilleFr-43 and Ioni-7 samples

In the coastal sample VilleFr-43, 21 morphospecies were found, largely dominated by the family Tintinnidae, with Salpingella faurei and Steenstrupiella steenstrupii representing 54% of the total number of cells (Figure 1). The corresponding 18S rDNA clone library was also dominated by Tintinnidae (37.3% of clones), followed by Undellidae (26%) and Codonellidae (18%). Similarly, pyroreads affiliating to Tintinnidae were the most abundant in both the 18S rDNA and ITS data sets, where they represented 64.6% and 43.2% of pyroreads, respectively. In the offshore sample Ioni-7, a more diverse community of 36 morphospecies was found, with the community more evenly distributed compared with the Villefranche assemblage. Individuals identified as belonging to the families Tintinnidae (23%), Undellidae (22%) and Eutintinnidae (18%) co-dominated the tintinnid community. In contrast, the proportions of tintinnid families in the clone library was different from the morphological observations, with a majority of 18S rDNA sequences attributed to Undellidae (47%), followed by Tintinnidae (22%) and Codonellidae and Codonellopsidae (12% each). Pyrosequencing of the 18S rDNA region yielded a majority of reads identified as Tintinnidae (51%) and Undellidae (32%), whereas the ITS region produced a majority of sequences affiliated to Undellidae (51%) and then Tintinnidae (16%; Figure 1). Overall, the distribution of 18S rDNA and ITS pyroreads related well with that of clone sequences and with the diversity of tintinnids based on microscopic observations at the family level. However, 18S rDNA and ITS pyroread data sets differed significantly from each other in both the coastal and off-shore samples (χ2 test, P-value <0.001), showing the difficulties in comparing diversity estimates when different markers are used.

Figure 1
figure 1

Frequencies of the different tintinnid families found in the VilleFr-43 and Ioni-7 samples by morphology (cells), 18S rDNA library sequencing (clones) and pyrosequencing of 18S rDNA and ITS regions. Sequences were assigned to families by BLAST searches against a curated tintinnid sequence database. Micrographs on the right show representative morphologies encountered in each family. Complete absence of a family in a sample is indicated by 0.

The microscopy-based determination of species diversity showed that both the Villefranche and Ionian Sea samples contained diverse tintinnid communities. The differences observed between the two samples were probably because of the different geographical origins and environmental conditions and sampling periods (collected at >1000 km of distance and at a 1-year interval). In particular, despite the fact that the number of cells in the Ioni-7 sample was half of that found in the VilleFr-43 sample, the Ionian sample harbored a more species-rich community (Supplementary Table S1). Furthermore, the coastal VilleFr-43 sample was dominated by a few forms, with >50% of the tintinnid community composed of two species, whereas the Ioni-7 sample showed a more even species distribution, with the five most abundant species just reaching 42% of the total community. These differences in the tintinnid communities in terms of cell abundance and species compositions were in agreement with previous studies (see, for example, Dolan et al., 2002, 2006). Thus, the observed differences between the two samples provided us with two different species distribution models to study the performance of different molecular approaches to assess OTU composition.

Species richness estimation

To estimate species richness from sequence data, we defined tintinnid OTUs applying a sequence similarity threshold of 99%, which was based on the 18S rDNA nucleotide differences observed between distinct tintinnid morphospecies (Bachy et al., 2012; and Supplementary Figure S1). A lower sequence similarity threshold (that is, <99%) would entail the mixing of not only different species but even different genera, such as Dictyocysta and Codonella, Amphorella and Amphorellopsis, or Tintinnopsis and Climacocylis. Moreover, considering both full-length 18S rDNA sequences or exclusively the short 340-bp fragment targeted by our pyrosequencing analysis, the use of 99% sequence identity provided similar species-level resolution. In fact, applying this threshold on the Sanger sequences from the clone libraries, we retrieved comparable OTU numbers for the complete 18S rDNA sequences and for the fragments corresponding to the region chosen for pyrosequencing (25 and 22 OTUs in VilleFr-43 and 17 and 22 OTUs in Ioni-7, respectively). The rarefaction curves corresponding to these clone libraries were close to saturation (Supplementary Figure S2). In the case of VilleFr-43, not only the number of OTUs but also the species abundance distributions (with 32–40% of rare OTUs, namely, those representing 1% of the total number of sequences in a data set) were quite close to the ones described by morphological identification. In contrast, the OTUs identified in Ioni-7 represented only 50% of the 36 species identified by morphological examination. However, the percentage of rare OTUs in this sample ranged from 35% to 52% and was similar to the 42% observed by morphological identification.

A variety of approaches have been developed to assess species richness based on pyroreads (see review in Zinger et al., 2012). We applied several of these methods in order to compare the results among them and with the classical 18S rDNA clone libraries/Sanger sequencing and traditional morphological analysis (Table 2 and Supplementary Figure S3). As explained above, we first defined an adequate sequence similarity threshold (99%) to identify OTUs in the 18S rDNA pyroread data set. According to pairwise sequence comparisons of reference sequences belonging to identified species, and sequences from clone libraries, the 18S rDNA and ITS regions targeted by our pyrosequencing analysis showed similar divergence rates, comparable to those of the complete 18S rDNA sequences (Supplementary Figures S1 and S4). Therefore, we also used the 99% sequence identity threshold to identify OTUs with both the 18S rDNA and ITS pyroreads. Nevertheless, we have shown in previous work that phylogenetically distinct lineages may contain identical ITS region sequences (Bachy et al., 2012); and hence it should be noted that the ITS pyroreads were not as accurate as the 18S rDNA ones to discriminate different species (see Supplementary Figure S1). This may reflect a greater selective constraint at the ITS than that experienced by the targeted 18S rDNA variable region V4, which is characterized by particularly rapid rates of evolution within eukaryotes. With the 99% sequence identity threshold established, we tested several sequence alignment, clustering and denoising methods on the 18S rDNA and ITS pyroread data sets (see Materials and methods). Surprisingly, depending on the analytical method and molecular marker used, 18S rDNA or ITS, the number of OTUs varied from only 7 up to 1848. Accordingly, the proportion of rare OTUs represented by few reads showed extreme variations, between 14% and 99% of the total OTUs. Table 2 shows that the number of OTUs detected using the common method of multiple sequence alignment or the more stringent pairwise alignment and average-linkage exceeded the number of expected species (based on unambiguous morphological taxon identification) by one or two orders of magnitude. The greedy algorithm for clustering of representative sequences implemented in Usearch (Edgar, 2010) appeared to be more efficient, but the number of OTUs still exceeded the number of expected species by one order of magnitude. We can reject these diversity estimates with some confidence, not only because they overestimated the number of species observed based on microscope examination but also because they exceeded, sometimes by twofold or more, the total number of tintinnid cells observed in twin volumes of samples.

Table 2 Number of tintinnid species and total and rare OTUs identified using morphological and molecular methods

To avoid this kind of artificial inflation, we applied several approaches recently developed to remove PCR and pyrosequencing errors, including sequence preclustering and the more complex pipelines implemented in CD-HIT-OTU (Wu et al., 2011) and AmpliconNoise (Quince et al., 2011). The agglomerative approach applying SLP before the classical clustering step also overestimated species richness, especially for the 18S rDNA pyroreads (Table 2). This could be because of the fact that the preclustering step at 98% sequence identity usually used for SLP (Huse et al., 2010) was too large compared with the threshold that we empirically determined for tintinnids. Nevertheless, applying a preclustering step with a 99.5% similarity threshold was not sufficient to denoise the data efficiently (between 166 and 1848 OTUs were defined depending on the marker, Table 2). Compared with the other approaches that we tested, the denoising and clustering procedures implemented in CD-HIT-OTU and AmpliconNoise were the only ones able to produce estimates coherent with those obtained from classical clone library/Sanger sequencing and, especially, with the morphological identification. These methods inferred a total number of OTUs of 25 for the 18S rDNA region and 20 for the ITS region in VilleFr-43 and 12–23 for the 18S rDNA region and 7–18 for the ITS region in Ioni-7. Whereas the other methods produced estimates of rare OTUs ranging from 94% to 99.2% of all OTUs, they accounted for only 36.4–55.6% and 33.3–56.5% of all OTUs inferred by AmpliconNoise and CD-HIT-OTU in VilleFr-43 and Ioni-7, respectively (Figure 1). As AmpliconNoise produced reasonable species estimates by taking into account all sequences (that is, it does not eliminate the clusters with only 1 or 2 sequences as CH-HIT-OTU does), we kept the OTUs identified from 18S rDNA pyroreads by this method for subsequent analyses.

Phylogenetic analyses

We built a phylogenetic tree based on sequences representative of all OTUs identified in clone libraries and pyrosequencing data together with their closest described relatives available in public databases (Figure 2). The resulting tree was strongly supported at the genus level, giving us the opportunity to compare at this fine taxonomic rank the abundance patterns generated by both molecular approaches and the morphological analysis. From a qualitative point of view, both the cloning and pyrosequencing approaches detected the majority of families and genera found using microscopic observations, although a slightly larger diversity was detected by cloning/Sanger sequencing than by pyrosequencing. Surprisingly, cloning and pyrosequencing approaches seemed thus to be almost equally efficient for detecting the most frequent species observed in our samples (Salpingella-, Undella-, Codonella- and Dictyocysta-related species) and even those present in low abundance (for example, those of families Xystonellidae or Codonellopsidae). Moreover, whereas no Tintinnidium spp. was found in the coastal sample (VilleFr-43) examined microscopically, 18S rDNA sequences affiliated to this genus were detected (4% of pyroreads and 0.5% of clones). In contrast, Rhabdonella sp. and Steenstrupiella sp. specimens observed in the sample Ioni-7 were not detected with any of the molecular approaches. The comparison of the two molecular approaches revealed that >50% of all OTUs were detected by both methods (16 of the 27 OTUs in the coastal sample VilleFr-43 and 11 of the 17 OTUs in the offshore sample Ioni-7; see Figure 2). The dissimilarities between the two methods concerned relatively rare OTUs, such as those affiliated to Tintinnopsis sp. (OTU 24, 5 clones in VilleFr-43) or Codonella aspera (OTU 36, 3 clones in VilleFr-43), undetected by pyrosequencing, or three OTUs within the Eutintinnus clade (OTUs 14, 15 and 18, with 37, 22 and 17 18S rDNA pyrosequence reads, respectively) undetected in clone libraries. Part of the diversity found only by pyrosequencing corresponded to OTUs represented by singletons or doubletons (OTUs 7, 19 and 21).

Figure 2
figure 2

Maximum likelihood phylogenetic tree of 18S rDNA sequences of detected OTUs and their closest tintinnid relatives, based on 1200 aligned positions. Sequences from this study are in bold. Relative proportions of the different OTUs in each sample estimated from clone libraries and 18S rDNA pyroreads are indicated by circles of proportional size on the right. Relative proportions of the different tintinnid species estimated from cell counts under the microscope are indicated by hexagons of proportional size on the right. The numbers at nodes are bootstrap values (values <50% are omitted). Accession numbers are provided in brackets. The scale bar represents the number of substitutions for a unit branch length.

From a semiquantitative point of view, the species found to be the most abundant by microscopic examination were also the ones preferentially detected by the molecular approaches. Thus, in the VilleFr-43 sample, OTUs 6, 8 and 27, which were dominant in the clone and pyroread libraries, were related to Salpingella sp., Steenstrupiella steenstrupii and Undella sp., respectively, and were found to dominate the microscopically determined tintinnid community (Figure 2). In the Ioni-7 sample, the Undella-related OTU 27 was dominant in the clone and pyroread data sets and was also the dominant genus based on microscopic analysis (22.6% of cells). However, there were some discrepancies comparing data sets. Eutintinnus spp. specimens, which were abundant based on microscopic observations, especially in Ioni-7, were detected only in low abundances by the molecular approaches. In contrast, Codonellopsidae appeared overrepresented in the molecular results because only a few Codonellopsis sp. and Stenosemella sp. specimens were present based on microscopic examination (Figure 1).

Discussion

Pyrosequencing versus classical clone libraries/Sanger sequencing

Pyrosequencing and the classical cloning method revealed a relatively similar phylogenetic structure of the tintinnid community at the species level. These observations resembled other molecular comparative studies of whole eukaryotic communities targeted with eukaryotic universal primers (Edgcomb et al., 2011) or of particular groups, such as fungi, targeted with specific primers (Tedersoo et al., 2010). In the case of our tintinnid communities, the presence and proportion of the dominant lineages were similar for both molecular methods and the OTUs unique to each method corresponded to rare lineages. These discrepancies in low abundant OTUs might be induced by the relative low quantity of DNA coming from rare species in the extracted DNA samples, and hence randomly amplified or not amplified during PCR. Some discrepancies could also be because of the very different length of clone sequences and pyroreads, and consequently a different amount of phylogenetic information, leading to a different number of potentially detectable OTUs. The fact that differences in amplicon length could notably influence estimates of richness has been shown with amplicon libraries constructed for bacteria (Huber et al., 2009; Engelbrektson et al., 2010).

A total of 97.7% of clone sequences and 81.5% of pyroreads (18S rDNA and ITS together) could be reliably assigned to known tintinnid taxa at the genus level (98% sequence identity) and, furthermore, a majority of them (85% of clones and 69.5% of pyroreads) could be related with strong confidence to a particular species (99% sequence identity). In fact, no novel distinct clades without described representatives were detected despite our deep sequence coverage, especially by the pyrosequencing approach. The lack of novel clades could be explained by recent efforts of species genetic characterization in the Mediterranean area (Bachy et al., 2012). However, for certain genera such as Salpingella, some lineages (OTUs 2, 5 and 6) were only represented by environmental sequences, which argues for the need of further taxonomic work to genotype more tintinnid species.

Molecular versus morphological identification

In contrast with previous studies addressing the accuracy of pyrosequencing and/or clone libraries to describe microbial diversity in natural samples, we targeted a particular protist group with conspicuous morphological traits that permitted a precise description of the community using morphological inspection in parallel with molecular techniques. This provided us not only with a maximum diversity limit (no more OTUs than cells could exist in the samples), but also with a control reference for the expected number of OTUs and their relative abundances. In fact, the accurate characterization of microbial communities requires not only taxonomic identification but also measurement of relative abundances. OTU frequencies (cloning or pyrosequencing derived) are often used in molecular studies of microbial communities to infer the relative ecological relationships among taxa (for example, dominance or evenness; Hughes et al., 2001).

In our study, the frequency distribution of clone and pyroread sequences did not fit perfectly the frequency distribution of tintinnid morphospecies encountered under the microscope. For example, Tintinnidium was found in sequence libraries but not observed under the microscope. This could be because of its amorphous lorica that can be mistaken for detrital material, leading to overlooking it in microscopic examinations if it was present in low numbers. In addition, differences in rRNA gene copy number among taxa probably exist, which is often cited as the major reason of incongruence between environmental sequence data and morphology (Medinger et al., 2010). Among tintinnids, the number of rRNA gene repeats probably varies, in particular because species may contain different number of nuclei (Agatha and Strüder-Kypke, 2007). In addition, in the case of the less abundant species, our use of two different subsamples for each locality may explain some differences between morphological- and molecular-based estimations, as very rare species could have been unequally distributed when the initial water samples were divided into two subsamples.

Influence of marker choice and computational analyses on diversity estimates

Identifying and counting each tintinnid cell in our samples allowed us to unambiguously compare their species richness and distribution between different analyses. As previously observed by Doherty et al. (2007) for planktonic ciliates, we found that the number of morphologically recognizable tintinnid species was roughly similar to the number of OTUs detected in clone libraries. However, pyrosequencing data and the multiple approaches existing to analyze them yielded much more contradictory results. For our samples, the number of OTUs increased by a factor up to 100 depending on the in silico approach applied. The multiple sequence alignment and average clustering grossly overestimated the number of OTUs. Similarly, methods such as pairwise alignment, greedy algorithm for clustering of representative sequences and SLP inferred a number of OTUs that did not approach the numbers estimated by morphology until we went down to a 95% identity threshold for OTU definition (Supplementary Figure S3). However, a 95% threshold is clearly not realistic as we know from described tintinnid morphospecies that members of two different species may have 99% identical sequences (Supplementary Figure S1 and Santoferrara et al., 2012). Only the denoising algorithms based on sequence clustering approaches implemented in AmpliconNoise (Quince et al., 2011) and CD-HIT-OTU (Wu et al., 2011) were able to correctly estimate OTU diversities at 99% sequence identity to yield values comparable to those observed by morphological study and clone library analysis.

Although the denoising methods have been demonstrated to minimize the number of spurious OTUs (Quince et al., 2011), they tended to underestimate to some extent the species richness in the Ionian sample (Ioni-7). Compared with the 36 morphospecies found in this sample, the estimates obtained from the clone libraries (17–22 OTUs), 18S rDNA pyroreads (12–23 OTUs) and ITS pyroreads (7–18 OTUs) were low. The distinction as different species of what might be just morphological varieties could explain, at least partially, why we found such reduced diversity in molecular approaches. For instance, Undella spp. encompassed six morphotypes in the sample (Supplementary Table S1) whereas it is known that there is very low genetic divergence in this genus for the markers we studied (Bachy et al., 2012). Moreover, the relatively small power of the ITS region selected for pyrosequencing to discriminate closely related species could explain the even lower diversity found in the ITS pyroread data set.

PCR and sequencing errors may inflate diversity estimates by creating false taxa, and this can also be the case if inadequate methodological approaches are applied to analyze pyrosequencing data (Quince et al., 2009; Reeder and Knight, 2009; Kunin et al., 2010). In fact, the use of a morphological reference allowed us to show that certain methods led to massive overestimation of OTU numbers, in particular for rare OTUs, not only because of sequence errors but also by other methodological problems. Notably, misaligned sequences in multiple sequence alignments can be considered as different OTUs. The size and diversity of the ‘rare biosphere’ may be somewhat less than what has been so far inferred by the first analyses of microbial diversity based on pyrosequencing data. The increasing availability of rigorous bioinformatic tools to treat these massive sequence data sets should allow carrying out realistic biodiversity estimates. This and the continued effort to obtain sequence data from described species to serve as reference will greatly improve the description of microbial communities by massive sequencing approaches.