Metabarcoding analysis on European coastal samples reveals new molecular metazoan diversity

Although animals are among the best studied organisms, we still lack a full description of their diversity, especially for microscopic taxa. This is partly due to the time-consuming and costly nature of surveying animal diversity through morphological and molecular studies of individual taxa. A powerful alternative is the use of high-throughput environmental sequencing, providing molecular data from all organisms sampled. We here address the unknown diversity of animal phyla in marine environments using an extensive dataset designed to assess eukaryotic ribosomal diversity among European coastal locations. A multi-phylum assessment of marine animal diversity that includes water column and sediments, oxic and anoxic environments, and both DNA and RNA templates, revealed a high percentage of novel 18S rRNA sequences in most phyla, suggesting that marine environments have not yet been fully sampled at a molecular level. This novelty is especially high among Platyhelminthes, Acoelomorpha, and Nematoda, which are well studied from a morphological perspective and abundant in benthic environments. We also identified, based on molecular data, a potentially novel group of widespread tunicates. Moreover, we recovered a high number of reads for Ctenophora and Cnidaria in the smaller fractions suggesting their gametes might play a greater ecological role than previously suspected.


Introduction 42
The animal kingdom is one of the best-studied branches of the tree of life 1 , with more 43 than 1.5 million species described in around 35 different phyla 2 . Some authors have 44 suggested there may be more than 10 million species of animals, indicating that there 45 is an extensive unknown animal diversity. This hidden diversity may vary according 46 to the animal phyla considered. Not surprisingly, those animal phyla with microscopic 47 representatives (i.e., those animals with a size below 2mm 3 , also known as 48 micrometazoans 4 ) are suggested to contain most of this potential unknown diversity 49 3 . 50

51
Marine environments cover most of the earth's surface. More importantly, all 52 metazoan phyla, except onycophorans, have marine representatives, with up to 60% 53 including microscopic members 5 . Copepods, for instance, are the most abundant 54 multicellular group of organisms on earth 6 , highlighting the key role of microbial 55 animals in marine ecosystems. Given that the marine benthic meiofauna is also one of 56 the hot spots of alpha-diversity in the biosphere, marine environments thus appear to 57 be ideal sites in which to analyze animal diversity across phyla. 58 59 Classical methods to survey animal diversity, such as isolation and morphological 60 identification, might be ineffective to comprehensively analyze 61 micro/mesozooplanktonic 7 and meiofaunal diversity 8 . The microscopic size of the 62 organisms and the wide variety of morphologies makes the identification process 63 tedious and slow, requiring taxonomists with experience in different groups to 64 properly assess the composition of the community and describe new species or 65 groups. Molecular techniques, and especially high-throughput environmental 66 environments and in oxic sediments (Fig.1B). Interestingly, metazoan reads were not 113 only abundant in the micro/mesoplankton fraction (68% DNA, 49% RNA of the total 114 eukaryotic reads), but also in the smaller fractions (i.e., the pico/nano fractions which 115 are less than 20um). The presence of a high percentage of metazoan reads in the 116 smaller fractions is especially relevant in the anoxic environment, with 75% of the 117 DNA reads (and 33% of the RNA) being assigned to metazoans. 118 The clustering of reads into OTUs yielded 1067 OTUs from 23 different metazoan 119 phyla (Fig.2, Table S4). 469 OTUs were found to be exclusive to benthic 120 environments, 505 to pelagic environments and 102 OTUs were present in both 121 ( Fig.2A). Crustacea appeared as the richest clade (246 OTUs) within the pelagic-122 exclusive dataset, followed by Polychaeta (45). Within the benthic (sediment)-specific 123 samples, the largest number of OTUs were from Nematoda (227), followed by 124 Crustacea (101). Polychaeta (31) and Crustacea (23) dominated the OTUs present in 125 both environments ( Fig.2A). 126 The largest proportion of animal reads in oxic water column environments were from 127 Crustacea, which represented up to the 89% of DNA and 53% of RNA in the overall 128 metazoans reads from the micro/meso fractions (Fig. 1A). More than 80% of the 129 crustacean RNA reads, however, corresponded to 8 specific OTUs that were assigned 130 to copepods (Table S5). Besides crustaceans, there was also a high abundance of 131 reads from tunicates (5% DNA only, but 28% RNA) within the oxic 132 micro/mesoplanktonic samples, most of them corresponding to appendicularians 133 (Table S5). On the other hand, benthic samples were dominated by polychaetes (30% 134 DNA, 23% RNA) and crustaceans (19% DNA, 23% RNA) (Fig. 1B). Within benthic 135 Crustacea, ostracods and copepods were the most abundant groups (Table S6). 136 137

Community structure across environments and size fractions 138
To determine the biogeographical patterns of the microbial animals in our dataset, we 139 analyzed the presence/absence of OTUs in all five sites (discarding the anoxic 140 samples). A large fraction of the OTUs (668 out of 1076) were present in just one 141 single location. However, the number of reads of these "endemic" OTUs (around 142 4·10 4 ) was three times lower than the 8 OTUs present in all sampling sites (around 143 1.2·10 5 reads) (Fig. 2B). The taxonomic composition of the cosmopolitan OTUs ( Fig.  144 2B) differed greatly from the complete dataset except for the crustacean dominance 145 (Fig. 2B). In particular, there were no nematodes or polychaetes among the 146 cosmopolitan OTUs, whereas a cnidarian and a craniate OTU appeared to be present 147 over the 5 sampling sites. Our analysis also showed that all the cosmopolitan OTUs 148 belonged to the water column, whereas more than half (56%) of the "endemic" ones 149 belonged to the sediments. These endemic OTUs represented 80% of the total benthic 150

OTUs. 151
RNA reads indicate metabolically active cells 22 . Interestingly, we found a relatively 152 high percentage of RNA reads assigned to metazoans in the smaller fractions (from 153 0.8 to 20 µm): 2.4 % in oxic and 32.4 % in anoxic samples (Fig. 1A). Therefore, we 154 decided to analyze the potential source of those RNA reads. Most of the reads were 155 crustaceans (36% RNA reads), followed by tunicates, ctenophores, cnidarians and 156 polychaetes (Fig. 1A). Ctenophores (85% RNA pico/nano fractions) and cnidarians 157 (16% RNA pico/nano fractions) dominated the reads assigned to metazoans in the 158 anoxic waters of Varna, Black Sea (Fig. 1B). 159 To understand whether the reads from the smaller fractions were directly derived from 160 the larger ones, we filtered the data based on their co-occurrence between the 161 pico/nano fraction and the micro/meso fractions. We observed that OTUs present in 162 both smaller and larger fractions had a clearly different proportion of reads (Fig. 3). 163 Most of the reads in the smaller fractions belonged to the ctenophores (58%), whereas 164 crustaceans dominated (52%) the micro/mesoplanktonic fractions. In this regard, 165 OTUs corresponding to Pleurobrachia pileus (a ctenophore) and Aurelia aurita (a 166 cnidarian) were especially enriched in the smaller fraction (Fig. 3), representing 57% 167 of all metazoan RNA reads, and up to 33% of all eukaryotic RNA reads in the anoxic 168 samples (Table S7) (Fig.1A). 169 170

Sequence novelty 171
We performed BLAST searches against the NCBI nt nr database to interrogate the 172 level of novelty in our molecular dataset across all animal phyla. The results revealed 173 a high degree of sequence novelty (Fig. 4A). In particular, 35.5% of our OTUs 174 (representing 10.5% of the reads) had a BLAST identity lower than 97% compared to 175 NCBI sequences (Fig. 4B). Moreover, up to 10% of the OTUs, which accounts for 5% 176 of the metazoan reads, had BLAST identities lower than 90%. The putative novelty 177 was especially high among platyhelminthes, acoelomorphs, and nematodes, in which 178 most of their OTUs (75%) had a BLAST identity lower than 97%. Gastrotrichs and 179 crustaceans also had significant novelty (40-50% of their OTUs had a BLAST identity 180 below 97%). 181 Interestingly, the OTUs that appear to be most abundant within the water column 182 (Table S5) and sediments (Table S6) correspond either to already known sequences or 183 with high similarity to known sequences. The level of novelty is also different 184 between benthic and pelagic environments. Thus, 70% of the OTUs found in benthic 185 environments had a BLAST identity of less than 97% ( Fig. 2A), while this percentage 186 decreased to 21% of OTUs in the water column or to 11% of OTUs present in both 187 water column and benthos. This suggests that benthic marine environments are a 188 potential hot-spot to find new metazoan taxa or lineages. 189 Among the potential novelty, we detected a group of three OTUs that had a relatively 190 large number of RNA reads in the water column (1.8%). (Fig. 1, labelled as "MAME To have a better understanding of its phylogenetic position, we performed 202 phylogenetic trees. Our trees placed the MAME 1 GenBank sequence within tunicates 203 by both maximum likelihood and Bayesian inference (Table S3), and with good nodal 204 support (79% bootstrap support and 0.99 Bayesian posterior probability), although 205 with relatively longer branches than the rest of the metazoans. To determine its 206 specific phylogenetic position within the tunicates, we inferred an additional tree with 207 most of the available 18S rRNA sequences of tunicates, representing most of the 208 known diversity of this phylum. In this tunicate-focused tree, the MAME 1 sequence 209 clustered with thaliaceans as sister-group to the genus Doliolium, although with low 210 nodal support (Fig. 5). Finally, we ran a RAxML-EPA analysis to place the 69 OTUs 211 plus the other NCBI sequence within the reference tree of metazoans and the tree of 212 tunicates. In both cases, the 69 OTUs clustered together, with the reference MAME 1 213 sequences forming a monophyletic clade. Thus, our phylogenetic analysis suggests 214 that MAME 1 represents a novel, previously undescribed group of tunicates. Given 215 their extremely long-branches, however, additional molecular data will be needed to 216 further confirm this relationship. 217 218

Discussion 219
High-throughput sequencing, a powerful methodology to assess diversity 220 HTES is a useful method, but it also has some caveats. For example, it is well known 221 that it may be misleading to directly translate reads and OTU numbers into biomass 222 and number of species, respectively. In particular, the use of amplicon data as a proxy 223 for metazoan biomass abundance has been disputed, also with RNA data 23 . Different 224 number of rRNA copies in the genomes of different taxa, PCR primer mismatches and 225 amplicon lengths can all affect the correlation between morphological and molecular 226 data 7,24 . However, some studies have indeed shown positive correlations between 227 read abundances and biomass patterns in bivalve and decapod larvae 19 and within 228 copepod groups 7 . Thus, we believe our approach to biomass abundance, although not 229 perfect, is useful enough to report the most abundant groups. A good indication of our 230 approach is that we recovered the general patterns previously described in 231 micro/mesoplanktonic communities based on morphological observations 25,26 , in 232 which copepods were found to be predominant within micro/mesoplanktonic 233 communities 6 followed by appendicularians 26 . Moreover, we found a more 234 heterogenic distribution in benthic habitats, which is to be expected considering that 235 sediments are known to harbor most of the metazoan diversity 5 . 236 Overall, our data confirms that, although with some caveats, HTES is a powerful tool 237 to assess diversity. In this regard, the construction of a phylogenetically curated 238 database to assign the OTU taxonomy has proven to be crucial for our analysis aimed 239 at describing novelty in different metazoan phyla. Our clustering of OTUs at 97% is 240 likely a conservative approach for metazoans 27 , and some of our OTUs may indeed 241 represent more than one species. This largely depends on each metazoan lineage and 242 its specific 18S rRNA evolution rate. Moreover, primer bias can affect the detection 243 of some groups, meaning that some taxa can be present in the environment but 244 missing in our dataset 28 . However, by clustering at 97% we can directly compare the 245 results with the rest of the eukaryotes and get a more stringent output avoiding 246 polymorphisms effects and an overrepresentation of the retrieved diversity. 247 248

Benthic-Pelagic relationship 249
Analysis of benthic and pelagic metazoan communities in our dataset revealed that 250 most OTUs are exclusively pelagic or benthic, showing few overlaps between the two 251 communities, in agreement with our beta-diversity analyses (Fig. S3, Fig. S4A) and 252 the literature available 29,30 . Only 10% of OTUs from our dataset were present in both 253 benthic and pelagic communities, and these mainly corresponded to polychaetes, 254 crustaceans, molluscs and cnidarians ( Fig. 2A). Among the shared OTUs Polychaeta 255 and Mollusca water column reads probably represent juvenile pelagic stages 31,32 256 while the benthic reads from crustaceans and cnidarians, that are predominantly 257 pelagic, come likely from death organisms or debris. 258 In addition, our data clearly shows that the pelagic OTUs tend to be present in more 259 sites, while most of the benthic OTUs are restricted to one location. The restricted 260 presence of meiofaunal OTUs has been described previously 20 . Thus, the distribution 261 in the water column fits more with the consideration that "everything is everywhere" Somewhat surprisingly, we observed a high percentage of metazoan reads in the 267 smaller size fractions of most water column samples (Figure 1). This includes, as 268 well, the samples derived from RNA templates, probably indicating a significant 269 biological activity of metazoans in those smaller fractions. We believe it is unlikely 270 that those metazoan RNA reads could come from an extracellular origin because RNA 271 is fragile and quickly degraded by ribonucleases, and its structure is easily affected by 272 both oxygen and water 35 . Furthermore, the RNA reads from pico/nanoplanktonic 273 fractions contain a different taxonomic distribution compared to the extracellular 274 DNA samples and the micro/mesoplanktonic RNA samples ( Fig. 1A and Fig. 3A). 275 Thus, and taking into account the small size reported for certain animal gametes, we 276 hypothesize that a large part of those metazoan reads from the smaller fractions most 277 likely come from metazoan gametes. 278 This is the case, for example, of the reads from smaller fractions assigned to tunicates, 279 ctenophores, cnidarians and polychaetes, since they all use external fertilization. 280 Ctenophora and Cnidaria, which are not only abundant in DNA reads but also have a 281 relatively high number of RNA reads in the smaller fractions (Fig. 3B), might be a 282 particularly notable example of the importance of gametes in the environment. The 283 co-occurrence of reads in both smaller and larger fractions, the overrepresentation in 284 the smaller ones and the fact that their sperm size is smaller than 5 µm 36,37 are good 285 indicators that at least the RNA signal of cnidarians and ctenophores might 286 corresponds to gametes. That will not be the case for the reads assigned to copepods 287 in the smaller fractions. They cannot come from gametes, since copepods use internal 288 fertilization and release eggs larger than 50 µm 38 . Therefore, the crustacean RNA 289 reads observed in smaller fractions (from 0.8 to 20 µm) are probably the result of cell 290 breakage from larger fractions (Fig. 3A). Finally, we note that some of the OTUs that 291 are exclusively retrieved from smaller fractions could also correspond to sperm from 292 organisms that are larger than 2mm or from benthic fauna with external fertilization 293 and gamete sizes less than 10 µm, such as certain ctenophores and polychaetes (Table  294 S7). 295 It is worth mentioning that metazoan RNA reads corresponding to germline cells 296 could account, in our data, for as much as 3.2% of the total eukaryotic RNA reads in 297 the smaller fractions (Table S7), and up to 33% of eukaryotic reads in anoxic samples. We performed an analysis on novelty by plotting the pairwise identities of the first 309 BLAST hit against NCBI non-redundant database. This provided a distribution of the 310 "novel" OTUs (those with sequence identities lower than 97% to any NCBI sequence) 311 along different environments (Fig. 2) and for different metazoan phyla (Fig. 4). 312 Interestingly, we found that 45% of our metazoan OTUs had less than 97% identity 313 against the NCBI nt nr database. Why a threshold of 97% for novelty? We believe it 314 is the safest one to detect novelty, although we probably miss a lot of intra-genera or 315 intra-class variation, depending in the animal group. It is worth mentioning, however, 316 that by having a threshold of pair-wise identities below 97%, we avoid any potential 317 intra-individual polymorphic variants 40 . Therefore, we follow the rationale that OTUs 318 that do not have 100% identities but close (98% or higher) against the first BLAST hit 319 from NCBI non-redundant database, are probably the same taxa (maybe representing 320 intraindividual variations) or very closely related species. In contrast, the OTUs that 321 have a BLAST identity under 97% represent much deeper changes, and so, they 322 clearly represent, at least, different taxa than the ones represented in Genbank. Some 323 OTUs, especially those 10% of our OTUs with pairwise identities against GenBank 324 under 90%, may even represent new clades. 325 Although one could argue that this degree of novelty might reflect sequencing 326 artifacts, we are confident it is not the case because 1) we have followed a stringent 327 chimera and singletons removal process, 2) the reads are distributed across different 328 samples, and 3) they are not homogeneously distributed among taxonomic groups. In 329 addition, around 80% of our OTUs have RNA reads and their taxonomic distribution 330 is almost identical to the DNA OTUs. So, these novel variants present in the RNA 331 subset are transcribed by active organisms and are less prone to be artifacts or rare 332 variants 41 . 333 We are aware that detection of novelty in metazoans just with molecular data is 334 challenging, given that the number of described animal species is larger than the 335 number of 18S rRNA sequences available in public databases (Fig. S7B). Therefore, a 336 novel sequence might belong to a species that has already been described but not yet 337 sequenced. A complete database linking morphological and molecular data is needed 338 to fully solve this issue. However, the 18S rRNA data so far available certainly is a 339 good representation of known animal diversity (Fig. S7B), and we believe our study 340 does indicates which metazoan lineages contain the higher levels of hidden molecular 341 diversity, and so, which are the animal groups needed for a more extensive sampling. We also recovered and genetically described a potential novel group of tunicates, here 361 named as "MAME 1". It could be argued that this group represents an already 362 described Thaliacean related to the genus Doliolum that happens to have never been 363 sequenced or rare variants of the 18S gene belonging to known species. However, we 364 consider these two options unlikely for several reasons. First, the group seems to be 365 well populated (69 OTUs between our data and public repositories) and present in 366 many environments worldwide, not only in coastal waters (Fig. S5). Moreover, the 367 pairwise identity of the two MAME 1 sequences retrieved from NCBI is about 89%, 368 suggesting is not a single species, but rather an entire group of sequences with high 369 genetic variability, forming an independent clade related to Thaliaceans (Fig. 5). In 370 fact, the nucleotide identity among MAME 1 OTUs is similar as the observed among ribosomal sequences available at Genbank) and the percentage of identity of MAME 375 1 sequences against described Tunicate species seems too low (78% of identity with 376 the best BLAST hit Thalia democratica) for a different 18S rRNA type. In animal 377 groups in which different classes of 18S rRNA gene have been described, such as in 378 chaetognaths, the intra-individual variation among 18S classes lies around 90-93% of 379 identity 44 . Therefore, we suggest that MAME 1 might corresponds to a new group of 380 tunicates that contains a large number of RNA reads within micro/mesoplankton 381 environments and is present in different habitats. However, without morphological 382 data, we cannot truly discard the possibility that those sequences belong to a 383 molecular divergent group of Thaliacean species, already morphologically described, 384 but without genetic data available. Although this emphasizes the powerful of HTES to 385 assess biodiversity and detect novelty, it also highlights its limitations. Thus, it is 386 crucial to continue and improve the classical screenings of marine diversity, with the 387 aim to link altogether morphological and genetic information in order to better 388 understand the metazoan biodiversity of our oceans. 389

Conclusions 390
We have reported an analysis of micrometazoan diversity in the European coast based 391 on HTES that includes, for the first time, both water column and sediments, oxic and 392 anoxic environments, and both DNA and RNA templates. To assess taxonomy, we 393 constructed a novel reference dataset comprising all animal phyla, which was 394 manually and phylogenetically curated. Our data show general read abundance and 395 richness patterns that partially corroborate previous morphological 5,6,25,26 and 396 molecular studies 8,10,13,19,20,45 . Our data showed a high relative abundance of 397 metazoan RNA reads within pico-nano size fractions (0.8-20 µm), suggesting that the 398 sperm of Ctenophores and Cnidarians plays a relevant ecological role as part of the 399 microbial food network. These results show the potential of HTES techniques as a fast 400 and exhaustive method to approach the study of micrometazoan biomass and diversity 401

patterns. 402
This kind of data has allowed us to describe novelty values found in different animal 403 phyla. We observed that some animal phyla have much genetic novelty that is yet to 404 be unraveled, including novelty in several well sampled groups such as Crustacea, 405 Platyhelminthes or Nematoda. Our finding of a potential new group of widespread 406 tunicates (MAME 1) highlights the value of phylogenetic approaches to identify novel 407 groups within phyla. The finding of MAME 1 in several HTES datasets could be 408 considered the first step in a reverse taxonomic process 46 potentially leading to 409 isolation and detailed description. Overall, our data show that, if we truly want to 410 understand the biodiversity of marine environments, it is important to further sample 411 animal taxa within those environments. To achieve that, we need to have better tools 412 for the genetic screening, and especially for the isolation and morphological 413 characterization of these organisms. 414

Sampling, 454 sequencing, curation of the sequences and diversity analysis 416
During the BioMarKs project (biomarks.eu), samples were collected in six European 417 coastal sites (Fig. S1

Analysis of the RNA reads from the small fractions 431
Using QIIME scripts, we binned the OTUs that contain RNA reads within the water 432 column of each sampling site into three different groups: 1) OTUs containing the 433 small fractions (pico/nano), 2) OTUs containing the larger fraction (micro/meso), and 434 3) OTUs present in both small and large size classes. OTUs representing less than 10 435 RNA reads per site were discarded. 436

Phylogenetic analysis of MAME1 sequence tags 437
In order to phylogenetically place the short reads assigned to the novel metazoan 438 group (MAME 1) within an animal and tunicate backbone, we performed a RAxML-439 EPA analysis 51 using a metazoan and a tunicate reference tree using the longest 440 putative MAME 1 sequence found by BLAST at NCBI nt nr database (KC582969), as 441 a unique MAME 1 representative. Using the MAME1 tree and alignment as a 442 reference we recruited environmental 18S rDNA short reads from SRA and Tara 443 Oceans and used them to perform abundance and distribution analyses (see the 444 electronic supplementary material). OTUs is shown based on prevalence: In blue, pelagic-specific OTUs (i.e., OTU with 638 more than 90% of the reads within the water column); in green, OTUs present both in 639 the water column and the sediments; in brown, OTUs present only in sediments (i.e., 640 OTUs with more than 90% of the reads within the sediments). In addition, BLAST 641 identities are shown against NCBI nr nt in dark/light blue. The number of OTUs (blue 642 line) and number of reads (red line) based on their occurrence in 1 or more (up to 5) 643 geographical site is shown to the right. 644 (blue) and the number of reads (red) of the given phyla. 652