Integrative analysis of large scale transcriptome data draws a comprehensive functional landscape of Phaeodactylum tricornutum genome and evolutionary origin of diatoms

Diatoms are one of the most successful and ecologically important groups of eukaryotic phytoplankton in the modern ocean. Deciphering their genomes is a key step towards better understanding of their biological innovations, evolutionary origins, and ecological underpinnings. Here, we have used 90 RNA-Seq datasets from different growth conditions combined with published expressed sequence tags and protein sequences from multiple taxa to explore the genome of the model diatom Phaeodactylum tricornutum, and introduce 1,489 novel genes. The new annotation additionally permitted the discovery for the first time of extensive alternative splicing (AS) in diatoms, including intron retention and exon skipping which increases the diversity of transcripts to regulate gene expression in response to nutrient limitations. In addition, we have used up-to-date reference sequence libraries to dissect the taxonomic origins of diatom genomes. We show that the P. tricornutum genome is replete in lineage-specific genes, with up to 47% of the gene models present only possessing orthologues in other stramenopile groups. Finally, we have performed a comprehensive de novo annotation of repetitive elements showing novel classes of TEs such as SINE, MITE, LINE and TRIM/LARD. This work provides a solid foundation for future studies of diatom gene function, evolution and ecology.

Diatoms are one of the most important and abundant photosynthetic micro-eukaryotes, and 3 contribute annually about 40% of marine primary productivity and 20% of global carbon 4 fixation 1 . Marine diatoms are highly diverse and span a wide range of latitudes, from tropical 5 to polar regions. The diversity of planktonic diatoms was recently estimated using 6 metabarcoding to be around 4,748 operational taxonomic units (OTUs) 2,3 . In addition to The availability of a sequenced genome for P. tricornutum has also opened the gate for 23 functional genomics studies, e.g., using Gateway, RNAi, CRISPR, TALEN and conjugation, which  elements. We further identified extensive alternative splicing (AS) involved in regulation of 1 gene expression in response to nutrient starvation suggesting that AS is likely to be used by 2 diatoms to cope with environmental changes. We report a conserved epigenetic code, 3 providing the host with different chromatin states involved in transcriptional regulation of 4 genes and TEs. Finally, our work dissected the proposed complex chimeric nature of diatom 5 genomes demonstrating the transfer of green, red and bacterial genes into diatoms, using the 6 greatly expanded genomic and transcriptomic reference libraries that have become available 7 across the tree of life since the publication of the initial genome 10,17 . This resource was 8 released as Phaeodactylum tricornutum annotation 3 (Phatr3) and is available to the 9 community on the Ensembl portal (http://protists.ensembl.org/Phaeodactylum_tricornutum 10 /Info/Index).

12 13
Results and Discussion

15
Structural re-annotation of P. tricornutum genome reveals numerous new gene models

17
To generate a new annotation of the P. tricornutum nuclear genome, our approach combined 18 high-throughput RNA sequencing data (RNA-Seq) along with ESTs and protein sequences.  (Table 1, Table S1):

25
1) New gene models, genes that are newly discovered and are not present in Phatr2 gene 26 annotations.

27
2) Unchanged gene models, gene whose structural annotation remains the same as in 28 Phatr2.

29
3) Modified gene models, genes whose structural annotation has a different 5' end, 3'end 30 or both 5' and 3' ends with respect to Phatr2. Thirty of these genes with a different N 31 terminus in Phatr3 compared to Phatr2 were validated by their presence within a 32 previously constructed multi-gene reference dataset of aligned plastid-targeted proteins 33 that are well conserved across ochrophyte lineages (File S1, panel A) 10 . The N-terminus 34 identified by the Phatr3 gene model in each alignment broadly matches the N-termini 35 identified for orthologous sequences from other ochrophytes (File S1, panel B).

36
Furthermore, RT-PCR analysis of six genes within this dataset amplified genes with 37 product length predicted by Phatr3 (File S1, panel C).  Table S1. 2 5) Split gene models, genes that are formed by splitting one Phatr2 gene into two Phatr3 3 gene models. 4 6) Antisense gene models, genes that are found localized on the antisense strand of 5 previously annotated Phatr2 genes. 6 7) Others, 566 genes which do not fall into any of the categories above. These genes require 7 manual curation which can be achieved through the Web Apollo Portal we implemented 8 to improve the Phatr3 genome annotation. Since the length of 56 genes in the Phatr3 9 repertoire is less than 100 bp, we only considered 12,177 genes for further functional 10 analysis (Table S1).

12 13
Assessment of the conservation and complex evolutionary origin of P. tricornutum genome

15
In light of the recent availability of numerous genome and transcriptome sequences from 16 many under-sampled taxa (e.g., red algae) through resources such as the Marine Microbial

17
Eukaryote Transcriptome Sequencing Project 10,17 , we wished to update and re-dissect the 18 proposed complex chimeric nature of the P. tricornutum genome. We first aimed to assess 19 the conservation of the P. tricornutum proteome across various taxonomic categories, which 20 we grouped together based on recent published phylogenies and taxonomic reviews (File 21 S2) 10,19 . From this analysis, a total of 9,008 (74.0%) of the genes within the P. tricornutum 22 genome were found shared with at least one other group within the tree of life. This is 23 substantially greater than the ~ 60% of P. tricornutum genes previously identified to have 24 orthologues in other groups 7 , underlining the importance of dataset size and taxonomic 25 sampling when considering gene conservation 11 . Up to 251 different conservation patterns 26 were identified across the entire genome, thirteen of which each accounted for 100 genes or 27 more ( Fig. 1). Many of the genes were found to have broad distributions across the tree of 28 life, with 4,543 genes (37.3%) found in at least five of the nine groups considered. These  Table S2). A further 1,188 genes (9.8%) were universally found across all eukaryotic 31 groups but neither in prokaryotes nor viruses, hence might constitute eukaryote-specific 32 genes ( Fig. 1; Table S2).

34
We still found, with the expanded dataset, that many genes within the P. tricornutum genome 35 have limited evolutionary conservation, with 5750 genes (47.2%) having originated within the 36 recent vertical history of the stramenopile lineage. A total of 3,170 genes (26.0%) were found 37 to be specific to P. tricornutum, 1,929 were only shared between P. tricornutum and other 38 diatoms (15.8%), and 651 were only shared with diatoms and other stramenopiles (5.3%) (Fig.   39 1). We found only limited evidence for genes that were not shared between Phaeodactylum 40 and other diatoms, but were shared with other groups (410 genes, 3.4%), or for genes that 41 were not shared between Phaeodactylum and other stramenopiles but were shared with 42 other groups (242 genes, 2.0%), suggesting largely vertical recent inheritance of the 43 Phaeodactylum genome (Table S2).

45
Further, we wished to determine whether the 1,489 novel genes uncovered by Phatr3 differ 1 in terms of evolutionary conservation to those previously identified. While many of the novel 2 genes are specific to P. tricornutum (864 genes; 58.0%; Fig. S1) or are limited to diatoms (222 3 genes; 14.9%; Fig. S1), 44 genes (13.6%) are shared with at least five other groups, and 4 novel 4 genes are shared with all nine groups considered (Fig. S1), including a UvrD-like DNA helicase 5 containing protein (Phatr3_EG00261), CTP biosynthetic process (Phatr3_EG00931), telomere 6 recombination (Phatr3_J11434) genes and a high motility group protein (Phatr3_J1241), 7 confirming that many of these genes are likely to have important biological functions.

9
In our second analysis, we aimed to reassess the evolutionary origins of the P. tricornutum 10 genome. In particular, we wished to validate the presence of genes derived from green algae 11 and from prokaryotes, which have previously been controversial 7,9,11 , and identify whether 12 different predicted gene transfer events occurred specifically in P. tricornutum, or are more 13 ancient events, occurring prior to the radiation of pennate diatoms, all extant diatoms, 14 stramenopiles, or previously.

19
Across the entire dataset, 584 genes yielded top BLAST hits against prokaryotes (Table S3), 20 which is similar to the number of prokaryotic genes (587) identified in the initial publication 21 of the P. tricornutum genome 7 . Similarly to the initial genome publication, the prokaryotic 22 sub-category that produced the most top hits (235) was the proteobacteria (Table S3; Fig. 23 S2A). Nine other sub-categories (cyanobacteria, firmicutes, chlorobi, archaea, actinobacteria, 24 chlamydiae, chloroflexi, the Deinococcus-Thermus clade, and planctomycetes) contributed 25 more than ten hits each ( Fig. S2A; Table S3). The 15 gene transfers involving members of the 26 Deinococcus-Thermus clade are of particular interest, as this lineage has not previously been 27 reported to have specifically exchanged genes with an ancestor of P. tricornutum 7 .

29
We considered whether the prokaryotic genes present in the P. tricornutum genome are 30 recent acquisitions (e.g., species-specific), or occurred at earlier points in the evolution of 31 diatom lineages. This was not possible in the initial genome, for which the only other available 32 diatom genome was for the centric species T. pseudonana 7,8 . For this, we performed an 33 analysis in which we serially removed the closest relatives of P. tricornutum in our sequence 34 library (which includes seven complete diatom genomes and transcriptomes for a further 92 35 diatom species available through MMETSP) 11,36 , and assessed the number of prokaryotic 36 genes that could be identified in each analysis. Twenty two of the prokaryotic genes were 37 identifiable with the full dataset (hence were specifically acquired by P. tricornutum following 38 its divergence from other diatoms), 69 were identifiable with a full dataset excluding pennate 39 diatoms (hence were presumably acquired within the evolutionary history of the pennate showing more ancient origins (Table S3). Thus multiple gene transfer events involving 44 prokaryotes have occurred progressively through the evolution of ancestors of P. tricornutum.

1
Red algal genes 2 3 Across the entire dataset, 459 genes produced BLAST top hits against members of the red 4 algae, consistent with the red algal ancestry of the diatom plastid 4,20 ( Fig. 2A). This is broadly 5 equivalent to the number of red genes identified in previous studies of diatom plastids 21 . The 6 two sub-categories with the greatest contributions to these genes were the 7 Porphyridiophytes (150 genes) and Bangiophytes/ Florideophytes (147 genes) ( Fig. S2B; Table   8 S3). A total of 353 of the red algal genes were identified following removal of all ochrophyte 9 sequences from the dataset, with only a further 28 identified following the removal of 10 aplastidic stramenopile groups (oomycetes, labyrinthulomycetes, and slopalinids) and a 11 further 25 identified following the removal of the two remaining SAR clade groups (ciliates, 12 and aplastidic rhizaria) considered ( Fig. 2B; Table S3). The limited number of genes of red algal 13 origin identified within aplastidic SAR clade members supports a late acquisition of a red algal 14 plastid by a common ancestor of all ochrophytes, following their divergence from oomycetes 15 10,22 .

17
Green genes

19
A total of 1,981 genes generated top BLAST hits from members of the green group (green 20 algae and plants). This is similar in size to the number of green genes (>1700) identified in 21 previous studies of the origins of diatom groups 21 , and could be consistent with large scale 22 gene transfer between diatom ancestors and green algae (Table S3). Some of these genes may 23 be misidentified genes of red algal origin, as has been discussed elsewhere 11,12 ; however, we 24 believe that many are genuinely of green origin, for two reasons. Firstly, compared to previous 25 phylogenomic studies of diatom genomes, our reference library contains a much larger 26 amount of red algal sequence information, including five complete genomes, and large-scale 27 transcriptomes for a further twelve red algal species (Table S4) 10 . Up to 685 of the identified 28 green genes had orthologues (as confirmed by the reciprocal best-hit (RbH) analysis) in two 29 or more red sub-categories, 314 had identified orthologues in two or more subcategories each 30 of red algae, green groups, amorphea (opisthokonts, amoebozoa and excavates) 19 and 31 prokaryotes, and 222 had identified orthologues in all five of the red sub-categories and all 32 eleven of the green sub-categories considered (Fig. S3A). We saw no difference in the 33 representation of red and green algal sub-categories in genes with annotated red or green 34 origin (Fig. S3B).

36
Secondly, green gene transfers appear to have occurred at a different time point to the red 37 algal gene transfers. Although the largest number of putative green genes (805) were 38 identified with the dataset from which all ochrophyte groups were removed (Fig. 2B), nearly 39 as many (691) were identified following the removal of aplastidic stramenopiles from the 40 dataset ( Fig. 2A). This contrasts to the situation for red genes (which were overwhelmingly 41 identified following the removal of all ochrophyte sequences from the library, as discussed

1
In summary, our data therefore supports previous findings 10,21 of gene transfers between an 2 ancestor of stramenopiles and one or more groups of chlorophyte algae. More broadly, the 3 presence of green, red and prokaryotic genes in the P. tricornutum genome, which appear to 4 have arisen at different points in its evolutionary history, confirms that it is an evolutionary

20
Interesting domain architectures were found among the genes that contain either Rv, RNase

31
We further aimed to determine whether genes with different levels of conservation, as 32 determined by our analysis (Table S2)   by RbH analysis (Fig. 1). These were: the 3170 genes that are specific to P. tricornutum (Pt-36 specific genes), the 1929 genes that are uniquely shared with other diatoms (diatom-specific 37 genes), the 1188 genes that are shared across all eukaryotic groups, and the 203 genes that 38 are shared with all other eukaryotic groups and with prokaryotes ( Fig. 1; Fig S4; Table S5).

39
Interestingly, a high number of Pt-specific genes encode the DNA integration GO category,

5
Next, we used ASAFind and HECTAR to predict the sub-cellular targeting of the Phatr3 6 proteome (Table S6)

24
We also considered the expression dynamics of each gene using quartile approach (where 25 elements of 1 st quartile are considered to have genes with no or low expression, 2 nd quartile 26 with low to moderate expression, 3 rd quartile with moderate to high expression, and 4 th 27 quartile with very high expression). Most of the novel genes (~70%) are expressed at below 28 the median level inferred for all other genes in the genome (Fig. 3A) and are mostly specific 29 to P. tricornutum (Fig. S1). We then compared chromatin marks associated with new versus 30 unchanged Phatr3 gene models and found that the proportion of DNA methylated genes 31 within new gene models (30%, 448 genes) was found to clearly delineate the proportion 32 within unchanged gene models (9%, 4,667 genes) ( Table S1). The majority of these are at least 33 methylated in CG context, which is in line with previous work 15 . Thus, along with boosting the 34 functional content of the genome, newly discovered genes will certainly expand our capacity  (Table S1, Fig 3B). The co-localization effect of different chromatin-level 44 modifications on these new genes regulates repressive, active and moderate states of 1 expression of the genome (Fig 3B).

3
Finally, we considered an update on the distribution of DNA methylation and post-4 translational modifications of histone H3 (PTMs) across the P. tricornutum genome, following 5 previous work 15, 16 . Up to 11534 genes (~95%) within Phatr3 were found to be either 6 associated to the studied H3 PTMs or to DNA methylation. Most of the genes are preferentially 7 labelled only by marks (6708 genes, ~55%) associated with an active transcriptional state, such 8 as acetylation (H3K9_14Ac) and/or H3K4me2 (Fig 3C; Table S1), whereas ~8% genes are 9 marked only by repressive modifications (DNA methylation, H3K27me3, H3K9me2 and

43
However, in most unicellular eukaryotes first introns are found to be significantly longer than non-first introns (Fig S6) 39 . The functional consequences of this intron organization remain to 1 be determined.

3
Next, we profiled alternative splicing events in P. tricornutum using RNA-Seq data generated 4 in different growth and stress conditions (see Methods). From the 12177 Phatr3 gene models, 5 2924 (~24%) genes are found to have introns that can be retained in more than 20% of the 6 total experimental samples studied, while 2444 (~20%) genes are observed to skip one or 7 more exons in various samples. A total of 1335 (~11%) genes are found to undergo both ES 8 and IR, hence can perform alternative splicing (Fig 4A; Table S1). Like most unicellular 9 eukaryotes and unlike metazoans, P. tricornutum shows a higher rate of IR than ES, supporting 10 the hypothesis that ES has become more prevalent over the course of metazoan evolution 32 .

11
We then considered the expression dynamics of P. tricornutum genes that undergo IR or ES 12 (Condition used: WT, Bio sample accession: SAMN06350643). Surprisingly, we found that 13 genes that can undergo intron-retention are more highly expressed than genes that do not 14 show alternative splicing (two sample t-test, P-value < 0.008, Fig 4B). This is in contrast to the 15 situation in mammals in which intron-retention down-regulates the genes that are 16 physiologically less relevant 40 .

18
To further assess the biological role of alternative splicing, we identified 1341 genes showing   creating mRNA diversity, seems to be used for transcriptional regulation of specific genes 1 under specific conditions in P. tricornutum and is likely to be widespread.

3 4
Copia-type LTR makes up most of the TEs in the P. tricornutum genome

6
In the context of the Phatr3 re-annotation of P. tricornutum genome, we also revisited the 7 annotation of repetitive elements in the genome assembly. In the current analysis, we applied 8 a robust and de novo approach for the whole genome annotation of repeat sequences.

9
Collectively, repeats were found to contribute ~3.4 Mb (12%) of the assembly, including 10 transposable elements (TEs), unclassified and tandem repeats, as well as fragments of host 11 genes (Table 2). TEs are the dominant repetitive elements in P. tricornutum and represent 12 75% of the repeat set, i.e., 2.3 Mb as compared to 1.7 Mb in the previous TE annotation. By 13 comparing the Phatr3 repertoire of TEs, including both large and small elements, with the 14 previous TE annotations, 1988 (~54%) TEs were found to be novel (Table S8).

16
In line with previous analyses, Copia-type LTR retrotransposons (LTR-RTs) are the most 17 abundant type of TEs, contributing over 55% of the repeat annotation, while Gypsy-type LTR-

30
Next, we considered the epigenetic marks and expression profiles associated with TEs within 31 Phatr3. Consistent with previous reports 15,16 , the majority (2790, ~75%) of the Phatr3 TE 32 repertoire is associated with one or other studied chromatin marks known to maintain either 33 active (H3K4me2, H3K9_14Ac) and/or repressive states (DNA methylation, H3K27me3,

42
H3K9-14Ac and H3K4me2, respectively (Fig S8B; Table S8). A total of 458 (~23%) TEs were 43 methylated, most of which (368; ~80%) in a CG context only (Fig S8C; Table S8), while 19 (~4%) 1 least two sub-contexts of DNA methylation (CG, CHH and CHG) (Fig S8C; Table S8). As for 2 genes, TEs marked by active PTMs of histones show high levels of expression while those that 3 carry repressive marks display lower expression levels, and TEs with combinations of both 4 marks are expressed at intermediate levels (Fig S8A, S8D). TEs that are methylated specifically 5 in CG context are typically expressed at lower levels compared to those specifically 6 methylated in CHG or CHH contexts (Fig S8E).

8
Finally, we compared the epigenetic marks and expression profiles associated with different 9 types of TEs, based on their methods of transposition. We noted distinct patterns of DNA 10 methylation and histone modifications associated with class I and II TEs: class I TEs are 11 enriched with CG methylation, co-localizing with or without CHH methylation, while class II 12 TEs are predominantly marked by CHH DNA methylation, and CG and CHG methylation events 13 co-localize with one another (Fig S9). A similar pattern was reported in soybean where a high

24
When checked, many of the TEs with CHG methylation were found to be inserted into or 25 overlapping with genes (Table S8), reflecting the importance of maintaining these TEs in a 26 silent state. A similar phenomenon has been observed for Arabidopsis TEs which are inserted 27 into genes and whose repression is required to avoid the deleterious effects of TE insertion 28 into host genes 46 .

30
In summary the dissection of P. tricornutum genome reported here has led to the discovery  Phaeodactylum tricornutum genome re-annotation (named as Phatr3) was done on the 5 Phatr2 genome assembly (ASM15095v2). The Phatr2 assembly was generated by the Joint 6 Genome Institute (JGI), which resulted in 10,402 gene models from 33 assembled scaffolds 7 (12 complete and 21 partial chromosomes) and 55 unassembled scaffolds 7 . Gene models 8 were predicted from RNA-Seq mapping and aligning the EST data-set using est2 genome.

9
Additionally we used SNAP and Augustus and MAKER2 for final gene predictions. Apart from 10 the previous assembly information, the species-specific data used in this re-annotation 11 included the following.    3 which identifies a custom similarity threshold between each constituent library above which 4 sequence pairs are inferred to be contaminants. In addition, following the methodology of a 5 previous study 10 , MMETSP libraries from a further twelve species were excluded due to the 6 presence of larger scale systematic contamination (Table S4).

8
The reference sequence library was split into twenty-five prokaryotic sub-categories, 9 including archaea and forty-nine eukaryotic sub-categories, which were finally binned into 10 nine distinct groups 10,19 (Table S4;

4
Next, the presence of organelle signaling signatures within the entire Phatr3 gene repertoire 5 was further investigated using ASAFind and HECTAR, under the default conditions as specified 6 in the original publications for each program 63,64 . HECTAR was run remotely, using the Galaxy 7 integrated server provided by the Roscoff Culture Collection (http://webtools.sb-roscoff.fr/) .

26
Total cellular RNA was extracted from approximately 30 ml late-log phase P. tricornutum, 27 grown as described above, by phase extraction with Trizol (Thermo, France), followed by 28 treatment using RNAse-free DNAse (Qiagen, France) and cleanup using an RNeasy column 29 (Qiagen) as previously described 10 . RNA was verified to be free of residual DNA contamination 30 by PCR using previously generated universal 18S rDNA primers 66 . cDNA was synthesized from 31 100 ng RNA-free DNA using a Maxima First cDNA synthesis kit (Thermo), and PCR was 32 performed using the cDNA template and primers designed against the 5' and 3' ends of genes 33 of interest using DreamTaq DNA polymerase (Thermo), per the manufacturers' instructions.

34
Products were separated by electrophoresis on a 1%-agarose TAE gel containing 0.2 µg/ml 35 ethidium bromide at 100V for 30 minutes, and visualized with a UV transilluminator.

36
Representative products from each reaction were purified using PCR cleanup spin columns 37 (Macherey-Nagel, France), and confirmed by Sanger sequencing (GATC, France) using both 38 the forward and reverse PCR primers.

40
Conservation analysis of Phatr3 gene repertoire

42
The evolution of the P. tricornutum genome was examined using gene homology searches.

43
Orthologues of each gene were identified from each taxonomic sub-category, following the 44 methodology used in the original Phaeodactylum genome annotation, by reciprocal BLAST best hit with an initial threshold e-value of 1 x 10 -10 . To minimize the effects of sequence 1 contamination, and subgroup-specific gene transfer events, genes were only denoted as being 2 shared with a particular group if reciprocal BLAST best in at least two separate taxonomic sub-3 categories within that group, following methodology established elsewhere 10,67 .  Table S3. The results obtained by this pipeline were compared to a subset of 324

22
Phat3 genes for which single-gene tree topologies generated using the expanded reference 23 library have previously been published 10 , and found to give broadly equivalent results (see 24 Results; Fig. S10; Table S10).

26
Conceptual translations of the entire Phaeodactylum genome was searched against this 27 modified library again using BLASTP, and the top ten hits for each gene were ranked. The 28 group and sub-category for each BLAST top hit was profiled. BLAST top hits were only recorded 29 if the top ten hits contained another sequence from a different sub-category within the same 30 group, as defined using the taxonomic categories defined above, with a better expected value 31 than the first hit from outside the same group as the top hit. For example, a BLAST output 32 consisting of a first hit from a centric diatom, a second hit from a pennate diatom, and a third 33 hit from a non-diatom group, would be considered to be genuine, whereas an output 34 consisting of a first hit from a centric diatom, second hit from a non-diatom, and third hit from 35 a pennate diatom would not. Genes for whom no BLAST hits were obtained were annotated 36 as producing "no match". Genes for which top hits were identified, but were not 37 taxonomically consistent with one another, as defined above, were annotated as being 38 "ambiguous".

40
The BLAST top hit analysis was modified in two further ways, to allow more precise

34
Significance of these terms was interpreted by calculating the observed to expected ratio of 35 their percent occurring enrichment. The occurrence of an individual biological process within 36 a specific functional set (genes exhibiting intron-retention/exon-skipping, etc.) was compared 37 to that of its occurrence in the complete annotated Phatr3 biological process catalog. The 38 degree of significance of enrichment of each biological process was quantified using a chi-39 squared test, with a threshold significance P value of 0.05.

41
To gain insights into the role of alternate splicing in regulating the molecular physiology of the       genes that were assigned (i.e., two or more top hits from two or more sub-categories from a 5 particular lineage, prior to the first top hit from outside that lineage) using the most reduced 6 reference dataset (i.e., all reference sequences, excluding SAR clade members, and other algal 7 lineages with secondary or tertiary plastids) is shown. For prokaryotic genes, two other 8 distributions (obtained for the entire dataset minus non-ochrophyte algae with secondary or 9 tertiary plastids, and the entire dataset minus pennate diatoms, and all non-ochrophyte algae 10 with secondary or tertiary plastids) are shown. Each chart additionally shows the relative size 11 of each sub-category within the reference sequence library, demonstrating that certain sub-categories contribute to substantially more of the top hits (e.g., the Deinococcus-Thermus 1 clade, in the distribution of prokaryotic genes for the full and pennate diatom-free datasets 2 that were modified to remove all non-ochrophyte lineages with secondary or tertiary plastids) 3 or fewer of the top hits (e.g., the streptophytes, in the distribution of green genes for the 4 dataset from which all SAR clade sequences, and other non-ochrophyte lineages with 5 secondary or tertiary plastids were removed) than might be expected given the corresponding 6 dataset size.