Diploid genomic architecture of Nitzschia inconspicua, an elite biomass production diatom

A near-complete diploid nuclear genome and accompanying circular mitochondrial and chloroplast genomes have been assembled from the elite commercial diatom species Nitzschia inconspicua. The 50 Mbp haploid size of the nuclear genome is nearly double that of model diatom Phaeodactylum tricornutum, but 30% smaller than closer relative Fragilariopsis cylindrus. Diploid assembly, which was facilitated by low levels of allelic heterozygosity (2.7%), included 14 candidate chromosome pairs composed of long, syntenic contigs, covering 93% of the total assembly. Telomeric ends were capped with an unusual 12-mer, G-rich, degenerate repeat sequence. Predicted proteins were highly enriched in strain-specific marker domains associated with cell-surface adhesion, biofilm formation, and raphe system gliding motility. Expanded species-specific families of carbonic anhydrases suggest potential enhancement of carbon concentration efficiency, and duplicated glycolysis and fatty acid synthesis pathways across cytosolic and organellar compartments may enhance peak metabolic output, contributing to competitive success over other organisms in mixed cultures. The N. inconspicua genome delivers a robust new reference for future functional and transcriptomic studies to illuminate the physiology of benthic pennate diatoms and harness their unique adaptations to support commercial algae biomass and bioproduct production.


Results
Cellular description and taxonomy. Nitzschia diatom isolate GAI-293, originally collected from the tidal area of a stream near Lihue on the island of Kauai, Hawaii, USA, was isolated in axenic culture by Global Algae Innovations. This strain is euryhaline and can be cultivated in sea water and low salinity brackish water, matching the ecological description of N. inconspicua 13 . Isolate GAI-293 is capable of growth using a CO 2 supply derived from power plant flue gas and using recycled growth media in large-scale outdoor raceways (Fig. 1A), demonstrating average productivity of 22 g/m 2 /day on an ash-free dry weight basis over a 2.5 month cultivation period. Vegetative cells have a typical raphid pennate morphology, with gliding motility and prominent lipid droplets visible under conditions of silica depletion in the media (Fig. 1B). Auxospores are formed paedogamously following a fusion of gametes produced within the same gametangium 14 .
Molecular taxonomic classification was confirmed using a concatenated 4-gene multilocus tree, constructed from 18S and 23S nuclear rRNA genes combined with chloroplast genes psbC (photosystem II reaction center protein C), and rbcL (ribulose bisphosphate carboxylase large chain). Isolate GAI-293 formed a well-supported, independent branch nested within the N. inconspicua clade of the Bacillariaceae family (Fig. 2, Supplementary  Figure S1) 17 . To honor the memory of diatom biologist Dr. Mark Hildebrand (1958Hildebrand ( -2018, who worked on lipid formation in this diatom and was a strong supporter of diatoms for biofuels and bioproducts, this organism has been given the name Nitzschia inconspicua strain hildebrandi. Genome sequencing and assembly. DNA was extracted from actively growing, diploid vegetative cells and used to generate 646,344 PacBio reads, ranging in size from 1,000 to 117,697 nt, with an average length of 31,140 nt. Long reads were assembled using the diploid-aware program CANU, producing 123 ungapped, nuclear contigs plus individual circular contigs for mitochondrial and chloroplast genomes (Table 1). Mitochondrial and chloroplast contigs were recognizable by their divergent nucleotide compositions of 29.3% and 33.4% G + C, respectively, versus an average of 45.1% ± 0.9% for nuclear sequences.  Concatenated multilocus gene tree of Nitzschia inconspicua strains. Alignments were constructed from 18S (SSU) and 23S (LSU) nuclear rRNA genes combined with chloroplast genes psbC (photosystem II reaction center protein C), and rbcL (ribulose bisphosphate carboxylase large chain). Supplementary Figure S1 shows a more complete version of this tree including the entire Bacillariaceae family. Accession numbers for concatentated genes are provided in Supplementary Table S5. www.nature.com/scientificreports/ Overall quality of the N. inconspicua assembly is higher than most currently available diatom genomes, based on sizes and numbers of contigs, completeness of conserved BUSCO marker genes, and percentage of non-fragmented protein predictions, defined as coding sequences that begin and end with canonical start and stop codons (Table 1). Six assembled contigs, ranging in size from 2.2-4.1 Mbp, were capped with telomeres of 355-414 nt at both 5' and 3' ends, suggesting complete chromosomes. Exceptional assembly quality was made possible by high-coverage, long-read PacBio read technology and the ability of the CANU assembler to resolve allelic ambiguities [18][19][20] .
The 99.7 Mbp size of the nuclear assembly and its 38,601 predicted proteins were larger than expected based on previously published numbers for other diatom species. However, diatom genome size statistics are conventionally reported as haploid values, even when obtained from diploid organisms, making an estimated haploid size of 49.9 Mbp and 17,988 predicted proteins for N. inconspicua unremarkable. Assembly diploidy was first demonstrated by MUMmer dot plots of paired contigs (Fig. 3, Supplementary Figures S2A-S2M). 42 contigs were resolved into 14 highly syntenic paired alignments, at approximately 94% average nucleotide identity per pair (including introns and non-coding, as well as coding sequences) and nearly equal assembly coverage depths between partner contigs (Supplementary Table S1). Paired contig alignments account for 93% (92,897,432 nt) of the nuclear assembly. At least eleven of the matched contig sets have few insertions, deletions, or re-arrangements, but it is not clear whether incompleteness of the remaining alignments, covering multiple contigs, are due to biologically relevant sequence differences, assembly errors, or some combination of these factors.
Additional evidence supporting diploid completeness of the assembly was obtained using predicted protein sequences based on BRAKER2 21 gene models. Not only were 100% of the conserved BUSCO markers from Stramenopiles present in predicted N. inconspicua gene models, but 92% of these markers were also present in a second, unfragmented copy, consistent with the 92.8% genome alignment coverage observed in MUMmer matched contig pairs (Supplementary Table S2). Duplicated BUSCO gene markers were not observed in any other previously published diatom genome at greater than 1%, except for putatively alloploid Fistulifera solaris 6 .
Diploid assembly and gene model correctness were further validated by comparing coding sequences for all 38,601 predicted proteins with 452,784 transcriptome sequences that were generated from growth under a variety of environmental conditions and de novo assembled (Supplementary Table S3). 98.8% of the coding sequences matched assembled transcriptome sequences at an average of 98.5% ± 2.5% nt identity (median 99.4%) (Supplementary Figure S3). These results are consistent with expression of both diploid alleles, supporting previous reports suggesting the absence of genome-wide silencing mechanisms in diatoms 2,22,23 .
To determine the level of genomic variability between diploid alleles, predicted coding sequences and corresponding translations were compared to each other to identify matched pairs (Supplementary Figure S4). Average nucleotide sequence identity for paired coding sequences was 97.3 ± 2.9%. 98% of the sequences were more than 90% identical to their partners, and 32% were 100% identical. These results are consistent with an independent estimate of 3.5% overall genome heterozygosity, obtained from unassembled raw reads using relative k-mer abundance (Supplementary Figure S5).  www.nature.com/scientificreports/ Non-protein coding genes. The N. inconspicua genome contained multiple copies of large and small subunit rRNAs and tRNAs for all 20 standard amino acids plus tRNA-SeC, potentially enabling the incorporation of selenocysteine. Paired allelic copies of other well-known RNA genes were also present, including a THI element TPP riboswitch (thiamine biosynthesis regulator), a 5' ureB small RNA (urease regulator), U1 and U6 spliceosomal RNAs, and small nucleolar RNAs of types SNORD24 and SNORA11, potentially acting as guides for the methylation of RNA targets and the conversion of uridine to pseudouridine, respectively. Non-protein coding regions comprised 41% of the N. inconspicua nuclear genome, including 7% identified as low-complexity elements by RepeatModeler 24 . These relative proportions were similar to levels previously reported in finished diatom chromosomes from P. tricornutum, and T. pseudonana, but much lower than draft genomes for P. multistriata, and F. cylindrus (Table 1, percent coding sequences). N. inconspicua transposable elements were widely distributed across 44 different contigs (Fig. 4). The most abundant elements were Class II (cut and paste) sequences belonging to the DDE Insertion and Eukaryotic Terminal Interspersed Repeat Mutator (MULE) subtypes, followed by Class I (copy and paste) elements of the LTR-Gypsy subtype. Class II elements are rare in previously sequenced diatom genomes, and expansion of the MULE subtype is unique to N. inconspicua. Transposable elements were most abundant in contigs tig00044028, tig00000004, tig00000012 and tig00000120, potentially contributing to the lower alignment synteny shown in Fig. 3 for contig pairs 13 and 14.
The most frequently repeated sequence in the N. inconspicua nuclear genome (630 exact matches) was the 12-mer TTA GGG TTG GGG , a G-rich extension of the conserved eukaryotic telomere motif TTA GGG . The second half of this sequence (TTG GGG ), originally described in the macronucleus of ciliate Tetrahymena thermophila 25,26 , is quite different from the consistent 6-mer telomer pattern used by T. pseudonana and P. tricornutum (TTA GGG ), and the more T-rich sequences found in Chlamydomonas reinhardtii (TTT TAG GG) and most land plants (TTT AGG G) [27][28][29][30] . Tandem repeats containing multiple copies of N. inconspicua telomer motifs were found exclusively at the ends of contigs, including thirteen of the fourteen matched pair sets shown in Fig. 3.
N. inconspicua telomeres differ from those of T. pseudonana and P. tricornutum, whose 6-mer repeats are all identical, in containing additional degenerate, truncated subsequences of varying lengths interspersed between full-length 12-mers (Supplementary Figure S6). Although degenerate N. inconspicua telomer repeat patterns were supported by PacBio read assembly depths of 12 to 31-fold in multiple contigs (Supplementary Table S4), excluding all potential artifacts that might have been introduced during sequencing and assembly would require verification using an alternative technology (e.g. Illumina).
Examples of telomer pattern degeneracy are well-known in Paramecium 31 and Saccharomyces 32 , but have not previously been reported in diatoms. Telomer sequence inconsistencies are believed to result from site specific nucleotide mis-incorporation in the telomerase enzyme of Paramecium, and slippage with premature termination of reverse transcriptase activity in Saccharomyces 33,34 . The variable lengths of partial repeats in N. inconspicua telomeres suggest greater consistency with the yeast degeneracy model. Amino acid sequences of the two N. inconspicua telomerase alleles are 95% identical to each other but only 37% identical to their closest diatom match (P. multistriata), consistent with potential differences from previously sequenced diatoms in enzyme processivity.
Mitochondrial and chloroplast genomes. The circular N. inconspicua mitochondrial genome is larger than many other pennate diatoms due to a 19,055 nt intragenic spacer (Fig. 5A), consistent with the observation of intragenic spacers of widely varying sizes in mitochondrial genomes from other diatoms 35 . In addition to small and large mitochondrial rRNA subunits (rrnS, rrnL), the mitochondrial genome encodes ATP synthase  , a tatC transporter, and 24 tRNA genes. COX1 and ribosomal subunit genes flanking the large intergenic spacer are interrupted by multiple group II introns matching RFAM patterns RF02012 and RF00029, but no sequences matching these motifs were detected within the spacer region itself. Group II introns have previously been reported to occur sporadically within COX1 and ribosomal rRNA genes in the mitochondrial sequences of other diatoms [35][36][37][38] . The large intergenic spacer region contained many low complexity repeats, but no coding sequences, RFAM database RNA gene matches, or blast matches to sequences in the Genbank nt database were detected. The N. inconspicua chloroplast genome architecture is typical of that found in other diatoms, as well as many plants and algae, with two inverted repeats separating long single copy (LSC) and short single copy (SSC) regions (Fig. 5B). The inverted repeat sections are each 13,169 nt long and contain three ribosomal RNA genes and two tRNAs. Chloroplast genes encoded in non-duplicated regions include those associated with photosystem I, photosystem II, cytochrome b/f complex, ATP synthase, RuBisCO, thiamin synthesis, tRNAs, RNA polymerase, large and small subunit RNA proteins, TatC and SecA/SecY type transporters, molecular chaperones, antioxidants, and quality control protease FtsH. Five conserved open reading frames of unknown function (ycf39, ycf45, ycf66, ycf89, ycf90) were shared with the F. cylindrus chloroplast genome at 56-85% amino acid sequence identity. No group II introns were found in the N. inconspicua chloroplast genome, despite their abundance in the mitochondrial genome and a recent report describing their discovery in the chloroplast of pennate diatom Toxarium undulatum 39 . However, the N. inconspicua chloroplast genome does contain a Tn3 transposon recombinase related to those typically found in bacteria, but also present in the Semanivis robusta chloroplast sequence 40 at 76% amino acid identity.
Like other diatoms, many sequences from the N. inconspicua nuclear genome contain targeting signals for intracellular localization to mitochondrial and chloroplast compartments. 1600 mitochondrial target candidates were predicted by MitoFates 41 , and 2172 chloroplast target candidates were predicted by ASAfind 42 . Some of these genes may have been transferred from ancestral organelles and their symbiont predecessors to the nucleus, a process that is believed to be ongoing even in current times 43,44 . However, no evidence of recent intracellular transfer was detected in N. inconspicua genes predicted to contain organelle-targeting sequences; the closest GenBank sequence matches for these genes were all to nuclear genomes from other diatoms.

Shared, unique and expanded protein families. Predicted protein sequences from N. inconspicua
were clustered together with those of four other diatom genomes (F. cylindrus, P. multistriata, P. tricornutum, and T. pseudonana) to identify shared versus unique families (Fig. 6). 3,797 protein families were shared among all five species, a number similar to the 3,742 families previously reported as shared between three Pseudo-nitzschia transcriptomes and the genomes of P. tricornutum, and T. pseudonana 45 .
Shared families encoded complete enzyme sets for many well-characterized, highly conserved metabolic pathways, including glycolysis, gluconeogenesis, citric acid cycle, C-4 metabolism, oxidative and reductive pentose phosphate pathways, urea cycle, mevalonate and non-mevalonate pathways, and assimilatory sulfate reduction. Complete, paralogous sets of glycolysis and fatty acid pathway enzymes were predicted to be localized in both cytosolic and chloroplast cellular compartments of N. inconspicua, along with mitochondrial duplication of the  Functions for 46% of the shared protein families and 67% of those unique to N. inconspicua were annotated as hypothetical, while many others were described only by general similarity to PFAM domain patterns. While some cases of apparent divergence within families of indeterminate function might be due to sequencing or assembly errors, transcriptome sequence matches and the existence of closely related allelic copies on syntenic contigs for > 99% of the N. inconspicua-unique loci provides strong evidence supporting their validity.
A large number of N. inconspicua protein sequences contained domains likely to be involved in the processes of cell-surface adhesion, biofilm formation, and raphe system gliding motility. These domains were identified within the 17,968 haploid gene subset of N. inconspicua by protein descriptions containing the keywords fibronectin (n = 171), ankyrin (n = 113), fasciclin (n = 22), laminin (n = 15), outer membrane adhesin (n = 11), lectin (n = 9), von Willebrand factor (n = 3), and villin headpiece domain (n = 2). Several OrthoVenn protein family clusters annotated as fibronectins, outer membrane adhesins, and fasciclins were unique to N. inconspicua, without detectable orthologs in the other 4 diatom genomes (Supplementary Data Set 1). Additional conserved diatom loci annotated as capsular and exopolysaccharide biosynthesis proteins, exostosins, and sulfotransferases may participate in the synthesis of sulfated polymer exudates supporting raphid motility. However, no predicted proteins from N. inconspicua included the adhesion-specific GDPH pattern recently described in Semanivis robusta and the transcriptomes of other benthic diatoms 8 .
N. inconspicua tolerates ambient chemistries that are often high in pH and low in dissolved CO 2 , conditions known to limit productivity of aquatic photosynthetic organisms. To understand the molecular basis of this tolerance, we investigated the identities and abundance of several putative components of the biophysical carbon concentrating mechanism (CCM), including bicarbonate transporters and carbonic anhydrases (CAs).
Phylogenetic analysis of bicarbonate transporter sequences from N. inconspicua (Supplementary Figure S7) reveals two loci belonging to a metazoan-type clade orthologous to plasma membrane-localized and low CO 2 -sensitive bicarbonate transporters from P. tricornutum, as well as a single transporter ortholog predicted to be localized to the P. tricornutum chloroplast. Additional bicarbonate transporter family proteins detected using HMM pattern PF00955 belong to a more distant clade, along with other diatom sequences annotated as boron transporters (TC 2.A.31.3, 49 ), leading to uncertainty about a potential role in the CCM. Excluding these, N. inconspicua has slightly fewer (n = 3 per genome) bicarbonate transporters overall than diatoms on average (n = 4 per genome) suggesting there has been little need to expand bicarbonate acquisition capabilities through genetic diversification in this species.
N. inconspicua encodes 21 CA genes from the α, γ, δ, θ, and LCIP63 subclasses but does not encode genes from the β, or ζ types (Supplementary Dataset 2, 50 ). LCIP63s and αCAs are particularly enriched relative to other diatoms. N. inconspicua has six putative LCIP63 candidate loci, three of which are chloroplast-localized, as compared to a single copy each in P. tricornutum and T. pseudonana 50 , and more than twice the number of αCA loci (n = 11) found in P. tricornutum or T. pseudonana 51 . Phylogenetic analysis shows this is largely the result of major independent expansions of αCAs belonging to two distinct clades (Fig. 7A). The first expansion (clade EC1) likely occurred prior to the divergence of N. inconspicua, P. multiseries, and F. cylindrus, giving rise to five putatively periplastid space/ER lumen/secretion-targeted genes, while the second expansion, specific to N.   52,53 has been shown experimentally to anchor TpδCA1 in the plasmalemma, supporting the idea that N. inconspicua clade EC2 αCA proteins may be similarly targeted to the cell surface 53 . Predicted identities and subcellular localizations of N. inconspicua CCM components were used to construct a cellular model for comparison with existing models of T. pseudonana and P. tricornutum (Fig. 7B, 51,54 ). The most notable differences between N. inconspicua and other diatoms are i) extra copies of putative chloroplast-localized www.nature.com/scientificreports/ LCIP63 genes, and ii) independent gene family expansions in N. inconspicua, giving rise to extra αCA copies that either decorate the cell surface, surround the chloroplast, or are possibly secreted extracellularly. Operating as a CCM, these extra CAs should capture and concentrate CO 2 towards the pyrenoid, the site of RuBisCO and carbon fixation, in order to sustain high productivity at low concentrations of dissolved CO 2 55 . Future experimental work is needed to conclusively determine the localization and roles of these CCM components. Nonetheless, these expansions provide good genomic evidence for the presence of a unique and powerful CCM and intracellular pH buffering system relative to what is observed in other diatoms, which is likely to be one of the features underlying tolerance and high productivity over a spectrum of salinities and dissolved CO 2 levels.
Based on reports of antimicrobial activity in some Nitzschia strains [56][57][58] , bioinformatic searches were performed to identify potential biosynthetic gene clusters that might encode secondary metabolite pathways. AntiSMASH 59 detected a number of individual genes similar to those commonly associated with biosynthetic gene clusters, including Type III polyketide synthases, non-ribosomal peptide synthetase-like proteins, cytochrome P450s, and isoprenoid synthesis enzymes. However, none of these protein families were identified as unique to N. inconspicua, and none of these genes were recognized as belonging to multi-gene clusters similar to those of known secondary metabolites. A separate blast search for genes involved in the production of domoic acid in many strains of Pseudo-nitzschia 60 produced no matches in N. inconspicua.

Discussion
This work describes a new reference-quality, diploid nuclear genome for Nitzschia inconspicua. The haploid size of this genome is nearly double that of well-characterized model diatom P. tricornutum, but 15-60% smaller than closer pennate relatives P. multistriata and F. cylindrus. An estimate of 14 chromosome pairs, supported by syntenic alignments of contig pairs with telomer-capped ends, will need to be verified by physical measurements. However, this number is similar to cytological observations of 15-17 chromosomes for other Nitzschia species, within the much wider range (8-130) previously reported for more distantly related pennate taxa 61,62 .
The expansion of Class II versus Class I transposable elements in N. inconspicua suggests lineage-specific activities promoting enhanced genomic plasticity. This plasticity may be expanded in candidate chromosome pairs 13 and 14, where these elements are particularly abundant. Group II introns like those found in the especially large mitochondrial genome could serve a similar plasticity-enhancing function in this organelle, consistent with widely varying mitochondrial genome sizes reported among other sequenced diatoms 7,36 . The N. inconspicua chloroplast genome conforms to the more uniform size characteristic of other diatoms, including a well-conserved overall structure similar to that found in many plants.
Some types of information provided by the N. inconspicua diploid genome could not have been obtained from more fragmented, less complete assemblies. Paired coding sequences covering 93% of the genome enabled a direct calculation of allelic heterozygosity at 2.7% (97.3% nucleotide identity), contrasting sharply with the estimate of 25% heterozygosity reported for F. cylindrus, although the latter value was calculated from only 5,400 allelic pairs, representing less than 30% of predicted coding sequences 10 . Differences between relatively complete and highly fragmented assemblies may also affect coding sequence percentages, which were calculated at 59% for N. inconspicua, 54% for P. tricornutum, and 50% for T. pseudonana, versus 34% for P. multistriata and 27% for F. cylindrus. It is not clear if these disparities represent true allelic variability or potential assembly artifacts, but the latter explanation is favored by the lower percentages of complete coding sequences (bounded by canonical start and stop codons) found in diatom genomes with larger numbers of short contigs and lower N50 values (Table 1).
Telomer sequences have long been the subject of intense interest with respect to vertebrate cellular aging, but have not yet been systematically studied in diatoms, and only recently investigated in algae 28 . The unusually G-rich, degenerate pattern of telomer repeats found in N. inconspicua was surprising in light of the unremarkable, uniform telomeres found in model diatoms P. tricornutum and T. pseudonana. The altered nucleotide composition of the N. inconspicua repeat pattern should increase secondary structure stability in telomer caps protecting single-stranded chromosome ends, especially at elevated temperatures or under high pH conditions. It will be interesting to learn whether similar degenerate, G-rich motifs occur in other diatoms, and whether they might serve an adaptive function.
N. inconspicua genome sequencing has revealed a number of features that may contribute to successful exploitation of benthic habitats and suitability for commercial biofuel production. Like other diatoms, distributed duplication of glycolysis and fatty acid synthesis among different cellular compartments could provide an energy boost enhancing competitive success against other organisms in mixed cultures. N. inconspicua's extensive repertoire of adhesive domain proteins may facilitate attachment to suspended particles during sediment mixing, increasing access to surface light and enhancing photosynthetic efficiency in the shallow, turbid environments where it thrives. Expansion and diversification of carbonic anhydrase paralogs may enable fine-tuning of carbon concentration activities under variable environmental conditions. Resistance to predation and bacterial pathogenesis may be achieved through currently uncharacterized secondary metabolite compounds, produced using genes encoding type III polyketide synthases, ribosomally-produced post-translationally modified proteins, cytochrome P450s, and isoprenoid synthesis enzymes.
The nearly complete set of diploid information delivered by the N. inconspicua genome provides a robust new set of reference material for future functional and transcriptomic studies illuminating the physiology of benthic pennate diatoms and harnessing their unique adaptions to support commercial algae biomass and bioproduct production.

Materials and methods
Nucleic acid isolation, library construction, and sequencing and assembly. An axenic N. inconspicua culture was obtained from a single colony on a streaked agar plate, then scaled up to a 500 mL liquid culture in an artificial brackish growth medium similar to that previously described 63,64 . After 36 h of growth under continuous light, cells were harvested at late exponential phase via centrifugation. High molecular weight DNA was extracted using CTAB-chloroform:isoamyl alcohol 65 . A large-insert (10-20 kb) genomic library for PacBio sequencing was constructed with the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) and sequenced on a single SMRT Cell on the PacBio Sequel platform at the University of California Davis Genome Center DNA Technologies Core. PacBio sequencing produced 646,344 reads that were cleaned, trimmed, and assembled with CANU v.1.8 18 , using default program parameters except for genomeSize = 175m. This assembly produced contigs ranging in size from 3,477-6,574,884 nucleotides, with an N50 of 3.7 MB. Read mapping to assembled contigs, coverage statistics, and predicted circularity were obtained from CANU program output files. Contigs supported by only a single PacBio read (1X coverage) were discarded.
Nuclear contigs were distinguished from those derived from mitochondria and chloroplasts on the basis of higher G + C nucleotide composition, lower coverage depths, and predicted linearity versus circularity. Length weighted average G + C values for nuclear, chloroplast, and mitochondrial contigs were 45.1%, and 33.4%, and 29.3%, respectively, while coverage depths were 59X, 364X, and 222X. Contig assignments to nuclear, chloroplast, or mitochondrial bins were confirmed by BLASTX searches versus Genbank reference sequences for chloroplast and mitochondrial genomes. Randomly selected, mapped reads from circular organelle contigs were also used for re-assembly at a range of lower coverage depths (12-90X) with more stringent CANU program parameters (correctedErrorRate = 0.01, minOverlapLength = 1000, min_read_length = 20,000).
Transcriptome data to support gene model predictions was obtained using Illumina PE-150 reads from cultures obtained from six N. inconspicua cultures grown under conditions of varying light and nutrient availability (Supplementary Table S3). The cells were collected by centrifugation (2,000 × g for 4 min, 4 °C), flash frozen in liquid N 2 and stored at -80 °C prior to RNA isolation. Total RNA was collected from each sample using TRIzol followed by a Zymo RNA Clean and concentrate kit. Quality was confirmed with an Agilent Bioanalyzer and RNA was treated with DNase to remove contaminating DNA. RNA sequencing was carried out by GENEWIZ (South Plainfield, NJ) using the Illumina HiSeq platform. Raw Illumina reads were cleaned using Trimmomatic version 0.36 66  Gene models and functional annotation. Organelle genomes were annotated using GeSeq 68 , supported by reference sequences from P. multiseries (NC_027265.1), Halamphora coffeaeformis (NC_037727.1), P. tricornutum (NC_016739.1), and F. cylindrus (NC_045244.1). Nuclear genome models were obtained using BRAKER2 21 , with the F. cylindrus genome as a seed training model and the Trinity-assembled N. inconspicua transcriptome contigs as reference sequences for two subsequent iterations. The number of gene model coding sequences beginning and ending with canonical start (ATG) and termination (TAG, TAA, TGA) codons was tallied using a custom perl script (count_cds_partials.pl, code provided in Supplementary Materials). BLASTN searches were used to measure agreement between predicted coding regions and the original assembled transcriptome sequences used as BRAKER2 input.
Nuclear genome completeness was assessed using BUSCO v.4.06 69,70 under the taxonomic setting "stramenopiles". Functional gene descriptions were assigned to predicted nuclear proteins based on BLAST matches to reference database sequences at an e-value cutoff of 1e-3 and HMM pattern matches above model-specified gathering cutoffs. Selection of product descriptive names was prioritized in the following order: blast matches to COGs, 2014 update 71 , followed by HMM pattern matches to TIGRFAM v. 15.0 72 and PFAM v. 32 73 . Candidate transporter proteins were identified based on BLAST searches against the Transporter Classification Database 49 and PFAM v.32 models bearing the following descriptive terms: transport, permease, efflux, export, antiporter, channel, and porin. KEGG metabolic pathway associations were assigned using KoFAM Koala 74 84 , together with transposase-associated patterns from PFAM v. 32 73 . Total percentages of genomic nucleic acid residues present as low-complexity repeats were determined using RepeatModeler2 24 . Tandemly repeated telomer sequences were identified using BioSerf 85 .
Taxonomic classification. Light micrographs for axenic cultures were taken under bright field DIC and under fluorescence using a Zeiss Imager.A2 (63 × oil objective). Frustules were visualized by acid washing according to 89 . Specimens were prepared for Scanning Electron Microscopy (SEM) as described in 90  www.nature.com/scientificreports/