Draft genome of the brown alga, Nemacystus decipiens, Onna-1 strain: Fusion of genes involved in the sulfated fucan biosynthesis pathway

The brown alga, Nemacystus decipiens (“ito-mozuku” in Japanese), is one of the major edible seaweeds, cultivated principally in Okinawa, Japan. N. decipiens is also a significant source of fucoidan, which has various physiological activities. To facilitate brown algal studies, we decoded the ~154 Mbp draft genome of N. decipiens Onna-1 strain. The genome is estimated to contain 15,156 protein-coding genes, ~78% of which are substantiated by corresponding mRNAs. Mitochondrial genes analysis showed a close relationship between N. decipiens and Cladosiphon okamuranus. Comparisons with the C. okamuranus and Ectocarpus siliculosus genomes identified a set of N. decipiens-specific genes. Gene ontology annotation showed more than half of these are classified as molecular function, enzymatic activity, and/or biological process. Extracellular matrix analysis revealed domains shared among three brown algae. Characterization of genes that encode enzymes involved in the biosynthetic pathway for sulfated fucan showed two sets of genes fused in the genome. One is a fusion of l-fucokinase and GDP-fucose pyrophosphorylase genes, a feature shared with C. okamuranus. Another fusion is between an ST-domain-containing gene and an alpha/beta hydrolase gene. Although the function of fused genes should be examined in future, these results suggest that N. decipiens is another promising source of fucoidan.

The genome size of N. decipiens was estimated by counting K-mer frequencies of raw reads (K-mer = 32). In Supplementary Fig. S2A, the peak appeared at around ~95. The calculated genome size was ~190 Mbp. A total read of 80.1 Gbp would correspond to approximately 420-fold sequencing coverage of the estimated genome.
Illumina paired-end reads were assembled de novo using Platanus. The assembled genome contained 411,597 contigs with an N50 size of 6,265 bp ( Table 1). The longest contig was 135,338 bp, and approximately 47% of sequences were covered with contigs over 2 kb in length. Subsequent scaffolding of 411,597 Platanus output was performed with SSPACE, using Illumina mate-pair sequence information (Supplementary Table S1). Gaps inside the scaffolds were closed with GapCloser. Contaminating bacterial and microbial scaffolds identified using Maxbin and RNAmmer were deleted. Final assembly of the N. decipiens genome was 685 scaffolds with an N50 size of 1.863 Mbp. Total length of scaffolds reached 154 Mbp (Table 1).
CEGMA analysis indicated 93.6% sequences for partial yields and 84.3% sequences for complete yields (Table 1). For comparison, CEGMA partial and complete values for genome sequences of C. okamuranus and E. siliculosus are 88.3% and 87.5%, and 83.1% and 72.6% (Table 1), respectively. This suggests that the assembled genome of N. decipiens has the higher quality of the three brown algal genomes. GC content. The GC content of the N. decipiens genome was calculated as ~56% (Supplementary Fig. S2B; Table 1), versus 54% for both C. okamuranus and E. siliculosus (Table 1).
RNA-seq, assembling, and mapping. Transcriptomic data are essential to analyze composition and expression of genes. RNA extracted from protonemas ( Supplementary Fig. S1B) was sequenced using the HiSeq. 4000 platform (average library size was 260 nucleotides (nts), and read length 151 nts) (Supplementary Table S1). A total of 28.5 giga nts were generated. Transcripts assembled with the Velvet/Oases yielded 204,065 contigs (a total of 345 mega nts) with an N50 size of 3,313 nts. 152,212 (74.6%) assembled transcripts were aligned to the assembled genome (with default settings) with blat software. These data were used to produce gene models and annotations.
The C. okamuranus and E. siliculosus genomes are intron-rich 12,14 ; average numbers of introns per gene are 9.14 and 6.96, and average intron lengths are 530 bp and 740 bp, respectively (Table 1). This feature was more prominent in the N. decipiens genome. The average number of introns per gene was 10.24, and the average length of an intron was 588 bp (Table 1) www.nature.com/scientificreports www.nature.com/scientificreports/ retrotransposons accounted for 0.2098% and 2.0143% of the N. decipiens genome, respectively (Supplementary Table S2). DNA transposons included EnSpm (0.0440% of assembled sequences), Helitron (0.0186%), hAT (0.0157%), and Polinton (0.0110%). Retrotransposons included LTR (long terminal repeat) retrotransposons such as Gypsy (0.8189%), Copia (0.4700%), and Bel_Pao (0.0681%), and the non-LTR retrotransposon CR1 (0.0016%). Percentages for LINE (long interspersed nuclear elements) are 0.0733% for Jockey, 0.0458% for Tx1 and 0.0072% for L1, and that for SINE (short interspersed nuclear elements) is 0.0024%. Repetitive sequences, including unclassified repeats comprised 8.8% of the N. decipiens genome (Supplementary Table S2). This is less than the two other brown algae, i.e.,11.2% for C. okamuranus and 22.7% for E. siliculosus, respectively (Table 1). An interesting question for future studies is how the variation in quality and quantity of repetitive sequences affects the composition of brown algal genomes.

Genome browser.
A genome browser has been established at: http://marinegenomics.oist.jp/ito_mozuku_ v1/viewer/info?project_id=68. Gene annotations from domain searches and Blast2GO 18 are provided on the site. phylogenetic position of Nemacystus decipiens. Based on morphological and molecular criteria, N. decipiens was classified as belonging to the family Spermatochnaceae of the order Chordariales 4,5 . On the other hand, C. okamuranus has been classified into the family Chordariaceae of the same order. Another brown alga, E. siliculosus, belongs to the order Ectocarpales. To examine phylogenetic relationship of the three algae, we carried out molecular phylogenetic analysis based on a comparison of nucleotide sequences of 32 protein-coding genes in mitochondria genomes of 38 brown algae. As shown in Fig. 2 and Supplementary Fig. S3, N. decipiens and C. okamuranus form a clade corresponding to the order Chordariales while Scytosiphon lomentaria and three other species form a clade corresponding to the order Scytosiphonales, and E. siliculosus belongs to an independent clade of the order Ectocarpales ( Fig. 2 and Supplementary Fig. S3). This indicates N. decipiens and C. okamuranus share a more recent common ancestor. transcription factor genes. We searched for genes that encode transcription factors (TFs) in the N. decipiens genome using hmmer3 and the Pfam database (e-value cutoff <e −5 ), and compared them with those in the C. okamuranus 14 and E. siliculosus 16 Table S3). The domains include HSF, Myb, bZIP, Zinc Finger, bHLH, CCAAT-binding, Homeobox, AP2-EREBP, Nin-like, TAF, E2F-DP, CBF/NF-Y/archaeal, and Sigma-70 r2/r3/r4 (Supplementary Table S3). It appears that the N. decipiens genome contains 299 transcription factor genes (Supplementary Table S3), versus 257 in the C. okamuranus genome (version 2) and 274 in the E. siliculosus genome (version 2), suggesting a small expansion of the TF family in N. decipiens. The most abundant TFs occurred in the Myb family, with 79, 74, and 70 genes detected in N. decipiens, C. okamuranus, and E. siliculosus genome, respectively. Others that were plentiful in the N. decipiens genome were CBF/NF-Y/archaeal (42), bZIP (36), Sigma-70 r2/r3/r4 (32), Zinc Finger C2H2-type (26), Zinc Finger CCCH-type (22), and HSF (22). The N. decipiens genome contains four genes with bHLH domains, three with homeobox domains, and ten with TAF domains, respectively.

genomes (Supplementary
Comparison of orthologous gene groups. The Nemacystus genome contains 15,156 gene models, which is comparable to the genomes of Cladosiphon (12,999) and Ectocarpus (17,418) 14,16 . A total of 9,179 orthologous gene groups were conserved among the three algae (Fig. 3). In addition, 455 orthologous groups were shared by N. decipiens and C. okamuranus, 549 by C. okamuranus and E. siliculosus, and 623 by N. decipiens and E. siliculosus. 2,878, 1,093, and 5,007 groups were found to be unique in genomes of N. decipiens, C. okamuranus, and E. siliculosus, respectively. 1,526 of the 2,878 unique groups in the N. decipiens genome could be GO-annotated (Supplementary Table S4). Among these, 55.8% were categorized as "molecular function" 37.5% as "biological process," and 6.3% as "cellular component." This indicates that many genes unique to N. decipiens may not be involved in cellular structure or composition, but in physiological processes such as alanine dehydrogenase and xanthine phosphoribosyl transferase activity. In fact, many of these genes encoded enzymes involved in polysaccharide biosynthetic processes (Supplementary Table S5). Furthermore, 617 of 1,352 non-GO-annotated gene groups were not found in the non-redundant protein sequence database at NCBI, and 200 of the 617 genes were annotated (Supplementary Table S6). extracellular matrix genes. The extracellular matrix (ECM) is composed of collagens, elastin, and proteoglycans, elements of which are polysaccharides and glycoproteins [19][20][21] . It regulates morphogenesis, cell differentiations, evolution of multicellularity, and cell-to-cell communication, and responses to stimuli from the environment [19][20][21] . In order to examine brown algae-unique and Chordariales (N. decipiens and C. okamuranus)-unique ECM components, we searched genes for those possibly associated with the ECM in genomes of the three brown algae, a diatom (Thalassiosira pseudonana), an oocyte (Phytophthora infestans), a green alga (Chlamydomonas reinhardtii), and a land plant (Arabidopsis thaliana), as described in the Materials and Methods. 676, 649, 901, 644, 1,116, 699, and 1,116 genes were defined as putative ECM genes in N. decipiens, C. okamuranus, E. siliculosus, T. pseudonana, P. infestans, C. reinhardtii, and A. thaliana genomes, respectively (Supplementary  Tables S7 and S8). These genes were annotated with the Pfam database and the number of annotated domains was counted. As a result, 140, 88, and 159 unique domains were found in N. decipiens, C. okamuranus, and E. siliculosus, respectively (Fig. 4). 26 domains were shared among the three brown algae, and additional 23 domains were conserved in the order Chordariales (Fig. 4). One GlcNAc gene (PF11397.6) that was also annotated as glycosyl transferase family 60 was found in each of the three genomes. On the other hand, three and two glycosyl transferase family 2 genes (PF13704.4) was found only in N. decipiens and C. okamuranus genomes, respectively (Supplementary Tables S8). Glycosyl transferase is necessary for polysaccharide biosynthesis 22  www.nature.com/scientificreports www.nature.com/scientificreports/ recently from a common ancestor that had acquired the glycosyl transferase family 2 gene, and that the GlcNAc gene may play an important role in polysaccharide biosynthesis in the brown algae.
Genes associated with fucoidan biosynthesis. Fucoidans are a family of sulfated homo-and hetero-polysaccharides of brown algae that contain l-fucose residues. The family comprises a broad spectrum of polysaccharides, from compounds with high uronic acid content and low fucose and sulfate content to almost pure α-l-fucan with fucose as the dominant monosaccharide. Genes encoding key enzymes for polysaccharide metabolism in brown algae were first predicted from the E. siliculosus genome 10 . Six enzymes are involved in this pathway (Fig. 5). GDP (guanosine diphosphate)-mannose and l-fucose are original sources of GDP-fucose, which are transformed to sulfated fucan via fucan (Fig. 5).
With a Blast search, our previous analyses indicated that genes encoding these key enzymes are conserved between C. okamuranus and E. siliculosus, although those for downstream enzymes are likely expanded independently in each lineage (Fig. 5) 14 . Specifically, the C. okamuranus and E. siliculosus genomes each contain two genes for GDP-mannose 4,6-dehydratase, and one gene for GDP-l-fucose synthase (Fig. 5). Both genomes hold one gene for l-fucokinase (FK) and one gene for GDP-fucose pyrophosphorylase. We found that the N. decipiens genome contained the same number of genes for the four enzymes (Fig. 5). The number of fucosyltransferases and sulfotransferases is variable among the three brown algae (Fig. 5). The N. decipiens, C. okamuranus, and E. siliculosus genomes contain four, five, and four genes for fucosyltransferase, and ten, nine, and six genes for sulfotransferase, respectively ( Fig. 5; details of this information are in Supplementary Tables S9). www.nature.com/scientificreports www.nature.com/scientificreports/ Our previous study of the C. okamuranus genome found a possible fusion of the genes for l-fucokinase and GDP-fucose pyrophosphorylase (FK-GFPP) 14 , which was not found in the E. siliculosus genome (Figs 5 and 6). The present study confirmed that the genes are also fused in the N. decipiens genome (Fig. 6). There were no stop codons in the sequence of the transcript. The protein predicted by mRNA contained both the FK and GFPP domains ( Supplementary Fig. S5). This suggests that the fused gene produces a bifunctional enzyme and that two enzyme-mediated processes are replaced by a single process. Although the function of the fused gene should be confirmed in the future, N. decipiens and C. okamuranus may have developed a more efficient means of producing sulfated fucans, compared to E. siliculosus.
The genomic region that contains FK-GFPP genes shows synteny among the three brown algae (Fig. 6). The FK-GFPP genes are inserted adjacent to an ankyrin repeat-containing gene at the 5′ flanking site and an ST-domain-containing gene, the alpha/beta hydrolase gene, the RNA-binding ASCH domain gene, and the tyrosinase gene on the 3′ flanking site. We found another possible fusion in the N. decipiens genome involving an ST-domain-containing gene with the alpha/beta hydrolase gene ( Fig. 6 and Supplementary Fig. S5). Fusion seems probable because there were no stop codons in the sequences of the transcript and because RT-PCR analysis, in which two primers were designed to produce a ~2-kb single transcript resulted in a transcript of corresponding size ( Supplementary Fig. S6). The ST-domain-containing gene was a component of 10 sulfotransferases. Although the function of the alpha/beta hydrolase has not been analyzed yet, this may be another means of facilitating sulfated fucan biosynthesis.

Discussion
As described above, the present decoding of a draft genome of the "ito-mozuku" alga, Nemacystus decipiens, identified 15,156 protein-coding genes, approximately 78% of which were substantiated by corresponding mRNAs. CEGMA analysis showed that the N. decipiens genome assembly is of higher quality than those of the two other brown algae. To facilitate understanding of brown algal biology, we compared features of the three genomes. First, molecular phylogeny using 32 mitochondrial genes showed that N. decipiens and C. okamuranus share a more recent common ancestor. Although taxonomic classification of these brown algae should include morphological and life cycle data, the results appear to support the order Chordariales, including N. decipiens and C. okamuranus. An intimate relationship between N. decipiens and C. okamuranus can also be deduced from their morphology.
Our present analysis of genes for components of extracellular matrix (ECM) showed that 26 and 23 types of domain-containing genes are common in genomes of the brown algae and Chordariales, respectively. In contrast 16 domains were shared by Stramenopiles, and majority of domains was species specific (Fig. 4, Supplementary  Fig S4, Supplementary Tables S7 and S8). This result was consistent with a previous report 21 , suggesting independent evolution of ECM-associated genes of the brown algae. The GlcNAc that is also annotated as glycosyl transferase family 60 was shared among N. decipiens, C. okamuranus, and E. siliculosus, whereas the glycosyl transferase family 2 gene was unique to N. decipiens and C. okamuranus (Supplementary Table S8). These results suggest that each organism has unique ECMs, whereas the glycosyl transferase family 60 gene is one of the key genes for polysaccharide biosynthesis in brown algae, and the glycosyl transferase family 2 was acquired and abundant in the Chordariales lineage.
A search for genes of enzymes involved in sulfated fucan biosynthesis identified all genes in this pathway. Our previous study demonstrated the fusion of genes for l-fucokinase (FK) and GDP-fucose pyrophosphorylase (GFPP), in the genome of C. okamuranus, but not E. siliculosus 14 . This suggests that "Okinawa mozuku" may have developed a more efficient way to synthesize sulfated fucans. The present study confirmed the presence of a fused gene of www.nature.com/scientificreports www.nature.com/scientificreports/ FK-GFPP in the N. decipiens genome as well. This fusion was supported by the corresponding mRNA. In addition, we found that the ST-domain-containing gene and the alpha/beta hydrolase gene are fused to each other in N. decipiens (Fig. 6). This fusion is evidenced by the lack of a stop codon between the sequences and by the results of RT-PCR analysis in which two primers designed to produce a ~2-kb transcript resulted in a single transcript of corresponding size (Supplementary Fig. S6). The ST-domain-containing gene was a sulfotransferase. Therefore, this draft genome of Nemacystus decipiens may provide a platform for future studies of sulfated fucan biosynthesis.
Cultivation of "ito-mozuku" in the Onna Fisheries Cooperative has a long history, commencing with the isolation of the "Ito5" strain in 1993 (Supplementary Fig. S1). We decoded the genome of the "Onna-1" strain, established in 2006. The Onna Fisheries Cooperative now maintains more than ten strains with different sporophyte morphology and responses to environmental changes. Due to world-wide environmental changes, including oceanic temperature rise, acidification, and pollution, brown algal culture is now facing critical conditions 11 . Continuous efforts toward maintenance and improvement are urgent. Genomic information about the "Onna-1" strain provides a reference for characterization of other strains with different features, and may facilitate subsequent improvement of "ito-mozuku" aquaculture to resist various environmental changes.

Materials and Methods
Biological materials. Nemacystus decipiens, "ito-mozuku" in Japanese, employed strains established and maintained by the Onna Fisheries Cooperative. The first, "Ito5," was isolated from a wild population in 1993 ( Supplementary Fig. S1A). The "Onna-1" strain was selected in 2006 and has been steadily maintained. This strain was used in the present study. It is cultivated at 22.5 °C with a 12-h light-dark cycle in sea water containing 0.5% KW21 (Daiichi Seimo Co. Ltd., Kumamoto, Japan). www.nature.com/scientificreports www.nature.com/scientificreports/ The life cycle of N. decipiens includes both haploid (n) and diploid (2n) generations ( Supplementary Fig. S1B) 4 . The 2n protonemas mature into sporophytes, and are harvested for market. Because the strain has been maintained as protonemas without contamination from other eukaryotes, it is easy to extract genomic DNA 14 , with protonemas as the dominant material. decipiens were frozen in liquid nitrogen and crushed to powder with a frozen-cell crusher, Cryo-Press (Microtec Co., Ltd, Chiba, Japan). Genomic DNA was extracted from the powder using a DNA-Suisui-VS extraction kit (Rizo Co., Ltd, Ibaraki, Japan). Illumina MiSeq and HiSeq 4000 platforms were used for sequencing 23 Table S1). The BioProject ID was PRJDB7493.
K-mer counting and estimation of genome size were done with JELLYFISH 2.2.0 software 24,25 and GenomeScope 26 . Adapter sequences were trimmed from all reads using Trimmomatic-0.30 27 . High-quality paired-end reads (quality >20) were assembled de novo using Platanus 1.2.4 28 to create contigs. Subsequent scaffolding of the Platanus output was performed using SSPACE 3.0 29 , based on Illumina mate-pair information. Gaps inside scaffolds were closed using GapCloser 1.12 30 . Assembled sequences were aligned with blastn (1e −50 ) to another sequence. Sequences that aligned by more than 50% were removed as errors arising from diploid sequences. CEGMA 2.5 software 31 was used to evaluate genome assembly. Sequences likely originated from bacteria and other microbiota were removed from the assembled genome with Maxbin version 2.2 32 and RNAmmer 1.2 33 .
Gene annotation and identification. In order to identify putative N. decipiens orthologous genes, reciprocal BLAST analysis was performed. This was carried out using mutual best hits of genes of C. okamuranus, E. siliculosus, and non-redundant protein sequences database from NCBI against N. decipiens gene models (BLASTP) or their assembly (TBLASTN). A second approach used for encoded proteins with one or more specific protein domains was to screen the models using HMMER (hmmer3) 42 against the Pfam database (Pfam-A.hmm, release 24.0, http://pfam.sanger.ac.uk) 43 , which contains approximately 11,000 conserved domains. Encoded proteins were also analyzed using InterProScan 5.25-64.0 44 for gene ontology annotations. The mitochondria genome was annotated with GeSeq 45 .
Mitochondrial gene collection and phylogenetic tree analysis. Sets of related sequences were subjected to phylogenetic analyses to more precisely determine orthologous relationships between N. decipiens, C. okamuranus, and E. siliculosus. Mitochondrial genomes sequences of 38 brown algae were downloaded from the NCBI database or our genome browsers (Supplementary Table S10). The mitochondrial genomes were annotated using GeSeq, and cDNA sequences of Atp6, Atp8, Atp9, Cox1, Cox3, Cob, Nad1, Nad2, Nad3, Nad4, Nad4l, Nad5, Nad6, Nad7, Nad9, Rpl2, Rpl5, Rpl14, Rpl16, Rpl31, Rps2, Rps3, Rps4, Rps7, Rps8, Rps10, Rps11, Rps12, Rps13, Rps14, Rps19, and Tatc genes from the 38 brown algae were collected. 32 gene sequences were independently aligned using MAFFT 46 with default options. Spurious sequences or poorly aligned regions were filtered using trimAl 47 , then filtered sequences were concatenated. Phylogenetic trees were constructed by the maximum likelihood method (GTR-gamma model) using RAxML version 8.2.11 48 with partition analysis excluded third codon and a 1,000 bootstrap replications. searching extracellular matrix genes. Data of N. decipiens, C. okamuranus, E. siliculosus, Thalassiosira pseudonana, Phytophthora infestans, Arabidopsis thaliana and Chlamydomonas reinhardtii were downloaded from websites as shown in Supplementary Table S11. Downloaded protein sequences were first analyzed using sig-nalP 4.1 49 , HECTAR 50 , and TMHMM 2.0 51 to ensure that proteins contain signal sequences in their N-terminal, extra-membrane domains. Then, intracellular proteins were removed by searching for the endoplasmic reticulum Figure 6. A diagrammatic representation of a syntenic region in genomes of three brown algae, Ectocarpus siliculosus (E.si), Cladosiphon okamuranus (C.ok) and Nemacystus decipiens (N.de). This region contains seven genes that encode Ankyrin repeat-containing protein, GDP-fucose pyrophosphorylase (GFPP), l-fucokinase (FK), ST-domain-containing protein, alpha/beta hydrolase, RNA-binding ASCH domain protein, and tyrosinase. In the E. siliculosus genome, all seven genes exist independently. However, in the C. okamuranus and N. decipiens genomes, the gene for l-fucokinase (FK) and the gene for GDP-fucose pyrophosphorylase (GFPP) are fused. In addition, in the N. decipiens genome, genes for ST-domain-containing protein and alpha/ beta hydrolase are fused. These fusions are supported by corresponding mRNAs (Supplementary Figs S5 and S6), although the fused mRNA for the latter needs further examination. Insertion of a gene for topoisomerase DNA-binding C4 Zinc Finger protein was discovered in in the N. decipiens genome. Transcriptional direction is shown by the arrowhead. Numbers under genes indicate gene ID numbers.