Sequence analysis of nonulosonic acid biosynthetic gene clusters in Vibrionaceae and Moritella viscosa

Nonulosonic acid (NulO) biosynthesis in bacteria is directed by nab gene clusters that can lead to neuraminic, legionaminic or pseudaminic acids. Analysis of the gene content from a set mainly composed of Aliivibrio salmonicida and Moritella viscosa strains reveals the existence of several unique nab clusters, for which the NulO products were predicted. This prediction method can be used to guide tandem mass spectrometry studies in order to verify the products of previously undescribed nab clusters and identify new members of the NulOs family.

www.nature.com/scientificreports/ investigated so far, but its nab gene cluster appears similar to that of Vibrio vulnificus CMCP6, which produces the alanyl carrying Leg variant Leg5Ac7AcAla 39 . The diversity of nab clusters within Vibrionaceae has already been shown, with an emphasis on Vibrio vulnificus 28 . The distribution of NulOs in Moritellaceae has not been studied. This study investigates the presence and composition of nab gene clusters in a set of strains from A. salmonicida (Vibrionaceae) and M. viscosa (Moritellaceae), as well as A. wodanis and a few other members of the Vibrionaceae family. The strains were previously isolated from different kinds of organisms such as fish, sponge and amphipod (for a summary of the strains used in this study, see Table 1 in the Materials and Methods section). They were screened for the presence of NeuB sequences, which were then used to locate nab clusters in their genomes as well as investigate the sequence determinants of NeuB substrate specificity. Each cluster was analyzed in terms of gene content and organization, using sequence homology as a basis for functional annotation. From this, hypotheses pertaining to the nature of the NulO(s) produced by each strain could be made, and a set of candidates for experimental analysis was obtained. The NulO content of a few strains was analyzed using mass spectrometry in order to obtain preliminary experimental results.

Results and discussion
comparison of neuB sequences. Sequence alignment and phylogeny. NeuB protein sequences from A. salmonicida LFI1238, M. viscosa 06/09/139 and A. wodanis were used as queries for conducting a sequence similarity search within the set of target genomes (see Table 1). Out of the 15 targets, 10 had at least one sequence identical to one of the queries, and 3 had a similar sequence. As expected, no NeuB coding sequences were found in either P. phosphoreum or V. anguillarum. A fingerprint of the alignment of unique NeuB sequences and NeuB phylogeny are presented in Fig. 2 (the full alignment is included as Supplementary Information, in Figure S1).
Six of the eight targeted M. viscosa strains have two NeuB coding sequences identical to the ones from the 06/09/139 strain. The two remaining strains contain each one sequence, identical to either the Neu-pathway homolog (strain F57) or the Pse-pathway homolog (strain NVI-5482). This is consistent with the evolutionary relations between M. viscosa genomes according to which those two strains form their own clade 40 . It is also interesting to note that the strains with identical sequences were isolated from either salmond or rainbow trout, while the NVI-5482 and F57 were obtained from cod and lumpfish, respectively. For the Aliivibrio strains, two strains (A. salmonicida R8-68 and R8-70, isolated from the amphipod Eurythenes gryllus) have sequences Figure 1. Structure of known NulOs. The common, nine-carbon backbone of NulOs is represented with white bonds, and the carbons are numbered. The absolute configuration of each isomer is indicated in either gray or green depending on the concerned chiral centers, which are marked by disks of the corresponding color. (a) The three main NulOs families, according to their synthesis pathways, and the newly identified fusaminic acid (Fus). Neuraminic acid (Neu) is represented carrying an N-acetyl group in C5 (Neu5Ac), since it is the most commonly found species. Legionaminic (Leg), pseudaminic (Pse) and Fus carry N-linked groups in C5 and C7, which were omitted for clarity leaving only the nitrogen atom. (b) Isomers from the legionaminic acid synthesis pathway presenting different absolute configurations. As for Leg, N-linked groups carried at the C5 and C7 positions are represented by the nitrogen atom only. The orange background serves to highlight the common synthesis pathway of Leg isomers. www.nature.com/scientificreports/ identical two both homologs from LFI1238 (which was isolated from cod). The A. logei MR17-77 (from the sponge Porifera indet) and A. sp. R8-63 (from Eurythenes gryllus)) strains each have one hit that is almost identical (97% sequence identity for both) to the Leg-pathway homolog of A. salmonicida LFI1238 and A. wodanis (from salmon), respectively. While the clustering of aliivibrios does not reflect the type of host they were isolated from, it is in accordance with the phylogenetic data available for these strains 41,42 . As expected, the NeuB sequence from V. vulnificus groups with that of A. wodanis, and is within the clade clustering Leg pathway sequences. The NeuB sequence from Vibrio sp. B9-25K2 is most similar (54% sequence identity) to the M. viscosa Pse-pathway homolog, although they do not belong to the same taxonomic family and were isolated from different organisms (sponge and salmon, respectively). This is expected since the only representative of the Pse pathway in our set of sequences is from Moritellaceae, and NeuB sequences are known to cluster according to pathway before species 28 . According to previously published phylogenetic data based on 16S rDNA sequences, this strain is closest to V. anguillarum, which does not produce NulOs 41 . Considering that phylogeny based on Multilocus Sequence Analysis (MLSA) 43,44 has been previously shown to be the preferable method for distinguishing between vibrios, we built a tree using this method with a subset of our target genomes (see Fig. 3). The tree only somewhat reflects the clades most recently proposed 44 for the Vibrionaceae, which is most likely due to the small number of sequences considered in this study. The M. viscosa strains cluster together with some of the aliivibrios, which indicates that the set of genes chosen (16SrDNA, ftsZ, gapA, gyrB, recA, rpoA and topA) is not fully able to distinguish between the Vibrionaceae and Moritellaceae families. The tree does however provide a relevant phylogenetic context for V. sp. B9-25K2 for the scope of this study, so this was not investigated further.
Comparison of NeuB active site residues. The possibility of assigning NAB pathways solely based on the sequence of the NeuB homologs raises the question of what the determinants which discriminate between the pathways are. The genome of Campylobacter jejuni strains contain genes coding for NeuB variants for all three pathways (cjNeuB1-3), and was therefore chosen for studying the pathway-related characteristics of NeuB homologs 27 . The NCTC 11168 strain is able to produce Neu5Ac, Pse5Ac7Ac, and Leg5Ac7Ac, as well as derivatives of the Leg/Pse compounds (assuming that the acetamidino-derivatives are synthetized from the diacetylated compounds) 45 . Their sequences were modelled onto the structure of the NeuB homolog (Neu pathway) from Nesseria meningitidis (nmNeuB, PDB ID: 1XUU). The structure of their active sites was compared with its PEP-bound form (PDB ID: 1XUZ) 46 . The results, presented in Fig. 4 (panels a-d), show that while the active residues are strictly conserved between the Neu-pathway enzymes, they differ for the Leg-and Pse-pathway ones, as is expected. Indeed, the different pathways indicate that while NeuB from the Neu pathway acts on ManNAc for the production of Neu5Ac, the substrates leading to diacetylated Pse and Leg compounds are 2,4-diacetamido-2,4,6-trideoxyaltrose and 2,4-diacetamido-2,4,6-trideoxymannose, respectively [47][48][49] . Neither Leg nor Pse substrates carry a -OH group at C6 (circled in black in panels b and c), which explains the variations observed for the residues neighboring it. It is interesting to note that the OH group carried by the Neu substrate in C4 is interacting with a water molecule, which is probably not present in the cjNeuB2 (Leg) and cjNeuB3 (Pse) active sites. Both Leg and Pse have an N-acetyl group at this position, and the water molecule could be compensating for the N-acetyl group. In addition to carrying a different substituent, C4 also has a different configuration in the Leg (S) and Pse (R) substrates (same orientation as the -OH of rManNAc for the Leg substrate). This could account for the variability of residues in proximity to the C4 position (indicated by a black arrow in panels b and www.nature.com/scientificreports/ c). Most interesting is the F129 residue of cjNeuB3 clashing with the N-acetyl group carried in C2 by the Neu substrate. This group has a different orientation in the Pse substrate (C2 has a S configuration) compared to the others. The conserved part of the active site, around the PEP substrate and the C1 of rManNAc, is in agreement with the proposed mechanism of NeuB enzymes, which involves this particular region 46 . This area is conserved in all three isozymes, while the area binding the rest of the sugar molecule is less so, allowing for specificity at C2 via position 132 (F129 in NeuB3) and at C4-C6 via the regions corresponding to the loop S2 and helix H4 of nmNeuB (positions 70-85). Our set of unique NeuB sequences was also used in a sequence-structure alignment, and the active site residues can be described as 12 regions on the NeuB sequences (see Fig. 4 panel d; residues indicated by their nmNeuB sequence number). Those regions were mapped around the nmNeuB substrate structure in Fig. 4 (panel e) as a simpler way to refer to active site sublocations and their corresponding residues. The multiple alignment strengthens the observations made by structure comparison, with the regions surrounding the sugar chain (from C2) less conserved than the others. Regions 5, 6 and 8 each contains several active site residues interacting with the substrates at different locations. PEP is surrounded by positions 110 (region 5); 129, 131 and 132 (region 6); 182 and 184 (region 8), with positions 132, 182 and 184 less conserved. Position 132 corresponds to the aforementioned G132/F129 residues of cjNeuB1/cjNeuB3, which may influence specificity at C2. The multiple alignments confirms that the Gly appears to be conserved within the Neu and Leg pathways, while the Phe is conserved for the Pse pathway. A rapid check with a larger set of sequences from public databases shows that the positions are mostly but not strictly conserved. Position 184 is occupied by either Asn or Tyr in Figure 2. Comparison of unique NeuB sequences. Top: Evolutionary history of NeuB sequences amongst target genomes. The evolutionary history was inferred using the Neighbor-Joining method 59 , with a bootstrap test (500 replicates) 83 . The tree is drawn to scale, with branch lengths (same units as the evolutionary distances used to infer the tree) and bootstrap values shown next to the branches. The evolutionary distances were computed using the Poisson correction method 60 and are in the units of the number of amino acid substitutions per site. The rate variation among sites was modeled with a gamma distribution (shape parameter = 5). This analysis involved 10 amino acid sequences. All positions containing gaps and missing data were eliminated (complete deletion option). There were a total of 318 positions in the final dataset. Evolutionary analyses were conducted in MEGA X 61 . Bottom: Fingerprint of NeuB sequence alignment. The sequences were aligned using the MUSCLE algorithm 57,58 . Gaps are shown as "-". In the nmNeuB structure, N184 interacts at least weakly with all susbtrates. Whether the differences in residue between pathways is a matter of susbtrate specificity is not clear. The same can be said of position 182, where the differences seem more to reflect phylogenetic distance than function. Amongst the Neu-pathway homologs, the sequence from A. salmonicida LFI1238 is more similar to that of the Leg pathway homologs, at least in part. This homolog is thought to accept 4-acetylated ManNAc (ManNAc2) as a natural substrate rather than the monoacetylated form, even though it is capable of utilizing it 34 . Considering that there may be some flexibility near C4 with the water molecule and that ManNAc2 is identical to LegAc2 at this position, the preference for ManNAc2 is supported by the sequence-structure comparison. Amongst the Leg pathway homologs, the closely related homologs from A. sp. R8-63 and A. wodanis share all their active site residues, as do the A. salmonicida LFI1238 and A. logei MR17-77 strains. The two groups differ from each other in regions 3-8, indicating that they may have a different susbstrate specificity. 8eLeg5Am7Ac was detected in A. salmonicida LFI1238, while A. wodanis, whose cluster is closest to that of Vibrio vulnificus CMCP6, might produce Leg5Ac7AcAla (or a variant of it) 34,39 . The corresponding biosynthesis pathways seem to involve different NeuB susbtrates, which is consistent with the observations made here.
nab cluster identification and analysis. The genome regions surrounding each NeuB homolog hit were investigated for the presence of putative nab clusters. The results, presented in Figs. 5 (panels a and b) and 6 (panel a), reveal that hits (or queries) with identical NeuB sequences had identical clusters (identical gene composition, sequences and organization). This is especially interesting in the case of the Leg pathway cluster from A. salmonicida LFI1238 (see Fig. 6 In order to confirm if this trend is observed for other species, we investigated the set of NeuB sequences similar to that of Vibrio B9-25K2, leading to the conclusion that when NeuB sequences are identical, the corresponding clusters might share a similar architecture with the same number of proteins coding for the same homologs in the same order, but they are not necessarily all identical in protein sequence (data not shown).
The cluster of Vibrio B9-25K2, for which the NeuB sequence was most similar to that of Pse pathway homolog from M. viscosa, contains homologs of sequences coding for PseB, PseC and PseG, which confirms its appartenance to this pathway (see Fig. 5, panel b). It also contains a homolog of PseH as a domain in a bi-functional protein for which the second domain is of unknown function, but that is associated to lipid metabolism. A conserved domains analysis of the cluster coding sequences revealed that the bi-functional enzyme encoded after PseC contains a putative cytidylyltransferase domain similar to the SpsF spore coat protein, indicating that it may be replacing the NeuA homolog absent from this cluster. The other coded domain is that of an aldo-keto reductase. Following this gene is a sequence coding for a dehydrogenase. The PseH homolog is followed by a gene www.nature.com/scientificreports/ coding for a hypothetical protein, with a domain similar to that of methylmalonyl CoA epimerase. The clusters corresponding to the closest NeuB sequences from public databases also contain the hypothetical protein and the bi-functional PseH homolog, both with around 60% sequence identity compared to that of Vibrio B9-25K2 (data not shown). A likewise search for the bi-functional cytidylyltransferase yielded a different set of strains sharing similar architectures, but none for which the NulO content has been investigated (data not shown). From the analysis of the NeuB homologs, we were able to hypothetize that the NeuBs from A. logei MR17-77 and R8-63 most likely produces Leg compounds, with the one from MR17-77 producing the same as that of A. salmonicida LFI1238 (LegAc2) and R8-63 the same as that of A. wodanis (Leg5Ac7AcAla or similar, since the sugar in A. wodanis has not been identified yet). However, the composition of their clusters is different (see Fig. 6, panel a), and they most likely ultimately each produce different compounds. The NeuB from Vibrio B9-25K2 seems to produce a Pse related NulO, but the lack of close sequences with identified NulO content in the corresponding organisms prevented a more precise hypothesis. The results from the cluster analysis are in agreement with these suggestions, and they suggest that Vibrio B9-25K2 may produce a diacetylated Pse on account of the presence of a PseH homolog (see Fig. 5, panel b). The presence of enzymes of unknown function might indicate further modifications of produced compound. The NulO content for the Neu pathway clusters has been previously identified, with both A. salmonicida LFI1238 and M. viscosa producing Neu5Ac and Neu5Ac7(9)Ac 32-34, 38 .   Fig. 6, panel b). While the former does not correspond to any identified NulO by this method, the latter can be tentatively assigned to the Leg5Ac7AcAlaQ compound observed in Vibrio vulnificus 39 . As mentioned above, the MS/MS spectra of the PseAc2 compounds from M. viscosa and Vibrio B9-25K2 show separate fragmentation patterns (see Fig. 5, panel c). This suggests that either the N-acetyl groups are substituents to different carbons of the Pse backbone, or that some of the asymmetrical carbons have a different geometry. In the latter case, the strains would produce epimers of PseAc2. No reports of a di-N-acetylated NulO carrying groups at positions other than C5 and C7 exist yet, and the only pathway known to produce epimers of its NulO is the Leg pathway [11][12][13][14] . The determination of the absolute configuration of PseAc2 from Vibrio B9-25K2, as well as studies targeting the proteins of its cluster should provide the necessary information. A common feature of all the presented MS/MS spectra is that the parent ions are either weak or absent, as is expected from molecules rich in alcohol groups (see Fig. 5, panel c) for the PseAc2 spectra and Fig. 6 (panel b, top) for the LegAmAc spectrum). The base peak, at m/z = 415, can result from the loss of either two water molecules from the PseAc2 compounds or one water and one ammonia from LegAmAc. The fragmentation route proposed by Klein et al., which involves the formation of a ring between the C4 and C8 of the NulO moiety, is consistent with the observed peaks (see Fig. 7). Indeed, the ring formation corresponds to the loss of a water molecule gives a compound at m/z = 433 for PseAc2 (432 for LegAmAc), where a small peak is observed for Vibrio B9-25K2. This route leads, when the ring substituents are removed in C5-C7, to a compound with m/z = 297 which loses the ring to form www.nature.com/scientificreports/ the compound at m/z = 229 . The compound from Vibrio B9-25K2 most likely forms another ring, due to a different proximity of the groups either because it is an epimer of Pse5Ac7Ac, and/or because the N-acetyl groups are carried by other positions. This preliminary experimental characterization strengthens the hypotheses made after analysis of the NeuB homologs and the nab clusters concerning the type of sugar produced by each cluster. A similar approach has already been used successfully in several species and the method appears robust, although it is limited to detecting compounds similar (by either mass or gene cluster) to previously identified sugars 24,39,52 . Obviously, a more thorough experimental analysis is needed to fully characterize the compounds detected. The results should thus be treated with caution.

conclusion
This study investigated the nab clusters from a set of bacteria from the Vibrionaceae and Moritellaceae families, based on the sequence comparison of their NeuB homologs. We found that each unique NeuB sequence corresponded to a unique nab cluster, each potentially producing different NulOs. This opens the possibility of screening sequence databases as the first step in the identification of previously undescribed NulOs. It also allows for the mapping of potential NulO diversity within species with published NeuB sequences and/or genomes. Considering that NeuB sequences cluster according to NAB pathway rather than species, we used sequencestructure comparison to identify putative substrate specificity regions for the NeuB homologs. In particular, we located a critical position occupied by either a glycine (Neu and Leg pathways) or a phenylalanine (Pse pathway) which seems to discriminates between NAc orientation at the C2 position of NeuB substrates. Further study of NeuB specificity not only opens for a better prediction of its substrates and products, but also the tailoring of already characterized enzymes for the production of various NulOs.
Going further, the study of the nab clusters associated with NeuB sequences of interest allows for better hypotheses concerning NulO structure as well as the detection of new NAB proteins previously unidentified, such as the non-NeuA homolog cytidylyltransferase from Vibrio B9-25K2. Preliminary results from mass spectrometry analysis of the NulO content of the target organisms both confirmed the presence of predicted sugars, www.nature.com/scientificreports/ but also revealed a new compound in A. sp. R8-63 for which the corresponding quinoxaline has a m/z of 530, which awaits identification. Taken together, the results support the use of NeuB sequence comparison as the basis for screening genomic data of NulO producing organisms, with the purpose of finding new protein and/or monosaccharide targets. In the case of Vibrio B9-25K2, which seems to produce a variant of PseAc2 with a different configuration and which cluster presents several unknown coding sequences, further studies might uncover a new branch of the Pse pathway.

Methods
Strains and sequences. The source information for the genomes used in this study is summarized in Table 1.  www.nature.com/scientificreports/ A. wodanis 06/09/139 (WP_045100955.1) were used as query for a similarity search against the set of target genomes using the BLAST suite 55 . The NeuB sequence from V. vulnificus CMCP6 was added to the set and unique NeuB protein sequences were thereafter aligned with MUSCLE, using default settings 57,58 . The evolutionary history of the NeuB sequences was inferred using the Neighbor-Joining method 59 , using the NeuB sequence from Legionella moravica (WP_028385079.1) as outgroup. The final dataset was composed of 10 sequences and 318 positions, with all positions containing gaps and missing data eliminated (complete deletion option). The evolutionary distances were computed using the Poisson correction method 60 and are in the units of the number of amino acid substitutions per site. The rate variation among sites was modeled with a gamma distribution (shape parameter = 5). Evolutionary analyses were conducted in MEGA X 61 .

Sequence alignments and phylogentic analyses.
The evolutionary history of Vibrio sp. B9-25K2 was inferred using Multilocus Sequence Analysis (MSLA) 43 . The sequences for the 16SrDNA, ftsZ, gapA, gyrB, mreB, pyrH, recA, rpoA and topA genes were retrieved for V. sp.  (Table S1). Each gene group was aligned in MUSCLE and the sequences from the same strains were concatenated (in the same order for all). The tree was calculated using the Neighbor-Joining method 59 and tested with bootstrap (1000 replicates). The evolutionary distances were computed using the Jukes-Cantor method 62 and are in the units of the number of base substitutions per site. The rate variation among sites was modeled with a gamma distribution (shape parameter = 5). This analysis involved 10 nucleotide sequences. All positions containing gaps and missing data were eliminated (complete deletion option). There were a total of 8050 positions in the final dataset. Evolutionary analyses were conducted in MEGA X 61 . nab cluster identification and analysis. The positive NeuB hits were located on the target genomes and the surrounding area was investigated for the presence of putative nab gene clusters, using the Artemis software for viewing 63 . The cluster regions were determined by the gene content and direction of the relevant areas. Gene names were assigned according to protein sequence similarity to known nab genes, using the most comprehensive (to the authors) denominations.
Identified clusters were analyzed for the function of each coding sequence composing them by performing sequence similarity searches within the non-redundant protein sequence database (BLAST) as well as domain searches within the Conserved domain Database 64 .
Homology modelling of C. jejuni neuB homologs and active site comparison. The sequences for the NeuB homologs of C. jejuni NCTC11168 (WP_002858213.1, WP_002864265.1, and WP_002870258.1) were retrieved from public databases and used as targets for homology modelling using the NeuB homolog from N. meningitidis (nmNeuB, PDB IDs: 1XUU, 1XUZ) as a template 46,65 . The modelling was performed on the SWISS-MODEL server 66 . The unique NeuB sequences from the target strains and the aforementioned C. jejuni sequences were also aligned to nmNeuB using the PROMALS3D server (https ://doi.org/10.1093/nar/gkn07 2). Sequence regions with active site residues were located by measuring distances to the substrates in the structures as well as using the PDBePISA server for nmNeuB (https ://doi.org/10.1016/j.jmb.2007.05.022). nonulosonic content release and derivatization. Cultures (15 mL) of A. salmonicida LFI1238, M. viscosa 06/09/139, and Vibrio B9-25K2 were grown in liquid LB media containing 2.5% NaCl (48 h, 12 °C, 200 rpm). 3 mL pellets were harvested and washed with dH 2 O before they were resuspended in 0.1 µ L phenylmethane sulfonyl fluoride (PMSF). After a 15 min incubation on ice, 200 µ L acetic acid (2M) and 2 µ L butylated hydroxytoluene (BHT, 1%) was added. The samples were thereafter incubated for 3 h at 80 • C and spun down for 10 min at 13,000 rpm. The supernatant was collected and filtered (Amicon 10K spin column) in order to remove large molecules. The filtrate was dried for 2 h using a speed-vac, and the resulting samples were stored at − 20 • C until use.
Samples were resuspended in 10 µ L dH 2 O before performing the labelling reaction with 1,2-diamino-4,5-methylenedioxybenzene (DMB, from TaKaRa) according to the manufacturer's instructions. The reaction mixtures were incubated in the dark at 50 • C for 2.5 h.
Mass spectrometry analyses. The quinoxaline (Q) content of the samples described in the previous section was analyzed by HPLC-MS/MS using the procedure described by Gurung et al. 34 . Water with 0.1% formic acid (A) and acetonitrile with 0.1% formic acid (B) were used for the HPLC-MS elution gradient (see a previous file and put the gradient here), at a flow rate of 400 µL/min. Tandem mass spectrometry was performed on samples containing compounds corresponding to masses equivalent to that of LegAmAcQ (m/z[M+H]+ = 450.19887323) and LegAc2Q/ PseAc2Q (m/z[M+H]+ = 451.1823402). The scan range was m/z 350-550.
Graphical output generation. Gene clusters were rendered in SVG graphics using scripts written by the author. The scripts are available through the python-bioinformatics repository on GitHub 67 . Molecular structures were obtained from PubChem and modified in Molview and Pymol 68,69 . All figures were prepared using Inkscape 70 .