Unravelling the diversity of magnetotactic bacteria through analysis of open genomic databases

Magnetotactic bacteria (MTB) are prokaryotes that possess genes for the synthesis of membrane-bounded crystals of magnetite or greigite, called magnetosomes. Despite over half a century of studying MTB, only about 60 genomes have been sequenced. Most belong to Proteobacteria, with a minority affiliated with the Nitrospirae, Omnitrophica, Planctomycetes, and Latescibacteria. Due to the scanty information available regarding MTB phylogenetic diversity, little is known about their ecology, evolution and about the magnetosome biomineralization process. This study presents a large-scale search of magnetosome biomineralization genes and reveals 38 new MTB genomes. Several of these genomes were detected in the phyla Elusimicrobia, Candidatus Hydrogenedentes, and Nitrospinae, where magnetotactic representatives have not previously been reported. Analysis of the obtained putative magnetosome biomineralization genes revealed a monophyletic origin capable of putative greigite magnetosome synthesis. The ecological distributions of the reconstructed MTB genomes were also analyzed and several patterns were identified. These data suggest that open databases are an excellent source for obtaining new information of interest.


Introduction
The amount of data obtained from genome and metagenome sequencing has been sharply increasing for the last several years 1 . These data are kept in open databases, such as the widely used NCBI 2 and IMG 3 databases. In the case of IMG, the number of entries for metagenomic data greatly exceeds that for genomic ones 3 . In most cases, scientists use only a part of the sequencing information uploaded to the databases, leaving large quantities of information essentially unanalyzed. This gives the possibility that the obtained data may contribute to other studies and shorten the time and efforts of other scientists. In the present study, data stored in open genomic and metagenomic databases were used to search for magnetosome biomineralization genes related to magnetotactic bacteria (MTB).
The MTB are a group of organisms characterized by the ability to synthesize magnetosomes, which are crystals of magnetite (Fe 3 O 4 ) or greigite (Fe 3 S 4 ) enveloped by a lipid membrane 4 . These crystals can be applied in medicine as contrast agents for MRI 5 and for treating tumors using magnetic hyperthermia 6 , and they are also of great interest in geology [7][8][9] and astrobiology 10 . The synthesis of magnetosomes is controlled by the magnetosome gene cluster (MGC), previously called the magnetosome island or MAI. The MGC comprises genes that control magnetosome biosynthesis and that determine magnetosome morphology and chemical composition. The MGCs are unique and are associated only with MTB. The genes essential to the biomineralization process are called mam (magnetosome membrane) genes. Nine of them (mamA, -В, -M, -K, -P, -Q, -E, -O, and -I), are present in all MGCs 11,12 . In addition to the mam genes, genes specific to certain groups may also occur; for instance, mad genes are found in MTB from the Deltaproteobacteria and Nitrospirae, while man genes are present only in the Nitrospirae 13 .
At present, only about 60 MTB genomes are known, and most are affiliated with the phyla Proteobacteria, Nitrospirae, and Ca. Omnitrophica. Recently, MTB genomes associated with Latescibacteria 14 and Planctomycetes 12 have been found in open databases, implying that these databases could contain substantial amounts of new information about MTB.

The search for magnetosome biomineralization genes in open databases. The search for MTB
genomes in open databases was guided by detecting MGCs unique to magnetotactic bacteria. Unfortunately, MGC sequences are not annotated as magnetosomal in open databases. This necessitated the use of previously known sequences of MGCs as search targets. The search was further complicated by the low identity values between the sequences of the same MGC gene in different MTB taxonomic groups. To cover the maximum number of new MTB representatives, MGC protein sequences were drawn from all known taxonomic groups where MTB were found previously. For this purpose, a database was created of known MGC protein sequences  (Supplementary Table S1). The database included 67 MGCs from Proteobacteria, Nitrospirae, Ca. Omnitrophica, Latescibacteria, and Planctomycetes. The sequences of nine Mam proteins present in all MGCs were used to conduct BLASTp with genomic data from the NCBI and IMG databases. This resulted in the detection of four new genomes containing magnetosome biomineralization genes (Table 1, Supplementary Table S2).
The use of all nine Mam proteins in metagenomic databases is complicated by the fact that much more data is kept in metagenomic than in genomic ones. To hasten the search process, one Mam protein out of nine common ones that met the required parameters was chosen for further BLAST analysis. The first chosen parameter was the identity between sequences from different taxonomic groups in each protein. The low values of these identities allowed exclusion of MamE, MamO, and MamP proteins from the analysis. The remaining MamA, -B, -M, -K, -I, and -Q proteins were assessed for sequences with the highest -ln of e-values, in addition to high identities (Fig. 1a). MamI was the least consistent with these requirements and was not used in further analyses. By contrast, MamK was the most consistent.
Each Mam protein has its homologs in non-MTB that are not involved in the magnetosomes biomineralization process. These homologs should be avoided when searching for MGCs. For this, Mam protein was chosen whose identities and -ln of e-values were significantly varied from these parameters in homologs (Fig. 1b). MamK showed the best result in this case, and its minimum identity and -ln e-value between sequences were 30 and 135, respectively. However, part of homologs had identities and -ln of e-values similar to the values found between Mam protein sequences. These homologs were confirmed not to be Mam sequences by verifying their phylogenetic separation (Fig. 1c). The sequences of each Mam protein formed monophyletic clades, while MamK formed two clades. Despite this, no homologs were observed inside the MamK clades. Based on all the investigated parameter results, the MamK protein sequences were chosen for the MGC gene search in the open databases.
The MamK protein sequences were used for BLAST for 10587 metagenomes from water, terrestrial, engineered, and host-associated ecosystems. The analysis revealed 2798 sequences potentially affiliated with the MamK protein ( Supplementary Fig. S1a). Their scaffolds were checked for the presence of other Mam protein sequences. After that, 227 MamK sequences referring to 135 metagenomes were obtained (Supplementary  Tables S3 and S4). These and previously known MamK sequences were used to construct a phylogenetic tree ( Supplementary Fig. S1b), which revealed that the identified MamK sequences were not closely related to previously known sequences. This assumes that they could refer to taxonomic groups in which MTB were not found before.
Metagenome binning, phylogenomic inferences, and MGC reconstruction. The phylogenetic position of genomes to which the MamK sequences belonged was assessed by conducting metagenome binning, and it yielded 14688 metagenome-assembled genomes (MAGs) (Supplementary Table S3). Two metagenomes were also determined to be single-cell amplified genomes (SAGs), so no binning procedures were required for www.nature.com/scientificdata www.nature.com/scientificdata/ them. Of all the MAGs obtained in this study, only 140 contained previously detected MamK sequences. For those of the 140 whose completeness was >45% decontamination was conducted. This left 32 MAGs with completeness >45% and contamination <10% that contained MGCs ( Table 2, Supplementary Table S6). The phylogenomic affiliations of the obtained MAGs, SAGs, and genomes were then determined, the MGСs genes were reconstructed, and the ecological distributions were studied.
The identification of the phylogenomic position of the studied genomes revealed, for the first time, their affiliation to the phyla Elusimicrobia, Ca. Hydrogenedentes, and Nitrospinae ( Supplementary Fig. S2, Supplementary Tables S2 and S5). One genome was affiliated with the phylum Elusimicrobia and referred to order UBA1565 in the Elusimicrobia class. After MGC reconstruction, the mamI, -B, -M, and -N genes were revealed in the investigated genome (Fig. 2). Two MAGs from Ca. Hydrogenedentes belonged to the same species (98.70% average nucleotide identity), but they were obtained independently from different metagenomes. These MAGs referred to the GCA-2746185 family in the order Hydrogenedentiales. The 16S rRNA gene from the Ca. Hydrogenedentes bacterium MAG_17971_hgd_130 44 had 90% similarity with the closest non-MTB Ca. Hydrogenedentes bacterium YC-ZSS-LKJ63. All these data confirmed that the obtained binning results were regular and did not represent a computational error. Only mam genes were found in the MGCs of the studied genomes.
In the Nitrospinae phylum, two MAGs were affiliated with different genera of the order Nitrospinales. Their MGCs revealed the presence of mam and mms (magnetic particle-membrane specific) genes. Samples for the metagenomes of the obtained MAGs were collected from the Gulf of Mexico 45 and Arctic Ocean waters. Non-MTB representatives of this phylum were also detected only in marine habitats 46,47 , indicating that bacteria from the Nitrospinae could prefer to inhabit marine environments.
The 14 reconstructed MAGs belonged to different families of Deltaproteobacteria. Of the 14, three MAGs were affiliated with the UBA8499 genus in the Pelobacteraceae family. In their MGCs, apart from the mam and mad genes, which are typical for Deltaproteobacteria, the man genes were detected for the first time. Previously, the man genes were associated only with MTB from the Nitrospirae. Another two MAGs were affiliated with the Syntrophobacteraceae family, where MTB were discovered previously 41 . This is further evidence that binning was conducted correctly and that MTB representatives are indeed present in this family.
Three genomes also belonged to the Desulfobulbales order. Of these, the Deltaproteobacteria bacterium MAG_22309_dsfv_022 48 contained man3 gene in addition to the mam and mad genes, thereby confirming the routine presence of man genes in Deltaproteobacteria. A further four MAGs were related to the NaphS2 family in the Desulfatiglanales order. Analysis of their MGCs revealed genes responsible for putative greigite magnetosome synthesis. Metagenomic samples of the studied genomes were obtained from marine sediments, as well as all other known non-MTB genomes of this family 49,50 .   www.nature.com/scientificdata www.nature.com/scientificdata/ In Alphaproteobacteria, three MAGs and one genome were related to a 2-02-FULL-58-16 family in the Rhodospirillales order. Metagenomic samples of the studied genomes were isolated from marine ecosystems. The other non-MTB genomes of this family were also detected only in marine ecosystems 51 . For the first time, two MAGs containing MGCs were also detected in Telmatospirillum genus. Their metagenomic samples were collected from a freshwater bog. Telmatospirillum siberiense, the only known representative of this genus, was also isolated from freshwater peat soil 52 . Thus, this group possibly tends to inhabit freshwater ecosystems. Reconstruction of the MGCs revealed mam and mms genes in the studied MAGs. One MAG was referred to the Ca. Etaproteobacteria class. Genomes from this class previously were found in both saline and freshwater habitats 15,31,53 . The obtained MAG clustered with genomes isolated from freshwater environments. The MGC of the recovered MAG revealed a standard gene set inherent to MTB from this class. A further six MAGs were affiliated with the Gammaproteobacteria. All of these were sampled from one source and had 100% identity between their genes. Only the mam genes were detected in their MGCs.
The Nitrospirae phylum was affiliated with one MAG. A metagenomic sample of this phylum was obtained from a hot spring. Previously, other MTB and non-MTB from this phylum were also detected in hot springs 54,55 . Three of the recovered MAGs belonged to the SG8-4 order in the Phycisphaerae class of Planctomycetes. Apart from the reconstructed MAGs, one SAG was also obtained from the UBA1845 order in Phycisphaerae class. The completeness of this SAG was very low (39%), but it was also taken into analyses due to the large number of mam genes detected in the MGC. Another detected SAG was affiliated with Сa. Omnitrophica and was referred to the GWA2-52-8 family in the Omnitrophales order. The MGC of this genome had a set of genes that were specific to all magnetotactic representatives from this phylum.

Reconstruction of the evolutionary pathways for MGCs. The identification of putative genes involved
in magnetosome biomineralization allowed investigation of MGC evolutionary pathways. These were analyzed by constructing a phylogenetic tree of concatenated MamABKMPQ sequences ("Mam tree", Fig. 3b) and comparing this tree with one based on 120 single-copy marker gene proteins ("core genome tree", Fig. 3a). Comparative analysis of the MTB position on the trees revealed some incongruences. For instance, the Deltaproteobacteria group from "core genome tree" was divided into three subgroups on the "Mam tree. " The first subgroup comprised representatives capable of putative greigite magnetosome synthesis, while the other two subgroups included representatives with MGCs for magnetite magnetosome biomineralization. One of the magnetite subgroups included representatives of the Pelobacteraceae, Syntrophia, and Desulfurivibrionaceae families, which clustered with the Nitrospirae. According to the "Mam tree" topology, the man genes could be assumed to have originated in the Deltaproteobacteria and were inherited by the Nitrospirae through horizontal gene transfer. The compared trees also indicated vertical inheritance in the Alpha-and Ca. Etaproteobacteria groups, although the occurrence of horizontal transfer events was previously established in these groups 27,31 . These types of transfers have been confirmed to have occurred recently, which is why they cannot be detected through the tree topology analysis.
A further investigation examined whether MGC originated once or more than once. This was done by adding the Mam protein sequences recovered in this study to previously known Mam protein sequences and their non-MTB homologs and then constructing phylogenetic trees ( Supplementary Fig. S3). Analysis of the constructed trees confirmed the previous results 15 showing that all Mam protein sequences, except for MamK, formed monophyletic clades and that these clades did not contain any homolog sequences. This indicates that the MGCs for magnetite and greigite synthesis are likely to have a common origin.
The magnetosome chemical composition in genomes of every phylum where MTB were found for the first time were predicted by counting the phylogenetic distances of the concatenated sequences of six essential Mam proteins (MamA, -B, -K, -M, -P, and -Q) and conducting a principal component analysis (Fig. 4). All values clustered to three groups. First was the group that comprised Planctomycetes, and Latescibacteria, which are known to have genes for putative greigite magnetosome synthesis 12,14,56 . The NaphS2 family of Deltaproteobateria, Ca. Hydrogenedentes, Сa. Omnitrophica, and Elusimicrobia also clustered with this group. The other two groups comprised representatives with magnetite magnetosome synthesis genes. The first magnetite group included Nitrospinae and all classes of Proteobacteria where MTB were known. The exception was the remaining studied classes of Deltaproteobateria, which clustered with the second magnetite group, together with Nitrospirae.

Discussion
This study represents the first large-scale search of magnetosome biomineralization genes in open databases. Bioinformatic analysis of the gathered data almost doubled the number of MTB genomes from the 60 previously known; 4 genomes, 2 SAGs, and 32 MAGs were obtained as a result of this research. Besides, analysis of the database of collected MGC protein sequences revealed MamK as the most appropriate protein for MGC searching in open databases. This finding will allow the use of these putative protein sequences as markers for MTB detection in environmental samples.
This study also provides the first description of magnetosome biomineralization genes in the genomes of Elusimicrobia, Nitrospinae, and Ca. Hydrogenedentes. Non-MTB representatives of Elusimicrobia phylum were previously found as free-living 57 and ecto-and endosymbionts 58,59 of multicellular eukaryotes. MTB living symbiotically with eukaryotes have been detected previously 60,61 . Further investigations are needed to solve the enigma of whether MTB from Elusimicrobia free-living or symbiotic organisms are.
To date, little is known about Ca. Hydrogenedentes, except for its genome presence [62][63][64] . More is known about Nitrospinae, where one axenic culture was previously described 65 . However, these reports do not give an extensive understanding of the capabilities of this phylum's representatives. Thus, the detection of MGCs in genomes that belong to these phyla significantly supplements the knowledge of MTB diversity and evolution, while also providing new information about these phyla. www.nature.com/scientificdata www.nature.com/scientificdata/ This work also gives much new information about groups where MTB were previously recognized. For instance, the relatively few genomes were affiliated with Alpha-and Ca. Etaproteobacteria, while the current belief is that representatives of these classes dominate among MTB in all natural environments 12 . In addition, within the Alphaproteobacteria class, the presence of MGCs was discovered for the first time in genomes belonging to the Telmatospirillum genus. This may indicate a common origin for magnetosome biomineralization genes among the Magnetospirillum, Magnetospira, and Magnetovibrio genera.
Furthermore, for the first time the presence of man genes was revealed in MGCs of the Deltaproteobacteria. Previously, these genes were found only in Nitrospinae. Whether horizontal gene transfer events occurred     Supplementary Table S1. www.nature.com/scientificdata www.nature.com/scientificdata/ between representatives of these phylogenetic groups or their MGCs shared a common origin is not known. Further studies are required to determine which possibility is correct.
The genomes with magnetosome biomineralization genes obtained in this study allowed the investigation of the origin and evolution of the MGCs. A comparison of the "core genome" and "Mam" trees revealed clustering of the Deltaproteobacteria greigite subgroup sequences with the Planctomycetes, Latescibacteria, Ca. Hydrogenedentes, Сa. Omnitrophica, and Elusimicrobia phyla. Of these, Latescibacteria 14 and Planctomycetes 12 were already known to have MGCs for putative greigite synthesis. Note that Ca. Omnitrophica was also associated www.nature.com/scientificdata www.nature.com/scientificdata/ with the greigite subgroup, although it is believed that they biomineralize magnetite magnetosomes 43 . Such assumptions are based on Сa. Omnitrophus magneticus SKK-01 however, this genome is highly contaminated (Supplementary Table S1). Thus, further investigations are needed to study Ca. Omnitrophica magnetosome chemical composition.
In addition to all mentioned findings, the latest version of the bacterial tree of life 66 , based on GTDB R04-RS89 reference data ( Supplementary Fig. S4) helped to reveal the most ancient phylum in which MTB representatives were known. It was indicated that the Elusimicrobia phylum is the most closely related to the last universal common ancestor (LUCA). If the MTB of this phylum are assumed capable of greigite magnetosome synthesis, then greigite MGCs could have appeared much earlier than commonly believed, and the first MTB could have greigite, not magnetite, MGCs. The other phyla with MTB representatives in the vicinity of LUCA are Ca. Omnitrophica and Proteobacteria, although Nitrospirae MTB was previously thought to be the most ancient 40 .
Considering the existing data regarding the presence of horizontal transfer events among MTB and analyzing the discrepancies in "core genome" and "Mam" trees, the proposal could be made that horizontal gene transfers occur much more often than previously thought and are of great importance in MGC evolution.
The genomes obtained in this work require further confirmation by morphological identification. Once confirmed, these data will allow a more thorough study of the contribution of vertical and horizontal gene transfer events with respect to MGC inheritance. The data obtained in the present work will allow the study of the environmental and metabolic preferences of newly discovered MTB genomes, which may become the key to isolating them in axenic cultures. Moreover, a detailed MGC analysis could help to find as yet unidentified genes that are involved in magnetosome synthesis and to reveal much about the biomineralization process.
Generally, in this work, it was shown that MamK is the most appropriate protein for MGCs detecting in open databases. The search results allowed to receive 38 new genomes containing MGCs, that were affiliated to both taxonomic groups where MTB were found before and three new phyla. Thus, received MTB genomes permitted to unravel the MTB diversity and can be used in further MTB studies or in receiving new information about these phyla. Also, a comparison of MTB position on "mam tree" and "core genome tree" helped to reveal signs of putative horizontal gene transfers. This led to assumptions that such MGC transfers could occur with higher frequency and probably play a much more important role in MGC evolution than it was previously thought. Moreover, a proposal was made that the origin of MGC probably is more ancient than it was suggested earlier and possibly was capable of greigite magnetosomes biomineralization rather than magnetite.
Thus, all received data allowed the expansion of knowledge about MTB diversity, ecology, and evolution and has opened up new opportunities for further searches for and investigations of magnetotactic bacteria.

Materials and methods
The search for magnetosome biomineralization genes in open databases. The search for magnetosome biomineralization genes was conducted by collecting a database of MGC protein sequences based on currently known MTB genomes (Supplementary Table S1). The search was provided using BLASTp analysis, with identity >30% and e-value >1e −05 . Searches of the IMG and NCBI genomic databases used sequences of nine essential Mam proteins from different taxonomic groups as targets. The IMG metagenomic database was searched by BLASTp using MamK sequences. The sequences obtained from BLAST analysis were further checked to separate MGC proteins from their homologs. For this, each Mam protein sequence was checked for joint clustering on the phylogenetic trees. The presence of other Mam proteins in the same scaffold provided additional support for choosing those scaffolds for further analysis. The search was conducted in April 2018.
Genome reconstruction and analyses. Metagenome assembled genome (MAG) reconstruction was conducted using the Busybee web 67  www.nature.com/scientificdata www.nature.com/scientificdata/ involves the classification of obtained genes and scaffolds in each MAG relative to the gene base with a known taxonomic classification. After that, scaffolds with incongruent taxonomic classifications are removed from the MAGs. The quality metrics were assessed using the QUAST 72 tool. The average nucleotide identity (ANI) was calculated using fastANI 73 . The MGCs were determined using local BLAST and comparison with reference sequences of magnetotactic bacteria.

Phylogenetic analyses.
Taxonomic assignments for the studied genomes 16S rRNA genes were obtained using the GTDB 16S r89 dataset in IDTAXA 74 . The GTDB-Tk v.0.1.3 75 'classify_wf ' command was used to find 120 single-copy bacterial marker protein sequences, to construct their multiple alignments and to get the taxonomic assignment using the GTDB r86 database 76 . Amino acid sequence sets of the MamA, -B, -M, -K, -P, and -Q proteins were independently aligned using MAFFT 77 , curated with Gblocks v. 0.91b 78 with an option that allows gap positions within the final blocks, and then concatenated. These Mam protein sequences were also used to build trees with their homologs. Maximum-likelihood trees were inferred with IQ-TREE 79 using evolutionary models selected by ModelFinder 80 . Branch supports were obtained with 1000 ultrafast bootstraps 81 . Trees were visualized with iTOL v4 82 . The genomes of Ca. Omnitrophus magneticus SKK-01, Ca. Magnetoglobus multicellularis str. Araruama, Ca. Magnetobacterium bavaricum TM-1, and Ca. Magnetoovum chiemensis CS-04 were not subjected to phylogenetic analyses because they had failed the quality check (Supplementary Table S1). Taxonomic classification of the obtained genomes on phylum rank was performed using NCBI taxonomy; other ranks were named using GTDB.

Data availability
The genomes and metagenomes used during the current study are publicly available in NCBI (https://www. ncbi.nlm.nih.gov/) 44,48, and IMG (https://img.jgi.doe.gov/cgi-bin/m/main.cgi) [114][115][116][117] databases. Scaffolds of obtained MAGs could be found in Supplementary Table S6, hosted at figshare 118 . All data generated and analyzed in this study are also available in figshare 118 and in supplementary information accompany this paper. Assembly of Rhodospirillaceae bacterium MAG_01419_mvb_30 could be found in RAST (https://rast.nmpdr.org/) using 'guest' as login and as password.