Genome sequence of segmented filamentous bacteria present in the human intestine

Segmented filamentous bacteria (SFB) are unique immune modulatory bacteria colonizing the small intestine of a variety of animals in a host-specific manner. SFB exhibit filamentous growth and attach to the host’s intestinal epithelium, offering a physical route of interaction. SFB affect functions of the host immune system, among them IgA production and T-cell maturation. Until now, no human-specific SFB genome has been reported. Here, we report the metagenomic reconstruction of an SFB genome from a human ileostomy sample. Phylogenomic analysis clusters the genome with SFB genomes from mouse, rat and turkey, but the genome is genetically distinct, displaying 65–71% average amino acid identity to the others. By screening human faecal metagenomic datasets, we identified individuals carrying sequences identical to the new SFB genome. We thus conclude that a unique SFB variant exists in humans and foresee a renewed interest in the elucidation of SFB functionality in this environment.


Introduction
The interdependency of the intestinal microbiota and its host manifests itself in various ways, of which some on the host side are highly spectacular. Thus, effects spanning from improving nutrient uptake (Krajmalnik-Brown et al., 2012) or metabolizing drugs (Zimmermann et al., 2019) to influencing risk of cancer (Feng et al., 2015) and altering cognitive function (Martin et al., 2018) have been reported to be dependent on microbiota composition and functionality.
Although this area of research has attracted great attention during the last decades, the codes for communication between microbes and man have only begun to be deciphered.
Investigations of intestinal host-microbe interactions have been revolutionized by the development and application of powerful DNA sequencing and bioinformatics tools.
Nevertheless, although huge amount of data is gained in this way, the translation of these data into a meaningful context lingers. Another, somewhat interlinked approach, has been to search for tentative "key players" which share an evolutionary history with the host and maintain important host functions through specific mechanisms. The identification of such key players is not straightforward and although some exquisite examples exist (Atarashi et al., 2013(Atarashi et al., , 2015Tan et al., 2016;Tanoue et al., 2019), only a limited number of commensal bacteria has so far been identified as having defined effects on their host. Segmented filamentous bacteria, SFB, represents one of these key players and holds a so far unique capacity to elicit full maturation of the mouse gut immune barrier. The work with SFB during the last decades beautifully describes the cross-fertilization between different areas of research, particularly microbiology and immunology (Cerf-Bensussan, 2019).
SFB were discovered already in the mid-1960s in laboratory animals (Hampton and Rosario, 1965;Savage, 1969) where they could be identified by microscopy due to their filamentous growth and the unique attachment of the filament to the intestinal wall. Several intriguing features connected to the lifestyle of these organisms later opened for a deeper interest in their possible role as important symbionts (Davis and Savage, 1974). Thus, they colonized primarily in the terminal part of the small intestine where many immune cells are located; they appeared at greater number around weaning which is an important period for maturation of the immune system, and, not least, they exhibited an intimate contact with the host through a specific anchoring to the intestinal cell wall. Together, these observations led to speculations and later also the first reports that SFB affected immune functions of the host (Klaasen et al., 1993;Umesaki et al., 1995). After these early observations, SFB have been subject to a large number of studies (reviewed in (Ericsson et al., 2014;Schnupf et al., 2017)) which has firmly established their role as immunomodulatory bacteria. Thus, they are attributed with a long row of effects, including stimulation of chemokine and antimicrobial components production, induction of gut lymphoid tissue and a strong increase in faecal IgA (Umesaki et al., 1999). However, their potent triggering of T helper 17 (T H 17) cell differentiation is perhaps their most eye-catching attribute (Ivanov et al., 2009) in terms of immunomodulation. Interestingly, very recent experiments applying immunodeficient mice, demonstrated the ability of SFB to also confer protection against rotavirus infection independent of immune cells (Shi et al., 2019).
Although not yet cultivable, SFB mono-colonized laboratory animals have offered a route to isolation and characterization of this group of organisms. Complete genomes are available from SFB isolated from mice (Bolotin et al., 2014;Kuwahara et al., 2011;Prakash et al., 2011;Sczesnak et al., 2011) and rats (Prakash et al., 2011) and an unpublished draft genome sequence from turkey is publicly available at NCBI (GenBank accession number GCA_001655775.1). Genomic analysis has revealed that SFB are gram-positive spore-forming bacteria with a distinct phylogenetic position within the Clostridales. They have small genomes of around 1.5 Mb which is reflected by a limited biosynthetic capability, rendering them a functional position between free-living bacteria and obligate intracellular symbionts (Bolotin et al., 2014;Sczesnak et al., 2011).
A hallmark of SFB biology seems to be host specificity, as supported both by experimental and genetic data. Thus, colonization experiments have shown that cross-colonization with mouse SFB in rats or rat SFB in mice is not possible (Tannock et al., 1984). This argues for different species or lineages in SFB adapted to different hosts. After the discovery of SFB in rodents and with the accumulating evidence for their ability to affect crucial steps in immune development, it was natural to search for them also in humans. The first study indicating their presence in humans visualized a tentative SFB organism adherent to ileal biopsied tissue by light microscopy (Klaasen et al., 1993). More recently, 16S rRNA gene sequences of SFB were reported in human samples using SFB-specific PCR primers; Yin et al. (Yin et al., 2013) found SFB sequences in 55 faecal samples while one of us (Jonsson) detected an SFB sequence in an ileostomy sample (Jonsson, 2013). While the faecal SFB sequences were phylogenetically interleaved with SFB sequences from mice from the same study, the ileostomy sequence was distinct from SFB sequences from other animals. Except for the 16S sequences from these studies, no human SFB DNA sequences have been published. Moreover, human gut shotgun metagenomic sample sets have been scanned by attempting to map reads to the SFB genomes of mice and rats, but without success (Sczesnak et al., 2011). Thus, up until now, no genomic data have been presented for human-derived SFB, and it is still an open question whether a human-adapted variant of the organism actually exists.
We now report the draft genome sequence of a tentatively human-adapted representative of the SFB group. With metagenomic approaches, we have reconstructed the SFB genome from the same ileostomy sample that earlier produced the unique 16S rRNA gene sequence.
Phylogenetic analysis clusters this genome to the SFB genomes described earlier, yet clearly defines it as unique. In addition, we could show the presence of sequences derived from the new genome in unrelated individuals through screening of published metagenome data. Our data strengthen the likelihood that the paradigm with host-specific colonization is valid also for SFB-human symbiosis. Considering the possibility of analogous immune-modulatory activities of SFB in humans and rodents, this finding could be of paramount importance.

Genome reconstruction
To verify the presence of an SFB 16S rRNA gene sequence in the human ileostomy sample where it was earlier detected with SFB-specific primers, we subjected the same sample to amplicon sequencing using broad-taxonomic range PCR primers. This confirmed the existence of an SFB sequence: after sequence noise removal, a single amplified sequence variant (ASV) was classified as Candidatus Arthromitus and this was identical over its full length to the previously published 16S sequence from the same sample. The relative abundance of this ASV was however low, as it represented 0.16 -0.37% of the microbial community's ASV sequences, depending on the DNA extraction method used.
In order to assemble the genome of the candidate SFB organism, we conducted deep shotgun metagenomic sequencing using Illumina NovaSeq, which generated a total of 953,167,834 read-pairs for four different DNA preparations from the same sample. The 317,687 contigs of the resulting assembly were binned into genomes using information on sequence composition and coverage. To improve the binning procedure, the coverage of the contigs was estimated not only using the four different DNA libraries from the sample that were prepared using three different DNA extraction methods, but also using publicly available human gut metagenomes.
We tried two different binning software, CONCOCT and MetaBAT2, and applied two different contig length cutoffs for each binning software. The two binners generated approximately the same number of bins with comparable quality estimates ( Figure S1), but only MetaBAT2 generated a bin at each length cutoff that was classified as SFB (genus Savagella according to the Genome Taxonomy Database (GTDB)). These two bins differed by a few contigs, and we used a conservative approach of defining the SFB metagenome-assembled genome (MAG) as all contigs shared by both bins (127 contigs, 1,221,164 bp), as well as those uniquely found in one but taxonomically classified as SFB (Candidatus Arthromitus according to NCBI; 25 contigs, 89,165 bp). As is often the case for MAGs, a contig encoding a 16S rRNA gene was missing.
rRNA gene prediction however identified a 4.2 kb contig encoding a 16S gene with a region identical to our SFB amplicon sequence, and this contig could be linked to contigs of the MAG using read-pair information. The contig (k141_89555) encodes a full-length 16S gene as well as a 23S gene. The 16S gene is 96% similar across its full length to those encoded in the SFB mouse and rat genomes. Notebly, it has mismatches to commonly used primers for PCR identification of SFB ( Figure S2). Adding this contig resulted in a 1,314,549 bp (153 contig) MAG, that we denote SFB-human-IMAG (IMAG; ileostomy metagenome-assembled genome).

Phylogenomic analysis
The reconstructed genome was subjected to phylogenomic analysis using a set of universally conserved protein sequences. This verified the placement of SFB-human-IMAG among the SFB (with 100% support). Intriguingly, the human-assembled SFB genome was most closely related to SFB isolated from turkey (GCA_001655775.1) and the two formed a sister clade to the SFB genomes from mouse and rat ( Figure 1). This pattern was supported by average amino acid identity (AAI) analysis, with SFB-human-IMAG displaying 71% AAI to SFB-turkey, while displaying 65% AAI to the SFB from rodents (Table 1). It was however not supported by a phylogenetic tree based solely on the full-length 16S genes of the genomes ( Figure S3). The conflicting phylogenies between the SFB and their hosts could indicate that the SFB have 6 switched hosts during the course of evolution. It could also reflect that the human and turkey SFB belong to a different lineage than the mouse and rat SFB, and that the two lineages diverged before mammals diverged from birds. The two SFB lineages may exist in all hosts, or one could have gone extinct in some of the hosts.    Table S2); twelve of which were also found in SFB-turkey. Several of the 29 COGs, like DNA methylase, transfer proteins TraG and TraE and SNF2 family helicases, appear to be encoded on prophage sequences or other mobile elements. Conversely, SFBhuman-IMAG misses 114 COGs that were found in all the other SFB, but most of these are likely missing due to that the genome is incomplete. SFB has previously been described as having a fermentative metabolism. We have identified all but a few of the enzymes for glucose utilization also in the draft genome of SFB-human-IMAG.
No enzymes involved in the tricarboxylic acid cycle were identified, and, accordingly, there are no proteins that can be assumed to take part in an electron transport chain, confirming a The SFB-human-IMAG genome also encodes a large number of transport functions. This is in agreement with a restricted metabolic capability and similar to other SFB. Since the genome is not complete, some transport functions are likely missing due to incomplete genome assembly.
A notable exception is the lack of the ABC transporter for phosphonate, where the specific genes are missing in the middle of an SFB-human-IMAG contig that otherwise displays conserved synteny with SFB-mouse-Japan. However, since phosphorous is indispensable, bacteria have evolved several systems for acquisition of this macronutrient, and SFB, including SFB-human-IMAG, carry genes for a phosphate specific transport system (sfb.merged_01113sfb.merged_01115).
When comparing with the annotation of the complete genome of SFB-mouse-Japan, we conclude that SFB-human-IMAG is likely to carry a complete set of genes for sporulation and germination. Likewise, a complete set of genes for flagellar motility and chemotaxis are present, and it is thus reasonable to assume that the bacterium has the ability for motility and chemotaxis. This intimate contact suggests a strong potential for interaction with the host, and indeed, data were recently published that show how SFB in mice can transfer cell wall proteins into the enterocyte (Ladinsky et al., 2019). This protein (p3340) was earlier shown to be a major target in the antigen-specific CD4 T H 17 cell response induced by SFB (Yang et al., 2014). The corresponding protein is also encoded in the SFB-human-IMAG genome (sfb.merged_00774). It is interesting to note that while the N-terminal (signal sequence) and the C-terminal parts of these proteins display high amino acid identity, the main part shows only a low degree of identity ( Figure S6). In the work by Yang et al. (Yang et al., 2014) two peptides from p3340 were reported to strongly stimulate T H 17 cells. These peptides are conserved only to a limited degree in SFB-human-IMAG, leaving open the possibility that the variability in sequence reflects host adaptation and thus the evolvement of human-specific T H 17 triggering epitopes.
The components of SFB responsible for attachment to the enterocytes have not been identified.
Secreted and cell surface located bacterial proteins generally play major roles in signal transduction, ion transport and host cell adhesion. While enzymes and transporters often contains signature motifs, many proteins involved in adhesion are undefined as to their functional sites, and therefore depicted as hypothetical. A number of secreted and cell surface proteins were predicted in our genome based on N-terminal signal peptides and 60 of these are hypothetical proteins. The size of the hypothetical and tentatively extracellular proteins in SFBhuman-IMAG ranges from 57 to 2040 amino acids, and the identity to homologous proteins from SFB from other animal hosts are in the range of 34-72%, with a mean of 52% identity. This is substantially lower than the overall identity of the SFB-Human-IMAG proteome with other SFB (Table 1), and thus indicates more rapid evolution in proteins communicating with the exterior environment. It is plausible that some of these proteins play a role in attachment and host communication and thereby mediate the host-specificity that is a characteristic of SFB.
SFB do not harbor a gene for sortase, the enzyme that normally anchors many cell surface proteins in gram-positive bacteria, and a corresponding mechanism has not been described in the SFB-group. It is likely though that an alternative route for anchoring of cell surface proteins exists in SFB. Interestingly, a conserved amino acid motif located C-terminally was earlier identified in a number of putative cell surface proteins in SFB (Pamp et al., 2012). We have localized this motif in a number of predicted extracellular proteins, including the T H 17 stimulating protein p3340 from mouse SFB and the orthologous protein 00774 from SFB-Human-IMAG mentioned above. Supporting evidence for anchoring comes from the study of Ladinsky et al (Ladinsky et al., 2019), where immuno EM shows the location of p3340 to the SFB cell wall. We therefore postulate that SFB-human-IMAG has at least 14 cell surface proteins that may be anchored to the bacterial surface through the involvement of this aa-motif. Furthermore, twelve proteins encoded by the SFB-Human-IMAG genome are predicted to be anchored via a lipoprotein motif, and six proteins could possibly be anchored via an N-terminal transmembrane helix (TMHMM 2). This leaves a substantial number of predicted extracellular proteins seemingly anchorless. While some of these likely are true secretory proteins, it is notable that a number of them have a very high isoelectric point, giving them a basic charge which in turn could allow them to re-associate with the bacterial surface (Turner et. al 1997).

Presence in other metagenomes
To verify that SFB-human-IMAG resides in the human intestine, we searched for it among published metagenomes from the human gut. The genome was first BLAST-searched against an integrated catalogue of reference genes in the human gut microbiome (IGC 9.9) (Chen et al., 2019), which consists of 9.9 million genes assembled from 1,267 human faecal samples. Only fourteen of the IGC genes gave matches to the genome when requiring ≥95% identity and ≥70% of the IGC gene's bases aligned. However, this gene catalogue is mainly derived from samples from adults, while SFB in most animals peak in young individuals during weaning (Jiang et al., 2001). Therefore, we instead scanned a large recent metagenomic study consisting of a time-series of faecal samples from children 0 -3 years born in Russia, Estonia and Finland (Vatanen et al., 2016). The reads from the metagenome samples were first mapped against SFB-human-IMAG using standard settings. This rendered substantial mapping for many samples. However, manual inspection of the alignments revealed that the mapped reads were typically only partially aligned, and to regions displaying unusually high sequence conservation, such as structural RNA genes. Redoing the mapping with stringent settings (see Methods) and only counting reads mapped to protein-coding genes (CDS) gave substantially reduced mapping; however, 61 out of the 153 contigs were mapped by at least one read pair, and 7 out of the 817 samples had at least one read-pair mapping. Two of these samples, one Estonian infant at day 390 (SRS1719092) and one Finnish infant at day 320 (SRS1719390), had particularly many reads mapping and mapped with 1-3 read pairs each to 24 and 44 contigs, respectively. Although the SFB-mapping reads only corresponded to three and eleven out of a million mapped reads, respectively, in these samples, the reads appeared to be randomly distributed over the genome, indicating that the genome is present in these samples, rather than that some genome regions are wrongly binned or horizontally transferred. In comparison, zero reads from the 817 samples mapped to any of the rodent SFB genomes using the same settings, and the draft SFB genome from turkey was mapped only at two CDS, both corresponding to genes being 100% identical at the nucleotide level to genes of several Firmicutes genomes. Since the infant metagenomes analysed were derived from another lab, it can be excluded that the mappings to SFB-human-IMAG are due to contamination of DNA from our Ileostomy sample or sequencing library. We also checked for the presence of SFB in the metagenomes from intestinal luminal fluids from three Chinese children that had earlier been screened positive for SFB with PCR (Chen et al., 2018). With the exception for reads mapping to one of the above SFB-turkey CDS, no mapping to any of the SFB genomes were obtained for these samples. In summary, our analyses show that SFB-human-IMAG is present in human infant faecal material, although in very low relative abundance.

Conclusions
SFB holds a so far unique position in our collective knowledge on how individual components in the intestinal microbiota can affect host functions. The intimate interaction with the intestinal cells represents a remarkable evolutionary mechanism and recent data has shown that this is indeed a route for SFB-host interaction. Although SFB has been described from many host species, conclusive data regarding a human-specific SFB has been lacking. The data presented in this study strongly suggests that such a lineage actually exists. The assembled genome clusters with the previously described SFB genomes while being clearly distinct from these. The insight that SFB could be a natural component of the human microbiota calls for deepened attempts to elucidate their impact on human physiology in general and immune development in particular.

Sample collection and storage
Samples were initially collected and processed as described by (Lundin et al., 2004). Briefly, 10 adult subjects previously proctocolectomised for ulcerative colitis volunteered to participate in the experiment (two female subjects, eight males, age range 24-65 y, BMI 20.7-35.6 kg/m2).
The subjects were living a normal life based on physical examination and blood tests before the experiment. The study was approved by the Ethical Committee of the Umeå University Hospital.
Ileostomy bags were immediately frozen on dry ice and stored at −30°C. Ileostomy effluents from each 24h period were freeze-dried to constant weight, mixed, homogenized and stored at −70°C until analysis. One of the subjects was earlier ( The products were sequenced on Illumina MiSeq with 2x300 bp together with amplicon samples from a different project. Cutadapt v.1.18 (Martin, 2012) was used to remove primer sequences, 3'-bases with a Phred score <15, and sequences not containing the expected primers. The resulting sequences were submitted to Unoise3 (Edgar, 2016). Taxonomic annotation was performed with SINA based on SILVA 132 (Pruesse et al., 2012).

Metagenomic library preparation and sequencing
Libraries were prepared with the ThruPLEX DNA-seq kit (Rubicon genomics, Ann Arbor, MI, USA), aiming at an average fragment length of 350 bp. Sequencing was performed in a NovaSeq 6000 in S1 mode, yielding 358-410 million reads/sample.
Reads classified as human were removed prior to assembly.
Three external datasets of human gut samples were used for binning and for checking the presence of the obtained SFB MAG: 21 samples from BioProject PRJNA288044 (unpublished), 785 samples from BioProject PRJNA290380 (Vatanen et al., 2018), and 11 samples from BioProject PRJNA299342 (Chen et al., 2018). The 21 PRJNA288044 samples and the 11 PRJNA299342 samples were preprocessed by adapter and quality trimming using Trimmomatic (Bolger et al., 2014)

Taxonomic annotation of contigs
Assembled contigs were classified taxonomically using package tango (https://github.com/johnne/tango, v. 0.5.6) and the UniRef100 protein database (release 2019_02). The package queried contigs in a blastx search using diamond (Buchfink et al., 2015) (v. 0.9.22) with parameters '--top 5 --evalue 0.001'. From the results, contigs were assigned a lowest common ancestor from hits with bitscores within 5% of the best hit. Assignments were first attempted at species level using only hits at ≥ 85% identity. If no hits were available at that cutoff, an attempt was made to assign taxonomy at the genus level using hits at ≥ 60% identity, followed by the phylum level at ≥45% identity. These rank-specific thresholds were chosen from (Luo et al., 2014).

Phylogenetic and amino acid similarity analyses
The phylogeny of the SFB genomes was inferred using GTDB-TK (Parks et al., 2018) (v. 0.2.2) with GTDB release86, in both 'classify_wf' and 'denovo_wf' modes. The former placed the query genomes into an existing reference tree using pplacer (Matsen et al., 2010) while keeping the reference tree intact and was used to assign a GTDB taxonomy to the genomes. The latter instead created a new phylogenetic tree using both reference and query genomes and was used to investigate the phylogenetic relationship between the genomes. In the 'denovo_wf' method FastTree (Price et al., 2010) (v. 2.1.10) was used with the WAG protein model and Gamma20-based likelihoods ('-wag -gamma').
For the 16S phylogenetic analysis, one full-length 16S rRNA gene from each of the previously published complete SFB genomes, as well as from the genomes of five different species of Clostridium, were downloaded from the RDP (Cole et al., 2014). The positioning of the 16S rRNA gene in SFB-human-IMAG contig k141_89555 and in SFB-turkey contig GCF.001655775_NZ_LXFF01000001.1 was predicted with CheckM. The six 16S genes were aligned with Muscle (Edgar, 2004) and columns with gaps removed with DegePrime (Hugerth et al., 2014). A phylogenetic tree was constructed with FastTree using the GTR+CAT modell (results were nearly identical using the Jukes-Cantor + CAT model).

Prediction of extracellular proteins
SignalP-5.0 was used to identify signal peptides in the translated ORFs of the SFB-human-IMAG draft genome. The setting of organism group was gram-positive.

Quantifying SFB in external metagenomes
Matching of the ORFs in IGC v9.9 (db.cngb.org/microbiome) against SFB-human-IMAG was performed with blastn v2.7.1+ (Camacho et al., 2009) requiring at least 80% identity over at least 70% of the query sequence. To assess the presence of SFB-human-IMAG and of SFB from mouse, rat and turkey in the faeces of young children, we used the recent work of Vatanen et al (Vatanen et al., 2016), one of the datasets that we used for the binning. Mapping of the preprocessed reads against the SFB genomes was run in 'strict' mode, where only alignments without mismatches were reported ('--score-min C,0,0' in bowtie2). Counts of read-pairs mapping inside protein-coding regions (CDS) was obtained with featureCounts (Liao et al., 2014) (v. 1.6.4) with settings '-p -B -M' to only count read-pairs with both ends mapped and allowing multimapping reads. The same procedure was used for mapping the shotgun reads from Chen et al (Chen et al., 2018).

Data and Code Availability
The preprocessed amplicon and shotgun sequencing reads generated during this study, and the contig sequences of SFB-human-IMAG bin, are available at the European Nucleotide Archive (ENA) under the study accession number PRJEB34939. Data files for amplicon sequence variants, genome annotations, phylogenomic analysis, genome quality estimates and metagenome read mappings are available at XXXX.