Main

Microbe–microbe interactions are important drivers of ecological and evolutionary trajectories in marine environments1,2. These interactions underpin trophic transfer of energy and carbon cycling, both being of key concern for understanding potential shifts in community assembly and ecosystem function during changing ocean conditions1,3. Protistan bacterivory via phagocytosis has a broadly appreciated role in transfer of energy to higher trophic levels in marine ecosystems1,4. Over evolutionary time scales, the simplicity of such predator–prey relationships and resultant energy transfer have been altered in cases of phagocytic engulfment that resulted in either temporary or permanent host–bacterium symbioses5,6. However, identification of symbioses, which encapsulate a range of relationships along a continuum from mutualistic to antagonistic7, has been impeded by challenges in capturing or culturing physically interacting marine microbes, inhibiting the advancement of this critical area of ocean science1,3.

Among important marine bacterivores8,9, choanoflagellates have recently been noted for their global distribution10 and interactions with bacteria. These interactions extend beyond consumption of bacteria as food, to laboratory demonstrations that bacterially-derived compounds initiate both the transition to multicellularity11,12 and sexual reproduction13 in choanoflagellates. However, little is known about distributions of specific choanoflagellates or their ecological associations with individual bacterial lineages in the natural environment. Specifically, insights into the existence, nature and consequence of specific cell–cell interactions that extend beyond known predator–prey interactions with bacteria are important for moving past correlative approaches that neither capture the directionality of exchanges in interactions nor hold under non-steady state conditions. Understanding of co-associations and ecological ramifications are essential for characterizing the mechanistic underpinnings of interactions1,2 that can then be used to improve carbon cycle models and the movement of organic carbon in the ocean.

We hypothesized that the microbiomes of choanoflagellates would reveal a variety of interactions, such as predator–prey or potentially microbial symbioses, with giant viruses of choanoflagellates having been discovered recently14,15. To discover such cell–cell interactions without inducing possible selection through culturing, we used Fluorescence Activated Cell Sorting (FACS)14 to isolate single choanoflagellate cells from Pacific Ocean surface waters.

Results

Sorting wild marine choanoflagellates and microbiome analyses

Scattering properties and labelling of acidic food vacuoles with a fluorescent probe identified a population of active choanoflagellate cells that were then sorted (Fig. 1b). After DNA amplification, 18S rRNA gene V4 amplicon sequencing showed that 90% of 188 sorted choanoflagellate cells had 100% nucleotide identity across the 378 bp of the amplicon sequence to the Bicosta minor gene sequence (KU58783916). To understand more about the distribution of this uncultivated choanoflagellate, we examined two large global ocean survey datasets, Tara Oceans and Malaspina10,17. Classification of 18S rRNA amplicons showed Bicosta in all Tara Ocean surface samples10, where they made up an average of 12% of the choanoflagellate community, and in 54% of Malaspina samples, where they made up 7% of the choanoflagellate community (Fig. 1c). Thus, B. minor appears to be of relevance not only to coastal and fjord environments where it was previously reported16, but also in the open ocean.

Fig. 1: Identification of a physical association between the uncultivated choanoflagellate B. minor and a divergent Gammaproteobacterial lineage.
figure 1

a, Locations of the single-cell sorting experiment (Station M2), metatranscriptome sequencing (Stations M1 and M2) and amplicon-based rRNA gene sequencing surveys performed herein at the Monterey Bay Time Series (MBTS; M1, M2) and amplicon data analysed from the San Pedro Ocean Time-series (SPOT). b, Left: population of choanoflagellates, which belong to the Opisthokont supergroup that includes metazoans37, sorted by FACS on the basis of forward angle light scatter (proxy for cell size), fluorescence labelling of food vacuoles and absence of fluorescence from photosynthetic pigments. The green box indicates the position of bead standards run before and after sorting at the same settings. Middle: choanoflagellates sorted were almost exclusively B. minor (family Salpingoecidae). Here, ‘B. minor ASV 1’ (100% nucleotide identity to uncultivated B. minor) refers to the dominant B. minor ASV (90% of all choanoflagellate cells), while the others are less frequently observed choanoflagellate ASVs. Right: the most common bacteria detected with the dominant B. minor ASV (only bacterial ASVs present with more than two B. minor cells are shown) comprised two Gammaproteobacteria (Comchoano-1 and Comchoano-2) and a third, less common Gammaproteobacterium (86% 16S rRNA amplicon identity to Comchoano with 2% of cells), alongside a Planctomycete (Blastopirellula, 93% identity to Mariniblastus fucicola with 3% of cells), two Flavobacteria (Saprospiraceae and Lewinella 93% and 91% identity to closest relatives Membranicola marinus and Portibacter lacus, respectively, with 2% and 1% of cells) and two Rickettsiales (87% identity to an endosymbiont of Oligobrachia haakonmosbiensis with 2% and 1% of cells, respectively). c, Oceanic distributions of choanoflagellates, Bicosta and B. minor as classified by QIIME 271. Relative abundances in surface samples are represented as percentage of 18S rRNA gene V4 amplicons from Malaspina 2010 (3 m depth, solid circles) and V9 amplicons from Tara Oceans (5 m depth, dashed circles) circumnavigations, with metazoan sequences (about 0.001% on average) excluded. Data were summed across ASVs (Malaspina) or OTUs (operational taxonomic unit; Tara Oceans) for the taxonomic level assigned by QIIME 2. Choanoflagellates were not detected in six samples (all from Malaspina).

To determine whether a microbiome exists in individual choanoflagellate cells, we sequenced 16S rRNA gene V4 amplicons from the sorted cells using primers that exclude the most probable bacterial prey, that is, the abundant surface ocean bacteria, SAR11. Low bacterial diversity was observed from the B. minor cells, with 28% exhibiting bacterial amplicon sequence variants (ASVs) at an average of 1.6 ± 1.3 ASVs per B. minor cell. In total, 22 ASVs were detected, eight of which were associated with B. minor in more than one instance (Fig. 1b). The most common microbiome members were represented by two Gammaproteobacterial ASVs that had low nucleotide identity (89–90%) to the closest cultured bacteria, members of the Coxiellales order, among which is the animal pathogen Coxiella burnetii responsible for Q fever in humans18. These two divergent Gammaproteobacteria were present in 12% of sorted B. minor cells and were detected only once in the same B. minor cell. The low average number of bacterial ASVs (1.6 ± 1.0) in B. minor cells harbouring one of the two Gammaproteobacteria indicated extremely low community diversity within these B. minor cells. Inspired by the association between the microbiome dominants and the choanoflagellate, we tentatively named these two bacteria ‘Comchoano’, meaning ‘with choano(flagellate)’.

We next sequenced genomic DNA from the sorts exhibiting Comchoano co-associated with B. minor cells. The low diversity within the single B. minor cells facilitated assembly of the full-length B. minor 18S rRNA gene (1,568 out of 1,569 bp identity to KU58783916) as well as full-length Comchoano 16S rRNA gene sequences. In each case, the respective full-length rRNA genes were always identical to the respective ASVs. Phylogenetic analyses of the full-length Comchoano 16S rRNA gene sequences showed that they belonged to a divergent, diverse and entirely uncultured marine lineage and confirmed that their closest cultured relatives belong to the Coxiellales order. The overall clade to which Comchoano belonged was statistically supported and otherwise comprised members of the enigmatic UBA7916 order (Fig. 2a) recognized in 201819.

Fig. 2: Phylogenetic relatedness of diverse uncultivated Gammaproteobacterial order and daily time-series dynamic with B. minor.
figure 2

a, ML reconstruction using full-length (including Comchoano) or nearly full-length (minimum 1,200 bp) 16S rRNA gene sequences and 1,593 analysed positions under the GTR+Γ+I model demonstrates that Comchoano-1 (orange star) and Comchoano-2 (purple star) branch within an uncultivated lineage of marine Gammaproteobacteria. All assembled full-length 16S rRNA genes for a respective Comchoano were 100% identical. ML bootstrap support was generated with 1,000 rapid replicates and additional support (posterior probability) was generated via Bayesian inference. Black stars indicate 16S rRNA genes from SAG assemblies. The dashed line orange box indicates sequences that have 95% nt identity to Comchoano.b, Daily relative abundances of B. minor, Comchoano-1 and Comchoano-2 at SPOT following a phytoplankton bloom (as captured in chlorophyll data; green shading, right axis). B. minor and Comchoano values reflect percentages of the total rRNA gene sequence reads of all eukaryotic (18S) and prokaryotic (16S) amplicons, respectively. Lines represent temporal dynamics within the size-fractionated seawater samples, with ‘protistan size fraction’ (1–80 µm) and ‘free-living prokaryote fraction’ (0.2–1 µm) being represented by solid and dashed lines, respectively. B. minor was correlated with Comchoano-1 and 2 in the protistan size fraction (r = 0.73, 0.80; P = 0.001, 0.0002, respectively; Supplementary Data 1) and with Comchoano-1 and 2 in the free-living prokaryote fraction (2 d time-shifted r = 0.62, 0.78; P = 0.019, 0.0011, respectively; Supplementary Data 1).

Geographic distribution and interaction ecology

Surprised by the co-association between this uncultivated marine bacterial lineage and wild choanoflagellate cells, we next mapped Comchoano distributions. Phylogenetic placement of amplicons from a variety of marine environments17 demonstrated that, like B. minor (Fig. 1c), they are globally distributed (Extended Data Fig. 1a) albeit at low relative amplicon abundances. We then turned to time-resolved data to gain insight into ecological relationships, specifically to data collected daily-to-weekly at SPOT, a time-series study in Pacific waters of the Southern California Bight (Fig. 1a). Across the daily portion of the time-series, Comchoano and B. minor relative amplicon abundances increased contemporaneously in the ‘protistan size fraction’ (1–80 µm) (r = 0.73, 0.80; P = 0.001, 0.0002 for Comchoano-1 and 2, respectively; Fig. 2b and Supplementary Data 1). Moreover, across the full daily-to-monthly time-series at SPOT (53 samples over 6 months), pairwise Spearman correlation analyses showed that in the protistan size fraction, Comchoano relative abundance correlates with choanoflagellates (P and q <0.05), with higher percentage of correlations to choanoflagellates than to other eukaryotes apart from telonemids (Extended Data Fig. 1b,c). However, in the ‘free-living prokaryote fraction’ (0.2–1 µm) of the daily time-resolved part of the time-series, Comchoano 16S rRNA amplicon relative abundances increased as B. minor 18S rRNA amplicon relative abundances decreased. Thus, a significant time-lagged positive correlation was observed between B. minor in the protistan size fraction and Comchoano in the free-living prokaryote fraction (2 d time-shifted r = 0.62, 0.78; P = 0.019, 0.0011 for Comchoano-1 and 2, respectively; Supplementary Data 1). These results suggest that Comchoano were potentially released from lysed or dying B. minor cells (depending on the limitations of relative abundance-based analyses), and the time-delayed increase in Comchoano is consistent with a lytic parasitic or pathogenic role, akin to extensions of Lotka-Volterra equations for host and parasite20, or could arise by other phenomena whereby the choanoflagellate decreases in abundance while Comchoano are released into the environment. The low relative abundances of Comchoano in global ocean data at SPOT and at Stations M1 and M2 (Fig. 1) indicate that they are not probable prey items. Indeed, at the more northerly Pacific stations where sorting was performed, monthly data showed that relative abundances of Comchoano-1 and 2 were approximately 500 and 150 times lower than that of the most abundant bacterial taxon, respectively (Extended Data Fig. 2a,b). Despite their rarity among other bacteria, the frequency of Comchoano co-associations with B. minor (12% of B. minor cells) is comparable to other co-associations known to have major biogeochemical impacts21. For example, recent estimates suggest that about 1% of unicellular plankton in the ocean are typically infected by viruses22,23 and about 25% are infected during blooms of specific taxa when virus–host encounter rates are higher22. The observed environmental distributions, the fact that Comchoano are physically co-associated with choanoflagellates, the relatively long-branch relationships24 with other bacteria and their phylogenetic proximity to bacterial pathogens of animals18 all point to Comchoano being obligately host associated, in this case with B. minor.

Complete genomes of Comchoano-1 and Comchoano-2

We found that the genomes of Comchoano were starkly different from those of common free-living marine bacteria. Assembly efforts rendered a single consensus circular genome of 1.01 Mb for Comchoano-1 (Fig. 3a) and two consensus contigs of 1.07 Mb for Comchoano-2 (Extended Data Fig. 3 and Supplementary Data 2). Comchoano genomes are small relative to most marine prokaryotes, smaller than all but one (archaeal) metagenome-assembled genome (MAG) among available marine MAGs, single amplified genomes (SAGs) and cultured prokaryotic genomes (see Methods for dataset description) considered to be as complete as Comchoano (Fig. 3b and Extended Data Fig. 4a). With respect to predicted protein numbers, Comchoano-1 encodes 951 and Comchoano-2 encodes 1,004 proteins, fewer than the genome-streamlined free-living bacterial lineage SAR1125, which has ~40% more predicted proteins. Estimates suggest that the Comchoano genomes are 96.6% complete based on predictions from single-copy gene (SCG) counts26. However, multiple lines of evidence contrast with a prediction of ‘incompleteness’: (1) the circularized Comchoano-1 genome sequence, (2) clear origin and terminus of replication in both Comchoano genomes, (3) nearly identical genome assemblies from across multiple single (B. minor) cells, including identical 16S rRNA gene sequences (Supplementary Data 2) and (4) the observation that Comchoano genomes always lack the same two SCGs, which are also commonly missing from other UBA7916 lineage members (Supplementary Data 3). By these measures, we concluded that the Comchoano consensus genome sequences are effectively complete, providing a springboard for investigation of gain and loss of functions in this putatively obligately associated lineage.

Fig. 3: Diminutive genomes and gaps in multiple biosynthesis pathways highlight host dependencies.
figure 3

a, Genome map of Comchoano-1 genome. The innermost layer shows GC skew with axes of −0.4 to 0.4 (green negative; purple positive), demonstrating the location of the replication origin and terminus. The second layer indicates GC content (with axes of 27–54% GC content; 1,000 bp moving average), showing stability of GC content across the genome at 39% GC content. The two outermost layers indicate metatranscriptome read mapping from Stations M2 and M1 one month after M2 sampling, demonstrating Comchoano gene expression in nature. b, Genome size or scaled total size (as opposed to sequence assembly size) for genomes of ‘marine’ prokaryotes (Methods). Data for Comchoano show consistent estimated size regardless of assembly method and source data. Red circles indicate taxa we identified as belonging to UBA7916 on the basis of GTDB-tk classification (Methods, also see Extended Data Fig. 7). The boxplot represents the lower and upper quartiles, the centre line is the median and whiskers are 1.5× interquartile range. c, Binned and summed SNP data across Comchoano-1 from nine B. minor single cells, indicating low variability between cells. The total variability considering all SNPs (4.5 ± 3.1 SNPs per cell) showed that just one position was variable across more than one cell (present in a gene lacking functional annotation; indicated by asterisk). d, Major pathways present in Comchoano-1 and Comchoano-2, and selected opportunistic pathogens, obligate endosymbionts and free-living bacteria. Genomes indicated by black stars are MAGs (estimated as over 90% complete) belonging to UBA7916. The heat map colour scale corresponds to the number of proteins present in each pathway or function (Methods). The estimated genome size of each bacterium is shown to the left of the taxon names (logarithmic scale). Gammaproteobacteria (apart from the Comchoano) are indicated by green text. Free-living replication refers to bacteria capable of free-living growth as demonstrated by cultivation studies. Genomes are clustered by the scaled values of the metabolic pathways as shown in the heat map (Manhattan distance and complete linkage clustering), with pvclust bootstrap analysis (n = 1,000). Hierarchical clustering is for visualization purposes and may change if additional pathways are considered.

The level of Comchoano-1 and 2 genome completeness is high compared with most marine SAG assemblies so far (Extended Data Fig. 4b,c)27, providing potential insights into how the interaction between Comchoano and B. minor manifests. The results point to the presence of multiple Comchoano cells in each sorted B. minor, which would minimize the impacts of multiple displacement amplification (MDA) bias. The assemblies clearly benefited from low sequence variation in Comchoano genomes. Specifically, we observed 0.49 ± 0.34 and 27 ± 38 single nucleotide polymorphisms (SNPs) per 100 Kb based on read mapping to the Comchoano-1 consensus genome assembled from different B. minor cells and the Comchoano-2 consensus genome, respectively (Fig. 3c and Supplementary Data 2). Only one SNP occurred more than once across eight Comchoano-1 genomes (each from a different B. minor cell), indicating that they are essentially clonal. The observed population structure for these dominant Bicosta-microbiome members contrasts with the extensive microdiversity in abundant free-living marine bacteria—a factor considered important to niche breadth, population stability and frequency-dependent interactions, such as host–virus dynamics27,28,29,30. Our results demonstrate that the evolutionary pressures acting on Comchoano differ from those acting on most known free-living pelagic bacteria31.

Evolutionary relationships within Gammaproteobacteria

With the Comchoano genomes in hand, we resolved evolutionary relationships between this unique lineage, UBA7916 and other Gammaproteobacteria using phylogenomic approaches. The phylogenomic reconstruction provided robust statistical support for Comchoano placement within the UBA7916 order (Extended Data Fig. 5a,b), as first indicated by full-length 16S rRNA gene analysis (which also included environmental 16S rRNA sequences for which no genomic information is available; Fig. 2a). The relative evolutionary distance (RED) value19 between Comchoano and existing UBA7916 genomes (0.83) suggests that Comchoano represent a family level of divergence within UBA7916. The genomes do retain some large-scale syntenic patterns (Extended Data Fig. 5c) and the amino acid identity between Comchoano-1 and Comchoano-2 is 49% based on comparison of homologous proteins. Together, these results suggest that despite their close relationship to each other relative to other sequenced bacteria (Fig. 2a and Extended Data Fig. 5a,b), the two Comchoanos are at least separate genera, a conclusion supported by the 16S rRNA gene nucleotide sequence identity level (95%).

The overall clade representing the UBA7916 order incorporates multiple long-branching, uncultivated Gammaproteobacteria represented only by marine SAGs and MAGs (Extended Data Fig. 5b) which, similar to Comchoano, have smaller than average estimated genome sizes compared with other marine prokaryotes (Fig. 3b and Extended Data Fig. 4a). Small genome sizes are one indicator of obligate endosymbionts and pathogens6, including some Coxiellales such as the Coxiella endosymbiont (0.82 Mb) of Amblyomma americanum, a tick species (Supplementary Data 4). Until now, the UBA7916 order has remained ecologically and biologically mysterious. This is largely because no UBA7916 member has been cultured—a situation we now hypothesize persisted due to the disruption of physical associations between these bacteria and possible eukaryotic hosts during most studies or culturing efforts, or selection for other types of bacteria that thrive as free-living cells under culture conditions. The partial MAGs and SAGs available from UBA7916 most closely related to Comchoano (for example, the marine bacterial SAG CACOCF) may therefore also have come from bacteria with a host or even choanoflagellate-associated lifestyle. While lifestyle attributes or associations of UBA7916 members apart from Comchoano remain unknown, the Coxiellales—which again branched as the closest cultivated relatives—and other related orders, including the Berkiellales, Diplorickettsiales and Legionellales (Extended Data Fig. 5a,b), are noted pathogens of insects and terrestrial mammals, as well as amoebozoan and ciliate protists32,33,34 unrelated to choanoflagellates. Additionally, a long-branching group (Candidatus Azoamicus) in an unsupported position adjacent to Legionellales and Francisellales (with 84% nucleotide identity to Comchoano) is an obligate endosymbiont of an anaerobic ciliate from waste sludge and other freshwater environments where the much-reduced bacterium appears to provide energy to the host, seemingly akin to mitochondria35. Alongside small genome sizes, long-branching phylogenetic relationships24 with other bacteria and low microdiversity are well known features of endosymbionts found in eukaryotes6.

Metabolic traits and pathway reductions in Comchoano

To this point, our findings suggested an obligate association between Comchoano and B. minor, leading us to consider the molecular mechanisms needed to support a host-associated lifestyle, the nature of the putative association, and its possible extension to other UBA7916 lineage members. In regard to central cellular biochemistry, Comchoano and UBA7916 have most of the key proteins associated with replication and translation, including a full complement of transfer RNAs and tRNA synthetases, as well as nearly all ribosomal and cell division-related proteins (Fig. 3d, Supplementary Data 5 and 6, and Extended Data Figs. 6 and 7). Additionally, similar to many other marine bacteria, Comchoano and most other UBA7916 encode photoactive microbial rhodopsin proteins (Extended Data Fig. 8). These contain motifs suggestive of a proton-pumping function with various hypothesized roles, including energy transfer or acidification of host cellular compartments14,36. Comchoano do not encode genes for biosynthesis of retinal, the chromophore required for rhodopsin function (Fig. 3d). Thus, Comchoano would need to scavenge the chromophore from prey ingested by the host, akin to chromophore scavenging that has been demonstrated for the cultured choanoflagellate Choanoeca flexa37, or Comchoano could potentially produce retinal in an uncharacterized manner. Strikingly, however, Comchoano central metabolism is disrupted. Multiple glycolysis enzymes are absent, specifically the first two and the last three in the pathway, similar to presence/absence patterns seen in other UBA7916 (Fig. 3d and Extended Data Fig. 9a). Two of the three glycolytic proteins that are retained, 6-phosphofructokinase and fructose-bis phosphate aldolase, are also essential to the pentose phosphate pathway (PPP), which is fully encoded, as are the tricarboxylic acid (TCA) cycle and oxidative phosphorylation (Fig. 3d and Extended Data Fig. 7). Utilizing the pentose phosphate pathway would require fructose-6-phosphate from the choanoflagellate (which can produce it via glycolysis, Supplementary Data 7) and the product, 3-phosphoglycerate, could then be used in processes such as amino acid biosynthesis. In place of losses of function for traditional glycolysis, Comchoano may meet energy demands using an ATP/ADP translocase that they possess, which can directly import ATP from the host environment. This has been reported as the mechanism by which obligate endosymbiotic bacteria38,39 obtain energy from the host (for example, the obligate intracellular pathogen Chromulinavorax destructans found in a freshwater stramenopile protist40). The ATP/ADP translocase encoded by Comchoano and most UBA7916 (Fig. 3d and Extended Data Fig. 6) is phylogenetically similar to those reported in other ‘energy parasites’ (Extended Data Fig. 10a,b)38. Some translocases within this broad protein family have been reported to have affinity for nucleotides such as guanosine di/tri-phosphate, calling for further experimental studies of translocase affinities35,41. Recently, this translocase was implicated as the mechanism by which Ca. Azoamicus endosymbionts provide ATP to their freshwater ciliate hosts. Here we find that the closest affiliation of the Comchoano translocases is with an ATP/ADP translocase in an alphaproteobacterial endosymbiont (Extended Data Fig. 10b), with both maintaining all the motifs for ATP/ADP translocation that have been demonstrated as essential on the basis of site-specific mutations (Supplementary Fig. 1) in versions present in parasitic fungi42. Collectively, these results show that Comchoano, and potentially most other UBA7916, have reached a host-dependent state for sugar compounds. With the twist of potential augmentation by microbial rhodopsins, the energetics of these bacteria are akin to those of bacterial parasites of hosts in other environments40, with the energy requirements of Comchoano being satisfied through exploitation of B. minor.

Comchoano have multiple other host-dependencies, with marked losses in biosynthesis of major constituents of bacterial cell walls, mirroring patterns in endosymbiotic and parasitic bacteria, such as Mycoplasma6 and Chromulinavorax40. In particular, Comchoano and several UBA7916 lack or have reduced pathways for fatty acid biosynthesis (Fig. 3d, and Extended Data Figs. 6 and 9b). Production of essential phospholipids therefore probably utilizes enzymes present for fatty acid degradation and a partial glycerophospholipid biosynthesis pathway (Extended Data Fig. 7). Biosynthesis of a major structural component of bacterial cell walls, lipopolysaccharide, is also lacking (Fig. 3d). In host-associated model bacteria, including several Gammaproteobacteria, loss of this structural barrier in outer cell membranes increases permeability for hydrophobic molecules43,44. Lipopolysaccharides also stimulate host recognition and immune responses in characterized pathogenic host–microbe interactions43 and its detection results in phagocytosis of bacteria as they invade multicellular eukaryotes45. By functional analogy, lipopolysaccharide loss in Comchoano indicates enhanced capacity for compound acquisition from the host and avoidance of detection and phagocytosis by the host.

Additional Comchoano auxotrophies highlight other facets of a host-dependent lifestyle. Unlike many free-living bacteria, Comchoano encode only a few proteins of B-vitamin and amino acid synthesis pathways and those that they do encode have roles in other pathways (Fig. 3d, Extended Data Fig. 7 and Supplementary Data 8). Although they are vitamin46 and amino acid auxotrophs, there are clear mechanisms for alleviation in their gene repertoires. For example, amino acid auxotrophy is alleviated by 24 and 23 amino acid transporters encoded by Comchoano-1 and Comchoano-2 (Supplementary Data 9), respectively, demonstrating utilization of the host environment for amino acid supplies. The amino acids transported from B. minor remain to be established as the specificities of the transporters identified are generally not known even in cultivated bacteria such as Coxiella47. Comchoano also encode enzymes for amino acid conversions in other pathways, alongside numerous transporters for the uptake of ions (for example, nitrogen, phosphorus and iron), osmolytes and other small organics (Supplementary Data 9). Thus, the hypothesis of an intracellular host-dependent lifestyle was borne out by Comchoano’s metabolic capacities and reductions therein.

Tracing co-association along the symbiosis continuum

Determination of where microbial associates rank along the continuum from mutualism to pathogenesis is challenging whether in cultured systems or not, because interactions with hosts are often context dependent and can shift under different environmental scenarios, as well as being labile across evolutionary time48,49,50. Examples of microbial endosymbionts for which genomic information is available abound for heterotrophic protists belonging to other (non-Opisthokonta) eukaryotic supergroups, especially taxa residing in freshwater, soil, sediment and host-associated (for example, protists residing in termite guts) environments33. Fewer are known from marine habitats apart from sediments where, for example, symbionts with foraminiferans, excavates and ciliates have been reported51,52,53. Those reported from seawater appear to generally come from ciliates and diplonemids isolated from coastal environments54 and saltwater aquaria55. The best-known examples of microbial symbioses in the pelagic ocean between protists and endosymbiotic bacteria generally involve eukaryotic algal species and nitrogen-fixing cyanobacteria, with the latter providing organic nitrogen to eukaryotic phytoplankton cell in exchange for carbon resources56.

Visualization of the Bicosta-Comchoano relationship has not yet been achieved largely due to their uncultivated status and ephemeral abundances in their dynamic habitat where the likelihood of re-encountering the targeted interaction on the spatial and temporal scales at which oceanographic research is conducted is low. This presents challenges and statistical limitations for, for example, efforts using hybridization chain reaction fluorescence in situ hybridization (HCR-FISH, as attempted herein; Methods) to capture co-associations that again occur either ephemerally or in relatively low abundances. These efforts can also be hampered by insufficient signal to noise/negative fluorescence ratios in field samples (which also contain naturally autofluorescent particles)57,58, or by difficulties in interpreting signals from small uncultivated heterotrophic protists such as B. minor, which could also contain prey cells. Collectively, these issues render the genomic data collected in the context of unperturbed cell co-associations invaluable. With respect to choanoflagellates, and more resolved transmission electron microscopy (which is possible for abundant cells in culture), one possible endosymbiont has been noted in a cultured Baltic Sea species from brackish waters; however, the putative host–microbe interaction, functional and phylogenetic features of the putative bacterium59 all remain unknown. Hence, in many regards, the completion and analysis of the Comchoano genomes and technology-enabled methods used herein for establishing their physical association with choanoflagellates provide the most compelling evidence yet possible for their relationship and its ramifications.

Analyses of the Comchoano genomes did not reveal an obvious ‘exchange currency’ that would benefit the choanoflagellate host. This contrasts with findings from cultured bacterial endosymbionts and endosymbiont consortia of many different animal and protistan hosts6,60,61. For example, in carpenter ants, the bacterium Blochmannia floridanus produces amino acids that are ‘exchanged’ for host-supplied carbon and nitrogen substrates. Thus, Comchoano’s requirements appear to impose an energy and resource ‘tax’ on hosts, with no apparent return benefit. The impact of this ‘tax’ could range from mostly neutral for the host to strict pathogenicity under ocean conditions that are challenging for host growth and energy acquisition.

One feature of confirmed intracellular pathogens like Legionella, Coxiella and Chromulinavorax32,40 is that they encode a large suite of proteins attributed to pathogen–host interactions. These types of proteins, including effectors used for host manipulation by all of the former, are absent from free-living marine bacteria such as SAR11 but present in both Comchoano (Fig. 3d and Supplementary Data 10). In Comchoano, they also include abundant eukaryote-like domains, especially ankyrin and leucine-rich repeat domains. Additionally, Comchoano-2 encodes three Sel1 repeat-containing proteins, involved in cellular trafficking during Legionella infection62, alongside a transcription activator-like (TAL) effector involved in manipulating host transcription in plant pathogens and fungal endosymbionts63. We did not find a recently described polysaccharide lyase (EroS) shown to induce sexual reproduction in the cultivated choanoflagellate Salpingoeca rosetta13. The identified gene repertoire in Comchoano is generally present in other UBA7916 as well, often in similarly high numbers as in Comchoano (Extended Data Fig. 6 and Supplementary Data 10) and numerous protist-associated endosymbionts (Fig. 3d), as exemplified by the amoeba symbiont Amoebophilus asiaticus64. These results point to host manipulation being important across most of the UBA7916 lineage and a lifestyle involving multiple levels of host-directed interactions.

Comchoano has a specialized type IV secretion system

A unique aspect of Comchoano relative to the vast majority of other marine bacteria is the presence of a complete type IV secretion system (T4SS) of the subtype pT4SSi. All pT4SSi genes were expressed in metatranscriptomes constructed from the station and time point when co-associated cells were sorted (Fig. 4a). This specific T4SS subtype is homologous to those of the pathogens Coxiella and Legionella, which secrete a broad array of effector proteins for evading host defenses and manipulating host pathways32. Phylogenetic analyses of a protein present across all T4SS (VirB4/IcmB/TraU) placed Comchoano in a statistically supported position within the pT4SSi clade adjacent to those for Coxiella (Fig. 4b,c). This placement indicates that pT4SSi is ancestral to UBA7916 and Coxiellales and was present before divergence of marine and non-marine lineages. Moreover, our analyses showed that while other secretion systems are common in marine bacteria, pT4SSi are scarce in marine bacteria (Fig. 4d). It should be noted that pT4SSi lack the relaxase used for conjugation that is typical of other T4SS, and hence cannot function in conjugation between bacterial cells65. Apart from Comchoano, pT4SSi were detected in five additional UBA7916 and 25 other marine bacteria (Supplementary Data 11). The majority of the latter are related to described pathogens: seven Micavibrionales (Bdellovibrio-like bacteria66), four Coxiellales, four Pseudomonadales and two Legionellales (Supplementary Data 11). The presence, synteny and phylogenetic conservation of the pT4SSi of Comchoano and distant relatives such as Coxiellales and Legionellales, imply that host association is probably an ancient trait for these lineages broadly. However, the pT4SSi and other features (for example, eukaryotic-like domains, ATP/ADP translocase) that are conserved among Comchoano and close relatives, paired with variations in the extent of reduction in genome size and metabolic capacities, suggest that the obligate nature of association is a more recent and sporadic trait.

Fig. 4: Comchoano’s T4SS is rare in free-living marine bacteria and phylogenetically closest to those of confirmed pathogens.
figure 4

a, T4SS proteins present in C. burnetii, Comchoano-1 and Comchoano-2 (represented to scale on the genome scaffold, that is, nucleotides). Synteny is indicated (grey shading), along with percent amino acid similarity between homologous proteins (darkness level of shading). Colour fills indicate metatranscriptome read mapping (reads per kb million, rpkm) from Pacific Ocean station M2 (all pT4SSi genes were expressed). The break mark shown in the C. burnetii genomic segment represents ~5,800 bp. b, ML phylogenetic reconstruction of 1,629 T4SS VirB/IcmB/TraU proteins with tips coloured by bacterial class. c, Rooted subtree ML phylogeny of 72 pT4SSi, T4 type I and related sequences from a supported clade extracted from b representing an analysis of homologous positions. For the subtree, some clades were collapsed for display purposes and node support is indicated by open circles (≥95% ML; 1 posterior probability, Bayesian) or numerical percentages (if ≥50% ML or ≥0.9 posterior probability, Bayesian). d, Prevalence of secretion systems across genomes of 18,671 marine prokaryotic isolates, SAGs and MAGs, regardless of genome completion level (Methods). Arrow, the pT4SSi possessed by Comchoano and other UBA7916 members, as well as 25 Proteobacteria from other lineages (as defined by the presence of the mandatory VirB/IcmB/TraU and greater than six component genes; Methods), the majority being classified as Coxiellales or Legionellales and relatives of host-associated bacteria in terrestrial ecosystems. Shading within bars reflects distribution of each subtype of the respective secretion system type, where applicable (for example, the T5a–c bar has three shaded segments representing T5a, T5b and T5c).

Discussion

Our findings to this point call for the naming of what has thus far been an enigmatic order of uncultivated bacteria with little known about its ocean roles or lifestyle. We propose the following status: Candidatus Comchoanobacterales ord. nov., Candidatus Comchoanobacteraceae fam. nov. for what has been known so far as the UBA7916 order and family, respectively, following the protocols of order and family naming after type species (see below and protologue in Supplementary Information). This status is proposed due to Comchoano and related UBA7916 members being phylogenetically distinct from other Gammaproteobacteria orders (Fig. 2a and Extended Data Fig. 5a,b) and their distance based on RED and amino acid identity (AAI) metrics. For the type species, that is, Comchoano-1, we propose the status Candidatus Comchoanobacter bicosticola gen. et sp. nov. and for Comchoano-2, Candidatus Synchoanobacter obligatus gen. et sp. nov.

Understanding the drivers of microbial dynamics and cell–cell interactions is essential to elucidating marine elemental and energetic cycles1,21. Nevertheless, little is known about direct microbe–microbe associations in the oceans and their ecological implications, especially for small heterotrophic protists. Advancements have been hindered by the paucity of methods for preserving intact relationships between cells collected in pelagic marine ecosystems. Here we report the discovery of a microbiome in Pacific Ocean single-celled choanoflagellates enabled by methodologies for circumventing both community changes that occur during cultivation efforts and disruption of associations caused by most field sampling approaches. The implications stemming from the phylogenetic, population and genomic features of the microbiome-dominant Comchoano have still unappreciated ramifications. Recently, obligate bacterial associates of heterotrophic protists have been increasingly noted and their genomes have been sequenced from freshwater and terrestrial habitats33. However, the vast majority of bacteria sequenced or cultured so far from the ocean maintain pathways for energy conversions and biosynthesis of essential compounds important to survival as free-living cells in the often resource-depleted marine environment29,30. In the case of Comchoano, its membrane modifications and molecular mechanisms for avoiding host detection and manipulating hosts, alongside limited metabolic capacities, requirement for numerous substrates and direct energy exchange, implicate a ‘resource drain’ that could have detrimental impacts for the host, depending on environmental conditions. The findings also provide a possible mechanism for concentrated delivery of high levels of bacterial compounds directly to the eukaryotic host, without dilution caused by release in seawater, such as the compound levels required to trigger sex and multicellularity in choanoflagellates11,12,13. The widespread distribution of both the bacterivorous host and bacterial symbiont discovered herein, as well as the diversity of potentially host-associated uncultivated bacteria related to Comchoano, call for intensified efforts to identify cryptic symbioses and deeper knowledge of the strength and directionality of their influence on resource flow in the ocean.

Methods

Sampling and cell sorting

For cell sorting, seawater was collected on 20 March 2014 at Monterey Bay Time Series station M2 (36.688° N; 122.386° W, 56 km from shore, Fig. 1a) using Niskin bottles mounted on a conductivity, temperature, density instrument (CTD) bearing rosette. As reported in an earlier paper describing the sample preparations for the choanoflagellate sorting experiment14, water from 20 m depth was pre-filtered with a 30 μm nylon mesh (to remove protists of a size likely to clog the flow cytometer) and concentrated by gravity over a 47 mm diameter, 0.8 μm pore size Supor (Pall) filter to a theoretical concentration of 250x. The latter step concentrates protists while lowering the relative numbers of free-living bacteria in the sample; it is not essential to remove all free-living bacteria as they will be selectively removed by single-cell sorting of the protists where only co-associated bacteria will be present. The concentrated seawater was labelled with LysoTracker Green DND-26 (Invitrogen), a label that targets acidic food vacuoles of living cells67, to a final concentration of 25 nM from a working stock of 10 µM diluted in artificial seawater. The labelled sample was analysed on a BD Influx flow cytometer equipped with a 488 nm laser and running on sterile nuclease-free 1x PBS as sheath fluid.

The sorted population was discriminated on the basis of positive LysoTracker signal (that is, fluorescence detected in the 520/35 nm bandpass filter under 488 nm excitation) as compared to an unlabelled sample, as well as absence of chlorophyll-a autofluorescence (that is, fluorescence detected in the 692/40 nm bandpass filter) and comparable forward angle light scatter (a proxy for cell size) to select for a coherent population of heterotrophic eukaryotes. Listmodes were analysed using Winlist (version 7.0, Verity Software House). Single cells from the population with these characteristics were sorted in a 384-well plate using the single-cell sorting mode from the BD FACS Sortware (software v1.0.0.650), ensuring that only one cell would be sorted in each well. A subset of wells was left empty or received 20 cells for negative and positive controls, respectively. The plate was illuminated with ultraviolet light for 2 min before performing the sort, and covered with foil and frozen at −80 °C immediately after its completion.

Sorted-cell sequencing and analysis

For single-cell sorts, sorted cells were subjected to alkaline lysis at room temperature, followed by whole-genome amplification by MDA with the RepliG single-cell kit (Qiagen) or WGA-X workflow (Supplementary Data 2) in 2 μl reactions set up with an Echo acoustic liquid handler (Labcyte). For single-cell samples, partial 16S rRNA gene sequences were amplified using V4 primers 515F-Y (GTGYCAGCMGCCGCGGTAA) and 806R (GGACTACNVGGGTWTCTAAT)68 and partial 18S rRNA gene sequences were amplified using V4 primers TAReuk454FWD1 (CCAGCASCYGCGGTAATTCC) and TAReukREV3 (ACTTTCGTTCTTGATYRA)69. 18S rRNA gene amplicons were initially processed as reported when the 18S data were originally published14 using UCLUST70 to cluster sequences into 99% OTUs (Fig. 1b, left panel). Subsequently, to potentially further resolve OTUs to ASVs using state-of-the-art methods, all amplicon sequences (16S and 18S) were processed within QIIME 271 by trimming primers with cutadapt (v1.13)72, and then quality filtering and denoising with DADA273, generating amplicon sequence variants (ASVs). During the denoising step, forward and reverse 16S rRNA gene sequences were trimmed to 210 and 180 bp, respectively; 18S rRNA gene sequences were trimmed to 250 and 200 bp, respectively. Separate MiSeq runs were denoised independently and combined after denoising, as required for generation of accurate error profiles. Both 16S and 18S ASVs were classified in QIIME 2 with classify-consensus-blast with–p-perc-identity 0.85 and–p-maxaccepts 1 using the SILVA 132 99% clustered representative as a reference database74 and majority_taxonomy_7_levels as the taxonomy file. The dominant 18S rRNA amplicon sequence was affirmed to have 100% similarity across 378 bp to the choanoflagellate B. minor, which was hand-picked and sequenced from Danish marine surface waters but remains uncultured16. For ASV processing, 16S rRNA ASVs classified as eukaryotic or chloroplasts were removed. The remaining ASVs classified as bacterial and archaeal reads were selected and then those that were classified as mitochondria went through a second curation step. The mitochondrial sequences were searched via blastn75 against the NCBI nucleotide database, and after excluding the hits to uncultured taxa, those that had best hits to mitochondria were removed from further processing.

After denoising and classification, cells that had fewer than 5,000 reads were removed from further processing. The remaining single-cell samples had an average number (±s.d.) of reads of 710,947 ± 464,742 and 159,492 ± 134,364 for the 16S and 18S rRNA gene sequences, respectively. The environmental samples had 93,352 ± 23,463 and 163,737 ± 16,046 reads for the 16S and 18S rRNA amplicon sequences, respectively. For the single-cell analyses, due to the biases introduced from MDA, ASV relative abundance data were converted into presence and absence data, where read proportions >5% and >25% for the 16S and 18S rRNA gene datasets, respectively, were considered ‘present’. These conservative values were chosen to reduce the likelihood of ‘cross-talk’ (for example, see ref. 76) between samples influencing co-occurrence frequencies.

Genome sequencing, assembly, binning, curation and assessment

Metagenomic sequence libraries were prepared from MDA products as described above with either the Qiagen repliG enzyme or WGA-X (Supplementary Data 2). Metagenomic sequencing was performed with paired-end 2x 150 or 2x 250 Illumina sequencing for population sorts and individual cells (Supplementary Data 2). For each sample, the sequences were assembled with spades (v3.11)77, with the –sc setting and kmers of 21, 33, 55, 77, 99 and 127. Contigs longer than 1 kb were then binned on the basis of their tetranucleotide coverage and GC (Guanine-Cytosine) content with anvi’o v6.278 (Supplementary Fig. 2). Genomes were recovered from 18 single cells and two population sorts (20 cells each) (Supplementary Data 2). From both individual single cells and population sorts, the Comchoano genomes were often highly complete (up to 96.6%, average ± s.d. 92.0 ± 8.0%) (Supplementary Data 2), as estimated with CheckM taxonomy_wf workflow for domain Bacteria26. Assemblies from only one B. minor cell with Comchoano rendered a second full-length 16S rRNA gene sequence, a Flavobacterium whose genome was otherwise not well-recovered (estimated <5% complete, Supplementary Data 2).

Preliminary analysis showed the individual genomes to be highly similar within a given type (identical 18S rRNA gene sequences and 99.9 ± 0.08% whole-genome nucleotide identity, FastANI79), so we sought to improve the genome assemblies by a combination of automated and manual curation steps. Contigs from each Comchoano individual genome (that is, from individual B. minor cells or the combined 20 cell population sorts) were pooled and re-assembled in Geneious Prime (version 2020.2.4), using the Geneious assembler with the following settings: maximum percent gap of 1% per contig, maximum gap size of 50 bp, overlap of 50 bp with a percent identity of 99%, and total mismatch number of 1%. Before this step, three contigs from Comchoano-2 were excluded due to long (>5 kb) repeat sequences not observed in the other contigs. For Comchoano-1, this approach produced a single contig similar in size (1.01 Mb) to the average individual genome size of the individual genome assemblies. Other (smaller) contigs produced by the assembler were observed to be highly similar via progressiveMauve80 with the large contig with minimal differences probably due to sequencing and assembly artefacts. For these reasons, we proceeded with this single contig for final genome curation and polishing (see below). For Comchoano-2, the Geneious assembler produced multiple contigs of ~696 kb and ~370 kb. Similar to Comchoano-1, these two sets of contigs within these two size ranges were shown to be highly similar via progressiveMauve, so we proceeded with the longest contigs (696,580 and 370,342 bp) for additional genome polishing.

For final genome curation steps, we mapped all the original contigs from single cells and population sorts to their respective Comchoano genome with Minimap2 for highly similar sequences (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 -N50–min-occ-floor=100)81. From these mapped contigs, the consensus was predicted on the basis of nucleotides that were greater than 50%, with coverage greater than two. Subsequent whole-genome alignment demonstrated that the ends of the contigs of each Comchoano tended to be enriched in repeats. In the case of the single Comchoano-1 contig, inspection of the alignment revealed a stretch of nearly identical sequences on the 5’ end and near the end of the 3’ end (98.6% of nucleotides across 212 bp), which was followed by a highly similar repeat region on the 3’ end of the contig. This 3’ end was represented by only a single original contig, hence was probably an artefact. Thus, the 3’ end of the Comchoano-1 genome was trimmed to remove this redundancy starting at the highly similar overlap. To further curate on the basis of the original paired Illumina reads, all reads were mapped to the single Comchoano-1 contig and two Comchoano-2 contigs with Bowtie 282, after which up to 125,000 reads from each sample were examined to confirm consistent coverage and further polish the consensus sequences. From these mapped reads, a new consensus was determined on the basis of 50% majority of mapped reads (the original consensus and mapped-reads curated contigs were greater than 99.99% similar for both Comchoano-1 and Comchoano-2). At this point, in the case of Comchoano-1, reads overhanging each end were identical, enabling circularization (further validated by read mapping and circularization, see below). For each consensus Comchoano genome, the origin of replication was predicted with OriFinder (on the basis of GC skew and DNA replication binding motifs)83. Comchoano-1 was subsequently re-oriented, with the first position being the origin of replication and binding the two ends together; because Comchoano-2 was not a single contig, the contigs were re-oriented to be in the proper orientation according to GC skew, but not arranged to start at the origin of replication. For Comchoano-1, the reads from each sample were then re-mapped with Bowtie 2 and up to 200,000 paired reads from each sample were again examined (consensus taken again, resulting in 82 bp changes across the genome) with 100% coverage.

Upon re-mapping of the original contigs from individual single cells with nucmer (default settings, except -b 5000) to the Comchoano-1 working consensus, SNP analysis (predicted with show-snps -ClHTr) showed three instances in which the consensus departed from the majority of single-cell assemblies. Thus, the original contigs were remapped with Minimap2 in Geneious and the majority bases (over 50%) were taken.

After this, SNPs between the consensus Comchoano-1 genome and the assembled Comchoano-1 from single B. minor cells were again predicted with nucmer (default settings, except -b 5000)84 and then SNPs were again predicted with show-snps -ClHTr. Nine positions had more than one cumulative SNP across all cells. All but one of these SNPs occurred across one of two pairs of highly similar paralogous sequences, and as such were probably the result of challenges to assembly of these highly similar sequences. These SNPs were manually inspected via read mapping and visualization in Geneious, which revealed that 1 bp in the consensus was probably incorrect and was thus corrected in the Comchoano-1 consensus genome, resulting in the final consensus Comchoano-1 genome.

To identify single nucleotide polymorphisms between the Comchoano consensus genomes and Comchoano from B. minor single cells, reads were mapped with Bowtie 2 default settings from eight B. minor cells containing Comchoano sequences. Read pairs that mapped discordantly or more than once were excluded. Coverage was then calculated with samtools mpileup85 and polymorphic sites predicted with bcftools call–ploidy 1 -P 1.1e-10 -v -m86. Resulting SNPs were further filtered with bcftools filter to exclude SNPs within 3 bases of indels, quality of less than 30 and coverage of 10 or less. Additionally, SNPs were examined between the original contigs from the single cells and population sorts to the consensus genomes with nucmer and show-snps, as described above. The end (2,555 bp) of one 167 kb contig from a single cell was filtered from the final output (Supplementary Data 2) because a region with greater than 10% divergence was observed over this region, which was suspected to be due to assembly error.

Genome annotation

Comchoano-1 and Comchoano-2, and other genomes used in comparative analyses were annotated as follows (note, the same methods were used for annotating the genomes compared to Comchoano in Fig. 3d). tRNAs were predicted with the Aragorn pipeline (v1.2.38)87. Proteins were predicted by Prodigal (v2.6.3)88. Protein annotations were determined using EggNOG emapper.py (version 2.0.1–14)89,90, with the diamond blastp search option and diamond database downloaded on 19 Mar 2019. Additionally, protein domains were annotated via hmmscan91 using the pfam database as a reference92; ankyrin repeats and leucine-rich repeats identified with hmmscan of pfam were checked using blastp (e-value < 1 × 10–10). Amino acid auxotrophy and pathway calculations were predicted by annotation on Kbase93 web server by first predicting proteins via RAST94 and then applying the fba_tools Predict Genome Auxotrophies tool (v.1.7.6)93. Metabolic pathway maps (Extended Data Fig. 7) were created using Pathway tools (v22.0)95. Putative signal peptides were identified with the Phobius web server (phobius.sbc.su.se)96. Rhodopsins and retinal biosynthesis proteins were identified from hmmscan of pfam (e-value < 1 × 10−10) as follows: rhodopsin (PF01036.19), GGPP synthase (PF00348), phytoene synthase (PF00494.18), lycopene cyclase (PF05834) and beta-carotene 15,15’-dioxygenase (PF15461.5). Phytoene dehydrogenase was identified by blastp (e-value < 1 × 10−10) search of the conserved protein domain family TIGR02734: crtI_fam. Transport proteins were additionally annotated via web-based transporter annotation tool TransAAP97 and amino acid transporters were identified by blastp against a dataset of predicted amino acid transporters in Coxiella burnetii RSA49398. ATP/ADP translocases were identified via hmmscan using the TLC ATP/ADP transporter pfam (PF03219) (e-value < 1 × 10−25). The B. minor genome was searched for proteins involved in glycolysis using the EggNOG diamond search blastx search option89,90.

Initial secretion system analysis prediction (Fig. 3d) was performed via txsscan65 (galaxy webserver version), excluding type IV pili and flagellar genes. Subsequently, MacSyFinder (version 1.0.5) was used to identify the secretion systems in Comchoano-1 and Comchoano-2, as well as genomes from the same dataset of genomes as used for genome size comparison (n = 18,671, but excluding any genomes <50% complete)65, with default settings and the ordered replicon (due to the assembled contiguous nature of the genes, even the genomes were in multiple contigs). The ‘–min-mandatory-genes-required’ parameter was set to three for T4SS_typeI, the maximum number for this system (due to some overlap with pT4SSi, in particular for Comchoano-1 and Comchoano-2), but left at default for all other systems. Specifically, Comchoano pT4SSi was identified through the presence of virB4/icmB and ≥6 accessory proteins (each Comchoano has nine), alongside the absence of the conjugation-related relaxase typical of other T4SS subtypes. For pT4SSi comparison with C. burnetii, the nomenclature is based on the original publication of the C. burnetii genome99. T4SS proteins not localized to the genomic regions shown in Fig. 4a are IcmX and IcmW in Comchoano-1, and IcmN and IcmE in Comchoano-2.

Metatranscriptomic reads were mapped to the Comchoano genome assemblies via bbmap.sh100(v37.17) at a sequence similarity cut-off of 0.99. Mapped reads were parsed via HTSeq-count101.

Genome size comparison

To compare Comchoano-1 and Comchoano-2 genome size and completion to a wide variety of marine bacterial and archaeal genome sizes, 4,931 marine bacterial and archaeal genomes were surveyed from the JGI/IMG database that were identified as ocean, coastal, pelagic or neritic102, 12,714 marine single-cell genome sequences27, 894 MAGs from Tara Oceans103 and 4 archaeal MAGs from a recent analysis of the BioGEOTRACES dataset104. The genome completion for all genomes was estimated with CheckM taxonomy_wf workflow at the domain level for bacteria and archaea26. Genome sizes were then estimated by accounting for the CheckM-derived completion and contamination metrics from each genome. Only those genomes that were estimated to be >80% complete on the basis of SCG estimates and having <5% contamination (3,652 total) were used in the size comparisons.

Phylogeny and classification

Initially, to classify bacteria on the basis of their whole genome, Comchoano-1 and Comchoano-2, and the other genomes described above were classified via the GTDB-tk (v1.4.0) classify_wf command19 which extracts 120 putatively vertically transferred genes, aligns them and then places them on a reference tree with pplacer; this analysis assigned the Comchoano to the order UBA7916, of which the only representatives with genomes are uncultivated oceanic SAGs and MAGs. This command was also used to calculate the RED between Comchoano and sequences already in the GTDB (default settings), as well as to recalculate with newly added sequences (Comchoano plus UBA7916 from other datasets, described below, not found in release 95 of GTDB) with the ‘–recalculate_red’ flag. In each case, the RED values for each Comchoano were 0.83. To determine the average AAI between Comchoano-1 and Comchoano-2, the aai.rb programme from enveomics105 was used, which resulted in the two-way AAI value of 49.0 ± 14.8% based on 729 homologous proteins. This AAI similarity is roughly similar to the amount expected from class-level differences106, yet the differences between the 16S rRNA gene sequences are 95%, which is similar to genus-level differences. With this low similarity, the average nucleotide identity may be unreliable;107 OrthoANIu108 suggests an average nucleotide identity of 67.03% on the basis of 159,905 bp on average, which is about 15% of the genome size. To circumvent such a problem of rapid evolutionary divergence in, for example, symbionts, Parks et al.19 recommend the RED metric that we used, which takes this into account. Subsequently, we also performed phylogenomic reconstruction of the Comchoano and other Gammaproteobacteria, and 38 single-copy ribosomal proteins from Proteobacteria were extracted with GToTree109 for Comchoano-1 and Comchoano-2, all representative Gammaproteobacterial genomes in the GTDB database (n = 5,784), all identified Gammaproteobacteria marine single-cell genome sequences (n = 1,413)27, six UBA7916 genomes from a Tara Oceans MAG study103 and one UBA7916 MAG from JGI/IMG not found in the other datasets (IMG Genome ID 2721755926). Additionally, ten Alphaproteobacterial genomes were used as outgroup sequences. Ribosomal proteins were identified via hmmscan with the intrinsic pfam gathering threshold used as cut-offs for each respective protein. Any ribosomal proteins that were detected more than once in a genome or were less than or longer than 30% of the median sequence for a given protein were excluded. Genomes were removed from the analysis if they encoded <30% of the 38 proteins. The ribosomal proteins were aligned with MAFFT110 and positions with greater than 50% gaps were removed via trimAl111. The sequences were then concatenated; the total alignment included 6,744 genomes and 4,502 amino acid positions. Phylogenetic reconstruction was performed with FastTree112, with the -notop setting selected. Phylogeny was visualized in iTOL113 with the GTDB order-level taxonomy used as branch colours19.

To more finely resolve the phylogenomics of Comchoano, on the basis of the FastTree reconstruction produced above, the genomes from the 15 closest Gammaproteobacterial orders to UBA7916 (Berkiellales, Coxiellales, Diplorickettsiales, DSM-16500, Legionellales, Piscirickettsiales, PIVX01, PIWD01, UBA1113, UBA12402, UBA6002, UBA6186, UBA7366, UBA7916 and UBA9339), plus Francisellales as the outgroup, were selected. Notably, these GTDB orders, besides Francisellales, correspond to the Legionellales order of the NCBI taxonomy. GToTree was then re-run for these orders. Genomes with fewer than five of the ribosomal proteins were removed. This selection removed three genomes, one of which (CACFNL) was included in the 16S rRNA tree, as described below. Then, the same alignment and trimming parameters as above were applied, resulting in 187 taxa and 4,573 amino acid positions. Maximum likelihood (ML) phylogenetic reconstruction was then performed using the LG+Γ+I model (the best model for 36 out of the 38 genes using the corrected Akaike information criterion (AICc) scores generated by IQ-TREE114) for amino acid substitution and 500 rapid bootstraps.

For 16S rRNA gene phylogenetic analysis, we gathered sequences in three ways: (1) rRNA genes were extracted from Comchoano, the genome-sequenced relatives and the same genomes from the 16 orders used in the phylogenomic subtree analysis with barrnap default settings115. (2) These sequences were then searched by blastn against a SILVA138 database that had been curated to remove sequences that had less than 1,200 bp and/or any degeneracies. Three sequences with best hits to mitochondria were removed from analysis on the basis of this search. From this search, the five closest hits (based on bit-score) to each remaining genome-derived rRNA gene sequence were selected (excluding the Francisellales which were the outgroup). (3) To more broadly sample the rRNA sequence diversity from the 15 GTDB orders of the SILVA database, five sequences were selected (at random) from the five individual groups that were included among the best scoring hits to any of the genome-derived rRNA sequences (Coxiellales, Diplorickettsiales, EC3, Berkiella and Legionellales). Sequences from genome-sequenced taxa that had degeneracies or were less than 1,200 bp were then removed, and the remaining sequences combined with sequences from steps 1 and 2, and clustered with cd-hit-est116 at 99% sequence identity, resulting in 240 sequences. The resultant sequences were then aligned with MAFFT and trimmed to remove positions with greater than 95% gaps with trimAl111. Phylogenetic reconstruction was performed in RAxML117 with the GTR+Γ+I model and 1,000 rapid bootstraps. Additionally, a Bayesian inference (BI) phylogenetic reconstruction analysis was performed with MrBayes118, with two independent runs of 2,500,000 generations with four chains each (that is, one cold and three heated), sampling every 250 generations and printing every 1,000 generations. After a burn-in of the first 25% of trees, posterior probabilities for node supports were calculated on the basis of assessment of convergence among runs using the R package RWTY119.

For the type IV secretion phylogeny, putative VirB4 ATPases (also known as virB4, icmB or TraU) from the following sources were included: (1) VirB4 from the same dataset of genomes as used for genome size comparison (n = 18,671), (2) VirB4 from the SecReT4 database (n = 570)120 and (3) VirB4 from an additional study of secretion systems across public databases (n = 562)65. Using MacSyfinder as described above, this analysis identified 1,629 unique virB4 of putative homologous proteins from T4SSs. For phylogenetic reconstruction, a tree was constructed by the approximate ML approach with FastTree112 using the complete dataset aligned with MAFFT110 and masking positions having ≥5% gaps. Then, a subset of bacterial sequences that grouped together in a well-supported clade (100%) with sequences identified as pT4SSi or T4SS type I was selected. On this subset, ML and BI phylogenetic analyses were then performed. First, the selected reference sequences were re-aligned with MAFFT using default parameters. The alignment was masked by removing positions having ≥5% gaps. The best‐fit model of amino acid evolution was determined on the 956 amino acid positions with ProtTest 3.2121 as being LG+Γ4+I, using the AICc. In RAxML117, a tree search was performed with 1,000 nonparametric bootstrap replications using the same evolution model. The BI analyses were conducted in MrBayes 3.2.6118, with two independent runs of 2,500,000 generations with four chains each (that is, one cold and three heated), sampling and printing every 100 generations. After a burn-in of the first 10% of trees, posterior probabilities for node supports were calculated on the basis of assessment of convergence among runs using RWTY. Taxonomy for genomes in virB4 phylogeny was determined by the GTDB-tk classify_wf command19 from data release 89.

For rhodopsin phylogeny, representatives of diverse rhodopsins were initially collected from a previous study14. Additionally, to broadly survey for rhodopsins similar to those in Comchoano and proteorhodopsins in general, rhodopsin proteins were extracted from global marine metagenome surveys103,122,123, marine single-cell surveys27 and predicted proteins from metagenomic assemblies from the North Pacific124 via hmmscan of the Bac_rhodopsin protein, with a gathering threshold of greater than 26. These sequences plus amplicons from the Red Sea125 and the MicRhoDE126 database were then searched by blastp75 against the Comchoano-1 and Comchoano-2 rhodopsins. The sequences with bit-score greater than 250 were then added to the sequences from the previous study14, as well as all rhodopsins from MAGs and SAGS classified as UBA7916. This resulted in 480 sequences that were then aligned using MAFFT default settings110. The alignment was trimmed to remove positions with greater than 50% gaps via trimAl111, resulting in an alignment of 250 positions. ML reconstruction was performed in RAxML117 with 1,000 rapid bootstraps and the LG+Γ+F substitution model as in ref. 14.

For ATP/ADP translocase phylogeny, we leveraged two datasets from recent publications for collecting representative sequences35,41, in addition to collecting putative ATP/ADP translocases from single-cell prokaryotic genomes from the North Atlantic27, MAGs from Tara103 and genomes from GTDB19. For the latter three datasets, ATP/ADP translocases were predicted by searching against the PFAM92 database with hmmscan91 using a gathering cut-off of 20.6. Combining these datasets resulted in 1,695 sequences. These sequences were then clustered at 0.95 amino acid similarity via cd-hit116 and subsequently filtered to remove sequences shorter than 250 amino acids via seqkit, resulting in 1,379 sequences. The sequences were then aligned with MAFFT and ambiguous positions were trimmed with trimal via the automated heuristic on the basis of similarity statistics (‘-automated1’). This resulted in an alignment of positions. Phylogenetic reconstruction was then performed with IQ-TREE114 with extended model selection (-m MFP) and 1,000 ultrafast bootstraps. Subsequently, the region of the tree containing UBA7916, Comchoano and numerous characterized proteins from parasites and symbionts was extracted (as indicated in Extended Data Fig. 10a), re-aligned and trimmed as performed for the full dataset, resulting in an alignment of 413 sequences and 412 positions. Phylogenetic reconstruction was again performed as above. All trees were visualized in iTOL113. To examine important motifs to nucleotide transport, we selected the same subset as in Graf et al. 2020, re-aligned with MAFFT and visualized the alignment in the ESPript web server (https://espript.ibcp.fr)127.

Environmental distributions

For 16S and 18S rRNA gene distributions from Monterey Bay, seawater was collected in 2014 (20 March, 2 April and 5 May) via CTD rosette from the top 1 m at three locations (M2: as above; M1: 36.762° N, 122.038° W; C1: 36.797° N,121.847° W) and 500 ml was filtered through 47 mm diameter, 0.2-µm-pore-size Supor (Pall) filters. Additionally, seawater from the top 5 m was collected via the ship intake at M1 and M2 stations, pre-filtered through a 30 µm nylon mesh (except for the March samples where no pre-filtration occurred) and 20–30 l were sequentially size fractionated through 142 mm diameter, 3 µm Versapor and 0.22 µm Supor filters. One additional sample, from 30 April 2015, was collected from 5 m depth with a CTD rosette, 500 ml were collected on 47 mm diameter, 0.2 µm Supor filters and size fractionated on 2 µm polycarbonate and 0.2 µm Supor (Pall) filters. DNA was extracted with a DNeasy kit, with modifications in an earlier report128. In the case of the 142 mm filters from the 2014 size-fractionated samples, only 1/6 of the filter was used for DNA extraction. 16S and 18S rRNA gene V4 amplicons were amplified and processed as described above. Additionally, V4-V5 amplicons were amplified and processed the same way as the V4 amplicons, except for the use of the primers 515F-Y (GTGYCAGCMGCCGCGGTAA) and 926R (CCGYCAATTYMTTTRAGTTT).

For 16S and 18S rRNA gene distributions from Malaspina cruises, data were downloaded from the NCBI via BioProject PRJEB25224 and PRJEB23913. This project analysed prokaryotic (primers, 515F-Y, 926R129) and pico-eukaryotic composition (TAReuk454FWD1 (CCAGCASCYGCGGTAATTCC) and TAReukREV3 (ACTTTCGTTCTTGATYRA)69) from surface seawater during the circumnavigating Malaspina 2010 cruise. The data were imported into QIIME 2 where primers were removed from the 18S data (16S primers had already been removed), and 16S and 18S denoised independently via dada2 denoise-paired (16S,–p-trunc-len-f 210–p-trunc-len-r 180; 18S,–p-trunc-len-f 210–p-trunc-len-r 200). Taxonomy was then assigned using QIIME 2 with the same settings as above. To examine possible 16S rRNA gene ASVs with affiliation to UBA7916, ASVs classified as ‘Coxiellales’ were extracted (n = 226; note, in SILVA, ‘Coxiellales’ is the classification for Comchoano and UBA7916 because UBA7916 is not defined in that rRNA database) for phylogenetic placement with epa-ng130 with default settings, except for the use of the –no-heur setting, using the alignment and maximum likelihood 16S rRNA gene phylogenetic reconstruction previously described. The best placement was used to determine the affiliation of a given ASV within UBA7916 (n = 167).

For 18S rRNA from Tara Oceans, data were downloaded as a published OTU dataset amplified with V9 rRNA primers 1389F (TTGTACACACCGCCC) -3’ and 1510R (CCTTCYGCAGGTTCACCTAC), universal for eukaryotes10,131. To make the taxonomy consistent with the Malaspina 18S data, the Tara Oceans V9 OTU representative sequences were reclassified with QIIME 2 as described above.

For the San Pedro Ocean Time-series daily time-series, the published OTU datasets and representative sequence dataset were downloaded from FigShare132. As described elsewhere132, these samples originated from the top 1 m of seawater at 33.55° N, 118.4° W. Seawater was pre-filtered (80 µm), and then cells were sequentially collected on 1 µm AE (Pall) glass filter and 0.2 µm Durapore (Millipore) filter. 16S and 18S rRNA genes (V4-V5 amplicons) were amplified with the 515F (GTGCCAGCMGCCGCGGTAA) and 926R (CCGYCAATTYMTTTRAGTTT) primers, as reported in the paper originally publishing the data132. The sequences were clustered at 99% similarity threshold and representative sequences were chosen on the basis of the most abundant sequence within a cluster. Statistical correlations between Comchoano-1, Comchoano-2 and B. minor were searched for the initial daily time-series portion of this study (12 March–1 April 2011) via eLSA133, allowing for a maximum time-delay of 5 d and for the full time-series, allowing three time-point time-delay, but only those with no time delay (Spearman correlation) were used in Extended Data Fig. 1b,c. In both cases, P and q-value determination was performed using the ‘theoretical’ option, not permutation, and the calculations were performed only on taxa that were detected on 33% of sampling points.

HCR-FISH probes

To design probes for Comchoano-1, Comchoano-2 and B. minor for HCR-FISH134, 23 probe candidates targeting 16S rRNA for Comchoano-1 and Comchoano-2, and 17 probe candidates targeting 18S rRNA genes for B. minor were provided by Molecular Technologies. Subsequently, these sequences were searched against NCBI to be specific only to their target (Comchoano-1, Comchoano-2 or B. minor, respectively), as well as across available sequences from closely related organisms (for Comchoano-1 and Comchoano-2 that included searches against each other). Two top candidates were identified for each target on the basis of their sequence specificity. For Comchoano-1: probe ‘2’, tcgggaaaagtgatggcgagtggcggacgggtgagtaatgcgtaggaatcta and probe ‘5’, tgcgatgaaggctttcgggtcgtaaagcactttcagttgggaagatggctta; for Comchoano-2: probe ‘2’, tcggaagaaatgatggcgagtggcgaacgggtgagtaatgcgtaggaatcta and probe ‘14’, ctttagtaataaaggggtgccttcgggaaccgagatacaggtgttgcatggc; for B. minor: probe ‘5’, tgattcttcgagtcttcctctcgtagttgtttggcgcacttgattgggtgcc and probe ‘v4-1’, tctgattcgaaagatcggtccgccgcaaggcgagcactgattcttcgagtct. For each probe set, these sequences were converted into ‘even’ and ‘odd’ split-initiator probes in accordance with the HCR v3.0 protocol134. Because the three cell types are currently not in culture, we used a Clone-FISH approach135 in Escherichia coli for positive and negative control testing of probes. Ultimately, both the evaluated B. minor and Comchoano-1 probes were deemed to be specific in a limited set of cross-reactivity tests, meaning fluorescent signal was not appreciable when the B. minor probe was paired with another choanoflagellate sequence and fluorescence was also not appreciable when Comchoano-1 probes were paired with Comchoano-2 sequences (clones). However, only one (probe ‘14’) of the Comchoano-2 probes was specific, as Comchoano-2 probe ‘2’ also amplified in Comchoano-1 clones. Comchoano-2 probe ‘14’ can be used individually to target only Comchoano-2.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.