Introduction

The most prominent characteristic of the deep continental subsurface is the absence of sunlight. However, the diversity of subsurface ecosystems is manifold. Physicochemical characteristics, as well as the availability of electron donors and acceptors shape different microbial communities within these ecosystems (e.g., Refs. [1, 2]). In some environments, the availability of fossil organic matter, burial depth, and temperature exert strong control on community structure [3,4,5]. Other subsurface environments have low availability of buried organic matter. In such environments, genomic analyses suggest that in situ CO2 fixation supports microbial communities [6,7,8,9,10,11]. Most subsurface environments may be sustained by fixed carbon from multiple sources, and the relative importance of in situ CO2 fixation has been difficult to ascertain [12].

The candidate phyla radiation (CPR) of bacteria is a monophyletic group [13], which includes enigmatic small-celled microbes [14] that appear to be abundant predominantly in the subsurface [15]. Cocultures of CPR bacteria indicate that some are symbionts of other bacteria and heavily depend on their hosts for basic resources [16]. To date, none of the reconstructed CPR genomes encode for a complete fatty acid (FA)-based lipid biosynthesis pathway [15]. Other putative bacterial and archaeal symbionts from different branches of the tree of life also do not encode for their own lipid biosynthesis pathway [17,18,19] and at least one hyperthermophilic episymbiont (Nanoarchaeum equitans) has been suggested to acquire its lipids from the host archaeon [20]. However, the origin and types of lipids used by CPR bacteria remain elusive.

Analysis of the stable carbon isotopic ratios of lipid molecules has enabled researchers to track carbon flow through communities. For instance, it was shown that archaea growing in syntrophy with sulfate-reducing bacteria mediate the anaerobic oxidation of methane [21, 22]. This analysis was possible because the consortia were based on simple bacterial and archaeal assemblages that produce diagnostic lipid types. In another study, the stable carbon isotope ratios of methane and lipids were used to track the flow of carbon from methane into the two species thought to be present based on rRNA sequence profiling [23]. Coupled lipidomic, tag sequencing, and isotopic analyses also allow spatiotemporal tracking of carbon flow through complex microbial communities [24, 25]. However, the power of this approach is limited when microbial communities contain numerous organisms that produce unknown lipid molecules [26]. In fact, lack of information about the types of lipids produced by uncultivated organisms remains a major gap in microbial ecology.

A recent large-scale environmental genomics survey of subsurface microbial ecosystems within the Colorado Plateau, USA, provided evidence for a depth-based distribution of organisms affiliated with more than 100 different phylum-level lineages [12]. Samples were acquired from groundwater that erupted through the cold (i.e., nonthermal), CO2-driven Crystal Geyser. During the eruption cycle groundwater was sourced from different depths, enabling the assignment of organisms to their respective depths. Genomic resolution of the tracked organisms linked three different carbon fixation pathways to groundwater from different depths. However, a major question remains regarding the extent to which autotrophic organisms provide organic carbon to these complex microbial communities. Further, the types and sources of lipids used to construct the cell envelope of CPR bacteria remain elusive. We postulated that clues regarding the types of lipids produced by uncultivated bacteria and archaea could be addressed by correlation-based analyses so long as sufficient numbers of samples were defined in terms of the abundances of the microorganisms present and overall lipid compositions of the same samples were available. Here, we use coupled metagenomic-lipidomic data sets to test this approach and to resolve the importance of autotrophy as the source of organic carbon in the studied environment.

Material and methods

Sampling scheme

Samples for lipid analyses were retrieved by collecting cells from groundwater sampled from the Crystal Geyser ecosystem onto a 0.1-µm teflon filter (Gravertech 10″ MEMTREX-HFE). Filters with biomass were immediately frozen on dry ice. One post-0.2-µm fraction was also collected to enrich for organisms of the CPR and DPANN radiations (sample ID 26, beginning of the recovery phase of the geyser). The samples span an entire cycle of the geyser, which lasted for ~5 days [12]. Collection for each metagenomic sample proceeded for around 4 h (141 L, SD 31%, Table S1). Collection of lipid samples proceeded simultaneously, but the collection time was around 8 h (114–338 L, Table S2) so there are half as many lipid samples as metagenome samples. The sampling scheme details are presented in Fig. S1. For infrared analysis coupled to metagenomics, one additional size-fractioned sample (first 0.2 µm, then 0.1 µm filtration) was included, which was collected during the recovery phase of the geyser in August 2014 and had been analyzed regarding its genomes earlier [12]. Details on samples and SRA accessions are provided in the Supplementary information.

Sampling and isotopic analysis of dissolved inorganic carbon

Twenty-four groundwater samples were collected from about 8.5 m below ground surface in the geyser borehole using a peristaltic pump and copper pipe. Samples were collected in 12 mL glass vials. The vials were flushed with fresh geyser water and were filled underwater in a bucket that was overflowing with groundwater to avoid atmospheric contact; this was confirmed by gas chromatography analyses that did not detect contamination by atmospheric gases (N2, O2, or Ar; unpublished data). The stable carbon isotopic composition of the dissolved inorganic carbon was analyzed by Continuous Flow Isotope Ratio Mass Spectrometry (CF-IRMS) using a Thermo Finnigan GasBench coupled to a DeltaVPlus. Water pressure, temperature, and electrical conductivity were measured in situ at the same depth using a Solinst LTC Levelogger Edge.

Lipidomics

Methods for lipid extraction and analysis are described in detail in the Supplementary information (sample overview is given in Table S2). In brief, lipids were extracted using a modified Bligh and Dyer method [27] after addition of an internal standard. Archaeal and bacterial intact polar lipids (IPLs; for structures see Fig. S2) were quantified using a Dionex Ultimate 3000 ultra-high-performance liquid chromatography (UPLC) system connected to a Bruker maXis Ultra-High Resolution quadrupole time-of-flight mass spectrometer equipped with an electrospray ion source operating in positive mode (Bruker Daltonik, Bremen, Germany). Lipids were separated using normal phase UPLC on an Acquity UPLC BEH Amide column (1.7 µm, 2.1 × 150 mm; Waters Corporation, Eschborn, Germany) maintained at 40 °C as described in Ref. [28]. For isotopic analysis, IPLs were separated from free core lipids using semi-preparative high-performance liquid chromatography. For mass spectrometric analysis of previously uncharacterized IPLs (see Figs. S3 and S4). Ether cleavage and saponification were performed on the IPL fractions to release isoprenoid hydrocarbons and FA, respectively. The stable carbon isotopic compositions of these compounds were analyzed using gas chromatography–IRMS. Fourier-transform infrared (FTIR) spectromicroscopy was performed to detect lipids in intact cells. The FTIR system consisted of a Hyperion 3000 Infrared-Visible microscope coupled to a Vertex70V interferometer (Bruker Optics—Billerica, MA). For FTIR analysis, cells were deposited on a double-side-polished silicon slide and dried with a gentle nitrogen gas stream in a biological safety cabinet. Lipid identification was achieved by comparing spectra from samples and dry films of lipid standards.

Metagenomics

Methods for DNA extraction and metagenomic sequencing are described in Ref. [12]. In brief, DNA was extracted from filters using the MoBio PowerMax Soil DNA isolation kit, and library preparation and sequencing were performed at the Joint Genome Institute (details on extracted DNA, type of library and sequencing are provided in Ref. [12]). Quality filtered reads (https://github.com/najoshi/sickle, https://sourceforge.net/projects/bbmap) were assembled using IDBA_UD [29], genes were predicted using prodigal (meta-mode; [30]). Coverage of scaffolds was calculated using bowtie2 (sensitive) [31]. Taxonomy of scaffolds was determined by searching proteins against an in-house database.

Tracking taxa across time using ribosomal protein S3

In order to get a near-complete picture of specific taxa present in the samples, we extracted ribosomal protein S3 (rpS3) sequences from all assembled scaffolds >1 kb using separately designed HMMs for archaea, bacteria, and eukaryotes (https://github.com/AJProbst/rpS3_trckr). The extracted amino acid sequences were clustered at 99% identity (collapsing most of the strains of the same species [32]) and the longest scaffold bearing a representative rpS3 sequence was obtained for each cluster. Using read mapping (bowtie2, [31]) and allowing a maximum of three mismatches per read (according to the 99% identity of the de-replicated rpS3 sequences), the relative abundance of each selected rpS3 scaffold was calculated across all samples. The breadth (i.e. how much of the sequence of a scaffold is covered) of the scaffolds was calculated in each sample. To call a rpS3 sequence present in a sample, it had to be either assembled or have a breadth of at least 95% of the entire scaffold in a sample. Since we worked with scaffolds, we did not consider ambiguous bases for calculating the breadth. The rpS3 sequences were taxonomically annotated against a combined database from previous publications [12, 33, 34], which was de-replicated at 99% rpS3 identity. Taxonomic assignments were performed with similarity cutoffs as described earlier: ≥99% for species, ≥95% for genus, and ≥90% for family level. Lower percentages were assigned to phylum or domain level (<50%).

Statistical analysis to correlate taxa abundance with IPLs

Relative abundance measures of rpS3 genes were correlated (Pearson correlation) with relative abundance measure of IPLs if the rpS3 gene/the IPL species was present in at least 7 out of 14 samples. Resulting p values underwent false discovery correction using the Bonferroni procedure and these q values were then weighted by division of the q value with the percent relative abundance of the rpS3 gene. Each lipid was allowed to be assigned to only one organism (with the best score). This assignment of lipids to rpS3 genes considers that highly abundant organisms are more likely to be detected in lipid analyses than low abundant organisms. IPL signatures were co-correlated (Bonferroni-corrected p value < 0.005) and lipid species that correlated with other lipids were identified for further analyses. These co-correlated lipid species, as well as the correlation of rpS3 genes and lipid species were used to construct a network (code is available under https://github.com/AJProbst/lip_metgen) and visualized in Cytoscape. Primary lipids were assigned based on direct correlation of lipids with organisms, secondary lipids were assigned based on a correlation with primary lipids. Lipids were classified as unspecific if the secondary lipid correlated with two primary lipids of different organisms.

Binning of genomes

rpS3 genes that were not found in existing genomes [12, 35] were identified based on a similarity (<98% [36]) and searched for in the respective metagenomes. Genomes containing these rpS3 sequences were binned using a consensus of guanine–cytosine content, coverage and taxonomy information in the ggKbase platform [37]. Genomes were subsequently curated with ra2 [13] for scaffolding errors. Genomes have been deposited at DDBJ/ENA/GenBank under the accessions SAMN13287258-462 (Umbrella BioProject PRJNA602879).

Genomic analysis of lipid biosynthesis pathways in CPR genomes

Protein sequences were annotated from USEARCH (–ublast) searches against UniProt, UniRef100 [38], and KEGG databases [39] and uploaded to ggKbase (https://ggkbase.berkeley.edu). Based on existing annotations target proteins involved in bacterial FA, isoprenoids, and lipids biosynthesis were identified in CPR genomes and can be accessed using the following link: https://ggkbase.berkeley.edu/genome_summaries/1491-Bacterial_membrane_lipids_AJP.

Results and discussion

Microbial community profile based on marker genes

We de novo assembled 27 metagenome samples, the reads from which were previously used in a study that involved mapping to 505 genomes reconstructed from prior data sets to link organisms to groundwater of different depths [12]. In the current study, we extracted assembled sequences of rpS3 and used read mapping to scaffolds carrying this gene to follow organisms over the 5-day eruption cycle. This approach allowed us to track 914 putatively distinct microbial species (Fig. 1), greatly exceeding the 505 previously reconstructed genomes [12].

Fig. 1: Community structure of 27 metagenomic samples from Crystal Geyser based on percent relative abundance of scaffolds carrying rpS3 sequences (clustered at 99% amino acid similarity).
figure 1

Nonmetric multidimensional scaling based on the Bray–Curtis index. The connections show the trajectory of the different samples taken throughout the eruption cycle. Sample 01 was not included as it was an amplified library due to low biomass (see “Material and methods” for further details). Sample 26 was collected after the end of the major eruptions and is already part of the recovery phase (thus colored in pink). Black color indicates samples that were collected during transition between phases. Please note, that the sample was also size-fractioned into a 0.2-µm and a 0.1-µm filter. For details on individual rpS3 abundances please see Fig. S5 and Table S5.

We detected a large community shift associated with different eruption phases. According to previously published geochemical data [12], the first phase, referred to as the recovery phase, sources groundwater from an aquifer of intermediate depth, likely a Navajo Sandstone-hosted aquifer. During the second minor eruption phase, water from a deeper aquifer is sourced (likely Wingate Sandstone-hosted) and during the third major eruption phase, an increased fraction of shallow groundwater is sourced (Fig. S1). Grouping of samples into different clusters in an ordination analysis based on community composition (Fig. 1) revealed stepwise changes throughout the eruption cycle. The final sample, which was taken after the end of the major eruption phase and as the geyser transitions into the next recovery phase was size-fractionated, with cells collected sequentially on a 0.2 µm filter and followed by a 0.1 µm filter (sample 26, Fig. 1). The community composition on the 0.2 µm filter plots near samples from the beginning of the first cycle in the ordination analysis, indicative of a restoration of the initial microbial community (Fig. 1).

In situ carbon fixation sustains microbial communities irrespective of aquifer depth

Previous community-wide genomic analyses suggested that carbon fixation might sustain the relatively complex aquifer microbial communities, but direct evidence was lacking [10, 33]. We measured the stable carbon isotope composition (i.e., δ13C values) of IPL-derived bacterial FA and archaeal phytane. The values for 14 samples were plotted as a function of sampling time and compared with the δ13C values of DIC and CO2 in the ecosystem (Fig. 2a, b). The δ13C values for DIC sampled from the geyser discharge over its 5-day cycle ranged from 3.6 to 8.0‰ (average = 5.0‰, std. dev. = 1.4‰) and showed no systematic variation with relative depth of source water (Fig. S6). The δ13C values for phytane range between −47.0 and −32.8‰ and for bacterial lipids (expressed as weighted average of all FAs) from −32.7 to −22.1‰. We found very little genomic evidence for utilization of methane [35] by these communities and methane was not detected in the geyser gas emissions [12]. Thus, we do not attribute the 13C-depletion of phytane to methane metabolism by methanogens/methanotrophs. Alternatively, heterotrophy could sustain microbial metabolism in the aquifers sourcing Crystal Geyser. However, the Wingate and Navajo aeolian sandstone aquifers have little associated sedimentary organic carbon [40, 41] that could serve as substrate. Similarly, dissolved organic carbon (DOC) concentrations in minor eruption phase fluids (~1 ppm, Table S3) are overall similar to global median groundwater [42], suggesting no significant admixture of exogenous DOC, for example from nearby oil reservoirs. It is possible that advection of exogenous DOC is more prevalent during major eruptions, but no DOC samples could be obtained from this phase. Still, the 13C-depletion in FA and phytane suggests that the majority of biomass is not primarily derived from heterotrophic incorporation of DOC during minor eruptions: phytane (δ13C is −42 to −47‰) and the C16:0, C16:1w7, and C18:1w7 FA are too depleted in 13C (δ13C is −29 to −34‰) to originate primarily from DOC (δ13C is −19 to −24‰, Table S3), while only the δ13C value of C18:0 FA (−27‰) is compatible with the small fractionation between substrate and FA observed in heterotrophic bacteria [43]. Importantly, the DOC in Crystal Geyser aquifers could be derived from in situ primary production and thus sustain heterotrophic bacteria.

Fig. 2: Carbon isotopic ratios and relative abundance of unsaturated intact polar lipids relative to the cycle of the geyser.
figure 2

a Water pressure and temperature over the geyser cycle showing sourcing of fluids from the conduit (mixed), the deep aquifer, and the shallow aquifer from Ref. [12]. b Stable carbon isotope fractionation of archaeal lipids (phytane, released from archaeol), individual bacterial fatty acids (FA, released from diacylglycerols), bacterial lipids (weighted average of FA), and dissolved inorganic carbon (DIC) relative to CO2 (εCO2-Lipid) over the geyser cycle. Lines to the left of the panel show expected ranges of εCO2-Lipid (accounting for up to 5‰ additional 13C-depletion of lipids relative to biomass, indicated by shaded areas) for the Calvin–Bassham–Benson (CBB; [46,47,48]), the reductive tricarboxylic acid cycle (rTCA [46, 49, 50]), and the Wood–Ljungdahl pathway (WL, reductive acetyl-coenzyme A pathway; [39, 62, 63]). The blue dashed line indicates relative contribution of carbon fixation through the CBB cycle versus the rTCA cycle for bacterial lipids (assuming maximum fractionation due to high in situ [CO2] and [DIC]). The red dashed line indicates the relative contribution of autotrophy versus heterotrophy (uptake of bacterial CBB/rTCA-fixed carbon) to archaeal lipid biomass, calculated from mass balance of δ13C values of bacterial and archaeal lipids (assuming maximum fractionation for archaeal autotrophy due to high in situ [CO2] and [DIC]). c Relative abundance of unsaturated diacylglycerol membrane lipids (the number indicates the sum of double bonds in both acyl chains). The distribution is dominated by mono- and di-unsaturated diacylglycerols but polyunsaturated lipids (6–15 unsaturations) increase markedly in deep aquifer fluids. Grey shading indicates major eruptions, which source deep aquifer water under high pressure.

Stable carbon isotopic compositions of lipids point to a predominantly autotrophic origin of microbial biomass. Due to the high in situ concentration of both HCO3 (69–84 mmol/L; [44]) and CO2 (at saturation level throughout the geyser [12, 44]), maximum fractionation by carbon-fixing microorganisms in the geyser can be assumed [45]. Changes in inorganic carbon speciation and thus fractionation are unlikely, as HCO3 concentrations, temperature (~16.8–17.5 °C, Fig. 2a), ionic strength (15–19 mS/cm, Fig. S1), and pH (6.4–6.5, [44]) stay in narrow ranges. In addition, growth rates of Crystal Geyser communities are likely to be low and thus carbon isotope fractionation (ε) would be expected to be maximally expressed. Based on this, and the known range of ε for carbon fixed via different pathways [46,47,48,49,50,51], it is plausible that the majority of archaeal lipids were synthesized via the Wood–Ljungdahl (WL, reductive acetyl-CoA, εDIC-lipid > 30‰) pathway from DIC, with εDIC-lipid of 38.3–53.9‰ observed in phytane derived from archaeol-based IPLs (Fig. 2b). This is in accordance with previous investigations of Crystal Geyser, which reported dominance of Altiarchaeota in the deepest aquifer [12]. Altiarchaeota fix carbon via a variant of the WL pathway with a fractionation εDIC-lipid of ~63‰ [52] (assuming εDIC-CO2 as ~10‰ at 15 °C calculated after Ref. [53] and δ13C of archaeal lipids from a 99% enrichment of Altiarchaeota reported in Ref. [52]). The observed εDIC-lipid values for archaeal lipids in many samples are below the maximum theoretical fractionation, implying that archaea in Crystal Geyser are not exclusively autotrophic but also take up isotopically heavier organic carbon. One likely source is archaeal utilization of organic carbon fixed by bacteria via the Calvin–Benson–Bassham (CBB) and reductive tricarboxylic acid cycle (rTCA) cycles, which would be more enriched in 13C than carbon fixed via the WL pathway. The degree of heterotrophic uptake by archaea can be approximated using a mass balance calculation involving mixtures of carbon with (i) the maximum theoretical fractionation for autotrophic archaeal carbon fixation via the WL pathway and (ii) the observed δ13C values of bacterial lipids (accounting for up to 5‰ additional 13C-depletion of lipids relative to biomass). This calculation would imply that archaea are predominantly autotrophic in deep groundwater (up to 70% of the biomass carbon fixed through WL pathway), but in the intermediate and shallow groundwater form up to 69% of their biomass by taking up bacterial organic carbon fixed through the CBB and rTCA cycles (Fig. 2b).

Bacterial lipids display the carbon isotopic fractionation expected from the CBB cycle relative to CO2 (εCO2-lipid of 20.9–28.8‰ observed vs. 30‰ theoretical) and not that expected from fixation via the rTCA cycle (εCO2-lipid < 12‰ theoretical). Sequences encoding the CBB pathway are fairly abundant in the ecosystem throughout the recovery phase [12] and likely contributed to the bacterial lipid pool of samples collected during that period. This agrees with previous genomic findings that identified several highly active iron-oxidizing Gallionella species carrying this pathway [12]. Importance of Gallionella in Colorado Plateau aquifers is further indicated by the association of organic carbon with fossilized Gallionella cells in postdepositional iron concretions of the Navajo sandstone [40]. However, genomic analyses suggested that one of the most abundant organisms in the shallow aquifer (Sulfurimonas sp.) fixes carbon via the rTCA cycle [12]. From mass balance calculations using the observed and theoretical fractionations, we estimate that carbon fixed via the rTCA cycle contributes as little as 12% to the bacterial biomass in the deep and intermediate aquifer but up to 78% of the biomass in the shallow aquifer (Fig. 2b). Overall, the observed carbon isotopic composition of the bacterial lipids could be explained as the result of a mixture of Sulfurimonas-derived lipids and lipids formed via the CBB pathway.

Degree of unsaturation of bacterial IPL changes with groundwater source depth

Using in-depth analyses of IPLs we tracked the abundance of IPL-bound bacterial unsaturated FAs across the eruption cycle. The unsaturations presumably correspond to double bonds but due to the mode of detection, we cannot strictly rule out cycloalkyl groups found in FAs of some bacteria [54], although typically not in higher numbers than one per FA. Interestingly, the relative abundance of highly unsaturated FAs correlated with the groundwater depth source (Fig. 2c). The cumulative abundances of IPLs with one or two double bond equivalents in their FA side chains were fairly consistent throughout the cycle, indicating little variation between the different groundwater sources. However, IPLs with seven or more unsaturations, i.e., at this high number presumably double bonds, were relatively abundant during the first phase, when groundwater was sourced from intermediate depths. These lipids were even more abundant during the middle phase, during which groundwater derives from the greatest depth, and almost undetectable in samples collected in the final shallow groundwater eruption phase. One explanation for elevated abundance of polyunsaturated lipids is their derivation from eukaryotes [55]. The occurence of tentatively identified DGCC-type (1,2-Diacylglyceryl-3-O-carboxyhydroxymethylcholine) betaine lipids is unprecedented in bacteria and supports the presence of Eukaryotes in the ecosystem, although the pathway for generating these lipids and its phylogenetic distribution remains unknown [56]. In general, Eukaryotes have been found in the geyser [57], primarily in a sample of decayed wood added to the geyser conduit, and they have been detected by rpS3 analysis in the current study. However, they are not very abundant, and fluctuate heavily throughout the cycle (Fig. S7). Due to the pronounced abundance maxima during deep aquifer eruptions, the most likely explanation for the presence of polyunsaturated FAs is their origin from organisms adapted to high pressures in the deeper subsurface. Bacteria are an additional, potential source of polyunsaturated FA, as the biosynthetic capacity for these lipids is widespread in terrestrial and aquatic bacteria such as Shewanella, Vibrio, and Geobacter spp. [58,59,60]. Incorporation of double bonds in bacterial FAs is a well-known mechanism that increases membrane fluidity at high pressure and low temperature [61, 62]. Consequently, a great diversity of unsaturated FA biosynthesis gene sequences are found in the Crystal Geyser metagenomes. For instance, we detected 1959 different protein clusters (>10% dissimilarity) of 3-oxoacyl reductases, representing 11,548 protein sequences in total (Fig. S8). As temperature remained nearly constant at around 17 °C (Fig. 2), high-FA unsaturation could represent an adaptation to the high pressures faced by indigenous bacterial communities in the intermediate and deep aquifers, supporting a direct link between groundwater sources and lipid profiles.

Predicting linkage of IPLs to uncultivated organisms

We detected 295 different IPLs in the 14 lipidomes but a strict organism-lipid relation was unresolved due to the complexity of the community. Assignment of lipids to specific organisms is further complicated by the existence of multiple potential source organisms for common lipid types and the distinct characteristics of a low-energy habitats in the subterranean aquifers. Distinct turnover times of lipids and DNA as well as lipid recycling, which may be a common strategy utilized by energy-starved archaea in the subsurface [63,64,65], could adversely affect correlations. While relative turnover times of DNA and lipids remain unconstrained, the predominance of chemically labile phosphoester IPLs in Crystal Geyser facilitates comparatively faster turnover of lipids compared with marine deep biosphere environments where ether-based IPLs, including glycolipids, are prevalent [66, 67]. Irrespective of whether they represent snapshots of a dynamic system or signals accumulated over longer timescales, the systematic changes in metagenomes and lipidomes indicate distinct, stratified habitats within Crystal Geyser.

In the current study, we used a time series of 14 metagenomic and coupled lipidomic data sets to establish correlations between marker gene abundances and IPLs. Based on this analysis, we tested for evidence for the assignment of lipids to organisms. Specifically, relative abundance patterns of individual organisms were correlated with the relative abundance of the 295 IPLs (only organisms and lipids were considered if they were identified in at least seven out of fourteen samples). Lipids were also co-correlated with other lipids and primary and secondary lipid assignments were investigated via a network analysis (Fig. 3). Although the majority of IPLs were found to be unspecific, significant correlations were observed between a subset of lipids and organisms: 44 primary lipids correlate significantly with 22 different marker genes (organisms) and 63 secondary lipids (Table 1).

Fig 3: Correlation network analysis of relative abundances of organisms (rpS3 genes) and relative abundance of IPL signatures.
figure 3

The primary lipids were defined based on a direct correlation of their relative abundance with rpS3 gene abundance (Bonferroni-corrected p value < 0.005). Secondary lipids showed a significant correlation with primary lipids and are indicative of a biological connection between the lipids (e.g., lipids from microbial symbionts or co-correlated organisms). Unspecific lipids shared primary lipids with different organism assignment. Due to visual limitations only few IPL names are displayed in the figure; all organisms to lipid correlations are provided in Table 1, raw data can be accessed in Tables S4 and S5.

Table 1 Correlation of rpS3 gene abundances from metagenomic read mapping with relative abundance of IPL signatures across samples. Primary lipids are direct correlations, secondary lipids are those that correlated with primary lipids.

It is important to note that all significantly correlating ether-based isoprenoid lipids were assigned to archaea (Ca. Huberiarchaeum crystalense) as this provides confidence in the correlation-based approach. However, it is unclear whether correlation of the main lipid of the Ca. H. crystalense and one bacterial lipid is a spurious covariation or if this represents assimilation of a bacterial membrane lipid by archaea (Huberiarchaeum did not correlate with that bacterial lipid; Table 1). Of particular interest were the lipids of Altiarchaeota, since these had been characterized earlier [52]. These previously detected lipids, including hexose-pentose archaeol (1G-1pentose-AR; for mass spectrometric identification see Fig. S4) and dihexose extended archaeol (2G-ext-AR), were the most abundant archaeal lipids in the current study but most abundances showed little correlation with the Altiarchaeota abundances. On the one hand, this might be due to the presence of multiple different strains of Altiarchaeum sp. in the samples (based on rpS3 genes; Fig. S9), which can harbor different lipid profiles as shown previously [68]. On the other hand, the main archaeal IPL (2G-AR) was also present in the sample filtered through a 0.2-µm filter and collected onto a 0.1-µm filter but Altiarchaeum sp. DNA was not (based on rpS3 genes). This indicates the lysis of Altiarchaeum sp. during the filtration process, possibly due to oxygen stress, a resistance that Altiarchaeota in Crystal Geyser apparently do not possess [52]. Altiarchaeota in Crystal Geyser also have Ca. H. crystalense as a symbiotic partner [69], which could derive its lipids from the Altiarchaeota and was indicated to possess genes, whose products might be involved in lysis of Altiarchaeota cells [12]. In addition, longer turnover times of the chemically stable ether-bound lipids of archaea [66, 67] compared with DNA could deteriorate correlations. Nevertheless, some IPL signatures (e.g., 2G-ext-AR) showed a significant correlation with the sum of rpS3 abundances of all Altiarchaeum sp. in the sample, supporting the above-mentioned assumptions (Fig. S9).

We detected one low abundance archaeal lipid, an unsaturated variant of 2G-ext-AR (2G-1uns-ext-AR), which had not been identified in Altiarchaeota. This may be a previously unrecognized membrane component of Altiarchaeota or derived from another archaeon. Its abundance correlated only weakly with other Altiarchaeota lipids but highly significantly with the abundance of Huberiarchaeum, thus it may derive from this organism. Huberarchaeota are the second most abundant archaea after Altiarchaeota in this ecosystem and they are predicted to have the genes required to synthesize lipids from scavenged isopentenylpyrophosphate [12]. The molecular structure of 2G-1uns-ext-AR differs by only one double bond from the Altiarchaeota lipid 2G-ext-AR, so Huberarchaeota may largely derive its lipids from Altiarchaeota, which was suggested to be its host [12]. The relative abundance of 2G-1uns-ext-AR correlated significantly with 2G-ext-AR, highlighting the potential biological meaning that can be inferred from IPLs, whose abundances do not correlate with certain organisms but with certain lipids instead. Given the confident assignment of 2G-1uns-ext-AR to Huberarchaeota, we used the p value for that assignment as a conservative correlation p value for further predictions (Bonferroni-corrected p value < 0.005), which are presented in Table 1.

Several bacterial groups were correlated with the occurrence of cardiolipins (diphosphatidylglycerol (DPG) lipids), which are involved in osmotic stress response, membrane ordering, and regulation of cell curvature [70,71,72,73]. Specifically, DPGs are required for maintaining cell shape in rod-shaped bacteria [71]. Consequently, DPGs found in Crystal Geyser are correlated with clades typically forming rods or elongated cell shapes, including the Flavobacteriaceae and Gallionellaceae (Table 1). These matching correlations thus further validate our statistical approach.

Lysolipids and Candidate Phyla Radiation bacteria

In order to investigate lipids of bacteria from the CPR [13], we analyzed the IPLs of a small cell size fraction collected on a 0.1-µm pore-size filter after 0.2-µm pre-filtration. Based on the corresponding metagenome, the sample contained 186 different organisms, 165 of which were classified as CPR based on rpS3 sequences and one low abundant organism was classified as a member of the DPANN radiation (Ca. H. crystalense). Surprisingly, the most abundant organism in the sample based on metagenomics was a Sulfurimonas, which apparently passed through the 0.2-µm filter (read mapping-based coverage in 0.2-µm filter was 8.4 in the corresponding 0.1 µm filter 1081.9). We identified 72 different IPLs in the post-0.2-µm sample, all of which were acylglycerols. Consequently, the CPR organisms in this sample must possess FA-based lipids. This is important because the composition of lipids of CPR bacteria is unknown. Interestingly, 22 of the 72 lipids (31%) were lysolipids, all of which contained betaine headgroups (for structural characterization see Fig. S3). By contrast, these lipids constituted only 18% across the entire sample set. Cultured bacteria only contain a small fraction of lysolipids, e.g., Sulfurimonas has been reported to only contain a single lysolipid with ~4% abundance [74]. Further, the abundances of several CPR bacteria also correlated significantly with the abundance of specific lysolipids (Table 1).

To further investigate the lysolipid content of CPR bacteria, we selected a sample taken during the recovery phase of the geyser, when little amounts of Sulfurimonas are present as indicated by metagenome sequencing [12]. For this sample cells that passed through a 0.2-µm filter were collected onto a 0.1-µm filter for subsequent metagenomic sequencing and infrared spectromicroscopy. Metagenomic sequencing analysis of the selected sample (CG10_big_fil_rev_8_21_14_0.10; [12]) showed a high abundance of CPR (rank abundance curve in Fig. S10) occupying the first seven ranks of the community. To test for the abundance of lysolipids in this CPR-rich sample, we performed FTIR analysis of the cells (Fig. 4a) and compared the results against a set of reference spectra (Fig. S11). For the first PCA in the 3050–2800 cm−1 spectral region dominated by the aliphatic chains of the lipids, ~85% of the spectral variance is explained by the first five loading vectors (Fig. 4b). Here, the first loading vector contains 55% of the variance, with features that are similar to palmitic acid; with the asymmetric stretching of the CH2 peak centered at 2916 cm−1 (Fig. 4). The peak corresponding to the CH3 asymmetric stretching vibration was used to evaluate the nature of the polar head. The position at 2951 cm−1 of the PC1 is in accordance with the one of lyso-phosphatidylcholine, whereas the peak for phosphatidylcholine is sharper and centered at 2957 cm−1. The corresponding heatmap of the PC1 scores (Fig. 4c) shows the presence of hotspots, a few microns in diameter. The remaining 2, 3, and 4 loading vectors, which explain 18, 10, and 2% of the variance, respectively, show different CH3 to CH2 ratios, and PC3 in particular can be assigned to free FA. In contrast, although the fifth loading vector accounts for only 1% of the variance, its spectral features can be assigned to highly branched and unsaturated lipids similar to those of archaea (Fig. 4b, c; Refs. [14, 75]; see Supplementary material for additional results). This agrees with the presence of DPANN archaea as the second most prominent group of organisms in this sample based on metagenomic profiling (Fig. S10). The combination of the detailed analysis of the IPLs and infrared imaging of two independently sampled small cell fractions suggest that a substantial fraction of some CPR cell membranes consists of lysolipids.

Fig. 4: FTIR analysis of a small cell size fraction (post-0.2-µm filter collected onto a 0.1-µm filter).
figure 4

a Field of view in FTIR, 1 × 1 mm (red square). b First five PCA loadings accounting for ~90% of the variance. They describe the directions of maximum variability of the analyzed system. The figure presets the first five vectors, that spectroscopically can be assigned, by similarity of shape and band position, to different types of lipids. c False color maps representing PCA scores PC1 and PC5, respectively. These maps show how the different lipids represented by the eigenvectors in (b), are distributed in the sample. The comparison of the spectral features of the loadings and the reference spectra in Fig. S11 allow assignment of PC1 to lysolipids and PC5 to unsaturated/branched lipids. The arrows point to a hotspot of cells indicating a particularly high distribution of lysolipids (PC1), surrounded by several smaller hotspots of unsaturated/branched lipids (PC5). Given the micrometric lateral resolution of the image (each pixel is 2.6 µm) it is possible to hypothesize that there is a small group of cells in the hotspot area, which is characterized by distinct membrane lipid composition. This can also be observed in other spots throughout the measured biomass. Loadings of the PCA over the whole 900–3700 cm−1 spectral range are provided in Fig. S12. Scale bar 200 µm.

Genome-resolved metagenomics generated 206 new genomes from the entire sample set. Together with 1215 previous genomes [12, 35], our data set included 675 genomes of CPR bacteria that were used to comprehensively investigate their potential for lipid biosynthesis (accessible through https://ggkbase.berkeley.edu/genome_summaries/1491-Bacterial_membrane_lipids_AJP). We found that the CPR genomes do not encode for any known, complete bacterial lipid biosynthesis pathway, yet CPR bacteria are known to have a cytoplasmic membrane based on cryogenic-transmission electron microscopy studies [14]. Interestingly, some members of the Nealsonbacteria phylum (Parcubacteria superphylum) have near-complete pathways for FAs and phospholipid synthesis. They possess some homologs of the FA synthase type II (FAS-II), the main FA biosynthesis pathway in most bacteria. However, they lack the FAS-related acyl carrier protein (ACP) processing machinery (ACP synthase and malonyl-CoA:ACP transacylase). ACP is a peptide cofactor that functions as a shuttle that covalently binds all FA intermediates. Although they lack key genes for FA synthesis, we cannot rule out this group could potentially synthetize FAs by an ACP-independent pathway, as suggested for some archaea [76]. We also searched theses genomes for genes coding for glycerol-3-phosphate (G3P) dehydrogenase, an enzyme responsible for the stereochemistry of the glycerol units of their membrane lipids, and acyl-ACP transferases responsible for the formation of ester bonds between FAs and G3P backbone in phospholipid synthesis. There are two families of acyltransferases responsible for the acylation of the C1-position of the G3P. The PlsB acyltransferase primarily uses ACP end products of FA biosynthesis (acyl-ACP) as acyl donors. The second family involves the PlsY acyltransferase and is more widely distributed in Bacteria. PlsY uses as donor acyl-phosphate produced from acyl-ACP by PlsX (an acyl-ACP:PO4 transacylase enzyme). The acylation in the C2-position of the G3P is carried out by the 1-acylglycerol-3-phosphate O-acyltransferase (PlsC). Screening the Nealsonbacteria genomes, we did not detect any homologs of the first family of acyltransferase, PlsB. However, we identified PlsY and PlsC, but not PlsX. Absence of PlsX raises the question of the enzyme or mechanism for production of acyl-phosphate needed to activate PlsY. Overall, mechanisms or enzymes that produce and/or require ACP were not identified in CPR genomes in this study. Even though this finding opens the possibility for the presence of ACP-independent pathways for FA and/or lipid synthesis in these CPR bacteria, we cannot conclude with confidence that few of these organisms can synthesize lipids de novo. Thus, we suggest that most CPR bacteria derive their membrane lipids, including lysolipids, from coexisting bacteria. Given the small cell size of CPR, lysolipids may be preferred due to their role in reducing membrane curvature stress (e.g., Ref. [77]). As lysolipids can form during lipid breakdown (e.g., mediated by phospholipase A [78]) and can be taken up by other bacteria [79], their utilization by CPR may indicate uptake from degraded bacterial biomass or direct derivation from host cells.

Model of lipid transfer in the community and conclusions

Our approach combined detailed metagenomics with whole community lipidomics and infrared spectroscopy and was informed by isotopic measurements that were constrained by detailed understanding of the geological context. The objective was to probe the carbon cycle within the subsurface microbial ecosystem, particularly the source of fixed organic carbon, but also to investigate evidence for its redistribution into other organisms, especially putative symbionts. Although sample limitation resulted in a lower resolution of isotopic analyses compared with metagenomics, carbon isotope systematics of archaeal and bacterial lipids confidently support the metagenomic predictions that microbial biomass is mostly of autotrohpic origin in all aquifers sampled. Particularly, our results provide evidence that predicted autotrophs were fixing CO2 in situ, using the WL (Altiarchaeum), rTCA (Sulfurimonas), and CBB cycles (Gallionella).

Using lipidomics and infrared spectroscopy on size-fractionated cells, we demonstrate that CPR bacteria with small cell size possess FA-based IPLs, although the corresponding genomes do not encode for a known pathway to synthesize them. Similarly, Huberarchaeota, potential symbionts of Altiarchaeota, were predicted to possess altered archaeal lipids related to those of their putative hosts. Our results support the notion that organisms of the CPR and DPANN radiation do not only scavenge (or symbiotically receive) molecular building blocks or even intact lipids from other bacteria and archaea but also use the corresponding lipids and introduce modifications (Fig. 5).

Fig. 5: Model for the acquisition and redistribution of carbon and lipids in the deep subsurface ecosystems of the Colorado Plateau (USA) accessible through Crystal Geyser.
figure 5

Organic carbon and lipids are produced by Gallionella, Sulfurimonas, Altiarchaeum spp. or other autotrophs, redistributed through the ecosystem and acquired by other community members including CPR bacteria and DPANN archaea.