Introduction

Symbiotic interactions are often mediated by specialised (secondary) metabolites of symbiont origin that serve in communication, signalling, and defence1,2. The genetic machinery encoding the production of these metabolites is organised in biosynthetic gene clusters (BGCs) that usually encode all genes required for their biosynthesis3. The repertoire of compounds produced by bacterial and fungal symbionts is vast and rarely comprehensively characterised in a symbiotic or coevolutionary context. Characterising such landscapes can provide insights into the natural product repertoire of symbionts and the genetic underpinnings of selection regimes that have characterised the evolutionary histories of metabolite gene clusters of importance in symbioses4,5.

Fungus-farming termites in the subfamily Macrotermitinae (Blattodea, Termitidae) have cultivated basidiomycete fungi in the genus Termitomyces (Agaricomycetes, Lyophyllaceae) for 30 million years (Fig. 1A)6. Termite colonies maintain monoculture fungus gardens (Fig. 1B) in optimal growth conditions on a constant supply of plant substrate for fungal growth and, in return, benefit from a dependable nutritionally enriched fungal food source7,8. Over the course of 30 million years, the symbiosis has diversified to include more than 300 described termite and 52 described Termitomyces species9, displaying some degree of association specificity indicative of host-symbiont adaptation and coevolution6,10.

Fig. 1: Phylogenomic analyses of the genus Termitomyces and its biosynthetic potential.
figure 1

A A fungus-farming termite colony of Macrotermes natalensis in South Africa (Photo credit: Michael Poulsen), within which termite hosts cultivate Termitomyces in fungus combs (Photo credit: Saria Otani) (B). C Phylogeny based on 100 single-copy orthologous genes allowed for solid placement of 39 genomes of Termitomyces, for which individual strain labels are coloured by termite host genus (green: Odontotermes, blue: Macrotermes, yellow: Microtermes, pink: Ancistrotermes, brown: Pseudacanthotermes and black: unknown). Support values are ultrafast bootstrap support (left) and SH-aLRT (right). Based on species-delimitation results using the Poisson tree processes (PTP)64, genomes were assigned to 21 species, indicated by letters A-U (Supplementary Table 2). The genomes ranged in completeness based on the BUSCO Agaricales gene set (48–98%). D The BUSCO measure of overall genome completeness was only weakly associated with the number of BGCs identified per genome (10–35). The left portion of the panel provides a heatmap of predicted BGCs for each genome (Supplementary Table 3), amounting to a total of 754 BGCs (Supplementary Data 3). E Violin plot of the percentage of complete BGCs by class across all Termitomyces species.

Specialised metabolites produced by Termitomyces have been hypothesized to play roles in signalling and defence, and Termitomyces species produce a range of metabolites, including terpenes, fatty acids, indoles, polyketides (PKSs), non-ribosomal peptides (NRPs), and phenylpropanoids11,12. These classes of specialised metabolites cover a broad range of functions in fungi, from the frequent antimicrobial properties of NRPs13, the iron capture of siderophores14, intra- and inter-kingdom signalling roles of terpenes and terpenoids15, to roles of indoles in growth and development16. More than 257 bioactive specialised metabolites have been identified over the past decades, almost exclusively from chemical analyses of mushrooms17,18, fungal gardens19, and cultured isolates under different growth conditions20,21, but only a fraction of these has been tied to BGCs22. This is in part a consequence of the length of BGCs that often preclude identification from low-quality genomes23. With only 11 of 52 Termitomyces species having been explored, and with varying analytical approaches, it has remained unclear how consistent the biosynthetic potential is across the genus and how the evolutionary histories of BGCs have been shaped by millions of years of coevolution with termite hosts.

To obtain a robust understanding of the biosynthetic potential of Termitomyces, we improved the annotation of 22 existing genomes and generated 17 new genomes, which allowed for comparative genomics analyses of the evolutionary histories of gene clusters coding for the production of specialised metabolites across 21 Termitomyces species. We combined in silico tools with manual curation to identify BGCs that putatively synthesize compounds from five major specialised metabolite compound groups. We then, established their consistency across species and evaluated their evolutionary trajectories and signatures of positive selection that could indicate ongoing arms race between Termitomyces chemistry and antagonists. We show that a small number of Termitomyces BGCs represent a core set of specialised metabolites for the genus that has likely remained important over the course of millions of years of coevolution with termites. The variation we observe in BGC profiles between species is characteristic of gains or losses over time that could indicate ecological importance to specific termite-Termitomyces pairs, and signatures of positive selection in a subset of BGCs imply the potential for defensive functions subject to selection pressures invoked by target antagonists.

Results

High-resolution consensus phylogeny and extensive presence of BGCs across the genus

Capitalising on 22 existing and 17 newly sequenced genomes of Termitomyces and extensive optimisation of assemblies and annotation procedures, we generated a consensus phylogeny revealing that the 39 genomes spanned 21 different species – a substantial fraction of the known Termitomyces species diversity6,24 (Fig. 1C). Robust confidence scores indicated that most branch splits were well supported. This high-resolution phylogeny represents one of the most robust phylogenetic analyses of the genus to date, based on the best-quality Termitomyces genome set available, and is congruent with previously established termite-fungus co-phylogenies6,10.

The optimised annotations allowed identification of 754 BGCs (Supplementary Table 3 and Supplementary Data 3) across all strains, ranging from 10–35 BGCs per genome. BGCs encoding terpene biosynthesis were most abundant, with 2–19 per genome, followed by 1–11 non-ribosomal peptide synthetases (NRPS-like), 1–6 ribosomally-synthesized and fungal post-translationally modified peptides (fungal-RiPP-like), one indole-like (missing in two and fragmented in three genomes), one NRPS-independent (NI)-siderophore (missing in two and fragmented in three genomes), and one NRPS-siderophore (missing in eight genomes) (Fig. 1D). Throughout the manuscript, we refer to the compound class a BGC encodes the biosynthesis of as its “class”, e.g., BGCs encoding terpene biosynthesis are referred to as terpene BGCs25,26. Predictably, there was greater variation in the percentage of complete BGCs in each class for terpene and NRPS-like BGCs, as they were more abundant. However, the majority of BGCs were complete for each Termitomyces species within a class (Fig. 1E and Supplementary Data 3). Notably, BGCs classified as fungal-RiPP, based on the DUF3328 UstY-like HMM profile hit, should be interpreted with some caution. To establish that these BGC indeed are fungal-RiPPs, the precursor peptide would need to be annotated25,26. We therefore tentatively consider these BGCs as “fungal RIPP-like”. Similarly, indole BGCs were classified as such when a BGC encodes a dimethylallyltryptophan synthase (DMATS)-type prenyltransferase; however, DMATS-type prenyltransferases can also prenylate non-indole substrates27. To ascertain caution in indications of the final product of these BGSs, we refer to them as indole-like BGCs.

Grouping the diversity of detected BGCs into GCFs based on their shared domains and encoded backbone enzyme identities allowed us to distribute 660 of the 754 BGCs into 61 distinct GCFs (Fig. 2B–D, Supplementary Table 4). BGCs not assigned to a GCF, e.g., singletons or very short BGCs, were excluded from subsequent analyses. The retained GCFs represented putative BGCs encoding 32 terpene cyclases, 19 NRPS-like, seven fungal RiPP-likes, one potential indole, one NRPS-siderophore, and one NI-siderophore. The NRPS-siderophore (GCF6) had a conserved core enzyme domain architecture (adenylation(A)1-thiolatino(T)1-condsation(C)1-T2-C2-T3-C3) (Fig. 3B) and subsequent sequence analysis revealed close similarity to a type VI basidiomycete siderophore synthetase. This synthetase is believed to produce the trimeric siderophore basidioferrin14,28, so that this BGC encodes for a NRPS-based siderophore (NRPS-siderophore). The BGCs for the NI-siderophore (GCF2a+b) and fungal-RiPP-like (GCF5a+b) GCFs each encoded two core enzymes that were situated either on a single contig or split across two contigs (Supplementary Data 3 and 4). This separation appeared to be due to fragmented genome assemblies, as subsequent alignment and phylogenetic analyses of core and tailoring genes revealed that each pair should be in each their respective GCF (Supplementary Data 4 and Figs. 1 and 2). The two core enzymes in the NI-siderophore both contained an N-terminal IucA/IucC family domain and a C-terminal conserved ferric iron reductase FhuF-like transporter domain, whereas the fungal-RiPP-like BGCs encode two enzymes: one with a hit to the adenylate-forming (AMP-binding) profile and one with a hit to the DUF3328 profile (Fig. 3 and Supplementary Data 4).

Fig. 2: Specificity in BGC content within and between Termitomyces species.
figure 2

A Comparison of the Termitomyces species tree (left, derived from Fig. 1C) with cluster analyses of the similarity in GCF composition (Supplementary Table 4) (based on complete linkage) revealed some degree of matching. Colours behind the species tree tip label are by termite host genus: green: Odontotermes, blue: Macrotermes, yellow: Microtermes, pink: Ancistrotermes, brown: Pseudacanthotermes and black: unknown. Presence (blue) / absence (white) heatmap of GCFs (class labels at the bottom), organised by their profile similarity based on the BiG-SCAPE distance matrix67 across Termitomyces species. This revealed a set of universal GCFs across species, a set of common but not universal GCFs, sets of species-specific GCFs and a set of inconsistently distributed GCFs, including singletons. BD Network representation of the 61 GCFs (BGCs) visualised using Cytoscape 3.9.1, where each node in a network represents a BGC in a species (labelled A-U), edges represent their relations, and colours represent BGC class (Supplementary Table 5 and Supplementary Data 4). B Networks of the seven GCFs whose member BGCs were identified in all 21 Termitomyces species. Networks 2a + 2b (the siderophore) and 5a + 5b appeared to be distinct; however, further analysis showed each pair to respectively be a single GCF (see text for details). C Networks of 21 terpene, NRPS-like, and fungal-RIPP-like GCFs for which member BGCs were present in only a subset of Termitomyces species but where all 1–4 genomes of a given species had the BGC. D Networks of the remaining 33 GCFs which contained BGCs that were only present in a subset of genomes within a species.

Fig. 3: A subset of BGCs contain genes under positive selection.
figure 3

A Box plot showing the percentage of genes with positive selection across BGC classes (left) and 25 GCFs ordered from highest to lowest mean of genes with positive selection (right). Whiskers extend to 1.5*Interquartile Range (IQR). NRPS and NRPS-like BGCs generally contain genes experiencing more positive selection than other BGC classes (Supplementary Table 6 and Supplementary Data 6). B Gene cluster structures of the nine GCFs with most genes with positive selection, of which two (GCFs 6 and 7) displayed consistent positive selection in the same gene in every BGC assigned to the GCF. GCFs are displayed as consensus representations of their contained BGCs. The number of species that contain a gene is indicated in the box denoting the gene in the cluster. The type of gene as assigned by fungiSMASH is indicated by colour. Genes with positive selection are indicated with bar plots of the percentage of genes displaying signatures of positive selection. If one or two species have positive selection for a given gene, the species identities are indicated with icons (A–U) (Supplementary Data 4). This is only noted if positive selection was present in 50% or more of the genomes for a given species. In each representation of a BGC we show the identified protein families obtained from the Pfam database, a = more than one known domain was present within the gene. All Pfam annotations are given Supplementary Data 4. DUF = Domain of unknown function. Significant levels for positive selection analyses are: * = p < 0.05, ** = p < 0.01, and *** = p < 0.001. The rest of the GCFs that were analysed for positive selection but with no or fewer branches under positive selection are available in Fig. S4.

Consistent GCFs across the Termitomyces phylogeny included one indole-like, one NI-siderophore, one NRPS-like, one fungal-RiPP-like, and three terpene-encoding GCFs that were identified in all 21 Termitomyces species (Fig. 2A + B, Supplementary Tables 4 and 5), implying that identical or very similar specialised metabolites are universally present across the genus. Twenty-one GFCs were only present in a subset of Termitomyces species (Fig. 2C), but they were represented in all genomes sequenced for each of these species (excluding genomes with BUSCO score <60%). These GCFs thus likely exist only in these species and their absence in other species is unlikely to be a product of poor annotation or assembly (Fig. 2, Supplementary Table 5 and Supplementary Data 3). Of these, GCFs 33, 34, 44, 60, and 65 were only present in Termitomyces species cultivated by Macrotermes spp. The remaining 33 GCFs were inconsistently detected across the diversity of Termitomyces (Fig. 2D). Given that the genomes were incomplete, we cannot rule out that some GCFs exist in more genomes than our analyses indicate. Although GCF profiles were also significantly affected by genome completeness (BUSCO score) (PERMANCOVA: F1,11 = 2.427, R2 = 0.0313, p = 0.0242), termite host (F4,11 = 6.535, R2 = 0.3369, p < 0.001) and Termitomyces species (F13,11 = 2.860, R2 = 0.4859, p < 0.001) more strongly impacted profiles.

BGCs with signatures of positive selection

Significant signatures of positive selection were absent for the vast majority of BGC genes belonging to the 25 GCFs we tested (143 of 216 orthologs; aBSREL, p > 0.05) (Supplementary Table 6 and Supplementary Data 5). The remaining 72 orthologs had significant (p < 0.05) dN/dS ratios (ω > 1) on specific gene branches, suggesting positive selection (Supplementary Data 6). From the 25 tested GCFs, genes originating from the single indole-like and 14 terpene biosynthesis-encoding BGCs generally experienced less positive selection than genes from the single NI-siderophore, fungal-RiPP-like, NRPS-siderophore and seven NRPS-like (Fig. 3A) BGCs, as evident from lower omega values (Supplementary Data 6). This suggests that these BGCs are more conserved than e.g., those of NRPS and NRPS-like GCFs highlighted in Fig. 3B. We observed evidence for gene-wide positive selection in a core gene of all BGCs in GCF6 and an unknown gene in GCF7 (Fig. 3B). Evidence for positive selection was, at times, limited to genes from specific species, such as gene 18 in the NRPS-siderophore GCF6, where positive selection was detected in Termitomyces sp. L (Fig. 3B, Supplementary Data 4). Similar patterns were identified in GCF1 (spp. K and J), GCF5 (spp. F, and H and J), GCF9 (spp. D and E), GCF11 (spp. K and J), GCF 29 (spp. L, M and N), GCF30 (sp. E) and GCF44 (sp. L) (Fig. 3B and Supplementary Fig. 3).

Similarity to biochemically characterised BGCs

Using the MIBiG database, we were able to identify two GFCs for which the constituent BGCs had high similarity matches. First, the NRPS-siderophore (GCF6) showed high similarity to a synthetase forming the siderophore basidioferrin, which is widely distributed in basidiomycetes fungi14. Second, GCF27 had high similarity to (+)-δ-cadinol, which has also previously been identified in a basidiomycete29 (Supplementary Table 4) and shows cytotoxic activity30. Both BGCs we also also identified in the closest known relative of Termitomyces (A. matolae; GCA_018855395). Many of the biochemically characterised terpene cyclases identified in Termitomyces have yet to be added to the MIBiG database22,28. Our manual similarity assessment demonstrated matches to four terpene BGCs (Supplementary Table 4), including GCF29 that includes BGCs encoding the biosynthesis of (+)-germacrene D-4-ol, present in all Termitomyces species. ( + )-Germacrene D-4-ol is typically produced by plants, where it has been linked to antimicrobial properties31,32, and a study has linked it to roles as an insect pheromone33. Our analysis also uncovered BGCs in GCF37 in five species of Termitomyces that encode (-)-δ-cadinene28,34. BGCs belonging to GCF34, which was only identified in Macrotermes spp.-associated Termitomyces, contained BGCs that have previously displayed bifunctional activity capable of transforming geranyl pyrophosphate and farnesyl pyrophosphate into numerous terpenes, most of which have been identified within the fungus comb volatilome, including camphene and d-limonene28.

Discussion

Improved genomes allowed robust phylogenomics and elucidation of a core set of Termitomyces BGCs

By significantly enhancing the annotation of publicly available genomes and generating new high-quality annotated Termitomyces genomes, we established the to-date most robust phylogenomic analyses of the genus. This allowed solid species assignment of 39 genomes to 21 species and the identification of 754 BGCs distributed across 61 GCFs that we could analyse in an evolutionary context. These efforts increased the available genomes for the genus from 28 to 45, spanning the five termite host genera Macrotermes, Microtermes, Ancistrotermes, Odontotermes, and Pseudacanthotermes and will serve as an important genomics resource to address new questions in the termite-fungus symbiosis. Despite previous reports of polyketides in Termitomyces12, our analysis found no BGCs encoding polyketide synthases (PKSs). This implies that PKS BGCs were either not identified with the pipeline we employ or that they are recognised as other types of BGCs. Future work is needed to improve the identification and classification of PKS BGCs bioinformatically, and to develop more comprehensive databases. Experimental validation of predicted BGCs and their associated compounds will also be crucial to understand the chemical identities of the BGCs uncovered. This further underlines that although our approach provides a comprehensive overview, it does not allow a complete account of the chemical diversity encoded in the genus.

We identified a core set of seven GCFs that were consistently present across the genus and future improvements of genomes from sequenced species, as well as the sequencing of new species, will inevitably expand this set. As a case in point, GCF31 (encoding a terpene synthase), GCF6 (NRPS-siderophore), and GCF7 (NRPS-like) were consistently present across the Termitomyces phylogeny, yet missing in, respectively, one, two and seven fungal species, suggesting that improved genomes will fill these gaps. We therefore cautiously conclude that at least 10 GCFs are consistently present across the Termitomyces genus, encoding an indole-like, a NI-siderophore, an NRPS (presumably basidioferrin), a fungal-RiPP-like, two NRPS-like compounds, and four terpenes (two of which were putatively identified to synthesise(+)-δ-cadinol and (+)-germacrene D-4-ol). From these BGCs, only two have previously been identified in other basidiomycete fungi (GCF6 and GCF27). The maintenance of a core set of BGCs across more than 30 million years of coevolution with termite hosts suggest important functions, and future work should prioritise deciphering roles of their products in symbiotic interactions with hosts or in defence.

BGC compositions are non-random across the phylogenetic history of the genus

Beyond the core set of GCFs, we identified a further 21 that were only present in subsets of Termitomyces species, indicating that some gene sequences linked to specialised metabolite syntheses likely serve specific roles in some species and in interactions with hosts. These were dominated by BGCs encoding terpenes (13) but also five NRPS-like and three fungal-RiPP-like BGCs. The species specificity in these BGCs is intriguing and may have arisen from species-specific losses of ancestrally universal BGCs or from gains over evolutionary time through horizontal gene transfers from other microorganisms. However, given that the assignments of several of the GCFs are based on only few genomes, it will be important in future work to verify this species specificity, in concert with determining the ecological factors that have led to their origins and persistence.

In line with species specificities and events of BGC gains and losses, we found that GCF profiles were to some extent congruent with the Termitomyces phylogeny; however, with several incidences of incongruence, suggesting either recent divergence or incomplete predictions. Despite these discrepancies, the patterns of broad-scale association specificity between termite host genera and Termitomyces imply some degree of host-specificity in BGC segregation and biosynthetic potential of Termitomyces. The most notable examples of this include five terpene clusters (GCFs 33, 34, 44, 45, and 60) and a fungal-RiPP-like cluster (GCF65) that were exclusive to Termitomyces cultivated by Macrotermes spp. Similarly, an NRPS-like compound (GCF24) and five terpenes (GCFs 45, 46, 47, 51, and 57) were only present in the Termitomyces sister species J and K that is cultivated by Ancistrotermes and Microtermes6,35. As alluded to above, such patterns may arise from losses in all other Termitomyces species, should they have been present in the most recent common ancestor of the genus or via gains over evolutionary time. Irrespectively, the patterns imply that there is not a universal recipe for the biosynthetic capacities associated with being a termite cultivar. Elucidating the idiosyncrasies that surround uniqueness in biosynthetic potential will be important to improve our understanding of the fundamental biology of host-symbiont interactions and cultivar roles.

Potential for distinct evolutionary trajectories of different BGC classes

Although some BGCs contained genes subject to positive selection based on dN/dS ratios, the selection analyses indicated that most did not, and no GCF contained BGCs that experienced positive selection on all genes. Our analyses did, however, reveal 18 GCFs with significant positive selection on at least one site or branch. This was primarily within the NRPS-like clusters, which may reflect that they often are antimicrobial and therefore may engage in arms-race dynamics with pathogens, where positive selection could generate novel chemistry for antimicrobial defence. Furthermore, gene specific positive selection was inferred to varying degrees, e.g., the finding that GCFs 6 and 7 experienced gene-wide, positive selection acting consistently on specific genes could indicate that gene functions within clusters could play adaptive roles.

The majority of the 25 GCFs subjected to positive selection analyses experienced multiple episodic positive selection events within their representative BGCs. In some instances, episodic positive selection events appeared to be associated with termite host species. For example, positive selection confined to a single gene in the NRPS-like GCF11 was inferred only in Termitomyces species associated with Ancistrotermes. Similarly, BGCs in Macrotermes-associated Termitomyces that encode the biosynthesis of GCF4 (NRPS-like) and GCF29 (terpene), which are both present in all species of Termitomyces, exhibited episodic positive selection events on non-core genes. This suggests termite host-dependent effects on specialised metabolites through specific ecological pressures that are then experienced by given fungal species. BGCs with evidence for positive selection may be of particular interest for further characterisation of the produced compound for the discovery of chemical novelties and compounds with antimicrobial properties.

GCFs with no or very few genes under positive selection could either reflect conserved functions or alternative avenues to chemical novelty. The latter may be the case for several terpene-encoding BGCs that overall displayed lower rates of inferred positive selection, but for which it has been established that a single BGC can give rise to extraordinary chemical diversity28. For example, a single terpene cyclase in Termitomyces species L, M, and N (based on our species assignment) allows for the production of more than 20 different compounds28. Further understanding of the evolution of these gene clusters, the functions of their genes, and the compounds they encode, will be needed to determine the relative role of positive selection and alternative avenues for chemical novelty in fungus-farming termite cultivar.

Conclusions and perspectives

Through significant improvement of genomes across the phylogenetic history of the genus Termitomyces, our comparative genomics unravelled a rich set of biosynthetic gene clusters encoded by the fungal cultivar of farming termites. BGC profiling allowed us to establish a core set of BGCs that likely has been present since the origin of fungiculture in termites 30 million years ago, as well as indications of BGC gains and losses over evolutionary time. Based on our positive selection analyses, our findings suggest that different compound classes may be subject to distinct evolutionary trajectories. Specifically, our findings suggest that NRPS and NRPS-like gene clusters are subject to more frequent consistent or episodic positive selection than e.g., terpenes. Chemical novelties may nevertheless occur in the latter, where substantial chemical diversity may arise from just a single conserved gene cluster28. Collectively, this indicates that millions of years of termite-fungus symbiosis have led to rich biosynthetic potential with distinct evolutionary trajectories of biosynthetic gene clusters and ample chemical novelties.

The vast non-random and largely undescribed chemical potential in Termitomyces implies a rich potential for future discoveries of specialised metabolites. Only five of the GCFs we describe encode metabolites or analogues of metabolites that have been characterized. The ecological function of these compounds for the fungus and potential roles in interactions with termite hosts remain largely unknown. Our evolutionary-guided approach serves as an important framework to further our understanding of both universally-present and unique chemistry in the fungal genus, and broadly for detailing identities and ecological roles of specialised metabolites in the ecology of fungal species and host-specific contexts. The vast roles and activities these natural products likely represent provide a rich potential to elucidate the chemical ecology of farming-termite symbiosis and the roles of natural products in interactions and defence in ecological and evolutionary contexts.

Materials and methods

Termitomyces genomes

To secure a comprehensive set of genomes that span the diversity of Termitomyces and termite hosts, we compiled genomes from 22 strains of published work (Supplementary Data 1) and sequenced 17 additional genomes of Termitomyces obtained from termite colonies collected in the Comoé and Lamto field stations in Cote d’Ivoire in 2018 and 2019 (Supplementary Data 1). New isolates were obtained by placing nodules (asexual structures produced within fungus combs) of Termitomyces on Potato Dextrose Agar (PDA; 39 g/l) and subcultured until pure.

The resulting 39 Termitomyces genomes spanned at least ten species (five genera) of termite hosts, but conceivably more since host origin was unknown for eight strains (Supplementary Data 1). Isolates from known hosts included one strain from Pseudacanthotermes sp., one from Ancistrotermes sp., one from Ancistrotermes guinensis, three from Ancistrotermes cavithorax, two from Odontotermes transvaalensis, two from Odontotermes cf. badius, two from unknown Odontotermes species, six from Macrotermes bellicosus, three from Macrotermes subhyalinus, five from Macrotermes natalensis, one from Macrotermes gilvus, and four from Microtermes spp. (Supplementary Data 1). Termite species were confirmed with barcoding as described by Zaman et al. (Supplementary Data 1)36.

Strain genotyping

New isolates were verified as Termitomyces by barcoding of the Internal Transcribed Spacer (ITS) region37. DNA was extracted using the Chelex protocol as described by Conlon (2022)38. PCR was performed using basidiomycete-specific primers ITS1F and ITS4B39 following the protocol described by Schmidt et al.40. Purified PCR products were sent to Eurofins MWG Operon (Ebersberg, Germany) for sequencing. Forward and reverse sequences were aligned in Geneious prime version 2019.1.1 (Biomatters Ltd., New Zealand) using Geneious’ own algorithm after primers and low-quality ends were trimmed. Sequences were then blasted against the NCBI database to confirm their identity.

Genome sequencing, assembly, and annotation

We extracted DNA from the 17 Termitomyces isolates using a CTAB extraction optimised for high yield and fragment length, with an initial fast-freeze step to increase yield38. Whole-genome sequencing was performed using a combination of 100 bp/150 bp paired-end shotgun (BGISEQ/DNBSEQ) and long-read (PacBio Sequel) sequencing by BGI. Short reads were filtered to a phred score of 30 using bbduk.sh v38.89 from the BBtools package (BBMap). PacBio long reads and BGIseq/DNBseq short reads were hybrid assembled with SPAdes (v.3.13.0)41,42 using the isolate mode. The resulting contigs were improved and re-scaffolded via RagTag (v. 2.1.0)43 using the low error rate short read sequences and the high-quality Termitomyces draft assembly “Termitomyces_v3.0”44. Assembly quality was assessed by quantifying genome completeness based on the expected gene content of the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.0.045 against the database for Agaricales-odb10 genomes. Genome assembly statistics are available in Supplementary Data 2. In addition, we obtained 22 assembled genomes from GenBank. We used these 39 genomes to build a transposable element library and model with RepeatModeler v2.0.146, which was subsequently used by RepeatMasker v4.1.147 to soft mask all genomes with the rmblastn engine on sensitive mode, skipping bacterial insertions. We annotated the 39 soft masked genomes with Braker2 v2.1.648 using OrthoDBs odb10_fungi database49 with fungi mode and skipping the fixing of broken genes48,49,50,51,52,53. We retained the longest isoforms of the braker.gtf output, filtering out all other isoforms with ProtHints print_longest_isoform.py54. An antiSMASH fungal version (fungiSMASH) v7.055 ready EMBL annotation file was generated by converting the gtf output to gff3 and then EMBL.

Consensus phylogeny and species assignment of genomes

A total of 6807 orthologous groups (Supplementary Data 2) in the 39 genomes and a species from the sister genus of Termitomyces, Arthromyces matolae (GCA_018855395)10, were generated with OrthoFinder v2.5.456. Single copy orthogroups present in at least 90% of the species were aligned with MAFFT v. 7.45357 in auto mode followed by gene tree generation with IQ-TREE v2.1.358 using model finder (-m MFP)59. These gene trees were used to infer a species tree with ASTRAL-Pro from ASTRAL v5.7.160. Both the species and all gene trees served as input into GenesortR61 to evaluate the phylogenetic information of the orthologs and subsample 100 genes (Supplementary Data 2), thereby minimizing potential sources of systematic bias62,63. A matrix of these 100 genes was subsequently used by IQ-TREE v2.1.3 to reconstruct a final phylogeny using model finder (m- MFP), and 1000 ultra-fast bootstrap (-B 1000)1 to estimate the support of each node. We grouped genomes according to putative species boundaries using the PTP (Poisson Tree Processes) model via the PTP web server64 (Supplementary Table 2). This allowed unambiguous assignment of genomes into 21 species for subsequent BGC comparisons (Fig. 1C).

In silico analysis of Biosynthetic Gene Clusters (BGCs) and Gene Cluster Families (GCFs)

To identify putative BGCs, we uploaded genomes to antiSMASH fungal version (v7.0)55 with default parameters (Supplementary Table 3 and Supplementary Data 3). BGCs were deemed incomplete if they were positioned close to the edge of a contig65,66, BGCs positioned in the middle of a sequence thus represent BGCs with presumed known boundaries and an accompanying complete set of tailoring enzymes (Supplementary Data 3). To reduce the complexity of the BGC dataset, GenBank files from the fungiSMASH analysis were assigned to biosynthetic gene cluster families (GCFs; i.e., families of BGCs that share a similar organization and likely code for the biosynthesis of the same or similar compounds) through a pairwise distance analysis using the BiG-SCAPE network prediction software67, with default settings (Supplementary Tables 4 and 5). GCFs were illustrated through sequence similarity networks using Cytoscape 3.8.0 (Shannon et al., 2003). All homologous enzymes within each GCF were aligned using MUSCLE in Geneious prime version 2019.1.1. Phylogenetic trees for each GCF family were built using RAXML, following the Algorithm Rapid bootstrapping and by searching for the best-scoring Maximum Likelihood tree.

To improve BGC annotations, we manually curated outputs from fungiSMASH and BiG-SCAPE by alignment of all genes within each putative BGC from each network together with all BGCs without an assigned GCF, i.e., singletons. This was done to 1) ensure that all BGCs were assigned to a GCF correctly, 2) organise all genes within a GCF to correct for rearrangements and enable comparison of gene similarity and BGC completeness, and 3) identify putative matches to protein domain families using the Pfam database68. To ascertain caution in indication of the final product of these BGCs, Supplementary Data 3 provides an overview of all information regarding the BGC, GCF number, Termitomyces species, BGC class, BGC length, number of genes within a BGC, and whether a BGC is close to a contig edge. The resulting GCF profiles (i.e., the collective of BGCs in a genome belonging to different GCFs) are available in Supplementary Data 4. Comparative analysis of GCF profiles was undertaken by modelling the effect of termite host genus (excluding those with unknown hosts) and Termitomyces species on GCF presence with a PERMANOVA on pairwise Bray Curtis distances between GCF profiles using adonis2 from the Vegan package v2.6-269.

We further compared BGCs to chemically characterised BGCs from the Minimum Information about a Biosynthetic Gene Cluster (MIBiG v 3.0) repository34, following a published pipeline28,70. We also compared terpene core enzymes to previously identified terpenes28 to see if we could link GCFs to previously characterised BGCs. Consensus graphical representations of a GCF as a single BGC (Fig. 3B and Supplementary Figs 1 and 2) were created using Adobe Illustrator based on the BGC structure shown in the fungiSMASH output and the manual MUSCLE alignments of individual genes (using Geneious prime version 2019.1.1) (Fig. 3). In each representation of a BGC we show the identified protein families obtained from the Pfam database68 to highlight which domain(s) the individual gene carries i.e., its domain architecture (Fig. 3 and Supplementary Data 4).

Assessment of BGC genes under positive selection

To find signatures of selection on genes within BGCs of the 25 most abundant GCFs, we first assessed the orthology relationship between the genes from all BGCs comprising a GCF using OrthoFinder v2.5.456. Then, the nucleotide coding sequences from the genes in orthogroups present in at least five genomes were extracted with exonerate v.2.4.071 and aligned independently for each orthogroup using the PRANK v.170427 codon model72. Considering the divergence and some degree of fragmentation in the genomes, alignment quality was assessed using Zorro73, and unreliable positions were filtered. We then ran hmmcleaner v0.180750 to identify alignment errors and subsequently mask them74. For each orthogroup, we inferred a gene tree using the codon alignment in IQ-Tree v2.1.358; with model finder to determine the best fit model “-m MFP” and 1000 ultrafast bootstrap replicates “-B 1000”. To test for positive selection occurring across lineages, we used the adaptive Branch-Site Random Effects Likelihood (aBSREL) model implemented in the HyPhy package using the codon alignment and gene tree as input75,76. This model fits an optimal number of ⍵ (ratio of nonsynonymous (dN) to synonymous (dS) substitution rates) rate classes for each branch and allows inference of positive selection in specific lineages when ⍵ > 1. In addition, we tested for relaxation or intensification of the strength of positive selection using the RELAX method in HyPhy77. The p-values from the selection analysis were adjusted for the false discovery rate (FDR; Benjamini and Hochberg 1995) (Supplementary Data 5 and 6). Individual gene trees for the orthogroups with evidence for positive selection are available in Fig. 3 and Supplementary Fig. 3.

Statistics and reproducibility

This study presents the first extensive analysis of biosynthetic potential within the Termitomyces phylogeny, utilizing a sample size of 39 genomes sourced from both field collections and existing records in GenBank. Statistical analyses were rigorously performed using methods that uphold the standards of validity and reproducibility, with significance thresholds set at *P < 0.05, **P < 0.01, and ***P < 0.001. To guarantee the reproducibility of our findings, meticulous data management practices were employed. This included comprehensive documentation of data sources, processing steps, and analytical pipelines. Additionally, all codes and scripts utilized in this analysis are publicly accessible online.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.