Discovery of fungal onoceroid triterpenoids through domainless enzyme-targeted global genome mining

Genomics-guided methodologies have revolutionized the discovery of natural products. However, a major challenge in the field of genome mining is determining how to selectively extract biosynthetic gene clusters (BGCs) for untapped natural products from numerous available genome sequences. In this study, we developed a fungal genome mining tool that extracts BGCs encoding enzymes that lack a detectable protein domain (i.e., domainless enzymes) and are not recognized as biosynthetic proteins by existing bioinformatic tools. We searched for BGCs encoding a homologue of Pyr4-family terpene cyclases, which are representative examples of apparently domainless enzymes, in approximately 2000 fungal genomes and discovered several BGCs with unique features. The subsequent characterization of selected BGCs led to the discovery of fungal onoceroid triterpenoids and unprecedented onoceroid synthases. Furthermore, in addition to the onoceroids, a previously unreported sesquiterpene hydroquinone, of which the biosynthesis involves a Pyr4-family terpene cyclase, was obtained. Our genome mining tool has broad applicability in fungal genome mining and can serve as a beneficial platform for accessing diverse, unexploited natural products.


INTRODUCTION
Recent years have witnessed an exponential increase in data accumulation across all fields, ushering us into the era of big data.Research on naturally occurring organic compounds, typically referred to as natural products, is no exception to this global trend. 1 Traditional methods used to isolate natural products often lead to the rediscovery of known compounds.Thus, scientists are now urged to discover or synthesize novel natural products or their analogs using alternative strategies.Leveraging big data presents a promising solution to this problem.The rapid accumulation of microbial genome sequence data has revealed that microbial genomes harbor more natural product biosynthetic gene clusters (BGCs) than one can predict based on the number of metabolites produced under standard laboratory cultivation conditions. 2 Thus, the activation and characterization of these silent or orphan BGCs present in microbial genomes can lead to the discovery of new natural products.A method known as genome mining has been demonstrated as useful for discovering new metabolites for over a decade. 3However, a major challenge in the field of genome mining is determining how to prioritize BGCs identified in available genome sequence data for the efficient discovery of untapped natural products.
Fungi are prolific producers of natural products with diverse molecular structures and biological activities, establishing them as promising sources of unexplored metabolites.To facilitate genome mining in fungi, several bioinformatics tools, such as antiSMASH, 4 DeepBGC, 5 and TOUCAN, 6 can be employed to detect BGCs in a specific fungal genome.Among the available tools, antiSMASH has gained the most popularity, possibly because of its user-friendly online platform and versatile functions.However, some challenges have been encountered while using antiSMASH for BGC detection and genome mining.First, antiSMASH classifies many genes as "other genes," even when they encode a biosynthetic enzyme.For example, when antiSMASH (version 7.0.0)was employed to analyze the BGC of the fungal meroterpenoid novofumigatonin, 7 a hybrid molecule of polyketide and terpenoid origin, it categorized 6 out of 13 genes as "other genes" (Figure S1).This may be because some of these genes lacked a detectable Pfam domain. 8Moreover, one of these six overlooked genes produces the terpene cyclase NvfL, which has widely distributed homologues in natural product pathways. 9us, although antiSMASH should theoretically categorize this BGC as "T1PKS, terpene" (type I polyketide synthase + terpene synthase), it recognized only T1PKS.In addition, these domainless enzymes included the αketoglutarate-dependent dioxygenase NvfI and the methyltransferase NvfJ, indicating that antiSMASH failed to recognize various biosynthetic proteins.Furthermore, although antiSMASH and the other aforementioned tools can be used to extract all possible BGCs from a genome sequence, they do not facilitate the prioritization of the detected BGCs.Thus, users need to select BGCs based on the criteria set for subsequent wet lab experiments.In this study, we aimed to overcome these challenges by developing a widely applicable genome mining methodology for the rapid discovery of novel natural products and obtaining a new class of metabolites.

Development of a fungal genome mining pipeline
Recent studies on the biosynthesis of fungal natural products have identified many biosynthetic enzymes lacking a detectable Pfam domain, such as NvfI and NvfL, which are not recognized by antiSMASH.Because these domainless biosynthetic enzymes often drive diverse chemical reactions, we hypothesized that a genome mining strategy focusing on BGCs encoding the weak homologues of such enzymes would offer a promising approach to obtaining untapped metabolites.To identify biosynthetic proteins without detectable Pfam domains, we first collected known fungal biosynthetic proteins.Although these proteins can be obtained from the UniProtKB/Swiss-Prot or MIBiG database, 10 these databases contain gene or protein sequences that can be mispredicted and are not manually corrected.Thus, we created our own fungal BGC database by manually collecting, reviewing, correcting where necessary, and systematically annotating approximately 700 fungal BGCs (Data S1).We named this database the FunBGCs database.
Next, we extracted all protein sequences from the created database and conducted a Pfam domain search on the resulting 5,070 proteins.This led to the identification of 520 Pfam domains.However, 572 proteins lacked a detectable Pfam domain.These domainless proteins were divided into groups based on their sequence similarity, and their hidden Markov model (HMM) profiles were generated for HMMER 11 analysis.The representative members of these protein groups are the terpene cyclase Pyr4, 12 the Diels-Alderase Fsa2, 13 the hetero Diels-Alderase AsR5, 14 the epoxide hydrolase CtvD, 15 and the isomerase Trt14. 16Furthermore, several additional HMM profiles were created for the domains of polyketide synthases (PKSs) and nonribosomal peptide synthetases (NRPSs).The identified Pfam domains, the manually added protein families or domains, and a few additional protein domains from the SMART 17 and TIGRFAMs 18 databases (Tables S1 and S2) were combined for use in biosynthetic protein detection.Moreover, a DIAMOND 19 database was constructed with all of the extracted biosynthetic proteins.
BGCs are extracted as follows (Figure 1A).Initially, genes whose products belong to a protein family of interest are identified from a given fungal genome.Subsequently, genes in the flanking regions of each identified gene are examined to determine if they encode a core protein (e.g., PKSs, NRPSs, terpene synthases or cyclases, and prenyltransferases; refer to Table S3 for more details) or another type of biosynthetic protein using the previously created in-house HMM library and DIAMOND database.If a target gene is found to be colocalized with a core protein gene, then the genomic region is extracted as a BGC.To determine whether a gene in flanking regions is clustered with the target, the following factors are considered: (i) whether the gene encodes a biosynthetic protein, (ii) whether the gene is duplicated in the genome (to include a possible self-resistance enzyme gene 20 ), (iii) whether the gene is biosynthetically related to nearby genes, (iv) the size of the gene product, and (v) the distance between a given gene and the next one (refer to Supporting Information for the detailed method).In addition, the extracted BGCs are visualized using a web browser, which displays general information on each BGC and each gene (Figure 1B).The fungal genome mining pipeline developed in this study was named FunBGCeX (Fungal BGC eXtractor).
https://doi.org/10.26434/chemrxiv-2023-f0z8nORCID: https://orcid.org/0000-0001-5650-4732Content not peer-reviewed by ChemRxiv.License: CC BY-NC-ND 4.0 The top table provides general information on a BGC.A schematic representation of BGC is displayed in the middle.On clicking each gene, information on that gene is provided in the bottom table.The BGC shown here is identical to the alli cluster (Figure 2C).

Discovery of unusual fungal triterpenoid BGCs
To examine whether FunBGCeX allows for the efficient discovery of BGCs that potentially synthesize a new class of natural products, we extracted BGCs that encoded a homologue of Pyr4 (Figure 2A), 12,21 which is a noncanonical transmembrane terpene cyclase and does not have a conserved domain, as inferred from the National Center for Biotechnology Information (NCBI) Conserved Domain Search 22 (Figure S2).Thus, such BGCs were extracted from 1,990 annotated fungal reference genomes downloaded from the NCBI database using an HMM profile created from Pyr4 and its homologues. 9,23 his process resulted in the identification of 1,050 BGCs (including those with a single gene; Data S2).More than half (690) of the identified BGCs did not contain an additional core protein gene, whereas 182 BGCs possessed at least one PKS gene and 131 BGCs were found to encode a PaxC-like prenyltransferase, which is specifically required for the biosynthesis of indole sesquiterpenoids and diterpenoids. 24Thus, except for standalone pyr4-like genes, the majority of the pyr4-like genes were clustered with either a PKS gene or a paxC-like gene.This observation is consistent with the fact that, in fungi, Pyr4-family terpene cyclases have been found only in the biosynthetic pathways of polyketideterpenoid hybrids and indole terpenoids. 25Intriguingly, 19 BGCs encode one or two protein(s) homologous to squalene-hopene cyclases (SHCs) or oxidosqualene cyclases (OSCs), which are involved in the biosynthesis of triterpenoids.Although a few of these BGCs were removed after manual review for further investigation, most of them were found to encode Pyr4-like proteins that form a new clade when subjected to phylogenetic analysis with known Pyr4-family terpene cyclases (Figure 2B).During our study, we observed that Aspergillus fumigatus A1163, a non-reference strain of A. fumigatus, harbored a similar BGC, which was reserved for further investigation.To date, no metabolic pathway involving both a Pyr4-family terpene cyclase and an SHC/OSChttps://doi.org/10.26434/chemrxiv-2023-f0z8nORCID: https://orcid.org/0000-0001-5650-4732Content not peer-reviewed by ChemRxiv.License: CC BY-NC-ND 4.0 like enzyme has been identified.Furthermore, the nature of these BGCs suggests that their products are triterpenoids whose biosynthesis requires two cyclization events.However, to the best of our knowledge, none of the known fungal metabolites appear to be synthesized by these BGCs.Therefore, we speculate that these BGCs are responsible for the biosynthesis of untapped triterpenoid molecules.

Characterization of selected triterpenoid BGCs
To examine whether these BGCs yield new triterpenoids, three of them were selected for experimental investigation.The three BGCs are from Aspergillus homomorphus CBS 101889, A. fumigatus A1163 (CBS 144.89), and Aspergillus alliaceus CBS 536.65, which were designated as homo, fumi, and alli clusters, respectively (Figure 2C and Tables S4 and S5).The homo cluster is the simplest of the three and encodes the squalene or oxidosqualene cyclase HomoS, the Pyr4-family terpene cyclase HomoB, and the FAD-dependent monooxygenase (FMO) HomoM, the last of which is homologous to epoxidases involved in fungal meroterpenoid pathways. 25In addition, the homologues of the three enzymes are conserved in the other two BGCs, although the fumi cluster contains two homoS homologues, fumiS1 and fumiS2.On the basis of the predicted functions of Homo enzymes, the product of the homo cluster could be synthesized as follows.First, HomoS cyclizes squalene or oxidosqualene to yield a cyclized product.Then, HomoM epoxidizes one of the olefinic double bonds present in the cyclized product, and HomoB finally performs a second round of cyclization to produce the end pathway product.The other two BGCs would employ a similar biosynthetic mechanism; however, they each encode one or two tailoring enzymes predicted to modify the triterpenoid scaffold.The fumi cluster contains a cytochrome P450 monooxygenase gene fumiP, whereas the alli cluster encodes the P450 AlliP and the acetyltransferase AlliA.Collectively, all of the three BGCs are expected to yield different triterpenoid species.
To analyze the functions of the three BGCs and identify the metabolites produced, we performed heterologous expression experiments in Aspergillus oryzae NSAR1, 26 which has been widely utilized for the characterization of orphan BGCs.21a, 27 We first individually expressed the four putative triterpene synthase genes (homoS, fumiS1, fumiS2, and alliS) in A. oryzae and then analyzed metabolites from the resulting transformants using gas chromatography-mass spectrometry (GC-MS).The findings revealed that all of the enzymes, except FumiS2, yielded specific metabolite(s) (Figure 2D, traces i to v).The homoS-transformed strain produced a major product 1 with m/z 410 [M] + , which was not present in the host strain.After isolation, metabolite 1 was identified as bicyclic triterpene α-polypodatetraene based on the comparison of its nuclear magnetic resonance (NMR) spectra and specific rotation with reported data 28 (Figure 2E).Compound 1 was also obtained from the A. oryzae transformant harboring fumiS1; however, the transformant produced another product 2, which was identified to be α-polypodatetraen-3β-ol, 29 the C-3 hydroxy analog of 1 (Figure 2E).The metabolic profile of the A. oryzae strain with alliS differed from those of the other two.A new metabolite 3 was detected as a major peak and was identified as 8α-hydroxypolypoda-13,17,21-triene 30 (Figure 2E).We sought to identify the downstream metabolites of the three pathways and first focused on the end product of the homo pathway.The co-expression of the epoxidase gene homoM and the pyr4 homologue homoB with homoS resulted in the formation of a new metabolite 4 with m/z 426 [M] + (Figure .2D, trace vi).Analysis of the NMR data of 4 revealed the presence of two additional ring systems, whose absolute structure was established using the modified Mosher's method 31 (Figures 2E and S3).Thus, compound 4, hereby named homomonoceroid A, is synthesized through the cyclization of squalene from both termini and is classified as an onoceroid.
https://doi.org/10.26434/chemrxiv-2023-f0z8nORCID: https://orcid.org/0000-0001-5650-4732Content not peer-reviewed by ChemRxiv.License: CC BY-NC-ND 4.0 Next, the co-expression system of fumiS1, fumiM, and fumiB was constructed, which yielded two additional metabolites, 5 and 6 (Figure 2D, trace vii).The major product 5, designated as fumionoceroid A, was determined to be another onoceroid; its structure was established through NMR and X-ray crystallographic analyses (Figures 2E and S4; CCDC: 2279439).Compound 5 was also found to be a tetracyclic onoceroid as 4, and both 4 and 5 appear to be derived from 1.However, the newly formed bicyclic system of 5 differs from that of 4. The other product 6, fumionoceroid B, was determined to be the C-3 hydroxy form of 5 and appears to be synthesized in the same manner as 5, using oxidosqualene as a starting material (Figure 2E).The absolute structure of 6 was deduced based on that of 5. Subsequently, the P450 gene fumiP was introduced, yielding an additional product 7 (Figure 2D, trace viii).NMR and X-ray crystallographic analyses revealed that 7, named fumionoceroid C, was a hydroxylated form of 5 at C-24 (Figures 2E and S4; CCDC: 2279440).
We then identified triterpenoids synthesized by the alli cluster.When alliS, alliM, and alliB were introduced into A. oryzae, the transformant with the three genes produced a new metabolite 8 (Figure 2D, trace ix).In contrast to the products obtained from the other BGCs, 8 was identified as a pentacyclic onoceroid and named alliaonoceroid A (Figure 2E).Although 8 has never been isolated from natural sources, it was previously synthesized as a predicted biosynthetic precursor of cupacinoxepin, 32 which had been isolated from the Ecuadorian plant Cupania cinerea. 33We next turned our attention to the tailoring steps, which should involve the P450 AlliP and the acetyltransferase AlliA.We created four gene expression systems with either alliP or alliA to determine the enzyme required first in the biosynthetic process.Consequently, both the transformants yielded a new product.The introduction of alliP and alliA led to the formation of 9 and 10, respectively (Figure 2D, traces x and xi), which were characterized to be the C-21 keto and C-6α hydroxy analog of 8 and the Oacetylated form of 8, respectively, through NMR and X-ray crystallographic analyses (Figures 2E and S4

DISCUSSION
In this study, we developed a fungal genome mining pipeline, FunBGCeX, which was based on a manually curated fungal BGC database and custom-made HMM profiles.We demonstrated that FunBGCeX could effectively identify BGCs for previously undiscovered natural products.The onoceroid BGCs characterized in this study could also be identified using antiSMASH; however, the BGCs extracted by antiSMASH contained substantially more outside genes than those obtained using FunBGCeX (Figure S5).In addition, antiSMASH could not detect the presence of pyr4-like genes in these BGCs.Therefore, the onoceroid BGCs would not have been effectively discovered using antiSMASH or other existing fungal genome mining tools.Although we focused only on BGCs with a pyr4-like terpene cyclase gene in this study, FunBGCeX can be readily applied to extract BGCs that encode a homologue of user-selected proteins.Thus, FunBGCeX can facilitate and accelerate the discovery of unexploited natural products, particularly those synthesized with the involvement of https://doi.org/10.26434/chemrxiv-2023-f0z8nORCID: https://orcid.org/0000-0001-5650-4732Content not peer-reviewed by ChemRxiv.License: CC BY-NC-ND 4.0 a domainless enzyme.
In our global fungal genome mining study, we discovered three types of onoceroids, which are triterpenoids synthesized from squalene or oxidosqualene through cyclization at both ends of the prenyl chain.Onoceroids have been isolated from bacteria, 34 ferns, 35 higher plants, 36 and animals. 37However, to the best of our knowledge, onoceroids have never been isolated from fungi.Thus, the present study provides the first examples of fungal onoceroids and their biosynthetic pathways.In terms of the biosynthesis of onoceroids, BmeTC from the bacterium Bacillus megaterium was the first onoceroid synthase characterized in 2013, and this enzyme solely transforms squalene into onoceroids. 34In the fern Lycopodium clavatum, a pair of homologous enzymes (LCC and LCD or LCE) convert oxidosqualene into onoceroid species. 38These known enzymes for onoceroid biosynthesis are all homologous to SHCs and OSCs.In contrast, the biosynthesis of fungal onoceroids requires two families of terpene cyclases (i.e., SHC/OSC-like enzyme and Pyr4-family terpene cyclase), introducing an unprecedented biosynthetic mechanism of onoceroids.This is the first study to demonstrate the involvement of Pyr4-family terpene cyclases in the biosynthesis of "pure" (not mero-) terpenoids.The majority of fungal triterpenoids are synthesized through the cyclization of oxidosqualene, 39 and hexaprenyl pyrophosphate is known to serve as the precursor of a few fungal triterpenoids. 40However, the fungal onoceroid pathways identified in this study represent rare examples in which squalene is directly cyclized to produce fungal triterpenoids.
The biosynthetic pathways of the fungal onoceroids discovered in this study can be proposed as follows (Scheme 1).The biosynthesis of homomonoceroid A (4) begins with squalene being cyclized into α-polypodatetraene (1) by the squalene cyclase HomoS through the carbocationic intermediate 12. Subsequently, the FMO HomoM epoxidizes the terminal olefin of 1 to yield epoxide 13.The Pyr4-family terpene cyclase HomoB then protonates epoxide 13 to initiate the second round of cyclization, and the resulting tetracyclic carbocationic species 14 is neutralized by a water attack, yielding 4. In addition, the bicyclic intermediate 13 is involved in the biosynthesis of fumionoceroid C (7), where FumiB accepts 13 to produce a differently cyclized product fumionoceroid A (5).
FumiB first cyclizes 13 into the carbocationic intermediate 15.Instead of being quenched by water, the reaction concludes with a 1,2-hydride shift, 1,2-methyl shift, and deprotonation from C-19.Subsequently, the P450 FumiP hydroxylates 5 at C-24 to complete the biosynthesis.Oxidosqualene can also be accepted by Fumi enzymes, except for FumiP, to produce the C-3 hydroxy analog of 5, fumionoceroid B (6), through αpolypodatetraen-3β-ol (2) (Figure S6).Meanwhile, the biosynthetic pathway leading to alliaonoceroid D (11)   branches from the other two pathways in the first step.AlliS transforms squalene into 8α-hydroxypolypoda-13,17,21-triene (3) instead of 1.After the epoxidation of 3 by AlliM to yield 16, AlliB cyclizes 16 into alliaonoceroid A (8).The cyclization mode adopted by AlliB is similar to that adopted by HomoB, but the AlliBcatalyzed reaction uses the C-8 hydroxy group instead of a water molecule to produce the pentacyclic onoceroid In conclusion, we successfully isolated several previously unreported natural products using our genome mining pipeline, FunBGCeX.We believe that this tool can be widely applied to the discovery of natural products with novel scaffolds.Currently, FunBGCeX mainly targets "known-unknown" BGCs, 41 which produce unknown natural products synthesized by the known classes of core enzymes.A recent study has highlighted the genome mining-driven discovery of "unknown-unknown" natural products. 42We are now enhancing the genome mining platform by incorporating additional functions to readily extract BGCs encoding self-resistance enzymes 20 or BGCs that lack a known core protein.This would facilitate the discovery of unexploited bioactive or unknownunknown natural products from fungi.

Figure 1 .
Figure 1.Fungal genome mining pipeline.(A) General workflow to extract BGCs with a target gene from a given fungal genome.(B) Example of output from the genome mining pipeline.For this panel, BGCs encoding a Pyr4 homologue were extracted from the fungus Aspergillus alliaceus CBS 536.65, yielding five BGCs.The top table provides general

Figure 2 .
Figure 2. Discovery and characterization of fungal onoceroid BGCs.(A) Reactions catalyzed by the selected members of Pyr4-family terpene cyclases.(B) Phylogenetic analysis of known Pyr4 homologues and those clustered with an (oxido)squalene cyclase identified in this study.Because of space limitations, only selected enzymes were labeled.The gene and protein sequences of several identified Pyr4 homologues were manually revised upon creating the phylogenetic tree (refer to Table S6 for their protein sequences).(C) Fungal onoceroid BGCs.(D) GC-MS chromatograms of metabolites obtained from A. oryzae transformants.(E) Structures of triterpenoids obtained in this study.