Introduction

The basidiomycetous fungal family Trichosporonaceae belongs to the order Trichosporonales, the class Tremellomycetes, and subphylum Agaricomycotina and incorporates morphologically and physiologically diverse, aromatic compound-assimilating yeasts1. Recently the taxonomy of this family was revised to include six genera, namely Apiotrichum, Cutaneotrichosporon, Effuseotrichosporon, Haglerozyma, Trichosporon (type genus) and Vanrija. This revision was based on phylogenetic analysis of seven markers, namely LSU (D1/D2 domains) and SSU rRNA, the Internal Transcribed Spacer (ITS) and the protein coding genes RPB1, RPB2, TEF1 and CYTB and a combination of morphological, biochemical and physiological characteristics1,2. Members of the Trichosporonaceae show a global distribution and have been recovered from a wide range of environments. Cutaneotrichosporon spp. are most frequently associated with a human host, and may represent opportunistic human pathogens. Trichosporon spp. form part of the natural microflora on human and animal skin and result in a non-serious mycosis of hair termed white piedra3. However, they have also been implicated in trichosporonosis, a collection of opportunistic infections caused by a number of species, including Trichosporon asahii, T. asteroides and T. ovoides4. By contrast Apiotrichum and Vanrija spp. are generally free-living and have been isolated from water bodies, food sources and rotten wood (Table 1).

Table 1 Genome features of thirty-three Trichosporonaceae and two outgroup species included in the analysis.

While the Trichosporonaceae include several opportunistic human pathogens, there has also been increased interest in these taxa for a broad range of biotechnological applications. Most pertinently, members of the Trichosporonaceae are known to produce and accumulate large amounts of single cell oil (SCO) relative to their dry biomass5,6,7,8,9,10,11, with up to 70% w/dwbiomass (weight/dry weight of biomass) accumulated by Cutaneotrichosporon oleaginosus12. Furthermore, they are amenable to large-scale fermentations as they are not as sensitive as other oleaginous yeasts to fermentation inhibitors including furanes and phenolic compounds8. These factors make members of the Trichosporonales suitable candidates in a wide range of biotechnological applications such as the production of oleo-chemicals and biofuels13,14.

The rapid development of genome sequencing technologies and bioinformatics has been pivotal in shaping our understanding of fungal genetics. Since the publication of the first fungal genome, Saccharomyces cerevisiae, in 199615, fungal genomics has experienced rapid development. As of June 2019, 5,269 fungal genome assemblies have been deposited in the NCBI database16. With the increasing availability of fungal genomes, recent works have harnessed the information contained in the genomes to develop more robust taxonomic frameworks for several fungal taxa. For instance, Takashima et al.17,18,19 have pioneered and variously reported a genome-based characterisation and phylogenetic analysis of the order Trichosporonales using 24 haploid and 3 natural hybrid genomes. Furthermore genome sequencing provides access to the full complement of proteins encoded on a fungal genome, which can serve as resource for modelling functional capacities of the fungal strains and to further their use as biological resources in a wide range of biotechnological applications20.

In the current study, we have employed comparative genomic strategies to study thirty-three members of the family Trichosporonaceae. Phylogenomic analysis identified three mis-classified taxa within this family, while genes coding for enzymes involved in oleagenicity and their regulatory regions show evolutionary patterns distinct from the genome scale phylogeny. Furthermore, the genome comparisons highlighted a range of genetic determinants underlying the distinct lifestyles and niche specialisations of the different taxa within this family.

Results and Discussion

Genomic characteristics of the Trichosporonaceae

The genomes of thirty-three taxa belonging to the genera Apiotrichum (nine strains), Cutaneotrichosporon (twelve strains), Pascua guehoae, Prillingera fragicola, Trichosporon (eight strains) and Vanrija (two strains) were incorporated in the analyses. Twenty-nine of the strains have haploid genome structures, while three strains, namely C. mucoides JCM 9939T, T. ovoides JCM 9940 and T. coremiiforme JCM 2938T, have been shown to comprise hybrid genomes18. In this study, genome duplication and phylogenetic analyses revealed one additional strain, C. cutaneum B3 to comprise of a hybrid genome. Two strains of Takashimella (belonging to the closely related family Tetragoniomycetaceae were included as outgroups. A survey of the origin of the Trichosporonaceae strains shows a wide geographic distribution of the organisms with isolates obtained from food, decomposing wood, human body, soils, water bodies, among others (Table 1). The two outgroup strains have originated from two distinct sources; leaf of plant and stream water. However, majority of members of the genus Trichosporon and Cutaneotrichosporon species, for which the genome sequences are available, are either associated with human or animal skin while genomes of isolates from insect1,21 are not available. This may reflect preference for the sequencing of clinically important strains. The phylogenomic analyses of thirty-three members of the family Trichosporonaceae, including Apiotrichum porosum DSM 27194 and one putative hybrid genome strain, C. cutaneum B3 are presented here. The estimated genome sizes of the thirty-three Trichosporonaceae strains ranged between 16.4 and 42.4 Mb with an average G + C content range of 56.5–62.8%. The N50, which is the contig/scaffold size for which at least 50% of the assembly is contained in equal or larger contigs/scaffolds, ranged between 53.5Kb in T. akiyoshidainum HP2023 and 5.6 Mb in C. cutaneum ACCC 20271, indicating wide variety in assembly quality. However, previous studies have shown that large N50 values may arise because of erroneous concatenation of contigs, thereby limiting the value of this metric in evaluating assembly quality22. The largest genome sizes (average of 40.5 Mb) are observed for the four hybrid genomes incorporated in the analysis. Among the haploid genomes, the largest genome sizes belong to the yeast strains that are predominantly isolated from various soil types. Prediction of protein encoding gene models revealed that the genomes of these fungi encode between 6,477 (C. curvatus SBUG-Y 855) and 15,061 (T. coremiiforme JCM 2938T) proteins. Evaluation of the predicted protein models using the BUSCO23 basidiomycota_odb9, which includes 1335 single copy genes/proteins, revealed that the genome completeness of the yeast strains included in this study ranged between 80.8 and 97.2% (Table 1). Additionally, BUSCO23 analysis revealed extensive protein duplication ranging between 55.7 to 70% in the four hybrid genomes that harbour the largest genome sizes. In contrast, the two outgroup species have genome sizes of 22.4 and 25.1 Mb and G + C content of 44.66 and 54.94% for Takashimella tepidaria JCM 11965T and T. koratensis JCM 12878T, respectively.

Genome-wide phylogenetic analysis reveals several misclassifications in the Trichosporonaceae

Orthologous proteins conserved among all compared taxa were identified using Proteinortho524. A total of 1,351 proteins are common to all the studied strains, including the outgroups. However, to put the hybrid genomes into phylogenomic perspective, 405 orthologous proteins present solely in single copies among the haploid genomes and only in duplicate copies in the hybrid genomes were used to reconstruct the phylogeny of the Trichosporonaceae. The trimmed concatenated protein alignment comprised 223,082 amino acids in length. The resultant maximum likelihood phylogeny (Fig. 1) shows the clustering of the Trichosporonaceae into six distinct clades. Eleven of the twelve Cuteaneosporotrichon, seven of the eight Trichosporon and all nine Apiotrichum strains incorporated in the study fall into three separate clades congruent with the distinct Trichosporonaceae genera that they represent1,2.

Figure 1
figure 1

Phylogenomic analysis of members of the family Trichosporonaceae. The maximum likelihood (ML) tree was inferred from the concatenated protein alignment (223,082 amino acids) of 405 proteins present in single copies among the haploid genomes and only in duplicate copies in the hybrid genomes. The phylogeny was generated using IQ-TREE version 1.6.7 based on the LG + F + R10 model. The ML was generated with confidence values based on 1,000 bootstrap replicates. The documented oil accumulating members of the family are indicated in blue fonts. The labels ‘_1’ and ‘_2’ indicate the two sets of single copy orthologs (SCOs) in the hybrid genomes, where the letter shows higher amino acid similarity to the closest haploid genome.

While three clear genus clades can be observed in the single copy orthologues phylogeny (SCOP), two taxa, namely C. cutaneum ACCC 20271 and T. akiyoshidainum HP2023 are clearly delineated within the Apiotrichum clade in the SCOP, and should thus be reassigned to the latter genus. As has previously been observed through separation of subgenomes18,25, the duplicate orthologue copies (here referred to as ‘strain number’_1 and _2) in the three described hybrid genomes form distinct branches but are still retained within their genus clades (Fig. 1). When considering the fourth putative hybrid genome identified in this study, C. cutaneum B3, B3_1 clusters with C. mucoides JCM 9939T_1, while B3_2 also clusters with C. mucoides JCM 9939T_2, suggesting that the two strains are likely to have shared similar evolutionary history including episodes of hybridization. In addition to evidence from gene duplication (55.7%) determined using BUSCO23 basidiomycota_odb9, BLASTP analyses showed that C. dermatis JCM 11170 shares 92.41% and 97.92% amino acid similarity among the 405 single copy orthologues (SCO) with those of C. cutaneum B3_1 and B3_2, respectively. In additon, the 405 SCO sets of B3_1 and B3_2 shared on average 92.76% amino acid similarity, further proving support for the distinct origin of the duplicated single copy orthologue sets.

Differences in the proteolytic and carbohydrate metabolic enzyme complements of the Trichosporonaceae may influence their lifestyles

To further enhance our understanding of various functional and adaptational capacities of the studied strains, proteins annotated as Carbohydrate-Active enZYmes (CAZYmes) and proteolytic enzymes (MEROPS) were identified and compared (Fig. 2a). The presence of these proteins can provide an indication of the ranges of possible carbohydrate and protein substrates utilised by an organism. CAZYmes represent a broad scope of proteins associated with the assembly, modification and degradation of various types of carbohydrates26 and are curated in the Carbohydrate-Active EnZYmes database (http://www.cazy.org). The Cutaneotrichosporon strains displaying hybrid genomes showed the highest numbers of CAZYmes; 671 in Cutaneotrichosporon cutaneum B3 and 689 in Cutaneotrichosporon mucoides JCM 9939T8 (Supplementary Fig. 1) Aside from these hybrid genome taxa, the genomes of the two Apiotrichum porosum strains encode the highest numbers of CAZYmes (570 & 604 proteins) with ~68%, of these belonging to the class of glycoside hydrolases (GH). Similarly, GHs form the largest proportion of the CAZYmes in all studied strains. Considering the average CAZYme numbers within each genus, the Apiotrichum species also harbour the most CAZYmes (average 421), followed by Vanrija (379), Trichosporon (378) and Cutaneotrichosporon (365). However, the single available genome of Pascua guehoae also encodes 460 CAZYmes. Within the genera, Trichosporon has the highest average number of CAZYmes linked to auxillary activities (AA) and glycosyltransferases (GTs) encompassing 65 and 55, respectively and Vanrija harbours the highest average number of carbohydrate-binding modules (CBM) and carbohydrate esterases (CE) with 17 and 30, respectively while the highest mean number of glycoside hydrolases, 261 and polysaccharide lyases (PL), 19 was recorded in Apiotrichum and Cutaneotrichosporon, respectively. Abundance of CAZYmes has been linked to the various fungal adaptations with saprophytic fungi harbouring larger numbers of these enzymes compared to their parasitic counterparts27. This feature may readily be inferred from the current comparison, where on the average the free-living fungi of the genera Apiotrichum and Vanrija harbour greater numbers of CAZYmes than the predominantly host-associated Trichosporon and Cutaneotrichosporon taxa. Furthermore, the abundance of GHs and CEs28 in Apiotrichum and Vanrija, respectively may reflect their capacity to breakdown and utilise wide range of substrates. These taxa are frequently isolated from soil and other environments where they degrade and subsist on various forms of complex substrates29.

Figure 2
figure 2

Comparison of number of proteins associated with (a) CAZymes and (b) MEROPS among thirty-three strains of Trichosporonaceae. CAZymes; AA: auxillary activities, CBM: carbohydrate-binding modules, CE: carbohydrate esterases, GH: glycoside hydrolases and GT: glycosyltransferases. MEROPS; A: aspartic peptidases, C: cysteine peptidases, M: metallo-peptidases, N: asparagine peptide lyases, S: serine peptidases, T: threonine peptidases, and I: protease inhibitors.

Proteolytic enzymes are proteins that hydrolyse peptide bonds and are widely distributed across all domains of life with estimates showing that they comprise ~2% of all proteins encoded on the genomes of organisms across all domains of life30. These enzymes form an important component of the biomass degradation capacities of both fungi and bacteria31 and their distribution is reflective of the lifestyle of the organisms. For instance, comparison of pathogenic and non-pathogenic Pseudogymnoascus strains revealed a marked underrepresentation of proteases in the former relative to the latter organisms32. To predict these enzymes, proteins of the organisms included in this study were searched against the manually curated enzymes in the MEROPS database33. All seven classes of MEROPS, namely aspartic peptidases (A), cysteine peptidases (C), metallo-peptidases (M), asparagine peptide lyases (N), serine peptidases (S), threonine peptidases (T), and protease inhibitors (I) are represented in the genomes of the thirty-three Trichosporonaceae, comprising approximately 3% of the proteins of the organisms (Fig. 2b). As observed with the CAZYmes, the hybrid genomes in the genera Trichosporon and Cutaneotrichosporon harbour the most abundant peptidases, ranging between 428 and 458 proteins. Omitting the hybrid genomes, the highest average number of the MEROPS was observed among the Vanrija and Apiotrichum species, with 264 and 270 proteins, respectively. The three most abundant MEROPS belong to the class S (56–143 proteins), M (63–130 proteins) and C (49–101 proteins) across the different genera. However, asparagine peptide lyase (N), which is the only member of the MEROPS that is not a peptidase34, appears to be restricted to only five of the strains; Apiotrichum domesticum JCM 9580T (1 protein), Apiotrichum laibachii JCM 2947T (1 protein), Cutaneotrichosporon arboriformis JCM 14201T (2 proteins), Trichosporon faecale JCM 2941T (1 protein) and Trichosporon inkin JCM 9195 (1 protein). Serine and metallo-peptidases are widely distributed in fungi and may reflect the capacity of these organisms to use proteinaceous substrates35,36. However, serine peptidases contents have been shown to be determined by both proteome size and lifestyle of fungi. Parasitic fungi, often associated with reduced genomes/proteomes and those involved in symbiosis have been shown to harbour less serine proteases37. The predominance of serine peptidases S (average 81 and 82 proteins, respectively) and metallo-peptidases (average of 77 and 78 proteins, respectively) among the mainly soil inhabiting Vanrija and Apiotrichum spp. reflect their versatility in sequestering a wide range of complex substrates in their environment. Cysteine peptidase were reported as pivotal in sustaining parasitic lifestyles38. Among the Trichosporonaceae, the upper range of the cysteine peptidases are seen among the predominantly host-associated Trichosporon (an average 66 proteins) and Cutaneotrichosporon (on average 60 proteins) strains, while Apiotrichum spp. and Vanrija strains only had on average 56 and 53 of these proteins encoded on their genomes, respectively.

Phylogeny of oleagenic proteins and promoter regions of their genes highlights the complex evolution of lipid biosynthetic pathway

The biochemical production and accumulation of single cell oil in fungi has received extensive interest because these organisms could serve as eco-friendly sources of lipids and other important biochemicals with a wide range of biotechnological applications7,39. To provide additional insights into the genomic basis of oil accumulation among the compared strains, six proteins involved in the biochemical pathway (Fig. 3) central to lipid production and accumulation were analysed. These were acetyl-CoA carboxylase (ACC), AMP deaminase (AMPD), ATP-citrate synthase (ACS), fatty acid synthase subunits alpha and beta (FASI & II) and isocitrate dehydrogenase [NADP] (ICDH). Understanding the structure of the regulatory elements of the genes that code for these proteins may be pivotal in deciphering approaches for enhanced oil production. For instance, an increase in lipid accumulation was achieved through the overexpression of ACC under various promoter systems40,41. As such, the transcription factor binding domains (TFBDs) 600 bp upstream of these genes were analysed.

Figure 3
figure 3

Illustration of the initiation of the biochemical oil production in yeasts showing the steps within the pathway catalysed by the studied enzymes under nitrogen limitation. ACC, acetyl-CoA carboxylase, AMPD, AMP deaminase, ACS, ATP-citrate synthase, FASI & II, fatty acid synthase subunits alpha and beta and ICDH, isocitrate dehydrogenase [NADP]. × and ↑ indicates the inhibition of ICDH and increased activity of AMPD under nitrogen limitation. Modified from7

Evaluation of the proteomes of the yeasts included in this study reveals that orthologues of the selected proteins occur in all of the strains studied, with the exception of T. asahii var. asahii CBS 8904 and T. akiyoshidainum HP2023 in which orthologues of ACC are absent and C. curvatus SBUG-Y 855, which does encode an orthologue of AMP on its genome. The hybrid genomes of C. cutaneum B3, T. coremiiforme JCM 2938, T. ovoides JCM 9940 and C. mucoides JCM 9939 harbour two copies of FASI &II, ACC and ACS. However, only JCM 2938 retains the duplicate copy of ICDH, while AMPD is present in two copies in B3 and JCM 2938. Given the essential nature of these proteins, it is likely that the absence of some of the orthologues is associated with the level of genome completeness rather than the lack of the affected function.

Oil production in yeasts has been linked to nutrient limitation, where the organisms channel carbon flux to lipid instead of energy production5,7. Two enzymes directly associated with this function are AMPD and ICDH with the former shown to enhance the depletion of AMP and consequently playing a role in the inhibition of ICDH42. Comparison of a phylogeny on the basis on the AMPD amino acid sequences (Supplementary Fig. 2,a) showed that, apart from the placement of V. humicola, this tree shows a similar topology and clustering as the SCOP. Clustering of the strains based on the distribution and abundance of TFBDs upstream of the AMPD gene (Supplementary Fig. 2,b) shows distinct grouping of the organisms suggesting disparate evolution of this regulatory region. In the ICDH tree (Fig. 4a), only the Trichosporon species showed a coherent grouping while members of the genus Cutaneotrichosporon, including the known oleaginous strain C. curvatus show incongruent branching pattern relative to the SCOP, indicating distinct evolutionary history of the ICDH gene. Comparison of the TFBDs of the ICDH gene revealed that these fungi form six distinct clusters (Fig. 4b) with the documented oleaginous strains A. porosum, C. curvatus and C. oleaginosum, clustering together thereby suggesting a possible similarity in the regulation of the ICDH gene among these strains. Two other reported oil accumulating yeast, namely C. cutaneum B3 and C. cutaneum ACCC 20271 are also closely clustered with the rest of the oleaginous strains. Discussion on the affiliation of the two strains has been presented above. The predicted TFBDs of ICDH include binding motifs for Gis1; Gat1p, Gln3p, Gzf3p; and Gln3p all of which have been implicated in the regulation of gene expression under nutrients starvation, including amino acids and nitrogen limitations43,44,45.

Figure 4
figure 4

Evolutionary analyses of the ICDH protein and the upstream region of its gene among thirty-three strains of Trichosporonaceae. (a) ML tree of ICDH (380 amino acids long trimmed alignment) generated using IQ-TREE version 1.6.7 with confidence values based on 1,000 bootstrap replicates. (b) Distribution of predicted transcription factor binding sites 600 nucleotide bases upstream of the transcription initiation site of ICDH gene clustered using hierarchical clustering on principal components (HCPC) in R. The documented oil accumulating members of the family are indicated in blue fonts in the phylogeny and with blue arrows in the HCPC.

Suppression of ICDH, which is considered as a feature specific to oleaginous yeasts5 results in the accumulation of citrate in the mitochondrion. The citrate is then transferred into the cytoplasm where ACS catalyses its conversion into to acetyl-CoA and oxaloacetate. Evaluation of the ACS phylogeny (Fig. 5a) showed similar branching pattern with the SCOP. However, P. guehoae is placed within the well supported Cutaneotrichosporon clade. However, based on the TFBDs of ACS, the strains group into six distinct clusters (Fig. 5b) with two of the known oleaginous strains, A. porosum and C. curvatus, clustering together. In addition to the Gis1p, Msn2p, Msn4p, Rph1p, YER130C binding domains, which are known to regulate gene expression under nutrients limitation and stress45, the regulatory region of ACS includes the Adr1p TFBD. Adr1p is a carbon source-responsive transcription factor involved in the regulation of genes associated with ethanol, glycerol, and fatty acid utilization and peroxisome biogenesis46,47,48. As reflected in the characteristic clustering of A. porosum and C. curvatus, each of the strains carries two putative binding sites for Adr1p compared to C. oleaginosum which harbours four such TFBDS.

Figure 5
figure 5

Evolutionary analyses of the ACS protein and the upstream region of its gene among thirty-three strains of Trichosporonaceae. (a) ML tree of ACS (1,097 amino acids long trimmed alignment) generated using IQ-TREE version 1.6.7 with confidence values based on 1,000 bootstrap replicates. (b) Distribution of predicted transcription factor binding sites 600 nucleotide bases upstream of the transcription initiation site of ACS gene clustered using hierarchical clustering on principal components (HCPC) in R. The documented oil accumulating members of the family are indicated in blue fonts in the phylogeny and with blue arrows in the HCPC.

One of the products of the cleavage of citrate, acetyl-CoA, is either directly channelled to fatty acids synthesis via the FAS complex (catalysed by FASI &II) or converted into malonyl-CoA, which is subsequently directed to fatty acid synthesis. The latter reaction is catalysed by ACC. Incongruent with the SCOP, the ACC of Apiotrichum and Trichosporon species as well as those of Pascua guehoae and Prillingera fragicola appear to share similar evolutionary history clustering distinctly from the Cutaneotrichosporon species (Fig. 6a). The TFBDs of the ACC gene grouped the studied strains into eight distinct clusters (Fig. 6b). Based on this grouping, the five documented oleaginous yeasts assemble in two close clades. In addition to previously discussed putative sites for transcription factors regulating genes under nutrients limitation, adaptation to stress and utilisation of ethanol, glycerol, and fatty acid, the TFBDs of the ACC gene include a putative binding site for the zinc cluster protein Gsm1p and the basic helix-loop-helix transcription factor Pho4p. Gsm1p has been predicted to regulate energy metabolism49,50 while Pho4p was shown to be activated in response to phosphate limitation and controls genes of the phosphatase regulon and an inorganic phosphate (Pi) transport system in Saccharomyces cerevisiae51,52. Pi limitation has been used as an alternative means of inducing oil accumulation in oleaginous yeast53. The phylogeny generated based on FAS subunits (Supplementary Fig. 3c,e) revealed a clustering similar to that observed in the SCOP with exception of the placements of P. guehoae and P. fragicola in both trees and the distinct grouping of C. curvatus and C. cyanovorans in FASII (Supplementary Fig. 2,f). This may indicate a disparate evolution of the FASII genes in the latter strains. In terms of the TFBDs, the oleaginous strains group in separate clusters for both FASI & II (Supplementary Fig. 2d,f), indicating a more complex evolution of these genomic regions. However, the TFBDs of both genes include Gis1p, Msn2p, Msn4p, Rph1p, YER130C binding sites which are involved in gene regulation under nutrient starvation45 while the FASI regulatory region harbours Adr1p46,47,48 and Gsm1p49,50 binding domains and that of FASII includes Pho4p49,50 TFBDs. On the overall, the prediction of the TFBDs could serve as a preliminary approach for the genomic exploration and identification of potential oleaginous yeast.

Figure 6
figure 6

Evolutionary analyses of the ACC protein and the upstream region of its gene among thirty-three strains of Trichosporonaceae. (a) ML tree of ACC (2094 amino acids long trimmed alignment) generated using IQ-TREE version 1.6.7 with confidence values based on 1,000 bootstrap replicates. (b) Distribution of predicted transcription factor binding sites 600 nucleotide bases upstream of the transcription initiation site of ACC gene clustered using hierarchical clustering on principal components (HCPC) in R. The documented oil accumulating members of the family are indicated in blue fonts in the phylogeny and with blue arrows in the HCPC.

Clustering of the fungal isolates based on the regulatory regions of genes encoding the enzymes that determine oil production pathway may be useful in selecting strains with similar pattern of putative regulatory mechanisms for subsequent characterisation. Considering the TFBDs clustering pattern of ICDH and ACC, seven strains namely, A. porosum JCM 1458T, A. gamsii JCM 9941T, A. brassicae JCM 1599T, A. laibachii JCM 2947T, C. arboriformis JCM 14201T, C. mucoides JCM 9939T and C. dermatis JCM 11170 are closely grouped with the oil accumulating isolates in the two clusters. Whereas the Cutaneotrichosporon species may not be excellent candidates because of their association with human host, the Apiotrichum species, all of which are free-living and isolated from various environments (Table 1) could potentially be oleagenic. A. porosum JCM 1458T and A. gamsii JCM 9941T, are the closest relatives of the oleagenic A. porosum DSM 27194.

Conclusion

Here, we have analysed the genomes of thirty-three members of the Trichosporonaceae, including five yeast, A. porosum, C. curvatus, C. oleaginosum, C. cutaneum B3 and C. cutaneum ACCC 20271 for which data regarding substantial lipid accumulation are available. Analysis of the whole genome phylogeny based on single copy orthologs shows that certain strains incorporated in the genera Trichosporon and Cutaneotrichosporon belong to the genus Apiotrichum. This highlights the need for the use of appropriate genomic evaluation schemes in the course of genome deposition in various databases. Comparison of the proteomes of these strains suggests functional diversification consistent with the various lifestyles and isolation sources of the studied organisms. For instance, abundance of the various CAZYmes and MEROPS signified the potential capacity of the yeast to degrade a wide variety of biomass, with distinct enzyme sets linked to these capacities in free-living and host-associated taxa within the Trichosporonaceae. The evaluation of selected genes coding for proteins involved in lipid biosynthesis and their corresponding transcription factor binding domains suggests a complex evolution with some level of conservation for the TFBDs of ACC, ACS and ICDH among the well-studied oil accumulating members of the family Trichosporonaceae. This indicates a possible similarity in terms of the regulation of the genes encoding these enzymes among the clustered strains. Further work should focus on investigating the specific binding potentials of the predicted TFBDs and their potential roles in oil production and accumulation in oleaginous yeast. Taken together, this information could be harnessed towards the selection of strains with potential functional capabilities that could be explored for the generation of environment friendly bioproducts, including single cell oils, biopharmaceuticals, and various raw materials in the food industry.

Methods

Genome sequences, gene predictions and annotation

Thirty-five genomes, comprising those of thirty-three members of the family Trichosporonaceae and two from the family Tetragoniomycetaceae (outgroup strains) were incorporated in this study (Table 1). Genome annotation was accomplished using the Funannotate pipeline (v. 1.5.0–8f86f8c)54. In brief, small duplicate contigs (clean) were removed, size sorted and renamed (sort) and repeat contains were masked using RepeatMasker v4.0.7 prior to gene prediction and annotation. Gene models were predicted using Augustus v3.2.3, GeneMark-ES v4.35, Evidence modeler v1.1.1 and tRNAscan-SE v1.3.1. For all gene prediction the Augustus training set for ‘cryptococcus’ was used. The predicted proteins were functionally annotated using Interproscan v.5.30–69.0, eggNOG-mapper v1.0.3.3-g3e22728, PFAM v.31.0, UniProtKB 2018_07, MEROPS v12.0, CAZYme (dbCAN v6.0), phobius v1.01 and SignalP v4.1. The completeness of the studied genomes was determined using BUSCO v3.0.3.

Phylogenomic analysis

Single copy orthologues conserved among the predicted protein sequences of the thirty-three Trichosporonaceae and two outgroup strains were identified using Proteinortho524 using all default parameters except percent amino acid identity which was set at 40%. To restrict the phylogeny to single copy orthologs (SCOs), the analysis included only proteins occurring in single copies among the haploid genomes and strictly in two copies for the hybrid genomes. The subgenome SCOs complement for each hybrid genome was determined by BLASTP comparison of the duplicate SCOs with the corresponding SCOs of the closest relative non-hybrid genomes18,25. The orthologous proteins were aligned using T-coffee v11.00.8cbe48655,56. The resultant alignment was concatenated and trimmed using Gblocks v0.9b57,58 with -b5 = h. The trimmed alignment was used to construct a Maximum likelihood (ML) tree using IQ-TREE version 1.6.759 based on the LG + F + R10 model (predicted using IQ-TREE) and 1,000 bootstrap replicates.

Evolutionary analysis of oleagenic proteins and promoter regions of their genes

Orthologs of selected proteins that play a major role in the biochemical pathways of lipid production in yeasts were selected among the Trichosporonaceae and Tetragoniomycetaceae based on BLASTP (percent identify cutoff value of 40%) using Proteinortho524. Individual orthologous proteins were aligned using T-coffee v11.00.8cbe48655,56 and manually inspected to ensure accuracy of the alignments. The alignments were trimmed using Gblocks v0.9b57,58 and Maximum likelihood (ML) trees were generated using IQ-TREE version 1.6.759 with 1,000 bootstrap replicates. Bedtools v2.27.160 was employed to extract the regulatory regions of the genes encoding these proteins comprising 600 nucleotide bases upstream of the transcription initiation site. Each set of the regulatory regions was scanned for putative transcription factor binding domains (TFBDs) using the tools in YEASTRACT61, a database that curates the transcription factors (TF) and their target regulatory binding sites in Saccharomyces cerevisiae. The variation in the distribution of the TFBDs among the studied strains was used to group them using hierarchical clustering on principal components (HCPC) computed in R.