Introduction

Comparative genomics is a powerful tool for the investigation of yeast evolution1,2. Genome sequences are now available for a large number of Saccharomycetaceae and Debaryomycetaceae species within the subphylum Saccharomycotina3,4,5,6,7,8,9,10,11,12. Species associated with the Pichia/Ogatea clade such as Dekkera bruxellensis, Komagataella pastoris, Ogataea polymorpha and Kuraicha capitulata have also attracted a great deal of attention13,14,15,16, but the basal lineages of the Saccharomycotina remain poorly studied. To date the sequences of only two genomes of basal species, Yarrowia lipolytica6 and Blastobotrys adeninivorans17, have been reported.

The ubiquitous species, Geotrichum candidum (teleomorph = Galactomyces candidus), a member of the basal family the Dipodascaceae, can be found in a wide range of habitats from plant tissue and silage, to soil, air, water, milk and cheese18,19,20. G. candidum is well-known as an important component of the surface microbiota of soft cheeses and has also been used as a starter in the cheese industry21. It is also involved in beer making22 and industrial enzyme production23. In addition, G. candidum presents unusual characteristics that have complicated its taxonomic classification. For instance, it displays high morphological variability and wide phenotypic diversity and has many features generally associated with filamentous fungi. Although initially classified as yeast by the two major yeast taxonomic monographs24,25, it was later reclassified as a mould or filamentous yeast-like fungi18,26.

Saccharomycotina yeasts have greatly contributed to the understanding of major molecular evolutionary mechanisms leading to functional diversity such as gene duplication followed by neo- or sub-functionalization4,9,17,27,28,29,30,31,32,33. Recent developments have shown that horizontal gene transfers (HGT) also contributes to the diversity between species34,35,36. However, these two gene-gain processes alone cannot account for most of the major and rapid transitions during yeast evolution such as the split between Pezizomycotina (filamentous fungi) and Saccharomycotina (yeasts) that was associated with genome contraction in the Saccharomycotina subphylum. Based on our whole genome comparisons between G. candidum and the other ascomycetes, we show that significant differential gene loss has occurred in lineages associated to major evolutionary transitions in yeasts, underscoring this evolutionary mechanism as an important force shaping genomic and functional diversity.

Results

Overall characteristics of the G. candidum CLIB 918 genome

A draft genomic sequence of high-quality of Geotrichum candidum strain CLIB 918 ( = ATCC 204307) was obtained by combining 454 pysosequencing of an 8 kb mate-pair library, Illumina/Solexa sequencing of genomic fragments and a single whole genome shotgun 454 pyrosequencing run. The final assembly yielded 134 scaffolds with 1416 sized gaps, as highly repeated sequences such as transposable elements are typically missing from the assembly. We estimated the number of transposons and related elements to be of the order of 1000, corresponding to the gaps in the sequence assembly (Supplementary Note). A preliminary analysis based on scaffold size and presence of genes shortlisted the 27 largest scaffolds, totaling 24.2 Mb, i.e. 97.5% of the assembly. The 107 remaining scaffolds were merged into the artificial scaffold 32 with a size of 620.6 kb. The genome had a GC content of 48% and its size was estimated to be 24.8 Mb by the Newbler assembler. As such, it constitutes the largest Saccharomycotina yeast genome described to date, 25% larger than that of Y. lipolytica with 20.5 Mb6. The overall number of protein-coding genes in CLIB 918 is 6804 (excluding transposons and pseudogenes). The data are summarized in Table 1, Supplementary Table S1 and Supplementary Note. In addition to the nuclear genome, the mitochondrial genome was also sequenced, assembled and annotated (Supplementary Fig. S1), producing a single, circular contig of length 29 kb and with 27.6% GC.

Table 1 Genome characteristics comparison

Automated annotation followed by manual curation identified 4713 genes presenting unambiguous sequence similarity to Saccharomyces cerevisiae and 1245 genes coding for conserved hypothetical proteins with similarity to fungal proteins but no clear ortholog in S. cerevisiae. The latter set of genes included 371 ORFs to which functions could be tentatively assigned based on comparison against annotated genomes and conserved domains, 34 genes encoding subunits of the NADH-ubiquinone oxidoreductase complex 1 (Supplementary Table S2), 27 genes with unique fungal homologs. Further, we found 846 genes with no similarity to any gene outside G. candidum. Finally, we identified three cases of bacterial HGT (Supplementary Data 1).

Phylogenomic analysis performed on the 246 genes previously identified by Aguileta and coworkers37, unambiguously placed G. candidum within the Saccharomycotina subphylum, with B. adeninivorans and Y. lipolytica as its closest neighbors. However, the branch lengths indicate that these species are not closely related (Fig. 1). This observation was confirmed by the reduced synteny existing between G. candidum and the two other basal species (Supplementary Fig. S2). As little as 778 and 511 syntenic blocks were identified between G. candidum and B. adeninivorans or Y. lipolytica, respectively (Supplementary Table S3). The large majority of these blocks comprised only 2 genes (50% of the blocks of synteny with B. adeninivorans and 64% of these with Y. lipolytica) or 3 genes (31% and 26%, respectively).

Figure 1
figure 1

Phylogenetic position of G. candidum.

Maximum likelihood phylogenomic reconstruction of 29 fungal species based on 246 concatenated gene sequences. The analysis was based on 64,105 informative positions remaining after curation of the 176,113 original aligned amino acids. Percentage bootstrap values for 100 replicates were 100% at each node. The bar represents 5 amino acid changes per 100 amino acids.

G. candidum genes are characterized by an average of 0.56 introns per protein-coding gene (3830 introns in 6804 ORFs). Thirty-five percent (2414) of the genes have at least one intron. This high intron content and the short intron size (71 nt median) depart from the situation in other yeasts. (Supplementary Fig. S3, Supplementary Table S4). Indeed, the number of introns in G. candidum is 12.9-fold higher than in S. cerevisiae and 3.4-fold higher than in Y. lipolytica, the most intron-rich Saccharomycotina yeast described to date (Table 1). Finally, a striking feature of the spliceosomal introns in G. candidum is the poor conservation of the 5’ splice site and the branch point when compared to other yeast within Saccharomycotina38 (Supplementary Fig. S4; Supplementary Note).

G. candidum has a sexual state39. A single gene (GECA02s02545g) coding for a protein of 281 amino acids that we have named MATA was identified on the basis of its sequence similarity with other fungal MAT genes and its position in a chromosomal region sharing a conserved organization with that of mating type loci in other yeasts and fungal species (Supplementary Fig. S5). In a survey of G. candidum strains we identified the MATB idiomorph, indicating that this species is heterothallic (Supplementary Note).

Functional analysis and gene family expansion

To gain insight into the evolutionary dynamics of G. candidum genes and compare this to other yeasts, we reconstructed the phylome (i.e. complete set of individual gene phylogenies) for G. candidum as described in Materials and Methods. The resulting phylogenies, stored in phylomeDB40; (www.phylomedb.org), span the evolution of yeasts across the main Dikarya groups (Ascomycota and Basidiomycota). The phylome was analyzed to bring to light G. candidum-specific duplications and infer orthology and paralogy relationships.

This analysis showed that G. candidum has 56 amplified gene families, that is, groups of paralogs containing three or more genes (Supplementary Data 2). The most highly amplified gene family (unknown function) with 21 copies has no counterpart in any other genome. The second largest expansion contains 16 members in a GRE2-like gene family, GRE2 being a pleiotropic gene involved in ergosterol biosynthesis and control of filamentous growth in S. cerevisiae41,42. This gene family is also amplified in most other yeasts, but to a lesser extent. Finally, the category of transporters and permeases is also highly amplified in G. candidum, both general permeases and, more specifically, allantoate permeases and transporters for bile acid, nicotinic acid and monocarboxylate.

The number of genes involved in chitin metabolism is striking, as many of the genes of this pathway are present in more than one copy. Interestingly, six copies of the ortholog encoding chitin synthase III (CHS3-like), necessary for the majority of cell wall chitin synthesis, are found. This analysis also revealed six co-orthologs (including a pseudogene) of the activator of chitin synthase III (SKT5). Indeed, the closely-related Y. lipolytica, a dimorphic species with a strong tendency to form filaments, contains only three chitin synthase-related genes and a single SKT5 regulator (Supplementary Table S5). The high number of genes involved in chitin metabolism compared with other yeasts correlates with the phenotype of high production of hyphae and pseudo-hyphae in G. candidum.

G. candidum is a major component of the microbiota of soft cheeses. In agreement with its propensity for growth in the dairy ecosystem, an expanded family with a total of four carboxylesterase/type B lipase genes was identified, of which two have previously been cloned and sequenced23,43 (Supplementary Table S6). Interestingly, none of these genes had an equivalent in the Saccharomycotina subphylum, but had homologs in the Pezizomycotina (see later section on specific gene retention). These lipases were predicted from their sequence to be secreted extracellular enzymes, in accordance with the first step of triacylglycerol catabolism in the dairy matrix involving secreted lipases. Volatile sulfur compounds, key to cheese aroma, are produced from the catabolism of methionine and cysteine by yeasts44. Seven of the genes in this pathway are duplicated in G. candidum (Supplementary Fig. S6), in accordance with its known preeminent role in the cheese ripening process45 and a putative domestication of this yeast.

The most surprising gene amplification concerned gene families involved in the degradation of plant polysaccharides which are typically associated with filamentous fungi. G. candidum has undergone amplification of three distinct families of cellulolytic enzymes (Supplementary Data 2). These, included four copies of an endogluconase GH45, five copies of a lytic polysaccharide monooxygenase and five copies of an endo-polygalacturonase. Such functions have not been described in yeasts, except for a single gene encoding an endo-gluconase GH45 in K. pastoris46 and one distantly related polygalacturonase in S. cerevisiae47,48. These enzymes, whose presence greatly varies among fungi, are responsible for plant cell wall polysaccharide degradation, leading to cell-wall decomposition in a saprophytic or pathogenic context49. The gene complement of carbohydrate degrading enzymes is unique in G. candidum among yeasts (Supplementary Note. Supplementary Data 3). Further experimental investigations will be necessary to validate the hypothesis that this permits the use of a broad range of carbon and energy sources. The overall distribution of the annotated gene functions is shown in Supplementary Fig. S7a,b,c,d.

Specifically retained ancestral genes in G. candidum

Functional annotation of the G. candidum genome was performed using the proteome of S. cerevisiae as well as those of other taxa of Saccharomycotina, Pezizomycotina and Basidiomycota. An initial analysis by BlastP, showed that there exist a set of few hundred G. candidum genes which do not have any orthologs in any sequenced Saccharomycotina species, but which display a good level of sequence conservation with predicted proteins from filamentous fungi (Pezizomycotina and Basidiomycota).

A detailed analysis of the topology of the phylogenies for each of the predicted proteins (phylome analysis) showed that 280 genes (4.1% of the 6804 G. candidum genes) presented discordant phylogenies. The simplest explanation and that most often put forward, for the presence of such genes is that they are the result of horizontal gene transfer (HGT), which has been shown to occur, albeit infrequently, between eukaryotes35,50,51. In this respect, we identified a total of 17 clear cases of HGT from filamentous fungi, where the G. candidum gene grouped outside the Saccharomycotina, either within the sister subphylum Pezizomycotina (16 genes; Table 2 and Supplementary Fig. 8) or outside the Ascomycota (1 gene). In this latter case, the G. candidum gene (GECA13s02485g, putatively involved in polyamine metabolism) grouped within the Basidiomycota (Fig. 2). To the best of our knowledge, this is the first report of a gene horizontally transferred from the Basidiomycota to a Saccharomycotina species (Supplementary Note).

Table 2 List of putative HGTs from Pezizomycotina species to G. candidum.
Figure 2
figure 2

Phylogenetic position of the G. candidum gene GECA13s02485g potentially encoding a spermine synthase among Pezizomycota and Basidiomycota orthologs.

Sequences of the fungal genes most closely related to GECA13s02485g were retrieved from NCBI after Blast comparison to Pezizomycotina and to Basidiomycota. Sequences were aligned using MUSCLE, the alignment was curated using Gblocks and the phylogenetic reconstruction was performed using Phyml with default settings as implemented in phylogeny.fr (http://www.phylogeny.fr/). The list of species can be found in Supplementary Data 6.

However, the remaining 263 of the 280 discordant genes did not appear to be due de HGT, grouping phylogenetically neither within the Saccharomycotina, nor within the Pezizomycotina. Further analysis revealed that 141 of these 263 genes had no orthologs within the Saccharomycotina, but counterparts in Ascomycota or in Ascomycota and in Basidiomycota (131 in Pezizomycotina subphylum, of which 45 were also present in the basidiomycetes). We call this group of genes set A (Supplementary Data 4). The other 122 genes were associated with a homolog in S. cerevisiae, presenting in contrast a phylogeny which followed that of the species tree. We denote this second group of genes as set B (Supplementary Data 4).

In order to elucidate the origins and history of these genes of discordant phylogeny, we compared their characteristics with those that would be expected of horizontally-transferred genes. In most cases of HGT described in yeasts, the genes involved were exclusively clustered and had resulted from introgressions13,52,53. In filamentous fungi, HGT affects few single genes, but mostly larger regions of DNA, typically containing functionally related groups of genes54. In contrast, the set A and B G. candidum genes were found to be scattered through the genome sequence and did not cluster together as part of larger regions of transferred DNA (Fig. 3). In addition, these genes were distributed in the scaffolds independently of functional class.

Figure 3
figure 3

Distribution of the phylogenetically discordant sets A and B genes on the five largest scaffolds of the G. candidum genome.

Scaffolds are represented as horizontal bars, numbered at the left and red lines show the position of SRAGs. The scale indicates gene number.

HGT can usually be detected because the phylogenetic position of the transferred genes with respect to homologs in related species differs from that of the other genes within the genome. Patristic distances (i.e. sum of branch lengths separating two tree nodes) between each G. candidum gene and their counterparts in the Pezizomycotina species were calculated from the phylome. Figure 4 presents the normalized patristic distances of the G. candidum genes, including the set A genes, the set B genes, all the G. candidum genes and the hypothetical HGT genes, from their closest Pezizomycotina orthologs. This analysis shows that the genes showing discordant phylogenies, both set A and set B, are not distinguishable from the entire gene complement of G. candidum in terms of their distances to Pezizomycotina orthologs. On the other hand, the normalized patristic distance between the HGT genes and their Pezizomycotina orthologs is clearly reduced. Genes originating from lateral transfers would be expected to display a reduced distance from their Pezizomycotina orthologs, since they are more or less recently diverged. The fact that distances between Pezizomycotina and set A and set B genes are not different from distances between Pezizomycotina and the G. candidum genes rules out the possibility that the set A and B genes were the result of HGT.

Figure 4
figure 4

Phylogenetic distance of HGT and sets A and B genes from G. candidum to Pezizomycotina.

Normalized distances between each G. candidum gene and its closest ortholog in the Pezizomycotina are represented as box plots. The graphs show the maximum, minimum and median values and the first and third quartiles. The points at the bottom of the “All gene trees” box plot are outliers, whose phylogenetic distance from the traced box is greater than 1.5 times the interquartile distance.

For all these reasons, it seems highly unlikely that the genes of sets A and B result from HGT events. Rather, a more plausible explanation considering the above observations would be that they had been specifically retained during the radiation after the separation of the Pezizomycotina and Saccharomycotina. We therefore propose to designate this type of gene as a Specifically Retained Ancestral Gene (SRAG). Figure 5 presents the proposed scheme leading to the occurrence of SRAGs in a present day yeast species such as G. candidum (Fig. 5).

Figure 5
figure 5

Schematic representation of the origin of SRAGs.

(a) The hypothetical fate of a gene transmitted vertically to the Pezizomycotina and the Saccharomycotina lineages from the Ascomycota ancestor is represented by a continuous green line. The dotted line indicates the lineages in which the gene is lost, (b), resulting in a situation where the gene is found in the Pezizomycotina lineage and only in G. candidum where it has been retained (set A genes). (c) Transmission of members of a duplicated gene family in the Ascomycota ancestor to the Pezizomycotina and the Saccharomycotina lineages (set B genes). The green line indicates that one paralog has been lost in the entire Saccharomycotina lineage, except in G. candidum where it has been retained (similarly to (a) and (b)). The black line indicates that the second paralog has been transmitted to the Saccharomycotina lineage. Whereas only one paralog is present in the Saccharomycotina, both paralogs are present in G. candidum.

The expression of genes with a discordant phylogeny was compared to the rest of the genes using data from high throughput RNA sequencing. We observed that the overall expression level of the set A was reduced compared to the rest of the genes in the genome (Reduction of 1.4-fold, P < 10−7). The overall gene expression of set B genes was not significantly different to that of the other genes (P = 0.84) (Table 3; Fig. 6). This reduced expression may be due to a higher specificity of the genes in the set, including lignocellulolytic enzymes and a number of transcription factors, which might not be expressed under the chosen laboratory growth conditions.

Table 3 Gene expression of SRAGs in G. candidum.
Figure 6
figure 6

Expression of genes with discordant phylogenies.

The distribution of the RNA sequence reads was plotted against the genes of setA, setB and against the whole genome. The number of genes in sets A and B are shown multiplied by a factor of 10 to facilitate comparison.

SRAGs are a common feature in yeasts

We examined other well-characterized yeast genomes to investigate whether such genes could also be found. To this end, we reconstructed the phylomes of three other species: S. cerevisiae, Debaryomyces hansenii and Y. lipolytica. A search in PhylomeDB for genes with discordant phylogeny permitted the identification of putative SRAGS in these species. Again we detected genes with orthologs in Pezizomycotina only as well as genes with discordant phylogeny which were present in the Pezizomycotina and absent from a majority of Saccharomycotina (Supplementary Data 5).

S. cerevisiae was found to have 15 genes presenting discordant phylogenies (Table 4, see www.phylomedb.org/phylome_236). These S. cerevisiae genes are involved in a variety of pathways (respiration, cell wall, post-transcriptional quality control, protein translation, sterol uptake); two of them are of unknown function. Interestingly, none of these 15 genes are essential for growth under normal conditions (PDR11, a sterol uptake protein, is however required for anaerobic growth, where sterol biosynthesis is compromised55; they are all expressed in either unusual or stressful conditions for S. cerevisiae (http://www.yeastgenome.org). The IRC7 gene, encoding a putative cystathionine beta-lyase, was proposed to be the result of HGT, originating in bacteria56; however, this gene proved unambiguously closer to Pezizomycota than to bacterial counterparts (data not shown).

Table 4 List of SRAGs in S. cerevisiae.

Functional analysis of the genes in the G. candidum, D. hansenii and Y. lipolytica revealed that SRAGs are associated with diverse functional classes and that they are responsible for at least part of the specificity, but functional classes are shared between these yeasts. A functional classification of the SRAGS highlighted differences between D. hansenii and the two other basal yeasts G. candidum and Y. lipolytica (Fig. 7).

Figure 7
figure 7

Functional distribution of SRAGs in three yeasts species.

The SRAGs of D. hansenii, Y. lipolytica and G. candidum, as listed in Supplementary Data 6, were assigned to functional categories. For each species, the distribution of SRAGs by category is expressed as a percentage of the total number of SRAGs. Orange, G. candidum; blue, D. hansenii; green, Y. lipolytica.

The halophilic and psychrophilic yeast D. hansenii is found in environments such as seawater, brine and salted foods and is a major component of cheese surface microbiota57. The functional classes overrepresented in the SRAG gene set are those of Amino acid metabolism (13 genes), Carbon metabolism (with seven SRAGs involved glycosidic bond hydrolysis) and Transport (with nine SRAGs involved in sugar transport). There are also five extracellular lipases that hydrolyze triacylglycerols in this lipid-rich environment to fatty acids and to glycerol, which is the main compatible osmolyte accumulated by D. hansenii as osmoprotectant on the highly saline cheese-surface58. Thus, D. hansenii SRAGs are representative of functions needed to grow under these conditions.

Y. lipolytica has long been a focus of research for its lipid metabolisms and its capacities for protein secretion59,60. It is encountered on the surface of ripened cheese61,62. The functions that are over-represented in Y. lipolytica SRAGs are Lipid metabolism (10 genes) and Proteolysis (20 genes, of which 10 encode extracellular proteases). Y. lipolytica and G. candidum are both dimorphic yeasts, whose transition from budding to hyphal growth involves complex subcellular processes. We built an inventory of the Y. lipolytica and G. candidum genes homologous to N. crassa genes necessary for filamentous growth63 (Supplementary Table S8). Among the 55 Y. lipolytica genes and 70 G. candidum genes in the inventory, respectively 29 and 37 SRAGs were found. Thus, over 50% of the Y. lipolytica and G. candidum genes necessary for filamentous growth are SRAGs, contrasting with the proportion of SRAGs in the whole genomes, (3.7% and 3.9% in Y. lipolytica and G. candidum, respectively) and highlights the strong association of SRAGs with filamentous growth.

In the case of G. candidum, with the exception of functions related to filamentous growth, the presence of SRAGs in the various functional categories is generally low, varying from 1 to 4%. The exception of the large number of G. candidum SRAGs in the Transcription regulation (11%) category is an indication that the reactivity and adaptability of this yeast to environmental changes may be carried by SRAGs. Our analysis of the functional classification of these SRAGs highlighted the specific properties of these yeasts according to their natural morphology and ecological niche. SRAGs contribute to phenotypic specificity of these yeasts. An over-representation of the Transcription regulation and Transport categories is expected in wild yeasts as they have to adapt to various environments by being able to use a wide variety of nutrients and to reorganize gene expression in response to environmental changes. We also noted that each of the three yeasts examined, D. hansenii, Y. lipolytica and G. candidum, possess SRAGs associated with lipid metabolism, which may be linked to their presence in dairy products. It is important to note that the genes in the “Lipid metabolism” category in all three species are phylogenetically unrelated, suggesting a parallel evolution. Indeed the same is true for most of the SRAGs, suggesting that these genes are interesting candidates for the analysis of species-specific technological properties.

Discussion

The genome sequence of G. candidum permits new insights into the genome structure of yeasts and their evolution. In particular, its relative basal position among Saccharomycotina and its unusually large genome for a yeast, makes it ideal to investigate the ancestral genomic repertoire of this subphylum. Comparative genomics between G. candidum and other Saccharomycotina yeasts demonstrated the existence of groups of genes specific to G. candidum and greatly-amplified gene families which appear to contribute to the known phenotypic specificity of this yeast, while the significance of others, such as the large repertoire of carbohydrate hydrolases otherwise only found in filamentous fungi, can only be hypothesized. We were interested to study whether the origins of these genes specifically present in G. candidum could be explained by HGT or another mechanism and therefore undertook further analyses based on individual gene phylogenies. This brought to light a larger group of genes with discordant phylogenies, of which some had no homologs within the Saccharomycotina. When such analysis was extended to other species representative of different lineages of the yeast phylogenetic tree it was seen that the presence of such genes is common to all the yeasts examined. We propose that such genes have been specifically retained after the split between Pezizomycotina and Saccharomycotina and during the subsequent genome reduction of the latter clade; we would therefore denote them Specifically Retained Ancestral Genes (SRAG). Several lines of evidence argue for this explanation and against the simplest hypothesis, acquisition through HGT, for the presence of these genes in G. candidum: (i) The large evolutionary distance, similar to that of clear vertically-inherited genes, of the putative participants makes HGT unlikely. HGT between eukaryotes usually result from interspecific or intergeneric hybridization64,65,66, but, to the best of our knowledge (and excepting the case of HGT that we describe here with GECA13s024858g), inter-subphylum transfers between filamentous fungi and yeasts have not been documented. (ii) The phylogenetic distances separating the SRAGs from their orthologs were similar to those separating the other genes from their respective orthologs, whereas a hallmark of HGT is the phylogenetic closeness of the orthologs thus transferred. This is illustrated by the position of SRAGs being outside the Pezizomycotina clade in the phylogenetic trees. (iii) The number and relative frequencies of SRAGs, present in the different species argues for specific retention rather than HGT. Indeed numerous SRAGs were found in each of the four yeasts examined (almost 4% of gene content in the case of G. candidum). It is unlikely that HGT events would occur at such a frequency. Furthermore the distribution of the numbers of SRAGs in the different yeasts is intriguing: of the species studied here, G. candidum, Y. lipolytica and D. hansenii possess a higher number of SRAGs than does S. cerevisiae (263, 230 and 111, respectively, compared to 15). Whereas we might expect a fairly constant frequency of genes with discordant phylogenies if their presence were due to HGT, there is a clear difference in their number, which may be due to their different evolutionary histories. This variability is also seen by the recent detection, in B. adeninivorans17, of 121 genes with orthologs only in Pezizomycotina and in Zygosaccharomyces bailii67 , of 27 genes with similarity to filamentous fungal genes or highly divergent from yeasts, though the latter group attributed these to HGT.

Lineage-specific gene retention described following mitochondrial endosymbiosis in crown group eukaryotes68 and the co-occurrence of genes could be used to predict their functional links. Lineage-specific losses of genes associated with gain or loss of function have been reported in widely separated lineages6,69,70,71,72. In addition, a number of metabolic pathways present in the Pezizomycotina are not found in Saccharomycotina73,74,75. The latter authors observed a differential presence or absence of peroxysomal and non-peroxysomal pathways of β-oxidation in some yeasts and fungi and proposed that the pathway has been duplicated in the ancestor and differentially lost or retained in the studied species. We expand this observation by a global comparison of four yeast genomes within the same subphylum. We define two categories of G. candidum-specific genes, based on their distributions:

1) One group of genes have orthologs within the Saccharomycotina, but are derived from the paralog in the common ancestor of Saccharomycotina and Pezizomycotina lost by the other yeasts. Lineage-specific gene retention following Whole Genome Duplication is well-known in organisms including Saccharomyces species32, filamentous fungi76, alveolates77, seed plants78 and vertebrates79. However, no such WGD has been described in the ascomycete ancestor, so the above-mentioned paralogs have probably resulted from gene duplications in the ancestor. This situation corresponds to that of the beta-oxidation genes described75; G. candidum has retained one of the paralogs, while the other Saccharomycotina species kept the other (Fig. 5). In some cases G. candidum had retained both genes of the ancestral duplication, for instance some snRNPs.

2) In G. candidum, in addition to the cases of gene retention after ancestral gene duplications, we discovered a second set of 141 genes in single copy in the Ascomycota ancestor, which was lost in the other Saccharomycotina species. Cases of specifically retained genes not derived from genomic duplication are rarely documented, although some have been proposed to play an important role in species differentiation80,81,82. Our analysis suggests however that this may be an important mechanism of generation of biodiversity, at least in the yeast subphylum studied.

The above discussion is limited to genes that were unique in each studied yeast species, but we also noted the existence of SRAGs present in two or more species. Further work on this class of SRAGs to determine their distribution within the subphylum, will certainly greatly increase our understanding of the evolution and biodiversity of the yeasts.

Thus, evolution by differential gene retention is widespread in a broad but well-defined clade, the Saccharomycotina. The distribution of SRAGs in distantly-related yeast species argues for a mechanism of a sustained loss throughout the yeast tree permitting adaptation of yeast species to various ecological niches and resulting in the genome reduction characteristic of yeasts, rather than a massive genome contraction in one branch of the Ascomycota.

Saccharomycotina yeasts use a combination of various mechanisms such as WGD4,6,9,17,83, gene duplication6,83 and HGT6,36,56,84,85,86,87, which contribute to generating biodiversity to a variable extent. To date, the major genetic mechanisms proposed to affect adaptation of fungi are duplication or gene amplification followed by neospecialization28,32,33 and HGT, the bacterial nitrate assimilation cluster is suggested to have contributed to the success of the Dikarya on land88 and the acquisition of genes to increase efficiency of alcoholic fermentation by S. cerevisiae53,89. Here we highlight the importance of another mechanism; yeasts that we have analyzed and probably others17,67 contain different proportions of SRAGs, which are associated with biochemical or growth characteristics of the species concerned, thus contributing to the great biodiversity shown by this group of organisms.

Material and methods

Strains

The sequenced G. candidum strain was isolated by Micheline Gueguen (University of Caen) from Pont-L’Evêque cheese in Normandy (France) in 1975. It has been shown to produce compounds that inhibit the growth of Listeria and has been extensively studied90,91,92,93,94,95. The strains used in this study, CLIB 918 (=ATCC 204307), CLIB 1368NT (=CBS 615.84NT) and 61 G. candidum isolates are preserved at the CIRM-Levures (http://www6.inra.fr/cirm/Levures). They were routinely propagated on complete medium (YPD: yeast extract 10 g/L, peptone 20 g/L, glucose 20 g/L) at 28 °C.

Preparation of DNA and RNA

DNA was extracted as previously described (Jacques et al., 2009) from strain CLIB 918 grown in YNBN5000 (1.7 g/L Yeast Nitrogen Base, 20 g/L glucose, 5 g/L ammonium sulfate) at 28 °C to increase the yeast-like form and promote cell lysis. For RNA preparation, strain CLIB 918 was grown at 28 °C with agitation on three different media, i.e. complete medium (YPD), minimal medium (YNBN5000) and Synthetic Cheese Medium, SCM, described in96) to maximize the diversity of gene expression. Total RNAs were extracted using the method described by Mansour et al.97 from cultures grown in the three different conditions and then pooled.

454 libraries preparation and sequencing

The single 454 library was constructed on genomic DNA (500 ng) according to the Roche standard procedure using RL adaptators (GS FLX Titanium Rapid Library Preparation Kit, Roche Diagnostic, USA). The 8 kb mate pair library was constructed following Roche 454 protocol. Briefly, 15 μg of genomic DNA was sheared to about 8 kb using HydroShear Instrument. Fragments were end-repaired and extremities were ligated with 454 circularization adaptors. After gel size selection of 8 kb bands and fill in, DNA fragments were circularized by Cre recombinase and remaining linear DNA digested by Plasmid Safe ATP dependent DNAse (Epicentre) and exonuclease I. Circular DNA was refragmented by nebulization. Fragments were end-repaired and ligated with library adaptors used for downstream processes. Mate pair library was amplified and purified. Both single and mate pair libraries were isolated, then bound to capture beads and amplified in an oil emulsion (emPCR). They were then sequenced using 1/2 Pico Titer Plate on 454 GSFlx instrument with Titanium chemistry (Roche Diagnostic, USA) according to the manufacturer protocol.

Illumina GA library preparation and sequencing

The genomic DNA and cDNA were sonicated separately to a 150- to 1000-bp size-range using the Covaris E210 (Covaris Inc., MA). Fragments were end-repaired then 3‘-adenylated and Illumina adapters were added using NEBNext Sample Reagent Set (New England Biolabs). Ligation products were purified and DNA fragments (>200 bp) were PCR-amplified using Illumina adapter-specific primers. After library profile analysis on an Agilent 2100 Bioanalyzer (Agilent Technologies, USA) for genomic DNA and Qubit quantification for cDNA, the respective libraries were sequenced using 76 base-length read chemistry in a single or paired-end flow cell on the Illumina GAIIx (Illumina, USA).

Genome assembly and automatic error corrections with Solexa/Illumina reads

All 454 reads were assembled with Newbler version 2.3. From the initial 3,322,644 reads, 92.2% were assembled, yielding 1688 contigs that were linked into 134 scaffolds. The contig N50 (the contig size cut-off above which 50% of the total length of the draft sequence assembly is included) was 26.7 kb and the scaffold N50 was 1.159 Mb. Cumulative scaffold size was 24.865 Mb. Sequence quality of scaffolds from the Newbler assembly was improved as described in Aury et al.98 by automatic error correction with Solexa/Illumina reads which have a different bias in error type compared to 454 reads. Following the correction process, we fixed 3415 mismatches and 6559 indels.

Genome annotation

Gene models were predicted using Eugene pipeline99 on the URGI platform (http://urgi.versailles.inra.fr/). Eugene relies on combination of ab initio gene predictions (Eugene_IMM, SpliceMachine100 and Fgenesh http://www.softberry.com/berry.phtml) and similarity (BlastX against Swissprot and Trembl) evidences. All the gene models were then manually curated with the help of RNAseq data previously assembled with SOAP on the ORCAE platform (http://bioinformatics.psb.ugent.be/orcae/101) and visualized on GenomeView (http://genomeview.org102) and Artemis (http://www.sanger.ac.uk/resources/software/artemis/). All regions potentially coding for peptides of over 100 amino acids (aa) were annotated. CDS of less than 100 aa were only annotated when they presented sequence similarity with known proteins and/or associated with spliceosomal introns and were represented in the RNAseq library. The genes encoding tRNA were predicted using tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/) using default parameters. The protein coding genes were first functionally annotated by comparison with the S. cerevisiae genome. Genes that failed to show sufficient sequence similarity with S. cerevisiae genes were annotated by comparison against other available yeast genomes, filamentous fungal genomes and Swissprot; they received the annotation “conserved hypothetical protein” when their sequence showed similarity with that of proteins from several species. When a functional annotation was available in the databanks, it was associated to the “conserved hypothetical protein” annotation. Nomenclature for naming genes is the following: species name GECA, scaffold number from 1 to 27 and 32, s for scaffold, gene number with an incrementing step of 11, g for protein coding gene (for example, GECA01s00065g encodes a protein similar to Saccharomyces cerevisiae YNR018W), r for RNA coding gene (for example, GECA01s00238r encodes tRNA-Asp).

Assembly and annotation of the mitochondrial genome

A total of four mtDNA contigs were identified. Ordering of contigs and junction was performed using PCR. Protein coding genes and ribosomal genes were detected using blastX against the available Saccharomycotina mtDNAs. tRNA genes were detected using tRNAscan-SE with default parameters and the mitochondrial search model (http://lowelab.ucsc.edu/tRNAscan-SE/).

Phylogenomic analysis

Orthologs were first selected using blast with a P-value of 10−5 against proteomes of strains listed in Supplementary Table S9. Single-copy G. candidum genes were verified using ORCAE and homology was verified using Fungipath103. Sequences were concatenated and were aligned using MUSCLE v3.8104 with default settings. Alignments were curated using GBlocks v0.91b105. Species trees were reconstructed using PhyML v2.4.4106 with the WAG model. Bootstrap analysis was used to obtain branch support. Trees were visualized with njplot107.

Synteny analysis

Conserved synteny blocks were defined using Synchro with default settings108. First, reciprocal blast hits were computed with a similarity threshold of 40% and length ratio between the two protein sequences smaller than 1.3. Second, syntenic homologs, which were not involved reciprocal blast hits, were added to the synteny blocks when they shared at least 30% of similarity over at least 50% of their length.

Phylome reconstruction

A phylome comprises the collection of phylogenetic trees for each gene encoded in a genome. We reconstructed the G. candidum phylome in the context of 21 additional fungal species ranging across the main dikarya groups, i.e. 10 Saccharomycotina, 8 Pezizomycotina, one Taphrinomycotina and two Basidiomycota (Supplementary Table S9). An automatic pipeline described previously was used to reconstruct the phylome109. This pipeline includes the standard tree reconstruction steps: homology search, multiple sequence alignment and finally reconstructing the maximum likelihood tree. The homology search was performed using a Smith-Waterman search for each gene (seed gene) in the G. candidum genome (seed genome) against the protein database that contained the proteomes of interest. Results were filtered to select only sequences with an e-value below 10−5 and a continuous overlap of 0.5. A maximum of 150 sequences for each protein were considered. Homologous sequences were then aligned using three different alignment algorithms: MUSCLE v3.8104, MAFFT v6.712b110 and kalign111. Alignments were performed in forward and reverse direction using the head-or-tail approach112 and the 6 resulting alignments were combined with M-COFFEE113. TrimAl v1.3114 was used to clean the alignment (consistency-score cut-off 0.1667, gap-score cut-off 0.9). To reconstruct maximum likelihood trees, an evolutionary model needed to be selected. This was done by reconstructing a neighbor joining tree for each alignment using BioNJ115. The likelihood of the resulting topology according to one of 7 different models (JTT, LG, WAG, Blosum62, MtREV, VT and Dayhoff) was computed. The model best fitting the data, as determined by the AIC criterion116, was used to derive ML trees using phyML v 3.0 with four rate categories and inferring invariant positions from the data117. Branch support was computed using an aLRT (approximate likelihood ratio test) based on a chi2 distribution. Three additional phylomes were reconstructed using the same proteome set but with different species as seeds: Saccharomyces cerevisiae, Y. lipolytica and Debaryomyces hansenii. The resulting trees and alignments are stored in phylomeDB (http://phylomedb.org) with phylome IDs 233 (G. candidum phylome), 234 (Y. lipolytica phylome), 235 (D. hansenii phylome) and 236 (S. cerevisiae phylome).

Species tree reconstruction

Proteins with a one-to-one orthology relationship to all the considered species were selected from the G. candidum phylome. The 302 protein alignments were concatenated into a multiple sequence alignment. The alignment was trimmed using trimAl v1.3114 to discard columns with more than 50% gaps (-gt 0.5 -cons 50). RAxML v8.0 was used to reconstruct the species tree118 using the PROTGAMMLG model (Supplementary Fig. S9). Additionally, a super-tree based species tree was derived from the G. candidum phylome using DupTree119.

Phylome analysis

Trees in the phylome were scanned using ETE v2109 Trees were scanned to detect duplications that had occurred specifically in G. candidum by searching for clades that contained exclusively G. candidum sequences. Orthology and paralogy relations were inferred from the phylome trees using a species overlap algorithm120. Briefly, for each node in the tree, the algorithm tries to detect overlapping species at either side of the node. If there are overlapping species, the node is considered a duplication node and therefore the sequences are paralogs. If there are no overlapping species, then the node is considered a speciation node and sequences are orthologs. Finally, we used the phylome to assess phyletic distribution of genes, based on homology or orthology and selected genes that had only homologs in each of the following six clades: i) the family Saccharomycetaceae (S. cerevisiae, Zygosaccharomyces rouxii, Candida glabrata, Kluyveromyces lactis and Lachancea thermotolerans), ii) the Saccharomycetales incertae sedis clade (K. pastoris and O. angusta), iii) the CTG clade (D. hansenii and Clavispora lusitaniae), iv) other fungi (Ajellomyces capsulata, Aspergillus oryzae, Penicillium chrysogenum, Neurospora crassa, Cryptococcus neoformans, Ustilago maydis, Schizosaccharomyces pombe, Botrytis fuckeliana, Trichoderma reesei, Magnaporthe grisea and Mycosphaerella graminicola), v) Y. lipolytica, or vi) G. candidum. The same analysis was performed using the orthology predictions obtained from the phylomes (see above).

In order to calculate the patristic distances, trees that contained at least one ortholog in Pezizomycotina and at least one in any of the outgroup species (S. pombe, U. maydis and C. neoformans) were selected. For each of those trees the patristic distance was calculated between the G. candidum protein and its closest Pezizomycotina ortholog. This distance was then normalized by dividing it by the patristic distance between the same G. candidum sequence and its farthest orthologous outgroup.

Gene expression analysis

Available RNAseq reads were mapped against the produced reference genome using the GSNAP software121 with default parameters. The resulting alignment files were transformed into raw read counts for each gene making use of htseq-count122 and the predicted G. candidum gene-models. To obtain the final expression values the raw read counts were normalized for CDS length. Afterwards subset of genes (and expression values) were created based on whether the gene has an ortholog in other Saccharomycotina (141 genes) or not (122 genes). The expression of the genes in these two subsets was then compared to the expression of all other genes in the genome. To investigate the potential difference in expression between the gene sets a Wilcoxon rank-sum test was applied.

Additional Information

Accession codes: Geotrichum candidum genome sequence data have been deposited at EMBL under the accession number PRJEB4557, the mitochondrial genome of strain CLIB 918 and the MATB gene of strain CBS 615.84 were deposited under accession numbers HG530139 and HF558449, respectively.

How to cite this article: Morel, G. et al. Differential gene retention as an evolutionary mechanism to generate biodiversity and adaptation in yeasts. Sci. Rep. 5, 11571; doi: 10.1038/srep11571 (2015).