Analyses of transcriptomes and the first complete genome of Leucocalocybe mongolica provide new insights into phylogenetic relationships and conservation

In this study, we report a de novo assembly of the first high-quality genome for a wild mushroom species Leucocalocybe mongolica (LM). We performed high-throughput transcriptome sequencing to analyze the genetic basis for the life history of LM. Our results show that the genome size of LM is 46.0 Mb, including 26 contigs with a contig N50 size of 3.6 Mb. In total, we predicted 11,599 protein-coding genes, of which 65.7% (7630) could be aligned with high confidence to annotated homologous genes in other species. We performed phylogenetic analyses using genes form 3269 single-copy gene families and showed support for distinguishing LM from the genus Tricholoma (L.) P.Kumm., in which it is sometimes circumscribed. We believe that one reason for limited wild occurrences of LM may be the loss of key metabolic genes, especially carbohydrate-active enzymes (CAZymes), based on comparisons with other closely related species. The results of our transcriptome analyses between vegetative (mycelia) and reproductive (fruiting bodies) organs indicated that changes in gene expression among some key CAZyme genes may help to determine the switch from asexual to sexual reproduction. Taken together, our genomic and transcriptome data for LM comprise a valuable resource for both understanding the evolutionary and life history of this species.

www.nature.com/scientificreports/ to breakthroughs in conservation efforts for LM, but genomic resources are currently largely unavailable for this species. Therefore, in this work, we obtained a high-quality genome of LM using de novo assembly. We utilized the resulting dataset to reconstruct a phylogeny of LM and found that it is distinct from Tricholoma. Moreover, based on comparative genomic analyses, we detected deletion of several essential metabolic genes that may help to explain its rarity in the wild. Nevertheless, the composition of CAZymes in LM suggests potential domestications. Additionally, we observed that expression differences of oxidoreductase genes in LM appear to promote the transition from asexual to sexual reproduction. Taken together, these new genomic and transcriptomic resources of LM resulting from our study comprise a framework for future studies on taxonomy, genome function, and conservation of this species. Throughout, we follow the NCBI taxonomic database (REF) for taxonomic names except as noted.

Results
Genome sequence analysis. We sequenced genomic DNA from LM using the PacBio SMRT Sequel and Illumina platforms, and generated a total of ~ 228 × coverage (10.5 Gb, PacBio platform) and ~ 246 × coverage (11.35 Gb, Illumina platform) of high-quality data respectively ( Table 1). The size of the assembled sequence of LM is 45.98 Mb, including 26 contigs with an N50 of 3.63 Mb and a GC content of 47.06% (Table 1). K-mer analysis based on Illumina reads indicated that the genome size was 47.69 Mb when K-mer was set at 21, while that was 44.67 Mb when K-mer was 35 ( Supplementary Fig. S1), both of which were very close to the total length (45.98 Mb) of assembled genome sequence by using PacBio sequencing. In addition, the heterozygous ratio and the repetitive sequences of the LM genome were estimated to be 0.18% and 26.89%, respectively. We identified 95.2% (1271/1335) of well-known conserved fungal orthologs in this LM assembly using BUSCO 28 , suggesting a high-quality assembled mononuclear genome (Supplementary Table S1). Detailed genome statistics are shown in Supplementary Tables S1-S3 and Supplementary Figs. S2-S3. We performed genome annotation by de novo prediction and homology-based searches as well as a cDNA-based search using the transcriptome data  www.nature.com/scientificreports/ generated in this study. In total, we predicted 11,599 protein-coding genes, which had a total length of 1835 bp and 6.3 exons on average ( Table 2). Of these, we annotated 7630 (65.8%) genes (Supplementary Table S4) and functionally annotated 6855 (59.1%) genes according to the Gene Ontology (GO) database ( Supplementary  Fig. S4). Additionally, we performed annotations of repetitive sequences and non-coding RNAs (Table S5- (Fig. 2c).
We compared the genome of LM to five closely related species in Tricholomataceae: Lepista nuda (Bull.) Cooke (LN), Tricholoma matsutake (S. Ito & Imai) Singer (TM), Clitocybe gibba (Pers.) Harmaja (CG), Laccaria bicolor (Maire) P.D.Orton (LB), and L. amethystina Cooke (LA), and found that LM has the smallest number of genes among the sampled species (Table 2). Among all the six species including LM, we identified 15,817 gene families, of which 4389 occur in all the species and 3269 of them are single-copy gene families, which were used as gene markers to perform phylogenetic analyses in this study. However, LM exhibits a very low level of genetic differentiation with only 95 unique gene families compared to 221 in LN, which has the next lowest, and 989 in TM, which has the highest ( Fig. 2a and Supplementary Table S7). This is particularly surprising given that TM is sister to a clade of LM and LN, thus highlighting the limited divergence in LM compared to the two most closely related species (Fig. 2c). Similarly, we identified 83 unique clusters of genes in LM, the smallest among all the species (Fig. 2b). The analysis by OrthoVenn also revealed that LM had the largest number of uniquely lost gene clusters (390) present in all the sampled species of Tricholomataceae except for LM ( Supplementary  Fig. S4). We annotated functions of the 390 gene families uniquely lost in LM according to the GO database. We found that most of the missing genes were mapped to biological processes, in which 93 genes matched biological process (GO:0008150) and metabolic process (GO:0008152), respectively, as shown in Supplementary Table S8. The analysis by OrthoVenn suggests that LM is experiencing gene loss at a higher rate than other species in the family. In fact, it is the only sampled member of the family that appears to be gaining genes at a slower rate than losing them (i.e., compare Fig. 2b and Supplementary Fig. S5). This unique evolutionary process in LM sets it apart from other Tricholomataceae and seems to offer additional support for its independence from Tricholoma as suggested by Yu et al. 7 .
CAZymes of LM. We sought to better understand the diversity of CAZymes in LM and thus the mechanisms of the species to metabolize carbon for nutrition. We annotated and compared all modules of gene families form CAZymes in LM with nine other fungal species including six grass-or wood-rot fungi. Of these, two species (CG and LN) were among those used in comparative genomic analyses; the other four species include Volvariella volvacea (Bull.) Singer (VV), Lentinus edodes (Berk.) Pegler (LE), Pleurotus ostreatus (Jacq. ex Fr.) P.Kumm. (PO), and Trametes versicolor (L.) Lloyd (TV). The remaining three species used for comparison are ectomycorrhizal fungi, which we also used in the comparative genomics analysis: TM, LB, and LA. In total, we annotated 384 CAZymes genes in LM, which is the smallest among the grass-and wood-rot fungi ( Fig. 3 and Supplementary Table S9). These genes were divided into six main modules corresponding to major CAZyme modules: 159 belonged to glycoside hydrolases (GHs, hydrolysis and/or rearrangement of glycosidic bonds), 71 were resolved to have auxiliary activities (AAs, redox enzymes that act in conjunction with CAZymes) and whose modules could code an important lignin degradation enzyme lytic polysaccharide monooxygenase (LPMO) 29 , 69 belonged to glycosyl transferases (GTs, formation of glycosidic bonds), 50 belonged to carbohydrate-binding modules (CBMs, adhesion to carbohydrates), 18 belonged to carbohydrate esterases (CEs, hydrolysis of carbohydrate esters), and 15 belonged to polysaccharide lyases (PLs, non-hydrolytic cleavage of glycosidic bonds) (Fig. 3). Notably, the number of CAZymes in LM detected in this study differed from our previous de novo transcriptome study 30 , in which we found 446 CAZyme genes in 6 modules. The difference may be attributed to unexpressed genes in the transcriptome data compared to our present analysis of the whole genome data. LM does not exhibit depletion or enrichment of any of the six CAZyme modules. However, in families of AA1, GH16, GH5, GT2, and GT4, LM has fewer genes than all the other grass-and wood-rot fungi. www.nature.com/scientificreports/ Transcriptome analysis of vegetative mycelia and fruiting bodies of LM. In order to determine genes involved in fruiting body formation in LM (i.e., transition from vegetative growth), we performed highthroughput transcriptome sequencing of monocaryon mycelia (asexual stage) and fruiting bodies (sexual stage) www.nature.com/scientificreports/ with three biological replicates for each stage (Fig. 4c). In total, we obtained 449 million high-quality reads (average ~ 224 × depth per replicate per stage) and more than 80% of the sequencing data were properly aligned to the annotated exons (Supplementary Table S10 and Supplementary Fig. S6). Gene expression levels were estimated using FPKM (Fragments per Kilobase per Million Mapped Fragments) to compare between the two growth stages (Supplementary Table S11). Based on FPKM, the overall gene expression levels in fruiting bodies were higher than in mycelia (Supplementary Fig. S7a) in all replicate analyses ( Supplementary Fig. S7c).
To determine differently expressed genes (DEGs) in LM, we compared gene expression among all samples and found that 2,192 genes in fruiting bodies were significantly up-regulated and 2,438 were significantly downregulated. We used this result to build a volcano plot, which enabled us to visualize that the values of log 2 fold change of most DEGs were within the range of ± 5 (Supplementary Fig. S7c. In addition, we found that the patterns of alternative splicing events between the two stages were slightly different ( Supplementary Fig. S7d). Specifically, TTS (transcript start site) events occurred more frequently in mycelia compared with fruiting bodies, whereas IR (intron retention) events showed the opposite trend ( Supplementary Fig. S7d).
To assess the enrichment of DEGs in fruiting bodies of LM compared with mycelia, we sought to detect enrichment for GO terms. (Fig. 4a) We found that DEGs in fruiting bodies were enriched in 16 terms representing biological processes (BP), 25 terms representing cellular components (CC), and 14 terms representing molecular functions (FC) (Fig. 4a). Among those, BP terms were primarily associated with metabolic processes, cellular processes, responses to stimuli, localizations and biological regulation; CC terms were mainly associated with the cell membrane, organelles and cell parts, and FC terms were related to catalytic activity and binding. We analyzed enrichment further according to expression levels and visualized the result using a Directed Acyclic Graph (DAG; Supplementary Fig. S8). Among CC terms, we found that genes annotated with the extracellular region (GO:0005576) and cytosolic part (GO:0044445) terms were enriched in the fruiting body stage. For BP, we found that carbohydrate metabolic process (GO:0005975) was greatly enriched in the fruiting body stage. For FC, iron ion binding (GO:0005506), cofactor binding (GO:0048037), catalytic activity (GO:0003824), and oxidoreductase activity (GO:0016491) were enriched in fruiting bodies.
We referred to genes related to fruiting body formation of Lentinus edodes and Agrocybe aegerita, but did not find similar expression in our transcriptome sequencing, which again proves that the mechanism of fruiting body formation of mushrooms is complex 22,31 . But notably, one BP term, carbohydrate metabolic process (GO:0005975), included most of the CAZymes, suggesting that some differentially expressed CAZyme may make significant contributions to the switch between vegetative and reproductive growth stages. All DEGs comprising CAZymes of LM are shown in Supplementary Tables S12-S13 and Fig. 4b. We identified 10 highly expressed CAZymes from the fruiting stage: five genes within the GH family, two in the CBM family, and three in the AA, CE, and GT families, respectively (Fig. 4b). Among these CAZymes there were (i) genes with expression levels that are equivalent in the two different stages (i.e., not DEGs), including AA1, CBM13, CBM43, CE16, and GT1, and (ii) genes that are up-regulated in fruiting bodies, including GH128, GH16, GH20, GH27, and GH47 (Fig. 4b). Based on their up-regulation in fruiting bodies, CAZymes of the GH family specifically may be involved in switching between reproductive stages in LM. www.nature.com/scientificreports/ To verify the results from transcriptome data, we performed quantitative real-time PCR (qPCR) analysis on the 10 highly expressed CAZyme genes using the six materials shown in Fig. 4c. The results showed that the expression of all genes except for GT1 was consistent with that obtained from transcriptome analysis (Supplementary Fig. S9), indicating our transcriptome data are reliable.

Discussion
Taxonomy and phylogenetic relationships of LM. LM was firstly described by Imai Sanshi in 1917 32 and included in the genus Tricholoma. More recently, Yu et al. 7 separated LM from Tricholoma and proposed that the species is more closely related to Lepista based on morphology and a phylogeny of LSU DNA sequences. Yu et al. also argued that LM is morphologically distinct from Lepista, and according to our identification, the spore print of LM is white but that of genus Lepista should be more pink 33 . Based on these evidences, in combination with polyphyletic analysis, Yu et al. proposed a new genus, Leucocalocybe 7 . However, due to a single marker is used in Yu et al., the proposed taxonomic opinion remains controversial.
In our study, we used 3,269 single-copy genes to reconstruct a phylogenetic tree including six species of Tricholomataceae. The dated phylogeny presented here (Fig. 2c) reveals that LM is closer to Lepista, which is in agreement with the results of Yu et al. and our early study that LM diverged 25.5 Mya and Tricholoma diverged ca. 60 Mya. In addition, unique habitat (the fairy ring as shown in Supplementary Fig. 1), restricted distribution area (only grow in a small part of the Mongolian Plateau), and gene loss events also imply us that LM has a unique biological status. Collectively, these results support the conclusion of Yu et al. that LM belongs to the monotypic genus Leucocalocybe.

Gene loss may be an intrinsic cause for rarity of LM in nature.
In this study, we unexpectedly discovered several lines of evidence suggesting that LM has experienced extensive gene loss. Most notably, LM has the smallest number of genes among the six sampled species of Tricholomataceae (Table 2) as well as the smallest number of unique gene families (Fig. 2a), suggesting a low rate of differentiation and a high rate of gene loss. In addition, LM appears to have the highest number of uniquely lost gene families (i.e., absent in LM but none of the other five species) ( Supplementary Fig. S5), supporting a high rate of gene loss instead of a slow rate of gene gain, though these are not mutually exclusive processes and both could work in this species. LM also has the smallest number of CAZymes among the six sampled saprophytic species (Fig. 3), suggesting that it has lost some key functions central to its carbon metabolism, thus, potentially affecting its ability to obtain nutrients. These intrinsic and evolutionary features of LM may be partially responsible for its rarity in nature.

GO annotations and analyses of CAZymes represent genetic resources supporting cultivation of the species.
In our study, we performed transcriptome sequencing on mononuclear mycelia and dikaryon fruiting bodies of LM, which represent the asexual and sexual stages of the species respectively. Our results for GO annotation of the two LM transcriptomes were consistent with prior studies on Agrocybe aegerita 34 , Hypsizygus marmoreus 25 , and L. edodes 24 , which indicate that GO terms related to the cell and cell membrane were the most common within the CC category, terms of catalytic activity and binding were the most common in the FC category, and terms of cellular processes and metabolic processes were the most common in the BP category (Fig. 4a). Additionally, we found the results of GO annotation in this study are very similar to our earlier de novo transcriptome study of LM 24 , which confirmed that we gain a reliable reference genome in this study.
Our analyses of GO terms between the two transcriptomes also revealed that carbohydrate metabolic processes (GO:0005975) genes are the most significantly enriched in fruiting bodies among DEGs ( Supplementary  Fig. S7). This seems to be in agreement with our findings based on CAZymes that acquisition and use of nutrients, particularly carbon sources, in LM may play a critical role in shaping the observed evolutionary and ecological patterns (fairy rings) and processes in the species.
Nutrient acquisition in LM is poorly understood. We compared CAZymes in grass-and wood-rot fungi (LM, LN, CG, VV, LE, PO, and TV) and symbiotic mycorrhizal fungi (TM, LA, and LB) and found that the symbiotic mycorrhizal fungi had fewer CAZymes, especially for the CBM1 family (Supplementary Table S9). n contrast to mycorrhizal fungi, the distribution of CAZymes in LM is more similar to grass-and wood-rot fungi, despite that LM has the smallest number of CAZymes among the grass-and wood-rot fungi. If the CAZymes in LM suggest similarity to grass-and wood-rot fungi in terms of nutrient acquisition modes, then LM may have high potential to be domesticated because many grass-and wood-rot fungi are relatively easy to domesticate because they do not require a living host.
We examined ten CAZyme genes in detail that were highly expressed within mycelia and fruiting bodies ( Table 3). Among these genes, evm.model.contig14.1 (AA10), evm.model.contig24.31 (CBM13), evm.model. contig16.164 (CBM43), and evm.model.contig13.330 (GT1) were highly expressed in both organs, and they play important roles in the metabolism of carbon nutrients. Interestingly, three of these five genes belong to small-size gene families: evm.model.contig14.1 is the only member of the AA10 family in LM, evm.model.contig16.164 represents one of two total genes in the CBM43 family, and evm.model.contig13.330 is one of four total genes in the GT1 family. This indicates that the size of a CAZyme family is not predictive of its expression level. We also found that, of the ten highly-expressed CAZymes, five were related to metabolism of glucose, xylose, galactose, or mannose, which were previously deemed important for nutrition in LM 14 . They are evm.model.contig13.330 (GT1, glucose and xylose), evm.model.contig4.294 (GH128, glucose), evm.model.contig6.808 (GH16, xylose), evm.model.contig17.149 (GH27, galactose), and evm.model.contig10.338 (GH47, mannose). Therefore, these ten genes may play important roles in the nutrient metabolism of LM, and could thus affect fruiting body formation and have implications for domestication and breeding. Extraordinarily, we found the expression of CAZyme family GH128 (evm.model.contig4.294) was very high in the fruiting body group (  35 , we found that this gene is widely distributed in many fungi, such as Aspergillus nidulans and Neurospora crassa 35 . Our study was the first to reveal and verify the high expression of the GH128 family gene in the fruiting bodies of LM, which promotes the research on the ex-site conservation, and also sheds light on the research of fruiting body formation of the other mushrooms.

Conclusion
We sequenced, annotated, and studied the first whole genome of LM as well as transcriptomes representing its vegetative and fruiting organs. Our study may help to facilitate conservation of this rare species by suggesting a genomic basis for its rarity in the wild and by providing specific genomic resources for its domestication for food and medicine. In particular, we found that ten key CAZymes are associated with nutrient acquisition and sexual reproduction in LM and could be utilized for rapid advancement in its domestication, which represents an important method of ex situ conservation.

Methods
Materials. The s29 strain of LM was isolated from a spore print of a specimen acquired in the fall of 2018 in Chenbaerhu Banner of Hulunbeier City, Inner Mongolia Autonomous Region, China. The voucher specimen is deposited in the Herbarium of Mycology of Jilin Agricultural University (HMJAU), under NO. 55229. We verified separation of the monospore strain using an Axio Imager A2 fluorescence microscope (Zeiss). We cultured the monokaryotic mycelia in potato dextrose and carrot sucrose solid-state (potato 100 g, carrot 100 g, and sucrose 20 g per liter) culture in the dark at 23 °C for 20 d. The fruiting body for transcriptome sequencing was acquired in the fall of 2017 in Wubuer Baolige Sumu of Hulunbeier City, Inner Mongolia Autonomous Region, China. Then, we selected three samples from a tufted wild fruiting body (Fig. 4c) as biological duplications to guarantee the same growth conditions and stored in a − 80° freezer prior to use, and all sequencing strains and materials were stored in Engineering Research Center of Chinese Ministry of Education for Edible and Medicinal Fungi, Jilin Agricultural University. For collection ethics and protection of the species, we neither collect fruiting bodies of LM less than two in pre fairy ring, nor collect fruiting bodies that have not yet begun to eject spores. Finally, we leave at least one mature fruiting body in each fairy ring after collection.
Isolation of RNA and construction of cDNA library. Total RNA was extracted from frozen mycelium and internal tissues of fruiting body by using the Transzol plant kit (TransGen Biotech, Inc.) following the manufacturer's instructions. After extraction and purification, we checked the purity of RNA using a K5500 spectrophotometer (Kaiao, Beijing, China) and determined the integrity of RNA and its concentration with an RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). A total amount of 2 μg RNA per sample was used as input for the RNA sample preparations. We generated sequencing libraries with a NEB Next Ultra RNA Library Prep Kit for Illumina (#E7530L, NEB, USA) following recommendations of the manufacturer, and added index codes to attribute sequences to each sample. Briefly, we purified mRNA from total RNA using poly-T oligo-attached magnetic beads, and carried out fragmentation using divalent cations under elevated temperature in NEB Next First Strand Synthesis Reaction Buffer (5 ×). We synthesized the first strand of cDNA using a random hexamer primer and RNase H, and the second strand using buffer, dNTPs, DNA polymerase I, and RNase H. We purified the library fragments with QiaQuick PCR kits and performed elution with EB buffer. www.nature.com/scientificreports/ DNA extraction, library preparation, and LM genome sequencing and assembling. (a) Genomic DNA was extracted from strain s29 using DNeasy Plant Mini Kit (QIAGEN). Sequencing was carried out on the Pacific Bioscience Sequel platform and Illumina platform at Annoroad Gene Technology Company, China. In total, we generated 10.5 Gb of high-quality reads from the SMRT cells and 11.35 GB of high-quality paired-end reads from the Illumina platform. (b) Assembly: We assembled the PacBio reads using the Mecat pipeline 36 , and curated the assembled contigs using the Arrow algorithm 37 . We curated the data from Illumina platform using Pilon 38 . We used BUSCO 39 with OrthoDB database 40 to assess the integrity of the assembled genome sequence.
Annotation of the LM genome. (a) Gene prediction: We performed genome annotation by de novo prediction and homology-based searches as well as a cDNA-based search using the transcriptome data generated in this study. We used Augustus v3.3 43  Comparative genomics and phylogenetic analyses. (a) Gene family identification: A GeneFamily approach 57 was conducted. Briefly, we first filtered the gene set of protein sequences of LM, AB, CG, LN, TM, LA, and LB based on filtration standard, that is, when a gene has more than one transcript, the longest transcript is taken, and the protein sequences with lengths greater than 50 amino acids (aa) were picked. Second, we used Blastp to format the filtered data with parameters "-p blastp -m 8 -e 1e-5 -a 10 -F F" and then OrthoMCL 58 software with the parameter " -I 1.5" to statistical gene family data form formatted data. (b) Phylogenetic analyses: By using results from gene family identification, we found 3,269 single-copy gene families in LM, AB, CG, LN, TM, LA, and LB genome data. Then, we used MUSCLE 59 to generate a super-alignment of 3,269 single-copy gene families and reconstructed a phylogenetic ML tree by PhyML v3.0 60  qPCR nalysis. qPCR was performed using the same DNA samples and primers described in Fig. 4c. The qPCR reaction conducted in a 15-μL volume containing 2 μL AceQ qPCR SYBR Green Master Mix (JZ121-02, Jizhenbio), 0.7 μL of each primer (10 μM), 100 ng (1 μL) of cDNA templates, and ddH 2 O to a final volume of 15 μL. The qPCR cycling parameters were: 95 °C for 5 min, 40 cycles of 95 °C for 10 s and 60 °C for 30 s. The Actin gene was used as the internal control and the relative expression level of each gene was calculated by the 2 -ΔΔCt method. Each qPCR reaction was performed in triplicate. All of the primer sequences used are shown in Table S14.