Genome sequencing and analysis of Talaromyces pinophilus provide insights into biotechnological applications

Species from the genus Talaromyces produce useful biomass-degrading enzymes and secondary metabolites. However, these enzymes and secondary metabolites are still poorly understood and have not been explored in depth because of a lack of comprehensive genetic information. Here, we report a 36.51-megabase genome assembly of Talaromyces pinophilus strain 1–95, with coverage of nine scaffolds of eight chromosomes with telomeric repeats at their ends and circular mitochondrial DNA. In total, 13,472 protein-coding genes were predicted. Of these, 803 were annotated to encode enzymes that act on carbohydrates, including 39 cellulose-degrading and 24 starch-degrading enzymes. In addition, 68 secondary metabolism gene clusters were identified, mainly including T1 polyketide synthase genes and nonribosomal peptide synthase genes. Comparative genomic analyses revealed that T. pinophilus 1–95 harbors more biomass-degrading enzymes and secondary metabolites than other related filamentous fungi. The prediction of the T. pinophilus 1–95 secretome indicated that approximately 50% of the biomass-degrading enzymes are secreted into the extracellular environment. These results expanded our genetic knowledge of the biomass-degrading enzyme system of T. pinophilus and its biosynthesis of secondary metabolites, facilitating the cultivation of T. pinophilus for high production of useful products.

. Several sequenced Penicillium and Aspergillus species such as P. oxalicum 114-2 12 and Aspergillus niger ATCC 1015 13 are also predicted to have eight chromosomes. The overall GC content of the T. pinophilus 1-95 genome was 46.23%. The GC content was 50.08% for the coding sequences and 24.84% for the mitochondrial genome. Other general features of the T. pinophilus 1-95 genome are in Table 1.
In total, 13,472 protein-coding genes were predicted from the genome assembly using five ab initio gene prediction programs: Augustus (http://bioinf.uni-greifswald.de/augustus/), GeneMark-ES (http://exon.gatech. edu/GeneMark/), Genewise (http://www.ebi.ac.uk/Tools/psa/genewise/), SNAP 14 and an unsupervised learning system program Glean version 1 (https://sourceforge.net/projects/glean-gene/). The number of coding genes was significantly higher than other filamentous fungi that produce biomass-degrading enzymes (see Supplementary  Table S1). Of the T. pinophilus coding genes, 8162 (60.58%), were annotated in the Gene Ontology (GO) database (http://geneontology.org/), 12,828 (95.21%) in the UniProt database (http://www.uniprot.org/), 12,946 (96.09%) in the NCBI non-redundant (NR) protein sequences database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) and 4437 (32.93%) in the Clusters of Orthologous Groups of proteins database (http://www.ncbi.nlm.nih.gov/COG/). A total of 6817 (50.6%) genes belonging to 331 pathways were also annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.kegg.jp/) (see Supplementary Fig. S1). The coding regions of the predicted genes constituted almost 53.37% of the genome, with an average length of 1446.5 bp. BUSCO 15 and CEGMA 16 were used to evaluate the integrity of the genome assembly and prediction of gene sets. A BUSCO set for fungi comprising 1438 single-copy ortholog genes from more than 100 fungal strains was used to evaluate the genome assembly and gene sets of T. pinophilus 1-95. More than 98% of the orthologous genes matched with genes in the genome and gene sets of T. pinophilus 1-95 (see Supplementary Table S2). Using CEGMA, 238 of 248 core eukaryotic genes for fungi were completely identified by evaluating the genome assembly, and 427 of 437 eukaryotic clusters of orthologous groups were identified with an overlap rate > 0.5 when predicted gene sets were assessed (see Supplementary Table S1). These results indicated that our genome assembly and prediction of gene sets for T. pinophilus 1-95 were of high quality and confidence.
Overall genome and proteome comparison. The genome size and number of protein-encoding genes of T. pinophilus 1-95 were comparable to other sequenced filamentous fungi (see Supplementary Table S1). A phylogenetic tree constructed based on 2082 single-copy orthologs (see Supplementary Dataset S1) indicated that T. pinophilus 1-95 was most closely related to T. cellulolyticus Y-94 (Fig. 2a). Analyzing the top hits of a BLASTp (https://blast.ncbi.nlm.nih.gov/Blast.cgi) search of all-vs.-all findings showed that T. pinophilus 1-95 and T. cellulolyticus Y-94 shared 10,260 orthologous proteins, accounting for 76.16% of the total proteome, with an average amino acid sequence identity of 98.28%. In contrast, low identity was observed to other sequenced Talaromyces species 17 , Trichoderma sp. 18 , Penicillium sp. 19,20 and Aspergillus sp. 13,21,22 , ranging from 70% to 88% (see Supplementary Table S3).
The predicted proteome of T. pinophilus 1-95 was 23.84% larger than the proteome of T. cellulolyticus Y-94 9 , whereas the genome size of 1-95 was similar to Y-94 (see Supplementary Table S1). The reason for this finding was that the genome sequencing and assembly of T. pinophilus 1-95 were of high quality (see Supplementary  Table S2), partly due to the lack of gaps in the scaffolds. A comparative analysis of proteins annotated by the GO database between T. pinophilus 1-95 and T. cellulolyticus Y-94 showed functional differences between the proteins mostly for "biological regulation", "cellular process" and "metabolic process" in the biological process category and "binding", "catalytic activity" and "nucleic acid binding transcription factor activity" in the molecular function category (see Supplementary Fig. S2). We used data from the PE library with an average insert size of 500 bp from T. pinophilus 1-95 to map the entire genome sequence of T. cellulolyticus Y-94 7 . The PE reads covered 35 scaffolds with 32.06 Mb of sequence from the T. cellulolyticus Y-94 genome, resulting in 88% average coverage and 83-times average depth (see Supplementary Table S4). We found 489 genes from T. cellulolyticus Y-94 contained insertion-deletion mutations when mapping PE reads from the T. pinophilus 1-95 to the genome of T. cellulolyticus Y-94; 257 of these occurred in coding sequence regions. Only 33 of the mutated genes showed no similarity with genes in the T. pinophilus 1-95 proteome, and approximately half were annotated by the GO database as related to "oxidation-reduction", "binding" and "catalytic activity".
Transcription factor prediction. Transcription factors (TFs) are essential for modulating diverse biological processes by regulating gene expression. In total, 943 TFs were found in the predicted proteome of T. pinophilus 1-95. The largest family was proteins (716 members) containing zinc-finger structures, such as the Zn 2 Cys 6 type, the C2H2 type and CCHC type, followed by the winged helix repressor DNA-binding family (97 members) (Fig. 4). TFs are used in genetic engineering to improve biomass-degrading enzyme yields. We carried out an orthology search of known TFs involved in regulation of lignocellulolytic genes in filamentous fungi against the translated proteins in the T. pinophilus 1-95 genome using an identity of 40% and an E-value threshold of 1E-10. Orthologs of most known TFs were identified in T. pinophilus 1-95, including the carbon catabolite repressor CreA (TP09972), the cellulase transcription activator CLR-2 (TP10486), the starch degradation regulator AmyR (TP09286) and the xylan degradation regulator XlnR (TP02627) ( Table 2). Several related proteins include ACEII, Xpp1, ClrC and BglR, had no ortholog in T. pinophilus 1-95.
A repertoire of secondary metabolism gene clusters. We found that T. pinophilus 1-95 had a wealth of secondary metabolites using the AntiSMASH web service 30 . A total of 68 secondary metabolism gene clusters harboring 401 putative genes were identified. The predicted products of 52 secondary metabolism gene clusters were classified into 8 different types: 28 T1 polyketide synthase (T1PKS) gene clusters, 9 non-ribosomal peptide synthase (NRPS) gene clusters, 9 terpene gene clusters, 2 Nrps-T1pks gene clusters, 1 phosphonate gene cluster, 1 T1pks-Indole gene cluster, 1 T1pks-Nrps gene cluster and 1 Terpene-T1pks gene cluster; the remaining 16 gene clusters synthesized other unknown secondary metabolites ( Fig. 2c; see Supplementary Table S7). When compared with known gene clusters for secondary metabolites, eight were predicted to produce emericellin, pestheic acid, azanigerone and azaphilone (data not shown).   The transporter and secretory system. Transporters are important in microbial growth and reproduction because they assist microbes in the uptake of nutrients and energy from the surrounding environment. In total, 1,238 genes encoding putative transporters belonging to seven superfamilies were identified in the T. pinophilus 1-95 genome (see Supplementary Dataset S2). Of these, members of the electrochemical potential-driven transporters (EPTs) were the most abundant, accounting for 54.7%, followed by primary active transporters (18.3%). Among the EPTs, 431 members of a major facilitator superfamily were selected; these are involved in the transport of substances including sugar, drugs and peptides. Of these EPTs, TP13272 and TP06909, respectively, were orthologs of the cellodextrin transporters Cdt-C and Cdt-D, which are crucial for induction of cellulase gene expression by insoluble cellulose 31 . Many proteins are commonly secreted into extracellular regions, including enzymes involved in biomass degradation. A comprehensive pipeline was designed to predict the T. pinophilus 1-95 secretome. T. pinophilus 1-95 potentially secreted 1,203 extracellular proteins (8.9% of predicted total proteins), comprising 831 classical and 372 nonclassical secretory proteins (see Supplementary Dataset S3). Using KEGG annotation, 546 putatively secreted proteins were assigned, indicating that these abundant secretory proteins were mainly involved in metabolism, especially carbohydrate metabolism and xenobiotic biodegradation and metabolism (see Supplementary  Fig. S3). We found that 323 of these, including 35 nonclassical secreted proteins, were present in the CAZyme database, accounting for 26.8% of the total secretome. The repertoire of secreted CAZymes consisted of 54 GH families, 18 GT families, 12 CE families, 8 AA families, 4 PL families and 13 CBM families. The most common GH family, comprising 188 enzymes, contributed to 58.2% of total secreted CAZymes, followed by the AA families (14.9%). The most prevalent GH CAZyme classes were GH7, GH3, GH5, GH10-13 and GH31, which represent cellulases, endoxylanases and amylases, all of which are required for biomass degradation. The most abundant CBM family was CBM1, accounting for 43.8% of total CBMs, which are known to bind to crystalline cellulose and aid in its enzymatic hydrolysis 32 .

Discussion
A systematic genetic investigation of filamentous fungi would contribute to genetic engineering of more diverse and productive industrial microbial strains for improving cellulolytic enzyme production. We sequenced, assembled and analyzed the entire T. pinophilus 1-95 genome in detail. T. pinophilus is a promising filamentous fungus for the industrial production of biomass-degrading enzymes. This study describes the nearly complete genome sequence of a member of the genus Talaromyces. The total assembled genome size was 36.51 Mb, which was within the range of filamentous fungi that produce cellulolytic enzymes, including Penicillium, Aspergillus, Trichoderma and Neurospora species.
Comparative genome analysis indicated that the most closely related species to T. pinophilus 1-95 was T. cellulolyticus Y-94. T. cellulolyticus Y-94 was identified as T. pinophilus based on only an internal transcribed spacer sequence 33 . It was proposed as the new species T. cellulolyticus in the genus based on morphological and phenotypic differences from T. pinophilus 34 . The reported genome sequence of Y-94 is a draft with a number of gaps 9 . In this study, a nearly complete genome sequence of T. pinophilus 1-95 was obtained. The number of protein-encoding genes in T. pinophilus was higher than in other investigated fungal strains. This result may be because of the presence of more genes or the result of high-quality SMRT DNA sequencing technology and a different methodology for gene prediction. In particular, a large inventory of CAZymes was found, including CWDEs and starch-degrading enzymes. This result supported the high capacity of this species to degrade biomass, comparable to T. cellulolyticus Y-94 35 and P. oxalicum HP7-1 20 . Of note, T. pinophilus 1-95 possessed the most BGLs (29 coding genes), glucoamylases (5 coding genes) and α-glucosidases (13 coding genes) among species we compared. BGLs are important for releasing inhibition of cellulase activity 36 . Furthermore, the predicted secretome of T. pinophilus 1-95 showed that approximately 50% of CWDEs and starch-degrading enzymes were secreted into the extracellular region, including major cellulases, hemicellulases and amylases. This result indicated a promising application of T. pinophilus in biorefining. These results also demonstrated that T. pinophilus 1-95 is more excellent cellular machinery for biomass-degrading enzymes than that of P. oxalicum HP7-1 and T. cellulolyticus Y-94 as previously observed 20,35 , meriting further study.
Comparative analysis to 10 filamentous fungi from four genera, Talaromyces, Trichoderma, Penicillium and Aspergillus, indicated that T. pinophilus 1-95 possessed the most secondary metabolism gene clusters except for A. niger and A. oryzae. T. pinophilus 1-95 had more T1PKs than Trichoderma sp., Penicillium sp. and Aspergillus sp., and fewer NRPS than Aspergillus species. These results suggested that T. pinophilus 1-95 has potential for producing bioactive secondary metabolites. Although thus far, several bioactive secondary metabolites have been extracted and characterized from T. pinophilus 6,7 , according to the genomic data, additional secondary metabolites could be generated using this species.
In summary, this study provided a nearly complete genome sequence for the genus Talaromyces. The result provided new insights for a comprehensive understanding of the biomass-degrading enzyme system of Talaromyces at the genome level. Detailed comparative genomic analysis revealed a complex biomass-degrading enzyme system in T. pinophilus, indicating its promising application in biomass biorefineries. This study provides a genome-sequence basis for developing strategies that use T. pinophilus as a microbial cell factory for production of high-value enzymes and secondary metabolites.

Materials and Methods
Culture conditions and genomic DNA extraction. T. pinophilus 1-95 was isolated from soil in a dried, ploughed paddy field in Wuzhou, China 1 and was deposited at the China General Microbiological Culture Collection Center (CGMCC) under accession number CGMCC No. 2645. Total DNA extraction from mycelia was performed using a phenol-chloroform method with some modifications 37 . Mycelia were ground in liquid nitrogen and put in 1 mL lysate reagent (40 mM Tris-HCl, 20 mM sodium acetate, 10 mM ethylenediaminetetraacetic acid, and 1% sodium dodecyl sulfate, pH 8.0) per 100 mg mycelia powder. Genomic DNA was collected by centrifugation at 11,300 × g for 10 min.
Genome sequencing and assembly. The T. pinophilus strain 1-95 genome was sequenced using a PacBio RS II platform and Illumina HiSeq 4000 platform at the Beijing Genomics Institute (BGI, Shenzhen, China). Four SMRT cells zero-mode waveguide arrays of sequencing, were used by the PacBio platform to generate the subreads set. PacBio subreads (length < 1 kb) were removed. The program Pbdagcon (https://github.com/ PacificBiosciences/pbdagcon) was used for self-correction. Draft genomic unitigs, which are uncontested groups of fragments, were assembled using the Celera Assembler 38 against a high-quality corrected circular consensus sequence subreads set. Order, distance and orientation of unitigs and combined scaffolds were generated using software SSPACE 39 . An upgraded draft genome was obtained after filling or reducing as many captured gaps as possible using software PBJelly 40 . To improve the accuracy of the genome sequences, GATK (https://www.broadinstitute.org/gatk/) and SOAP tool packages (SOAP2, SOAPsnp, SOAPindel) 41, 42 were used to make single-base corrections.
A DNA library of 500 bp inserts was constructed and PE sequenced. For generated HiSeq reads, Q20, representing the probability of the incorrectness of the corresponding base call, was detected. If Q20 reads accounted for less than 60%, they were discarded. Software Pilon 43 used reasonable PE sequence data from Illumina libraries to perform comprehensive variant detection and improve genome assembly.
For functional annotation of translated proteins in the T. pinophilus 1-95 genome, a BLASTp version 2.2.28 + search against the NCBI NR database (update 05, 2015) and Swiss-Prot and TrEMBL databases (http://www. mrc-lmb.cam.ac.uk/genomes/madanm/pres/swiss2.htm) (update 01, 2016) and KEGG database (http://www. kegg.jp/) version 76, were used to assign general protein function profiles. We used cut-off E-value ≤ 1e-5, overlap 0.4 and identity 30. InterProScan5 (http://www.ebi.ac.uk/interpro/), stand-alone version 55, and GO (http:// geneontology.org/) were also used to annotate the predicted proteome. TFs were predicted based on InterPro IDs in the Fungal Transcription Factor Database (http://ftfd.snu.ac.kr/). The hmmsearch program in the HMMER 3.1b2 package (http://hmmer.org/), was used to search all predicted proteomes with the family-specific hidden Markov model profiles of CAZymes from the dbCAN database 26 . Primary results were processed with an E-value threshold of 1E-7. Protein kinases and phosphatases were detected using hmmsearch based on the Eukaryotic Kinase and Phosphatase Database (http://ekpd.biocuckoo.org/). Membrane transport proteins were classified and identified by a BLASTp search against the transport classification database 44 , with E-value threshold 1E-10, overlap 0.4 and identity of 30. AntiSMASH 30 was used to annotate secondary metabolism gene clusters.
Phylogenetic analysis. An all-against-all pairwise BLASTp similarity search was performed using proteomes from 11 filamentous fungi (see Supplementary Table S2) with E-value cutoff 1E-7, according to the method described previously 45 . We selected 2,082 single-copy genes from 118,099 genes in 11 fungal genomes. MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle/) version 3.7 with default parameters was used to perform multiple sequence alignment of single-copy genes. A neighbor-joining tree was calculated using TreeBeST 46 with bootstrapping set to 100. The phylogenetic tree was visualized using SVGKit (http://svgkit.sourceforge.net/) and PERL scripts. Secretome prediction. The total set of 13,472 proteins of T. pinophilus strain 1-95 was analyzed using Secretome P 47 v1.0 and SignalP 48 v4.0 for ab initio prediction of classical and nonclassical secretory proteins, except for 11 proteins with more than 4,000 amino acid residues. All proteins containing signal peptides were selected, with proteins without signal peptides chosen as candidates in cases of neural network score ≥ 0.6. All selected proteins were analyzed by TargetP 49 v1.1 with localization = secretory pathway and RC≤ 4 as screening criteria. Protein sets were scrutinized for the presence of transmembrane domains using TMHMM v2.0 (http:// www.cbs.dtu.dk/services/TMHMM/) and for glycosylphosphatidyl inositol anchors use web server PredGPI (http://gpcr.biocomp.unibo.it/predgpi/). Software tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/) version 1.3 was used for transfer RNA prediction using the T. pinophilus 1-95 genome with option C and other default parameters.