Main

Lignin, a major component of plant cell walls that gives strength to wood, is the second most abundant natural polymer on earth. This amorphous and insoluble aromatic material lacks stereoregularity and, unlike hemicellulose and cellulose (the most abundant natural polymers), is not susceptible to hydrolytic attack. Only a small group of fungi are able to completely degrade lignin to carbon dioxide and thereby gain access to the carbohydrate polymers of plant cell walls, which they use as carbon and energy sources. Selective degradation of lignin by these fungi leaves behind crystalline cellulose with a bleached appearance that is often referred to as “white rot”1,2. These filamentous wood decay fungi are common inhabitants of forest litter and fallen trees, and have potential in a wide range of biotechnological applications including hazardous waste remediation and the industrial processing of paper and textiles. All white rot fungi are basidiomycetes, a diverse fungal phylum that accounts for over one-third of fungal species, including edible mushrooms, plant pathogens such as smuts and rust, mycorrhizae and opportunistic human pathogens.

The most intensively studied white rot basidiomycete, P. chrysosporium3, is phylogenetically distant from other sequenced fungi, all of which are members of the Ascomycotina, e.g., Saccharomyces cerevisiae4, Schizosaccharomyces pombe5 and Neurospora crassa6,7,8. Like most basidiomycetes, the vegetative mycelium of P. chrysosporium contains two distinct haploid nuclei, a condition known as dikaryosis. Restriction-fragment length polymorphism (RFLP) mapping and low-resolution pulsed field gels suggest it contains seven to nine chromosomes with a haploid genome size of approximately 30 million base pairs (Mbp)9,10. To reveal the genetic repertoire of white rot fungi we have sequenced the genome of P. chrysosporium, a representative fungus, using a whole genome shotgun strategy. Here we present an initial analysis of the draft genome sequence. An interactive web portal to the white rot genome can be found at http://www.jgi.doe.gov/whiterot.

Results

Assembly and general characteristics

Using a pure whole genome shotgun approach, we sequenced the P. chrysosporium genome to >10.5× coverage (Methods). The net length of assembled contigs totaled 29.9 Mbp, which is in excellent agreement with pulsed-field gel estimates of a 30-Mbp genome size9,10. Genome statistics are presented in Table 1.

Table 1 General features of the P. chrysosporium genome

Identifying genes in the P. chrysosporium genome is particularly challenging, because it is the first genome representing the phylum Basidiomycota. Fungi are believed to have appeared approximately one billion years ago, and the divergence of the Basidiomycota and Ascomycota occurred over 500 million years ago11. To reveal the gene repertoire of P. chrysosporium, we combined comparative methods with an ab initio gene-finding approach, bootstrapping the gene prediction process by first obtaining high confidence homologs and subsequently using these models to identify less conserved genes (see Methods). This modeling approach yielded a prediction of 11,777 genes in the P. chrysosporium genome.

Of the 11,777 predicted models, 72.1% (8,486) have significant sequence similarity to GenBank proteins (Smith-Waterman alignment is at least 60% of either model or hit length), and 6.4% (757) have <60% alignment but contain conserved protein domains according to InterPro scanning12. Of the remaining 2,534 models, 2,499 are GRAIL13 predictions (99.7%), representing highly divergent genes, previously unrecognized genes that are potentially unique to filamentous fungi, basidiomycetes, or white rot fungi, or spurious gene predictions. Within the subset of 8,486 gene models with significant sequence similarity to known genes or domains, 28.7% (2,439) are in essence full-length homologs of known proteins (i.e., the alignment is at least 80% of both the predicted model and its homolog). Lastly, 16.7% (1,418) of the predicted gene products show sequence similarity only to proteins with 'unknown' or 'hypothetical' annotations. Their existence corroborates the presence of these functionally ambiguous proteins in both white rot fungi and other genomes, but provides no additional information. The taxonomic distribution of gene model homologies is summarized in Figure 1.

Figure 1: Taxonomic distribution of gene models that correspond to Smith-Waterman double-affine alignments with BLOSUM62 scores >100.
figure 1

There are 7,336 total gene models within this minimum score.

Typical of shotgun sequencing of eukaryotes, extended repeats, telomeres and rRNA clusters were excluded from the assembly. Nevertheless, substantial numbers of noncoding repetitive sequences and putative mobile elements were assembled. Short repeats (<3 kbp) not clearly associated with transposons varied in copy number from >40 (GenBank no. Z31724) to 4 (GenBank nos. AF134289AF134291). Several putative transposase-encoding sequences resemble class II transposons of ascomycetous fungi such as Aspergillus niger Ant, Cochiobolus carbonum Fot1, Nectria “Restless,” Fusarium oxysporum Tfo1 and Cryphonectria parasitica Crypt1 (for review see refs. 14, 15). Additional transposase-encoding sequences included EN/Spm- and TNP-like elements (gx.25.15.1; pc.90.8.1) that are common in higher plants (pfam02992) but hitherto unknown in fungi14,15. Fungal class II elements often exceed 50–100 copies per genome, but the corresponding P. chrysosporium transposases are each represented by only 1–4 copies.

Numerous multicopy retrotransposons were identified, and several seem likely to affect expression of genes related to lignin degradation (Supplementary Table 1 online). Most of these elements appear truncated and/or rearranged, and the long terminal repeats (LTRs), typical of retroelements, often lie apart as “solo LTRs”16,17. Several non-LTR retrotransposons, similar to other fungal long interdispersed nuclear elements (LINEs)-like retroelements, were also identified. Copia-like retroelements are particularly abundant, and one such element interrupts a putative member of the cytochrome P450 gene family within the seventh exon (Supplementary Table 2 online). A similar situation was observed for a multicopper oxidase gene mco3, where a Skippy-like gypsy retroelement has been inserted within the twelfth intron. Intact coding regions flanked these inserts suggesting recent transpositions and/or splicing of the elements. Another gyspy-like element is inserted 100 bp upstream of a 'hybrid' peroxidase gene pc.91.32.1 (Supplementary Table 1 online). The occurrence of intact transposons and other highly conserved repetitive elements is in marked contrast to the recently sequenced N. crassa genome, where repeat-induced point mutations (RIP) have greatly reduced the frequency of repeats greater than 400 bp6,7,8.

Protein families and domains in P. chrysosporium

InterPro12 identification of conserved domains among predicted genes of P. chrysosporium provides an overview of the coding capability of this filamentous fungus (Fig. 2; Supplementary Table 3 online). A large expansion in the InterPro category corresponding to the cytochrome P450 superfamily may reflect the complexity of metabolizing lignin derivatives and related aromatic compounds (see below). Also consistent with efficient depolymerization and degradation of lignin is the relatively high number of putative glucose methanol choline reductases (IPR000172). This family includes extracellular alcohol oxidases and cellobiose dehydrogenase18, enzymes directly involved in lignocellulose degradation (below). Short chain dehydrogenase/reductases (IPR002198), aspartyl proteases (IPR001461) and Ras small GTPase (IPR003579) domains are more abundant in filamentous fungi (P. chrysosporium, N. crassa) relative to the sequenced yeasts (S. cerevisiae, S. pombe). The filamentous fungal genomes are considerably larger and encode more genes than yeast. Increased size and complexity of filamentous fungal genomes are likely due, in part, to their hyphal morphology, elongation and penetration of complex substrata.

Figure 2: Schematic showing the InterPro version 5.3 (ref. 12) domains in P. chrysosporium (outer ring), N. crassa, S. pombe and S. cerevisiae (inner ring).
figure 2

InterPro contains the combined protein signature databases of Swiss-prot, TrEMBL, PROSITE, PRINTS, Pfam, ProDom, SMART and TIGRFAM. Note the large difference between the three organisms in the size of bands corresponding to the cytochrome P450 domain (IPR001128) indicating the expansion of this gene family in P. chrysosporium. Type I Antifreeze domains (IPR000104) are artifacts attributed to a common tripeptide repeat, Ala-Ala-Thr. The number of sequences per domain is listed separately (Supplementary Table 3 online).

Degradation of lignin and related aromatic compounds

White rot fungi catalyze the initial depolymerization of lignin by secreting an array of oxidases and peroxidases that generate highly reactive and nonspecific free radicals, which in turn undergo a complex series of spontaneous cleavage reactions19. Major components of the P. chrysosporium lignin depolymerization system include multiple isozymes of lignin peroxidase (LiP) and manganese-dependent peroxidase (MnP). Consistent with previous studies, ten lip genes were identified. The sequence of scaffold 85 confirms and extends earlier genetic data9,20 providing a detailed view of the principal lignin peroxidase gene cluster. Genes encoding lipC and lipI lie 45 kb apart within this cluster, which is in good agreement with the observed 1% recombination between these genes21 and the estimate of one crossover per 60 kb inferred from RFLP mapping9.

Additional analyses exposed five mnp genes, including two previously unknown members of this gene family (Supplementary Table 1 online). One of these was designated mnp5; its predicted peptide corresponds to the N-terminal sequence of a peroxidase partially purified from colonized wood pulp (GenBank no. A61147). Unexpectedly, the other new MnP gene (mnp4) was located only 5.7 kb from the well-characterized gene, mnp1. The predicted proteins of mnp4 and mnp1 are nearly identical, with a single amino acid substitution. Clustering of mnp genes has been observed in other white rot fungi but not in P. chrysosporium. An interesting peroxidase gene, model pc.91.32.1, is unlinked to all peroxidases, but shares residues common to both mnps and lips. The pc.91.32.1 sequence is most closely related to the 'hybrid' peroxidase of Pleurotus eryngii (GenBank no. AF007224), but not all catalytic and manganese-binding residues are conserved22.

LiP and MnP require extracellular H2O2 for their in vivo catalytic activity, and one likely source is the copper radical oxidase, glyoxal oxidase (GLOX)23,24,25. In addition to glx, the genome sequence reveals at least six other sequences predicted to encode copper radical oxidases (cro1 through cro6). Moreover, three highly homologous genes, designated cro3, cro4 and cro5, were uncovered within the lignin peroxidase gene cluster on scaffold 85. The position of these new genes strongly suggests a relationship between genomic organization and the proposed dependency25 between lignin peroxidases and copper radical oxidases. Additional GLOX-like sequences were detected on scaffolds 46, 72 and 120, apparently unlinked to any peroxidases (Supplementary Table 1 online).

Beyond copper radical oxidases, extracellular FAD-dependent oxidases are likely candidates for generating H2O2, but such genes had not been previously characterized in P. chrysosporium. Genes encoding FAD oxidases in related white rot fungi include aryl alcohol oxidases (AAO) of P. eryngii (GenBank nos. AF064069, AF143814) and a pyranose oxidase from Coriolus versicolor (GenBank no. D73369). Until now, the only extracellular FAD oxidase sequence known for P. chrysosporium was within the oxidoreductase domain of cellobiose dehydrogenase, an enzyme implicated in both cellulose and lignin degradation18. Nevertheless, at least four distinct AAO-like sequences, a pyranose oxidase-like sequence and a glucose oxidase-like sequence have been identified in the genome data. The precise roles and interactions of these genes in lignin degradation remains to be determined26, but when viewed together with the copper radical oxidase genes, it is clear that P. chrysosporium possesses an impressive array of genes encoding extracellular oxidative enzymes.

In addition to extracellular peroxidase systems, laccases had been implicated in lignin degradation. These blue copper oxidases catalyze one-electron oxidation of phenolics, aromatic amines and other electron-rich substrates with the concomitant reduction of O2 to H2O27. Genome searches revealed no conventional laccases. Instead, four multicopper oxidase (MCO) sequences are found clustered within a 25-kb segment on scaffold 56. Signal P identified a putative secretion signal in gene mco1, and subsequent analysis has shown that it encodes a ferroxidase-like protein28. Thus it appears that P. chrysosporium does not have the capacity to produce laccases although distantly related multicopper oxidases may have a role in extracellular oxidations.

Degradation of cellulose and hemicellulose

In addition to lignin, P. chrysosporium completely degrades all major components of plant cell walls including cellulose and hemicellulose. The genome harbors the genetic information to encode more than 240 putative carbohydrate-active enzymes (ref. 29; http://afmb.cnrs-mrs.fr/CAZY/) including 166 glycoside hydrolases, 14 carbohydrate esterases and 57 glycosyltransferases, comprising at least 69 distinct families (Supplementary Table 1 online). A global correlation between the number of carbohydrate active enzymes and the total number of open reading frames in bacterial and eukaryotic genomes has been observed30. The number of glycoside hydrolases and glycosyltransferases predicted in the P. chrysosporium genome matches this correlation accurately. In other sequenced eukaryotic genomes (S. cerevisiae, S. pombe, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Homo sapiens), the overall number of glycosyltransferases exceeds, sometimes by a large factor, that of the glycoside hydrolases (B.H., unpublished observation), consistent with the greater relative importance of constructing rather than breaking down polysaccharides. In P. chrysosporium, the situation is reversed, pointing both to the development of a large repertoire of glycosidases (in accord with its lifestyle) and to a comparatively smaller catalog of glycosyltransferases (57 in P. chrysosporium, 61 in S. cerevisiae, 64 in S. pombe, over 140 in D. melanogaster, over 230 in C. elegans and H. sapiens, and over 410 in A. thaliana).

As in the case of the extracellular oxidases (above), many of the glycoside hydrolases appear within large families of closely related genes that may encode redundant (or partially overlapping) functions. Among the cellulases, endoglucanases are thought to hydrolyze cellulose at random positions in less crystalline regions whereas exocellobiohydrolases act processively on the chains to liberate the disaccharide cellobiose. Lastly, β-glucosidases (cellobiases) cleave cellobiose to yield glucose. The complement of cellulases in P. chrysosporium includes at least 40 putative endoglucanases, seven exocellobiohydrolases, and at least nine β-glucosidases (Supplementary Table 1 online). The list of putative endoglucanases comprises five glycoside hydrolase families (GH5, GH9, GH12, GH61 and GH74). Scanning the genome for exocellobiohydrolase genes revealed only those that had been previously described, encompassing six members of family GH7 (CBH1 isozymes) and a single member of type GH6 (CBH2). Multiple β-glucosidase genes that code for two enzymes in family GH1 and seven members of GH3 were also found. Unlike the genes for ligninolytic enzymes, the cellulase genes of P. chrysosporium do not appear to be tightly clustered in the genome, with the noted exception of the previously known case of neighboring cel7A and cel7B genes31.

In addition to the enzymes responsible for hydrolysis of cellulose, numerous other polysaccharide-degrading enzymes are predicted in the P. chrysosporium genome, including catalytic activities for degradation of hemicellulose, starch and glycogen, mutan, chitin and β-glucans (Supplementary Note and Supplementary Table 1 online).

Secondary metabolism

Examination of the genome suggests potential for production of an array of biologically active compounds (Supplementary Note online). Among these are numerous putative polyketide synthases and nonribosomal peptide synthases (Supplementary Table 4 online). In addition, a minimum of 148 cytochrome P450 sequences representing 12 cytochrome P450 families32 were observed (Supplementary Table 2 and Supplementary Fig. 1 online).

Mating type loci

Two multigenic mating type loci, A and B, were identified in the P. chrysosporium genome. The Aα locus is similar to orthologs in other homobasidiomycetes33,34,35 in that it features two divergently transcribed homeodomain-encoding genes immediately adjacent to the gene encoding a mitochondrial intermediate peptidase (mip) on scaffold 7. In addition, five sequences were identified with similarity to the pheromone receptor genes of mating type B loci. Three receptors are clustered within a 12-kb region on scaffold 110, and these are most similar to Coprinus cinereus rcb2 and rcb336,37. Members of the same receptor family were also identified on two separate scaffolds (255 and 88), both of which were similar to C. cinereus rcb3. Mating type loci are typically associated with fleshy fruiting bodies and believed to govern the fusion of compatible homokaryons, the migration of nuclei, and the formation of a morphological structure known as the clamp connection. P. chrysosporium does not form clamp connections and the sexual basidiospores are formed in a simple, resupinate layer on the substrate. Thus, the conservation of mating type genes in P. chrysosporium was unexpected.

Discussion

It has been proposed that colonization of land by eukaryotes was facilitated by a symbiotic partnership between a photosynthetic organism and a fungus, and that this relationship had far-reaching effects on climate and atmosphere, possibly contributing to the rapid evolution of animals in the Precambrian38. The oldest fossil evidence for land plants and fungi suggests their appearance 480–460 million years ago; however, molecular clock estimates indicate an earlier colonization of land at about 600 million years ago. Divergence of the major fungal taxa Basidiomycotina and Ascomycotina occurred during the Paleozoic period (550 million years ago)11. The ascomycetes include the sequenced yeasts S. cerevisiae and S. pombe, and the filamentous fungus N. crassa. Draft genomes of other filamentous fungi such as Aspergillus nidulans, C. cinereus, Ustilago maydis, Fusarium graminearum and Magnaporthe grisea are in various stages of completion.

Our objective was to generate high quality draft coverage of the genomic DNA sequence for the white rot fungus P. chrysosporium, a filamentous basidiomycete. In addition to the lignin-degrading white rot fungi, basidiomycetes include commercially important species (mushrooms), mycorrhizae, as well as pathogens of plants (smuts, rusts) and animals. Providing further insight into this large and diverse phylum, the genomes of the human pathogen, Filobasidiella neoformans (= Cryptococcus neoformans) (http://www-sequence.stanford.edu/group/C.neoformans/index.html), the maize pathogen Ustilago maydis (http://www-genome.wi.mit.edu/seq/fgi/candidates.html), soybean pathogens Phakospora pachyrhizi and Phakospora meibomiae (J. Boore, Joint Genome Institute, personal communication) and an inky cap mushroom C. cinereus (http://www-genome.wi.mit.edu/seq/fgi/candidates.html), will soon be available. Comparative analyses of these genomes will undoubtedly provide valuable information about the genetic features that distinguish pathogenic and saprophytic lifestyles among the basidiomycetes. In addition, comprehensive comparisons with the forthcoming genome data from other fungal taxa may yield important clues about their origins and influences on other organisms in the tree of life.

Fungi are the only eukaryotic organisms that digest their food components extracellularly through a process that typically involves the secretion of degradative and/or oxidative enzymes before absorption of the nutrients. As the primary degraders of lignin in wood, white rot fungi play an important role in the global carbon cycle. P. chrysosporium has been the most intensively studied species and is capable of efficient degradation of all wood components through the production of an array of oxidative and hydrolytic enzymes. Extensive genetic diversity was observed within complex gene families encoding peroxidases, oxidases, glycosyl hydrolases and cytochrome P450s. Reasons for multiple genes in the white rot fungus are largely unknown, but their presence may reflect that multiple specificities are essential for the effective hydrolysis of these complex wood polymers. The redundant activities might suggest a need for biochemical diversity for optimum growth under varying environmental conditions (e.g., temperature, pH, ionic strength), or simply that similar but not identical enzyme functions are necessary to effectively break down these complex carbohydrate polymers whose structure, physical state and accessibility vary widely upon the botanical source or upon the extent of decay. The occurrence of P. chrysosporium gene families with closely related members is in clear contrast to the genome of N. crassa in which the repeat-induced point mutation system appears to have restricted the sizes of these gene families.

With the P. chrysosporium genome in hand, we are poised to achieve a deeper understanding of the processes by which white rot fungi colonize wood, interact with other organisms in their ecosystem and perform a vital function in the carbon cycle. Large-scale processes for the hydrolysis of plant cell-wall polysaccharides may one day expand the utility of plant biomass for fuels and biochemicals through industrial fermentation. Modification or degradation of specific carbohydrate components in wood is especially attractive to the textile, fuel and paper industries. Determining the choreography of expression and secretion of the oxidative and hydrolytic enzymes and their individual and collective contributions in the breakdown of lignocellulose will be greatly facilitated by the availability of the white rot fungus genome data.

Methods

Genome sequencing.

Genomic DNA was purified from a homokaryotic strain RP-78 (ref. 39; available from USDA Forest Mycology Center) and was sheared to give an approximate fragment size of 3 kbp. The DNA fragments were blunt-end repaired and then size selected on an agarose gel. Four principal small insert (3–4 kbp) libraries were generated by blunt-end ligation into pUC18. These libraries were end-sequenced with dye terminator chemistry using standard m13-40 and m13-28 primers40. Details of the libraries generated are presented in Table 2. Approximately 15% of the sequence was mitochondrial contaminant, which (after initial detection as a large 115 kb circular contig) was assembled separately to produce a finished mitochondrial sequence, which will be analyzed in depth elsewhere (J.C., N.P., D.R. unpublished data). After vector, mitochondrial and quality screening, 619,803 end sequences, representing approximately 9.75× coverage, were produced.

Table 2 Sequence summary by library

Genome assembly.

These paired sequence fragments were assembled using the JAZZ suite of assembly tools (ref. 41; J.C., N.P., D.R. unpublished data), yielding a high quality draft assembly. Ninety percent of the assembled sequence was reconstructed in 165 scaffolds (414 contigs); half of the assembled sequence was arranged in 45 scaffolds longer than 203 kbp and 109 contigs longer than 79 kbp. The fidelity of the assembly is supported by the high degree (96%) of plasmid-end pairs preserved in contigs and scaffolds, as well as the ends of a sampling of cosmid ends. The net length of assembled contigs totaled 29.9 Mbp and included substantial numbers of repetitive elements. Completeness of the assembly with regard to coding region was supported by analysis of 1,390 unique ESTs derived from colonized wood. Approximately 98% of these ESTs as well as all 39 previously known P. chrysosporium genes in GenBank were recovered. Part of this group was a cluster of linked genes that notably includes tandemly repeated copies of lignin peroxidases, which we would have expected to be challenging to assemble. Based on an analysis of high confidence gene models (see below) the rate of short inserts and deletions in coding sequence is expected to be less than one per sixteen kilobases, providing an excellent substrate for annotation.

To further corroborate the long-range structure of the assembled genome we end-sequenced 888 clones from a pWE15 cosmid library with average inserts of 40Kb10,31 (1.5× clone coverage). This analysis confirmed that 785 (88%) of the end-sequences from clones successfully sequenced at both ends were found in opposing orientations within 3 s.d. of each other or split between scaffolds with each read pointing outward from the end of a scaffold, consistent with new linking information not found in the short insert data. In the absence of a large-scale physical or genetic mapping effort (as is available for many model systems), we used genetic methods20 to test the assembly on the longest length scales. Ten pairs of polymorphic markers were identified from opposite ends of long scaffolds generated by an earlier assembly (Supplementary Table 5 online). Scoring of recombinant progeny confirmed that these markers were indeed linked as predicted.

Gene identification.

The genome assembly was compared to all known proteins in GenBank (release 131) at low stringency using ungapped BLASTX42 (Blosum62, score > 30), with significant hits indicating potential exons. Alignments were parsed to derive one or more 'optimal' colinear set of hits for each protein (S. Rash, Joint Genome Institute, personal communication) and the best-scoring putative homolog was submitted with the surrounding genomic sequence to Genewise43, which predicts gene structure based on homology, recognizing splicing signals and the potential for short insertions and deletions that occur in draft sequence.

The resulting homology-based gene models exhibit a low rate of single-base insertions and deletions (395 models out of 9,638 (4%), or one per 16,000 bases of coding sequence) corroborating the accuracy of the assembled draft sequence. These models were often incomplete, though they appeared to possess accurate intron-exon boundaries compared with known P. chrysosporium genes. To obtain a more complete set of gene models from this phylogenetically distant fungus, the homology-derived transcripts were submitted as “faux” ESTs (along with 1,134 real ESTs) to GrailEXP13, an ab initio gene finder that incorporates expressed sequence information and has tunable modules for the detection of splice boundaries and other sequence features that could be bootstrapped from the homology models. The outputs of GeneWise and GrailEXP were post-processed to select the best gene model at each locus. To identify tRNAs we used tRNAscan-SE version 1.23 (ref. 44). Accuracy of modeling was assessed as described previously45,46 for 19 full-length cDNA sequences. The correlation coefficient, sensitivity and specificity, all measures of predictive accuracy, were 0.73, 0.75 and 0.96, respectively (Supplementary Table 6 online). Manually curated models appearing on the browser are designated with the prefix “ug” (user gene).

Genome data availability.

The annotated genome is available on an interactive web portal at http://www.jgi.doe.gov/whiterot/. The genome sequence has been deposited and assigned GenBank accession number AADS00000000.

Note: Supplementary information is available on the Nature Biotechnology website.