Discovery of genes coding for carbohydrate-active enzyme by metagenomic analysis of lignocellulosic biomasses

In this study, a high-throughput sequencing approach was applied to discover novel biocatalysts for lignocellulose hydrolysis from three dedicated energy crops, Arundo donax, Eucalyptus camaldulensis and Populus nigra, after natural biodegradation. The microbiomes of the three lignocellulosic biomasses were dominated by bacterial species (approximately 90%) with the highest representation by the Streptomyces genus both in the total microbial community composition and in the microbial diversity related to GH families of predicted ORFs. Moreover, the functional clustering of the predicted ORFs showed a prevalence of poorly characterized genes, suggesting these lignocellulosic biomasses are potential sources of as yet unknown genes. 1.2%, 0.6% and 3.4% of the total ORFs detected in A. donax, E. camaldulensis and P. nigra, respectively, were putative Carbohydrate-Active Enzymes (CAZymes). Interestingly, the glycoside hydrolases abundance in P. nigra (1.8%) was higher than that detected in the other biomasses investigated in this study. Moreover, a high percentage of (hemi)cellulases with different activities and accessory enzymes (mannanases, polygalacturonases and feruloyl esterases) was detected, confirming that the three analyzed samples were a reservoir of diversified biocatalysts required for an effective lignocellulose saccharification.

auxiliary enzymes acting towards recalcitrant highly crystalline cellulose by a non-hydrolytic mechanism, such as lytic polysaccharides monooxygenases, are needed to enhance the fermentable sugars yield 8 .
Although different combinations of processes for conversion of dedicated energy crops and waste materials into fermentable sugars have been widely studied [9][10][11][12][13][14][15] , the saccharification step is still the main bottleneck in the biorefinery 16 due to the high costs of the enzyme production and the need for biocatalysts that are efficient and stable at the operative conditions 17 . Therefore, the discovery of novel biocatalysts that could satisfy these criteria is one of the main challenges to overcome this bottleneck. At present, the most advanced researches exploit metagenomes, namely genomic DNAs extracted directly from different environments 18 , bypassing the need for culture under laboratory conditions and avoiding the restrictions related to in vitro techniques. Two different methods can be used to screen the metagenomes. The function-driven strategy is performed by a biological activity-screening of expression libraries 18 . The sequence-driven approach is based on the direct sequencing of all genetic material from a target environment and on the homology analysis in comparison with sequences already present in the databases 18 . The increasing number of works focusing on the study of microbiota from guts of wood-eating insects 19 , cow 20 , green-waste compost 21 shows the relevance of the research for new lignocellulolytic microorganisms and enzymes. At the present, among natural environments, decaying lignocellulosic materials could represent an important reservoir of novel genes encoding enzymes involved in (hemi)cellulose degradation, necessary for the development of eco-compatible and economically favorable industrial processes. In a previous study 22 , new multifunctional degrading bacteria that were potential producers of multiple enzymes that have synergistic actions on cellulose and hemicellulose were isolated and selected from lignocellulosic biomasses using a cultural-dependent approach.
Therefore, in the present work, a sequence-driven metagenomic approach was applied to the three dedicated lignocellulosic energy crops Arundo donax, Eucalyptus camaldulensis and Populus nigra after natural biodegradation to identify candidate genes coding for enzymes that may be of use in lignocellulose hydrolysis. Moreover, metagenomic DNA sequences were also analysed to assess the complex microbial community structure and taxonomic diversity of the analyzed biomasses and to evaluate the microbial diversity related to GH families of predicted ORFs.
This study provides high-quality results for the identification of sequences coding for enzymes involved in breakdown, biosynthesis or modification of complex carbohydrates such as lignocellulosic biomass.
The data obtained in this work indicate that the investigated feedstock represent a source of biocatalysts potentially suitable for industrial applications to enhance the conversion of lignocellulosic crops into fermentable sugars.

Results
Data Statistics. The microbiota of three different lignocellulosic biomasses were analysed by Illumina sequencing of the metagenomic DNAs. A total of 11,208,388,400, 11,274,127,600 and 2,392,000 raw reads for A. donax, E. camaldulensis and P. nigra, respectively, were obtained. Sequence reads accounting for around 10.0 Gb, for A. donax and E. camaldulensis samples, and 2 Gb, for P. nigra, were selected ( Table 1).
Microbial community composition of lignocellulosic biomasses. The reads were compared against sequences in the NCBI NR database and the results processed by MEGAN version 4.70.4 to determine the composition of the microbial communities. The three lignocellulosic biomass samples were shown dominated by Proteobacteria and Actinobacteria. These phyla together accounted for approximately 87.5%, 87.2% and 89.4% of the total biodiversity in A. donax, E. camaldulensis and P. nigra, respectively (Table 2). In P. nigra biomass, Firmicutes, and in particular Bacilli, were detected at a high incidence (approximately 10%). A low percentage of reads matched fungal species in A. donax and E. camaldulensis (2.7% and 5.3%, respectively) ( Table 2).
The relative abundances of microbial taxa were examined at the level of genera to determine the dominant taxa within the bacterial communities degrading biomass from the different investigated plant species. The composition of prokaryotic and eukaryotic subpopulations within the biomass were also separately assessed and presented below.
As in the P. nigra biomass, Streptomyces was the taxa that heavily dominated the microbial community in A. donax and E. camaldulensis (35.0% and 47.7%, respectively), followed by Pseudomonas, Agrobacterium, Xanthomonas, Pantoea and Stenotrophomonas. The relative abundance of these taxa was very variable showing a percentage ranging approximately from 1% to 6.5%, depending on lignocellulosic plant species (Fig. 1).
The relative abundances of fungal taxa accounted for 2.2%, 4.9% and 0.2% of the total biodiversity in A. donax, E. camaldulensis and P. nigra, respectively (Fig. 1). In detail, the incidence of all fungal genera identified in A. donax and P. nigra biomass was < 1%; while in E. camaldulensis biomass, Penicillium strongly dominated the eukaryotic biodiversity showing a relative abundance of 3.2% (Fig. 1). eggNOG and KEGG functional profiling of lignocellulosic biomass. With the aim to investigate the functional diversity in the three samples, the predicted amino acid sequences were also aligned to the databases Evolutionary genealogy of Genes non-supervised orthologous groups-eggNOG-and Kyoto Encyclopedia of Genes and Genomes-KEGG-by using BLAST.
As shown in Fig. 2, the data revealed a prevalence of poorly characterized genes belonging to S (function unknown) or R (general function prediction only) eggNOG category. Moreover, for all three samples, a high  Table 2. Relative abundance of dominant taxa at the phylum and class rank mapping the high quality reads to the NT database (NCBI). Only taxa with an incidence ≥1% in each sample are shown.     Table 3. CAZYmes classification of predicted ORFs from T3ADSB, T3ESB and T3PSB sample. * The total numbers of CAZYmes is less than the sum (AAs + CBMs + CEs + GHs + GTs + PLs) due to the fact that some multimodular predicted proteins were detected. percentage (~38%) of genes matching to non-supervised orthologous groups were classified involving in metabolism (categories C, E, F, G, H, I, P, Q) with ~8% of genes related to the carbohydrate transport and metabolism. As shown in Fig. 3, although the majority of predicted ORFs were related to the membrane transport, this analysis confirmed that many genes matching to KEGG database (~12%) originated from pathways involved in the carbohydrate metabolism. Inventory of the detected Carbohydrate-Active Enzymes families and putative plant-polysaccharides-targeting Glycoside Hydrolases. In order to identify putative genes and enzymes involved in breakdown, biosynthesis or modification of carbohydrates, the total predicted ORFs in the three investigated biomass samples were compared to the entries of the Carbohydrate-Active Enzymes (CAZymes) database. A total of 1792, 1279 and 2113 putative CAZymes were identified in the samples T3ADSB (from A. donax after 135 days of natural biodegradation in underwood), T3ESB (from E. camaldulensis after 135 days of natural biodegradation in underwood) and T3PSB (from P. nigra after 135 days of natural biodegradation in underwood) respectively, corresponding to 1.2%, 0.6% and 3.4% of the total ORFs (Table 3). A high relative abundance (25-26%) of predicted CAZymes was reported belonging to glycosyltransferases (GTs) families and involved in forming glycosidic bonds for the biosynthesis of di-, oligo-and polysaccharides. A less amount of Carbohydrate Esterases-CEs-(~5-7%), Polysaccharide Lyases-PLs-(~1-3%) and Auxiliary Activities-AAs-(~2-4%) enzymes were detected in the three samples. Moreover, ORFs coding for putative Carbohydrate-binding
The microbial diversity of the ORFs predicted to encode GHs from the three lignocellulosic biomasses was also investigated to identify the bacterial and fungal genera encoding enzymes involved in the carbohydrate metabolism. The microbial biodiversity related to GHs was very high and twenty-six bacterial and forty-two fungal genera were recovered with an incidence ≥ 1% in at least one sample (Fig. 5). Streptomyces was a dominant genus in all samples accounting for 18.1%, 28.3% and 30.0% in A. donax, E. camaldulensis and P. nigra, respectively, of the microbial genera related to GH (Fig. 5A).
Unlike the other biomasses in which Streptomyces was the dominant taxon, in P. nigra the most abundant GHs were related to Paenibacillus (30.38%) (Fig. 5A). Pseudomonas and Rhizobium were the other genera recovered in all lignocellulosic biomasses in relationship to the GHs (Fig. 5A) showing an abundance ranging from 1.1% to 6.3% depending on plant species.
In this study, the GHs also originated from a wide range of fungal taxa. Among the forty-two genera occurring with an abundance ≥ 1% in at least one sample, only Pestalotiopsis was recovered in all lignocellulosic biomasses (with an incidence of 2.7%, 2.0% and 12.50 in A. donax, E. camaldulensis and P. nigra, respectively) (Fig. 5B).
Finally, the lowest fungal diversity was found in P. nigra biomass. The most abundant taxa recovered in this plant sample was Togninia (37.5%) followed by Batrachochytrium (25.5%) and Pestalotiopsis, Meyerozyma and Ustilago (12.5%) (Fig. 5B). However, although this result seemed suggest that the fungal taxa were abundant, overall very few GHs were related to them because only the 0.2% of total biodiversity was determined by fungi in this sample (Fig. 1).

KEGG pathway classification related to Glycoside
Hydrolases. An in-depth KEGG pathway mapping was carried out for the putative genes coding for plant polysaccharides-degrading enzymes in order to obtain a specific, unique activity for each detected GH. As shown in Fig. 6, a high percentage of different cellulases were detected. In particular, β -glucosidases (EC 3.2.1.21, hydrolyzing cellobiose and other cellodextrins) and endo-1,4-β -glucanases (EC 3.2.1.4, performing the random internal hydrolysis of amorphous cellulose) were the most abundant putative enzymes involved in the hydrolysis of glycosidic bonds. In the samples from Arundo donax and Populus nigra, an abundance of chitinases (EC 3.2.1.14) was also detected (7.41% and 6.29% respectively). It is noteworthy that in all three samples putative genes coding for hemicellulases and accessory enzymes with a broad spectrum of activities were recognized. In particular, a high percentage of proteins involved in the degradation of (glucurono)(arabino)xylan-such as endoxylanases (E.C. 3.2.1.8) and β -xylosidases-and in the removal of arabinose-α -L-arabinofuranosidases (E.C. 3

Discussion
In the last decades, the increasing interest in the use of renewable sources for green energy and chemicals has strongly stimulated search for new biocatalysts from different ecosystems for lignocellulose conversion. Therefore, in this work, microbial and enzymatic diversities potentially relevant to the degradation of plant biomass into fermentable sugars were explored through metagenomic approach in three dedicated lignocellulosic energy crops, Arundo donax, Eucalyptus camaldulensis and Populus nigra, after natural biodegradation 22 . Metagenomic DNA sequences were analysed to assess the total biodiversity, identify candidate genes coding for enzymes putatively involved in carbohydrates metabolism and that may be of use in lignocellulosic degradation, and evaluate microbial diversity related to GH families of predicted ORFs.
The microbial diversity results from this study were performed on the same samples previously characterised using 16S phylotyping in our earlier study 22 with samples T3ADSB, T3ESB and T3PSB corresponding to samples At3UW, Et3UW and Pt3UW in that publication. Some taxa differed sharply in composition, e.g. Actinobacterial content of 40.1% vs 8.6% when T3ADSB and At3UW were compared. The substantial differences could be due to the different molecular methods adopted by Ventorino et al. 22 for sequencing in comparison to those ones used in this study (amplicon sequencing of the 16S rRNA gene vs shotgun metagenomic sequencing) as well as to the different methods used for microbial DNA extraction. In fact, in the present work eDNA was extracted directly from lignocellulosic biomass samples, whereas Ventorino et al. 22 extracted DNA from pellets obtained from microbial cells desorbed from lignocellulosic materials. This approach could determine an underrepresentation of filamentous bacteria, and in general of relative abundance of Actinobacteria, in amplicon data reported in the previous work. Discrepancies between different approaches to quantifying the taxonomic composition of microbiomes are a known phenomenon. According to Morgan et al. 25 the relative abundances of microbial taxa inferred from metagenomic sequences significantly varied depending on the DNA extraction and sequencing protocols utilized. Recently, Duncan et al. 26 revealed that shotgun metagenomics detected a much higher abundance of Actinobacteria than amplicon sequencing.
Nevertheless, Actinobacteria were significant components of biomass in both studies. The prevalence of the actinobacterial genus Streptomyces could be due to the ability to synthetize enzymes, such as cellulases 27,28 , which efficiently degrade lignocellulosic materials under a wide range of environmental conditions 29 . Actinobacteria, and in particular, Streptomyces spp. were found to be major plant biomass degrading microbes in peat swamp forests and also ubiquituously present during the composting of chestnut green waste 30,31 .
Bacterial species belonging to Proteobacteria phylum, such as Pseudomonas spp. and Stenotrophomonas spp., were also retrieved in all lignocellulosic samples. Bacteria belonging to these genera are known to be able to produce a wide range of enzymes for efficient degradation of carboxymethylcellulose, (hemi)cellulose and lignin 32,33 . These results are in according with previous study in which culture-independent approach based on 16S rRNA gene sequence demonstrated that Proteobacteria was the taxa that heavily dominated the microbial community in different lignocellulosic biomass piles, remaining high during all degradation processes in natural conditions 22 . Moreover, Actinobacteria and Proteobacteria have been identified as the predominant bacterial phyla during composting of lignocellulosic waste exhibiting the enzymatic activities required for the degradation of this recalcitrant polymeric material 34 .
The occurrence of other bacterial taxa with a different abundance depending on plant species was also demonstrated in the investigated lignocellulosic biomasses. Interestingly, Bacillus genus covered approximately 8.0% of the total microbial biodiversity in P. nigra. Members belonging to Bacillus spp. isolated from different environments exhibit cellulolytic and/or hemicellulolytic activities to potentially breakdown the components of lignocellulosic material [35][36][37][38] . Moreover, different microbial strains belonging to Enterobacteriaceae family such as Pantoea, Rahnella and Erwinia, are frequently recovered in the gut of insects producing digestive enzymes implicated in the hydrolysis of cellulose 39,40 .
Moreover, a low abundance of eukaryotic populations was observed in all the lignocellulosic biomass samples. This result could be due to the fact that fungi have tough chitin walls that are difficult to breach. In fact, since fungal community patterns could be strongly dependent on the extraction method used 41 , their representation in this work could be depressed. However, among the fungal taxa retrieved, only Penicillium showed an incidence > 1% in E. camaldulensis. Cellulolytic activity of this genus is well documented and there are several reports on β -glucosidase, cellulases and xylanases production from different Penicillium species [42][43][44] . Moreover, Ryckeboer et al. 45 reported also the ability of Penicillium spp. to degrade lignin and starch making it a good candidate in the producing of industrial cellulases 46 .
Analysing the biodiversity related to GH families of predicted ORFs, a highly complex microbial community was found. With regard to bacterial biodiversity, Streptomyces, Pseudomonas and Rhizobium were found in all lignocellulosic biomass samples. In agreement with the results obtained analysing the total biodiversity, Scientific RepoRts | 7:42623 | DOI: 10.1038/srep42623 Streptomyces was the dominant taxon, confirming the ability of the members belonging to this genus to encode enzymes involved in cellulose and hemicellulose degradation. In fact, Streptomyces spp. is reported to produce different GHs that are well characterized 47,48 . In addition, the production of cellulolytic enzymes in Rhizobium spp. is related to their ability to nodulate leguminous plants. In fact, Rhizobium is a plant growth promoting rhizobacterium living as free-living saprophytes in the soil but also able to fix nitrogen establishing a symbiotic associations with a host plant 49 . The production of enzymes, such as cellulases, is fundamental to degrade plant cell wall polymers and penetrate in the host root 50 . García-Fraile et al. 51 reported the ability to actively hydrolyse CM-cellulose of two bacterial strains isolated from decaying wood of Populus alba and classified as Rhizobium cellulosilyticum.
The prokaryotic biodiversity related to GHs was also dominated by Paenibacillus genus in the P. nigra biomass. Eida et al. 52 reported the ability of different Paenibacillus isolates to efficiently contribute to cellulolytic and hemicellulolytic processes during composting of sawdust. Other taxa recovered in the P. nigra biomass that are known as plant biomass-degrading microbes were Stenotrophomonas and Xanthomonas (Proteobacteria) and Bacillus (Firmicutes). De Angelis et al. 17 reported that the members of Proteobacteria as well as Firmicutes strongly dominated switchgrass-adapted communities comprising approximately 80% of the microbial richness.
Differently, in A. donax biomass the most of GHs was encoded by genera belonging to the class of Actinobacteria. These taxa are related to well characterized potent plant polysaccharide-degrading bacteria and play an important role in degradation of numerous polymers such as chitin, cellulose, lignin and polyphenol 53 .
With regard to fungal biodiversity related to GHs, diverse genera were found, and among these only Pestalotiopsis was recovered in all lignocellulosic biomasses. This result is in agreement with Cahyani et al. 54 that reported the ubiquitous presence of Pestalotiopsis spp. during the composting process of rice straw. In fact, this endophytic fungus is able to secrete xylanases and cellulases also in salt stress conditions 55 as well as produce a considerable amount of ligninolytic enzymes such as laccase 56 .
However, Sporothrix, Fusarium, Nectria and Trichoderma dominated the eukaryotic biodiversity related to GHs in A. donax and E. camaldulensis biomasses. These Ascomycota are known for their ability to produce cellulolytic enzymes 57,58 and comprise many species involved in the degradation of recalcitrant substances such as cellulose, hemicellulose, pectin, and lignin 59 . Jurado et al. 60 reported that fungi belonged to Ascomycota group were ubiquitous throughout the whole lignocellulose-based composting process.
The functional clustering of the predicted ORFs to eggNOG and KEGG databases showed high similarity among the three analyzed samples.
The prevalence of poorly characterized genes obtained by matching to eggNOG categories suggested the three detected biomasses as potential sources of not yet known genes. Moreover, the analysis of functional classification distribution among these three metagenomes, based on both the eggNOG and KEGG database, suggests that a large number of predicted genes were putatively associated with formation, breakdown and interconversion of polysaccharides. In particular, the relative abundance of genes linked to carbohydrates metabolism pathway was higher than or similar to that detected in metagenomes from samples with well-known lignocellulose-degrading ability, such as invasive snail crop microbiome 61 and lower termite Coptotermes gestroi gut 62 . This result confirmed the high potentiality of the three analyzed metagenomes to express genes involved in lignocellulosic biomasses biotransformation.
Moreover, the inventory of the Carbohydrate-Active Enzymes families detected in the three samples interestingly revealed ORFs codifying for putative lytic polysaccharide monooxygenases (LPMOs). Nowadays, the interest is moving towards the LPMOs belonging to AA9 (formerly reported as GH61), AA11 or AA10 (formerly reported as CMB33) families, due to their ability to depolymerize the recalcitrant insoluble polysaccharides from highly crystalline cellulose, increasing the efficiency of lignocellulose saccharification 8 . Only a few of LPMOs have been discovered by metagenomic approach 18 . In metagenomes analyzed in this study, 3, 5 and 9 ORFs (for T3ADSB, T3ESB and T3PSB respectively) were assigned to family AA10, whereas only in the sequenced eDNA from A. donax 11 and 2 ORFs encoding putative enzymes belonging respectively to families AA9 and to AA11 were detected.
However, most of CAZymes detected in the three samples were related to putative plant-polysaccharidestargeting GHs. Based on the results obtained by Li et al. 63 analyzing 46 finished metagenomic studies collected in Genomes OnLine Database (GOLD) by comparison against the CAZy sequences for homologues of glycosyl hydrolases using an e-value < 10 −40 as a cut-off threshold, the percentages of detected GHs in our study were higher than those present in metagenomic samples from soil, sludge and marine or lake environments. Furthermore, the diversity of GH family enzymes detected in the three samples was greater than that observed in insect or mammalian fecal and gut samples with high lignocellulose-degrading potentiality 64 , in line with the detected high phylogenetic diversity.
The putative genes encoding proteins involved in the degradation of plant polysaccharides were detected in the three samples. Moreover, accepted that the obtained data are sensitive to the bioinformatics workflow used in the different studies, a comparison between the GHs detected in our samples and in metagenomes well known as reservoirs of genes involved in lignocellulose-degradation was attempted (Table 5), based on the classification provided by Allgaier et al. 65 . The detection and assignment of glycoside hydrolases in our metagenomes and bovine rumen metagenome 66 were performed by BLAST-based procedures against the CAZy database, whilst the searches for glycoside hydrolases in metagenomes from six years old elephant feces 64 , yak 67 and cow rumen 66 , snail crop 61 , macropod gut 68 and termite hindgut 19 were performed by using HMMER hmmsearch with Pfam. The putative ORFs encoding enzymes related to the oligosaccharides degradation represented the majority of the total plant-polysaccharides-targeting GHs and their abundance (~26% for T3ADSB, ~22% for T3ESB and ~24% for T3PSB) was comparable to that detected in samples from cow rumen 20 and termite hindgut 19 . Most belonged to GH1, GH2 and GH3 families including β -glucosidases, β -galactosidases, β -mannosidase, β -glucuronidase, β -xylosidase and other enzymes involved in the breakdown of a large variety of β -linked disaccharides. Due to the high diversity of protein structural arrangements, a robust phylogenetic classification of these families is currently not available. In addition, enzymes belonging to GH43 family were highly represented (mainly in T3ADSB and T3PSB). This family includes β -xylosidases and α -L-arabinofuranosidases and several bifunctional enzymes; moreover, due to a remarkable expansion in GH43 family resulting from novel studies about plant cell wall degrading organisms, members of this family may have a more extensive range of specificities 69 .
In the sample T3ADSB, the abundance of endocellulases was double than T3ESB and T3PSB and comparable to that detected in the six-years-old elephant feces by Ilmberger et al. 64 and in yak rumen by Dai et al. 67 . The GH5 and GH6 were the most represented families. While only endoglucanase and cellobiohydrolase activities have been reported for the members of GH6 family, the enzymes belonging to Glycoside Hydrolases family 5 have a variety of specificities: this is one of the largest of all CAZy glycoside hydrolase families comprising not only cellulases, such as endo-and exo-glucanases and β -glucosidases, but even hemicellulases, such as endo-and exo-mannanases and β -mannosidase. Interestingly, in T3ADSB an amount of enzymes belonging to GH7 family (that includes mainly enzymes from fungi) was detected, although in this sample only a small amount of fungi was identified. The cellobiohydrolases belonging to GH7 family are the most active exoglucanases known 70 .
The abundance of hemicellulases detected in the three investigated samples was comparable with the percentage occurred in bovine rumen 66 and macropod gut 68 . In T3ADSB, more that 1% of CAZymes belonged to GH10 family. These enzymes have received much attention for their use in degradation of lignocellulosic biomass for biochemicals production, due to their involvement in breaking down of xylan, the major component of the hemicellulose. Moreover, in the three samples a percentage of 1-2% of enzymes belonging to Glycoside Hydrolases family 28 was identified. These CAZymes are involved in the degradation of pectin, a structural constituent of the plant cell wall.
About 1% of the debranching enzymes detected in the three samples belonged to family GH51: this percentage was higher than that detected in yak and cow rumen 20,67 and snail crop 61 . Moreover, the samples T3ESB and T3PSB revealed an abundance of family GH67 members. The enzymes belonging to these two families (α -L-arabinofuranosidases and α -glucuronidases respectively) are required for the optimal breakdown of glucoronoarabinoxylans (GAXs), one of the major component of hemicellulose, composed by β (1-4)-D-xylose linked polymers branched with arabinose and glucuronic acid. Interestingly, in the samples from Eucalyptus camaldulensis and Populus nigra 2.5% and 0.6% respectively of total GHs belonged to GH78 α -L-rhamnosidases. These enzymes catalyze the hydrolysis of α -L-rhamnosyl-linkages in L-rhamnosides present in polysaccharides such as rhamnogalacturonan.
Furthermore, the in-depth KEGG pathway mapping of the genes encoding enzymes involved in the polysaccharides hydrolysis confirmed that all three analyzed samples were a valuable source of a full set of diversified (hemi)cellulases and accessory enzymes required for an effective pretreated lignocellulosic biomass hydrolysis 71,72 .

Methods
Lignocellulosic biomasses and DNA extraction. Chipped wood from A. donax, E. camaldulensis and P. nigra was used to form piles of approximately 30 kg that were submitted to biodegradation under natural conditions as previously reported 22  Metagenome shotgun sequencing and assembly. Three qualified 270 bp short-insert libraries were constructed from the eDNA samples. The genetic material was firstly sheared into smaller fragments by nebulization. Then the overhangs resulting from fragmentation were converted into blunt ends by using T4 DNA polymerase, Klenow Fragment and T4 Polynucleotide Kinase. An "A" base was added to the 3′ phosphorylated blunt ends of the DNA fragments and the adapters were ligated. Undersized fragments were removed with Agencourt AMPure XP Beads (Beckman Coulter Inc, Brea, CA, USA). The libraries were then subjected to 151 paired-end sequencing on Illumina HiSeq2000 platform by using TruSeq SBS Kit v3-HS (Illumina, San Diego, CA, USA) following standard pipelines. The generated raw data were trimmed: leading or trailing low quality (below quality 3) or 3 N bases were cut off and reads contaminated by adapter (15 bases overlapped by reads and adapter) or with low quality (20) bases (40% as default, parameter setting at 36 bp) were removed. The data were filtered by using readfq.v5 (unpublished software, BGI).
The obtained Clean Data were used to perform the metagenome sequences. Before assembly, k-mer analysis (K-mer length 15) was done to evaluate the sequencing depth for each sample. SOAPdenovo (Version 1.06) 73 was used to assemble filtered data in contigs and scaffolds and assembly results were optimized by in-house scripts (key parameters: -r 2; -l 35; -M 4; -p 1) using the SOAP-aligner tool.

Metagenome analyses.
To evaluate the microbial composition, the assembled contigs were matched against the bacteria, fungi and archaea sequences extracted from NCBI NR database (release-20130408) by BLASTx with 1 × 10 −8 and ≥ 90% identity cut-off. Each contig was subsequently taxonomically assigned by MEGAN version 4.70.4 74 , based on lowest common ancestor (LCA). The taxonomic abundance was determined by read count of each taxon, after mapping to the assembled contigs using SOAPaligner version 2.21 75 62 . For each GH family, the total number and the percentage on total GHs detected in the respective sample were shown. a Searches for glycoside hydrolases were performed by using HMMER hmmsearch with Pfam_Is HMMs (full-length models) to identify complete matches to the family, which were named in accordance with the CAZy nomenclature scheme. b The detection and assignment of glycoside hydrolases were performed by BLAST-based procedures against the CAZy database.
default parameters. Assembled contigs are used to predict genes by using MetaGeneMark Software 76 (version 2.10, default parameters) based on assembly results. Functional annotations of predicted amino acid sequences were performed by BGI Tech Solutions Co., Ltd. (Hongkong, China) by using BLASTP (version 2.2.23). In particular, the metabolism pathway assignment of the predicted protein was performed using the Enzyme Commission (EC) number in the Kyoto Encyclopedia of Genes and Genomes (KEGG)-version 59-databases 77 and the annotation of each contig with functional categories was carried out by matching against Evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG)-version 3.0 78 . Both comparison were performed by using BLAST 79 with e-value threshold of 1e-5 and a 40% minimum percentage of identity to assign the subject sequence to a specific function family. Moreover, in order to explore in depth the ability of the microbial biodiversity detected in the samples to degrade lignocellulose, the putative encoded protein sequences were first compared to the full length sequences of the CAZy database using BLAST 75 and query sequences that produced a e-value > 10 −6 were discarded. Query sequences that produced an e-value < 10 −6 and aligned over their entire length with a protein in the database with > 50% identity were automatically assigned to the same family as the subject sequence. The remaining query sequences were subjected to manual curation which involved BLAST searches against a library built with partial sequences corresponding to individual GH, PL, CE and CBM modules and examination of the conservation of specific family patterns and features such as catalytic residues (where known).