Structural Features of a Bacteroidetes-Affiliated Cellulase Linked with a Polysaccharide Utilization Locus

Previous gene-centric analysis of a cow rumen metagenome revealed the first potentially cellulolytic polysaccharide utilization locus, of which the main catalytic enzyme (AC2aCel5A) was identified as a glycoside hydrolase (GH) family 5 endo-cellulase. Here we present the 1.8 Å three-dimensional structure of AC2aCel5A, and characterization of its enzymatic activities. The enzyme possesses the archetypical (β/α)8-barrel found throughout the GH5 family, and contains the two strictly conserved catalytic glutamates located at the C-terminal ends of β-strands 4 and 7. The enzyme is active on insoluble cellulose and acts exclusively on linear β-(1,4)-linked glucans. Co-crystallization of a catalytically inactive mutant with substrate yielded a 2.4 Å structure showing cellotriose bound in the −3 to −1 subsites. Additional electron density was observed between Trp178 and Trp254, two residues that form a hydrophobic “clamp”, potentially interacting with sugars at the +1 and +2 subsites. The enzyme’s active-site cleft was narrower compared to the closest structural relatives, which in contrast to AC2aCel5A, are also active on xylans, mannans and/or xyloglucans. Interestingly, the structure and function of this enzyme seem adapted to less-substituted substrates such as cellulose, presumably due to the insufficient space to accommodate the side-chains of branched glucans in the active-site cleft.

the cow rumen 7 . A PUL-based mechanism 8-10 relies on cellulose binding proteins and cellulases which are tethered to the outer membrane of the bacterium. Since the AC2a PUL is the first cellulose-targeting PUL described 7 , further characterization of its cellulases and their activity on crystalline cellulose is of interest. We have studied one of these cellulases, a member of the glycoside hydrolase family 5 (GH5), referred to as AC2aCel5A.
Enzymes classified as GH5 in the Carbohydrate Active Enzymes database (CAZy) are functionally diverse with 20 experimentally verified activities on various polysaccharides, including xyloglucans, mannans, mixed-linkage β -glucans and cellulose (β -1,4 linked glucose) 11 . GH5s can be further divided into subfamilies based on their sequence similarity and enzyme specificities 12 . AC2aCel5A is affiliated with subfamily 4, a polyspecific family which typically includes extracellular bacterial enzymes that exhibit one or more activities categorized as endoglucanase, xyloglucan-specific endoglucanase, xylanase, and licheninase. GH5 enzymes share a common (β /α ) 8 -barrel fold, which contains two conserved catalytic glutamic acid residues at the C-terminal ends of the fourth and seventh β -strands. As endo-cellulases are important constituents in commercial cellulase mixtures, the discovery and characterization of additional cellulolytic GH5 enzymes from previously non-accessible biodiversity may prove fruitful for the development of better biomass conversion technology.
Here, we present the three dimensional structure and biochemical characterization of AC2aCel5A. The PUL-derived GH5 cellulase was discovered in the genome of an uncultured Bacteroidetes phylotype reconstructed from a cow rumen metagenome, which was previously assigned to the order Bacteroidales by genome-wide alignment against the NCBI 13,14 . The gene was synthesized for expression in Escherichia coli with a C-terminal His 6 -tag for ease of purification. The native structure was solved to 1.8 Å resolution and revealed a narrow substrate-cleft, likely determining the specificity of the enzyme for linear and less substituted substrates.

Results
Structural features of AC2aCel5A. Wild-type AC2aCel5A contains a predicted N-terminal signal peptide, which was removed to simplify over-expression of the 405-residue recombinant protein, including mature AC2aCel5A (397 residues) followed by an eight residue C-terminal His 6 -tag. The structure of AC2aCel5A was solved to 1.8 Å, and contained two molecules in the asymmetric unit. The final model comprised residues 9-397; no electron density was observed for the nine N-terminal residues nor for the final eight C-terminal residues, including the His 6 -tag. The data collection and refinement parameters are summarized in Table 1. The enzyme displays the typical (β /α ) 8 barrel structure associated with the GH5 family (Fig. 1a). This canonical (β /α ) 8 fold is preceded by an extra alpha-helix (Fig. 1a,α0), as was also observed in its close structural homologues, cellulase EngD from Clostridium cellulovorans (PDB id 3NDY) 15 and CelAcd from Piromyces rhizinflata (PDB id 3AYR) 16 . Additionally, the structure includes six short helices, one in the N-terminus and five in the loop regions between the canonical strands and helices ( Fig. 1a; α 1′ -α 6′ ).
The six closest structural homologues, which include all the biochemically characterized members of subfamily GH5_4, were identified using the Dali server 17 and were used to produce a structure-based multiple sequence alignment using the PROMALS3D server 18 (Supplementary Fig. S1). This allowed identification of the conserved catalytic residues, i.e. Glu172 and Glu303, acting as the catalytic acid and nucleophile/base, respectively. Six other strictly conserved residues in the catalytic centers of GH5s 15 were also identified (Fig. 1b). A surface projection shows the narrow groove stretching across the enzyme, with the active-site located in its center ( Fig. 1c-d). The roles of Glu172 and Glu303 as catalytic residues were confirmed by site-directed mutagenesis experiments, showing that AC2aCel5A_E E172A and AC2aCel5A_ E E303A were both inactive in the standard carboxymethylcellulose (CMC) assay.
Cellotriose complex. Co-crystallization of the E172A mutant with cellotetraose resulted in the complex structure solved to 2.4 Å resolution, with two molecules in the asymmetric unit. Additional electron density was observed in the active-site cleft of molecule A, and was refined as a cellotriose molecule bound to the − 3 to − 1 subsites ( Fig. 2a-b). No electron-density for a bound ligand could be observed in the B molecule, which also lacked electron density for several main-chain residues, namely 34-45 and 112-127. These residues correspond to the loop regions between β 1 and α 1, and between β 3 and α 2' , respectively, and have been omitted from the final model. Even though the ligand density appears to fit a cellotriose molecule, it is likely that a cellotetraose molecule is present in the complex as the enzyme only produces dimers from cellotetraose 7 . The observed cellotriose density could reflect three glucose moieties of a tetramer, where the fourth moiety could be positioned in the putative − 4 position and be disordered. This is supported by the observation that the electron density of the -3 bound moiety is incomplete towards O4 (Fig. 2b).
Furthermore, in both the apo-and the complex structure, extra electron density was also observed close to, and between Trp178 and Trp254 in the substrate-binding cleft of both chain A and B (Fig. 2c). The electron density was insufficient to allow meaningful interpretation, but did not appear to resemble any of the components found in the cryoprotectant or mother liquor. In GH5 complexes where either a tetramer or pentamer ligand is bound across the catalytic site, a tryptophan corresponding to Trp178 of AC2aCel5A is always interacting with the sugar moiety bound in the + 1 subsite and in some cases also the sugar in the + 2 subsite (PDB ids; 4HU0, 3AZT 19 , 2CKR, 1H5V, 1ECE 20 , 3QHN 21 and 4OOZ 22 ). The aromatic residue corresponding to Trp254 in AC2aCel5A is much less conserved; this residue's role is most commonly appears to be taken by tyrosine (2CKR, 1ECE, 3QHN), whereas also proline (4HU0), phenylalanine (3AZT) and histidine (4OOZ) occur in a similar position of the active-site cleft. 1H5V lacks a residue corresponding to Trp254. While the role of the Trp178-Trp254 "clamp" in ligand binding seems obvious, it seems unlikely that the density, observed in both the apo-and the ligand structure, reflects substrate. It is worth noting that the C-terminus of an adjacent B molecule is located in close proximity to the active-site cleft of molecule A, and vice versa. It is therefore likely that the electron density in the Trp-Trp "clamp", as well as the additional density seen outside the clamp, reflects a non-specific interaction of the C-terminal His 6 -tag which protrudes towards the active site cleft. LigPlot+ 23 was used to investigate the interactions of the enzyme with the cellotriose molecule, and the estimated interactions are shown in Fig. 2d. The glucose moiety bound in the − 1 subsite is stabilized by hydrogen bonding from its O2 group to the catalytic Glu303 residue, as well as the strictly conserved Asn171. O3 interacts with His111 as well as Asn30, Asp33 and His112 via a water molecule, HOH25. The same water also interacts with O5 of the glucose bound in the -2 subsite. The O2 of the latter sugar has hydrogen-bond type interactions with Trp339, Asn341 and Asp349; Asp349 also interacts with the O3 together with Asn30. Unlike the sugars bound in the − 2 and − 1 subsites, the sugar bound in the − 3 subsite involves no hydrogen bonds between the ligand and the protein. Its binding seems to be mediated by stacking interactions with Trp45. Other hydrophobic and van der Waals contacts contributing to ligand binding involve the strictly conserved His249 and Tyr251, which interact with the − 1 glucose moiety, and Phe115, interacting with the − 2 sugar. Superposition of the native structure on the cellotriose complex placed a carboxyl oxygen of the catalytic acid, Glu172, at 1.7 Å of O1 of the glucose bound in the − 1 subsite.

Biochemical characterization.
As previously reported 7 , AC2aCel5A is an endoglucanase active on β- (1,4) glycosidic bonds between glucose units in soluble and insoluble cellulosic substrates and in linear β -glucans. Further characterization and quantification of enzyme activities revealed high activities on linear β-(1,4) glucans, whereas the enzyme displayed only trace activity on soluble tamarind xyloglucan and no activity could be detected on solubilized Birchwood xylan ( Table 2). Using the standard CMC assay, the optimum temperature of the enzyme was found to be 40 °C, correlating with the 38-40 °C environment of the cow rumen 24 (Fig. 3). However, the enzyme showed activity over a relatively broad temperature range, retaining close to 60% of maximum activity between 20 °C and 50 °C. Assays with different buffer systems of various pH values also showed a broad activity range. Maximum activity was observed at pH 5.0, while over 60% of maximum activity was maintained between pH 4.5 and 8.0.
Comparison to structural homologues. The six closest structural homologues of AC2aCel5A previously identified with the Dali server were subjected to further investigation ( Table 2). The selected structures had root-mean-squared deviations (R.m.s.d.) of less than 1.9 Å when compared with AC2aCel5A.  Two of the homologues have been characterized as xyloglucanases (PDB ids: 2JEP and 3ZMR), whereas the four others are identified as cellulases (PDB ids: 3NDY, 4IM4, 1EDG and 3AYR) ( Table 2). The xyloglucanase homologues are reported as strictly specific for xyloglucan, whereas the cellulase-type homologues generally seem more promiscuous in their substrate specificities. For example, EngD (PDB id 3NDY) has comparable activity on xyloglucan (1,6-xylose substituted β -1,4 glucan), CMC and Barley β -glucan, and also showed activity on birchwood xylan (β -1,4 xylose substituted with 10.2% hexuronic acids 25 ). CelCCa (PDB id 1EDG) from Clostridium thermocellum and CelAcd (PDB id 3AYR) from Piromyces rhizinflata were not tested on xyloglucan, but showed activities on insoluble xylan from Sigma, and soluble oat spelt xylan (substituted with arabinose (9%), glucose (7%), galactose (1%) 25 ), respectively. Conversely, AC2aCel5A is specific for unsubstituted β -1,4 glucans and CMC (cellulose substituted with relatively small carboxymethyl groups), with only trace activity on xyloglucan and no detectable activity on birchwood xylan. A comparison of the active-site clefts of the GH5s shows that AC2aCel5A has a narrower active site cleft than its homologues, especially in the vicinity of the − 3 to − 1 sub-sites (Fig. 4). This is primarily caused by the loop region between β 3 and α 3' (red surface), which protrudes into the substrate-binding cleft and which is extended in AC2aCel5A, compared to the examined homologues ( Supplementary Fig.  S1, residues 115-126). The figure also indicates the aromatic residues, corresponding to the putative + 1 site of AC2aCel5A, that are structurally conserved in these close homologues (yellow surface). Fig. 5a shows a detailed view of the differences between the surface projections of AC2aCel5A (black mesh) and the cellulase CelAcd (blue), highlighting that this homologue has a wider active site cleft. The picture further shows that the active site cleft of AC2aCel5A is far too narrow to accommodate a xyloglucan tetramer due to the extended loop between β 3 and α 3' (red backbone and mesh), whereas CelAcd has enough space to accommodate this branched oligosaccharide. Fig. 5b shows that a true xyloglucanase such as XG5 from Paenibacillus pabuli also has a wider cleft than AC2aCel5A but that this cleft is narrower compared to the promiscuous CelAcd. In the XG5 xyloglucanase, the cleft seems more optimally shaped to harbour the xyloglucan fragment, providing many tight interactions.

Discussion
AC2aCel5A is an endo-acting cellulase that is encoded within the first reported cellulolytic PUL, originating from an uncultured Bacteroidetes phylotype. Biochemical data presented in this study demonstrate that optimal conditions for AC2aCel5A activity correspond well with the cow rumen environment.  Table 2. Structural comparison of AC2aCel5A with its six closest structural homologues identified using the DALI server 17 , along with reported enzyme activities. RMSD; root-mean-square deviation of C-alpha atoms. The structural homologues are sorted based on the Z-score obtained in the DALI search. One Unit of enzyme activity was defined as the amount of enzyme releasing 1 μ mol of reducing sugar equivalents per minute. "nd" means not detected, whereas a hyphen, "-", indicates "not tested". a μ mol reducing sugar equivalents calculated as μ mol cellotriose + μ mol cellobiose + μ mol glucose, quantified by HPAEC-PAD 7  AC2aCel5A was evolutionarily distinct compared to its nearest structural homologues (30-34% sequence identity, Table 2), whereas its nearest sequence homologue was a putative GH5 recovered from a goat rumen metagenome (51% sequence identity, accession number: AIF26005). According to the CAZy subfamily classification proposed by Aspeborg et al. 12 , AC2aCel5A and all of the examined homologues ( Table 2) belong to subfamily GH5_4. Among these subfamily GH5_4 enzymes, the deeply branched AC2aCel5A is dissimilar in that it lacks activity on both xylan and xyloglucan. The interactions between cellotriose and the -3 to -1 subsites of AC2aCel5A are similar to the interactions observed in enzyme-substrate complexes of EngD 15 (PDB id 3NDZ) and CelAcd 16 (PDB id 3AYS). Nonetheless, AC2aCel5A is unique in that it has an exceptionally narrow active site cleft, due to an extended loop region that seems well adapted to (only) acting on less-substituted β -glucans.
AC2aCel5A contains a tryptophan "clamp" in its putative +1 and +2 subsites, analogous to what has been observed in other glycoside hydrolases acting on insoluble polysaccharides 26,27 . This aromatic pair is not conserved in all GH5s (see text above), but all of the examined structural homologues except the Bacteroides ovatus xyloglucanase, BoGH5A (PDB id 3ZMR), do contain two structurally conserved aromatic residues that form a similar substrate-binding "clamp" (Fig. 4). Although the complex structure of AC2aCel5A does not provide direct new insight into the interaction between the two aromatic residues and the substrate, the presence of electron density in the "clamp" in both the apo and complex structures suggests a potential for strong stacking interactions, and likely indicates the enzyme's putative +1 (and +2) subsites. This study presents new insights into the biology of an as-yet uncultured Bacteroidetes-affiliated phylotype (AC2a). Collectively, the AC2aCel5A enzyme structure and activity data suggest specificity towards linear and less substituted β -(1,4)-linked glucans, and strengthens previous hypotheses that the PUL encoding AC2aCel5A is cellulolytic. Ongoing studies directed towards elucidating the structures and functions of other proteins encoded by this PUL (e.g. SusD and SusE cellulose-binding proteins) will enhance our understanding of its polysaccharide degradation potential. Methods Protein discovery and expression. The GH5 enzyme (AC2aCel5A) was identified and produced as previously described 7 . In brief, the gene was synthesized without its predicted signal peptide (residues 1-18), cloned into the pNIC-CH vector 28 and over-expressed in Escherichia coli. The C-terminally His 6 -tagged protein was purified to near homogeneity by immobilized metal affinity chromatography, and the purity was assessed by SDS-PAGE. The protein concentration was estimated by A 280 using the The xyloglucan tetramer binds to subsites − 1 to − 4. molar extinction coefficient calculated from the protein sequence. The protein was stable at concentrations around 20 mg/ml in 20 mM Tris-HCl pH 8 with 0.2 M NaCl, at 4 °C, for several months.
Site-directed mutagenesis. The QuickChange II site-directed mutagenesis kit (Agilent) was used to mutate the catalytic glutamic acid residues (E172 and E303) to alanines, using the following primers; E172AF; GGT TTT TGA AAC CCT GAA TGC AAT TCA GGA TGG TGA TTG GG, E172AR; CCC  AAT CAC CAT CCT GAA TTG CAT TCA GGG TTT CAA AAA CC, E303AF; GCC GGT TTA TGG  TGC ATT TGG TGC CGT TCG, E303AR; ACG AAC GGC ACC AAA TGC ACC AAA ATA AAC  CGG. Crystallization, diffraction data collection and structure determination. Native crystals were produced by screening sitting drop vapor diffusion conditions using the JCSG+ 1 kit (Molecular Dimensions; Altamonte Springs, FL, USA), with a reservoir volume of 100 μ L, and a drop comprising 0.5 μ L well solution + 0.5 μ l purified protein solution (10.2 mg/ml in 20 mM Tris-HCl pH 8.0, 0.2 M NaCl). Crystals grew overnight in several conditions, with highest quality crystals forming in conditions consisting of 0.1 M sodium cacodylate pH 6.5, 40% v/v 2-Methyl-2,4-pentanediol and 5% w/v PEG8000. Crystals for the ligand bound complex were obtained by co-crystallization of the catalytically inactive AC2aCel5A E172A mutant in conditions consisting of 0.2 M potassium nitrate and 20% w/v PEG 3350. The reservoir volume was 200 μ L and the 2 μ L drop contained a 1:1 mixture of well solution and protein solution (11 mg/ml protein in 20 mM Tris-HCl pH 8.0, 0.2 M NaCl, 5 mM cellotetraose). Crystals were mounted in loops and flash-frozen with liquid nitrogen. X-ray diffraction data was collected at the ID29 beamline at the European Synchrotron Radiation Facility (ESRF), to a resolution of 1.8 Å for the apo protein, and 2.1 Å for the complex. The apo data was processed using iMOSFLM 29 , Aimless 30 and tools in the CCP4i package 31 . The structure was solved by molecular replacement with Phaser 32 using a poly-alanine model of the structure of a GH5 family xyloglucanase from Paenibacillus pabuli 33 (PDB-entry 2JEP). The initial model was built using the autobuild function of PHENIX 34 and further refined using PHENIX, RefMac5 35,36 and manual rebuilding in Coot 37 . The dataset for the complex was processed with the XDS package 38 and scaled using Scala 39 , cutting the data to 2.4 Å. The structure was solved by molecular replacement using the apo structure of AC2aCel5A with MolRep, and the structure was refined using RefMac5 and manual rebuilding in Coot. All protein structure figures were produced in PyMOL.
Enzyme characterization. The substrate specificity of AC2aCel5A had previously been determined using Azurine-Crosslinked Polysaccharide (AZCL) substrates, along with activity on insoluble cellulose substrates 7 . In this study, further characterization of the enzymatic activity was performed using the soluble β -(1,4) linked glucan substrates carboxymethylcelullose (CMC) (Sigma-Aldrich), barley β -glucan (Megazyme), tamarind xyloglucan (Megazyme), lichenan (Sigma-Aldrich), and Birchwood Xylan (Carl Roth). The standard reaction using CMC contained 20 mM BisTris buffer pH 6.5, 25 nM enzyme, 20 mM CaCl 2 , and 10 mg/ml substrate in a total volume of 200 μ l. Enzyme was added to pre-heated assay mixtures and the reactions were incubated at 40 °C, with 900 rpm vertical shaking. 100 μ l sample was taken after 10 minutes, and added to 100 μ L DNS reagent 40 . The amount of reducing ends released were determined as glucose equivalents using the DNS reducing-end assay and a glucose standard curve. Assays for activity on barley β -glucan, lichenan, xylan and xyloglucan were performed with 0.5% substrate (w/v), and 10, 10, 200 and 200 nM enzyme load, respectively. A Unit of enzyme activity was defined as the amount of enzyme releasing one μ mol of glucose equivalents per minute.