Structure and functional implications of WYL-domain-containing transcription factor PafBC involved in the mycobacterial DNA damage response

In mycobacteria, transcriptional activator PafBC is responsible for upregulating the majority of genes induced by DNA damage. Understanding the mechanism of PafBC activation is impeded by a lack of structural information on this transcription factor that contains a widespread, but poorly understood WYL domain frequently encountered in bacterial transcription factors. Here, we determined the crystal structure of Arthrobacter aurescens PafBC. The protein consists of two modules, each harboring an N-terminal helix-turn-helix DNA binding domain followed by a central WYL and a C-terminal extension (WCX) domain. The WYL domains exhibit Sm-folds, while the WCX domains adopt ferredoxin-like folds, both characteristic for RNA binding proteins. Our results suggest a mechanism of regulation in which WYL domain-containing transcription factors may be activated by binding RNA molecules. Using an in vivo mutational screen in Mycobacterium smegmatis, we identify potential co-activator binding sites on PafBC.


Introduction
DNA damage represents a threat to the integrity of genetic information and is therefore counteracted in all organisms by an arsenal of DNA repair processes that are activated by specific DNA damage response pathways. Mycobacteria and many other actinobacteria employ two distinct yet interconnected pathways in order to upregulate the expression of specific sets of genes required for repair and survival of DNA damage.
The "SOS response", the canonical pathway described in most bacterial species, relies on cleavage and removal of LexA, a transcriptional repressor of DNA repair genes ("SOS genes") (reviewed in (Kreuzer, 2013;Maslowska et al., 2018)). Under normal conditions, LexA ensures low expression levels of the SOS genes by binding to a promoter motif called "SOS box" (Little et al., 1981). Single-stranded DNA (ssDNA) fragments accumulating under DNA damage conditions serve as DNA stress signal for the SOS response and are sensed by the ATPase RecA. RecA forms a filamentous complex with ssDNA that is able to induce autoproteolytic cleavage of the LexA repressor, leading to derepression of the SOS genes (Galletto et al., 2006;Little et al., 1980;Phizicky and Roberts, 1981). In Mycobacterium tuberculosis (Mtb), the LexA repressor controls about 20 genes (Davis et al., 2002a;Smollett et al., 2012).
In contrast, the second pathway regulates over 150 genes, including many of the LexAcontrolled genes, like for example recA. This predominant pathway operates independently of LexA and RecA, as demonstrated by deletion of recA in Mtb, which leaves upregulation of most DNA repair genes intact (Davis et al., 2002b;Rand et al., 2003). Different from the regulatory principle of derepression, these genes are regulated by transcriptional activation by the heterodimeric protein complex PafBC (Fudrini Olivencia et al., 2017;Müller et al., 2018). The complex consists of the close sequence homologs PafB and PafC (proteasome accessory factors B and C) that are encoded together in an operon that is tightly associated with the bacterial proteasome gene locus, suggesting a functional connection. Indeed, many DNA repair proteins are removed by proteasomal degradation after the DNA damage has been repaired, thereby helping to shut down the stress response and preventing negative impact of DNA-modifying activities under normal conditions (Müller et al., 2018).
PafBC activates its target genes via a promoter motif called RecA-NDp (RecA-independent promoter), which was demonstrated by in vivo identification of PafBC binding sites using cell culture 3 cross-linking followed by immunoprecipitation of PafBC-DNA complexes (Müller et al., 2018). However, PafBC protein levels are not changing in response to DNA stress (Fudrini Olivencia et al., 2017). Furthermore, specific interaction between PafBC and the identified DNA-target regions could not be reconstituted in vitro. Taken together, these results suggest that an additional "responseproducing" event must take place to initiate PafBC transcription activation.
In order to establish the mechanistic principles employed by PafBC to activate transcription at a molecular level, understanding of the structural framework is crucial. Based on sequence similarity, PafBC belongs to a family of bacterial regulators featuring a winged helix-turn-helix (HTH) domain at the N-terminus, followed by a C-terminal domain of unknown function named WYL domain after a consecutive W-Y-L sequence motif. It has been suggested that the WYL domain might play the role of a ligand-binding domain in the context of this class of transcription factors. A handful of other WYL domain-containing proteins were studied to date: (1) DriD, an SOS responseindependent transcriptional activator of a cell division inhibitor protein in Caulobacter crescentus (Modell et al., 2014) (2) Sll7009, Sll7062, Sll7078, transcriptional repressors of CRISPR/Cas system mature crRNA in Synechocystis 6803 (Hein et al., 2013), (3) PIF1 helicase from Thermotoga elfii (Andis et al., 2018) and (4) WYL domain-containing proteins stimulating RNA cleavage by Cas13d in Eubacterium siraeum and Ruminococcus sp. (Yan et al., 2018). However, structural information on WYL domain-containing transcriptional regulators is missing, and evidence as to how they exert their functions mechanistically has remained elusive.
In this study, we determine the crystal structure of PafBC from Arthrobacter aurescens in its non-activated, DNA-free state. The structure reveals that the WYL domain exhibits an Sm-fold, commonly encountered in RNA-binding proteins, and is followed by an additional C-terminal extension (WCX) domain featuring a ferredoxin-like fold. Based on the structure of the PafBC WYLdomain, we carry out a comprehensive computational analysis of WYL domain-containing proteins, and demonstrate that the WYL domain is a widespread feature of bacterial transcription factors present in almost all bacterial taxa. Our study shows that Sm-fold proteins are a much more frequent occurrence in bacteria than previously thought. Based on the high structural similarity of the WYL motif-containing domain to the bacterial RNA chaperone Hfq and the known binding sites of Hfq, we identify functionally essential residues in the WYL domain of PafBC, which are likely involved in binding of a response-producing ligand in this distinct class of transcriptional regulators.

Results
The crystal structure of PafBC in the non-activated state exhibits an asymmetric domain arrangement incompatible with DNA binding In order to obtain information about the architecture of the PafBC class of transcriptional regulators, we set out to determine the crystal structure of PafBC. We carried out crystallization experiments using a range of PafBC orthologs from different actinobacterial organisms, also including PafBC proteins from organisms encoding a naturally fused PafBC complex (i.e. from Kocuria rhizophila, Thermobifida fusca, and Arthrobacter aurescens). Three-dimensional crystals suitable for data collection were ultimately obtained using an A. aurescens PafBC construct ( Aau PafBCΔNC), which was shortened by 17 amino acids at the N-terminus and 7 amino acids at the C-terminus based on sequence alignment, since these residues are not conserved amongst the orthologs and not even present in most of them. Indeed, it is likely that the start site for the A. aurescens protein was misassigned, since a valine (encoded by GTG) is present at the position where the other PafBC proteins feature the conserved initiator methionine, and GTG is a frequent start codon in actinobacteria (Belinky et al., 2017;DeJesus et al., 2013) (Figure S1). Selenomethionine-labelled protein was used for crystallization and the structure was determined de novo by single-wavelength anomalous diffraction. Aau PafBCΔNC crystallized in space group P2 1 2 1 2 1 with two molecules in the asymmetric unit, and the structure was refined to 2.2 Å with an R work /R free of 20%/24% (Table 1). The structural model was built to near completion, encompassing 1279 residues out of 1328. The missing residues are all located in three poorly ordered loop regions. Although Aau PafBC is a natural fusion protein and thus consists of a single polypeptide chain, for simplicity and easier comparison to the majority of actinobacteria encoding separate PafB/PafC proteins we will refer to the PafB-and PafCcorresponding parts as PafB or PafC, respectively.
Although most transcriptional regulators with an HTH domain feature an internal symmetry axis when bound to DNA (Jones et al., 1999), PafBC in our structure adopts an asymmetric conformation (Figure 1,S2 and S3). The asymmetric arrangement of PafBC is not surprising, since PafBC is not in complex with its consensus DNA binding site and its domain arrangement reflects the non-activated form of PafBC. This is in agreement with the observation that specific interaction between PafBC and the identified DNA-target regions takes place in vivo, but could not be reconstituted in vitro (Fudrini Olivencia et al., 2017;Müller et al., 2018), further supporting the notion that PafBC is in a non-activated state in absence of the putative DNA-stress sensing ligand.
The PafBC structure features six distinct domains, three in each of the homologous PafB and PafC modules. Each module includes an N-terminal HTH domain followed by two other domains ( Figure 1 and Figure S2). To distinguish between the homologous domains of each module, we refer to the domains belonging to the PafB module as helix-turn-helix (HTH-B), WYL (WYL-B) and Cterminal extension of the WYL (WCX-B) domains, while in PafC the corresponding domains are termed HTH-C, WYL-C and WCX-C, respectively. All individual domains are connected by long loops, which are particularly pronounced between the WYL and WCX domains, suggesting a high degree of flexibility and conformational adaptability of the PafBC complex.
Both HTH domains adopt a classical winged helix-turn-helix fold with a three-helix bundle consisting of helices H1, H2 and H3 followed by a two-stranded wing, a topology that is typical for many transcriptional regulators (Figure 2 and S4a). From structures of other winged HTH domains in complex with DNA, it is known that H3 usually mediates the specific DNA recognition by binding into the major groove, while the wing provides additional contacts in the minor groove (Clark et al., 1993;Rajagopalan et al., 2013;Wisedchaisri et al., 2007). The putative recognition helix of PafB features two highly conserved phenylalanines (F42 and F46) involved in forming the hydrophobic core of the three-helix bundle (Figure S4b). From the opposing surface-accessible side of the recognition helix, , two strictly conserved arginine side chains project outwards (R44 and R48). These residues might be involved in sequence-specific DNA binding with guanine and cytosine bases. Notably, the putative recognition helix of PafC is shorter and comprises a different amino acid sequence that would suggest recognition of non-palindromic DNA sequence, which is in accordance with the nonpalindromic nature of the PafBC binding motif (RecA-NDp) (Gamulin et al., 2004;Müller et al., 2018). However, the most striking difference between the HTH domains in PafB and PafC is their location and accessibility in the solved complex structure.
The HTH domain of PafB (HTH-B) is accessible in the structure, since only helix H1 and a small part of H3 make hydrophobic contacts to the rest of the molecule ( Figure S4b). In contrast, the HTH domain of PafC (HTH-C) interacts extensively with the other domains of the protein ( Figure S2b and S4c-f). Importantly, the putative recognition helix (H3) appears wedged into the protein core, and following H3, a long loop extends around the back of the central helix contacting the central α-helix of PafB (α4) and the WYL-B domain (Figure 2b and S4c). Superposition of the HTH-B and HTH-C domains reveals a good agreement of the folds except for a much longer β1/β2-loop (wing) in HTH-B, while HTH-C displays a very short connection between β1 and β2 and instead features a long loop between H3 and β1 ( Figure S4a). Sequence alignment-assisted comparison of the structural elements reveals that the β2 strands of both HTH domains occupy the same position relative to the helices, while the β1 position differs. This opens up the possibility that the wing of HTH-C has undergone a register shift to accommodate HTH-C in the protein core of the non-activated PafBC complex ( Figure  S5).
In all PafB and PafC homologs, the HTH domains are connected to the WYL domains via a sequence stretch of roughly 30 residues. This region forms a long single helix in the PafB module of our structure. The helix is located in the core of the PafBC structure, traversing the entire complex (Figure 1 and S3). In contrast, the equivalent region in PafC consists of three smaller helices forming a bundle between HTH-C and WYL-C (α4', α4'', α4''', Figure S2). The long central helix of PafB packs against helix H1 of the HTH domain in PafC ( Figure S4c). The WYL domains of PafB and PafC are located at the perimeter on opposite sides of the complex.

The PafBC WYL domains exhibit an Sm-fold and are followed by a ferredoxin-like domain
The structure shows that the domain previously referred to as the "WYL domain" of PafB or PafC in fact consists of two separate domains, with only the first domain featuring the characteristic WYL sequence motif. In the context of this study we refer to this domain as the WYL domain and to the second domain as WCX domain. The PafBC WYL domains are located at the periphery of the molecule on opposite sides, while the WCX domains come together to form a dimeric interaction module.
Notably, in this interaction module the WXC domains are arranged in a two-fold pseudosymmetric manner, in spite of the overall asymmetric arrangement of the domains in the PafBC structure (Figure 3a and 3b). Each WCX domain harbors a four-stranded anti-parallel β-sheet of 4-1-3-2 topology framed by two short α-helices, which contain a hydrophobic core (Figure 3b, 3c left and S6a). The main chain sharply bends at a highly conserved cis-proline into a C-terminal α-helix, which crosses the C-terminal α-helix of the other WCX domain (Figure S6b). Interaction of the WCX domains arises through interdigitation of two pairs of helices. The contacts are stabilized by salt-bridges, hydrogen bonds and a conserved hydrophobic island containing a pair of highly conserved leucines . In fact, the majority of hydrogen bonding occurs at the ends of the crossed C-terminal α-helices, while high conservation among the interacting residues seems to be restricted to the two leucines ( Figure S6b). The PafBC interaction via the WCX domains represents a strong element in the PafBC non-covalent interaction and is probably maintained also in the active DNA-binding form.
We carried out a comparative analysis based on the protein fold of the WYL and WCX domains using the Dali fold recognition program to discover structural homologues (Holm and Laakso, 2016). The search using the isolated WCX domain yielded a very broad spectrum of hits ranging from spliceosomal proteins, elongation factors, oxidoreductases to proteases. After further manual assessment of the individual hits, we discovered that the WCX domain follows a typical ferredoxin-like fold (βαββαβ) with an ancillary C-terminal α-helix. Interestingly, a number of RNAbinding proteins such as human hnRNP A1 (Figure 3c, middle), CRISPR/Cas protein Cse3 (Figure 3c,right) and ribosomal proteins also contain the ferredoxin-like fold and these domains are directly involved in RNA interaction ( Figure S7).
The fold of the WYL domain consists of a five-stranded anti-parallel β-sheet with a 5-1-2-3-4 topology preceded by an α-helix (Figure 4a). The strands are strongly curved and the middle β2strand is almost twice the length of the other five, causing it to arch back over itself and resulting in a β-sandwich topology, where β-strands 5-1-2 make up one half and strands 2-3-4 the other. Middle strand β2 is participating in both and connects the two halves. The eponymous WYL residues are located in β3, with the highly conserved tyrosine pointing away from the hydrophobic core. Structure similarity searches using the isolated WYL domain on the Dali webserver (Holm and Laakso, 2016) returned PDB entries of proteins containing an Sm-fold, like the bacterial RNA chaperone Hfq (host factor for RNA bacteriophage Qβ replication) and certain spliceosomal proteins. Proteins containing an Sm-fold are very abundant in eukaryotes, while only few examples (amongst them Hfq) have been described in bacteria. Closer comparison of the WYL domain with Hfq shows that the WYL domain Sm-fold features a slightly longer N-terminal helix, longer β2, and β3 strands as well as longer loops between strands β1/β2 and β3/β4 (Figure 4a and b). Many proteins of the Sm-like family were shown or predicted to bind RNA. The structural similarity of the WYL domain to Hfq and other Sm-like proteins therefore suggests that the WYL domains provide a binding site for an RNA molecule.

The WYL domain Sm-2 loop contains essential residues for PafBC function
Previously, we showed that PafBC levels do not change under stress conditions and we could not detect any specific DNA binding activity towards the RecA-NDp motif in vitro (Fudrini Olivencia et al., 2017;Müller et al., 2018), suggesting that PafBC requires a co-activator for its activity, which is only present during stress conditions. The structural homology between the WYL domains of PafBC and Sm-folds involved in RNA binding indicates that the response-producing ligand might be an RNA molecule and the WYL domain could act as a ligand-sensing domain.
In order to deduce potential ligands and ligand binding locations from the homology between the PafBC WYL domains and the Sm-fold-containing bacterial RNA chaperone Hfq, we carried out a comparative analysis of potential binding regions. Hfq forms a homohexameric ringshaped complex that was shown to bind RNA at three distinct sites (Figure 4b) (Updegrove et al., 2016): the proximal site exposing the α-helices and binding sRNA and mRNA (shown with ligand in Figure 4b); the distal site binding A-rich oligonucleotides; and the rim (also called lateral) site literally represented by the rim of the Hfq ring. There is also increasing evidence that the C-terminus is functionally involved in RNA binding (Santiago-Frangos et al., 2016;Vecerek et al., 2008). Given the structural similarity of PafBC's WYL domains to Hfq, we compared sequence conservation patterns in both proteins. Hfq exhibits a highly conserved loop in its Sm-2 motif, containing residues that contact the RNA backbone at the proximal binding site . Such a highly conserved loop is also found at the corresponding locations in the PafBC WYL domains, where two arginine side chains point into the direction where the RNA ligand is positioned in Hfq (Figure 4d and e). A potential ligand may be bound at this location and transduce the signal of DNA damage to PafBC, which in turn becomes activated to carry out its role as transcriptional activator.
To test the hypothesis that the conserved sequence stretch between strands β4 and β5 in the PafBC WYL domains has functional significance, we complemented the M. smegmatis ΔpafBC strain with PafBC mutants featuring amino acid substitutions at this location and assessed the viability of the mutants in presence of the DNA damaging agent mitomycin C (MMC). We also chose residues at other sites in the WYL domain based on sequence conservation and structural similarity to Hfq. Specifically, we selected the conserved tyrosine that is part of the WYL triplet, another conserved tyrosine (sometimes histidine) in strand β1, and the conserved patch between β4 and β5 for mutation. The chosen residues were mutated to alanine and the mutations were introduced separately into PafB or PafC or into both proteins simultaneously. To reduce the permutation space, we decided to treat the two arginine residues in the β4/β5 loop as functionally redundant (i.e. they were simultaneously substituted with alanine). Since the HTH domain of PafC would not be able to bind DNA in the observed conformation ( Figure 1 and 2b), we also deleted the HTH domains individually to establish if they are required for PafBC function.
To assess the viability of the PafBC mutant strains, the cells were first grown for a defined period of time in presence of increasing concentrations of MMC. Subsequently, the dye resazurin was added, which is reduced by living (but not by dead) cells to resorufin, giving rise to a color change. Wild type M. smegmatis cells grow in presence of up to 100 ng/ml MMC, while the ΔpafBC strain shows growth only up to about 8 ng/ml MMC, which is in agreement with the previously determined minimal inhibitory concentrations for these strains (Figure 5a) (Fudrini Olivencia et al., 2017).
Complementation of the pafBC knockout strain with wild type PafBC restores the viability to the level observed for the wild type. Deletion of either the HTH domain of PafB or PafC leads to the same reduced viability as observed for the ΔpafBC strain (Figure 5a), indicating that both domains are required for the function of PafBC. It has to be noted that, in contrast to ΔHTH-C, which expressed well, the expression of ΔHTH-B was barely detectable and may therefore not be sufficient for complementation (Figure 5g and 5h). Nevertheless, the complementation experiment demonstrates that the second HTH domain (HTH-C), which in our structure is in an inaccessible conformation for DNA-binding, is required for a fully functional PafBC complex.
We then tested the alanine point mutants for complementation. For most alaninesubstitution mutants a pattern could be observed : If the mutation is present in only one of the WYL domains, the viability is only moderately affected, but in case both WYL domains carry the mutation, the effect seems to be additive and the viability of the cells is lowered to the level of the knockout strain. This is the case for the double arginine mutants (Figure 5c), the tyrosine of the WYL triplet (Figure 5d), and the phenylalanine of the β4/β5 loop. Mutation of the β1 histidine/tyrosine leads to a comparable result, except that the decrease in viability is milder if one of the WYL domains carries the mutations, and the effect is less severe than in the knockout strain if 8 both WYL domains are mutated (Figure 5b). Furthermore, mutation of the serine/aspartate in the β4/β5 loop did not affect viability (Figure 5e).
Our results demonstrate that the conserved residues in the β4/β5 loop and β1 of the WYL domain are required for the function of PafBC, likely because they interact with a signal-transducing ligand. Furthermore, successive inactivation of the subunits has an additive effect, suggesting that there are two ligand binding sites present, one at each WYL domain. In summary, our observations show that both subunits of the PafBC complex are functional and both WYL domains are required for full viability.

The WYL domain is mainly associated with DNA-binding domains
Based on the structural analysis of the WYL domains in the PafBC complex and the fact that this domain occurs also in other bacterial regulators, we carried out a thorough bioinformatics analysis of WYL domain-containing proteins to understand, in which functional context they occur and how widely distributed they are.
We computationally analyzed the co-occurrence of the WYL domain with other domains along with its taxonomic distribution using hidden Markov models (HMMs) (Eddy, 2011). HMMs are widely used for finding distant protein homologs and they provide the basis for one of the largest protein family databases, Pfam, which groups proteins containing the same domain into families. Our structural analysis has shown that the PafBC C-terminal part originally assigned as "WYL" domain as a whole, in fact consists of two domains, the actual WYL domain and a C-terminal extension (WCX) domain. Based on the domain boundaries of the WYL domains in our structure and sequence alignment with other PafBC orthologs, we generated a WYL domain HMM and used it to retrieve all WYL domain-containing proteins from the UniProt reference proteomes yielding 15'079 entries (Table S1). The resulting entries were distributed across 5'330 different species with only 81 sequences from 50 species among Eukaryota, Archaea or Viruses, which were mostly candidate species (Table S2). Thus, the WYL domain appears to be limited to bacteria and we restricted our subsequent analyses to bacterial sequences.
In order to identify domain families associated with WYL domain proteins, the retrieved WYL domain-containing bacterial sequences were annotated based on all Pfam HMMs and additional HMM profiles we generated for the WCX domain and the PafBC N-terminal winged HTH domain that was not recognized by any of the existing Pfam HMMs. Two observations can immediately be made from the final set of domain architecture classes (Figure 6a and S8b): First, the majority of classes, covering more than 90% of sequences, shows co-occurrence of the WYL domain with an HTH domain preceding it. Second, the WYL domain is primarily present together with a C-terminally located WCX domain, and only about 25% of sequences exhibit the WYL domain alone.
About two thirds of all sequences are found in class A, which also comprises all PafB and PafC sequences. The second largest group, class B (15% of all hits), contains proteins with only an HTH and a WYL domain, lacking the WCX domain. Notably, class C could also be viewed as a subgroup of class A, as it is made up of natural fusion proteins of actinobacterial PafB and PafC homologs. A significant number of sequences contain a Helicase C3 domain in combination with the WYL domain, but lacking the WCX domain.
The distribution of WYL domain-containing proteins among bacterial phyla reflects the distribution of these phyla in the reference proteomes, showing that WYL domains are ubiquitous among bacteria ( Figure S8a). Interestingly, the gram-positive phyla of Actinobacteria and Firmicutes exhibit on average roughly 5.2 and 2.8 WYL domain-containing proteins per organism, respectively, while the gram-negative phyla of Proteobacteria and Bacteroidetes show only 2.0 and 2.1 average WYL domain-containing proteins per organism, respectively (Figure 6b). By analyzing the taxonomic distribution for each domain architecture class, we observed that the PafBC-like class A is most prominently found in Actinobacteria, Firmicutes, and Bacterioidetes, but much less abundant in Proteobacteria (Figure 6c). On the other hand, more than two thirds of the HTH-WYL architecture members (class B) are found in proteobacterial species (Figure 6d).
Taken together, our analysis shows that the majority of WYL domain-containing proteins are transcriptional regulators based on the presence of an HTH domain. It therefore seems likely that the mechanism of transcriptional regulation and signal relay employed by PafBC, although currently unknown, is a widespread principle found in almost all bacteria. Moreover, in some phyla multiple of these transcriptional regulators are present in one organism, suggesting that WYL domain-containing regulators may be involved in different pathways.

Discussion
During the mycobacterial DNA damage response, the heterodimeric transcriptional regulator PafBC activates most of the genes required for an adequate response to DNA stress (Fudrini Olivencia et al., 2017;Müller et al., 2018). However, understanding and experimentation concerning the regulatory mechanism of PafBC was largely hampered by a lack of knowledge about its molecular structure. This limitation has manifested also in other studies concerning WYL domain-containing proteins (Andis et al., 2018;Hein et al., 2013;Modell et al., 2014;Yan et al., 2018). Our computational analysis showed that roughly 90% of all WYL domain-containing proteins possess an N-terminal HTH domain suggesting that these are transcriptional regulators (Figure 6a). Furthermore, our analysis revealed the WYL domain as a domain specific to bacteria that is present in nearly all bacterial phyla (Figure 6b and S8). Thus, it is conceivable that the regulatory mechanism employed by PafBC might represent a shared mode of action for all of these regulators. This is not only an exciting possible concept, but might also be helpful in gaining a full understanding of the nature of the regulation.
We obtained the crystal structure of PafBC in the absence of DNA, in a largely asymmetric domain arrangement that is likely characteristic for the non-activated state. The domains are connected through long loops, suggesting a great degree of flexibility for the entire protein and that the protein might undergo large domain movements upon DNA binding, where it could eventually adopt a more symmetric arrangement, as observed for the WCX domains (Figure 7). In such a state, the helices connecting HTH-C and WYL-C (α4', α4'', α4''') could merge into a single helix and act as the counterpart to PafB helix α4 in a coiled-coiled fashion at the center of the protein. Such an interaction could be mediated by the row of hydrophobic residues that are featured along the axes of helices α4'-α4'''. Also, the HTH-C domain was observed in a state inaccessible for DNA binding (Figure 2b). Besides their role in protein-DNA interaction, winged HTH domains were found to mediate protein-protein interactions (Littlefield and Nelson, 1999;Woo et al., 2009;Zheng et al., 2002). Thus, the conformation of HTH-C may represent a state that is part of a regulatory mechanism, in which PafBC is prevented from efficient DNA binding under non-stress conditions. In agreement with this notion, the PafBC mutant lacking HTH-C cannot complement the phenotype of ΔpafBC observed under DNA stress (Figure 5a), suggesting that HTH-C must fulfill an essential function, i.e. DNA binding/recognition. Furthermore, the recognition helices in the HTH domains of PafB and PafC are different in length and also in amino acid composition ( Figure S4a and S5), and likewise the PafBC binding motif (RecA-NDp) is non-palindromic (Müller et al., 2018). Together with the regulatory switch, this could then also explain why PafBC is a heterodimer.
Our results provide key insights into the WYL and WCX domains, revealing that they adopt folds similar to proteins associated with RNA binding (Figure 3, 4 and S7). Considering that the activation of PafBC upon DNA damage does not rely on changes in protein levels (Fudrini Olivencia et al., 2017), the possibility of another, stress-dependent factor required to elicit PafBC activity is likely. The results obtained in the complementation study with single amino acid substitutions in the WYL domains strongly suggest a binding interface for a signal-transducing ligand (Figure 5). Such a potential factor may thus well be an RNA molecule or, in a broader context, a nucleic-acid or nucleic acid derivative, relaying the signal of DNA damage to PafBC by recognition at the WYL and/or WCX domains. In fact, the WYL domain of PIF1 helicase from Thermotoga elfii was shown to bind singlestranded DNA, thereby stimulating helicase activity (Andis et al., 2018). Binding of single-stranded DNA may be conceivable for the WYL domain proteins of class J of our computational analysis, which are also associated with a helicase domain.
The Sm-fold of the WYL domain is characteristic of eukaryotic RNA binding proteins, the Sm proteins. Their ring-shaped assemblies are core components of the spliceosomal snRNPs (small nuclear ribonucleoproteins) (Bertram et al., 2017). Through analogy, the bacterial protein Hfq is considered the sole representative of the Sm-like/LSm family based on its hexameric assembly state and RNA chaperone function (Khusial et al., 2005). Interestingly, no Hfq homolog has been identified in actinobacteria to date using sequence searches (Chao and Vogel, 2010), but we found WYL domain-containing proteins to be significantly enriched in the actinobacterial phylum (Figure 6b). It is possible that some of these actinobacterial WYL domain-containing proteins carry out a similar function to Hfq.
The crystal structure of PafBC provides the framework for understanding the mechanism by which PafBC connects the signal of DNA stress with a transcriptional response through use of its WYL/WCX domains. These results will also help us to better understand WYL domain-containing proteins in general.

Protein expression and purification of Arthrobacter aurescens PafBC
Full-length Aau PafBC was amplified from genomic DNA of Arthrobacter aurescens strain 579 (DSM-20116) using the primers ACGCGCCTTGCTGCTTTCC (forward) and CTAGCCAGCCTTGGTGCCCG (reverse), which were designed based on the sequenced genome of strain TC1 (NC_008711; locus tag AAur_2182). The amplicon was cloned into a temporary vector and a truncated variant of Aau PafBC ( Aau PafBCΔNC) missing the first 17 amino acids at the N-terminus and the last 7 amino acids at the Cterminus was amplified from this vector using primers GCATCCCGCACCGAACG (forward) and TGAGTCGTACTGCACCAAAG (reverse). The amplicon of Aau PafBCΔNC was cloned into an isopropyl-β-D-thiogalactopyranosid (IPTG)-inducible expression vector with a cleavable His 6 -TEV tag at the Nterminus. Selenomethionine-labeled protein was expressed according to a procedure adapted from (Doublié, 2007): E. coli Rosetta (DE3) cells harboring the expression vector were grown as shaking cultures at 37°C in M9 medium (M9 salts supplemented with 2 mM MgSO 4 , 0.1 mM CaCl 2 , 0.5% w/v glucose, 2 mg/l biotin, 2 mg/l thiamine, 0.03 mg/l FeSO 4 ). At an OD600 of 0.5, 100 mg/ml of phenylalanine, lysine, and threonine, 50 mg/ml of isoleucine, leucine, and valine, as well as 80 mg/ml of selenomethionine (Chemie Brunschwig) were added as solid powder to the cultures, which were further incubated for 30 min. Expression was then induced with 0.5 mM IPTG and cells were further incubated at 16°C overnight. Cells were harvested (F9S, 7'000 rpm, 10 min, 4°C) and pellets were resuspended in lysis buffer (50 mM HEPES-NaOH pH 7.8/4°C, 300 mM NaCl, 2 mM TCEP). The cell suspension was homogenized using a Heidolph DIAX600 mixer and cells were lysed by high pressure shear force using a Microfluidizer M110-L device (Microfluidics; 5 passes, 11'000 psi chamber pressure). After removal of cell debris (SS34, 20'000 rpm, 4°C, 30 min), the cleared lysate was supplemented with 1 mM PMSF, 1x c0mplete EDTA-free protease inhibitors (Roche), 50 U/ml DNase I, 10 mM imidazole and incubated for 30 min on ice. The lysate was passed over a self-packed Ni 2+ -charged IMAC Sepharose 6 Fast Flow (GE Healthcare) column, and bound protein was eluted step-wise with lysis buffer containing 80 mM to 250 mM imidazole. After pooling protein-containing elution fractions, His-tagged TEV protease was added to a 1:30 molar ratio and the protein sample was dialyzed against 25 mM HEPES-NaOH pH 7.8/4°C, 150 mM NaCl, 2 mM DTT, 1 mM EDTA at 4°C overnight. TEV protease was removed by affinity chromatography and the protein sample was dialyzed against 25 mM HEPES-NaOH pH7.8/4°C, 40 mM NaCl, 2 mM DTT, 1 mM EDTA at 4°C overnight. The protein was further loaded on a Source 30Q column and eluted with a 50 mM to 400 mM NaCl gradient in 25 mM HEPES-NaOH pH 7.8/4°C, 1 mM TCEP. Protein-containing elution fractions were pooled and concentrated using an Amicon Ultra 30K centrifugal filter (3'500 g, 4°C). The concentrated protein was run on a self-packed 100 ml Superose 12 prep grade column in crystallization buffer (10 mM HEPES-NaOH pH 7.8/4°C, 50 mM NaCl, 1 mM TCEP, 0.1 mM EDTA). Protein was concentrated as above to 22 mg/ml, aliquotted, frozen in liquid nitrogen and stored at -80°C until use.

Data collection, experimental phasing, structure determination, and refinement
Reflection image data was collected at the X06SA beamline of the Swiss Light Source (SLS, Paul-Scherrer-Institut, Villigen, Switzerland) at 100 K and 12'670 eV beam energy (0.978561 Å). Diffraction images were processed using XDS (Kabsch, 2010) and scaled using AIMLESS (Evans and Murshudov, 2013). Determination of heavy atom sites, initial phases, and crude main chain tracing were carried out using the SHELX programs (Sheldrick, 2010). The resulting experimental electron density map displayed easily discernible protein features. The initial model from SHELX was further extended with PHENIX AutoBuild (Terwilliger et al., 2008) and subsequent iterative model building and refinement was carried out using Coot (Emsley et al., 2010) and phenix.refine (Afonine et al., 2012), respectively.

Mutational screening of Mycobacterium smegmatis PafBC
The coding sequence of pafBC was amplified from Mycobacterium smegmatis mc2-155 SMR5 (Sander et al., 1995) genomic DNA and cloned into a pMyNT-derived integrative plasmid (pMyNT template provided by A. Geerlof, EMBL Hamburg). As promoter, a DNA fragment containing 347 bp upstream of the pafA coding sequence was inserted in front of pafBC to generate the pafBC complementation plasmid. Variants were then generated by KLD site-directed mutagenesis or by Gibson assembly using the pafBC complementation plasmid as template. The various plasmids were then transformed into the Mycobacterium smegmatis ΔpafBC strain (Fudrini Olivencia et al., 2017), and viability in presence of mitomycin C was assessed with the resazurin assay as described previously (Müller et al., 2018).
For the initial analysis, PafBC homologs from other actinobacteria were identified by BLAST (blast.ncbi.nlm.nih.gov; restricted search to actinobacterial species, otherwise default settings) (Camacho et al., 2009) using Mycobacterium smegmatis PafB or PafC as input. The identity of the obtained PafBC homologs was cross-checked on the genome level for the operon organization of the genes and association with the Pup-proteasome system gene locus. In total, ten PafB/PafC sequence pairs were retrieved (UniProt accession numbers P9WIM1, P9WIL9, I7G3U5, A0QZ41, A7BCC5, A7BCC6, A4X749, A4X750, C0ZZU3, C0ZZU2, A1SK18, A1SK19, A0LU62, A0LU63, A6W976, A6W977, Q8NQE2, Q8NQE3, Q9RJ64, Q9RJ65). A global alignment of these sequences was used to generate an HMM with "hmmbuild" (default settings), which was subsequently used to search against the reference proteomes database with "hmmsearch" (command line options were -E 1 --domE 1 --incE 0.01 --incdomE 0.03). A list of domains contained in the retrieved sequences was obtained with the "hmmscan" module (command line options were -E 0.1 --domE 0.1 --incE 0.01 --incdomE 0.03) using all HMMs from the Pfam database release 32.0 (El-Gebali et al., 2019). Domain lengths were analyzed based on the envelope boundaries given by "hmmscan" for sequences with a domain score above 30.0 (independent domain e-value < 0.001).
Because the Pfam HMM profile of the WYL domain (Pfam-WYL) includes both WYL and WCX domain, we had to define a new WYL domain HMM in order to annotate it correctly for our analysis. To build the WYL HMM, 250 sequences were randomly sampled from sequences with a Pfam-WYL length greater than 127 residues and another 250 sequences were randomly sampled from sequences with a Pfam-WYL length less than 127 residues (other thresholds: domain score > 30.0, independent domain e-value < 0.001). The threshold was chosen based on the domain boundaries seen in the crystal structure of Aau PafBC and the length distribution of C-terminal region of the retrieved Pfam-WYL-containing proteins. Sequences with obvious defects in the WYL region were discarded manually resulting in 477 entries that were used for alignment. From this alignment, the WYL domain boundaries were established using the N-terminal boundary of the Pfam-WYL, while the C-terminal boundary was chosen by the C-termini of short Pfam-WYLs and the crystal structure. The alignment was then trimmed to the WYL boundaries, sequences with a pairwise identity above 70% were clustered using CD-HIT v4.6.8 (command line options -n 4 -c 0.7) (Li and Godzik, 2006) and an HMM was generated as above. To mature the WYL HMM, an iterative approach similar to the Pfam HMM generation was chosen. Sequences were retrieved from the reference proteomes using the WYL HMM to generate a full alignment, which was then again trimmed to the WYL domain boundaries to make up a new seed alignment used to build a new HMM (gathering threshold: domain score > 27.0). The process was repeated for a total of three iterations.
To build the HMM for the unrecognized (winged) HTH domains in PafBC and many other WYL domain-containing proteins, 500 sequences were randomly sampled from the group of sequences with a median length of 326 amino acids and containing only a WYL domain (thresholds: domain score > 30.0, independent domain e-value < 0.001). The sequences were aligned and curated based on the presence of the conserved blocks representing the helices of the HTH fold. The sequences were further split into groups exhibiting a PafB-like or PafC-like wing sequence of 184 and 235 sequences, respectively. For each group, an HMM was built and searched against reference proteomes database as described above, but using an identity threshold of 90% and iterating only once in order to keep a separation to other HTH-type domains.
To build the HMM for the C-terminal extension found in many WYL domains (WCX), 500 sequences were randomly sampled from the group containing a Pfam-WYL with a length greater than 127 amino acids (thresholds: domain score > 30.0, independent domain e-value < 0.001). Sequences missing the C-terminal conserved block were removed, leaving 434 sequences for HMM generation. The HMM generated from the manually curated seed alignment was used without iteration.
The custom HMMs were then added to the local search database of the Pfam HMM database (see also above).
To establish the domain architecture classes of the WYL domain-containing proteins, the WYL HMM was used to search against the reference proteome database with "hmmsearch" (command line options as above). Obtained sequences with a domain score > 27.0 were analyzed for the presence of other domains using "hmmscan" together with the local Pfam HMM database including the custom HMMs as described above. Protein sequences were then categorized according to their sequence length, their identified domains (on the level of Pfam clans; independent domain e-value < 0.01) and the domain length. Domain architecture classes with large regions apparently containing no domain were checked manually by alignment for presence of conserved features, by alignment to other classes and by alignment to the manually curated PafBC seed sequences (see above). They were then grouped together with other classes, where appropriate. Due to low overall abundance and being mostly candidate species, all non-bacterial sequences were excluded from the main analysis.   The WCX domains (left; shown for PafC; residues 585-664) contain a ferredoxin-like fold with an additional C-terminal α-helix (α3). Very versatile and present in proteins with highly diverse functions, the ferredoxin-like fold is also found in many RNA-binding proteins such as human hnRNP A1 (also known as UP1; middle; brown; PDB 6DCL; residues 7-89 shown) or the C-terminal domain of CRISPR-Cas protein Cse3 (right; beige; PDB 2Y8W; residues 90-211 shown). RNA ligands are colored in orange.   (c) Class A, featuring the WCX domain located C-terminally of the WYL domain, is mainly found in gram-positive bacteria, namely Actinobacteria and Firmicutes, while (d) Class B, exhibiting only the WYL domain, is mostly found in Proteobacteria. The segment radian represents the number of unique species, while the thickness of the segment represents the average number of sequences per species within that taxonomic group. The number of species is given below the class labels with the number of sequences in parentheses. For clarity, taxonomic groups smaller than 1.5% of the total number are not shown. See also Figure S1, S2 and S3. In the non-activated state, PafBC buries the recognition helix (orange) of PafC's HTH domain (HTH-C) and cannot bind to its cognate promoter motif. Upon activator binding, PafBC likely undergoes large structural rearrangements of its domains to release the HTH-C domain, allowing promoter recognition and transcriptional activation of DNA repair genes.   The recognition helices (H3) of wHTH domains typically insert into the major groove of the DNA, while the wings establish contact to the minor groove (small inset, middle). The recognition helix of HTH-B (orange) is exposed and available to accept a DNA ligand. The main chain of H1 and the wing coordinate a potassium ion (lilac sphere). (b) The HTH domain of PafC (HTH-C, red) on the other hand forms part of the protein core. H3 of HTH-C (orange) is shorter by two helical turns compared to H3 of HTH-B, and an unstructured loop reaches into the protein core. Additionally, the β-sheet of the wing extends the β-sheet of the WYL domain of PafB (WYL-B, blue). Dashed lines bridge gaps in the models. See also Figure S4. The WCX domains (left; shown for PafC; residues 585-664) contain a ferredoxin-like fold with an additional C-terminal α-helix (α3). Very versatile and present in proteins with highly diverse functions, the ferredoxin-like fold is also found in many RNA-binding proteins such as human hnRNP A1 (also known as UP1; middle; brown; PDB 6DCL; residues 7-89 shown) or the C-terminal domain of CRISPR-Cas protein Cse3 (right; beige; PDB 2Y8W; residues 90-211 shown). RNA ligands are colored in orange.  (c) Class A, featuring the WCX domain located C-terminally of the WYL domain, is mainly found in gram-positive bacteria, namely Actinobacteria and Firmicutes, while (d) Class B, exhibiting only the WYL domain, is mostly found in Proteobacteria. The segment radian represents the number of unique species, while the thickness of the segment represents the average number of sequences per species within that taxonomic group. The number of species is given below the class labels with the number of sequences in parentheses. For clarity, taxonomic groups smaller than 1.5% of the total number are not shown. See also Figure S1, S2 and S3.

Figure 7: Hypothetical model of DNA-binding by activated PafBC.
In the non-activated state, PafBC buries the recognition helix (orange) of PafC's HTH domain (HTH-C) and cannot bind to its cognate promoter motif. Upon activator binding, PafBC likely undergoes large structural rearrangements of its domains to release the HTH-C domain, allowing promoter recognition and transcriptional activation of DNA repair genes.