Deep Analysis of Residue Constraints (DARC): identifying determinants of protein functional specificity

Protein functional constraints are manifest as superfamily and functional-subgroup conserved residues, and as pairwise correlations. Deep Analysis of Residue Constraints (DARC) aids the visualization of these constraints, characterizes how they correlate with each other and with structure, and estimates statistical significance. This can identify determinants of protein functional specificity, as we illustrate for bacterial DNA clamp loader ATPases. These load ring-shaped sliding clamps onto DNA to keep polymerase attached during replication and contain one δ, three γ, and one δ’ AAA+ subunits semi-circularly arranged in the order δ-γ1-γ2-γ3-δ’. Only γ is active, though both γ and δ’ functionally influence an adjacent γ subunit. DARC identifies, as functionally-congruent features linking allosterically the ATP, DNA, and clamp binding sites: residues distinctive of γ and of γ/δ’ that mutually interact in trans, centered on the catalytic base; several γ/δ’-residues and six γ/δ’-covariant residue pairs within the DNA binding N-termini of helices α2 and α3; and γ/δ’-residues associated with the α2 C-terminus and the clamp-binding loop. Most notable is a trans-acting γ/δ’ hydroxyl group that 99% of other AAA+ proteins lack. Mutation of this hydroxyl to a methyl group impedes clamp binding and opening, DNA binding, and ATP hydrolysis—implying a remarkably clamp-loader-specific function.


DARC.
For this study, DARC used as input: (i) an MSA of AAA+ proteins; (ii) the E. coli δ' or γ sequence as the query to seed the analysis of γ/δ' or γ subunits, respectively; and (iii) corresponding 3D structural coordinates. DARC executes the following: Step 1. Run Bayesian Partitioning with Pattern Selection (BPPS) 1-5 to define within the MSA the functionally divergent subgroups to which the query sequence belongs based on pattern residues distinguishing the members of each subgroup from other, closely related sequences. Step 2. Use the sub-MSA corresponding to the query's family to compute DCA-scores; this generates a DCA score output file in PSICOV format 6 . Step 3. For available structures, determine the statistical significance of the concurrence between DCA scores and 3D contacts (3DSDC) and between BPPS scores and 3D contacts (3DSP). Step 4. Identify statistically significant 3D clusters of BPPS-defined residues (CLSp). Step 5. Use PyMOL (http://www.pymol.org/) to visualize pattern residues and directly coupled residue pairs within protein structures. Structural coordinate files were obtained from the RCSB protein data bank (PDB) 7 . The PDB identifiers for proteins examined here are: 3glf, 3glg, 3glh, 3gli, 1a5t, and 1jr3. Hydrogen atoms were added to these files using the Reduce program 8 version 3.3; this may also be done using the PyMOL h_add command.
Bayesian Partitioning with Pattern Selection (BPPS). Given a typically very large multiple sequence alignment (MSA), denoted here as X , BPPS applies Markov chain Monte Carlo (MCMC) sampling to articulate a superfamily into a set of hierarchically nested partitions corresponding to a tree. Each subtree h, which may consist of only a single node, when attached to the root corresponds to a family, and, in general, when attached to a parent node corresponds to a child subgroup. The sampler defines each subgroup, also denoted by h, based on residue patterns distinguishing subgroup members from 2 sequences assigned to other nodes in the parent subtree. For instance, a simple pattern for subgroup h might consist of   V,I,L ,   D,E , and   F,Y at column positions 3, 10 and 23, respectively. BPPS favors assignment of those sequences to subtree h conserving a pattern that is not conserved in sequences assigned to other nodes in the parent subtree. Hence, BPPS favors assignment to each parent node those sequences conserving the parent node's pattern but lacking each of the descendent nodes' patterns. For a non-root node n in the hierarchy this process defines a 'contrast' alignment (as in Fig.   S1) divided into foreground and background sequences, corresponding respectively to the subtree rooted at n and to the rest of the subtree rooted at the parent of n. MCMC sampling is used to determine the number and arrangement of the nodes in the tree, the sequences belonging to each node, the pattern positions for each subgroup, and the conserved residues at each pattern position. The sampler favors convergence on a hierarchy where the pattern defining the partitioning for each node best distinguishes its foreground from its background. BPPS weights sequences, as in PSI-BLAST 9,10 , to avoid modeling conserved patterns merely due to sequence redundancy.
For the ensuing discussion, we define the following: For vectors and where N a , the number of unlabeled, unordered rooted trees with N nodes, is defined recursively as: with k j being the number of subtrees with k nodes 11  Nf are, respectively, the background frequency and the total foreground number (with background contamination included) of the pattern-matching pseudo-residues in column c for subtree h.

X H S A α Θ X H S A α Θ H S A αΘ
(1) where, assuming statistical independence among subtrees (but see below),  ; hence, these positions contribute nothing to the posterior probability. However, because the configuration of each subtree constrains to some degree the possible configurations of other subtrees above or below it in the hierarchy, our independence assumption is invalid. These constraints reduce the probabilities for some states to zero, so that the probabilities assigned to the remaining, reachable states will sum to less than 1, and our formulation is therefore conservative (i.e., computed probabilities are smaller than they should be). This also occurs due to other imposed constraints, such as placing an upper bound on the number of pattern positions or on the depth 6 of the hierarchy, or requiring that a minimum number of sequences be assigned to each node. Nevertheless, in searching for an optimum, Equation 1 is valid as an objective function, its use here. After convergence with N = 1 nodes, a child node may be added to the root node, as follows. First, some of the sequences assigned to the root are reassigned to the child node by selecting a subset of sequences that are more similar to each other than they are to the remaining sequences. Selections are based on similarity to the query, when one is designated, and on similarity to an arbitrary sequence otherwise. Next, BPPS samples for a few cycles over S and A, as described above, to search for a configuration that improves upon the previous hierarchy based on Equation 1. If the hierarchy fails to improve and a query has not been provided, several other candidate queries may be selected in turn until either an improved state is found or until a prespecified number of attempts are tried. If the hierarchy is improved, BPPS further enlarges and rearranges the evolving hierarchy H by adding more leaf nodes 7 and by deleting, inserting or moving nodes using this same basic strategy. To avoid excessively complex hierarchies, BPPS requires that each leaf node contain a minimum number of sequences (50 by default); those that do not are pruned. After convergence, the sampler applies simulated annealing 13 to 'drop into' a more nearly optimal configuration.

BPPS sampling strategies. Conditioned on fixed H, BPPS samples over S and
When DARC applies BPPS, it focuses on the query's lineage within the superfamily hierarchy by first defining the query's family based on residues that most distinguish family members from other superfamily members. Next, DARC seeks to recursively define, in a similar manner, the query's subfamily, and other subgroups further down the query's lineage to a prespecified maximum depth. For the analysis here, using a MSA of 463,471 AAA+ proteins and the E. coli δ' clamp loader subunit as the query, BPPS identified those residues that most distinguish both γ and δ' from other AAA+ proteins ( Fig. S1c). Using the γ/δ' sub-MSA and the E. coli γ subunit as a query, BPPS identified those residues that most distinguish γ from δ' (Fig. S1e) where Z is a normalization constant to ensure that the sum over all sequences equals 1. However, because computing ( ) P εX is intractable for a non-trivial MSA, the following pseudo-log-likelihood is used instead as the objective function: where the regularization coefficients are λsingle = 1; λpair = 0.2 × (L -1) 16 , and where where "." denotes averaging over the corresponding row or column and .,. K is the average over all matrix elements.
Evaluating the robustness of DCA score rankings. To determine whether different input MSAs rank DC-pairs consistently, an auxiliary subsampling routine is included as an option in DARC. For the analysis here, this routine draws from the input MSA 1,000 samples of 1,000 sequences, from each of which DCA scores are computed. Between samplings, the previously sampled sequences are replaced prior to sampling the subsequent set. The percentage of times that each residue pair was among those with the top 20, 10, 5 or 2 DC-pairs is given in Table 1. Those pairs consistently selected among the 20 top scores were included in Fig. 2 and Table 1.

Initial Cluster Analysis (ICA). To compute
CLSP, 3DSDC, 3DSP, or DCSP (denoted generically here as S) we apply Initial Cluster Analysis 18 , a statistical approach to address the following question: Consider an array of 0s and 1s of length L and containing D 1s. Are some or all of the 1s significantly clustered near the start of the array, and, if so, how surprising is the most significant such clustering? To make this determination, ICA applies the Minimum Description Length (MDL) principle 19 , an information theoretical regularization method for finding the best hypothesis for a given set of data.
The MDL principle defines a theory  as a probability distribution P  over all possible sets of data and the description length of a data set E given a theory  as  For ICA, the MDL principle determines whether the hypothesis 1 that the 1s cluster near the start of the sequence is better than the null hypothesis 0 that the 1s and 0s occur randomly.
ICA treats 1 as a single-parameter model, whose parameter x describes the location of a cut at a discrete point from 1 to 1 L − along the array, thereby dividing it into an initial segment s1 of length x, and a terminal segment s2 of length y L x =−. If s1 contains D1 1s, and s2 contains 21 D D D =− 1s, assume that s1 is generated by Bernoulli trials with maximum-likelihood probability 11 P D x = for a 1, and s2 is generated by Bernoulli trials with probability 22 P D y = for a 1. Given a particular fixed value for x, the probability of E is ( ) ( ) ( ) with an alternative ordering based on 3D pairwise distances. More specifically, we seek to identify an optimal initial cluster of elements of the array (defined by a cut), as measured by a relevant p-value. We are given an array of L residue pairs ordered by their DC-scores. D of the pairs (denoted by '1's) are 11 separated within a reference 3D structure by ≤ z Å (with z = 3.5 Å by default) and L -D (denoted by '0's) are not.
We ask: what initial cluster, consisting of pairs up to and including a cut point X, contains the most surprising number d of '1's, and what is its probability of occurring by chance? (We term the d 1s in an initial cluster "left-distinguished pairs.") For L = 18 and D = 7, for example, one such array is "101101100000010001", with optimal cut point X = 7 (underlined), yielding d = 5. Since the pairs are ranked by pairwise distance, we might then represent our example array as "401603200000070005" with digits > 0 denoting the ranks of distinguished pairs. ICA ignores these ranks when choosing the optimal X, whereas we would prefer the d distinguished pairs to the left of X to have superior ranks (i.e., lower numbers) than those to the right. For homomeric structures, DARC assesses the correspondence, not only between DCA scores and internal 3D-contacts alone (e.g., labeled as 'A' for chain A), but also between DCA scores and both internal and adjacent subunit interface 3D-contacts (e.g., labeled as 'A:B' for chain A and adjacent chain B). The change in 3DSDC upon inclusion of interface contacts is denoted as ΔSDC. High positive values for ΔSDC suggest that strong selective pressures are maintaining 3D contacts between adjacent subunits. In contrast, negative values for ΔS suggest that subunit interactions may be functionally insignificant and perhaps due to a crystallographic artifact. Pattern residue 3D-clustering significance scores (CLSP). DARC applies ICA to estimate constraints tending to cluster BPPS-defined residues structurally 25 . The L positions of the ICA array correspond to the L residues within a 3D structure of the DARC query protein or of other proteins belonging to the query family. The 1s in the array correspond to a fixed number of BPPS-defined pattern residues and the 0s to the remaining residues. DARC orders array elements based on their 3D distance from a starting residue. It then determines the most significant 3D-cluster of these residues among a nested set of clusters, each centered on the starting residue. This is performed, starting with each of the BPPSdefined residues in turn, and the highest scoring 3D-cluster among these is reported along with the starting residue and the corresponding CLSP-score. The CLSP score measures the significance of the intersection between a 3D cluster and the BPPS residue set. In addition to the strategy just described (termed "spherical expansion"), DARC allows either core expansion or hydrogen-bond-network expansion 25 . Core expansion sequentially adds the residue closest to a residue within the cluster's "core". This core is defined as the starting residue R plus all cluster residues whose distance to their k th closest cluster residue is less than R's distance to its k th closest cluster residue (with k=7 by default; this was selected empirically to avoid both spherical-and tentacle-shaped clusters.) In this case, the cluster typically expands less symmetrically. Hydrogen-bond-network expansion sequentially adds a residue 13 forming the closest sidechain-to-sidechain or sidechain-to-backbone hydrogen bond with a cluster residue.
Equilibrium β clamp binding assay. Fluorescence was measured using a QuantaMaster QM1 spectrofluorometer (Photon Technology International). Reagents were added to 3x3x5 mm quartz cuvette (Hellma 105.251-QS) to a final volume of 80 μL and data was collected at room temperature immediately after mixing. A β mutant, Glu-299 to Cys, was covalently labeled with N-(1-pyrene) maleimide at residue 299 to measure β binding by γ-complex clamp loader 26 where γc is total concentration of clamp loader, β is total concentration of β PY and Imax and Imin are maximum and minimum emission intensities, respectively.
Equilibrium β clamp opening assay. The basic protocol for clamp opening assays is the same for  binding assays except that a different  construct is used. For opening assays, a  mutant in which Arg-103 and Ile-305 were mutated to Cys and surface Cys-260 and Cys-333 were mutated to Ser was labeled with Alexa Fluor 488 (AF488) maleimide (Invitrogen) 27 . AF488 was excited at 490 nm and emission was measured from 500 nm to 550 nm with a 3 nm band-pass. As in the β binding assay, three measurements were taken: 1) buffer containing ATP and MgCl2, 2) solution after addition of β AF488 , and 3) solution after the addition of γ complex. Storage buffer was added instead of  complex to generate the 0 nM γ-clamp loader point. After correcting for buffer background, the intensity of bound β AF488 at 14 517 nm was divided by the intensity for free β AF488 at each clamp loader concentration. constants measured in the -AF4882 opening assay are a function of the microscopic equilibrium constants for both the discrete binding and opening reactions . Therefore, the apparent dissociation (Kd,app) and opening (Kop,app) constants differ, and can be expressed in terms of the microscopic equilibrium constants as Kd,app = Kd ( +1 ) and Kop,app = KdKop . In general, the value of Kd,app will be smaller than Kop,app by a factor of 1/(Kop+1). However, when the value for Kop is much less than one, the value of Kd,app will be about the same as the value of Steady state ATP hydrolysis assay. To measure ATPase activity of the clamp loader complex, an enzyme-coupled reaction was used in which pyruvate kinase (PK) and lactate dehydrogenase (LDH) ultimately oxidize one mole of NADH to NAD+ per mole of ADP produced by the clamp loader 30,31 .
Loss of NADH is measured by decrease in absorbance at 340 nm or fluorescence emission at 460 nm.
The reaction scheme for the PK/LDH-coupled assay is as followed: (1) ATP ⇆ ADP + Pi, by the clamp loader (2) phosphoenolpyruvate + ADP ⇆ ATP + pyruvate, by PK (3) pyruvate + NADH + H + ⇆ NAD + + lactate, by LDH In Fig. 5b and 5e, the ATP hydrolysis-dependent decrease in NADH concentration was measured by the decrease in absorbance at 340 nm, and in Fig. 5c, ATP hydrolysis was measured by the decrease in NADH fluorescence at 460 nm using a 2 nm bandpass. For absorbance measurements, the change in 16 absorbance over time is then converted to the rate of ADP production using the extinction coefficient of NAD + (340 = 6220 M -1 cm -1 ) 30  2. Figure S1. Figure S1. DARC-generated alignments highlighting all residues conserved in γ and δ' clamp loader proteins and residues distinctive of the AAA+ superfamily, of the γ + δ' subgroup, and of γ but not δ'. These are shown using five versions of the same representative set of γ proteins (in panles a-e) and of δ' proteins (in panels a to c).
Residues are highlighted to indicate amino acid biochemical properties based on the following color code: red font with yellow highlight, non-polar (AVILMWFY) ; blue font with yellow highlight, cysteine (C); red, acidic (DE); cyan, basic (KR); magenta, polar (STNQ); green, glycine (G); blue, histidine (H); black, proline (P). Nonconserved positions in panels (a) and (d) and non-pattern residues in panels (b), (c) and (e) are shown in gray font.
The leftmost columns in panels (b), (c) and (e) give the NCBI sequence identifiers; these are colored the same as the residue sidechains in Figure 2 of the paper. a. Alignment highlighting all γ + δ' conserved residues. Those sequences above the line within the alignment correspond to representative γ proteins from the (distinct) phyla denoted in the leftmost column; the first sequence corresponds to the E. coli γ subunit (pdb_id: 3glfB). Those sequences below the line correspond to representative δ' proteins from distinct phyla, the first sequence of which corrsponds to the E. coli δ' subunit (pdb_id: 3glfE), which was used as the DARC query. The positions listed at the bottom correspond to the E. coli γ subunit. b. BPPS contrast alignment showing the same sequences as in panel (a), but highlighting only those residues most distinctive of the AAA+ superfamily. The heights of the red bars above each highlighted column estimate the selective pressure imposed on pattern residues at that position using a semi-logarithmic scale. Directly below the aligned sequences, the characteristic AAA+ residues at each position are shown and, directly below these, corresponding frequencies are given in integer tenths. A '7', for example, indicates that the corresponding residue occurs in 70-80% of the 452,949 AAA+ sequences in the alignment. Below this is shown the residue positions and sequence of the E. coli γ subunit (with the Thr 165 residue that was mutated to Val highlighted in red), and shown below these are predicted secondary structure elements (symbol: H, helix; E, strand), helix and strand designations, and AAA+ structural motifs (red font) and putative clamp binding loops C1 and C2 (green font). Secondary structure assignments were calculated for the E.