Selection of ZF specificity and compatibility

Two general approaches have been used to engineer ZFs with novel specificity (Supplementary Fig. 2). The first focused on engineering one finger at a time by selection of functional variants from ZF libraries where the six base-specifying positions of the helix have been randomized (Supplementary Fig. 2b). The second approach focused on the interface between adjacent ZFs of an array, because the influence of adjacent fingers on one another has been apparent since the first structures of ZFs bound to DNA were solved (Supplementary Fig. 2c); this influence leads to combinatorially greater complexity, which is the main reason for the failure of previous attempts to build a code. While the first approach allows for a comprehensive screen of all amino acid combinations at the six critical positions of the ZF alpha-helix24,26,27,29,30,31,32, it samples these combinations only in a single-adjacent-finger context. As a result, only ZF strategies enabled by this initial selection environment are available in subsequent rounds of selection or as the foundation of a ZF model. By contrast, the second approach captures the complexity of compatibility at the interface between ZFs25,28,33 (Supplementary Fig. 2c). However, because combinatorial explosion quickly exceeds the maximum practical library size for any screening platform, incomplete randomization schemes and the sampling of a limited number of helical positions become necessary. We hence reasoned that the solution lies in a combined approach that uses multiple exhaustive libraries in a comprehensive set of interface environments. In other words, each library presents a fully randomized single ZF helix in a unique interface environment, producing a broad catalog of binding strategies that are enabled by that single-interface environment. When considered across all libraries screened, this laborious but inclusive approach produces a comprehensive portfolio of general and interface-specific ZF binding solutions (Fig. 1). We theorized that this interface-derived complexity would provide (1) the diversity necessary to generate compatible ZF pairs able to bind a wide range of DNA targets and (2) the depth of data required to support a model for ZF array design.

Fig. 1: Overview of interface-focused ZF screens. a, Structure of adjacent ZF domains showing their close proximity. Helical position 6 of domain 1 (red) and position −1 (blue) of domain 2 are outlined. b, Cartoon of interactions between adjacent helices and DNA. The six helical positions of the three domains are shown as circles, with the common contacts made by positions −1, 2, 3 and 6 indicated by arrows. The overlap environment, which includes the base adjacent to the library interaction and the amino acid used to specify that base, is highlighted in green. This environment is unique for each library. c, Cartoon of B1H selections. The three-fingered protein is expressed as a C-terminal fusion to the omega subunit of RNA polymerase. For each library, ZF domain 2 is randomized at six helical positions and screened for amino acid combinations able to specify each of the 64 possible NNN targets. This is done in 64 independent screens. Domains 0 and 1 bind to their known, preferred targets and thereby anchor the protein adjacent to the NNN target sequence and present an overlap environment unique to that library. Only helices able to bind the target in the unique library overlap environment will recruit the polymerase, activate the reporter and survive on selective media. d, Left, helical residues for domains 0, 1 and 2 are shown for each library screened. Domain 2 contains all possible combinations of the six helical residues whereas domain 1 is fixed in the selections but varied by library. The sixth residue of domain 1 is the side chain that will be exposed at the interface between domains 1 and 2. Domain 0 is the same in all libraries except library 1. Right, there are 64 DNA targets for domain 2 to be screened against in 64 independent selections. The fixed targets for domain 1 of each library are shown, with the overlap base color coded by nucleotide. e, Left, to assay the success of each selection we determined clusters from the data for each selection. Here we show the maximum information content at one position of the strongest cluster to provide a relative measure of enrichment across all selections. Right, molecular dynamic simulations were performed on all domain 1 helices in their previously characterized contexts. The number of suggested contacts between domain 1 and the DNA is shown for each library. Full size image

The profound influence exerted by adjacent ZFs on one another can be explained by the multiple side chains of adjacent ZFs that bind DNA in close proximity to one another. This is most obvious at the ‘overlap’, where position 6 of an N-terminal helix can be within hydrogen-bonding distance of positions −1 and 2 side chains of its C-terminal neighbor (Fig. 1b). In this way the N-terminal helix is presenting a specific interface interaction, or ‘environment’, to its C-terminal neighbor that is based on the side chain employed and the base specified at the overlap position (Fig. 1a,b). Therefore we screened ten ZF libraries, each presenting the randomized C-terminal helix in a unique adjacent finger environment defined by the adjacent ZF helix (Fig. 1c,d). We screened these libraries across each of the 64 possible three base pair (bp) targets in independent selections to recover functional ZF helices. Each library presents a unique interaction between the side chain at position 6 of the adjacent finger and the base it specifies at the overlap that defines the unique adjacent finger influence of each library (Fig. 1d and Supplementary Fig. 3). We designed the majority of our libraries to contact adenine or cytosine at the overlap, to provide a contrast to the arginine–guanine contacts that presented at the overlap in most previous ZF screens. In addition, two of our libraries can specify two different bases at the overlap (nos. 1-A or -C and 3-A or -G). Therefore, we completed two comprehensive screens of these libraries, one with each base presented at the overlap. In total, we screened >49 billion protein–DNA interactions from ten libraries, across 12 sets of 64 selections per library, for 768 independent selections.

From these screens we found global and target-specific differences induced by library environments, indicative of the strength of the constraint exerted by each adjacent finger context on the selected ZF. The total number of selected helices ranged from 128,000 to >1 million per library screened (Supplementary Fig. 3). To distinguish selections that appeared to have low enrichment because of overlapping but unique strategies to bind the selection target from selections that truly failed to enrich functional helices, we used reasoning based on information content. We first used MUSI34, a method designed to identify multiple, unique sequence clusters in complex datasets such as these. We then quantified the information content across motifs generated from the different sequence clusters recovered in each selection. Reasoning that a successful selection should produce clusters where at least one helical position has been strongly selected for, we removed selections lacking any clusters with at least one position with high information content (Supplementary Fig. 3). We used this same threshold, the maximum information content at a single helical position of a cluster, to quantitatively compare different libraries (Fig. 1e). From this analysis we found that 39–100% of the 3-bp target selections led to successful enrichment of ZFs, depending on the library (Supplementary Fig. 3). In fact, for nine of the libraries screened at least 55 of the 64 selections (>85%) successfully enriched ZFs with specific sequence composition. In addition, for each of the 64 three-base-pair targets, at least eight different library contexts resulted in successful enrichment of ZFs, demonstrating the ability of ZFs to bind any 3-bp target in a wide range of adjacent finger environments. Also note that at least one library that bound either A, C or G at the overlap successfully enriched helices in at least 61 of the 64 selections (for example, library 1 with an A overlap, library 7 with a C overlap and library 9 with a G overlap), suggesting that functional ZFs exist in a wide variety of contexts independent of the overlap base. By contrast we found libraries 6 (C overlap) and 10 (A overlap) to be the least successful (Supplementary Fig. 3). To assess the impact of the biophysical properties of the adjacent helix on library success, we performed molecular dynamics simulations using the helices utilized in library contexts. We found that the number of contacts between the adjacent finger (domain 1 in Fig. 1d) employed in each library and the DNA it specifies were related to global library success, indicating that higher affinity of the neighboring finger enables more ZF strategies (Fig. 1e). Hence, adjacent fingers have a large impact on ZF function while viable ZF binding strategies exist for each overlap base.

G-rich binding modularity and promiscuity

The majority of published ZF selections have been carried out with an arginine–guanine contact presented at the overlap, due to the high affinity offered by this contact and its historical presence in the parent protein scaffolds that were used to engineer specificity. Consequently we reasoned that the libraries reported here presenting adenine and cytosine contacts at the overlap would enrich novel types of ZF-binding strategies. Therefore, to measure the similarity of helices enriched in different library contexts we computed pairwise Hamming distances (normalized by sequence length) between all helices enriched for each successful 3-bp target selection across all different library contexts. We then compared the mean normalized Hamming distance for each of the 3-bp targets to compare library differences. While there were general trends that libraries employing the same overlap base were more similar (Supplementary Fig. 4), the most striking difference was found when comparing libraries with adenine and cytosine at the overlap with the two libraries displaying an arginine–guanine contact at the overlap (Supplementary Fig. 4c). The arginine–guanine contact libraries (3 (G) and 9) were more similar to each other than any other libraries screened. Interestingly, a comparison of helices selected to bind various targets across all libraries showed that G-rich binding is less influenced by library context (Fig. 2a and Supplementary Fig. 5). This suggests that G-rich binding is more modular, because these helices appear less dependent on the adjacent finger interaction. However, this independence in binding could lead to more promiscuity. To address this possibility, we calculated how frequently helices recovered in a particular 3-bp target selection were recovered in other target selections. The 15 targets with the greatest target selection entropy (that is, recovered in the majority of other selections) all had a G at the GNN or NNG position, where arginines were the dominant amino acid enriched at corresponding positions 6 and −1, respectively (Supplementary Fig. 6). Conversely, none of the 13 targets with the lowest target selection entropy had a G at these positions. These results demonstrate that helices binding a G at either the first or third position of a binding site are more likely to be promiscuous ZFs. This could help explain the G-rich bias in ZFs previously selected, engineered or assembled as modules. This may also suggest that these modules tend towards more off-target binding.

Fig. 2: Specificity solutions are library specific. a, Top, dot plot comparison of 1 Hamming distance is provided comparing the similarity of helical strategies enriched in libraries 1–9 for three G-rich targets (right) and three G-poor targets (left). The darkness of the dot represents the similarity of the enriched populations, with darker dots being more similar. Empty spots indicate a failed target selection for one or both of the libraries compared. Bottom, normalized Hamming distance for all libraries across all targets, listed from least similar (left) to most similar (right). The targets compared above are underlined in yellow for G-poor targets and in blue for G-rich targets. b, Clusters were determined by MUSI from the enriched helices in each library selection. Three clusters are shown for four different binding sites (CCA, TTT, CCG and GAG). If a cluster was enriched in a library selection, the corresponding box is filled black in the table. c, Schematic illustration (top) and molecular dynamics snapshot (bottom) of hydrogen bonds between the arginine at position 2 of the domain 2 helix QsRYtt with the G* of the CCG* target when an asparagine is at position 6 of the adjacent finger (library 2 environment), or when an arginine is at position 6 of the adjacent finger (library 3, 9 environment). d, Left, paired format for two-finger selections using the base-skipping linker to encourage modularity, allowing test pairs (yellow) to function independently from the fixed pair (blue). Right, cartoon of B1H two-finger selections. e, The number of helices enriched in two-finger selections is shown as a factor of the number of single-finger libraries in which they originated. f, Comparison of helices enriched in the two-finger selections showing average number of single-finger libraries in which a helix originated, by binding site. Full size image

General and specialized binding strategies

For a more fine-grained analysis of the differences between libraries, such as the types of binding strategies enabled by one library environment versus another, we compared the clusters generated by MUSI for each target site selection. For most targets we found general strategies common to several successful library selections. We also found specialized strategies recovered in a small number of selections and, in some cases, recovered with only a single library environment (Fig. 2b). Previous work has shown that recovery of helical strategies in one library versus another is indicative of activity only in the recovered contexts, rather than sampling influences35. In addition, because different library contexts present different structural influences at the overlap, we investigated the physical influences that might lead to the selection of a particular type of ZF in a specific library context. For example, in most NCG selections we found a cluster of ‘QxRYxx’ helices (see CCG in Fig. 2b). However, this cluster was not recovered in libraries that presented an arginine from the adjacent finger at the overlap (libraries 3 and 9). Molecular dynamics simulations suggested that this is due to potential competition between position 2 of the selected finger and the arginine at position 6 of the adjacent finger (Fig. 2c).

The data demonstrate global and specific differences in ZF function dictated by the adjacent finger environment, but they represent only a small number of potential adjacent finger influences. To test how greater variability at the interface might influence compatibility, we created 200 two-finger libraries by assembling pools of helices successfully selected to bind each 3-bp half-site of a 6-bp target. These pools represented helices preselected to bind each half-site of the target across the small but diverse adjacent finger influences assayed in our primary selections. The 6-bp targets for these 200 two-finger selections were chosen to accommodate the construction of ZF nucleases (ZFNs) that will bind longer sequences in the enhanced green fluorescent protein (eGFP) coding sequence (Supplementary Fig. 10). In this way we were able test and validate the function of the selected two-fingered modules in the context of longer arrays necessary for ZFN activity while providing detailed compatibility data that could be used to train the model. Mimicking single-finger selections, we used two fixed fingers with known specificity to anchor the binding and properly position ZF pairs in the pool library to engage the test 6-bp target (Fig. 2d). To minimize the potential influence of fixed fingers on ZF pairs in the library we employed a long, flexible linker between the fixed and library pairs to encourage independence in binding. This linker prefers a base to be skipped between the binding sites of the fixed and screened ZF pairs, reducing the potential influence of fixed fingers on ZFs in the library and encouraging these pairs to work as independent modules36. In this way, the screens should produce ZF pairs that are dependent on one other but also function as an independent module relative to the fixed pair in the array. We selected compatible pairs of ZFs from these 200 libraries and analyzed the number of starting library environments from which the helices were enriched. Most helices enriched in these compatibility assays were recovered in only a minority of the library environments (Fig. 2e). This suggests that, despite the fact that all of these helices were preselected to bind each half-site, only a fraction is enabled in these new environments. Interestingly, when we plotted compatible helices by target selection and examined the number of primary libraries in which they were recovered, we again found that G-binding ZFs recovered in the two-finger selections originated in a large number of the primary libraries while compatible ZFs recovered to bind G-poor targets originated in a small number of library selections (Fig. 2f). Together these results demonstrate that, even for a more comprehensive set of presented environments, the interface has a large influence on ZF function and that G-rich binding helices tend to be more modular and promiscuous. The data from these two-finger library selections offer crucial insight into the pairwise compatibility of individually functional ZFs.

Hierarchical transformer integrates selection data

Despite considerable effort, previous attempts to generate a general ZF design code have failed. However, those attempts were hampered by sparse datasets that ignored adjacent finger influences and/or severely undersampled the potential complexity. Given the unprecedented depth of our screening data, we sought to develop a unique model that explicitly addresses these neighbor finger influences. Also artificial intelligence technology—in particular from natural language processing–vastly outperforms earlier machine learning models at capturing intricate detail in large pools of data. We separately make use of the comprehensive single-finger library selections that describe specificity in a variety of neighbor finger contexts, as well as the 200 pair selections confirming which ZFs are compatible with each other as neighbors (Fig. 3a). This information is by nature hierarchical and, to make optimal use of it, we developed a neural network architecture that implements attention modules in a hierarchical manner (Fig. 3b). The first layer of this hierarchical architecture contains two modules trained on the single-finger selection data for each of the half-sites of a desired two-finger target. Because we consider each target 3-bp plus the adjacent base, this becomes two overlapping 4-bp targets or a 7-bp, two-finger target. The single-helix modules generalize to unseen sequences; interestingly, residue–nucleotide relationships are captured in the attention values (Supplementary Figs. 7 and 8). The residue embeddings from these two modules are then fed into a top module trained on data recovered from the 200 ZF pair selections (Fig. 3b). This is akin to the experimental procedure of taking selection pools from single-finger selections and performing two-finger selections on them (Fig. 3a,b for comparison). In effect, the modules of the first layer design functional single ZFs (for a given neighbor environment) while the second layer module assembles compatible ZF pairs.

Fig. 3: An interface-focused ZF design model. a, The model comprises two modules trained on single-helix B1H selections to predict residues in partially masked helices that bind 4-mer nucleotide sequences. b, The residue embeddings generated from these modules are fed into a third module that learns interhelix compatibility. The full model is trained on two-helix B1H selection data to predict residues in partially masked helix pairs that bind 7-mer nucleotide sequences. In the model architecture schematic, layer normalization is abbreviated to "layer norm." and concatenation is abbreviated to “concat”. Full size image

The overall model retains a traditional encoder–decoder architecture: An encoder generates a high-dimensional representation for each DNA base and a decoder then generates predictions for each residue in a ZF helix, using self-attention layers and attention layers that relate nucleotide bases to helical residues. The model was trained using the masked language model objective37; during training we provided the nucleotide target as well as a partially masked ZF sequence and evaluated cross-entropy loss between predicted residues and the ground-truth ZF sequence (Methods). We achieved reconstruction accuracy (sequence identity to the six masked residues) of 0.62 and 0.69 on the validation and test data, respectively; some positions (such as −1) that were strong determinants of binding specificity had higher reconstruction accuracies (Fig. 4a–c). Overall, because some variability in the 12 residues is allowable while retaining the ability to bind a target sequence, 0.62–0.69 reconstruction accuracy can be considered quite high (Fig. 4c).

Fig. 4: Performance of two-helix design model. a, Training and validation accuracy during pretraining step. b, Training and validation accuracy during fine-tuning step. c, Helix sequence reconstruction accuracy with different numbers of masked residues. d, Comparison of differences between predicted and real selection logos using the developed model and ZFPred based on the mean-square error (MSE) of predicted position weight matricies (PWMs) to ground-truth PWMs. e, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from the single-helix design model. f, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from single-helix B1H selections. g, Predicted logos, real B1H logos and concatenated single-helix B1H logos for test set sequences. Full size image

ZFDesign generates compatible ZF pairs

Our method (ZFDesign) generates sequences in an incremental fashion: Starting from an empty sequence, the model is run once for each amino acid in the ZF helix pair. At each iteration an amino acid is predicted, and this prediction is provided as context in subsequent iterations. For optimal sequence generation we adapted both an A*-based sampling method38 and a temperature-dependent sampling procedure39. We sought to compare ZFDesign with a baseline, but no previous model has explicitly attempted to perform full ZF-array design for a given target and with only a few collections of ZFs available. However, previous models designed to capture ZF binding specificity exist and can be adapted to design ZFs for given targets; we used ZFpred, a recently developed method that outperformed previous models35. We then used both ZFDesign and ZFpred to generate ZF sequences to target 6-mers from our test dataset. As alternative baseline comparisons, we first used the single-finger models (for example, only the bottom module in Fig. 3b) to generate ZF sequences for each DNA 3-mer and concatenated them. In a similar fashion, we also took sequences directly from each of our 3-mer bacterial one-hybrid (B1H) selections and concatenated them, which is akin to previous methods of simply concatenating pre-existing collections of fingers as modules. All three methods performed noticeably worse than our hierarchical model (Fig. 4d–f). When directly comparing representative sequence logos of the sequences generated, ZFDesign produced logos that broadly captured those from the B1H two-helix selections whereas the concatenated logos from the one-helix selections were noticeably different (Fig. 4g and Supplementary Fig. 9), underlining the fact that ZFDesign captures interhelix relationships absent from single-helix selections.

For experimental validation of ZFDesign we performed a GFP disruption assay in a U20S cell line previously used to approximate nuclease activity for ZFNs40, TALENs41 and SpCas9 (ref. 42), because indels in the coding sequence of GFP led to frameshifts and loss of fluorescence. For each ZFN, two ZF arrays were designed as ZFNs requiring dimerization of the Fok1 catalytic domain, presented as C-terminal fusions from each ZF array in a tail-to-tail orientation (Supplementary Fig. 10a). The arrays use a longer linker between two-finger modules to enable independent binding, because the linker allows a base to be skipped between the binding sites for each two-finger module36. The DNA targets for the two-finger selections detailed above had been specifically chosen to accommodate targets in the GFP coding sequence. Therefore, for each target we first assembled ZFNs that use four ZFs per monomer (eight per ZFN) based on the most frequent pairs recovered in the corresponding two-finger selections. Next, we designed five ZFNs that also use four ZFs per monomer for comparison with the B1H-selected ZFs that bind the same targets. All of the designed ZFNs were functional above background, but four of the five demonstrated decreased activity relative to the selected arrays (Supplementary Fig. 10b). However, the substitution of single modules largely increased activity (Supplementary Fig. 10c), demonstrating the stringency of the assay because a single weak module can have a large impact on overall function. Nevertheless, because these designs were functional on all targets, and longer arrays have overcome the presence of weak modules43, we designed and tested 16 ZFNs that use six ZFs per monomer (12 per ZFN). We found all 16 to be functional, with a mean 53.6% loss of fluorescence (Supplementary Fig. 10d). Finally, to determine whether six fingers are sufficient for monomeric binding, we designed a six-finger array to label a genomic locus as a GFP fusion. Because many copies of GFP are necessary to visualize punctate GFP expression, we designed the array to bind a repetitive sequence on chromosome 14, which appears three times in HEK293T cells. We observed three points of GFP fluorescence by live cell imaging (Supplementary Fig. 10e). These results suggest that ZFDesign consistently produces highly functional ZF arrays and that six or more fingers routinely produce strong on-target activity in the human genome.

Seamless reprogramming of human transcription factors

Because half of human TFs use ZFs to engage DNA, we reasoned that these endogenous ZF domains could be seamlessly replaced by designed ZFs without impacting the protein’s regulatory function (Fig. 5a). This approach presents the designed ZFs in the exact context in which ZFs would occur naturally in the parent protein. Such reprogrammed transcription factors (RTFs) present the effector domain in its natural context, maximize secondary interactions of the TF, avoid the use of foreign effector domains and enable research focused on the precise investigation of TF binding events. As potential therapeutics, RTFs present maximally native-like human proteins with correspondingly low immunogenicity risk. We chose the TF encoded by KLF6 as our activation scaffold because we recently identified KLF6 as a potent activator when tethered to a reporter gene44. To test the activity of the KLF6 architecture, we replaced KLF6 ZFs with a series of ZF arrays designed to bind the tet operator sequence (tetO) (Supplementary Fig. 12). We replaced KLF6 ZFs with these designed ZF arrays and expressed RTFs in a stable HEK293T cell line containing a GFP reporter with a minimal promoter and seven tetO sites44,45 (Fig. 5b). Three of the four designs activated the reporter at a similar or greater level than rTetR-VP64. Next, we replaced the DBDs of three other activating TFs (genes KLF7, FOXR2 and ZXDC)44 with Tet ZF array 3 (Fig. 5b). All of these RTFs activated the reporter as well or better than the rTetR-VP64 control. This included the FOXR2 RTF, where its natural forkhead DBD was replaced by the ZF array (Supplementary Fig. 13a).

Fig. 5: RTFs. a, Left, the ZFs of KLF6 are seamlessly replaced by designed ZFs. The consensus ZF motif, listed below, is used to guide the seamless replacement of parent ZFs. Right, sequence of the KLF6 TF and precise location of ZF replacement. b, A GFP reporter is activated with four ZF arrays designed to bind the tetO sequence. Array 3 is used to show that TFs other than KLF6 can also be reprogrammed to bind the tetO sequence and regulate the target. c, A GFP reporter is repressed by ZIM3 reprogrammed with tetO-binding array 3. This array can also be used to reprogram other repressing TFs in addition to ZIM3. d, Relative expression of CDKN1C by KLF6 reprogrammed with seven ZF arrays designed to bind sequences upstream of the TSS. e, Relative expression of DPH1 repressed by ZIM3 reprogrammed with 11 ZF arrays designed to bind sequences downstream of the TSS. Source data Full size image

We chose a TF encoded by ZIM3 as our TF scaffold for repression, because the ZIM3 KRAB domain has proven a potent repressor as a SpCas9 fusion46. We replaced ZIM3 ZFs with the same series of tetO-binding ZF arrays tested with KLF6. We expressed these ZIM3 RTFs in a HEK293T cell line with a GFP reporter driven by a constitutive promoter. Three of the four ZF arrays repressed GFP expression relative to controls, with array 3 outperforming dCas9 (Fig. 5c and Supplementary Fig. 13b). To confirm that this RTF approach for repression was not restricted to the ZIM3 protein, we replaced the ZFs of three other KRAB-containing proteins (genes ZNF10, ZNF264 and ZNF324) with ZF array 3. In all cases we observed similar levels of repression (Fig. 5c). Interestingly, despite the fact that the KOX1 KRAB domain (ZNF10) provides less repression potential than the ZIM3 KRAB domain when expressed as an isolated spCas9 fusion domain46, its activity was similar when expressed here as RTFs, suggesting that the presentation context can have a large impact on the potency of these domains.

Moving beyond reporter-based assays, we next designed RTFs to regulate genes from their natural loci in the human genome. We applied the ZIM3 architecture to repress three endogenous genes (DPH1, RAB1A and UBE4A). For each target gene we designed three arrays that bind at the transcriptional start site (1) and both the forward (2) and reverse sequence (3) that corresponds to a guide RNA target previously identified as a potent repressor of these genes by CRISPR interference. To maximize the likelihood of function, we designed these as eight-finger proteins and maintained the base-skipping linker between each designed ZF pair (Supplementary Fig. 15). HEK293T cells were transfected with RTFs and expression levels assayed by quantitative PCR with reverse transcription (RT–qPCR). While eight of the nine RTFs repressed the target gene, only two did so by >50% (Supplementary Fig. 15b). However, considering the extreme size difference between Cas9 and ZFs, it is possible that these functional positions for Cas9 are not optimal for ZFs. Therefore, we designed 11 ZF arrays to bind sequences across a 252-bp region downstream of the transcription start site (TSS) for DPH1. Again, we expressed these arrays in the ZIM3 scaffold and assayed DPH1 expression. All arrays repressed DPH1 relative to controls but two arrays, 15 and 126, repressed DPH1 by >80% (Fig. 5e). Finally, to activate an endogenous target we took a similar approach and reprogrammed KLF6 with a series of arrays designed to canvass a 150-bp region upstream of the TSS in the CDKN1C promoter. All seven RTFs increased the expression of CDKN1C (in three of the seven by 9–43-fold) (Fig. 5d).

Specificity and genome-wide regulatory activity of RTFs

ZFDesign enables the reprogramming of TFs for either activation or repression. To test the precision of regulation we used RNA sequencing (RNA-seq) to quantify RTF on- and off-target regulation. We focused on the two most potent KLF6 RTF activators for CDKN1C (arrays 125 and 200) and the most potent ZIM3 RTF repressor of DPH1 (array 15) (Fig. 5d,e). In all cases the target gene was either the most—or one of the most—significantly regulated genes, but off-target activity ranged from seven (DPH1, array 15), to 268 (CDKN1C, array 200) and to 1,173 (CDKN1C, array 125) misregulated genes (Fig. 6a). Because KLF6 and ZIM3 are human TFs, we tested whether off-target activity is due to secondary interactions of the TF rather than ZF arrays. RNA-seq was carried out for CDKN1C arrays 125 and 200 using VP64 as the activation domain in place of KLF6, as well as KLF6 without any ZFs. These data suggest that off-target activity is primarily dictated by ZF arrays, because the KLF6 and VP64-ZFs had similar off-targets and KLF6 without ZFs resulted in only four genes with altered expression (Supplementary Fig. 16).

Fig. 6: ZF specificity and genome-wide activity. a, Genome-wide RNA-seq results for CDKN1C arrays 125 and 200 and DPH1 array 15, and comparison with an array that binds the reverse complement of CDKN1C array 125. b, Left, structure of a ZF bound to DNA highlighting two potential phosphate contacts51. Right, the human ZF consensus with phosphate-contacting positions highlighted in yellow (−5) and blue (9)49. c, qPCR comparison for activation of the on-target CDKN1C gene as well as two off-target sequences with CDKN1C array 200 with between none and eight modifications at phosphate-contacting position −5. d, RNA-seq results for CDKN1C array 200 with arginines or glutamines at the −5 position of each ZF. e, On-target qPCR results for arrays with the N-terminal (F3–8) or C-terminal (F1–6) ZF pairs removed compared to an empty vector negative control (neg.). f, Specificity of CDKN1C array 200 array with glutamine at position −5 as determined by ChIP–seq, B1H selection at low (5 mM) and high (10 mM) stringency and specificity as predicted by ZFDesign. B1H specificity is a concatenation of the specificities determined for each of the two-finger pairs. ChIP–seq peaks contained two independent motifs, suggesting that the base-skipping linker allows modular, independent binding. Source data Full size image

The specificity of ZF arrays can be impacted by both target content and affinity. As noted, G-rich binding tends to be more promiscuous. Consistent with this observation, the CDKN1C target with the lowest G content (no. 200; Supplementary Fig. 17) also led to the fewest off-target events while target 125 led to the most. To further test the influence of G-rich binding, we designed an array to target the reverse complement of the 125 sequence, which is necessarily a C-rich sequence. This approach reduced off-target activity to just one off-target gene (Fig. 6a, right). In addition to minimization of target G content, ZF specificity can be improved by reduction in the nonspecific affinity provided by contacts made between each ZF and phosphate backbone47,48 (Fig. 6b). This puts more pressure on the base-specifying interactions of each helix to provide the binding affinity necessary for function. We created mutant versions of CDKN1C array 200 that replace two, four or eight phosphate-contacting arginines at position −5 of the ZF scaffold with glutamines. Position −5 is dominated by basic residues in an analysis of human ZFs49 (Fig. 6b). We first compared the impact of these mutations by qPCR at both the target locus and two off-target loci found to be upregulated by RNA-seq (Fig. 6c). The expression of these off-target genes was reduced by up to 70 or 55%, respectively, as we increased the number of phosphate-contacting modifications. On-target activity, on the other hand, was reduced by only 12%. RNA-seq demonstrated that the number of off-targets was decreased as the number of modifications was increased, and that only CDKN1C was upregulated with the full eight arginine-to-glutamine modifications, thus providing single-target regulation. Interestingly, taking the same approach with DPH1-repressing array 15 resulted in a large reduction in on-target activity (Supplementary Fig. 18). It is notable, however, that the unmodified version of the array produced only six misregulated genes initially. This might suggest that DPH1 array 15 could already be at a low-affinity regime that cannot be lowered while preserving on-target activity. Together these data suggest that affinity and genome-wide specificity are tightly linked.

For the arrays used to regulate endogenous targets we used four pairs of ZFs to provide a greater opportunity for function, but it is not clear whether eight ZFs are necessary. A six-finger array provides up to 18 bp of specificity that would give single-target resolution in the human genome. To test whether eight fingers are necessary in each array, we tested the on-target activity of CDKN1C arrays 125 and 200 and DPH1 array 15 with either the N- or C-terminal ZF pairs removed (Fig. 6e). In the case of CDKN1C array 200, removal of either terminal pair led to a large decrease in activity. Conversely, removal of the C-terminal pair for CDKN1C array 125 or of the N-terminal pair for DPH1 array 15 had very little impact on activity. Together, these data suggest that in some (CDKN1C array 125 and DPH1 array 15), but not all (CDKN1C array 200), a six-finger version of proteins can provide similar activity but eight-finger proteins may capture activity for more challenging targets.

Finally, to characterize the DNA-binding specificity of the ZF arrays, we first performed independent B1H binding selections for each pair within CDKN1C arrays 125 and 200 and DPH1 array 15 (Supplementary Fig. 19). Because the flexible linker between pairs in these arrays will lead to modular and independent binding (Fig. 2d), these screens were used to confirm that each pair can bind their subtarget independent of the full array. We screened these pairs with an 8-bp library at two stringencies and found that specificities were in general agreement with their designed targets (Fig. 6f and Supplementary Fig. 20). Next, we tested the genome-wide specificity of the full eight-finger arrays by chromatin immunoprecipitation sequencing (ChIP–seq). Here we found that, despite the proper specificity provided by each pair, subsets of ZFs appear to drive binding genome wide. This is probably due to the modular design of our arrays that allows each pair to bind independently, and therefore the highest-affinity arrays will guide most ChIP–seq peaks. We used this design approach for simplicity in this proof-of-concept stage but, in future work, ZF arrays employing conventional linkers that do not skip bases should exhibit reduced off-target binding because in those, much less independent binding of ZF pairs will occur. It is furthermore worth noting that ZFDesign can be used to approximate the specificity of designed arrays with relatively high accuracy (Methods), and with much better accuracy than approaches trained specifically to predict ZF specificity (Fig. 6f and Supplementary Fig. 20). This suggests that ZFDesign can potentially be used to identify ZF arrays that are more likely to show specific binding before experimental validation.