Dear Editor,

The most striking feature of a transcription activator-like effector (TALE) is the presence of a central DNA-binding region composed of tandem repeats of about 34 amino acids1. Two hypervariable residues at positions 12 and 13 (repeat-variable diresidues or RVDs) in each repeat bind to DNA, and this modular DNA-binding feature of TALE repeats has inspired the development of custom-designed TALE repeats for gene editing2,3,4,5. The nucleotide recognition preference of the commonly used RVDs has been experimentally or computationally determined2,5. For instance, RVD NN has a high preference for both G and A. The rare RVDs, NK and NH, have better specificity for guanine than NN, but their affinity is relatively lower3,6,7. We thus decided to conduct a thorough investigation of potential RVDs, which cover all possible combinations of amino acid diresidues, for their DNA recognition capabilities.

We set up a screening platform composed of an artificial TALE-VP64-mCherry construct, which expresses RVD (XX') in 3-tandem repeat format (from 7th to 9th, TALE-(XX')3), and 4 corresponding EGFP reporter constructs, in which potential TALE-(XX')3-binding sites composed of 3 consecutive nucleotides (A, T, C or G) are located in front of a minCMV promoter and its downstream EGFP gene (Figure 1A, Supplementary information, Figure S1A and Data S1). To test this system, we made a control TALE (TALE-Ctrl) that is identical to TALE-(XX')3 except for the repeat domain (Supplementary information, Figure S1A), and confirmed that it could not activate any of the 4 EGFP reporters (Supplementary information, Figure S1B), thus serving as the control for basal activity. We then constructed 4 TALE-(XX')3 expression plasmids by placing the common RVDs (NI, NG, HD and NN) in the middle to target the 3A, 3T, 3C and 3G EGFP reporters, respectively (Supplementary information, Figure S1C). These TALE-(XX')3 constructs were individually introduced into HEK293T cells together with 1 of the 4 EGFP reporter plasmids to examine their specificities, which were determined by the fold induction of EGFP expression compared with the basal level (Supplementary information, Data S1). The identity of XX' determined TALE-(XX')3 specificity on different EGFP reporters: NI, NG, HD and NN predominantly recognized A, T, C and G or A, respectively. This result is consistent with the current knowledge regarding the base preference of these 4 common RVDs, demonstrating that this artificial system is suitable for testing the DNA recognition ability of RVDs (Supplementary information, Figure S1D-S1E).

Figure 1
figure 1

A Complete assessment of TALE RVD efficiencies and specificities. (A) Design of the screening system for novel TALE RVDs. (B) A heat map generated from library screening of TALE-(XX')3 with four reporters (3A, 3T, 3C, and 3G) reflecting the base preference of 400 RVDs. EGFP activities from different reporters were coded by different colors representing the reporter identities (3A, green; 3T, red; 3C, blue; 3G, yellow), and the brightness of the colors indicates the fold induction of reporters by TALE-(XX')3 compared to the basal levels. The single-letter abbreviations for the amino acids are used. (C) Design of TALE-(XX')6 and its corresponding reporters. (D) Design of TALE-(XX')12 and its corresponding reporters. (E-F) Base preference of RVDs in TALE-(XX')6 (E) and TALE-(XX')12 (F). RVDs were clustered by base preference. The x-axis labels indicate the variable RVDs tested in TALE-(XX')6 or TALE-(XX')12. Data are means ± SD, n = 3; *P < 0.05, **P < 0.01, and ***P < 0.005.

To quantitatively measure the base preference of all theoretical RVDs, we created a library of TALE-(XX')3 constructs, which covers a total of 400 types of RVDs, following a special protocol combined with the ULtiMATE assembly method8 (Supplementary information, Figure S2 and Data S1). X and X' correspond to the 12th and 13th amino acids in a classical TALE module, respectively. We introduced each of the 400 TALE-(XX')3 constructs (Supplementary information, Tables S1 and S2) individually into HEK293T cells together with 1 of the 4 EGFP reporter plasmids and measured both the EGFP and mCherry levels by FACS analysis. We then determined the base-recognition efficiencies of the 400 diresidues. A total of 1 600 data points were summarized in 3 formats: heat map (Figure 1B), histograms categorized by the 13th residue (X') (Supplementary information, Figure S3) and the 12th residue (X) (Supplementary information, Figure S4). The results obtained from this screening provide substantial information regarding the base preference of all theoretical RVDs. In addition to NI, NG, HD and NN, all the natural RVDs and a few artificial RVDs showed base-recognition preferences that were similar to those reported previously2,5,6,7 (Supplementary information, Table S3). Besides these 25 RVDs, we determined the DNA base-recognition preference of the remaining 375 RVDs that did not evolve naturally and have not been previously examined.

Notably, many of these artificial RVDs showed a distinct preference for DNA bases compared with the 25 reported RVDs, and only a few of them start with 1 of the 2 frequently occurring amino acids, Asn and His (Figure 1B, Supplementary information, Figures S3 and S4). From these artificial RVDs, we selected those that showed potential base-recognition preference based on the criteria shown in Supplementary information, Data S1 for further intensive analyses. We found that the adenine recognition ability of KI and RI was similar to that of NI (Supplementary information, Figure S5A). For thymine recognition, we identified 3 additional RVDs aside from NG, which all end with Gly (RG, KG and HG), and seven RVDs that all end with Ala (KA, CA, FA, YA, RA, PA, and AA), but appeared to have higher background, especially for C recognition (Supplementary information, Figure S5B). HD and ND, as reported previously2,5,7, were optimal RVDs for C recognition, with almost no non-specific recognition of other bases (Supplementary information, Figure S5C). Five groups of RVDs were identified to recognize guanine, with each group sharing the same 13th residue: Asn (N), His (H), Arg (R), Gln (Q), or Lys (K). Most of these RVDs predominantly recognized guanine except for HN and NN (Supplementary information, Figure S5D). These data support the prediction from previous TALE structural studies suggesting that the 13th residues of TALE repeats make the base-specific contact9,10. Nevertheless, our data indicate that the 12th residue also affects RVD specificity. For example, with the same N13, KN and RN only recognized G, whereas HN and NN recognized both A and G, and LN and MN preferred T and C. Similarly, HQ, KQ and RQ preferentially recognized G, whereas LQ preferred T (Supplementary information, Figures S3-S6).

To further examine the base-recognition preference of RVDs, we created two additional artificial platforms with increased stringency, in which multiple TALE repeats carrying the same RVDs were aligned in tandem: TALE-(XX')6 and its corresponding EGFP reporter constructs (6A, 6T, 6C, and 6G) (Figure 1C) were used to test RVDs in 6-tandem repeat format, and TALE-(XX')12 and its corresponding EGFP reporter constructs (12A, 12T, 12C, and 12G) (Figure 1D) were used to test RVDs in 12-tandem repeat format (Supplementary information, Table S1 and Figure S2). In addition to the 4 most common RVDs, which were used as controls, we mainly chose those that demonstrated outstanding base-recognition specificities from the initial screening. We found that KI and NI functioned similarly with respect to A recognition in the 6-repeat format (Figure 1E). The activities of TALE-(RG)6 and TALE-(HG)6 were similar to TALE-(NG)6 for 6T recognition (Figure 1E), whereas TALE-(KG)6 showed reduced specificity for 6T (Supplementary information, Figure S7). HD and ND again demonstrated strong C preference (Figure 1E). KN, RN, NH and HH showed specific G recognition with variable efficiencies in 6-tandem repeats, whereas NN and HN recognized both G and A as in the 3-repeat format (Figure 1E), and the 6G preference of TALE-(XX')6 containing either NR, FR, KH, NK, FK or RQ was significantly reduced (Supplementary information, Figure S7). Interestingly, only TALE-(XX')12 with RG (for T), HD (for C), NN (for G) and KN (for G) in 12-tandem repeats maintained recognition efficiency and specificity (Figure 1F). This result is somewhat surprising for RG as it is assumed that RVDs ending with Gly cannot form hydrogen bonds with thymine10. Consistent with previous reports6,7, neither TALE-(NH)12 nor TALE-(HH)12 could support 12G reporter activation. Considering the strong preference of NH for G in the 6-repeat format, it is unclear why TALE-(KN)12 but not TALE-(NH)12 retained activity for the 12G reporter. By the same token, it is also unclear why TALE-(ND)12 completely lost its preference for the 12C reporter (Figure 1F). Although the combination of the 12th and 13th amino acids determines the ultimate binding activity of TALE, the increase of repeat number also leads to the decrease or even complete loss of DNA-recognition activity of TALE, which is likely due to either steric or static repulsion between consecutive TALE repeat units.

To further evaluate these novel RVDs, we applied KN and RG in TALEN assembly in place of NN and NG, respectively, and compared them with conventional RVDs in TALENs-mediated DNA cleavage by measuring indel rates. TALENsKN for G-targeting showed similar efficiency in creating indels as TALENsNN in 2 independent tests, and both of them performed better than TALENsNH. On the contrary, TALENsRG, although functional, were less effective than TALENsNG (Supplementary information, Table S4). It is possible that other diresidues newly revealed in this study could function as valid RVDs in recognizing DNA bases with high specificity. However, rigorous tests are needed in order to more accurately determine their DNA recognition capabilities.

In addition, we identified a significant number of RVDs that target multiple DNA bases (Supplementary information, Table S5). The availability of RVDs that target different combinations of bases in a degenerate manner may provide certain flexibility in future application such as engineering of sophisticated genetic circuitry11.

By further deciphering the DNA base preference of all RVDs, natural or artificial, we can achieve a clear understanding of the mechanism that guides the base preference of TALE RVDs. Comprehensive information regarding the specific DNA associations of all RVDs may improve the application of TAL effectors in bioengineering and precision therapy.