Introduction

The past few years have seen the emergence of novel and extremely robust DNA targeting platforms namely the transcription activator like effectors (TALE) and the clustered regularly interspaced short palindromic repeats (CRISPR)1,2,3. The possibility to control and finely tune the targeting specificity of such platforms represents a key issue. Regarding TALEs, DNA binding core of this class of proteins is composed of 33 to 35 amino acid-long repeats arranged in a tandem array. Each of the individual repeated modules differs at two essential positions (12 and 13), the so-called repeat variable diresidue (RVD) and contacts one single nucleotide on one strand of the DNA target in a N-terminus to C-terminus direction (in line with 5′ to 3′ of the target)4,5. Structural, biochemical and functional studies underlined that the nature of the amino acid located at position 13 was responsible for the specificity of recognition by contacting in the major groove the top DNA strand base6,7,8. The amino acid in position 12 participates, on the other side, to the stabilization of the repeated units.

The current TALE scaffold turned out to be a very robust DNA targeting platform for molecular tools. Today, only 4 DNA targeting modules (RVDs) are mainly implemented and used by most researchers (NI, HD, NN and NG). Recently a few studies have started to highlight the benefit to implement other natural RVDs9,10. However, the naturally found RVDs only explore a limited fraction (~5%) of the possible diversity repertoire at these two key positions.

As naturally found RVDs present various patterns of specificity on the four nucleotides, we hypothesized that exploring the 400 combinations of the two key amino acids (position 12 and 13) could allow pinpointing non-conventional RVDs (ncRVD) with novel alternative specificity patterns. To identify such ncRVDs, we generated and screened collections of TALEN containing alternative combinations of amino acids at position 12 and 13 at defined positions of an array. Among the thousands TALEN/targets combinations tested (~18,000), we identified ncRVDs that presented novel valuable specificity features (exclusion properties). The additional levels of specificity control offered by these alternative DNA targeting modules will clearly represent a major advantage to alleviate off-site targeting. Off-site targets are defined as sequences within the genome that contain a few mismatches relative to the targeted sequence of interest. When considering an intended target and a related off-site sequence, the adequate implementation of ncRVDs that discriminate between the nucleotides present at the position of the mismatches (exclusion properties) will prevent the TALEN activity at the off-site sequences while allowing processing of the desired locus. As proof of concept, we demonstrated the potential of ncRVDs incorporation in TALE arrays by generating TALEN able to discriminate, in mammalian cells, between two closely related loci in the HBB (in the vicinity of the mutation responsible for the sickle cell anaemia) and HBD (off-site sequence, 94% of identity with HBB) genes.

Results

Experimental setup and collection screening

To systematically explore the landscape of RVD specificity and stringency we implemented a library-based high-throughput screen of TALE nuclease (TALEN) containing non-conventional RVDs (ncRVDs)11,12,13. First, a library of degenerated RVDs randomized at position 12 and 13 was created by short overlap PCR (SOE) using the conventional NNK codon degeneracy that encodes the complete set of natural amino acids. This RVD library was further incorporated at different position of TALE arrays using a solid phase assembly strategy (Fig. 1a)14. We focused in particular on the first four positions of the array (Fig. 1b) as we12 and others15 demonstrated that this N-terminal section of the array tends to present a higher specificity than the C-terminal section.

Figure 1
figure 1

Design of experimental setup and validation experiments.

(a). Graphical representation of the library assembly process. The randomized RVD is represented in orange and pre-assembled blocs in grey. The size of each bloc is indicated. The solid support is composed of streptavidin coated magnetic beads. (b). Schematic representation of the libraries of TALE and targets used in the yeast activity assay. This nuclease activity assay relies on the single strand annealing (SSA) pathway used after the generation, by the TALEN, of a double strand break (DSB) in the target sequence. Target sites were designed for homodimeric TALEN. X represents any combination of the 20 amino acids (encoded by the NNK codon) at position 12 and 13. N represents any of the four A, C, G and T base. Libraries are composed of arrays containing 9.5 repeats. (c) Comparison of the activity between clones encoding the same TALE array. For each TALEN for which two clones encoding the same protein were obtained, the activity was compared on all nucleotides. Each point corresponds to a pair of clones measured on a given nucleotide (first and second clone are arbitrarily fixed). (d) Comparison of the distribution of distances between profiles of TALEN pairs sharing the same RVD at different positions and TALEN pairs having different RVDs at the same position. For each pair of TALEN, the Euclidean distance was computed between their profiles on the four nucleotides. Pairs for the blue distribution have different RVDs at the same position (from the same collection), while pairs for the red distribution have the same RVD at different positions (from different collections). The curves are density curves corresponding to the histograms.

The four TALEN collections were assayed against their respective targets containing the four nucleotides (A, C, G and T) at the adequate position, regarding the localization of the ncRVD in the TALE array (Fig. 1b). Altogether, about 18,000 TALEN/targets combinations were assayed (~4200 to 4600 per collection). For each collection, we selected approximately hundred clones (101 clones for position 1, 105 for position 2, 136 for position 3 and 128 for position 4) that displayed a high nuclease activity in the yeast assay on at least one of the four targets (containing A, C, G and/or T). All these clones were individually re-assayed to ensure consistency of measurements. We further sequenced the validated clones and overall, found that all 20 amino acids could be recovered at either position 12 or 13. However, the frequency of apparition of each amino acid was highly variable, with predominance for naturally occurring amino acids (Supplementary Fig. 1a and 1b). We next monitored the variability of the nuclease activity for a given RVD (same position in the array), taking also into account the codon degeneracy. The mean activity and standard deviation were computed, revealing a low variability, altogether validating the experimental setup (Fig. 1c).

Variability by position

The experimental setup we used allowed studying several parameters including the effect of the amino acids identities but also position in the array. To investigate the effect of these two parameters, we computed the variability on the cleavage profiles due to (i) the position (1 to 4) in the array, keeping the amino acids in position 12 and 13 fixed and (ii) the identity of amino acids in position 12 and 13, keeping the position 1 to 4 fixed. We thus calculated the euclidean distance between all pairs of profiles for the two sets (position or identity) which gave a distribution of distances (Fig. 1d). The comparison of both distributions clearly pointed out that the nature of the amino acids in position 12 and 13 has a much more pronounced impact on the nuclease activity compared to its position in the array, as demonstrated by the shift to the left (smaller distances) for the position distribution (Fig. 1d). Noteworthy, this feature was also previously observed in our recent study focused exclusively on the activity and specificity the four mostly used RVDs (NI, HD, NN and NG)12.

Analysis of the cleavage profiles of ncRVDs on the 4 nucleotides

For each collection, we then calculated heatmaps of the nuclease activity as function of the amino acid found in position 12 and 13 (Supplementary Fig. 2, 3, 4 and 5). As we previously showed that the identity of the amino acids in position 12 and 13 is of major importance compared to the position in the array, we additionally calculated a global heatmap, where nuclease activity values were averaged on all positions (Fig. 2). To further decipher the profiles of cleavage that can be obtained on the four nucleotides, we performed hierarchical clustering of the profiles and draw the corresponding heatmap (Fig. 3A). This allowed determining nine clusters of cleavage profiles. At first, we observed that in contrast to Thymine, the Adenine was the less stringent nucleotide to be accommodated by ncRVDs. We also noted that only two clusters contained clones presenting a nuclease activity almost restricted one nucleotide, with members of cluster 7 targeting nearly exclusively a Guanine and members of cluster 2 targeting predominantly an Adenine (Fig. 3A, Supplementary Fig. 6). All other clusters were targeting at least two nucleotides.

Figure 2
figure 2

Analysis of global nuclease activity in function of amino acid 12 and 13 identity.

Heatmap showing the nuclease activity of identified combinations of amino acids at position 12 and 13 of an RVD on targets containing the four possible nucleotides. The activity was averaged on all four collections. A cross indicates that the combination of amino acids was not found in the present analysis of TALE array libraries and therefore, the activity is unknown. Activity is represented on a scale from 0 (no nuclease activity) to 1 (strong nuclease activity) by increasingly darker colors (arbitrary units).

Figure 3
figure 3

Hierarchical clustering of the cleavage profiles of ncRVDs on the 4 nucleotides.

(a) Clusters are visualized by the alternation of red and blue and are numbered from top to bottom. Position of classical RVDs and ncRVDs used in the mammalian discrimination experiments are indicated. (b) Occurrence of every amino acid in clusters 1, 2 and 5. Each amino acid at position 12 and 13 is represented by a different color. The size of the cluster (number of different RVDs) and the targeting preference are indicated.

We further noticed that the most often occurring natural RVDs are not always found in the most specific clusters. For instance, NI, that is commonly used to target an Adenine, belonged to the first cluster (with relative activities: A/100%, C/73%, G/58%, T/0%) while the similar mutant MI is located in cluster 2 and turned out to be more specific (with relative activities: A/100%, C/0%, G/0%, T/0%).

We then determined the frequency of each amino-acid in each cluster in order to find out enrichment for particular amino-acids (Fig. 3b and Supplementary Fig. 6). We found, as expected, that the position 12, mainly involved in the stabilization of the RVD, was much more variable in each cluster than position 13 that presented important biases in some of the clusters. For instance, cluster 5 that is composed of clones targeting mainly an Adenine and a Guanine, contained predominantly the amino acids K, R, N and Q at position 13. Clones from cluster 1 that contains HD but also NI are targeting mainly an Adenine and a Cytosine and are enriched in acidic residues at position 13. Finally, cluster 2 which predominantly targets an Adenine is enriched in hydrophobic residues, essentially F and L (Fig. 3b).

Use of ncRVDs to improve targeting specificity in vivo

The ability to master the targeting specificity of TALEN by implementation of ncRVDs would be of great interest, notably to minimize off-site targeting. In contrast to previous work aiming to fully replace an RVD type in a TALE array16, we thought to develop an exclusion strategy focused on educated incorporation of limited number of ncRVD based on their discriminative properties. Based on our dataset, we thus selected ncRVDs that maintained a robust activity on a nucleotide while allowing discrimination of one or more other nucleotides when incorporated in a TALEN (Fig. 4a and Supplementary Fig. 7). As proof of concept, we aimed at targeting a sequence (in-site) close to the mutation responsible for the sickle cell disease in the HBB gene, knowing that this locus of interest is highly similar to a portion (off-site) of the HBD gene (93% of identity on the 140 base pairs surrounding the mutation, Fig. 4b).

Figure 4
figure 4

Evaluation of discriminative properties on ncRVDs in mammalian cells.

(a) Potential discriminating ncRVDs identified in the library screens. These ncRVDs were determined especially by maximizing the difference of nuclease activity on two nucleotides (favouring a strong nuclease activity on the desired base and a weak or no nuclease activity on the other base). Italic indicates either non negligible residual activity on the weaker nucleotide, or an intermediate activity on the stronger nucleotide. Value in parenthesis represents the discrimination score (scale from 0, no discrimination to 1, maximum discrimination, arbitrary units, computed as the difference of intensities on both nucleotides). (b) Sequence alignment of the 2 × 69 base pairs of HBB and HBD loci surrounding the position of the mutation responsible for the sickle cell anaemia. The two TALEN binding sequences are underlined. (c) Schematic representation of the loci and TALEN used in the mammalian cell target discrimination experiment. (d) Targeted mutagenesis frequency detected at the two endogenous loci (desired locus or in-site, HBB in violet and off-site, HBD in orange) using the HD/NG, SD/VG and NM/LP containing TALEN as measured by deep sequencing.

Using features from our TALEN scaffold (TAL DNA binding array of 15.5 RVDs and spacer length of 15 bp), we found an optimal target that do not present within the human genome any off-site sequences with less than 5 mismatches at the exception of the HBD locus that contain 2 mismatches. The two HBB and HBD loci presenting 94% of identity with 32/34 identical nucleotides in the targeted sequence (Fig. 4b). The corresponding TALEN may therefore present significant processing of not only the HBB but also the HBD locus12,17,18. A careful examination of (i) the identity of the mismatches between the HBB and HBD sequences and (ii) our ncRVD dataset, enabled determining ncRVDs that would permit controlled targeting, ultimately excluding the processing of the HBD locus. We thus implemented either SD and VG or NM and LP as alternative ncRVDs to replace an HD and NG RVD in the classical TALEN (Fig. 4b and c).

To assess the discriminative potential of ncRVDs we performed targeted mutagenesis experiments at the HBB and HBD loci in a human cell line using one classical TALEN and two that contained the combination of ncRVDs (Fig. 4c). We monitored mutagenic events (small deletions and insertions, Indels), generated by the non-homologous end joining (NHEJ) repair pathway, at the double strand break (DSB) site induced by the TALEN. We thus quantified, by deep sequencing, targeted mutagenesis frequencies on specific PCR products surrounding the HBB or HBD loci. We observed that the three TALEN induced similar levels of targeted mutagenesis at the expected HBB locus (Table 1, Fig. 4d). Remarkably, while the classical TALEN was unable to discriminate between the two loci, the incorporation of the two NM and LP ncRVDs allowed a dramatic decreased of off-site targeting at the HBD locus (Fig. 4d).

Table 1 Activities of TALEN on their endogenous cognate target (HBB) and off-target sequence (HBD)

Although a larger in vivo data set would be desirable to precisely define the limits of ncRVD incorporation in TALEN, taken together our data indicates that implementation of such ncRVDs in TALENs is a valuable strategy to control and tune TALEN specificity.

Discussion

The TALE scaffold has proven over the past few years to be one of the most robust platforms to generate molecular tools to edit, regulate and detect genes of interest19,20,21,22,23,24,25,26,27,28,29,30. However, as with all engineered DNA binding domains, depending on the experimental conditions and the targeted sequence, a certain extent of unwanted (off-site) targeting events might need to be controlled.

In this work, we aimed at extending the level of TALEN targeting control by taking advantage of the vast repertoire of possibilities offered by the DNA targeting characteristics of RVDs (position 12 and 13). We and others recently demonstrated that only a small number of mismatches relative to the “standard” association code (NI:A, HD:C, NN:G and NG:T) are tolerated in a TALEN12,17,18. We thus reasoned that the use of ncRVDs would mainly benefit for TALEN presenting off-site targets with few mismatches (1–3 mismatches, representing approximately a fourth of all possible TALEN12) and that a complete replacement of one or several of the four more common natural RVDs would not be required. We were comforted in this strategy as previous studies reported mitigated results with complete replacement of classical RVDs16,31,32. We therefore proposed a strategy focused on the educated incorporation of a limited number of ncRVDs based on their base preference properties. In particular, their exclusion capacities that prevent targeting and processing of sequences containing defined nucleotides at the adequate position, expands the utility of this technology. We thus provide a panel of characterized ncRVDs that will allow researchers to easily implement this methodology, which is also compatible with published state-of-the-art TALE assembly protocols33,34,35. Using such approach, we were able to significantly improve the targeting specificity (reduce the off-site targeting) while maintaining the robustness (high level of gene targeting) of the TALE platform. Such improvement in the fine tuning of the targeting specificity using ncRVDs associated with other recent strategies17,18 (e.g. reduction of RVD number, engineering of the TALE C-terminal domain) should allow approaching the single nucleotide discriminative resolution. We thus anticipate that ncRVDs can greatly benefit to nuclease design, especially for therapeutic applications and synthetic biology36.

Methods

Preparation of libraries and extrachromosomal SSA assay in yeast

The library of degenerated RVDs randomized at amino acid positions 12 and 13 was created by short overlap PCR (SOE) using the conventional NNK codon degeneracy (encoding the complete set of natural amino acids) with the following oligonucleotides: Fragment 1A: CCCAGTCACGACGTTGTAAAAC/Fragment 1B: GTCTCCAGCGCCTGCTTGCCGCCMNNMNNGCTGGCGATGGCCACCACCTGCTC and Fragment 2A: CCCAGTCACGACGTTGTAAAAC/Fragment 2B: CACAGGAAACAGCTATGACCATG. The DNA template consisted in a single tandem repeat commercially synthesized and subcloned in the pAPG10 plasmid (Top Gene Technologies)14. The second (assembly) PCR was realized using the following oligonuleotides: Assembly1: Biotin-CCCAGTCACGACGTTGTAAAAC/Assembly2: CCCGGTACCGCATCTCGAGG. The resulting PCR product was purified and analysed by DNA sequencing to ensure the randomization of the two positions. This single randomized tandem repeat was used in the preparation of the four TALE array libraries. Assembly of the complete TALE arrays using solid support (magnetic beads) was performed as described in Daboussi et al.14. All intermediate arrays released from the solid support were subcloned in the pAPG10 plasmid, transformed in XL1b (Stratagene) according to standard molecular biology procedures. A number of colonies at least 5-fold larger than the theoretical library size (1024) was scraped from the agarose plate and DNA extracted according to standard procedures. The intermediate array was prepared for the next assembly step by performing enzymatic restriction using SfiI and BbVI as previously described14. The final arrays were subcloned in plasmids already containing a TALEN scaffold (described in refs 12, 37). The resulting constructs were reanalysed by DNA sequencing to ensure the randomization of the two positions.

For each of the four TALE array libraries, plasmidic DNA of 1152 clones were extracted and transformed in yeast cells with a success rate >91% (colony growth). The nuclease activity screens in yeast were performed for each of the yeast clones as previously described in Juillerat et al.12.

For experiments in the mammalian cell line (Human 293H cells, Life Technologies), all TALE arrays were obtained from Cellectis Bioresearch (Paris, France) and the TALE-based nucleases were expressed under the control of an EF1a promoter (described in ref. 37). TALEN™ is a trademark owned by Cellectis Bioresearch.

Nuclease transfection and targeted mutagenesis

Cell culture and nuclease transfections were performed as described in Juillerat et al.12. Cells were pelleted by centrifugation and genomic DNA was extracted using DNeasy Blood & Tissue Kit (Qiagen) according to the manufacturer's instructions. Targeted mutagenesis was estimated on the whole cell population three days post transfection. PCR of the endogenous loci (410 bp for HBB, In-site and 400 bp for HBD, Off-site) were performed using the following oligonucleotides: In-site1: CCACACCCTAGGGTTGGCCAATCTACTCCC/In-site2: GTAGACCACCAGCAGCCTAAGGGTGGG and Off-site1: GCCTCCTTGAGCCTCTCTTATAAC/Off-site2: GCGGTGGGGAGATATGTAGAGGAG. PCR products were purified using the AMPure kit (Invitrogen). Amplicons were further analyzed by deep-sequencing38.

Bioinformatics analysis

The distance profile was computed as follows: for all identical RVDs found active in different collections, we identified all pairs with one member in one collection and the other member in another collection, each member having its own nuclease activity profile on the four nucleotides. We then computed the Euclidean distance between these two profiles as a measure of similarity (the lower, the most similar), giving a distribution of distances. As a point of comparison, we computed the same distance distribution between pairs of different RVDs from the same collection. Smoothed lines were computed as kernel density estimates with a Gaussian kernel and a bandwidth of 0.1.

Clustering analysis was done using hierarchical clustering utilizing the Euclidean distance between profiles and complete linkage method.

Discriminating ncRVDs were identified as follows: for each nucleotide pair N1 and N2 for which we want to find a ncRVD that cuts N1 but not N2, the solutions (ncRVDs 1 to 3) were first sorted by occurrence number (number of collections in which the RVD has been observed to cut N1) and then by the difference of nuclease activity between N1 and N2. In particular, we filtered out ncRVDs having an intensity on N1 smaller than 0.75 and an intensity on N2 larger than 0.30 (nuclease activities, arbitrary units). Remaining ncRVD candidates were ordered by decreasing value difference between N1 and N2. The first three best RVDs for (N1, N2) were retained. If there were less than 3 solutions, these solutions were completed to 3 by adding all ncRVDs having a positive difference between the intensities on N1 and N2 without taking into consideration the occurrence number and the constraints limits (simple ordering by decreasing difference).

All computations were done using the R statistical software package (http://cran.r-project.org/index.html).