Introduction

CRISPR/Cas9 technology has been widely used for genome editing and is currently being tested as a therapeutic in clinical trials1,2,3. However, the risk of Cas9 cleaving sequences with high similarity to the targeted DNA has raised critical concerns in its scientific and clinical applications4,5,6,7,8. Thus, it is crucial to diminish off-target effects in CRISPR/Cas9 genome editing.

A number of experimental techniques, including high-throughput screens9,10,11 and various genome-wide detection approaches12,13,14,15,16,17,18,19, have been developed to quantitatively assess the off-target effects. These approaches have facilitated the exploration of rules governing the specificity of CRISPR/Cas9, and it is widely accepted that the off-target cleavage by Cas9 is dependent on the positions and nucleotide contexts of mismatches between crRNA and targeted protospacer9,11,20. Nevertheless, these mismatch-dependent rules only partially account for the observed off-target effects. For example, recent studies revealed a guide-intrinsic mismatch tolerance (GMT) in addition to the impact of specific mismatches21,22. While the observations of GMT suggest the possibility to identify the highly specific gRNA regardless of genetic context, the sequence determinants underlying GMT is unclear. Over 95% of the detected genomic off-target sites harbor two or more mismatches with respect to the crRNA sequences23, but the combinatorial effect of multiple mismatches is not well understood due to the lack of adequate experimental data that allow for quantitative separation of the combinatorial effect from that of individual mismatches.

Although experimental techniques are capable of quantifying Cas9 off-target activities, in silico prediction of the off-target effects remains the most efficient and cost-effective method for designing and optimizing CRISPR-based applications. The advancement of machine learning approaches has fueled the progressive improvement of off-target prediction over the past several years24,25,26,27,28,29. Moreover, biophysical modeling has provided new insights into the prediction of off-targeting from a bottom-up perspective30,31,32. Notwithstanding this progress, the power for computational prediction of CRISPR/Cas9 off-target effects remains limited as large-scale training datasets covering more gRNAs and an enhanced understanding of sequence-dependent rules are needed.

In the present study, we devise a high-throughput synthetic system with a dual-target design to measure the relative cleavage rate between off- and on-target sequences (off-on ratios) of ~2,000 gRNAs. This system shows lower variance among experimental replicates and better assessment of the genomic off-targets compared to the previous single-target system. The large number of gRNAs and customized library design facilitate the mining of a more comprehensive set of rules underlying CRISPR/Cas9 specificity, which lead to the development of an improved off-target prediction tool and an optimized strategy for allele-specific genome editing.

Results

High-throughput assessment of off-on ratios using a synthetic dual-target system

Recently, paired gRNA-target systems have been developed for the high-throughput assessment of CRISPR/Cas9 genome editing outcomes33,34,35. These systems include a gRNA expression cassette together with a target sequence flanked by barcodes that are uniquely mapped to the gRNA sequence (Fig. 1a). Combined with high-throughput oligonucleotide synthesis and lentiviral delivery, the variable context of target sequences allows flexible design of the experiments to detect Cas9-mediated editing events associated with a large number of gRNAs. To facilitate a direct comparison of the off- and on-target effects, we modified the paired gRNA-target system by introducing a dual-target sequence that contains two 23-bp PAM-endowed target sequences arranged in tandem, corresponding to an off-target (left) and an on-target (right). These two targets are separated by an optimized 15-nt linker sequence and are surrounded by a 10-nt barcode 1 at the 5′ end and a 15-nt barcode 2 at the 3′ end (Fig. 1b). Since the off- and on-targets are integrated to the same genomic locus and are PCR-amplified together, the on-target cleavage rate acts as an internal control for the normalization against confounding factors in the experiment. Compared to the single-target design without the use of internal normalizations, the dual-target design is expected to reduce the experimental variations and biases for accurate measurement of off-on ratios.

To explore the editing outcomes mediated by different cleavage events at almost identical tandem targets, we designed control dual-target sequences to represent four combinations of cleavage events (no cleavage, left, right, and both), where the cleavage can be turned off at a specific target by the replacement of the “NGG” PAM sequences with “NTT” (Fig. 1c). We first tested two gRNAs associated with distinct repair mechanisms upon double-strand breaks (Fig. 1d and Supplementary Fig. 1). In addition to anticipated small indels, large deletions (>30-nt) are enriched when cleavage occurs at both targets (NGG + NGG) or the left target alone (NGG + NTT). The latter is likely due to the similarity of the two target sequences that induces long-range resection via microhomology-mediated end joining (MMEJ). These observations are consistent between the two gRNAs, suggesting a general cleavage-editing model as demonstrated in Fig. 1e. We further extended the analysis on 276 control gRNA-target pairs using a high-throughput pooled screen. Computational analysis of the Cas9-mediated mutational profiles at the control sequences rendered a matrix that represents the probability of editing outcomes conditional on cleavage types, as well as the background noise rates measured from the “no cleavage” sequences (Fig. 1f). This matrix enabled subsequent inference of the cleavage frequency and off-on ratio from the observation of editing outcomes (see Methods section for details).

To evaluate the dual-target approach, we designed a library that includes 35 benchmark gRNAs collected from previous in vitro or in vivo studies, corresponding to 296 reported off-target sites (positive controls) and 295 similar genomic sequences that are never detected to be off-targets (negative controls)1,12,17,36. Additionally, the library includes 328 random gRNAs corresponding to 2261 off-targets harboring 1–3 mismatches (Supplementary Data 1). For a systematic comparison, we also generated a single-target library (Fig. 1a and Supplementary Data 2) that includes the same sets of off- and on-targets. Both lentiviral libraries were transduced to HEK293T cells expressing doxycycline-inducible Cas9 (HEK293T-iCas9) for the assessment of off-on ratios, following almost identical experimental configurations (Supplementary Fig. 2). As expected, the dual-target system significantly improved the correlations of the measured off-on ratios between biological replicates or different barcode sets (Fig. 1g and Supplementary Fig. 3). Comparison of the positive and negative control targets suggests a dynamic range of ~0.02–1.0 in the measurement of off-on ratios for the dual-target system (Fig. 1h). On the positive controls, the off-on ratios detected by the dual-target system showed higher consistency with the measures from in vitro GUIDE-seq12 (Fig. 1i and Supplementary Fig. 4a) and an in vivo study on mouse embryos using whole-genome sequencing (WGS)36 (Fig. 1j and Supplementary Fig. 4b) as compared to the single-target system. Overall, these head-to-head comparisons confirmed the rationale and advantage of the dual-target system that improves the accuracy of off-on ratio measurements via internal normalization using on-target controls.

With the proven robustness of our system, we measured the off-on ratios of 1902 gRNAs on 13,314 synthetic target sequences (Supplementary Data 1). We designed 7 off-target sequences for each gRNA, including 3 targets with 1 mismatch (1-MM), 3 with 2 mismatches (2-MM), and 1 with 3 mismatches (3-MM). The mismatches in the 2-MM and 3-MM sequences are the combinations of the mismatches in 1-MM sequences, as exemplified in Fig. 1k. These settings allow systematic decomposition of the estimated off-on ratios into single-mismatch effect, combinatorial effect, and GMT.

Sequence determinants of guide-intrinsic mismatch tolerance

To confirm the existence of the GMT effect, we first estimated the indel rates of 6 gRNAs at synthetic targets harboring 0 to 6 mismatches (Supplementary Data 3). While the on-target indel rates are largely consistent for all the gRNAs, we found that 4 gRNAs were associated with high indel rates at the 1-MM sequences and the other 2 showed obvious mismatch intolerance (Fig. 2a and Supplementary Fig. 5). This guide-intrinsic effect was also observed at the 2-MM and 3-MM sequences regardless of various mismatch contexts (Fig. 2b). To quantitatively estimate the GMT effects of the gRNAs in the large-scale dataset, we used a gradient descent algorithm to decompose the off-on ratios at the 1-MM sequences into (i) mismatch-dependent effects determined by the position and nucleotide context of the mismatches, and (ii) the GMT effects independent of the mismatch contexts (see Methods section for details and Supplementary Data 4). The estimated mismatch-dependent effect, as demonstrated by a matrix in Fig. 2c, is highly consistent with a previous report based on CRISPR/Cas9 functional screens9 (r = 0.86, Supplementary Fig. 6), supporting the robustness of the decomposition.

Comparing the nucleotide frequency of the protospacer sequences associated with high or low GMT effect, we observed an enrichment of guanine and depletion of thymine in the high-GMT protospacers (Fig. 2d). Unlike the on-target activity that is mainly dependent on the context of the nucleotides proximal to PAM37, the GMT effect relies on the sequences in both the seed and non-seed regions. Further analysis showed only a moderate correlation between the on-target activity and the GMT effect (r = 0.44, Supplementary Fig. 7), suggesting that it is possible to select gRNAs that are associated with both high on-target activity and low GMT. To predict GMT from a sequence, we tested two convolutional neural network (CNN) regression models corresponding to mono- and di-nucleotide kernels (Supplementary Fig. 8). Cross-validation showed that the dinucleotide model achieved the optimal performance (Fig. 2e). Tested on independent TTISS18 and CHANGE-seq19 datasets, the predicted GMT score is significantly associated with the specificity of the gRNAs on the vast majority of Cas9 variants (Fig. 2f and Supplementary Fig. 9).

Several studies have confirmed that R-loop formation is required for CRISPR/Cas9-mediated DNA cleavage38,39. In principle, the progression of R-loop formation can be driven by free-energy transactions involved in substituting DNA-DNA duplex interactions for RNA-DNA hybrid interactions in the context of the Cas9 protein. The sequence dependence of this free energy should be largely set by the base-pairing energies, particularly by the base-stacking energy30,31,32. Comparing the GMT effect with the difference of cumulative base-stacking energy between DNA-DNA and RNA-DNA hybridization, we found that the high-GMT and low-GMT gRNAs are associated with distinct free-energy landscapes (Fig. 2g). Moreover, we observed a strong correlation between the dinucleotide-specific base-stacking energy and the weights of dinucleotides that contribute to GMT prediction (Fig. 2h), suggesting that the free-energy change accounts for a majority of sequence-dependent GMT effect. Next, we asked why those high-GMT gRNAs are mismatch tolerant. We considered a recent kinetic model in which the process of Cas9-mediated cleavage is a sequential transition among the states of unbinding, PAM recognition, R-loop formation, and cleavage40 (Fig. 2i). A mismatch causes an energy barrier that could halt R-loop progression, and the probability of unbinding before cleavage is determined by the relative height ($$\triangle {G}_{{\max }}$$) of the kinetic barriers. If $$\triangle {G}_{{\max }}$$< 0, the free-energy barrier to cleavage is lower than that to unbinding, and cleavage is the most likely outcome; if $$\triangle {G}_{{\max }}$$> 0, the free-energy barrier to cleavage is higher than that to unbinding, the R-loop formation is kinetically blocked, and unbinding is the most likely outcome. As illustrated in Fig. 2j, the difference of free-energy landscapes intrinsic to high-GMT or low-GMT gRNAs leads to different levels of $$\triangle {G}_{{\max }}$$, thus resulting in a high or low probability of reaching the final cleavage state, respectively.

Combinatorial effect of multiple mismatches

In a simplified theoretical model, the combinatorial effect of two or more mismatches is assumed to be “marginally independent”9. That is, the tolerance to a combination of mismatches, in terms of off-on ratio, is taken to equal the multiplied tolerances for the individual mismatches. To test this model, we compared the observed off-on ratios of 2-MM targets to the expected off-on ratios computed from 1-MM targets. The marginally independent model provides an upper bound of the combinatorial effects, but many combinations are associated with much lower off-on ratios (Fig. 3a), suggesting an “epistasis-like” effect. Consistently, the CFD score9, a benchmark method based on the marginal independence model, overestimates the off-target effects at genomic targets harboring multiple mismatches (Fig. 3b). The “epistasis-like” combinatorial effect is explainable by the above-mentioned kinetic model40, where the combination of two tolerated mismatches results in a higher energy barrier that blocks R-loop progression (Fig. 3c).

To quantitatively model the combinatorial effect, we introduced a parameter $${\delta }_{{ij}}$$ to represent the ratio of the observed and expected effects when two mismatches occur at the $$i$$th and the $$j$$th nucleotide with respect to the PAM. A smaller value of $${\delta }_{{ij}}$$ indicates a stronger “epistasis-like” effect. The maximum likelihood estimates of $${\delta }_{{ij}}$$ are shown in a heatmap in Fig. 3d. Additionally, we computed a relative co-occurrence score (RCS) that represents the observed frequency of two mismatches relative to random expectation, based on 96,555 genomic off-target sites detected by CHANGE-seq19 (Fig. 3e). Cross-referencing $${\delta }_{{ij}}$$ and RCS, we derived three reproducible rules of the combinatorial effect. First, the combinatorial factor $${\delta }_{{ij}}$$ is smaller when both mismatches occur in the seed region (<10-nt from PAM) and is greater when one of the mismatches occurs at the PAM-distal 19th or 20th nucleotide. Second, the “epistasis-like” effect is dependent on the distance between the two mismatches, where a minimal $${\delta }_{{ij}}$$ is observed when the distance is between 1 and 6 (Fig. 3f). Third, a greater $${\delta }_{{ij}}$$ is associated with adjacent mismatches (distance = 0) that correspond to a 2-nt bubble in RNA-DNA hybridization, as compared to two separated proximal bubbles of 1-nt in size. Of note, all three revealed rules can be recapitulated using the kinetic model, suggesting an underlying biophysical mechanism (Fig. 3g). In extension to the third rule, we further asked if the number of bubbles is associated with the combinatorial effect given a fixed number of mismatches. We examined the genomic sequences harboring 4–6 mismatches with respect to 109 gRNAs in the CHANGE-seq dataset19 and found that the sequences with a smaller number of bubbles are more likely to be detected as off-target sites (Fig. 3h and Supplementary Fig. 10). Consistently, those sequences with fewer bubbles are associated with lower energy barriers computed from the cumulative dinucleotide base-stacking energy changes, providing a biophysical explanation of the “bubble rule” (Fig. 3i).

Predicting off-target effect and guide specificity with MOFF

With the observations of GMT and the combinatorial effect of multiple mismatches, we sought to combine these sequence-dependent rules for in silico CRISPR/Cas9 off-target evaluation. We developed MOFF, a model-based off-target predictor that includes three components corresponding to the multiplication of individual mismatch effect (IME), the combinatorial effect (CE), and the GMT effect (Fig. 4a). Given a gRNA and an off-target sequence, we defined a MOFF-target score to be the logarithm of the expected off-on ratio as follows:

$${S}_{{{{{{{\mathrm{MOFF}}}}}}}}=\mathop{\sum }\limits_{i=1}^{k}{{\log }}\left({s}_{i}\right)+\frac{2}{k}\mathop{\sum }\limits_{i=1}^{k}\mathop{\sum }\limits_{j=1}^{i}{{\log }}\left({\delta }_{{ij}}\right)+k\,{{\log }}\ ({s}_{{{{{{{\mathrm{GMT}}}}}}}})$$

where k is the number of mismatches, $${s}_{i}$$ is the effect of the $$i$$th mismatch computed from the matrix in Fig. 2c, $${\delta }_{{ij}}$$ is the pairwise combinatorial effect with respect to the positions of the $$i$$th and $$j$$th mismatches as demonstrated in Fig. 3d, and $${s}_{{{{{{{\mathrm{GMT}}}}}}}}$$ is the GMT effect estimated from the dinucleotide CNN regression model (see Methods section for a detailed explanation of MOFF-target score). To facilitate the genome-wide assessment of the specificity of a given gRNA, we also defined a MOFF-aggregate score which is the logarithm of the sum of predicted off-on ratios for all genomic sequences harboring up to six mismatches.

We evaluated the performance of MOFF-target and MOFF-aggregate using three independent datasets generated by the platforms of GUIDE-seq12, TTISS18, and CHANGE-seq19. We reasoned that the classification of off-target and untargeted sites is highly dependent on the sensitivity of each platform, where the off-target effects take continuous values that may differ by several orders of magnitude. Thus, we adopted the Spearman correlation in log-scale for quantitative evaluations. We compared the performance of MOFF-target and MOFF-aggregate to 5 off-target prediction methods, including the benchmark CFD score9 and its improved version Elevation24, two recent deep learning-based methods CNN_std27 and CRISPR-Net29, and an energy-based model CRISPRoff31 (Fig. 4b-c). Among them, three machine learning-based methods (Elevation, CNN_std, and CRISPR-Net) achieved good predictive power on their training dataset generated by GUIDE-seq, but the performances degraded when tested on datasets from the other two platforms, suggesting overfitting. Alternatively, MOFF-target and MOFF-aggregate, which are based on explicit sequence-dependent rules, achieved consistently superior performance in all three independent datasets. Furthermore, feature importance analyses suggest higher weights of IME and CE in MOFF-target. In contrast, the GMT effect significantly contributes to genome-wide specificity in MOFF-aggregate because it impacts the cleavage rates at all the off-target sites regardless of various contexts of mismatches (Supplementary Fig. 11).

Previous studies suggested that unintended off-target cleavages lead to decreased cell proliferation through induction of a G2 cell cycle arrest41,42. To test if MOFF could select gRNAs to minimize the confounding off-target effect in the high-throughput CRISPR/Cas9 screens, we analyzed 7,403 gRNAs targeting 758 non-essential genes in the Avana and GeCKO-v2 libraries used in the large-scale DepMap project43. As expected, the MOFF-aggregate score is correlated with a cell depletion phenotype in viability screens (Fig. 4d and Supplementary Fig. 12). Further analysis showed that the combination of MOFF-aggregate and DeepHF44, a gRNA efficiency prediction tool, optimizes gRNA selection to achieve greater effect size in CRISPR/Cas9 screens (Fig. 4e, f and Supplementary Fig. 13). Moreover, the gRNA selection based on MOFF and DeepHF leads to improved cross-library reproducibility on cell-specific essential genes (Supplementary Fig. 14). Collectively, these lines of evidence support the future application of MOFF for the rational design of customized CRISPR/Cas9 libraries.

Improving allele-specific genome editing with mismatched gRNA

Cas9-mediated allele-specific genome editing holds great potential for functional elucidation of disease-associated heterozygous mutations45,46,47,48,49. However, the selectivity of allele-specific targeting remains a significant challenge due to the sequence similarity between the mutant and wild-type alleles, which differ by only 1-bp for most point mutations46,48,50,51. The existing strategies are limited to the mutations that occur in the PAM or the seed regions45,52. Truncated 17-19 nt crRNAs improved the selectivity to a moderate degree but failed in some applications46,48,50.

Given the observation of the “epistasis-like” combinatorial effect of two mismatches, we hypothesized that the selectivity of allele-specific targeting could be improved using a mismatched gRNA that differs by 1-nt from the mutant and 2-nt from the wildtype allele (Fig. 5a). To test this hypothesis, we used the dual-target synthetic system to perform a high-throughput allele-editing screen on 18 cancer hotspot mutations (Fig. 5b and Supplementary Data 5). In library design, the mutant and the wildtype sequences are encompassed in the dual-target site, and the gRNAs are of variable contexts corresponding to perfect match, truncated forms, and all possible single mismatches relative to the mutant. Among 2445 screened gRNAs, 574 (23.5%) showed high mutant-editing efficiency (>75% indel frequency at the mutant allele relative to perfectly matched gRNA) and high selectivity (<20% indel frequency at the wildtype allele relative to perfectly matched gRNA). 16 of the 18 cancer hotspot mutations are selectively targetable by mismatched gRNA, with the exception of 2 mutations lacking efficient mutant-editing potential. Using the same criteria, 4 and 9 cancer hotspot mutations are targetable by gRNAs of perfect match and in a truncated form, respectively (Table 1). The improvement made by mismatched gRNA was confirmed by experimental validations based on 6 selected cancer hotspot mutations, as shown in Fig. 5c. Importantly, the mismatched gRNAs achieve satisfactory selectivity for the mutations that occur in both seed and non-seed regions of the corresponding protospacers.

To test if the effective mismatched gRNAs can be predicted computationally, we applied MOFF-target to determine a subset of 364 mismatched gRNAs that are expected to be efficient at the mutant sequences and selective against the wildtype sequences (see Methods section for details). We found that the predicted gRNAs maintained a high level of mutant-editing efficiency comparable to the perfectly matched gRNAs, and significantly improved the selectivity when compared to the truncated 17-18 nt forms (Fig. 5d, e). Among the predicted mismatched gRNAs, 191 (52.5%) are effective in the high-throughput screen, confirming the feasibility of allele-specific editing using the computer-aided design of mismatched gRNAs. To address the needs of guide design in allele-specific editing, we implemented a module in the MOFF package, named MOFF-allele, that allows the users to select the optimal mismatched gRNA for their study.

Discussion

In this study, we devised a high-throughput dual-target synthetic system to explore the sequence features associated with CRISPR/Cas9 off-target effect. We determined the rules underlying a GMT and the combinatorial effect of multiple mismatches. The sequence-dependent rules are significantly associated with the free-energy landscape during R-loop formation and support a recent multi-state kinetic model40. Of note, R-loop formation is not only required for the DNA cleavage mediated by Cas9 and its variants, but also for other CRISPR techniques based on dCas9 and Cas12a53,54. Therefore, we anticipate that these sequence-dependent rules and the kinetic model are generally applicable to a wide spectrum of CRISPR applications such as dCas9-mediated transcriptional repression and activation, base editing, and prime editing55,56,57,58,59,60, at least to a significant extent. In addition to R-loop formation, the interaction between Cas9 and unwound target DNA strand, the conformational change of Cas9-gRNA structure, as well as the folding stability of gRNAs, collectively contribute to the sequence-dependent free-energy landscape. Therefore, the kinetic model can be further improved by taking these factors into consideration, which will lead to a better explanation of off-target effect from a biophysical perspective.

A simple combination of our rule sets led to the development of MOFF, a model-based off-target prediction method that outperforms previous methods including complex deep learning frameworks. We ascribe the improvement to two factors. First, the previous methods were developed on datasets with a limited number of gRNAs that are insufficient for modeling on a large feature space spanned by 20-nt nucleotides. Second, complex models are sensitive to platform-specific biases and are prone to overfitting, whereas rule-based approaches such as MOFF are more robust against the variation of platforms. The second point is supported by our additional evaluation results as shown in Supplementary Fig. 15, in which MOFF outperforms traditional machine learning methods (Gradient Boosted Tree, Random Forest Regressor, and Support Vector Machine) when trained on our dual-target dataset and tested on public datasets. Meanwhile, we acknowledge that complex machine learning models are advantageous in extracting subtle features that are indiscernible by simple rule mining if the platform-specific biases are minimal. We envision that MOFF can be further improved by: (i) supplying advanced machine learning techniques with known rules and biophysical laws, followed by training the model on a compendium of independent datasets; (ii) considering DNA/RNA bulges, which account for ~10% of genomic off-targeting events61; (iii) accounting the mismatches in the PAM sequence, as some alternative PAM sequences, such as NAG, also lead to active Cas9/gRNA editing9; and (iv) incorporating sequence-independent features such as chromatin structure and epigenetic markers, as reported previously19,62.

Current strategies for allele-specific editing using discriminating gRNAs are mainly focused on the mutations in the PAM or the seed region. Although several successful applications have been reported using Cas9 and its orthologs/variants45,46,49, the selectivity of allele-specific editing remains a significant challenge. Based on the “epistasis-like” combinatorial effect of dual mismatches that we reported, we propose to use mismatched gRNAs to improve the selectivity of allele-specific genome editing, where the intended allele harbors a single tolerable mismatch relative to the gRNA and the unintended allele harbors two intolerable mismatches. The high-throughput allele-editing screen confirmed that a vast majority of cancer hotspot mutations located either in the seed region or non-seed region are selectively targetable through the computer-aided design of mismatched gRNAs. This improvement expands the application domain of allele-specific editing as a time-saving and cost-effective approach for the perturbation of endogenous mutant alleles, thus benefiting functional studies of disease- or trait-associated heterozygous point mutations in a variety of species. Besides allele-specific editing, mismatched gRNAs can be applied to selectively edit one of the paralogous genes that share high degrees of sequence homology, which will facilitate the elucidation of genetic interactions and functional crosstalk between paralogs. Moreover, while our validations are focused on cancer hotspot mutations, the strategy of computer-aided design of mismatched gRNAs holds the clinical potential to correct dominant-negative mutations that drive mendelian disorders, which will be addressed in future work.

Methods

Oligonucleotide library design

In this study, we designed 11 high-throughput libraries for systematically decomposing sequence determinants that govern CRISPR/Cas9 specificity. Each library consists of ~12,000 oligonucleotides. These 11 libraries are summarized as follows and more details are provided in Supplementary Data 6.

• Library T1, T2, and T3: paired gRNA and dual-target design (Fig. 1b) for method evaluation and rule mining. The T1 library includes control gRNAs, benchmark gRNAs and random gRNAs for method evaluation. The T2 and T3 libraries, together with the fraction of random gRNAs in T1, are used to uncover the rules underlying off-targeting.

• Library S1: paired gRNA and single-target design (Fig. 1a) for method evaluation. We designed this library following a previously described high-throughput library design protocol34. For a systematic comparison, Library S1 contains the same sets of gRNA and corresponding target combination as Library T1.

• Library Allele: paired gRNA and dual-target design (Fig. 1b) for the allele-editing screen. We designed this allele-editing library to leverage the mismatched gRNAs to target 18 cancer hotspot mutations, which were curated from Memorial Sloan Kettering Cancer Center (MSKCC) Cancer Hotspots (https://www.cancerhotspots.org) and cBioportal for Cancer Genomics (https://www.cbioportal.org) with reported gain-of-function or dominant-negative effects. For dual-target design, the wildtype sequence is placed at the off-target position and the mutant sequence at the on-target position (Fig. 5b).

• Library sg1, sg2, sg3, sg4, sg5, and sg6: target-only design (no gRNA pooling) for GMT exploration (Fig. 2a, b). Two gRNAs, sg2 and sg6, have the crystal structure with Cas9 and their targets (PDBs: 4OO8 and 4UN3). Target sequences in each library include the on-target, all the possible 1 mismatch (1-MM, 66 sequences) and 2 mismatches (2-MM, 2,079 sequences) targets, and randomly selected 3 mismatches (3-MM, 100 sequences), 4 mismatches (4-MM, 100 sequences), 5 mismatches (5-MM, 100 sequences), and 6 mismatches (6-MM, 47 sequences) targets.

The gRNAs in these 11 libraries are categorized into 5 types: control gRNAs, benchmark gRNAs, randomly generated gRNAs, allele-editing gRNAs, and truncated gRNAs. Every control gRNA was designed with 4 types of dual-target sequences by the replacement of “NGG” with “NTT” sequence at the PAM to represent 4 combinations of cleavage events (no cleavage, left, right, and both) (Fig. 1c). The benchmark gRNAs with reported genomic off-target sites were curated from different methods detecting genome-wide off-targets induced by Cas9 editing in vitro or in vivo and used to evaluate the dual-target approach1,12,17,36. Unreported genomic off-target sites with <5 mismatches relative to corresponding benchmark gRNAs were generated by CRISPRitz63 (Fig. 1h–j). Except for the gRNAs in Libraries sg1-6, each randomly generated gRNA was designed with 7 types of off-target sequences to systematically decompose the estimated off-on ratios into single-mismatch effect, combinatorial effect, and GMT. These seven sequence types include three targets with 1-MM, 3 with 2-MM, and 1 with 3-MM. The mismatches in the 2-MM and 3-MM targets are the combinations of the mismatches in 1-MM targets (Fig. 1k). Frequencies of four nucleotide types (A, G, C, and T) at each position for randomly designed gRNAs are roughly evenly distributed (Supplementary Fig. 16). Allele-editing gRNAs were designed to test the selectivity for allele-specific targeting on 18 cancer hotspot mutations (Fig. 5b). Truncated 17–19 nt gRNAs were designed to compare to mismatched gRNAs for allele-editing assessment (Fig. 5d, e).

In the paired gRNA and dual-target libraries (Library T1, T2, T3, and Allele), each 190-bp oligonucleotide consists of an 18-bp left primer sequence, a 10-bp barcode 1, a 23-bp corresponding off-target sequence containing a PAM, a 15-bp linker to segregate two targets, a 23-bp corresponding on-target sequence containing a PAM, a 15-bp barcode 2, a 45-bp cloning linker sequence containing a vector homology sequence, an SspI enzyme recognition site and an hU6 homology sequence, a 21-bp guide sequence beginning with “g” (gN20), and a 20-bp right primer sequence. In the paired gRNA and single-target library (Library S1), each 157-bp oligonucleotide consists of a longer 20-bp barcode 2 but lacks the 15-bp linker and the second 23-bp target. All other sequence components are identical between the single- and dual-target design. In the target-only libraries (Library sg1, sg2, sg3, sg4, sg5, and sg6), each 109-bp oligonucleotide consists of a 25-bp left primer sequence, a 10-bp barcode 1, a 27-bp corresponding target sequence containing a PAM, a 20-bp barcode 2, and a 27-bp right primer sequence (Supplementary Note). All the oligonucleotides in these libraries passed the filters of having no thymine homopolymers more than three nucleotides long in gRNA sequences, carrying no more than one SspI enzyme recognition site, one left primer site, and one right primer site, and no sites besides the target sequence being cut by gRNAs in the constructed plasmids.

Plasmid cloning

gRNA I and D associated with distinct repair mechanisms upon double-strand breaks were selected from a previous single-target library predicting mutations generated by CRISPR/Cas9. Dual-target I and D were designed with four types of dual-targets by the replacement of “NGG” with “NTT” sequence at the PAM as described above. For gRNA I/D and corresponding dual-target I/D pairs cloning, the 152-bp single-strand Ultramer™ DNA oligonucleotides with 4 types of dual-target I and D were synthesized (IDT) and amplified with Q5 High-Fidelity DNA polymerase (NEB #M0492). Meanwhile, gRNA I and D oligonucleotides (Sigma) were annealed and ligated to the BsmBI (NEB #R0739)-linearized LentiGuide-BSD-PacI plasmid. The LentiGuide-BSD-PacI plasmid was derived from the LentiGuide-BSD plasmid by inserting a PacI recognition site to 33-bp downstream of the scaffold sequence using Q5 site-directed mutagenesis (NEB #E0552). The resulting plasmids were then linearized by PacI (NEB #R0547), and undergone Gibson assembly (NEBuilder HiFi DNA Assembly Master Mix, NEB #E2621) with the amplified four types of dual-target I and D fragments. Assembled products were then transformed to homemade Stbl3 competent cells, and the plasmids were extracted using QIAprep Spin Miniprep Kit (QIAGEN) and confirmed by Sanger sequencing.

For plasmid library cloning, the oligonucleotide libraries with paired gRNA and dual-target design were synthesized by Twist Bioscience, and libraries with paired gRNA and single-target design or target-only design were ordered from CustomArray. The plasmid library containing pooling gRNAs and corresponding single- or dual-target pairs was prepared by undergoing an intermediate circulation, SspI enzyme linearization, and Gibson assembly into a lentiviral plasmid LentiGuide-BSD-Δscaffold. This multistep cloning procedure was adapted and modified from previously described protocols33,34. Briefly, the oligonucleotide pools were amplified with Q5 High-Fidelity DNA polymerase for extension of the oligonucleotide sequences with overhangs complementary to a donor G-block. Gibson assembly was performed to ligate the amplified pools to the 125-nt donor gBlock fragment (IDT) encoding the gRNA scaffold at a molar ratio of 1:3 at 50 °C for 1 h, and the resulting products were incubated with Plasmid Safe ATP-Dependent DNase (Lucigen #E3101K) at 37 °C for 1 h to remove the linear fragments. The assembled circular DNA was purified with QIAquick PCR Purification Kit (QIAGEN), followed by linearization with SspI enzyme (NEB #R3132) at 37 °C for 4 h, and purification with QIAquick PCR Purification Kit. Next, the purified products were performed the second amplification with Q5 High-Fidelity DNA polymerase for the addition of overhangs complementary to gRNA-expressing plasmid LentiGuide-BSD-Δscaffold, and the PCR products were then purified with QIAquick Gel Extraction Kit (QIAGEN). This LentiGuide-BSD-Δscaffold plasmid lacking the scaffold sequence was generated by removing the gRNA scaffold from LentiGuide-BSD plasmid using Q5 site-directed mutagenesis. Gibson Assembly was then employed to fuse the second amplified oligonucleotide pools to BsmBI-linearized LentiGuide-BSD-Δscaffold at a molar ratio of 3:1 at 50 °C for 1 h. The resulting products were then purified by isopropanol precipitation with GlycoBlueTM Coprecipitant (Thermo Fisher Scientific #AM9516), dissolved in TE buffer (pH 8.0) to a final concentration of 100 ng/μl at 55 °C for 10 min, and transformed into Endura electrocompetent cells (Lucigen #60242). The plasmid library was extracted by NucleoBond Xtra Maxi Kit (MACHEREY-NAGEL). The construction of the plasmid library containing single gRNA and corresponding targets followed the cloning processes of gRNA I/D and dual-target I/D pairs as mentioned above. In brief, oligonucleotides of each gRNA were annealed and inserted into BsmBI-linearized LentiGuide-BSD-PacI, and the constructs then were verified by Sanger sequencing. In the meantime, the target-only library was amplified with Q5 High-Fidelity DNA polymerase and purified with gel purification. Gibson Assembly was applied to ligate the amplified target-only oligonucleotide pools to their corresponding PacI-linearized single gRNA-expressing plasmids LentiGuide-BSD-PacI at a molar ratio of 6.5:1 at 50 °C for 1 h. The assembled products were purified by isopropanol precipitation and transformed into electrocompetent cells, followed by plasmid extraction as described above.

To validate the results of the high-throughput allele-editing screen, 6 genes with cancer hotspot mutations were selected. For allele-editing validation plasmid cloning, oligonucleotides containing cancer hotspot mutation-derived wildtype and mutant target sites, also referred to as wildtype and mutant target, were synthesized (IDT) and annealed, followed by insertion into BsmBI-linearized plasmid LentiEGFP-P2A-Puro. Meanwhile, 6 gRNAs were designed to target each selected cancer hotspot mutation, including one perfectly matched, two truncated, and three effective mismatched gRNAs based on the screen results. LentiCRISPR-V2-BSD (Addgene plasmid #83480) was employed to insert each gRNA according to the protocol from Feng Zhang’s lab. The ligated products were extracted using QIAprep Spin Miniprep Kit and confirmed by Sanger sequencing.

All the primers, gRNA, and fragments used for plasmid construction are listed in Supplementary Data 7.

Cell culture

HEK293T cell line was purchased from ATCC (CRL-3216) and cultured in DMEM medium (Gibco) supplemented with 10% FBS (Sigma) and 1% penicillin-streptomycin (Gibco) at 37 °C with 5% CO2. A monoclonal HEK293T cell line expressing doxycycline-inducible Cas9 (HEK293T-iCas9) was generated by infecting parental cells with pCW-Cas9 (Addgene plasmid #50661) lentivirus. All the cell lines were routinely tested for being free of mycoplasma contamination using MycoAlert™ Mycoplasma Detection Kit (Lonza #LT07-218).

Lentivirus production

HEK293T cells (5 × 106) were seeded in a 10-cm tissue culture dish 1 day before transfection. On the day of transfection, 4 μg of lentiviral plasmid, 4 μg of psPAX2 (Addgene plasmid #12260), and 2 μg of pMD2.G (Addgene plasmid #12259) in 500 μl Opti-MEM (Gibco) were mixed with 25 μl X-tremeGene HP DNA transfection reagent (Roche #06366236001) and then added to the pre-seeded HEK293T cells. The culture medium was changed with the addition of 20 μM HEPES buffer 6 h after transfection. Lentiviral supernatant was harvested and filtered through a 0.45-μM syringe filter (Millipore) 48 h after transfection. Aliquots were frozen at −80 °C for later use.

CRISPR screen and deep sequencing

HEK293T-iCas9 cells were seeded in eighteen 10 cm tissue culture dishes (4 × 106 cells per dish) with three independent biological replicates 1 day before transduction. On the day of transduction, the lentiviral plasmid library was transduced into the cells at an MOI of 0.3 in the presence of 8 μg/ml polybrene (Millipore). The infected cells were passaged to 15 cm tissue culture dishes 48 h after infection, selected with 20 μg/ml blasticidin (InvivoGen #ant-bl-1) for 3 days, and then incubated with doxycycline (Sigma #D9891) to induce Cas9 expression. Cells were passed every 2–3 days with at least 2 × 107 cells for each passage to maintain the library diversity and harvested (2 × 107 cells per replicate) on days 0, 3, 5, and 9 after Cas9 induction. Cell pellets were frozen at −80 °C for later genomic extraction using Blood & Cell Culture DNA Midi Kit (QIAGEN).

For the transduction of lentiviral plasmids with gRNA I/D and corresponding dual-target I/D, HEK293T-iCas9 cells were seeded in six-well plates (5 × 104 cells per well) with two biological replicates following the same procedure as described above. In all, 2 × 106 cells were harvested on day 9 after Cas9 induction for genomic extraction using QIAamp® DNA Mini Kit (QIAGEN).

After genomic extraction, the integrated target sequences were PCR amplified using NEBNext® Ultra™ II Q5® Master Mix (NEB #M0544) and prepared for deep sequencing by 2 rounds of PCR as described previously64. The 1st round PCR (16 cycles) was performed with 40 μg genomic DNA divided into eight 100 μl-PCR reactions to achieve 500-fold coverage over the library. To attach the Illumina adaptor and barcode sequences, the 2nd round PCR (12 cycles) was conducted with 5 μl of the 1st purified PCR product. The gel-purified samples were pooled and sequenced on Illumina Miseq, Hiseq 3000 or Nextseq 500 by 75-bp paired-end sequencing at MDACC-Smithville Next Generation Sequencing Core. All the primers applied for sequencing library preparation are provided in Supplementary Data 7.

Allele-specific editing validation

To validate the allele-editing results from the high-throughput screen, 2 rounds of transduction were performed. HEK293T cells were seeded in six-well plates (1 × 105 cells per well) and infected with lentivirus of LentiEGFP-P2A-Puro harboring the wildtype or mutant target sequence related to the 6 selected cancer hotspot mutations. On day 4 after puromycin (InvivoGen #ant-pr-1) selection, the resulting HEK293T cells were re-seeded in six-well plates (1 × 105 cells per well), followed by infection with lentivirus of LentiCRISPR-V2-BSD expressing perfectly matched, truncated or mismatched gRNAs. The infected cells from 2 biological replicates were passaged every 2–3 days and harvested (2 × 106 cells) on day 9 after blasticidin selection for genomic extraction.

Genomic DNA was isolated from cell pellets with QIAamp® DNA Mini Kit (QIAGEN) according to the manufacturer’s protocol. In total, 2 μg genomic DNA of each sample was used to amplify the integrated wildtype or mutant target sequence with Q5 High-Fidelity DNA polymerase. PCR products were purified by gel purification and undergone Sanger sequencing at MDACC-Smithville Molecular Biology Core. The wildtype and mutant allele-editing frequencies by perfectly matched, truncated, and mismatched gRNAs were analyzed through a Sanger sequencing-based CRISPR analysis online tool Inference of CRISPR Edits (ICE, https://ice.synthego.com).

Sequencing data analysis

A computational pipeline was developed to process the generated sequencing data. We first combined the partially overlapped, paired-end reads into a single sequence using FLASH v1.2.1165 with the options “-m 20, -M 150, -x 0.1, -allow-outies” (a minimum overlap of 20 bp, a maximum overlap of 150 bp, a maximum mismatch density of 0.1 and allow outie pairs). Next, we extracted the unique barcode 1 and 2 pairs from the merged reads and compared them to the barcodes of the designed constructs in the library. Sequencing reads were assigned to a specific construct if their barcodes were perfectly matched to the 10-nt of barcode 1 and last 10 bp of 15-nt of barcode 2 of the construct. The sequences between the two barcodes were then aligned to the designed sequences for indel calling using the Smith-Waterman algorithm66 with gap open penalty of 8, mismatch penalty of 1, match score of 2, and mismatch score of −2. Finally, the sequencing reads were assigned to one of five different indel types, i.e., wt + wt, indel + wt, wt + indel, indel + indel and large deletion, based on the alignment results. Specifically, if a deletion larger than 32-bp was identified in the alignments, the read was assigned to the large deletion category. Otherwise, we adopted a 10-bp window around the cutting site to capture the potential CRISPR-induced editing at on-target and off-target sequence. We can then easily assign the reads into four other categories: indel + wt (indels detected at on-target site only), wt  + indel (indels detected at off-target site only), indel + indel (indels detected at both on- and off-target sites), and wt + wt (indels detected at neither on- nor off-target sites). Of note, we only observed rare cases (~0.005%) where two cleavage sites are inverted in our dual-target design67. This is consistent with previous report where inversion rate of small DNA fragments with size <100 bp is much lower compared to that of larger fragments of hundreds to thousands bps68. Thus, here we did not include this inversion type in the classification of indel types. Distributions of the number of sequenced reads per gRNA-target pair to indicate the sequencing depths of the experiments with dual-target design are shown in Supplementary Fig. 17.

Estimation of off-on ratios

We developed a statistical method to infer the read counts of different cleavage states from the observed editing types of the sequencing data. Given a gRNA-target pair, we denote $$C={\left[{c}_{1},{c}_{2},{c}_{3},{c}_{4},{c}_{5}\right]}^{T}$$ to represent the observations of read counts, where $${c}_{1}$$ to $${c}_{5}$$ correspond to the editing types of wt + wt, indel + wt, wt + indel, indel + indel, and large deletion, respectively. We denote $$S={\left[{s}_{1},{s}_{2},{s}_{3},{s}_{4}\right]}^{T}$$ to represent the numbers of reads in the four cleavage states, where $${s}_{1}$$ to $${s}_{4}$$ correspond to the states of “no cleavage”, “cleavage at off-target only”, “cleavage at on-target only”, and “both cleavage”, respectively. We model the observation $$C$$ to be:

$$C={P}^{T}S+\varepsilon$$
(1)

where $$P$$ is a $$4\times 5$$ stochastic matrix and $$\varepsilon$$ is a vector of Gaussian noise. An entry $${p}_{i,j}$$ ($$i=1,2,3,4{;\; j}=1,2,\ldots ,5$$) in the matrix $$P$$ represents the conditional probability of observing the $$j$$th editing type given the $$i$$th cleavage state. The sum of each row in $$P$$ is 1, i.e., $${\sum }_{j=1}^{5}{p}_{i,j}=1$$, $$\forall i$$. Given the model in Fig. 1e, $$P$$ can be written as:

$$P=\left[\begin{array}{cc}\begin{array}{ccc}{p}_{1,1} & {p}_{1,2} & {p}_{1,3}\\ \,0 & {p}_{2,2} & 0\\ \,0 & 0 & {p}_{3,3}\end{array} & \begin{array}{cc}{p}_{1,4} & {p}_{1,5}\\ 0 & 0\\ 0 & {p}_{3,5}\end{array}\\ \begin{array}{ccc}0\, & \,0\, & 0\end{array} & \begin{array}{cc}{p}_{4,4} & {p}_{4,5}\end{array}\end{array}\right]$$
(2)

The parameters in $$P$$ were estimated from the observations of read counts on the control dual-target sequences. The first row, denoted $${P}_{1}$$, represents the background noise and PCR artifacts, which were estimated based on the read distribution of the NTT + NTT (no cleavage) controls. The second row, denoted $${P}_{2}$$, was estimated based on the observations from the NGG + NTT controls. Depending on the gRNA cutting efficiency, the reads derived from the NGG + NTT controls are subject to a mixture of the “no cleavage” and “cleavage at off-target only” states. Given the estimated $${P}_{1}$$ and the read counts of the five editing types on the NGG + NTT controls, $${P}_{2}$$ can be explicitly estimated by solving linear equations. Similarly, the parameters in the third and fourth rows in $$P$$ were computed from the observations on the NTT + NGG and NGG + NGG controls.

Given the stochastic matrix $$P$$, we adopted a maximum-likelihood estimation (MLE) approach to compute $$S$$. This can be achieved by minimizing $${\Vert {P}^{T}S-C\Vert }^{2}$$, under the constraint of $${s}_{i}\ge 0$$, $$\forall i$$. With the estimated reads for the four cleavage states, the off-on ratio of a gRNA-target pair was calculated as:

$$r=\frac{{s}_{2}+{s}_{4}}{{s}_{3}+{s}_{4}}$$
(3)

Where s2, s3 and s4 are the estimated read counts for “cleavage at off-target only”, “cleavage at on-target only”, and “both cleavage”, respectively. Considering that the calculation of off-on ratios is associated with large variation using a limited number of reads in the denominator, we filtered out gRNA-target pairs with less than 100 total edited reads at on-target site, i.e., $${s}_{3}+{s}_{4} < 100$$, resulting in a total number of 10,460 gRNA-target pairs with the estimated off-on ratios (Supplementary Data 1). When computing log off-on ratio, we added a small constant c to control the variation, i.e., $${{\log }}\left(r+c\right)$$. The constant c is chosen to be 0.01 because the dynamic range of off-on ratio evaluation is approximately 0.02–1.

Estimation of GMT

We assume that the intrinsic sequence of the gRNA and the mismatch context between gRNA and target sequence jointly contribute to the overall off-target effects. To decompose these two factors, we collected the off-on ratios of 3897 1-mismatch targets corresponding to 1438 gRNAs in our dataset (Supplementary Data 4). There are 12 types of unmatched nucleotide pairs (A-C, A-G, …, U-C), and 20 different positions of the mismatches, resulting in 240 mismatch types in total. Suppose $${r}_{{ij}}$$ represents the off-on ratio of the $$i$$th gRNA with a mismatch of the $$j$$th type, we model $${r}_{{ij}}$$ to be the multiplication of mismatch-dependent effect $${m}_{j}$$ and gRNA-intrinsic mismatch tolerance $${g}_{i}$$. In a log scale, this can be written as:

$${{{{{\rm{Log}}}}}}({r}_{{ij}}+c)={{\log }}({m}_{j})+{{\log }}({g}_{i})+\varepsilon$$
(4)

where $$\varepsilon$$ is the Gaussian noise and c is the small constant for controlling the variation of log off-on ratio. We then applied a gradient descent algorithm to compute the MLE estimation of $${m}_{j}$$ and $${g}_{i}$$, which represents the mismatch-dependent effect and the gRNA-intrinsic mismatch tolerance, respectively. The mismatch-dependent effects for single-mismatches are presented as a 12 $$\times$$ 20 matrix, namely M1 matrix, in which each entry represents the effects of a nucleotide mismatch at a specific position of the target sequence. The estimated $${g}_{j}$$ was subsequently used to explore the sequence determinants of GMT.

Prediction of GMT

To predict GMT from a gRNA sequence, we took all the 1438 gRNAs with estimated GMT scores ($${g}_{j}$$) to train a CNN model. For each gRNA, we tested two different encoding strategies, i.e., mononucleotide and dinucleotide, to vectorize the gRNA sequence as inputs (Supplementary Fig. 8). For the mononucleotide encoding, a 20-nt gRNA sequence was binarized into a 4 $$\times$$ 20 two-dimension array, with 0 s and 1 s indicating the absence or presence of 4 different nucleotides (A, T, C, G) at every single position. For the dinucleotide encoding, 20-nt gRNA sequence was binarized into a 16 $$\times$$ 19 two-dimension array, with 0 s and 1 s indicating the absence or presence of 16 different dinucleotides (AT, AC, AG, AA, TT, TA, TG, TC, CC, CA, CG, CT, GG, GA, GT, GC) at each of the constitutive positions along the gRNA sequence. A CNN regression model was then designed using Keras (https://keras.io/) with a TensorFlow backend engine, consisting of one convolution layer and one dense layer, terminating in a single neuron.

We compared two encoding methods for data vectorization with different settings of parameters including the size, shape, and number of convolution kernels using a five-fold cross-validation strategy. The performance was assessed by computing Spearman’s correlation between the predicted and observed GMT scores. Finally, we selected the dinucleotide inputs and CNN model with three kernels in the convolutional layer. We further tested our model on two independent datasets: the TTISS dataset, which includes 59 sgRNAs with genome-wide off-targets detected by TTISS for nine different Cas9 variants18; the CHANGE-seq dataset, which includes 110 gRNAs with genome-wide SpCas9 off-targets detected by CHANGE-seq19. For each gRNA, an overall off-on ratio was calculated as the sum of detected off-target reads divided by on-target reads. We predicted the GMT scores for all the gRNAs and compared the overall off-on ratios of gRNAs with high (top 25%) and low (bottom 25%) GMT scores. Statistical significance was measured using Mann–Whitney U test.

Estimation of combinatorial effects

We denote $${\delta }_{{ij}}$$ as the position-dependent combinatorial effect between two mismatches that occur at the $$i$$th and the $$j$$th nucleotide relative to the PAM ($$i,j=0,1,2,\ldots ,20$$). A value of $${\delta }_{{ij}}\,$$close to 1 suggests “independence” of the combination, whereas $${\delta }_{{ij}}\,$$close to 0 suggests a strong “epistasis-like” combinatorial effect. We estimated $${\delta }_{{ij}}$$ from the off-on ratios of 2-MM and individual 1-MM targets, as specified below.

To obtain a sufficient number of data points for a robust estimation of $${\delta }_{{ij}}$$, we consider the mismatches located in the $$i$$th and the $$j$$th nucleotide, as well as those in the adjacent locations. Suppose there are $${N}_{{ij}}$$ 2-MM targets with mismatches at positions (i, j), (i, j-1), (i, j+1), (i-1, j), and (i+1, j), we denote $${x}_{{ij}}^{k}$$ to be the off-on ratio of the $$k$$th 2-MM target in this group ($$k=0,1,2,\ldots ,{N}_{{ij}}$$) and model $${x}_{{ij}}^{k}$$ as:

$${x}_{{ij}}^{k}={y}_{{ij}}^{k}{z}_{{ij}}^{k}{\delta }_{{ij}}+\varepsilon$$
(5)

where $${y}_{{ij}}^{k}$$ and $${z}_{{ij}}^{k}$$ are the off-on ratios of the 1-MM targets corresponding to the 2-MM target in the library design, and $$\varepsilon$$ is the Gaussian noise. The MLE estimate of $${\delta }_{{ij}}$$ can be explicitly computed as:

$$\hat{{\delta }_{{ij}}}=\mathop{\sum }\limits_{k=1}^{{N}_{{ij}}}\left({x}_{{ij}}^{k}{y}_{{ij}}^{k}{z}_{{ij}}^{k}\right)\bigg/\mathop{\sum }\limits_{k=1}^{{N}_{{ij}}}{\left({y}_{{ij}}^{k}{z}_{{ij}}^{k}\right)}^{2}$$
(6)

The combinatorial effects for 2-position combinations were presented as a 20 × 20 matrix, namely M2 matrix, in which each entry represents the combinatorial effects between two mismatched positions.

To cross-reference the combinatorial effects derived from our data, we also calculated the relative co-occurrence score (RCS) that represents the observed frequency of two mismatches relative to random expectation based on CHANGE-seq dataset19. Assuming that there are n off-target sites detected in the genome, the RCS is defined as:

$${RCS}=\,\frac{{a}_{{ij}}* n}{{b}_{i}* {c}_{j}}$$
(7)

Where $${a}_{{ij}}$$ is the number of off-target sites harboring mismatches at both positions i and j, $${b}_{i}$$ is the number of target sites harboring mismatches at position i, and $${c}_{j}$$ is the number of target sites harboring mismatches at position j. We computed the RCS for all the gRNAs in the CHANGE-seq dataset and took the average RSC over the gRNAs to obtain the matrix in Fig. 3e.

To explore the combinatorial effect from a biophysical point of view, we used a previous kinetic model40 to perform simulation and to estimate $${\delta }_{{ij}}$$ from the simulated data. In brief, the cleavage rate can be directly calculated given the free energy gain of binding the PAM, the energy gain for extending the hybrid over a match, the cost associated with extending the hybrid over an isolated mismatch, as well as the cost of extending the hybrid over neighboring mismatches. As an illustration, we plotted the combinatorial effect calculated for a specific choice of these parameters with the gain in energy due to PAM binding of 5 $${k}_{{{{{{\rm{B}}}}}}}T$$, the gain per correctly matched hybrid base pair of 0.25$$\,{k}_{{{{{{\rm{B}}}}}}}T$$, the cost of a mismatch of 4 $${k}_{{{{{{\rm{B}}}}}}}T$$ if it is isolated and 3 $${k}_{{{{{{\rm{B}}}}}}}T$$ if it is added to an existing bubble (Fig. 3g).

Model-based off-target prediction with MOFF

In MOFF, we integrate three factors including the individual mismatch effect (IME), the combinatorial effect (CE), and the GMT effect.

To explain the MOFF model, we start with a gRNA g and a target with a single mismatch $${m}_{1}$$. The expected off-on ratio, S(g, m1), is the multiply of IME and GMT, i.e.,

$$S\left(g,\,{m}_{1}\right)={s}_{1}{s}_{{{{{{{\mathrm{GMT}}}}}}}}$$
(8)

where $${s}_{1}$$ is the effect of $${m}_{1}$$ computed from the M1 matrix as described in “Estimation of GMT” section, and $${s}_{{{{{{{\mathrm{GMT}}}}}}}}$$ is the GMT effect estimated from the dinucleotide CNN regression model as described in “Prediction of GMT” section.

Next, we consider two mismatches $${m}_{1}$$ and $${m}_{2}$$. The expected off-on ratio, S(g, m1, m2), is the multiply of the effects of individual single mismatches and the combinatorial effect:

$$S\left(g,\,{m}_{1},{m}_{2}\right)=S\left(g,\,{m}_{1}\right)\,S\left(g,\,{m}_{2}\right){\delta }_{12}={s}_{1}{s}_{2}{\delta }_{12}{({s}_{{{{{{{\mathrm{GMT}}}}}}}})}^{2}$$
(9)

where $${\delta }_{12}$$ is the pairwise combinatorial effect with respect to the positions of $${m}_{1}$$ and $${m}_{2}$$, as computed from M2 matrix as described in “Estimation of Combinatorial Effects” section.

When three mismatches are considered, it is ideal to estimate the combinatorial effect of all three. Unfortunately, the number of possible combinations increases exponentially, which makes experimental estimation of parameters impractical. Here, we consider a sequential model to add the 3rd mismatch, $${m}_{3}$$, to the model of two mismatches $$S\left(g,{m}_{1},{m}_{2}\right)$$:

$$S\left(g,\,{m}_{1},{m}_{2},{m}_{3}\right)=S\left(g,\,{m}_{1},{m}_{2}\right)S\left(g,\,{m}_{3}\right){\delta }_{12,3}$$
(10)

where $${\delta }_{{{{{\mathrm{12,3}}}}}}$$ is the additional combinatorial effect and is modeled to be the geometric mean of pairwise combinatorial effects of $${\delta }_{13}$$ and $${\delta }_{23}$$.

Combining Eqs. (8)–(10), and the above definition of $${\delta }_{{{{{\mathrm{12,3}}}}}}$$, we have:

$$S\left(g,\,{m}_{1},{m}_{2},{m}_{3}\right)={s}_{1}{s}_{2}{s}_{3}{\delta }_{12}\sqrt{{\delta }_{13}{\delta }_{23}}{({s}_{{GMT}})}^{3}$$
(11)

Note that $${m}_{1},{m}_{2},{m}_{3}$$ are indeed unordered. With a different order in the sequential model, $$S\left(g,{m}_{1},{m}_{2},{m}_{3}\right)$$ can also be computed as $${s}_{1}{s}_{2}{s}_{3}{\delta }_{13}\sqrt{{\delta }_{12}{\delta }_{23}}{({s}_{{GMT}})}^{3}$$ or $${s}_{1}{s}_{2}{s}_{3}{\delta }_{23}\sqrt{{\delta }_{12}{\delta }_{13}}{({s}_{{GMT}})}^{3}$$. Therefore, we compute $$S\left(g,{m}_{1},{m}_{2},{m}_{3}\right)\,$$to be the geometric mean of the scores computed with all three possible orders in the sequential model, simplified as follows:

$$S\left(g,\,{m}_{1},{m}_{2},{m}_{3}\right)={s}_{1}{s}_{2}{s}_{3}{({\delta }_{12}{\delta }_{13}{\delta }_{23})}^{2/3}{({s}_{{GMT}})}^{3}$$
(12)

Finally, we extend the three-mismatch model to k mismatches using mathematical induction approach, and define the MOFF score to be the logarithm of predicted off-on ratio:

$${S}_{{MOFF}}=\mathop{\sum }\limits_{i=1}^{k}{{\log }}({s}_{i})+\frac{2}{k}\mathop{\sum }\limits_{i=1}^{k}\mathop{\sum }\limits_{j=1}^{i-1}{{\log }}({\delta }_{{ij}})+k\,{{\log }}({s}_{{GMT}})$$
(13)

To assess the genome-wide specificity of a given gRNA, we first mapped the gRNA to the genome to search for potential off-target sites harboring up to 6 mismatches using CRISPRitz, a software for rapid and high-throughput in silico off-target site identification63. Next, we defined a MOFF-aggregate score, which is the logarithm of the sum of the MOFF-target scores for all the potential off-target sites. We note that, the current version of MOFF only considers genomic sites with mismatches, but not indels, relative to the 20-nt crRNA sequence as the potential off-targets.

Evaluation of the models and feature importance analysis

To evaluate the performance of our model, we curated three independent testing datasets generated by three different platforms, i.e., GUIDE-seq12, TTISS18, and CHANGE-seq19. GUIDE-seq dataset includes 348 detected off-target sites for 9 gRNAs, TTISS dataset contains 630 detected off-target sites across 59 gRNAs and CHANGE-seq dataset consists of 96,555 detected off-target sites corresponding to 109 gRNAs. For each gRNA-target pair, we measured the off-target effect as off-on ratio, which is calculated as the detected off-target reads divided by on-target reads. For each gRNA, we measured its genome-wide specificity as overall off-on ratio, which is calculated as the total detected off-target reads across the genome divided by on-target reads. We reasoned that the classification of off-target and untargeted sites is highly dependent on the sensitivity of each platform, where the off-target effects indeed take continuous values that may differ by several orders of magnitude. Thus, we adopted the Spearman correlations between measured and predicted off-on ratios for quantitative evaluations.

We further evaluated the importance of different features using Gini importance, which was implemented through the Random Forest Regressor module from the scikit-learn package in Python with default parameters. For MOFF-target, we consider three features: IME, CE, and GMT. For MOFF-aggregate, we consider two features: (1) the mismatch-dependent feature, which is the sum of the predicted mismatch-dependent effects (IME + CE) without considering GMT; (2) the GMT effect corresponding to each gRNA.

Comparison of off-target prediction methods

We compared the performance of MOFF to five representative off-target prediction methods: (1) Cutting Frequency Determination (CFD) score is the multiplication of single mismatch effects derived from a cleavage dataset targeting the coding sequence of the human CD33 gene in MOLM-13 cells9. For the implementation of CFD, we used the Supplementary Table 19 of their original publication which includes all the single-mismatch effects. (2) Elevation is a two-layer regression model where the first layer learns to predict the effects of single-mismatch and the second layer learns how to combine single-mismatch effects into a final score24. The source codes of Elevation were downloaded from https://github.com/Microsoft/Elevation. (3) CNN_std is a deep learning model to predict off-target effects using a standard convolutional neuron network27 and the source codes to implement CNN_std were downloaded from https://github.com/MichaelLinn/off_target_prediction. (4) CRISPR-Net is a more recent deep learning method using a recurrent convolutional network, which shows superior performance compared to other machine learning approaches29. The source codes to implement CRISPR-Net were downloaded from https://codeocean.com/capsule/9553651/tree/v1. And (5) CRISPRoff is an approximate binding energy model for the Cas9-gRNA-DNA complex, which systematically combines the energy parameters obtained for RNA-RNA, DNA-DNA, and RNA-DNA duplexes31. The source codes for executing CRISPRoff were downloaded from https://github.com/RTH-tools/crisproff.

We adopted the same strategy to evaluate the performance of different methods as described in “Evaluation of the models and feature importance analysis” section. We note that aggregation models for Elevation and CRISPR-Net assigned different weights to off-targets sites occurring at genic or non-genic regions, which were trained on cell viability screen data of gRNAs targeting non-essential genes; however, cell viability is not a direct indication of DNA cleavage at off-target sites and the genomic features used for training are not associated with the sequence determinants for gRNA specificity. Therefore, the performance of their models degraded when applied to the datasets that directly measure DNA cleavage at off-target sites across the genome (Supplementary Fig. 18). For a fair comparison, we used logarithm of the sum of individual scores for all potential off-target sites identified by CRISPRitz63 to predict gRNA specificity for all these methods.

Application to gRNA design

We collected Avana, GeCKO-v2, and Sanger CRISPR screen data43,69 from the Depmap portal at https://depmap.org/portal/download. The efficiency of gRNAs was predicted using the DeepHF method44, and the off-target effect of gRNAs was predicted using MOFF-aggregate. As the analysis involves the computation on tens of thousands of gRNAs, we configured the alignment input to allow up to five mismatches for computational efficiency. All the gRNAs were then classified into three categories: Low efficiency (DeepHF score < 0.6), High efficiency High off-target (DeepHF score > 0.6 and MOFF-aggregate > 1), and High efficiency Low off-target (DeepHF score > 0.6 and MOFF-aggregate < 1).

To compare the effective size of gRNAs within different categories, we compared the average log-fold change (LFC) of gRNAs targeting 1246 core essential genes and 758 non-essential genes70 within each category using strictly standardized mean difference (SSMD), which is widely used for quality control in the high-throughput screen data. Higher SSMD values indicate that gRNAs can better discriminate essential and non-essential genes, therefore achieving greater effective size.

To test the cross-library reproducibility for different types of gRNAs, we compared gRNAs targeting 529 cell-specific essential genes from Avana data and Sanger data. Specifically, for gRNAs in different categories, we measured the Pearson correlation between averaged LFC of gRNAs targeting the same genes in Avana and Sanger data across different cell lines. Higher correlation indicates that the effects of gRNAs are more reproducible among different libraries. The cell-specific essential gene list was downloaded from the Depmap portal at https://ndownloader.figshare.com/files/27902064.

gRNA selection for allele-specific editing

Given the local DNA sequences of the wildtype and mutant alleles, we first searched for all the possible gRNAs followed by a PAM (NGG) motif targeting the DNA sequence of the mutant allele that harbors a single mutation compared to the wildtype allele. The selected gRNAs, which we termed seed gRNAs, are perfectly matched to the mutant allele, and have a single mismatch relative to the wildtype allele. Next, we introduced all possible single mismatches to the seed gRNAs to generate the candidate gRNAs that differ by 1-nt from the mutant and 2-nt from the wildtype allele. To select the best gRNAs from these candidates, we predicted the MOFF-target scores between the gRNAs and the targeted DNA sequence from the mutant and wildtype alleles, which indicate the sensitivity and selectivity of the gRNAs, respectively. In practice, we selected gRNAs that satisfy: (1) MOFF-target scores at mutant allele >0.5 to ensure high knockout efficiency, (2) the ratio of MOFF-target scores between wildtype allele and mutant allele <0.2, for high selectivity.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.