Abstract
Prime editing is a versatile genome editing tool but requires experimental optimization of the prime editing guide RNA (pegRNA) to achieve high editing efficiency. Here we conducted a high-throughput screen to analyze prime editing outcomes of 92,423 pegRNAs on a highly diverse set of 13,349 human pathogenic mutations that include base substitutions, insertions and deletions. Based on this dataset, we identified sequence context features that influence prime editing and trained PRIDICT (prime editing guide prediction), an attention-based bidirectional recurrent neural network. PRIDICT reliably predicts editing rates for all small-sized genetic changes with a Spearman’s R of 0.85 and 0.78 for intended and unintended edits, respectively. We validated PRIDICT on endogenous editing sites as well as an external dataset and showed that pegRNAs with high (>70) versus low (<70) PRIDICT scores showed substantially increased prime editing efficiencies in different cell types in vitro (12-fold) and in hepatocytes in vivo (tenfold), highlighting the value of PRIDICT for basic and for translational research applications.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Measured editing rates used for analysis and figures in this study are provided as Supplementary Tables and on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). DNA-sequencing data is available via the National Center for Biotechnology Information Sequence Read Archive (PRJNA825584). Target sequences of pathogenic mutations were based on the ClinVar database (accessed December 2019), and corresponding genomic sequences (flanking the edit) were acquired via UCSC Genome Browser (Table Browser, hg38). Plasmid encoding for pCMV-PE2-tagRFP-BleoR is available from Addgene (no. 192508).
Code availability
Custom Python code used in this study is provided on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). Additional information on the PRIDICT algorithm can be found in Supplementary Methods 1.
References
Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
Hsu, J. Y. et al. PrimeDesign software for rapid and simplified design of prime editing guide RNAs. Nat. Commun. 12, 1034 (2021).
Hwang, G.-H. et al. PE-Designer and PE-Analyzer: web-based design and analysis tools for CRISPR prime editing. Nucleic Acids Res. 49, W499–W504 (2021).
Kim, H. K. et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat. Biotechnol. 39, 198–206 (2021).
Li, Y., Chen, J., Tsai, S. Q. & Cheng, Y. Easy-Prime: a machine learning–based prime editor design tool. Genome Biol. 22, 235 (2021).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Nielsen, S., Yuzenkova, Y. & Zenkin, N. Mechanism of eukaryotic RNA polymerase III transcription termination. Science 340, 1577–1580 (2013).
Gao, Z., Herrera-Carrillo, E. & Berkhout, B. Delineation of the exact transcription termination signal for type 3 polymerase III. Mol. Ther. Nucleic Acids 10, 36–44 (2018).
Bill, C. A., Duran, W. A., Miselis, N. R. & Nickoloff, J. A. Efficient repair of all types of single-base mismatches in recombination intermediates in Chinese hamster ovary cells: competition between long-patch and G-T glycosylase-mediated repair of G-T mismatches. Genetics 149, 1935–1943 (1998).
Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 4768–4777 (Curran Associates Inc., 2017).
Kim, H. K. et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance. Sci. Adv. 5, eaax9249 (2019).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).
Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nat. Biotechnol. 32, 1262–1267 (2014).
Nelson, J. W. et al. Engineered pegRNAs improve prime editing efficiency. Nat. Biotechnol. 40, 402–410 (2022).
Chen, P. J. et al. Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell 184, 5635–5652.e29 (2021).
Nair, N. et al. Computationally designed liver-specific transcriptional modules and hyperactive factor IX improve hepatic gene therapy. Blood 123, 3195–3199 (2014).
Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).
Villiger, L. et al. Treatment of a metabolic liver disease by in vivo genome base editing in adult mice. Nat. Med. 24, 1519–1525 (2018).
Kim, H. K. et al. In vivo high-throughput profiling of CRISPR-Cpf1 activity. Nat. Methods 14, 153–159 (2017).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Kim, N. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 38, 1328–1336 (2020).
Dang, Y. et al. Optimizing sgRNA structure to improve CRISPR-Cas9 knockout efficiency. Genome Biol. 16, 280 (2015).
Böck, D. et al. In vivo prime editing of a metabolic liver disease in mice. Sci. Transl. Med. 14, eabl9238 (2022).
Jensen, K. T. et al. Chromatin accessibility and guide sequence secondary structure affect CRISPR-Cas9 gene editing efficiency. FEBS Lett. 591, 1892–1901 (2017).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).
Schep, R. et al. Impact of chromatin context on Cas9-induced DNA double-strand break repair pathway balance. Mol. Cell 81, 2216–2230.e10 (2021).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 48, D882–D889 (2020).
Karabacak Calviello, A., Hirsekorn, A., Wurmus, R., Yusuf, D. & Ohler, U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol. 20, 42 (2019).
Lamb, K. N. et al. Discovery and characterization of a cellular potent positive allosteric modulator of the polycomb repressive complex 1 chromodomain, CBX7. Cell Chem. Biol. 26, 1365–1379.e22 (2019).
Hattori, T. et al. Antigen clasping by two antigen-binding sites of an exceptionally specific antibody for histone methylation. Proc. Natl Acad. Sci. USA 113, 2092–2097 (2016).
Lee, B. T. et al. The UCSC Genome Browser database: 2022 update. Nucleic Acids Res. 50, D1115–D1122 (2022).
Zerbino, D. R., Johnson, N., Juettemann, T., Wilder, S. P. & Flicek, P. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 30, 1008–1009 (2014).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).
Marquart, K. F. et al. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat. Commun. 12, 1–25 (2020).
Paszke, A. et al. Automatic differentiation in pytorch. In Proc. 31st Annual Conference on Neural Information Processing Systems:Advances in Neural Information Processing Systems 2017 (NIPS, 2017).
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166 (1994).
Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks 385 (Springer, 2012).
Luong, T., Pham, H. & Manning, C. D. Effective approaches to attention-based neural machine translation. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L. et al.) 1412–1421 (Association for Computational Linguistics, 2015).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 6000–6010 (Curan Associates Inc., 2017).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Eggington, J. M., Greene, T. & Bass, B. L. Predicting sites of ADAR editing in double-stranded RNA. Nat. Commun. 2, 319 (2011).
Acknowledgements
We thank the Functional Genomics Center Zurich for their help and support in next-generation sequencing; the Flow Cytometry Facility of the University of Zurich and especially M. Wickert for performing liver hepatocyte sorting experiments; the Science IT team at the University of Zurich for providing infrastructure used for data analysis and especially P. Shemella for helpful discussions about code performance optimizations; R. Schep for discussions about chromatin marks; G. Affentranger for support in the design of figures; the members of the Schwank laboratory for fruitful discussions. This work was supported by the SNF (grant nos. 310030_185293 and 201184), the University Research Priority Program ‘Human Reproduction Reloaded’ and ‘ITINERARE’ of the University of Zurich. K.F.M. holds a PHRT iDoc Fellowship (PHRT_324).
Author information
Authors and Affiliations
Contributions
N.M. designed the study, performed experiments and analyzed data. A.A. designed and generated attention-based bidirectional RNNs (PRIDICT) and implemented feature extraction strategies. A.A. and N.M. built linear regression and tree-based machine learning models and performed feature extraction analysis. L.K. performed in vivo experiments. K.F.M. and C.S. contributed to arrayed validation experiments. L.S. performed pegRNA and AdV cloning experiments. Z.B. performed the analysis of chromatin characteristics of endogenous loci. N.M., A.A. and G.S. wrote the manuscript. M.K. and G.S. designed and supervised the research. All authors revised the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Sangsu Bae and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Self-targeting screen characteristics.
a, Visualization of library design (library 1) and numbers before and after filtering results. b, Distribution of edit positions for single base replacement edits in library 1. c, Distribution of edit positions for insertion edits in library 1. d, Distribution of edit positions for deletion edits in library 1. e, Distribution of insertion lengths in library 1. f, Distribution of deletion lengths in library 1. g, Distribution of edit types in library 1 (number of design variants and percentage of the total library). h,i, Editing rates of a test self-targeting locus with a forward (Fw) or reverse (Rv) orientation of the target sequence. Either on plasmid level or integrated by lentiviral transduction in HEK293T cells. Data points for bars (from left) 2,3 and 5,6 correspond to two technical replicates (simultaneous transfection of two separate wells). Only one data point was used for the plasmid controls (bar 1 and 4). h, pegRNA with TAG to TGG edit. i, pegRNA with TAG to TAC edit. The observed editing in the forward direction in the absence of PE2 could be caused by lentiviral reshuffling or ADAR-mediated A to I (G) RNA editing. The latter could occur during lentiviral packaging in HEK293T cells: HEK293T cells endogenously express ADAR and the target site is present as RNA on the lentiviral vector and targeted by the complementary pegRNA with a mismatch, providing an ideal template for ADAR-dependent RNA editing. The observation that primarily TAG to TGG (but not TAG to TAC) showed background editing is in line with this hypothesis, as previous studies showed ADAR preference for UAG sequences52.
Extended Data Fig. 2 Additional validation of the DeepPE model.
a, Predicted (PRIDICT) and measured intended editing efficiency for GtoC edits at position 5 of RTT in the dataset from this study. Data from all five test sets (fivefold cross-validation) were combined for this visualization. n = 540. b, Evaluation of attention-based bidirectional RNN (PRIDICT-AttnBiRNN; trained on the dataset from this study) by testing on pegRNAs from Kim et al. 2021 HT dataset (only G to C at Position 5). n = 4,457. c, Evaluation of DeepPE model (original, trained on Kim et al. 2021 HT dataset) by testing on the dataset from this study (only G to C at Position 5). n = 540. d,e, SHAP analysis of XGBoost models trained and tested on DeepPE dataset (n = 43,149) (d) or on G-to-C Position 5 edits from library 1 (e). Feature descriptions are listed in Supplementary Table 1. f, Editing efficiency with different RTT overhang lengths (5, 7, 10, 15 bp) in DeepPE (Kim et al.) dataset. n for each bar (left to right) = 10,746, 10,828, 10,921, 10,654. Error bars = mean ±s.d. g, Editing efficiency with different RTT overhang lengths (3, 7, 10, 15 bp) in GtoC Pos. 5 edits of library 1 for a direct comparison to identical edits in the DeepPE dataset. n for each bar (left to right) = 135, 135, 137, 133. (f,g) Error bars = mean ±s.d. h,i, Evaluation of DeepPE model (n = 18) on 18/45 endogenous edits from this study in HEK293T (h) and K562 (i).
Extended Data Fig. 3 Additional validation of the Easy-Prime PE2 model.
a, Edit type count distribution in the original Easy-Prime test dataset. b, Evaluation of Easy-Prime PE2 model by testing this XGBoost model on the original Easy-Prime test dataset5, filtered against 1 bp edits at position 5 of the RTT to eliminate the bias towards this edit type. n = 585. c–g, Evaluation of Easy-Prime PE2 by testing the model on datasets generated in this study. c, Library 1 in HEK293T, n = 92,423. d, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in HEK293T, n = 915. e, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in K562, n = 876. f,g, Endogenous loci from Fig. 4a, b in HEK293T (f) and K562 (g), n = 45. h, Intended editing efficiency rank of the best-predicted pegRNA for each pathogenic locus in library 1 (PRIDICT and Easy-Prime). Pathogenic loci with multiple pegRNAs on rank 1 (identical efficiency) and loci with less than three pegRNAs were excluded from this analysis. Predictions from PRIDICT were taken from five different cross-validations to ensure none of the predictions are included in the training set. n = 12,189. i, Intended editing efficiency rank of the best-predicted pegRNA for each endogenous locus (PRIDICT and Easy-Prime). n = 15.
Extended Data Fig. 4 Additional library 2 evaluation with PEmax.
a, Mean editing efficiencies of each replicate, including all pegRNAs in library 2 with different experimental conditions in U2OS and K562 cells. Error bars indicate the mean ±s.d. of three biologically independent replicates. n = 3. Mean editing of library 2 for each of the three replicates is based on the following number of pegRNAs for each data point (bars left to right) = 916, 922, 917, 924, 879, 869, 877, 866. Note that absolute levels of editing efficiency for PEmax cannot be directly compared to PE2 in this study due to the use of different selection agents (Blasticidin for PEmax screens compared to Zeocin for PE2 screens). Previous studies showed that in identical setups, PEmax surpasses the performance of PE216. b, Spearman correlation for PEmax editing efficiencies in library 2 between different experimental conditions (MLH1dn, tevopreQ1) and cell lines (K562, U2OS). c, Editing efficiency rank correlations (Spearman) in library 2 between editing performed with PE2 versus editing performed with PEmax.
Supplementary information
Supplementary Information
Supplementary Notes 1–3, Tables 1–3, Figs. 1–19 and Methods 1.
Supplementary Table 4
List of library sequences used in this study.
Supplementary Table 5
List of oligo sequences used for cloning and PCR.
Supplementary Table 6
Tables containing information about pegRNAs used in library 1, library 2 and endogenous editing experiments and their associated editing efficiencies.
Supplementary Table 7
Table containing Spearman correlations between every pair of features listed in Supplementary Table 1.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mathis, N., Allam, A., Kissling, L. et al. Predicting prime editing efficiency and product purity by deep learning. Nat Biotechnol 41, 1151–1159 (2023). https://doi.org/10.1038/s41587-022-01613-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-022-01613-7
This article is cited by
-
High-throughput evaluation of genetic variants with prime editing sensor libraries
Nature Biotechnology (2024)
-
Continuous directed evolution of a compact CjCas9 variant with broad PAM compatibility
Nature Chemical Biology (2024)
-
BacPE: a versatile prime-editing platform in bacteria by inhibiting DNA exonucleases
Nature Communications (2024)
-
Precise genome-editing in human diseases: mechanisms, strategies and applications
Signal Transduction and Targeted Therapy (2024)
-
Enhancing prime editor activity by directed protein evolution in yeast
Nature Communications (2024)