Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Predicting prime editing efficiency and product purity by deep learning

Abstract

Prime editing is a versatile genome editing tool but requires experimental optimization of the prime editing guide RNA (pegRNA) to achieve high editing efficiency. Here we conducted a high-throughput screen to analyze prime editing outcomes of 92,423 pegRNAs on a highly diverse set of 13,349 human pathogenic mutations that include base substitutions, insertions and deletions. Based on this dataset, we identified sequence context features that influence prime editing and trained PRIDICT (prime editing guide prediction), an attention-based bidirectional recurrent neural network. PRIDICT reliably predicts editing rates for all small-sized genetic changes with a Spearman’s R of 0.85 and 0.78 for intended and unintended edits, respectively. We validated PRIDICT on endogenous editing sites as well as an external dataset and showed that pegRNAs with high (>70) versus low (<70) PRIDICT scores showed substantially increased prime editing efficiencies in different cell types in vitro (12-fold) and in hepatocytes in vivo (tenfold), highlighting the value of PRIDICT for basic and for translational research applications.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: High-throughput screen for determinants of prime editing efficiency.
Fig. 2: Prediction of pegRNA editing rates by an attention-based bidirectional RNN.
Fig. 3: Feature importance overview for editing prediction.
Fig. 4: Validation of PRIDICT on endogenous loci and external datasets.
Fig. 5: Evaluation of MLH1dn and tevopreQ1 effect on PE2 editing efficiency and PRIDICT performance in library 2.

Similar content being viewed by others

Data availability

Measured editing rates used for analysis and figures in this study are provided as Supplementary Tables and on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). DNA-sequencing data is available via the National Center for Biotechnology Information Sequence Read Archive (PRJNA825584). Target sequences of pathogenic mutations were based on the ClinVar database (accessed December 2019), and corresponding genomic sequences (flanking the edit) were acquired via UCSC Genome Browser (Table Browser, hg38). Plasmid encoding for pCMV-PE2-tagRFP-BleoR is available from Addgene (no. 192508).

Code availability

Custom Python code used in this study is provided on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). Additional information on the PRIDICT algorithm can be found in Supplementary Methods 1.

References

  1. Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Hsu, J. Y. et al. PrimeDesign software for rapid and simplified design of prime editing guide RNAs. Nat. Commun. 12, 1034 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Hwang, G.-H. et al. PE-Designer and PE-Analyzer: web-based design and analysis tools for CRISPR prime editing. Nucleic Acids Res. 49, W499–W504 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kim, H. K. et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat. Biotechnol. 39, 198–206 (2021).

    Article  CAS  PubMed  Google Scholar 

  5. Li, Y., Chen, J., Tsai, S. Q. & Cheng, Y. Easy-Prime: a machine learning–based prime editor design tool. Genome Biol. 22, 235 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).

    Article  CAS  PubMed  Google Scholar 

  7. Nielsen, S., Yuzenkova, Y. & Zenkin, N. Mechanism of eukaryotic RNA polymerase III transcription termination. Science 340, 1577–1580 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Gao, Z., Herrera-Carrillo, E. & Berkhout, B. Delineation of the exact transcription termination signal for type 3 polymerase III. Mol. Ther. Nucleic Acids 10, 36–44 (2018).

    Article  CAS  PubMed  Google Scholar 

  9. Bill, C. A., Duran, W. A., Miselis, N. R. & Nickoloff, J. A. Efficient repair of all types of single-base mismatches in recombination intermediates in Chinese hamster ovary cells: competition between long-patch and G-T glycosylase-mediated repair of G-T mismatches. Genetics 149, 1935–1943 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 4768–4777 (Curran Associates Inc., 2017).

  12. Kim, H. K. et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance. Sci. Adv. 5, eaax9249 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).

  14. Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nat. Biotechnol. 32, 1262–1267 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Nelson, J. W. et al. Engineered pegRNAs improve prime editing efficiency. Nat. Biotechnol. 40, 402–410 (2022).

    Article  CAS  PubMed  Google Scholar 

  16. Chen, P. J. et al. Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell 184, 5635–5652.e29 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Nair, N. et al. Computationally designed liver-specific transcriptional modules and hyperactive factor IX improve hepatic gene therapy. Blood 123, 3195–3199 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Villiger, L. et al. Treatment of a metabolic liver disease by in vivo genome base editing in adult mice. Nat. Med. 24, 1519–1525 (2018).

  20. Kim, H. K. et al. In vivo high-throughput profiling of CRISPR-Cpf1 activity. Nat. Methods 14, 153–159 (2017).

    Article  CAS  PubMed  Google Scholar 

  21. Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Kim, N. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 38, 1328–1336 (2020).

    Article  CAS  PubMed  Google Scholar 

  23. Dang, Y. et al. Optimizing sgRNA structure to improve CRISPR-Cas9 knockout efficiency. Genome Biol. 16, 280 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Böck, D. et al. In vivo prime editing of a metabolic liver disease in mice. Sci. Transl. Med. 14, eabl9238 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Jensen, K. T. et al. Chromatin accessibility and guide sequence secondary structure affect CRISPR-Cas9 gene editing efficiency. FEBS Lett. 591, 1892–1901 (2017).

    Article  CAS  PubMed  Google Scholar 

  26. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).

    Article  Google Scholar 

  27. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Schep, R. et al. Impact of chromatin context on Cas9-induced DNA double-strand break repair pathway balance. Mol. Cell 81, 2216–2230.e10 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 48, D882–D889 (2020).

    Article  CAS  PubMed  Google Scholar 

  33. Karabacak Calviello, A., Hirsekorn, A., Wurmus, R., Yusuf, D. & Ohler, U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol. 20, 42 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Lamb, K. N. et al. Discovery and characterization of a cellular potent positive allosteric modulator of the polycomb repressive complex 1 chromodomain, CBX7. Cell Chem. Biol. 26, 1365–1379.e22 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Hattori, T. et al. Antigen clasping by two antigen-binding sites of an exceptionally specific antibody for histone methylation. Proc. Natl Acad. Sci. USA 113, 2092–2097 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Lee, B. T. et al. The UCSC Genome Browser database: 2022 update. Nucleic Acids Res. 50, D1115–D1122 (2022).

    Article  CAS  PubMed  Google Scholar 

  37. Zerbino, D. R., Johnson, N., Juettemann, T., Wilder, S. P. & Flicek, P. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 30, 1008–1009 (2014).

    Article  CAS  PubMed  Google Scholar 

  38. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  39. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).

  40. Marquart, K. F. et al. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat. Commun. 12, 1–25 (2020).

    Google Scholar 

  41. Paszke, A. et al. Automatic differentiation in pytorch. In Proc. 31st Annual Conference on Neural Information Processing Systems:Advances in Neural Information Processing Systems 2017 (NIPS, 2017).

  42. Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014).

  43. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).

  44. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  CAS  PubMed  Google Scholar 

  45. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166 (1994).

    Article  CAS  PubMed  Google Scholar 

  46. Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks 385 (Springer, 2012).

  47. Luong, T., Pham, H. & Manning, C. D. Effective approaches to attention-based neural machine translation. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L. et al.) 1412–1421 (Association for Computational Linguistics, 2015).

  48. Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 6000–6010 (Curan Associates Inc., 2017).

  49. Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).

  50. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  51. Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

    Google Scholar 

  52. Eggington, J. M., Greene, T. & Bass, B. L. Predicting sites of ADAR editing in double-stranded RNA. Nat. Commun. 2, 319 (2011).

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank the Functional Genomics Center Zurich for their help and support in next-generation sequencing; the Flow Cytometry Facility of the University of Zurich and especially M. Wickert for performing liver hepatocyte sorting experiments; the Science IT team at the University of Zurich for providing infrastructure used for data analysis and especially P. Shemella for helpful discussions about code performance optimizations; R. Schep for discussions about chromatin marks; G. Affentranger for support in the design of figures; the members of the Schwank laboratory for fruitful discussions. This work was supported by the SNF (grant nos. 310030_185293 and 201184), the University Research Priority Program ‘Human Reproduction Reloaded’ and ‘ITINERARE’ of the University of Zurich. K.F.M. holds a PHRT iDoc Fellowship (PHRT_324).

Author information

Authors and Affiliations

Authors

Contributions

N.M. designed the study, performed experiments and analyzed data. A.A. designed and generated attention-based bidirectional RNNs (PRIDICT) and implemented feature extraction strategies. A.A. and N.M. built linear regression and tree-based machine learning models and performed feature extraction analysis. L.K. performed in vivo experiments. K.F.M. and C.S. contributed to arrayed validation experiments. L.S. performed pegRNA and AdV cloning experiments. Z.B. performed the analysis of chromatin characteristics of endogenous loci. N.M., A.A. and G.S. wrote the manuscript. M.K. and G.S. designed and supervised the research. All authors revised the manuscript.

Corresponding authors

Correspondence to Michael Krauthammer or Gerald Schwank.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Sangsu Bae and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self-targeting screen characteristics.

a, Visualization of library design (library 1) and numbers before and after filtering results. b, Distribution of edit positions for single base replacement edits in library 1. c, Distribution of edit positions for insertion edits in library 1. d, Distribution of edit positions for deletion edits in library 1. e, Distribution of insertion lengths in library 1. f, Distribution of deletion lengths in library 1. g, Distribution of edit types in library 1 (number of design variants and percentage of the total library). h,i, Editing rates of a test self-targeting locus with a forward (Fw) or reverse (Rv) orientation of the target sequence. Either on plasmid level or integrated by lentiviral transduction in HEK293T cells. Data points for bars (from left) 2,3 and 5,6 correspond to two technical replicates (simultaneous transfection of two separate wells). Only one data point was used for the plasmid controls (bar 1 and 4). h, pegRNA with TAG to TGG edit. i, pegRNA with TAG to TAC edit. The observed editing in the forward direction in the absence of PE2 could be caused by lentiviral reshuffling or ADAR-mediated A to I (G) RNA editing. The latter could occur during lentiviral packaging in HEK293T cells: HEK293T cells endogenously express ADAR and the target site is present as RNA on the lentiviral vector and targeted by the complementary pegRNA with a mismatch, providing an ideal template for ADAR-dependent RNA editing. The observation that primarily TAG to TGG (but not TAG to TAC) showed background editing is in line with this hypothesis, as previous studies showed ADAR preference for UAG sequences52.

Extended Data Fig. 2 Additional validation of the DeepPE model.

a, Predicted (PRIDICT) and measured intended editing efficiency for GtoC edits at position 5 of RTT in the dataset from this study. Data from all five test sets (fivefold cross-validation) were combined for this visualization. n = 540. b, Evaluation of attention-based bidirectional RNN (PRIDICT-AttnBiRNN; trained on the dataset from this study) by testing on pegRNAs from Kim et al. 2021 HT dataset (only G to C at Position 5). n = 4,457. c, Evaluation of DeepPE model (original, trained on Kim et al. 2021 HT dataset) by testing on the dataset from this study (only G to C at Position 5). n = 540. d,e, SHAP analysis of XGBoost models trained and tested on DeepPE dataset (n = 43,149) (d) or on G-to-C Position 5 edits from library 1 (e). Feature descriptions are listed in Supplementary Table 1. f, Editing efficiency with different RTT overhang lengths (5, 7, 10, 15 bp) in DeepPE (Kim et al.) dataset. n for each bar (left to right) = 10,746, 10,828, 10,921, 10,654. Error bars = mean ±s.d. g, Editing efficiency with different RTT overhang lengths (3, 7, 10, 15 bp) in GtoC Pos. 5 edits of library 1 for a direct comparison to identical edits in the DeepPE dataset. n for each bar (left to right) = 135, 135, 137, 133. (f,g) Error bars = mean ±s.d. h,i, Evaluation of DeepPE model (n = 18) on 18/45 endogenous edits from this study in HEK293T (h) and K562 (i).

Extended Data Fig. 3 Additional validation of the Easy-Prime PE2 model.

a, Edit type count distribution in the original Easy-Prime test dataset. b, Evaluation of Easy-Prime PE2 model by testing this XGBoost model on the original Easy-Prime test dataset5, filtered against 1 bp edits at position 5 of the RTT to eliminate the bias towards this edit type. n = 585. cg, Evaluation of Easy-Prime PE2 by testing the model on datasets generated in this study. c, Library 1 in HEK293T, n = 92,423. d, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in HEK293T, n = 915. e, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in K562, n = 876. f,g, Endogenous loci from Fig. 4a, b in HEK293T (f) and K562 (g), n = 45. h, Intended editing efficiency rank of the best-predicted pegRNA for each pathogenic locus in library 1 (PRIDICT and Easy-Prime). Pathogenic loci with multiple pegRNAs on rank 1 (identical efficiency) and loci with less than three pegRNAs were excluded from this analysis. Predictions from PRIDICT were taken from five different cross-validations to ensure none of the predictions are included in the training set. n = 12,189. i, Intended editing efficiency rank of the best-predicted pegRNA for each endogenous locus (PRIDICT and Easy-Prime). n = 15.

Extended Data Fig. 4 Additional library 2 evaluation with PEmax.

a, Mean editing efficiencies of each replicate, including all pegRNAs in library 2 with different experimental conditions in U2OS and K562 cells. Error bars indicate the mean ±s.d. of three biologically independent replicates. n = 3. Mean editing of library 2 for each of the three replicates is based on the following number of pegRNAs for each data point (bars left to right) = 916, 922, 917, 924, 879, 869, 877, 866. Note that absolute levels of editing efficiency for PEmax cannot be directly compared to PE2 in this study due to the use of different selection agents (Blasticidin for PEmax screens compared to Zeocin for PE2 screens). Previous studies showed that in identical setups, PEmax surpasses the performance of PE216. b, Spearman correlation for PEmax editing efficiencies in library 2 between different experimental conditions (MLH1dn, tevopreQ1) and cell lines (K562, U2OS). c, Editing efficiency rank correlations (Spearman) in library 2 between editing performed with PE2 versus editing performed with PEmax.

Supplementary information

Supplementary Information

Supplementary Notes 1–3, Tables 1–3, Figs. 1–19 and Methods 1.

Reporting Summary

Supplementary Table 4

List of library sequences used in this study.

Supplementary Table 5

List of oligo sequences used for cloning and PCR.

Supplementary Table 6

Tables containing information about pegRNAs used in library 1, library 2 and endogenous editing experiments and their associated editing efficiencies.

Supplementary Table 7

Table containing Spearman correlations between every pair of features listed in Supplementary Table 1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mathis, N., Allam, A., Kissling, L. et al. Predicting prime editing efficiency and product purity by deep learning. Nat Biotechnol 41, 1151–1159 (2023). https://doi.org/10.1038/s41587-022-01613-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01613-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing