Predicting prime editing efficiency and product purity by deep learning

Mathis, Nicolas; Allam, Ahmed; Kissling, Lucas; Marquart, Kim Fabiano; Schmidheini, Lukas; Solari, Cristina; Balázs, Zsolt; Krauthammer, Michael; Schwank, Gerald

doi:10.1038/s41587-022-01613-7

Article
Published: 16 January 2023

Predicting prime editing efficiency and product purity by deep learning

Nature Biotechnology volume 41, pages 1151–1159 (2023)Cite this article

14k Accesses
37 Citations
162 Altmetric
Metrics details

Subjects

Abstract

Prime editing is a versatile genome editing tool but requires experimental optimization of the prime editing guide RNA (pegRNA) to achieve high editing efficiency. Here we conducted a high-throughput screen to analyze prime editing outcomes of 92,423 pegRNAs on a highly diverse set of 13,349 human pathogenic mutations that include base substitutions, insertions and deletions. Based on this dataset, we identified sequence context features that influence prime editing and trained PRIDICT (prime editing guide prediction), an attention-based bidirectional recurrent neural network. PRIDICT reliably predicts editing rates for all small-sized genetic changes with a Spearman’s R of 0.85 and 0.78 for intended and unintended edits, respectively. We validated PRIDICT on endogenous editing sites as well as an external dataset and showed that pegRNAs with high (>70) versus low (<70) PRIDICT scores showed substantially increased prime editing efficiencies in different cell types in vitro (12-fold) and in hepatocytes in vivo (tenfold), highlighting the value of PRIDICT for basic and for translational research applications.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: High-throughput screen for determinants of prime editing efficiency.**

**Fig. 2: Prediction of pegRNA editing rates by an attention-based bidirectional RNN.**

**Fig. 3: Feature importance overview for editing prediction.**

**Fig. 4: Validation of PRIDICT on endogenous loci and external datasets.**

**Fig. 5: Evaluation of MLH1dn and tevopreQ1 effect on PE2 editing efficiency and PRIDICT performance in library 2.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

De novo generation of multi-target compounds using deep generative chemistry

Article Open access 06 May 2024

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Data availability

Measured editing rates used for analysis and figures in this study are provided as Supplementary Tables and on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). DNA-sequencing data is available via the National Center for Biotechnology Information Sequence Read Archive (PRJNA825584). Target sequences of pathogenic mutations were based on the ClinVar database (accessed December 2019), and corresponding genomic sequences (flanking the edit) were acquired via UCSC Genome Browser (Table Browser, hg38). Plasmid encoding for pCMV-PE2-tagRFP-BleoR is available from Addgene (no. 192508).

Code availability

Custom Python code used in this study is provided on GitHub (https://github.com/uzh-dqbm-cmi/PRIDICT). Additional information on the PRIDICT algorithm can be found in Supplementary Methods 1.

References

Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hsu, J. Y. et al. PrimeDesign software for rapid and simplified design of prime editing guide RNAs. Nat. Commun. 12, 1034 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hwang, G.-H. et al. PE-Designer and PE-Analyzer: web-based design and analysis tools for CRISPR prime editing. Nucleic Acids Res. 49, W499–W504 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kim, H. K. et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat. Biotechnol. 39, 198–206 (2021).
Article CAS PubMed Google Scholar
Li, Y., Chen, J., Tsai, S. Q. & Cheng, Y. Easy-Prime: a machine learning–based prime editor design tool. Genome Biol. 22, 235 (2021).
Article CAS PubMed PubMed Central Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Nielsen, S., Yuzenkova, Y. & Zenkin, N. Mechanism of eukaryotic RNA polymerase III transcription termination. Science 340, 1577–1580 (2013).
Article CAS PubMed PubMed Central Google Scholar
Gao, Z., Herrera-Carrillo, E. & Berkhout, B. Delineation of the exact transcription termination signal for type 3 polymerase III. Mol. Ther. Nucleic Acids 10, 36–44 (2018).
Article CAS PubMed Google Scholar
Bill, C. A., Duran, W. A., Miselis, N. R. & Nickoloff, J. A. Efficient repair of all types of single-base mismatches in recombination intermediates in Chinese hamster ovary cells: competition between long-patch and G-T glycosylase-mediated repair of G-T mismatches. Genetics 149, 1935–1943 (1998).
Article CAS PubMed PubMed Central Google Scholar
Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 4768–4777 (Curran Associates Inc., 2017).
Kim, H. K. et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance. Sci. Adv. 5, eaax9249 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).
Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nat. Biotechnol. 32, 1262–1267 (2014).
Article CAS PubMed PubMed Central Google Scholar
Nelson, J. W. et al. Engineered pegRNAs improve prime editing efficiency. Nat. Biotechnol. 40, 402–410 (2022).
Article CAS PubMed Google Scholar
Chen, P. J. et al. Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell 184, 5635–5652.e29 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nair, N. et al. Computationally designed liver-specific transcriptional modules and hyperactive factor IX improve hepatic gene therapy. Blood 123, 3195–3199 (2014).
Article CAS PubMed PubMed Central Google Scholar
Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).
Article CAS PubMed PubMed Central Google Scholar
Villiger, L. et al. Treatment of a metabolic liver disease by in vivo genome base editing in adult mice. Nat. Med. 24, 1519–1525 (2018).
Kim, H. K. et al. In vivo high-throughput profiling of CRISPR-Cpf1 activity. Nat. Methods 14, 153–159 (2017).
Article CAS PubMed Google Scholar
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS PubMed PubMed Central Google Scholar
Kim, N. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 38, 1328–1336 (2020).
Article CAS PubMed Google Scholar
Dang, Y. et al. Optimizing sgRNA structure to improve CRISPR-Cas9 knockout efficiency. Genome Biol. 16, 280 (2015).
Article PubMed PubMed Central Google Scholar
Böck, D. et al. In vivo prime editing of a metabolic liver disease in mice. Sci. Transl. Med. 14, eabl9238 (2022).
Article PubMed PubMed Central Google Scholar
Jensen, K. T. et al. Chromatin accessibility and guide sequence secondary structure affect CRISPR-Cas9 gene editing efficiency. FEBS Lett. 591, 1892–1901 (2017).
Article CAS PubMed Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).
Article Google Scholar
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
Article PubMed PubMed Central Google Scholar
Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Article PubMed PubMed Central Google Scholar
Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schep, R. et al. Impact of chromatin context on Cas9-induced DNA double-strand break repair pathway balance. Mol. Cell 81, 2216–2230.e10 (2021).
Article CAS PubMed PubMed Central Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
Article PubMed PubMed Central Google Scholar
Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 48, D882–D889 (2020).
Article CAS PubMed Google Scholar
Karabacak Calviello, A., Hirsekorn, A., Wurmus, R., Yusuf, D. & Ohler, U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol. 20, 42 (2019).
Article PubMed PubMed Central Google Scholar
Lamb, K. N. et al. Discovery and characterization of a cellular potent positive allosteric modulator of the polycomb repressive complex 1 chromodomain, CBX7. Cell Chem. Biol. 26, 1365–1379.e22 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hattori, T. et al. Antigen clasping by two antigen-binding sites of an exceptionally specific antibody for histone methylation. Proc. Natl Acad. Sci. USA 113, 2092–2097 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lee, B. T. et al. The UCSC Genome Browser database: 2022 update. Nucleic Acids Res. 50, D1115–D1122 (2022).
Article CAS PubMed Google Scholar
Zerbino, D. R., Johnson, N., Juettemann, T., Wilder, S. P. & Flicek, P. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 30, 1008–1009 (2014).
Article CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).
Marquart, K. F. et al. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat. Commun. 12, 1–25 (2020).
Google Scholar
Paszke, A. et al. Automatic differentiation in pytorch. In Proc. 31st Annual Conference on Neural Information Processing Systems:Advances in Neural Information Processing Systems 2017 (NIPS, 2017).
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A. et al.) 1724–1734 (Association for Computational Linguistics, 2014).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS PubMed Google Scholar
Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166 (1994).
Article CAS PubMed Google Scholar
Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks 385 (Springer, 2012).
Luong, T., Pham, H. & Manning, C. D. Effective approaches to attention-based neural machine translation. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L. et al.) 1412–1421 (Association for Computational Linguistics, 2015).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 6000–6010 (Curan Associates Inc., 2017).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Google Scholar
Eggington, J. M., Greene, T. & Bass, B. L. Predicting sites of ADAR editing in double-stranded RNA. Nat. Commun. 2, 319 (2011).
Article PubMed Google Scholar

Download references

Acknowledgements

We thank the Functional Genomics Center Zurich for their help and support in next-generation sequencing; the Flow Cytometry Facility of the University of Zurich and especially M. Wickert for performing liver hepatocyte sorting experiments; the Science IT team at the University of Zurich for providing infrastructure used for data analysis and especially P. Shemella for helpful discussions about code performance optimizations; R. Schep for discussions about chromatin marks; G. Affentranger for support in the design of figures; the members of the Schwank laboratory for fruitful discussions. This work was supported by the SNF (grant nos. 310030_185293 and 201184), the University Research Priority Program ‘Human Reproduction Reloaded’ and ‘ITINERARE’ of the University of Zurich. K.F.M. holds a PHRT iDoc Fellowship (PHRT_324).

Author information

These authors contributed equally: Nicolas Mathis, Ahmed Allam.

Authors and Affiliations

Institute of Pharmacology and Toxicology, University of Zurich, Zurich, Switzerland
Nicolas Mathis, Lucas Kissling, Kim Fabiano Marquart, Lukas Schmidheini, Cristina Solari & Gerald Schwank
Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
Ahmed Allam, Zsolt Balázs & Michael Krauthammer
Institute of Molecular Health Sciences, ETH Zurich, Zurich, Switzerland
Kim Fabiano Marquart & Lukas Schmidheini

Authors

Nicolas Mathis
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Allam
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Kissling
View author publications
You can also search for this author in PubMed Google Scholar
Kim Fabiano Marquart
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Schmidheini
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Solari
View author publications
You can also search for this author in PubMed Google Scholar
Zsolt Balázs
View author publications
You can also search for this author in PubMed Google Scholar
Michael Krauthammer
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Schwank
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.M. designed the study, performed experiments and analyzed data. A.A. designed and generated attention-based bidirectional RNNs (PRIDICT) and implemented feature extraction strategies. A.A. and N.M. built linear regression and tree-based machine learning models and performed feature extraction analysis. L.K. performed in vivo experiments. K.F.M. and C.S. contributed to arrayed validation experiments. L.S. performed pegRNA and AdV cloning experiments. Z.B. performed the analysis of chromatin characteristics of endogenous loci. N.M., A.A. and G.S. wrote the manuscript. M.K. and G.S. designed and supervised the research. All authors revised the manuscript.

Corresponding authors

Correspondence to Michael Krauthammer or Gerald Schwank.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Sangsu Bae and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self-targeting screen characteristics.

a, Visualization of library design (library 1) and numbers before and after filtering results. b, Distribution of edit positions for single base replacement edits in library 1. c, Distribution of edit positions for insertion edits in library 1. d, Distribution of edit positions for deletion edits in library 1. e, Distribution of insertion lengths in library 1. f, Distribution of deletion lengths in library 1. g, Distribution of edit types in library 1 (number of design variants and percentage of the total library). h,i, Editing rates of a test self-targeting locus with a forward (Fw) or reverse (Rv) orientation of the target sequence. Either on plasmid level or integrated by lentiviral transduction in HEK293T cells. Data points for bars (from left) 2,3 and 5,6 correspond to two technical replicates (simultaneous transfection of two separate wells). Only one data point was used for the plasmid controls (bar 1 and 4). h, pegRNA with TAG to TGG edit. i, pegRNA with TAG to TAC edit. The observed editing in the forward direction in the absence of PE2 could be caused by lentiviral reshuffling or ADAR-mediated A to I (G) RNA editing. The latter could occur during lentiviral packaging in HEK293T cells: HEK293T cells endogenously express ADAR and the target site is present as RNA on the lentiviral vector and targeted by the complementary pegRNA with a mismatch, providing an ideal template for ADAR-dependent RNA editing. The observation that primarily TAG to TGG (but not TAG to TAC) showed background editing is in line with this hypothesis, as previous studies showed ADAR preference for UAG sequences⁵².

Extended Data Fig. 2 Additional validation of the DeepPE model.

a, Predicted (PRIDICT) and measured intended editing efficiency for GtoC edits at position 5 of RTT in the dataset from this study. Data from all five test sets (fivefold cross-validation) were combined for this visualization. n = 540. b, Evaluation of attention-based bidirectional RNN (PRIDICT-AttnBiRNN; trained on the dataset from this study) by testing on pegRNAs from Kim et al. 2021 HT dataset (only G to C at Position 5). n = 4,457. c, Evaluation of DeepPE model (original, trained on Kim et al. 2021 HT dataset) by testing on the dataset from this study (only G to C at Position 5). n = 540. d,e, SHAP analysis of XGBoost models trained and tested on DeepPE dataset (n = 43,149) (d) or on G-to-C Position 5 edits from library 1 (e). Feature descriptions are listed in Supplementary Table 1. f, Editing efficiency with different RTT overhang lengths (5, 7, 10, 15 bp) in DeepPE (Kim et al.) dataset. n for each bar (left to right) = 10,746, 10,828, 10,921, 10,654. Error bars = mean ±s.d. g, Editing efficiency with different RTT overhang lengths (3, 7, 10, 15 bp) in GtoC Pos. 5 edits of library 1 for a direct comparison to identical edits in the DeepPE dataset. n for each bar (left to right) = 135, 135, 137, 133. (f,g) Error bars = mean ±s.d. h,i, Evaluation of DeepPE model (n = 18) on 18/45 endogenous edits from this study in HEK293T (h) and K562 (i).

Extended Data Fig. 3 Additional validation of the Easy-Prime PE2 model.

a, Edit type count distribution in the original Easy-Prime test dataset. b, Evaluation of Easy-Prime PE2 model by testing this XGBoost model on the original Easy-Prime test dataset⁵, filtered against 1 bp edits at position 5 of the RTT to eliminate the bias towards this edit type. n = 585. c–g, Evaluation of Easy-Prime PE2 by testing the model on datasets generated in this study. c, Library 1 in HEK293T, n = 92,423. d, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in HEK293T, n = 915. e, Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in K562, n = 876. f,g, Endogenous loci from Fig. 4a, b in HEK293T (f) and K562 (g), n = 45. h, Intended editing efficiency rank of the best-predicted pegRNA for each pathogenic locus in library 1 (PRIDICT and Easy-Prime). Pathogenic loci with multiple pegRNAs on rank 1 (identical efficiency) and loci with less than three pegRNAs were excluded from this analysis. Predictions from PRIDICT were taken from five different cross-validations to ensure none of the predictions are included in the training set. n = 12,189. i, Intended editing efficiency rank of the best-predicted pegRNA for each endogenous locus (PRIDICT and Easy-Prime). n = 15.

Extended Data Fig. 4 Additional library 2 evaluation with PEmax.

a, Mean editing efficiencies of each replicate, including all pegRNAs in library 2 with different experimental conditions in U2OS and K562 cells. Error bars indicate the mean ±s.d. of three biologically independent replicates. n = 3. Mean editing of library 2 for each of the three replicates is based on the following number of pegRNAs for each data point (bars left to right) = 916, 922, 917, 924, 879, 869, 877, 866. Note that absolute levels of editing efficiency for PEmax cannot be directly compared to PE2 in this study due to the use of different selection agents (Blasticidin for PEmax screens compared to Zeocin for PE2 screens). Previous studies showed that in identical setups, PEmax surpasses the performance of PE2¹⁶. b, Spearman correlation for PEmax editing efficiencies in library 2 between different experimental conditions (MLH1dn, tevopreQ1) and cell lines (K562, U2OS). c, Editing efficiency rank correlations (Spearman) in library 2 between editing performed with PE2 versus editing performed with PEmax.

Supplementary information

Supplementary Information

Supplementary Notes 1–3, Tables 1–3, Figs. 1–19 and Methods 1.

Reporting Summary

Supplementary Table 4

List of library sequences used in this study.

Supplementary Table 5

List of oligo sequences used for cloning and PCR.

Supplementary Table 6

Tables containing information about pegRNAs used in library 1, library 2 and endogenous editing experiments and their associated editing efficiencies.

Supplementary Table 7

Table containing Spearman correlations between every pair of features listed in Supplementary Table 1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mathis, N., Allam, A., Kissling, L. et al. Predicting prime editing efficiency and product purity by deep learning. Nat Biotechnol 41, 1151–1159 (2023). https://doi.org/10.1038/s41587-022-01613-7

Download citation

Received: 05 April 2022
Accepted: 15 November 2022
Published: 16 January 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s41587-022-01613-7

This article is cited by

High-throughput evaluation of genetic variants with prime editing sensor libraries
- Samuel I. Gould
- Alexandra N. Wuest
- Francisco J. Sánchez Rivera
Nature Biotechnology (2024)
Continuous directed evolution of a compact CjCas9 variant with broad PAM compatibility
- Lukas Schmidheini
- Nicolas Mathis
- Gerald Schwank
Nature Chemical Biology (2024)
BacPE: a versatile prime-editing platform in bacteria by inhibiting DNA exonucleases
- Hongyuan Zhang
- Jiacheng Ma
- Quanjiang Ji
Nature Communications (2024)
Precise genome-editing in human diseases: mechanisms, strategies and applications
- Yanjiang Zheng
- Yifei Li
- Yimin Hua
Signal Transduction and Targeted Therapy (2024)
Enhancing prime editor activity by directed protein evolution in yeast
- Yanik Weber
- Desirée Böck
- Gerald Schwank
Nature Communications (2024)