Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Machine learning prediction of prime editing efficiency across diverse chromatin contexts

Abstract

The success of prime editing depends on the prime editing guide RNA (pegRNA) design and target locus. Here, we developed machine learning models that reliably predict prime editing efficiency. PRIDICT2.0 assesses the performance of pegRNAs for all edit types up to 15 bp in length in mismatch repair-deficient and mismatch repair-proficient cell lines and in vivo in primary cells. With ePRIDICT, we further developed a model that quantifies how local chromatin environments impact prime editing rates.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Characterization and prediction of pegRNA efficiencies based on sequence context.
Fig. 2: Validation of PRIDICT2.0 predictions in different contexts and in comparison with existing models.
Fig. 3: Characterization and prediction of prime editing efficiency based on chromatin context.

Similar content being viewed by others

Data availability

Measured editing rates used for analysis and creating figures in this study are provided in Supplementary Tables 2, 7, 8 and 12. DNA sequencing data are available via the NCBI Sequence Read Archive (PRJNA1025026)51. ENCODE datasets for K562, HEK293T and HepG2 cells are listed in Supplementary Tables 911 and are available from encodeproject.org.

Code availability

Scripts used in this study for data analysis or offline running of the prediction models (PRIDICT2.0, ePRIDICT) are provided on GitHub33,34. Online implementation of both models can be accessed via www.pridict.it. Additional information on the PRIDICT2.0 algorithm can be found in Supplementary Methods.

References

  1. Mathis, N. et al. Predicting prime editing efficiency and product purity by deep learning. Nat. Biotechnol. 41, 1151–1159 (2023).

    Article  CAS  PubMed  Google Scholar 

  2. Kim, H. K. et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat. Biotechnol. 39, 198–206 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Koeppel, J. et al. Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants. Nat. Biotechnol. 41, 1446–1456 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Yu, G. et al. Prediction of efficiencies for diverse prime editing systems in multiple cell types. Cell 186, 2256–2272 (2023).

    Article  CAS  PubMed  Google Scholar 

  5. Chen, P. J. et al. Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell 184, 5635–5652 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ferreira da Silva, J. et al. Prime editing efficiency and fidelity are enhanced in the absence of mismatch repair. Nat. Commun. 13, 760 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Trojan, J. et al. Functional analysis of hMLH1 variants and HNPCC-related mutations using a human expression system. Gastroenterology 122, 211–219 (2002).

    Article  CAS  PubMed  Google Scholar 

  8. Matheson, E. C. & Hall, A. G. Assessment of mismatch repair function in leukaemic cell lines and blasts from children with acute lymphoblastic leukaemia. Carcinogenesis 24, 31–38 (2003).

    Article  CAS  PubMed  Google Scholar 

  9. Böck, D. et al. In vivo prime editing of a metabolic liver disease in mice. Sci. Transl. Med. 14, eabl9238 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017) (eds. Guyon, I. et al.) 4766–4775 (Curran Associates, 2017).

  11. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).

  12. Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Brooks, D. L. et al. Efficient in vivo prime editing corrects the most frequent phenylketonuria variant, associated with high unmet medical need. Am. J. Hum. Genet. 110, 2003–2014 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Schep, R. et al. Impact of chromatin context on Cas9-induced DNA double-strand break repair pathway balance. Mol. Cell 81, 2216–2230 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Chen, E. et al. Decorating chromatin for enhanced genome editing using CRISPR−Cas9. Proc. Natl Acad. Sci. USA 119, e2204259119 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Daer, R. M., Cutts, J. P., Brafman, D. A. & Haynes, K. A. The impact of chromatin dynamics on Cas9-mediated genome editing in human cells. ACS Synth. Biol. 6, 428–438 (2017).

    Article  CAS  PubMed  Google Scholar 

  17. Ding, X. et al. Improving CRISPR−Cas9 genome editing efficiency by fusion with chromatin-modulating peptides. CRISPR J. 2, 51–63 (2019).

    Article  CAS  PubMed  Google Scholar 

  18. Pokusaeva, V. O., Diez, A. R., Espinar, L., Pérez, A. T. & Filion, G. J. Strand asymmetry influences mismatch resolution during single-strand annealing. Genome Biol. 23, 93 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Akhtar, W. et al. Using TRIP for genome-wide position effect analysis in cultured cells. Nat. Protoc. 9, 1255–1281 (2014).

    Article  CAS  PubMed  Google Scholar 

  20. Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 48, D882–D889 (2020).

    Article  CAS  PubMed  Google Scholar 

  21. Buenrostro, J., Wu, B., Chang, H. & Greenleaf, W. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Curr. Protoc. Mol. Biol. 109, 21.29.1–21.29.9 (2015).

    Article  PubMed  Google Scholar 

  22. Wang, Z. et al. Genome-wide mapping of HATs and HDACs reveals distinct functions in active and inactive genes. Cell 138, 1019–1031 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).

    Article  CAS  PubMed  Google Scholar 

  24. Bonasio, R., Tu, S. & Reinberg, D. Molecular signals of epigenetic states. Science 330, 612–616 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Peters, A. H. F. M. et al. Partitioning and plasticity of repressive histone methylation states in mammalian chromatin. Mol. Cell 12, 1577–1589 (2003).

    Article  CAS  PubMed  Google Scholar 

  26. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2020).

  27. Bannister, A. J. et al. Spatial distribution of di- and tri-methyl lysine 36 of histone H3 at active genes. J. Biol. Chem. 280, 17732–17736 (2005).

    Article  CAS  PubMed  Google Scholar 

  28. Li, X., et al. Chromatin context-dependent regulation and epigenetic manipulation of prime editing. Preprint at bioRxiv https://doi.org/10.1101/2023.04.12.536587 (2023).

  29. Kim, H. K. et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance. Sci. Adv. 5, eaax9249 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Park, S.-J. et al. Targeted mutagenesis in mouse cells and embryos using an enhanced prime editor. Genome Biol. 22, 170 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Liu, N. et al. HDAC inhibitors improve CRISPR/Cas9 mediated prime editing and base editing. Mol. Ther. Nucleic Acids 29, 36–46 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Cirincione, A. et al. A benchmarked, high-efficiency prime editing platform for multiplexed dropout screening. Preprint at bioRxiv https://doi.org/10.1101/2024.03.25.585978 (2024).

  33. Mathis, N. & Allam, A. GitHub code repository for PRIDICT2.0. GitHub https://github.com/uzh-dqbm-cmi/PRIDICT2 (2024).

  34. Mathis, N. GitHub code repository for ePRIDICT. GitHub https://github.com/Schwank-Lab/epridict (2024).

  35. Arbab, M. et al. Determinants of base editing outcomes from target library analysis and machine learning. Cell 182, 463–480 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Nelson, J. W. et al. Engineered pegRNAs improve prime editing efficiency. Nat. Biotechnol. 40, 402–410 (2022).

    Article  CAS  PubMed  Google Scholar 

  37. Sanjana, N. E., Shalem, O. & Zhang, F. Improved vectors and genome-wide libraries for CRISPR screening. Nat. Methods 11, 783–784 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Mátés, L. et al. Molecular evolution of a novel hyperactive Sleeping Beauty transposase enables robust stable gene transfer in vertebrates. Nat. Genet. 41, 753–761 (2009).

    Article  PubMed  Google Scholar 

  39. Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR−Cas9 variants. Science 368, 290–296 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).

    Article  Google Scholar 

  42. Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Preprint at https://arxiv.org/abs/1912.01703 (2019).

  46. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  47. Yu, G. GenET: Python package for genome editing research (v. 0.12.0). Python Software Foundation https://pypi.org/project/genet/0.12.0/ (2024).

  48. Weller, J. GitHub code repository for Koeppel, et al. 2023 (MinsePIE). GitHub https://github.com/julianeweller/MinsePIE (2023).

  49. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. Zenodo https://zenodo.org/records/8117732 (2015).

  50. Ryan, D. et al. deeptools/pyBigWig: 0.3.22. Zenodo https://doi.org/10.5281/zenodo.7809144 (2023).

  51. Mathis, N. Sequencing data deposition: prime editing efficiency prediction with sequence and chromatin context (PRJNA1025026). Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA1025026 (2024).

Download references

Acknowledgements

We thank the Functional Genomics Center Zurich for their support in next-generation sequencing, the Science IT team at the University of Zurich for the computational infrastructure used for data analysis, the ENCODE consortium for providing the datasets used for the analysis of TRIP libraries, C. Leemans for consulting during the TRIP library analysis, G. Affentranger for assistance in figure design and the members of the Schwank laboratory for fruitful discussions. This work was supported by the Swiss National Science Foundation (SNSF) through grants 185293, 214936 and 201184, URPPs (University Research Priority Programs) ‘Human Reproduction Reloaded’ and ‘ITINERARE’, the State Secretariat for Education, Research and Innovation (SERI)-financed European Research Council Consolidator grant ‘GeneREPAIR’, an EMBO Long-Term Fellowship (S.J.) and the ProMedica Stiftung.

Author information

Authors and Affiliations

Authors

Contributions

N.M. designed the study, performed experiments and analyzed data. A.A. designed and generated attention-based bidirectional RNNs. Linear regression and tree-based machine learning models for Library-Diverse were built by A.A. and models for TRIP predictions were built by N.M. E.B. and S.J. were involved in TRIP library screening. A.T. and T.D. performed editing experiments on endogenous loci. L.K. performed in vivo experiments with help from E.I.I. R.S. and B.v.S. provided the TRIP plasmid library and performed tagmentation experiments of the cell pool. Z.B. contributed to the integration analysis of the TRIP library. L.S. and D.B. helped with NGS and cloning experiments. N.M. and G.S. wrote the manuscript with input from A.A. and R.S. M.K. and G.S. designed and supervised the research. All authors revised the manuscript.

Corresponding author

Correspondence to Gerald Schwank.

Ethics declarations

Competing interests

G.S. is a scientific advisor to Prime Medicine. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Library-Diverse characteristics.

(a) Edit type distribution in Library 1 from Mathis et al.1, which had a focus on 1bp replacements and short insertions and deletions. (b) Edit type distribution in ‘Library-Diverse’ screened in this study. (c) Self-targeting construct with the promoter (hU6) and different pegRNA domains (spacer, scaffold, reverse transcription template/RTT, primer binding sequence/PBS, tevopreQ1 motif36, and poly T stop signal), target sequence (including position of Protospacer), and primer location for NGS-PCR (forward (Fw) and reverse (Rv) primer). (d-g) Correlation of background-subtracted individual replicates of 'Library-Diverse' prime editing screens in (d) HEK293T (n = 22,619), (e) K562 (n = 22,752), (f) K562 cells with MMR inhibition through MLH1dn expression (n = 20,477), and (g) in vivo (mouse liver hepatocytes; n = 17,775). Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE.

Extended Data Fig. 2 Machine learning metrics for training models on ‘Library-Diverse’.

(a, b) Comparison of 7 different machine learning model performances on editing efficiency prediction in HEK293T (Spearman (a), Pearson (b)) and K562 (Spearman (c), Pearson (d)). Bars show the mean of fivefold cross-validation, and each of the five cross-validations is visualized as individual data points (n = 5). Error bar indicates the mean +/− s.d. (e) Prediction Pearson correlations of PRIDICT1 and PRIDICT2.0, tested on different edit types and cell types. n for rows from top to bottom: 5,957, 4,455, 6,283, 5,924, 5,969, 4,508, 6,302, 5,973. (f) Prediction Pearson correlations of PRIDICT1 and PRIDICT2.0, tested on insertions and deletions of different lengths in HEK293T and K562 cells. n for different edit lengths combined are as follows: HEK293T insertions: 6,283, HEK293T deletions: 5,924, K562 insertions: 6,302, K562 deletions: 5,973. (g) Prediction Spearman correlations of PRIDICT1, the updated attention-based bi-directional RNN architecture trained on Library 1 (ref. 1) only (see Fig. 1n), or on Library 1 (ref. 1) and Library-ClinVar4, and PRIDICT2.0 (includes fine-tuning on Library-Diverse); tested on different edit and cell types. n for rows from top to bottom: 22,619, 5,957, 4,455, 6,283, 5,924, 22,752, 5,969, 4,508, 6,302, 5,973.

Extended Data Fig. 3 Editing characteristics in K562 with MMR inhibition (MLH1dn) and in vivo (mouse liver).

(a-f) Editing efficiencies of different edit/pegRNA features in K562 with MMR inhibition (MLH1dn5). (g-l) Editing efficiencies of different edit/pegRNA features in the in vivo (mouse liver) setting. Editing efficiency for different edit lengths of (a,g) insertions and (b,h) deletions. (c,i) Heatmap visualizing editing efficiency of pegRNAs (1bp replacements) in Library-Diverse with different RTT overhang lengths (3, 7, 10, 15 bp) and edit positions (1–15). PAM position is highlighted with black dotted rectangle. (d,j) Editing efficiency of replacements with edit lengths of 1 to 5 bp. (e,k) Editing efficiency of single and double 1 bp replacements with or without editing of at least 1 base within the GG PAM sequence. (f,l) Editing efficiencies for double edits where 2 separated 1 bp replacements were installed. Intended editing means that both replacements were installed, whereas intermediate editing means that only 1 of the 2 replacements was installed. Distance of 0 corresponds to single 1 bp edits. (a-l) Bars show mean with error bar indicating mean +/− s.e.m. The number of analyzed pegRNA–target combinations are as follows. a, n = 371, 492, 504, 409, 412, 382, 406, 405, 405, 397, 323, 340, 333, 295, 320. b, n = 384, 444, 365, 345, 370, 385, 347, 379, 379, 369, 354, 339, 318, 314, 320. c, n = 2,815. d, n = 4,520, 764, 588, 497, 526. e, n = 4,092, 428, 2,528, 761. f, n = 4,520, 764, 130, 65, 151, 117, 25, 119, 133, 25, 149. g, n = 315, 415, 439, 351, 338, 321, 339, 356, 340, 337, 277, 303, 286, 243, 277. h, n = 317, 367, 294, 275, 288, 308, 289, 324, 304, 302, 273, 285, 255, 259, 249. i, n = 2,362. j, n = 4,265, 653, 499, 408, 459. k, n = 3,889, 376, 2,289, 663. l, n = 4,265, 653, 134, 59, 155, 119, 24, 119, 138, 26, 159.

Extended Data Fig. 4 TRIP screen characteristics for different edit modalities.

(a-d) Distribution of editing efficiency across different TRIP reporter integrations for (a) PE, (b) ABE8e, (c) BE4max, and (d) SpCas9 genome editing. Dotted vertical line indicates mean editing efficiency. (e-h) Correlation (Pearson) of individual TRIP screening replicates for PE (n = 1,182) (e), ABE8e (n = 1,169) (f), BE4max (n = 1,194) (g), and SpCas9 (n = 1,196) (h) genome editing. (i,j) Correlation of replicate means between different edit modalities: Spearman (i), Pearson (j). Only barcode integrations available from all editors are used for analysis (n = 1,165).

Extended Data Fig. 5 Additional analysis of TRIP screens and predictive modeling of editing rates.

(a-c) UMAP projection based on chromatin characteristics of genomic locations in the TRIP library (n = 1,165; corresponding to integrations with mappings to all editors), with editing efficiency overlay of (a) ABE8e, (b) BE4max, and (c) Cas9. (d-g) Visualization of chromatin characteristics of clusters defined in Fig. 3i. For each target/dataset type, we selected the averaging window with the largest deviation from the library mean. The relative difference to the library mean, calculated as the absolute difference between the cluster average and the library mean divided by the library mean is shown. (h) Evaluation of the ePRIDICT-light XGBoost model trained on a subset of 6 features. Predictions from 5 different cross-validation runs were combined. (n = 1,182). (i) Spearman and Pearson correlation of XGBoost model prediction to editing efficiencies in TRIP library for prime editing, adenine base editing (ABE8e), cytosine base editing (BE4max), and SpCas9 genome editing. Bars show the mean of fivefold cross-validation, and each of the five cross-validations is visualized as individual data points (n = 5). Error bar indicates the mean +/− s.d. (j) Validation of ePRIDICT on an independent dataset from Li et al.28, where one sequence was integrated and edited (CTT insertion) at 4,144 genomic locations. (h,j) Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE.

Extended Data Fig. 6 Validation of ePRIDICT at endogenous loci in K562, HEK293T and HepG2 cells.

(a) Spearman correlation analysis of ENCODE feature values for 19 selected endogenous loci, comparing datasets from K562 and HEK293T cells. (b) Validation of prime editing efficiency in HEK293T cells on endogenous loci with high (>50) or low (<35) ePRIDICT scores, normalized to editing on the reporter sequence. 1bp replacements (n-high: 9, n-low: 10), 4bp insertions (n-high: 9, n-low: 9), and 4bp deletions (n-high: 9, n-low: 10). (c-d) Validation of genome editing efficiency on endogenous loci normalized to editing on integrated reporter in K562 (c) and HEK293T (d) with high (>50, n = 8 (K562) and 9 (HEK293T)) and low (<35 n = 9 (K562) and 10 (HEK293T)) ePRIDICT values for ABE8e, BE4max, and Cas9. (e) Spearman correlation analysis of ENCODE feature values for 19 selected endogenous loci, comparing datasets from K562 and HepG2 cells. (f) Validation in HepG2 cells as described in b. 1bp replacements (n-high: 9, n-low: 10), 4bp insertions (n-high: 8, n-low: 8), and 4bp deletions (n-high: 9, n-low: 10). (g, h) Binning editing efficiency and predicted score from Fig. 3n, o into 3 categories each. Editing efficiency is binned into ‘Low’ (n = 92), ‘Middle’ (n = 27), and ‘High’ (n = 27) categories based on the cutoffs <5%, 5–20%, and >20%. The prediction score is binned in three even-sized tertiles. (g) PRIDICT2.0 K562 value as prediction score. (h) Combined PRIDICT2.0 K562 and ePRIDICT value (average of both scores) as prediction score. (i,j) Performance of PRIDICT2.0 HEK293T (i) or PRIDICT2.0 HEK293T in combination with ePRIDICT (j) in predicting the editing efficiency of 56 pegRNAs targeting endogenous loci in HEK293T. (k) Additional visualization of the performance of PRIDICT2.0 HEK293T alone or in combination with ePRIDICT on 56 pegRNAs targeting endogenous loci in HEK293T, including highly and poorly accessible loci. (l-n) Performance of PRIDICT2.0 K562 or PRIDICT2.0 K562 in combination with ePRIDICT in HepG2 (54 pegRNAs), as described for i-k. (b-d, f) Box plots represent the 25th, 50th and 75th percentiles. Whiskers extend to points within 1.5 times the interquartile range from the quartiles.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Methods.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–12.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mathis, N., Allam, A., Tálas, A. et al. Machine learning prediction of prime editing efficiency across diverse chromatin contexts. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02268-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-024-02268-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing