Recently, various small Cas9 orthologs and variants have been reported for use in in vivo delivery applications. Although small Cas9s are particularly suited for this purpose, selecting the most optimal small Cas9 for use at a specific target sequence continues to be challenging. Here, to this end, we have systematically compared the activities of 17 small Cas9s for thousands of target sequences. For each small Cas9, we have characterized the protospacer adjacent motif and determined optimal single guide RNA expression formats and scaffold sequence. High-throughput comparative analyses revealed distinct high- and low-activity groups of small Cas9s. We also developed DeepSmallCas9, a set of computational models predicting the activities of the small Cas9s at matched and mismatched target sequences. Together, this analysis and these computational models provide a useful guide for researchers to select the most suitable small Cas9 for specific applications.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Get just this article for as long as you need it
Prices may be subject to local taxes which are calculated during checkout
The deep sequencing data used in this study are available at the NCBI Sequence Read Archive under BioProject accession number PRJNA807878. The indel frequency datasets used in this study are provided as Supplementary Tables 2–6. The training and test datasets for DeepSmallCas9 and DeepSpCas9-v2 are provided as Supplementary Table 11. The human genetic variations analyzed in this study are available at https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2020/. The reference genomes for human (GRCh38.p13 v.104) and mouse (GRCm39 v.104) are accessible at https://ftp.ensembl.org/pub/release-104/, and protein-coding annotations for human (MANE Select v.0.95) and mouse (RefSeq Select v.109) are accessible at https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_0.95/ and https://www.ncbi.nlm.nih.gov/nuccore/?term=%22Mus+musculus%22%5BOrganism%5D+AND+Refseq_select%5Bfilter%5D%E2%80%9D+AND+srcdb_refseq%5BPROP%5D, respectively. Source data are provided with this paper.
Source codes for DeepSmallCas9 and the custom Python scripts used for the indel frequency calculations are available on GitHub at https://github.com/SangyeonSeo/DeepSmallCas9 and https://github.com/CRISPRJWCHOI/CRISPR_toolkit/tree/master/Indel_searcher_2, respectively.
Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).
Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).
Cho, S. W., Kim, S., Kim, J. M. & Kim, J. S. Targeted genome engineering in human cells with the Cas9 RNA-guided endonuclease. Nat. Biotechnol. 31, 230–232 (2013).
Jinek, M. et al. RNA-programmed genome editing in human cells. eLife 2, e00471 (2013).
Hwang, W. Y. et al. Efficient genome editing in zebrafish using a CRISPR–Cas system. Nat. Biotechnol. 31, 227–229 (2013).
Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR–Cas systems. Nat. Biotechnol. 31, 233–239 (2013).
Slaymaker, I. M. et al. Rationally engineered Cas9 nucleases with improved specificity. Science 351, 84–88 (2016).
Kleinstiver, B. P. et al. High-fidelity CRISPR–Cas9 nucleases with no detectable genome-wide off-target effects. Nature 529, 490–495 (2016).
Chen, J. S. et al. Enhanced proofreading governs CRISPR–Cas9 targeting accuracy. Nature 550, 407–410 (2017).
Casini, A. et al. A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nat. Biotechnol. 36, 265–271 (2018).
Lee, J. K. et al. Directed evolution of CRISPR–Cas9 to increase its specificity. Nat. Commun. 9, 3048 (2018).
Kleinstiver, B. P. et al. Engineered CRISPR–Cas9 nucleases with altered PAM specificities. Nature 523, 481–485 (2015).
Anders, C., Bargsten, K. & Jinek, M. Structural plasticity of PAM recognition by engineered variants of the RNA-guided endonuclease Cas9. Mol. Cell 61, 895–902 (2016).
Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).
Nishimasu, H. et al. Engineered CRISPR–Cas9 nuclease with expanded targeting space. Science 361, 1259–1262 (2018).
Swiech, L. et al. In vivo interrogation of gene function in the mammalian brain using CRISPR–Cas9. Nat. Biotechnol. 33, 102–106 (2015).
Chew, W. L. et al. A multifunctional AAV–CRISPR–Cas9 and its host response. Nat. Methods 13, 868–874 (2016).
Long, C. et al. Postnatal genome editing partially restores dystrophin expression in a mouse model of muscular dystrophy. Science 351, 400–403 (2016).
Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520, 186–191 (2015).
Pardi, N., Hogan, M. J., Porter, F. W. & Weissman, D. mRNA vaccines—a new era in vaccinology. Nat. Rev. Drug Discov. 17, 261–279 (2018).
Schmidt, M. J. et al. Improved CRISPR genome editing using small highly active and specific engineered RNA-guided nucleases. Nat. Commun. 12, 4219 (2021).
Esvelt, K. M. et al. Orthogonal Cas9 proteins for RNA-guided gene regulation and editing. Nat. Methods 10, 1116–1121 (2013).
Muller, M. et al. Streptococcus thermophilus CRISPR–Cas9 systems enable specific editing of the human genome. Mol. Ther. 24, 636–644 (2016).
Agudelo, D. et al. Versatile and robust genome editing with Streptococcus thermophilus CRISPR1–Cas9. Genome Res. 30, 107–117 (2020).
Hou, Z. et al. Efficient genome engineering in human pluripotent stem cells using Cas9 from Neisseria meningitidis. Proc. Natl Acad. Sci. USA 110, 15644–15649 (2013).
Lee, C. M., Cradick, T. J. & Bao, G. The Neisseria meningitidis CRISPR–Cas9 system enables specific genome editing in mammalian cells. Mol. Ther. 24, 645–654 (2016).
Amrani, N. et al. NmeCas9 is an intrinsically high-fidelity genome-editing platform. Genome Biol. 19, 214 (2018).
Friedland, A. E. et al. Characterization of Staphylococcus aureus Cas9: a smaller Cas9 for all-in-one adeno-associated virus delivery and paired nickase applications. Genome Biol. 16, 257 (2015).
Najm, F. J. et al. Orthologous CRISPR–Cas9 enzymes for combinatorial genetic screens. Nat. Biotechnol. 36, 179–189 (2018).
Tycko, J. et al. Pairwise library screen systematically interrogates Staphylococcus aureus Cas9 specificity in human cells. Nat. Commun. 9, 2962 (2018).
Kim, E. et al. In vivo genome editing with a small Cas9 orthologue derived from Campylobacter jejuni. Nat. Commun. 8, 14500 (2017).
Yamada, M. et al. Crystal structure of the minimal Cas9 from Campylobacter jejuni reveals the molecular diversity in the CRISPR–Cas9 systems. Mol. Cell 65, 1109–1121 e1103 (2017).
Edraki, A. et al. A compact, high-accuracy Cas9 with a dinucleotide PAM for in vivo genome editing. Mol. Cell 73, 714–726.e4 (2018).
Hu, Z. et al. A compact Cas9 ortholog from Staphylococcus auricularis (SauriCas9) expands the DNA targeting scope. PLoS Biol. 18, e3000686 (2020).
Hu, Z. et al. Discovery and engineering of small SlugCas9 with broad targeting range and high specificity and activity. Nucleic Acids Res. 49, 4008–4019 (2021).
Kleinstiver, B. P. et al. Broadening the targeting range of Staphylococcus aureus CRISPR–Cas9 by modifying PAM recognition. Nat. Biotechnol. 33, 1293–1298 (2015).
Tan, Y. et al. Rationally engineered Staphylococcus aureus Cas9 nucleases with high genome-wide specificity. Proc. Natl Acad. Sci. USA 116, 20969–20976 (2019).
Xie, H. et al. High-fidelity SaCas9 identified by directional screening in human cells. PLoS Biol. 18, e3000747 (2020).
Nakagawa, R. et al. Engineered Campylobacter jejuni Cas9 variant with enhanced activity and broader targeting range. Commun. Biol. 5, 211 (2022).
Koblan, L. W. et al. Improving cytidine and adenine base editors by expression optimization and ancestral reconstruction. Nat. Biotechnol. 36, 843–846 (2018).
Zafra, M. P. et al. Optimized base editors enable efficient editing in cells, organoids and mice. Nat. Biotechnol. 36, 888–893 (2018).
Kim, N. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 38, 1328–1336 (2020).
Kim, H. K. et al. In vivo high-throughput profiling of CRISPR–Cpf1 activity. Nat. Methods 14, 153–159 (2017).
Kim, H. K. et al. Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity. Nat. Biotechnol. 36, 239–241 (2018).
Kim, H. K. et al. SpCas9 activity prediction by DeepSpCas9, a deep learning-based model with high generalization performance. Sci. Adv. 5, eaax9249 (2019).
Wang, D. et al. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat. Commun. 10, 4284 (2019).
Kim, H. K. et al. High-throughput analysis of the activities of xCas9, SpCas9-NG and SpCas9 at matched and mismatched target sequences in human cells. Nat. Biomed. Eng. 4, 111–124 (2020).
Shen, M. W. et al. Predictable and precise template-free CRISPR editing of pathogenic variants. Nature 563, 646–651 (2018).
Allen, F. et al. Predicting the mutations generated by repair of Cas9-induced double-strand breaks. Nat. Biotechnol. 37, 64–72 (2018).
Chen, W. et al. Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9-mediated double-strand break repair. Nucleic Acids Res. 47, 7989–8003 (2019).
Song, M. et al. Sequence-specific prediction of the efficiencies of adenine and cytosine base editors. Nat. Biotechnol. 38, 1037–1043 (2020).
Arbab, M. et al. Determinants of base editing outcomes from target library analysis and machine learning. Cell 182, 463–480 (2020).
Kim, H. K. et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat. Biotechnol. 39, 198–206 (2021).
Schlub, T. E., Smyth, R. P., Grimm, A. J., Mak, J. & Davenport, M. P. Accurately measuring recombination between closely related HIV-1 genomes. PLoS Comput. Biol. 6, e1000766 (2010).
Sack, L. M., Davoli, T., Xu, Q., Li, M. Z. & Elledge, S. J. Sources of error in mammalian genetic screens. G3 6, 2781–2790 (2016).
Feldman, D., Singh, A., Garrity, A. J. & Blainey, P. C. Lentiviral co-packaging mitigates the effects of intermolecular recombination and multiple integrations in pooled genetic screens. Preprint at bioRxiv https://doi.org/10.1101/262121 (2018).
Hill, A. J. et al. On the design of CRISPR-based single-cell molecular screens. Nat. Methods 15, 271–274 (2018).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR–Cas9-mediated gene inactivation. Nat. Biotechnol. 32, 1262–1267 (2014).
Doench, J. G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR–Cas9. Nat. Biotechnol. 34, 184–191 (2016).
Fu, Y., Sander, J. D., Reyon, D., Cascio, V. M. & Joung, J. K. Improving CRISPR–Cas nuclease specificity using truncated guide RNAs. Nat. Biotechnol. 32, 279–284 (2014).
Kim, S., Bae, T., Hwang, J. & Kim, J. S. Rescue of high-specificity Cas9 variants using sgRNAs with matched 5′ nucleotides. Genome Biol. 18, 218 (2017).
Zhang, D. et al. Perfectly matched 20-nucleotide guide RNA sequences enable robust genome editing using high-fidelity SpCas9 nucleases. Genome Biol. 18, 191 (2017).
Xie, K., Minkenberg, B. & Yang, Y. Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. Proc. Natl Acad. Sci. USA 112, 3570–3575 (2015).
He, X. et al. Boosting activity of high-fidelity CRISPR/Cas9 variants using a tRNA(Gln)-processing system in human cells. J. Biol. Chem. 294, 9308–9315 (2019).
Dang, Y. et al. Optimizing sgRNA structure to improve CRISPR–Cas9 knockout efficiency. Genome Biol. 16, 280 (2015).
Riesenberg, S., Helmbrecht, N., Kanis, P., Maricic, T. & Paabo, S. Improved gRNA secondary structures allow editing of target sites resistant to CRISPR–Cas9 cleavage. Nat. Commun. 13, 489 (2022).
Tsai, S. Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR–Cas nucleases. Nat. Biotechnol. 33, 187–197 (2015).
Lin, Y. et al. CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences. Nucleic Acids Res. 42, 7473–7485 (2014).
Jones, S. K. Jr et al. Massively parallel kinetic profiling of natural and engineered CRISPR nucleases. Nat. Biotechnol. 39, 84–93 (2021).
Courtney, D. G. et al. CRISPR/Cas9 DNA cleavage at SNP-derived PAM enables both in vitro and in vivo KRT12 mutation-specific targeting. Gene Ther. 23, 108–112 (2015).
Christie, K. A. et al. Towards personalised allele-specific CRISPR gene editing to treat autosomal dominant disorders. Sci. Rep. 7, 16174 (2017).
Bakondi, B. et al. In vivo CRISPR/Cas9 gene editing corrects retinal dystrophy in the S334ter-3 rat model of autosomal dominant retinitis pigmentosa. Mol. Ther. 24, 556–563 (2016).
Gao, X. et al. Treatment of autosomal dominant hearing loss by in vivo delivery of genome editing agents. Nature 553, 217–221 (2018).
Gyorgy, B. et al. Allele-specific gene editing prevents deafness in a model of dominant progressive hearing loss. Nat. Med. 25, 1123–1130 (2019).
Koo, T. et al. Selective disruption of an oncogenic mutant allele by CRISPR/Cas9 induces efficient tumor regression. Nucleic Acids Res. 45, 7897–7908 (2017).
Li, Y. et al. Exploiting the CRISPR/Cas9 PAM constraint for single-nucleotide resolution interventions. PLoS ONE 11, e0144970 (2016).
Kim, W. et al. Targeting mutant KRAS with CRISPR-Cas9 controls tumor growth. Genome Res. 28, 374–382 (2018).
Cruz, L. et al. Mutant allele-specific CRISPR disruption in DYT1 dystonia fibroblasts restores cell function. Mol. Ther. Nucleic Acids 21, 1–12 (2020).
Xie, C. et al. Genome editing with CRISPR/Cas9 in postnatal mice corrects PRKAG2 cardiac syndrome. Cell Res. 26, 1099–1111 (2016).
Trochet, D. et al. Allele-specific silencing therapy for Dynamin 2-related dominant centronuclear myopathy. EMBO Mol. Med. 10, 239–253 (2018).
Rabai, A. et al. Allele-specific CRISPR/Cas9 correction of a heterozygous DNM2 mutation rescues centronuclear myopathy cell phenotypes. Mol. Ther. Nucleic Acids 16, 246–256 (2019).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Landrum, M. J. et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 48, D835–D844 (2020).
Liu, Z. et al. Versatile and efficient in vivo genome editing with compact Streptococcus pasteurianus Cas9. Mol. Ther. 30, 256–267 (2022).
Harrington, L. B. et al. A thermostable Cas9 with increased lifetime in human plasma. Nat. Commun. 8, 1424 (2017).
Hirano, S. et al. Structural basis for the promiscuous PAM recognition by Corynebacterium diphtheriae Cas9. Nat. Commun. 10, 1968 (2019).
Fedorova, I. et al. PpCas9 from Pasteurella pneumotropica—a compact Type II-C Cas9 ortholog active in human cells. Nucleic Acids Res. 48, 12297–12309 (2020).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Shen, J. P. et al. Combinatorial CRISPR–Cas9 screens for de novo mapping of genetic interactions. Nat. Methods 14, 573–576 (2017).
Joung, J. et al. Genome-scale CRISPR–Cas9 knockout and transcriptional activation screening. Nat. Protoc. 12, 828–863 (2017).
Shalem, O. et al. Genome-scale CRISPR–Cas9 knockout screening in human cells. Science 343, 84–87 (2014).
Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (eds Keeton, K. & Roscoe, T.) 265–283 (USENIX Association, 2016).
Bae, S., Park, J. & Kim, J. S. Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 30, 1473–1475 (2014).
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).
O'Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Diedenhofen, B. & Musch, J. cocor: a comprehensive solution for the statistical comparison of correlations. PLoS ONE 10, e0121945 (2015).
We thank J. Park, S. Park and Y. Kim for assisting with the experiments. This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (2022R1A3B1078084 (H.H.K.) and 2018R1A5A2025079 (H.H.K.)); the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (2022M3A9E4017127 (H.H.K.) and 2022M3A9F3017506 (H.H.K.)); the Korea Drug Development Fund funded by the Ministry of Science and ICT, the Ministry of Trade, Industry, and Energy, and the Ministry of Health and Welfare, Republic of Korea (HN21C0917 (H.H.K.)); the Yonsei Signature Research Cluster Program of 2021-22-0014 (H.H.K.); the Brain Korea 21 FOUR Project for Medical Science (Yonsei University College of Medicine); the SNUH Kun-hee Lee Child Cancer & Rare Disease Project, Republic of Korea (22B-000-0101 (H.H.K.)); the Korea Research Institute of Bioscience and Biotechnology(KRIBB) Research Initiative Program (KGM5162221 (H.H.K.)); and the Korea Health Technology R&D Project funded by the Ministry of Health and Welfare, Republic of Korea (HI21C1314 (H.H.K.)).
Yonsei University has filed a patent based on this work, in which S.-Y.S., S.L. and H.H.K. are the co-inventors (patent no. 10-2022-0060290). H.H.K. is a consultant for EcoR1 capital. The remaining authors declare no competing interests.
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Madhura Mukhopadhyay, in collaboration with the Nature Methods team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Expression of the small Cas9s in HEK293T cells.
a, Schematic of the small Cas9-expressing cassette. LTR, long terminal repeat; psi, psi packaging signal; RRE, Rev response element; CMV, cytomegalovirus promoter; NLS, nuclear localization signal from SV40 T antigen; FLAG, FLAG-tag; P2A, self-cleaving 2 A peptide from porcine teschovirus-1; BlastR, blasticidin selection marker; WPRE, woodchuck hepatitis virus post-transcriptional regulatory element. b, Representative images of Western blotting used to measure the amount of the small Cas9 proteins in HEK293T cells transduced with the lentiviral vectors encoding the small Cas9s. The levels of the small Cas9 proteins were determined using a FLAG-tag; β-actin was used as a loading control. Unavoidably, images from two Western blotting experiments conducted in parallel are shown because two gels were required to accommodate the 18 evaluated Cas9 proteins. Unprocessed images are available in Source Data Extended Data Fig. 1. c, Relative levels of the small Cas9 proteins. Data represent mean ± SD. The numbers of replicates (n) are as follows: SaCas9, n = 12; SaCas9-KKH, n = 12; SaCas9*, n = 8; St1Cas9, n = 8; Nm1Cas9, n = 8; Nm2Cas9, n = 8; CjCas9, n = 8; SauriCas9, n = 5; SauriCas9-KKH, n = 5; sRGN3.1, n = 5; SlugCas9, n = 5; SlugCas9-HF, n = 5; Sa-SlugCas9, n = 5; SaCas9-HF, n = 5; efSaCas9, n = 5; eSaCas9, n = 5; SaCas9-KKH-HF, n = 5; enCjCas9, n = 5. Subsets of the Cas9 protein levels normalized to the β-actin protein levels without statistically significant differences (one-way analysis of variance followed by Bonferroni post-hoc test) are represented with the letters a, b, c, and d.
Extended Data Fig. 2 PAM compatibilities of small Cas9s in human cells.
a, Heatmaps showing the average indel frequencies in the target sequences with the indicated PAM sequences. Indel frequencies were measured four days after transduction of the paired libraries in SaCas9-, SaCas9-KKH-, eSaCas9-, and efSaCas9-expressing cells; in the cells expressing the other three small Cas9s, indel frequencies were measured seven days after transduction. Protospacers for which the highest indel frequencies were < 5% across candidate PAM sequences were excluded from the analyses. Fixed positions and nucleotides are indicated above each heatmap. For instance, to evaluate the preferences for the 3rd, 4th, and 5th nucleotides of the PAM for SaCas9, the 6th and 7th nucleotides of the PAM were fixed as TN. The numbers of analyzed protospacers (n) are as follows: SaCas9, n = 29; SaCas9-KKH, n = 29; eSaCas9, n = 29; efSaCas9, n = 29; SaCas9-HF, n = 29; SaCas9-KKH-HF, n = 29; St1Cas9, n = 27. b, Summary of the analyzed PAM compatibilities.
Extended Data Fig. 3 Activities of sRGN3.1 and SlugCas9 at diverse potential off-target sequences.
a,b, Comparison of the activities of sRGN3.1 and SlugCas9 at different potential off-target sequences. The average relative indel frequencies are indicated using the red plus symbols. The numbers of analyzed target sequences (n) are as follows: sRGN3.1, n = 3,715 (1-bp mismatch), 1,195 (2-bp mismatch), 588 (3-bp mismatch), 1,263 (1-nt deletion), and 1,131 (1-nt insertion); SlugCas9, n = 3,715 (1-bp mismatch), 1,195 (2-bp mismatch), 588 (3-bp mismatch), 1,263 (1-nt deletion), and 1,131 (1-nt insertion). Subsets of the small Cas9-induced relative indel frequencies without statistically significant differences (one-way analysis of variance followed by Bonferroni post-hoc test) are represented with the letters a, b, c, and d. c–h, Heatmaps showing the average specificities of sRGN3.1 (c,e,g) and SlugCas9 (d,f,h) when there were 1-bp mismatches (c,d), 1-nt RNA bulges (e,f), or 1-nt DNA bulges (g,h) between sgRNAs and target sequences with a primary or secondary PAM. The specificity was calculated as 1 − (indel frequency at mismatched target sequences divided by that at perfectly matched targets). c,d, To distinguish mismatch types, wobble, nonwobble, and transversion mismatches are shown in red, green, and blue, respectively. i-l, Box plots showing the effects of deleted (i,j) or inserted (k,l) bases on the activities of sRGN3.1 (i,k) and SlugCas9 (j,l). The numbers of analyzed target sequences (n) are as follows: sRGN3.1 RNA bulge, n = 271 (A), 325 (T), 329 (C), and 338 (G); SlugCas9 RNA bulge, n = 271 (A), 325 (T), 329 (C), and 338 (G); sRGN3.1 DNA bulge, n = 279 (C), 308 (G), 289 (T), and 255 (A); SlugCas9 DNA bulge, n = 308 (G), 279 (C), 289 (T), and 255 (A). Subsets of the small Cas9-induced relative indel frequencies without statistically significant differences (one-way analysis of variance followed by Bonferroni post-hoc test) are represented with the letters a and b. a,b,i–l, Indel frequencies were normalized to those at perfectly matched target sequences. Boxes represent the 25th, 50th, and 75th percentiles and whiskers show the 10th and 90th percentiles.
Extended Data Fig. 4 Development of DeepSmallCas9.
DeepSmallCas9 is a set of deep learning-based models that predict the activities of the small Cas9s at matched and mismatched target sequences. Additional features include the melting temperature (Tm), the number of G or C nucleotides (GC count), the minimum free energy (MFE), and the mismatch position and type between guide and protospacer sequences (mismatch profile). See also Methods.
Extended Data Fig. 5 Performance comparison of algorithms used to develop computational models that predict the activities of the small Cas9s.
Heatmaps showing correlations between the measured and computationally predicted indel frequencies. Average Pearson (top) and Spearman (bottom) correlation coefficients were calculated from five-fold cross-validation. The algorithms that showed the highest average correlation coefficients are shown in bold. XGBoost, extreme gradient boosting; Boosted RT, gradient-boosted regression trees; Lasso, L1-regularized linear regression; Ridge, L2-regularized linear regression; Elastic Net, L1 and L2-regularized linear regression; RF, random forest; SVM, support vector machine.
Extended Data Fig. 6 Comparison of the performance of DeepSmallCas9 with those of existing computational models predicting SaCas9 activity.
a,b, Evaluation of DeepSmallCas9 and ‘SaCas9 on-target rules’ (ref. 29), an existing computational model predicting SaCas9 activities at matched target sequences, using the fraction of the hold-out test dataset including matched targets with NNGRRN PAM; n = 3,975. c,d, Evaluation of DeepSmallCas9 and ‘Model of SaCas9 specificity’ (ref. 30), an existing computational model predicting SaCas9 activities at mismatched target sequences, using the fraction of the hold-out test dataset including mismatched targets with NNGRRT PAM; n = 217. Predicted activities at mismatched targets were normalized to those at perfectly matched targets. a,c, The Spearman correlation coefficient (Rho) and the Pearson correlation coefficient (r) are shown. Dashed line represents y = x. b,d, Data indicate correlation coefficient ± 95% confidence interval. Statistically significant differences between two correlations were determined by two-tailed Steiger’s z-test. The P-values from left to right are < 2.2 × 10-16, < 2.2 × 10-16, 3.1 × 10-4, and 4.7 × 10-8.
Extended Data Fig. 7 Evaluation and prediction of the activities of four small Cas9s in three different cell lines.
Cell lines expressing sRGN3.1, efSaCas9, SauriCas9-KKH, or Nm2Cas9 were transduced with lentiviral pairwise libraries of sgRNA-encoding sequences and target sequences. Four days after transduction, the indel frequencies were measured. In addition, the indel frequencies were predicted using DeepSmallCas9. a, Measured activities of four small Cas9s in three cell lines. Boxes represent the 25th, 50th, and 75th percentiles and whiskers show the 10th and 90th percentiles. Guide formats, PAM sequences, and the numbers of target sequences (n) analyzed for the small Cas9s are as follows: sRGN3.1, G/gN21, NNGGRT, and n = 197 (DLD-1), 197 (HCT116), and 4,809 (HEK293T); efSaCas9, G/gN21, NNGRRT, and n = 394, 394, and 9,514; SauriCas9-KKH, G/gN21, NNGGRT, and n = 197, 197, and 4,855; Nm2Cas9, G/gN22, NNNNCCA, and n = 95, 95, and 2,304. Subsets of the small Cas9-induced indel frequencies without statistically significant differences (one-way analysis of variance followed by Bonferroni post-hoc test) are represented with the letters a, b, c, and d. b, Correlations between predicted and measured activities of four small Cas9s. Results of four Cas9s in each cell line were combined to generate one dataset per cell line. The Spearman correlation coefficient (Rho) and the Pearson correlation coefficient (r) are shown. Red dashed line represents y = x. Guide formats, PAM sequences, and the numbers of target sequences (n) analyzed for the small Cas9s are as follows: sRGN3.1, G/gN21, NNGRRT, and n = 394 (DLD-1), 394 (HCT116), and 951 (HEK293T); efSaCas9, G/gN21, NNGRRT, and n = 394, 394, and 946; SauriCas9-KKH, G/gN21, NNGRRT, and n = 394, 394, and 988; Nm2Cas9, G/gN22, NNNNCCN, and n = 362, 362, and 962.
Extended Data Fig. 8 Computational prediction of preferred small Cas9s at targets with diverse PAM sequences.
a, Heatmap showing the most efficient Cas9 out of eight highly active small Cas9s, which include sRGN3.1, SlugCas9, SaCas9, SauriCas9, Sa-SlugCas9, SaCas9-KKH, eSaCas9, and efSaCas9, at target sequences with a given PAM sequence. To compare the activities of the small Cas9s at sites with 4,096 (= 46) PAMs (all possible NNNNNN sequences for the 1st–6th nucleotides of the PAM), 204,800 target sequences were generated by combining 50 randomly designed protospacer sequences and 4,096 PAM sequences and used as input data for the prediction of the activities (i.e., the induced indel frequencies) using DeepSmallCas9. The color-coded squares represent the small Cas9 that is predicted to be the most efficient, in cases in which the average indel frequency is higher than 10%, at a given PAM sequence. When the predicted average indel frequencies of the most efficient small Cas9s at given target sequences are lower than 10%, the squares representing those PAM sequences are shown in white. The color-code for each Cas9 is shown in b. b, Pie chart showing the number of PAM sequences that could be most efficiently targeted with each Cas9 with an average activity higher than 10%. c, Bar graph showing the number of efficiently targetable PAM sequences out of 4,096 (= 46) PAMs for each Cas9 with an average activity higher than 10%.
Extended Data Fig. 9 SlugCas9-, SaCas9-KKH-, SlugCas9-HF, Sa-SlugCas9-, or efSaCas9-directed targeting of dominant single-nucleotide variants with or without using DeepSmallCas9 to select sgRNAs.
Pie charts showing the fraction of the dominant single-nucleotide variants in protein-coding sequences in the ClinVar database (ref. 83,94) that can be edited using SlugCas9 (a), SaCas9-KKH (b), SlugCas9-HF (c), Sa-SlugCas9 (d), or efSaCas9 (e) in an efficient and allele-specific manner (on-target activity higher than 10% and off-target activity lower than 2%). Mutations for which no designed sgRNAs met these criteria were classified as either inefficient or nonspecific and those for which no mutant allele-targeting sgRNAs could be designed due to the lack of a nearby PAM were classified as untargetable. (Left pie charts) The specified small Cas9s were chosen and the most appropriate sgRNAs were designed using DeepSmallCas9 such that both the activity at the mutant allele and the allele-specificity are high. (Right pie charts) The specified small Cas9s were chosen and sgRNAs were designed to target given mutations such that the mutations were located in regions in the target sequence with the following order of preference: i) the PAM, ii) the highly selective protospacer region (within 10 bp from the PAM), and iii) the remaining region in the protospacer. The activities at the mutant and corresponding wild-type alleles were predicted afterwards using DeepSmallCas9. (Box plots) The predicted activities of selected Cas9-sgRNA combinations at mutant and wild-type alleles for the indicated SNVs. Boxes represent the 25th, 50th, and 75th percentiles and whiskers show the 10th and 90th percentiles. The fold differences between the average activities at mutant and wild-type alleles are shown (e.g., 34x).
Extended Data Fig. 10 Allele-specific gene editing using the small Cas9s and SpCas9.
Of the 13,145 dominant SNVs in protein-coding sequences from the ClinVar database, pie charts show the numbers of dominant SNVs that could be most efficiently and allele-specifically targeted with the indicated Cas9s. (Top pie chart) DeepSmallCas9- and DeepSpCas9-v2-assisted selection of Cas9-sgRNA combinations allowed efficient (expected indel frequency at the mutant allele (on-target) > 10%) and allele-specific (expected indel frequency at the wild-type allele (off-target) < 2%) targeting of 10,925 of the 13,145 SNVs. (Bottom pie chart) Random selection of Cas9 and sgRNA pairs resulted in efficient and allele-specific targeting for only 678 SNVs. (Box plots) The predicted activities of selected Cas9-sgRNA combinations at mutant and wild-type alleles for the indicated SNVs. Boxes represent the 25th, 50th, and 75th percentiles and whiskers show the 10th and 90th percentiles. The fold differences between the average activities at mutant and wild-type alleles are shown (e.g., 37x).
Supplementary Texts 1–5, Figs. 1–15, Tables 1–13, Notes 1–3 and references.
Supplementary Tables 1–13.
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data and unprocessed blots.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 7
Statistical source data.
Source Data Extended Data Fig. 8
Statistical source data.
Source Data Extended Data Fig. 9
Statistical source data.
Source Data Extended Data Fig. 10
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Seo, SY., Min, S., Lee, S. et al. Massively parallel evaluation and computational prediction of the activities and specificities of 17 small Cas9s. Nat Methods (2023). https://doi.org/10.1038/s41592-023-01875-2