Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells

Leenay, Ryan T.; Aghazadeh, Amirali; Hiatt, Joseph; Tse, David; Roth, Theodore L.; Apathy, Ryan; Shifrut, Eric; Hultquist, Judd F.; Krogan, Nevan; Wu, Zhenqin; Cirolia, Giana; Canaj, Hera; Leonetti, Manuel D.; Marson, Alexander; May, Andrew P.; Zou, James

doi:10.1038/s41587-019-0203-2

Brief Communication
Published: 29 July 2019

Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells

Nature Biotechnology volume 37, pages 1034–1037 (2019)Cite this article

11k Accesses
62 Citations
98 Altmetric
Metrics details

Subjects

Abstract

Understanding of repair outcomes after Cas9-induced DNA cleavage is still limited, especially in primary human cells. We sequence repair outcomes at 1,656 on-target genomic sites in primary human T cells and use these data to train a machine learning model, which we have called CRISPR Repair Outcome (SPROUT). SPROUT accurately predicts the length, probability and sequence of nucleotide insertions and deletions, and will facilitate design of SpCas9 guide RNAs in therapeutically important primary human cells.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: SPROUT predicts DNA repair outcomes.**

Deep sampling of gRNA in the human genome and deep-learning-informed prediction of gRNA activities

Article Open access 16 May 2023

CRISPR off-target detection with DISCOVER-seq

Article 20 April 2020

Efficient C•G-to-G•C base editors developed using CRISPRi screens, target-library analysis, and machine learning

Article 28 June 2021

Data availability

All the raw data and analyses are openly available through SRA (BioProject PRJNA486372) and figshare (https://figshare.com/projects/Systematic_characterization_of_genome_editing_in_primary_T_cells_reveals_proximal_genomic_insertions_and_enables_machine_learning_prediction_of_CRISPR-Cas9_DNA_repair_outcomes/37166), respectively.

Code availability

The SPROUT software is publicly available at https://zou-group.github.io/SPROUT. Code is also available in the Supplementary Code.

References

Fischbach, M. A., Bluestone, J. A. & Lim, W. A. Sci. Transl. Med. 5, 179ps7 (2013).
Article Google Scholar
Simeonov, D. et al. Commun. Biol. 2, 70 (2019).
Article Google Scholar
Hultquist, J. F. et al. Nat. Protoc. 14, 1–27 (2019).
Article CAS Google Scholar
Lindsay, H. et al. Nat. Biotechnol. 34, 701–702 (2016).
Article CAS Google Scholar
van Overbeek, M. et al. Mol. Cell 63, 633–646 (2016).
Article Google Scholar
Brinkman, E. K. et al. Mol. Cell 70, 801–813 (2018).
Article CAS Google Scholar
Lemos, B. R. et al. Proc. Natl Acad. Sci. USA 115, E2040–E2047 (2018).
Article CAS Google Scholar
Deriano, L. & Roth, D. B. Annu. Rev. Genet. 47, 433–455 (2013).
Article CAS Google Scholar
Shen, M. W. et al. Nature 563, 646–651 (2018).
Article CAS Google Scholar
Allen, F.et al. Nat. Biotechnol. 37, 64–72 (2019).
Shin, H. Y. et al. Nat. Commun. 8, 15464 (2017).
Article CAS Google Scholar
Kosicki, M., Tomberg, K. & Bradley, A. Nat. Biotechnol. 36, 765–771 (2018).
Article CAS Google Scholar
Roth, T. L. et al. Nature 559, 405–409 (2018).
Article CAS Google Scholar
Simeonov, D. & Marson, A. Annu. Rev. Immunol. 37, 571–597 (2019).
Article CAS Google Scholar
Untergasser, A. et al. Nucleic Acids Res. 40, e115 (2012).
Article CAS Google Scholar
Magoč, T. & Salzberg, S. L. Bioinformatics 27, 2957–2963 (2011).
Article Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Bioinformatics 30, 2114–2120 (2014).
Article CAS Google Scholar
Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by the Chan–Zuckerberg Biohub. J.Z. was supported by a Chan–Zuckerberg Investigator grant and by National Science Foundation grant CRII 1657155. A.M. was supported by National Iinstitutes of Health (NIH)/NIDA Avenir New Innovator Award (DP2DA042423), NIH/NIGMS funding for the HIV Accessory and Regulatory Complexes (HARC) Center (P50 GM082250; to A.M. and N.J.K.) and gifts from J. Aronov, B. Bakar, K. Jordan, F. Caufield and D. Wolkoff. A.M. holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund, has received funding from the Innovative Genomics Institute (IGI) and the Parker Institute for Cancer Immunotherapy (PICI) and is an investigator at the Chan–Zuckerberg Biohub. A.A. was supported by NIH grant 7R01HG008164-04 and the Stanford data science initiative. J.H. was supported by the UCSF Medical Scientist Training Program. We would like to thank N. Neff and R. Sit for assistance collecting sequence data and A. Sellas for laboratory support. We also thank J. Palacios for statistical discussions.

Author information

Andrew P. May
Present address: Sana Biotechnology, South San Francisco, CA, USA
These authors contributed equally: Ryan T. Leenay, Amirali Aghazadeh, Joseph Hiatt.

Authors and Affiliations

Chan–Zuckerberg Biohub, San Francisco, CA, USA
Ryan T. Leenay, Giana Cirolia, Hera Canaj, Manuel D. Leonetti, Alexander Marson, Andrew P. May & James Zou
Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Amirali Aghazadeh, David Tse & James Zou
Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
Joseph Hiatt
Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
Joseph Hiatt, Theodore L. Roth, Ryan Apathy, Eric Shifrut & Alexander Marson
Diabetes Center, University of California, San Francisco, San Francisco, CA, USA
Joseph Hiatt & Alexander Marson
Medical Scientist Training Program, University of California, San Francisco, San Francisco, CA, USA
Joseph Hiatt
J. David Gladstone Institutes, San Francisco, CA, USA
Joseph Hiatt, Judd F. Hultquist & Nevan Krogan
Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
Judd F. Hultquist & Nevan Krogan
Quantitative Biosciences Institute (QBI), University of California, San Francisco, San Francisco, CA, USA
Judd F. Hultquist & Nevan Krogan
Division of Infectious Diseases, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
Judd F. Hultquist
Department of Chemistry, Stanford University, Stanford, CA, USA
Zhenqin Wu
Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
Alexander Marson
Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
Alexander Marson
Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
Alexander Marson
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
James Zou

Authors

Ryan T. Leenay
View author publications
You can also search for this author in PubMed Google Scholar
Amirali Aghazadeh
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Hiatt
View author publications
You can also search for this author in PubMed Google Scholar
David Tse
View author publications
You can also search for this author in PubMed Google Scholar
Theodore L. Roth
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Apathy
View author publications
You can also search for this author in PubMed Google Scholar
Eric Shifrut
View author publications
You can also search for this author in PubMed Google Scholar
Judd F. Hultquist
View author publications
You can also search for this author in PubMed Google Scholar
Nevan Krogan
View author publications
You can also search for this author in PubMed Google Scholar
Zhenqin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Giana Cirolia
View author publications
You can also search for this author in PubMed Google Scholar
Hera Canaj
View author publications
You can also search for this author in PubMed Google Scholar
Manuel D. Leonetti
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Marson
View author publications
You can also search for this author in PubMed Google Scholar
Andrew P. May
View author publications
You can also search for this author in PubMed Google Scholar
James Zou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.T.L., A.A., J.H., A.M., A.P.M. and J.Z. designed research. R.T.L., A.A., J.H., D.T., T.L.R., R.A., E.S., J.F.H., N.K., Z.W., G.C., H.C. and M.D.L. conducted research. R.L., A.A., J.H., A.M., A.P.M., T.L.R and J.Z. wrote the paper.

Corresponding authors

Correspondence to Alexander Marson, Andrew P. May or James Zou.

Ethics declarations

Competing interests

A.M. is a co-founder of Spotlight Therapeutics. A.M. has served as an advisor to Juno Therapeutics, is a member of the scientific advisory board at PACT Pharma and an is advisor to Sonoma Biotherapeutics. The Marson laboratory has received sponsored research support from Juno Therapeutics, Epinomics, Sanofi and a gift from Gilead. A.M. and T.L.R. are co-founders of Arsenal Biosciences and T.L.R. is chief scientific officer of the company.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Number of donors per site.

Histogram showing the distribution of unique blood donors per SpCas9 cut site in the T cell data.

Supplementary Figure 2 Distribution of repair outcomes.

(a) Distribution of the edit efficiency (left) and indel diversity (right) of the repair outcomes in T cells. We use the entropy of the distribution of the reads over the indel types as a metric to quantify the diversity of the repair outcomes. If there is exactly one repair outcome in all of the reads, then the entropy is 0. Higher entropy means that the repair outcomes are more diverse. (b) Distribution of the fraction of total reads with an insertion (left) and deletion (right) in T cells. (c) Distribution of the average insertion length given an insertion (left) and average deletion length given a deletion (right) in the repair outcomes of T cells.

Supplementary Figure 3 Repair outcome similarity.

Jaccard similarity between the top 20 indels of 250 (of a total of 3,989) randomly selected SpCas9-targeted sequencing experiments in T cells. Experiments performed on cells from different individuals at the same cut site are placed next to each other in the heatmap. These biological replicates show greater Jaccard similarity in repair outcomes compared to outcomes at distinct cut sites, as can be seen in the darker blocks along the diagonal.

Supplementary Figure 4 Indel predictions.

Spearman’s Rank Correlation and Pearson Correlation Coefficient were calculated to measure the performance of SPROUT in predicting insertion containing reads. The mean and standard deviation of the cross-validation results are shown in the table. In each cross-validation, 304 sites in T cell sites and 96 sites in each of the cell lines were used for testing.

Supplementary Figure 5 Example predictions of SPROUT.

Performance of SPROUT on three representative target sites with high, low and medium fraction of insertions. Predicted features are described in the SPROUT column, while actual measured features are presented in the experimental validation column.

Supplementary Figure 6 Ranking cut sites.

SPROUT’s performance in ranking guides within a gene based on predicted repair outcome. (a) Schematic of the guide ranking experiment. Assuming a gene with three potential guides (Guide 1, Guide 2, Guide 3), SPROUT ranks the guides based on likelihood to produce a single nucleotide insertion (or deletion). In this illustration the algorithm predicts that Guide 2 produces the most number of reads with 1-bp insertion/deletion. (b) Guide ranking performance on T cells. The algorithm was trained on 435 genes and tested on the remaining 108 genes. Kendal tau (between [−1,1]) measures the rank correlation (higher is better and zero indicates no correlation), “SPROUT (# genes)” indicates the number of genes for which SPROUT predicted exactly the correct ranking across all the guides, and “Rnd Shuffle (# genes)” indicates the number of genes predicted correctly by naïve guessing. (c) Guide ranking performance on HEK293. (d) Guide ranking performance on K562. (e) Guide ranking performance on HCT116. For parts (C,D,E) the model was trained on all T cell genes and tested on 28 genes from these other cell types. (f) For each gene, we order the target sites from the most likely to introduce frame-shift outcome to the least likely, using SPROUT predictions. The table reports the fraction of genes where SPROUT correctly predicts the top target site, where SPROUT correctly predicts the complete ordering of all the target sites in the gene, as well as the correlation between the SPROUT prediction and the experimental validations. The same metrics for random prediction are reported as baselines. Bootstrap mean and standard deviation are shown in the table.

Supplementary Figure 7 Nucleotide analysis.

(a) Average fraction of indel mutant reads with insertion in target sites grouped by their nucleotide type at location -1 (adjacent to the cut site from the 5’ side). Presence of C or G at location −1 is significantly correlated with higher deletion proportion (p < 0.004, two-sided t-test) and presence of A or T is significantly correlated with higher insertion proportion (p < 10⁻⁶, two-sided t test) consistently across all cell types. We show the results for T cells (left, 1,521 sites) and the aggregate results for HEK293, K562 and HCT116 (right, 96 sites). The results in this supplementary figure are normalized for the background distribution of nucleotide types—we divided by the number of occurrences of each nucleotide in computing the average fraction of indel mutant reads with insertion. As additional controls, we also performed the same analysis at the −2, +1 and +2 locations and did not find significant differences in insertion fraction by the nucleotide type (p > 0.1, two sided t-text). (b) Average fraction of indel mutant reads with insertion conditioned on the nucleotide at position +3 (the last nucleotide before e.g. 5’ of the PAM sequence). The presence of A at +3 is correlated with higher fraction of insertions. The analyses here differ from and complement Fig. 2e. The SPROUT importance scores of 2E captures the nonlinear model’s overall prediction as to the impact of each nucleotide and position. The results here ignore the effects of other positions and plots the conditional insertion fractions. Even though the methods are different, both the feature importance scores and the conditional fractions give consistent biological findings. The mean and standard error of the mean (SEM) are shown in the tables.

Supplementary Figure 8 Homopolymer analysis.

(a) Average fraction of indel mutant reads with deletion in target sites grouped by their homopolymer types. HomP(A) corresponds to target sites that have at least two consecutive A nucleotides adjacent to the cut site, and similarly for HomP(C), HomP(G), and HomP(T). No HomP indicates the rest of the target sites without homopolymers. We show the results for T cells (top, 1,521 sites) and the aggregate results for HEK293, K562 and HCT116 (bottom, 96 sites). The mean and standard error of the mean are shown. (b) SPROUT was evaluated with MMEJ and non-MMEJ deletions, and the accuracies are reported here. On T cells we observe an improved prediction accuracy for non-MMEJ model. We also see a better generalization performance of MMEJ model compared to non-MMEJ model in cancer cell lines. Bootstrap mean and standard deviation is shown in the table.

Supplementary Figure 9 Methods comparison.

(a) Comparison of the performance of SPROUT, inDelphi and FORECasT in predicting frameshift, precision (defined as one minus the entropy of the deletion frequency), and fraction of indel mutant reads with insertion on four independent validation sets. The results for frameshift and precision are reported in terms of accuracy and the results for fraction insertion is in terms of R². (b) Scatter plot of the experimentally observed vs. predicted fraction of indel mutated reads with insertion in T cells. (c) Comparison of the performance of SPROUT with inDelphi and FORECasT in ranking guides within randomly selected genes in the T cell data by the fraction of frame-shift outcomes and 1-bp insertions. The original T cells has 304 sites, new T cells I has 32 sites, New T cells II has 182 sites and iPSC has 30 sites. Bootstrap mean and standard deviation is shown in each table.

Supplementary Figure 10 Insertion distribution.

(a) Histogram of the distances of long insertions from the target cut sites. (b) Distribution of the length of the aligned insertions in T cells. (c) Histogram of the distance of the insertion donor sites to the cut sites in intra-chromosomal long insertions. The x-axis indicates distances in log 10 bases.

Supplementary Figure 11 Long insertion analysis.

Genomic DNA sequences from sites in physical proximity can be inserted at SpCas9 cut sites. (a) All insertions longer than or equal to 25 bases were identified and plotted. (b) Number of cut sites with at least one (aligned) long insertion was plotted against the insertion length. (c) Average similarity of the insertion location within the same cut site and gene was compared across donors. This was also performed on a shuffled set of insertions as a control. Average genomic distance quantifies the distance between the donor sites of the long insertions that originate from the same chromosome. Bootstrap mean and standard deviation of 1,521 sites is shown in the table. (d) Quantification of the HiC contact data to the long insertions. Neighboring blocks as well as a randomly selected block were used as controls. The HiC block size was also varied. Bootstrap mean and standard error of the mean are shown in the table. (e) HiC chromosome contact maps were directly compared to the aligned long insertions. P value is computed using t-test.

Supplementary Figure 12 Overlap with chromatin states.

Overlap between the insertion donor sites and 15 core chromatin states. We measured the percentage of insertion donor sites that fall within each of the 15 chromatin states (“% donor sites”) across the 1,521 sites. The chromatin states were obtained for primary CD4+ T cells (E043) from the Human Epigenome Roadmap. Here we considered only the long insertions that are aligned to a different chromosome from the SpCas9 target site, to avoid potential confounding due the target sites being in exons. For background control, we randomly shuffled each aligned insertion within a +- 500kb window centered at its original location (i.e. donor site), and report the percentage of the shuffled sites that overlap each chromatin state (“% shuffled sites”). Insertion donors are significantly enriched for chromatin states associated with enhancers and transcription (states 1-7, colored red) compared to control (P < 10⁻⁵, two-sided t-test). Altogether 35.8% of inter-chromosome donor sites come from one of states 1 to 7 compared to 29.6% of the control sites.

Supplementary Figure 13 Insertion analysis in cell lines.

(a) The average similarity and distance of the chromosomal positions between long insertions. We measure the similarity and distance for long insertions across biological samples at the same cut site (“Within cut site”), across different cut sites within the same gene (“Within gene”), and across random pairs of cut sites (“Shuffled control”). We report the results for each of three previously published data¹, with 96 sites each. Bootstrap mean and standard deviation is shown. (b) The HiC contact map at the insertion site locations compared to three control cases in two other cell types (HEK293 and K562) across different HiC block sizes for insertion larger than 25 nucleotides¹. The first two controls average the HiC contact map in the neighboring blocks of the insertion donor and cut site. The third control averages the HiC score among random blocks in the same cut site-donor site chromosome pairs. Bootstrap mean and standard error of the mean is shown.

Supplementary Figure 14 Prediction features.

The top table list all of the features used by SPROUT to predict frameshift, insertion, deletion and repair entropy. The gRNA and PAM were one-hot encoded as features. We also explored including additional chromatin features (second table). These features did not significant change SPROUT’s performance for these tasks and are not included in the final model. We have additionally explored using longer flanking sequence (up to 50bp) and DNA melting temperature as features. They did not improve SPROUT’s performance and are not used in the model.

L

Supplementary Figure 15 Prediction dependence on sample size.

Performance of SPROUT as the training size increases, measured by R². Prediction saturates at 5-fold cross validation. The leave one out validation performance is 0.60 ± 0.01. Error bars show the standard deviation across the folds and the curve indicates the mean values.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Note

Reporting Summary

Supplementary Code

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leenay, R.T., Aghazadeh, A., Hiatt, J. et al. Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells. Nat Biotechnol 37, 1034–1037 (2019). https://doi.org/10.1038/s41587-019-0203-2

Download citation

Received: 19 October 2018
Accepted: 27 June 2019
Published: 29 July 2019
Issue Date: September 2019
DOI: https://doi.org/10.1038/s41587-019-0203-2

This article is cited by

Integrating machine learning and genome editing for crop improvement
- Long Chen
- Guanqing Liu
- Tao Zhang
aBIOTECH (2024)
Point-wise spatial network for identifying carcinoma at the upper digestive and respiratory tract
- Lei Zhou
- Huaili Jiang
- Xinsheng Huang
BMC Medical Imaging (2023)
DNA polymerases in precise and predictable CRISPR/Cas9-mediated chromosomal rearrangements
- Mohammadreza M. Mehryar
- Xin Shi
- Qiang Wu
BMC Biology (2023)
Current progress and open challenges for applying deep learning across the biosciences
- Nicolae Sapoval
- Amirali Aghazadeh
- Todd J. Treangen
Nature Communications (2022)
Recursive Editing improves homology-directed repair through retargeting of undesired outcomes
- Lukas Möller
- Eric J. Aird
- Jacob E. Corn
Nature Communications (2022)