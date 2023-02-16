Mammalian cell culture

The human HEK293T cell line was purchased from AMS Biotechnology (EP-CL-0005). The HAP1 WT cell line was provided by Andrew Waters (Wellcome Sanger Institute) and the HAP1 ∆MLH1 cell line was purchased from Horizon Discovery (HZGHC000343c022). HEK293T cells were cultured in DMEM (Invitrogen) and HAP1 cells in IMDM (Invitrogen), both supplemented with 10% FCS (Invitrogen), 2 mM glutamine (Invitrogen), 100 U ml−1 penicillin and 100 mg ml−1 streptomycin (Invitrogen) at 37 °C and 5% CO 2 .

Primers

All primers used in this study are listed in Supplementary Table 3.

Plasmid cloning

Plasmids generated in this study are listed in Supplementary Table 4.

pCMV-PE2-P2A-PuroR was generated by replacing eGFP from pCMV-PE2-P2A-GFP (Addgene 132776) with PuroR. A gene fragment containing parts of the MMLV reverse transcriptase and the puromycin resistance gene was ordered from IDT (Supplementary Table 5). The gene fragment and pCMV-PE2-P2A-GFP were digested using AgeI, purified with the Monarch PCR & DNA Cleanup Kit (NEB) and ligated with T4 DNA ligase (NEB). The ligation product was transformed into XL10-Gold Ultracompetent Cells (Agilent). Plasmid DNA was isolated using the Plasmid Plus Midi Kit (Qiagen).

pCMV-PE-FeLV-P2A-EGFP was generated by replacing the MMLV coding sequence between the XTEN linker and the 2A cleavage peptide with a synthesized gene fragment from IDT using Gibson Assembly which encodes an IDT human codon-optimized version of the MashUp reverse transcriptase (pipettejockey.com) that is engineered from the Feline Leukemia Virus (UniProt Q85521).

pLentiGuide-BlastR was generated by replacing the puromycin resistance gene from Lenti_gRNA-Puro (Addgene 84752) with a blasticidin resistance gene. A gene fragment containing parts of the EF1a promoter and the blasticidin resistance gene was ordered from Twist Biosciences (Supplementary Table 5). The gene fragment and Lenti_gRNA-Puro were digested using FseI (NEB) and MluI-HF (NEB), purified with the Monarch PCR & DNA Cleanup Kit (NEB), ligated with T4 DNA ligase (NEB) and transformed into XL10-Gold Ultracompetent Cells (Agilent). Plasmid DNA was isolated using the Qiagen Spin Miniprep Kit.

pPB-TREG3G-PE2-rtTA3G-P2A-eGFP was generated by fusing three gene fragments with restriction cloning. The first part contains the ITR sequences for the PiggyBac transposase, the second part contains prime editor 2 under the control of the third-generation doxycycline-inducible rtTA3G promoter and the third part was synthesized by Twist Biosciences and contains a PGK promoter followed by the rtTA3G protein, a P2A sequence and eGFP.

pTwist_FEN1-T2A-tagBFP, TREX1-T2A-mScarlet, TREX2-T2A-emiRFP670 and Acceptor-T2A-eGFP were ordered from Twist Biosciences in a pTwist EF1 Alpha cloning vector. The protein sequences encoded by the primary transcripts of FEN1, TREX1 and TREX2 were identified on ensembl.org (July 2022), fused with the T2A sequence and the respective fluorophores, and reverse translated into codon-optimized nucleotide sequences (Twist Biosciences).

The pCMV-PE2-P2A-PuroR, pLentiGuide-BlastR and pPB-TREG3G-PE2-rtTA3G-P2A-eGFP plasmids will be made available on Addgene.

Generating HAP1 cell lines that stably express prime editor

HAP1 cell lines expressing prime editors were generated by cotransfecting pCMV-hyPBase55 and pPB-TREG3G-PE2-rtTA3G-P2A-eGFP. First, 500,000 HAP1 WT and 500,000 HAP1 ∆MLH1 cells were each seeded into one well of a six-well plate one d before transfection. For each transfection, 3 µg of each plasmid was mixed with 6 µl of Plus reagent and 7.5 µl of Lipofectamine LTX (Invitrogen) reagent, incubated for 30 min and then added to the cells. At two weeks post transfection, cells were sorted into single clones based on eGFP expression. Two different individual clones were used for each screen.

Library design

Set 1: The insert sequence libraries contained 2,666 unique sequences, made up of useful molecular biology sequences, the eukaryotic motif library (eukaryotic linear motif, ELM) and sequences with strong secondary structure. We designed four separate versions of this library with identical insert sequences to target the CLYBL, EMX1, FANCF and HEK3 sites. The pegRNAs contained a 13-nt PBS and a 34-nt homology arm on the reverse transcriptase template. The utility sequences were hand-picked for their usefulness in molecular biology. The ELM instances library with the corresponding fasta file of the genes was downloaded from elm.eu.org/instances.html?q = * (refs. 26,27) on 19 November 2020 and filtered to only contain sequences from ‘homo sapiens’ that are longer than one amino acid. The amino acid motifs were extracted from the fasta file based on the indicated start and end sites. Finally, the amino acid motifs were reverse translated into DNA sequence using the ‘reversetranslate’ R package (v.1.0.0) and using the most frequent codon from the ‘homo sapiens’ codon table. For the secondary structure library, 100,000 random DNA sequences of 20- and 30-nt length were generated (RBioinf::randDNA function; v.1.48.0) and their secondary structure was calculated (see the Data analysis and feature generation section). The sequences were distributed into ten bins based on the strength of their secondary structure and 20 sequences were randomly picked from each structure bin to be included in the library. Finally, 30 random perfect 20- and 30-nt RNA hairpins were generated and amended to the secondary structure library. The combined library of insert sequences is included as Supplementary Data 1. The insert sequences were then flanked with primer binding sites, random nucleotide stuffer sequences for shorter inserts, BsmBI sites and target vector compatible overhangs, resulting in 11,166 sequences of 199 nt. The oligonucleotide library was ordered from Twist Biosciences.

Set 2: This set of insert sequences was focused on short sequences between 1 and 10 nt. It included all 1-, 2-, 3- and 4-nt sequences and 100 random sequences (RBioinf::randDNA function; v.1.48.0), respectively, of 5–10 nt, and 61 sequences <10 nt from Set 1 for a total of 999 unique inserts (938 were recovered in screens). The libraries were endowed with target-site-specific adapter sequences and ordered the same way as Set 1.

Eighteen-nt insert sequence libraries: This set of sequences consisted of six sublibraries that were designed to target the HEK3 site and five additional nearby sites (within 1 kb), dubbed HEK3-2, HEK3-3, HEK3-4, HEK3-5 and HEK3-6. The sublibraries shared 100 identical, randomly generated (RBioinf::randDNA function; v.1.48.0) 18-nt insert sequences and 256–288 sublibrary-specific 18-nt insert sequences that were picked based on their ability to form secondary structure in the reverse transcriptase template. In contrast to Set 1 and Set 2, we ordered oligos for this set of sequences that already included the spacer (20 nt), improved scaffold (86 nt, sequence: gtttaagagctatgctggaaacagcatagcaagtttaaataaggctagtccgttatcaacttgaaaaagtggcaccgagtcggtgc), PBS (13 nt), insert (18 nt) and homology arm (HA) (15 nt). The oligos were endowed with BsmBI sites, overhangs for cloning and primer binding sites for amplification of the oligo pool. The oligonucleotide library was ordered from Twist Biosciences.

Codon variation library: six protein tags, His-6 (HHHHHH), Flag (DYKDDDDK), a glycine-rich linker (GSSGGSSG), the HiBiT tag (VSGWRLFKKIS)43, mNeongreen-11 (TELNFKEWQKAFTDMM)28 mNeongreen with a linker (GSSGTELNFKEWQKAFTDMM) and a drug-inducible superdegron (LQCEICGFTCRQKGNLLRHIKLH)44; were used to tag ACTB, LMNB1, NOLC1, RNF2 and TP53 genes, and to insert into the HEK3 site. We chose ACTB, LMNB1, NOLC1 and RNF2 because they have been successfully edited in the other publications12 and TP53 for its relevance in health and disease. ACTB, LMNB1, NOLC1 and TP53 were tagged at their N termini; an in-frame, internal fusion was made for RNF2. For the ACTB, LMNB1 and TP53 targets, two independent pegRNAs were used that target both the forward and reverse strands (Supplementary Table 6). Because we decided to make in-frame fusions, the position of the insert sequence was shifted up to 6 nt downstream on the reverse transcriptase template relative to the nick. Together, this resulted in nine target sites.

For the His-6 tag and the glycine-rich linker, all possible codon combinations were generated in silico. For the remaining, longer tags, all possible codon variations were generated using only the top two most frequent human codons. MinsePIE was used to predict the insertion efficiencies for the generated codon variants and ten codon variants with both high and low predicted insertion rates were included in the final library. The codon-optimization webtool from Eurofins Genomics (https://eurofinsgenomics.eu/en/gene-synthesis-molecular-biology/geneius/sequence-optimisation/) was used to design an additional version of each tag. This resulted in 594 sequences in total (Supplementary Data 1). The oligos for this set of sequences contained spacer (20 nt), improved scaffold (86 nt, gtttaagagctaagctggaaacagcatagcaagtttaaataaggctagtccgttatcaactcgaaagagtggcaccgagtcggtgc56), PBS (13 nt), insert and HA (34 nt). The oligos were endowed with BsmBI sites, overhangs for cloning and primer binding sites for amplification of the oligo pool, and were ordered from Twist Biosciences.

Library cloning

Set 1 and Set 2: First, a separate, site-specific backbone was cloned for each target site. A gene fragment was ordered containing the protospacer, guide RNA scaffold, parts of the reverse transcriptase template and primer binding site, a stuffer sequence flanked with BsmBI sites for insert library insertion and the T7 terminator motif (Supplementary Table 5). Then, 100 ng of the gene fragments was digested with BsaI-HFv2 (NEB) and purified with the Monarch PCR & DNA Cleanup Kit (NEB). The pLentiGuide-BlastR plasmid was digested with BsmBI-V2 (NEB) at 55 °C for 8 h followed by 20 min of heat inactivation at 80 °C, and gel purified using the QIAEX II Gel Extraction Kit (Qiagen). The gene fragments were ligated into the backbone using T4 DNA ligase (NEB) and transformed into XL10-Gold Ultracompetent bacteria (Agilent). The plasmids were purified with Qiagen Spin Miniprep Kit.

Second, pegRNA insert libraries were inserted into the site-specific backbones. The insert libraries were synthesized as oligonucleotide pools and amplified using KAPA HiFi HotStart ReadyMix (Roche). Libraries for individual target sites were amplified with separate primers (Supplementary Table 3). The products were purified using the Monarch PCR & DNA Cleanup Kit, digested with BsmBI-v2 at 55 °C for 4 h and heat-inactivated at 80 °C for 20 min alongside 5 μg of site-specific plasmids. The digested oligos were purified using the Monarch PCR & DNA Cleanup Kit. The vectors were treated with quick CIP (NEB) for 15 min at 37 °C and then purified using QIAquick PCR Purification Kit (Qiagen). Inserts were ligated into vectors using Golden Gate assembly. A 1:3 molar ratio of insert and vector was mixed with BsmBI-v2 and T4 DNA ligase and incubated in a thermocycler for 30 cycles, alternating between five min at 42 °C and five min at 16 °C and finishing with a heat inactivation step at 60 °C for five min. The ligation products were purified with Monarch PCR & DNA Cleanup Kit and electroporated into MegaX DH10B T1R Electrocomp Cells (Thermo Fisher). The bacteria were grown overnight in liquid culture and plasmid was extracted using the Plasmid Plus Midi Kit. The pegRNA sequences are shown in Supplementary Table 6.

epegRNA libraries were cloned by first generating a HEK3 site-specific epegRNA backbone with a stuffer sequence for the insert libraries (as above). The tevoprep sequence was added to the fragment containing the protospacer, guide RNA scaffold, parts of the reverse transcriptase template and primer binding site, a stuffer sequence flanked with BsmBI sites for insert library insertion and the T7 terminator motif by PCR (using P42, P43; Supplementary Table 3). Next, the 379 sequences with strong structure were amplified from the Set 1 oligo pool by PCR and cloned into the epegRNA HEK3 backbone as described above.

Eighteen-nt inserts and codon variation libraries: pLentiGuide-BlastR plasmid was digested with BsmBI-V2 (NEB) at 55 °C for eight h followed by 20 min of heat inactivation at 80 °C and gel purification of the vector using the QIAEX II Gel Extraction Kit (Qiagen). Amplification, purification, digestion and repurification were performed as described above. The oligo sequences were ligated into pLentiGuide-BlastR using Golden Gate assembly, the ligation product was purified and transformed into bacteria, and the plasmid was extracted after an overnight culture as above.

Lentivirus production

Lentivirus was produced in HEK293FT cells that were transfected with Lipofectamine LTX (Invitrogen). First, 5.4 μg of a lentiviral vector, 5.4 μg of psPax2 (Addgene 12260) and 1.2 μg of pMD2.G (Addgene 12259) were mixed in 3 ml of Opti-MEM together with 12 μl of PLUS reagent and incubated for five min at room temperature. Next, 36 μl of the LTX reagent was added and the mix was incubated for another 30 min at room temperature. Then, 3 ml of the transfection mix was added to 80% confluent cells in 10 ml of DMEM medium in a 10-cm dish. After 48 h the supernatant was collected and stored at 4 °C. Fresh medium was added to the cells and collected 24 h later. The two collections were kept separate. For virus titration, Lenti-X GoStix Plus (Takara) was used following the manufacturer’s protocol.

pegRNA insertion screens in HEK293T cells

Infection with pegRNA library: Cells were infected with the pegRNA library (separate infections for each target site and library set), aiming at a multiplicity of infection of 0.5 and a guide coverage of >1,000×. Each screen was performed in three biological replicates and independently infected. To achieve this, 6 × 106 cells were plated in three wells of a six-well plate and spin-infected for 15–30 min at 2,000 r.p.m. Following infection, cells were resuspended and replated at 2 × 104 cells per cm2. Cells were cultured for seven d and selected for pegRNA integration with 10 µg ml−1 blasticidin.

Transfection with prime editors: HEK293T cells were seeded at a concentration of 6.9 × 104 cells per cm2 in a 15-cm dish. The next day, the medium was replaced with fresh medium and the cells were transfected using Lipofectamine LTX reagent. Then, 72 µg of PE-Puro or PE-FeLV plasmid was mixed with 8 µg of pCS2-GFP and 40 µl of Lipofectamine P3000 (Invitrogen) in 3.2 ml of Opti-Mem (Gibco). In another tube, 40 µl of Lipofectamine 3000 and 160 µl of Lipofectamine LTX were mixed in 3.2 ml of Opti-Mem. The solutions were combined, incubated for 30 min at room temperature and then added to the cells. For PE3, an additional 6 µg of nicking guide RNA was added. For screens with nuclease overexpression, an additional 30 µg of flap nuclease or eGFP plasmid in the pTwist vectors was added.

pegRNA insertion screens in HAP1 and HAP1 ∆MLH1 cells

Infection with pegRNA library: The pegRNA library viruses for all target sites and sets were individually quantified using the Lenti-X GoStix Plus (Takara) kit and then combined into one virus pool. The HAP1 and HAP1 ∆MLH1 cells with PiggyBac-integrated PE2 were infected with the virus pool, aiming at a multiplicity of infection of 0.5 and a pegRNA coverage of >1,000×. Each screen was performed in two biological replicates with separate PiggyBac prime editor clones and independently infected. To achieve this, 6 × 106 cells were plated in three wells of a six-well plate and spin-infected for 15–30 min at 2,000 r.p.m. Following infection, cells were resuspended and replated at 2 × 104 cells per cm2. Cells were cultured for seven d and selected for pegRNA integration with 10 µg ml−1 blasticidin.

For each replicate, 30 million cells were seeded into five-layer flasks and induced with 1 µM doxycycline. The cells were split once at day four and the doxycycline was refreshed. Finally, cells were collected on day seven post induction.

DNA extraction and library preparation for next-generation sequencing

Genomic DNA extraction and sequencing library preparation for screens were done as described by Allen et al.10. Briefly, cell pellets were resuspended in TAIL BUFFER A (100 mM Tris-HCl, 5 mM EDTA, 200 mM NaCl) and then mixed with 1 volume of TAIL BUFFER B (100 mM Tris-HCl, 5 mM EDTA, 200 mM NaCl, 0.4% SDS) supplemented with freshly thawed Proteinase K (20 mg ml−1 final). The lysate was incubated overnight at 56 °C. On the next day, RNase A was added to a final concentration of 10 µg ml−1 and incubated at 37 °C for 30 min to four h. Then, 1 volume of isopropanol was added and the DNA spooled on a sterile inoculation loop. The DNA was washed three times by dipping it into consecutive 5-ml tubes containing 70% ethanol. The DNA was air-dried for 5–10 min and resuspended in TE buffer (pH 8.0).

For each screen, two independent amplicons were generated by PCR using Q5 HotStart High-Fidelity 2X Master Mix (NEB). One amplicon was for the targeted locus and one amplicon for the pegRNA locus (primers in Supplementary Table 3). To maintain high coverage for each sample, 40 μg of genomic DNA was used as the template and each PCR reaction was run in 50-μl aliquots containing no more than 5 μg of genomic DNA. The PCR reactions were column-purified using the QIAquick PCR Purification Kit (Qiagen). Sequencing adapters and barcodes were added with a second round of PCR using the KAPA HiFi HotStart ReadyMix (Roche), primers P3 and P4 (Supplementary Table 3) and 1 ng of template DNA. Amplicons were purified with Agencourt AMPure XP beads in a 0.7:1 ratio (beads to PCR reaction volume) and quantified with the Quant-iT High-Sensitivity dsDNA Assay Kit (Invitrogen). The amplicons were pooled together and sequenced on the Illumina HiSeq 2500 using HiSeq Rapid SBS Kit v2 (500 cycles, 250 paired-end).

Reverse transcription of pegRNA libraries

Frozen cell pellets containing 4.5–6.1 million cells from screens targeting the HEK3 site in HEK293T cells were washed with 500 µl of PBS and the RNA was extracted using the mirVana miRNA Isolation Kit (Invitrogen). Then, 8.4–16.6 µg of template RNA split across eight reactions was used for genomic DNA digestion and complementary DNA synthesis with the SuperScript IV VILO Master Mix with ezDNase (Invitrogen). For cDNA synthesis, a primer was used that was reverse complementary to the 13-nt PBS with extra nucleotides on the 5′ end (italic) to provide additional base pairing for PCR amplification (ATCGAGTTTCAGACTGAGCACG; Supplementary Table 3). pegRNAs were amplified from the cDNA mixture by 27 cycles of PCR using KAPA HiFi HotStart ReadyMix (Roche) and primers P39 and P40 (Supplementary Table 3). Library preparation and sequencing were performed as described in the DNA extraction and library preparation for next-generation sequencing section.

Generating read count tables

Paired forward and reverse reads from Illumina sequencing were merged using PEAR v.0.9.11. Data for the same screen but different sequencing lanes were concatenated. The resulting merged fastq files were processed using a custom R script (read_match_pegRNAs.R, GitHub45). First, DNA sequences were trimmed to contain the 10 nt up- and downstream of the nick site (for target site amplicon) or to contain 15 nt up- and downstream of the nick site (pegRNA amplicon). On average, 98% of reads were matched for the target site amplicon and 84% for the pegRNA amplicon. The trimmed sequences were then matched to each insert in the pegRNA library flanked by 10 nt of target site sequence (for target site amplicon) or flanked by 15 nt of pegRNA plasmid sequence (pegRNA amplicon), requiring 0 mismatches. Adding the flanking sequences ensures that only insertions at the correct locations are considered. On average, 92% of reads were matched to the unedited locus or an insertion for both the target site amplicon and the pegRNA amplicon.

Combining replicates

pegRNAs where any replicate had fewer than 20 reads in the pegRNA amplicon mapping to it were filtered out. Insert counts were normalized to frequencies by dividing the reads for each insert by the number of reads in each screen. Insertion efficiencies were calculated for each replicate and screen by dividing the target insert frequency by the pegRNA insert frequency. (Note: calculating insertion frequencies this way likely underestimates them, as it does not take cells that were not infected with the library into account. In addition, an average of 16% of reads in the pegRNA amplicons did not match to any sequence in the library.) Finally, insertion efficiencies were averaged across replicates. The script used to combine replicates is available on GitHub45 as ‘combine_replicates.R’. The processed read count tables are shown in Supplementary Data 2.

Mutation rates around the insertion site and indel detection

The fastq reads of the target sites were trimmed by matching a stretch of 10 nt directly upstream of the PBS and 60 nt downstream of the insertion site (CLYBL: CTGAATGGTG, CAGAGTTCCA; EMX1: GGGCCTGAGT, ATGGGGAGGA; FANCF: CCTCATGGAA, AGCACCTGGG; HEK3: CCTTGGGGCC, AGCTTTTCCT). The occurrence of library insertions was detected by pattern matching the trimmed reads for library sequences. Indel detection: The trimmed reads were filtered in a series of steps. First, sequences with insertions at the nick site that perfectly match a sequence in the insert libraries were removed (this also means that our method cannot detect single/double/triple-nucleotide insertions at the nick site because our library contains all possible singlets/doublets/triplets). Second, sequences that contained ‘N’ were removed. Third, sequences with a perfectly preserved sequence around the cut site were removed. Fourth, sequences that were 83-nt long were removed (83 nt corresponds to the length of a sequence without indels). The remaining sequences were annotated according to the indel type. Scaffold integrations were sequences that contained five or more nucleotides of the scaffold (GCACC) directly downstream of the reverse transcriptase template. Mutated insertions were sequences that matched any sequence >10 nt in the library with no more than three mismatches (fuzzyjoin R package v.0.1.6, optimal string alignment method). Duplications were sequences that contained two or more copies of the homology arm sequence. Deletions at the target sites were deletions that overlapped up to 10 nt up- and/or downstream with the nick site. Other deletions were deletions that did not overlap with the nick site and all remaining sequences are classified as ‘other’. The scripts used to call mutation rates and indels are available on GitHub45 as ‘find_mutations.R’.

SNV detection: Going from the outside to the inside of the trimmed sequence (with the nicking site being between the two innermost nucleotides), the occurrence of the four nucleotides was counted at every position. Nonreference nucleotides were classified as mutations with the exception of a nonreference SNP (A) in HEK293T cells for one of three alleles at position +9. The reverse transcriptase template on the pegRNA corresponds to the sequence of the major allele (G).

Data analysis and feature generation

Merging data from Set 1 and Set 2: For each target site and cell line, the insertion rates in Set 2 were multiplied by the ratio of the mean insertion rate of the shared sequences in Set 1 and the mean insertion rate in Set 2. For the 140 shared insert sequences, the mean insertion rate between both sets was calculated. Length-normalized insertion rates: Length residuals were calculated by dividing the insertion rate by the median insertion rate for sequences of the same length (for sequences <10 nt) or by dividing sequences into length bins. The length bins consisted of sequences of 10–14, 15–19, 20–24, 25–29, 30–39, 40–49, 50–59 and 60–69 (sequences with lengths above 30 nt were divided into length bins of 10 nt because there were fewer longer sequences in the library). The melting temperature for the insert sequence was calculated using SeqUtils.MeltingTemp.Tm_NN from biopython. The RNA fold (v.2.4.16) algorithm of the ViennaRNA (v.2.5.0a) package57,58 was used to calculate the tendency of insert sequences (alone or in the context of PBS and/or HA) to form secondary structures. The free energy was normalized to the mean and standard deviation (z score) of 1,000 random sequences with the same length and in the same context.

The 6-nt and 9-nt insertion data from Choi et al.42 were filtered for sequences with more than 20 sequencing reads for each pegRNA replicate and more than 30 sequencing reads for the plasmid reads, followed by feature calculation as described above. The insertion and plasmid read frequencies were calculated as the fraction of insertion mapping reads in all reads, and the normalized insertion rate as the ratio of insertion read frequency to the plasmid read frequency normalized to the mean and standard deviation of each dataset (z score). The data from Kim et al. were filtered to contain target sites with all seven insertions and no other edits, followed by feature calculation as described above. Edit rates were normalized to the mean and standard deviation of editing rates at each target site.

Comparison of HAP1 and HAP1 MLH1 lines

To account for screen batch effects for direct comparisons (Fig. 2f and Supplementary Fig. 2d), the mean insertion rates across wild-type and MLH1 knockout HAP1 cell lines were scaled to be identical for >13-nt sequences that are not affected by MMR. The fold changes of the scaled insertion efficiencies between HAP1 ∆MLH1 and HAP1 lines were then calculated for each sequence in the library.

Validation of nuclease overexpression with individual pegRNAs

We chose four different insertions (C, CAG, a BCL6 recognition sequence: TTCTAGGAA and a Myc-tag: GAGCAGAAGCTGATCAGCGAAGAGGACCTC) from our pooled library for validation and cloned them into HEK3 site-targeting pegRNAs endowed with 25- or 34-nt homology arms. At one d before transfection, HEK293T cells were seeded in two 24-well plates at 50,000 cells per well. All transfections were done in replicates and each well was transfected with 500 ng of pCMV_PE2_P2A_PuroR, 150 ng of pTwist nuclease or eGFP overexpression constructs, and 100 ng of pegRNA using Lipofectamine LTX according to the manufacturer’s protocol. Successful transfection one d later was confirmed by fluorescence microscopy and 2 µg ml−1 puromycin was added one d later. Cells were collected five d post transfection by direct lysis of cell pellets using home-made quick extract buffer (1 mM CaCl 2 , 3 mM MgCl 2 , 1 mM EDTA, 1% Triton X-100, 10 mM Tris pH 7.5) with freshly added proteinase K (0.2 mg ml−1) followed by 15 min of incubation at 65 °C and 20 min of incubation at 95 °C. Then, 1.5 µl of the lysate was directly added to 25 µl of amplicon PCRs. Sequencing adapters and barcodes were added by a second round of PCR and the purified products were sequenced on an Illumina Miseq (300 cycles). Correctly edited reads were identified by pattern matching for the insert sequence flanked by 10 nt of the target site to each end. Unedited sequences were detected by matching the 20 nt of wild-type sequence around the nick site. The insertion rate was calculated by dividing the number of edited reads by the number of wild-type reads.

Modeling

Insertion efficiencies were normalized (z score) between screens and replicates by subtracting the corresponding mean insertion efficiency from each individual insertion efficiency and dividing it by the standard deviation of the insertion efficiency. Categorical features were one-hot encoded. Hyperparameters were tuned for each model by evaluating average model performance after fivefold cross-validation using each combination of hyperparameters, then choosing the parameter combination resulting in the best cross-validation performance. The Lasso and Ridge regressions were tested with alpha values of 0, 0.00001, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1. The Random Forest regressor was tested with n_estimators of 5, 10, 50, 100, 500 and 1000; max depth of 2, 5, 7, 10 and None; and min_samples_leaf of 1, 5 and 10. The Multilayer perceptron regressor was tested with hidden_layer_size of (10), (100), (100, 10), (1000, 100) and (1000, 100, 10); and alpha of 0.01, 0.1, 0.5 and 1. The gradient boosted tree from XGBoost36 was tested with n_trees of 1, 5, 10, 50, 100, 500 and 1000; max_depth of 1, 2, 3, 4, 5, 7 and 10; l1_penalty and l_2 penalty of 0, 0.001, 0.01, 0.1, 0.5 and 1; colsample of 0.1, 0.3, 0.5, 0.7, 0.9 and 1; gamma of 0 .001, 0.01, 0.1, 0.5 and 1; and learning_rate of 0.0001, 0.001, 0.01, 0.1, 0.3 and 0.5. The scikit-learn models were trained using parameters obtained from hyperparameter tuning: Lasso regression was performed with alpha = 0.1; Ridge regression was performed with alpha of 0.01; Random forest had no maximum depth, 1000 estimators and min_samples_leaf of 5; Multilayer perceptron regressor was trained with alpha = 1, 200 maximum iterations at a constant learning rate of 0.001, a hidden layer size of (1000, 100) and ‘lbfgs’ solver. Gradient boosted tree from XGBoost59 was trained with a minimum loss reduction of 0.1, 100 trees, a learning rate of 0.1, maximum depth of 4, 0.00001 L1 regularization on weights, 0.1 L2 regularization on weights and a subsample ratio of one per column when constructing each tree.

The final model was trained with XGBoost using the features length; normalized secondary structure of the reverse transcriptase template; MMR proficiency; percentage of the nucleotides C, A and T; the number of paired bases between the first 3 nt of the insert and the last 3 nt of the spacer in addition to the first nucleotide of the scaffold; complementarity between the first nucleotide of the insert and the nucleotide at the nicking site; the maximum number of consecutive adenines in the insert; and the intactness of loop1. Features in each set are summarized in Supplementary Tables 1 and 2.

For training, unique insert sequences were split randomly into training and test sequences at a ratio of 0.7 (Supplementary Fig. 10a). Measurements for different target sites and cell lines were assigned to training and test data based on the grouping of insert sequences. The model was trained and predictions were evaluated using Pearson’s R based on the correlation between test data and corresponding predictions. SHapley Additive exPlanations (SHAP) values for the model and feature importance for the prediction of specific outcomes were calculated using the SHAP TreeExplainer and explainerModel60.

Statistics and reproducibility

The n numbers denoted in the figure legends refer to independent experiments that were separately infected with the pegRNA library. Measurements were always taken from distinct samples. No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment. Wherever correlations were indicated, Pearson’s R was used. The t-tests (Supplementary Fig. 5a,b) were performed as two-sided tests. Normal distribution of the underlying data was assumed and no adjustments for multiple comparisons were made.

MinsePIE website

The MinsePIE website uses the MinsePIE package available at https://github.com/julianeweller/MinsePIE to serve as a user-friendly and interactive way to predict insertion efficiency (Supplementary Fig. 10b). There are three main modes, with standard highlighting all relevant sequence features, manual allowing more advanced usage where the user can adjust relevant parameters (for example, mean and s.d. of editing rate) and batch mode allowing to upload a set of sequences for analysis. A table highlighting insert sequences, respective z scores and insertion prediction scores is given in each usage mode. For ease of analysis, color codes are used in the table and the following distribution graph to highlight the sequences with the highest insertion efficiency scores. MinsePIE web application makes use of Vue.js (v.2.6.11), D3.js (v.3.5.17) and agGrid (v.24.1.1) libraries and the Flask framework (v.2.0.2). Genomic data are retrieved via https://api.genome.ucsc.edu.

Padding of shorter insert sequences

Three sequences between 12 and 13 nt (an endoplasmic reticulum retention signal, AAGGACGAGCTG; a BRE-TATA element, CCACGCCTATAAA; and a consensus splice motif, TTTTTTTCAGGTT) were chosen for padding. The sequences were padded to 18 nt with all possible nucleotide combinations. MinsePIE was used to predict the insertion rates for these variants at the HEK3 site. The sequences with highest predicted efficiencies were picked for testing: CAAGGACGAGCTGTCCAC, CCCACGCCTATAAAGGCC and GCTTTTTTTCAGGTTCTC. The padded and original inserts were endowed with a 13-nt PBS and 34-nt reverse transcriptase template and cloned into the pU6-pegRNA-GG-acceptor (Addgene no. 132777) as described previously12. Editing efficiencies were assessed by transient transfection in an arrayed format. Therefore, 10,000 HEK293T cells were seeded into a 96-well plate in triplicates. On the following day, 50 ng of pegRNA plasmids and 200 ng of pCMV-PE2-PuroR were transfected using 0.3 µl of Lipofectamine LTX (Thermo Fisher Scientific) and 0.1 µl of Plus reagent per well according to the manufacturer’s instructions. After one d, 2 µg ml−1 Puromycin was added. Cells were collected four d post transfection by direct lysis of cell pellets using home-made quick extract buffer (1 mM CaCl 2 , 3 mM MgCl 2 , 1 mM EDTA, 1% Triton X-100, 10 mM Tris pH 7.5) with freshly added proteinase K (0.2 mg ml−1) followed by 10 min of incubation at 65 °C and 15 min of incubation at 95 °C. Then, 3 µl of the lysate was directly added to amplicon PCRs. Sequencing adapters and barcodes were added by a second round of PCR and the purified products were sequenced on an Illumina Miseq (300 cycles). Correctly edited reads were identified by pattern matching for the insert sequence flanked by 10 nt of the target site to each end. Unedited sequences were detected by matching the 20 nt of wild-type sequence around the nick site. The insertion rate was calculated by dividing the number of edited reads by the number of wild-type reads.

Software

The software used comprised BaseSpaceCLI (v.1.4.0); Geneius codon-optimization webtool from Eurofins Genomics (accessed 2022); PEAR (v.0.9.11); Python (v.3.8.10); Python packages: Biopython (v.1.79), more-itertools (v.8.5.0), pandarallel (v.1.6.1), scikit-learn (v.0.24.2), scipy (v.1.5.3), shap (v.0.39.0), statannot (v.0.2.3) and XGBoost (v.1.4.0); R (v.4.0.2); ViennaRNA (v.2.5.0); and R packages: Broom (v.0.7.9), fuzzyjoin (v.0.1.6), ggpointdensity (v.0.1.0), RBioinf (v.1.48.0), reversetranslate (v.1.0.0), ShortRead (v.1.46.0), spgs (v.1.0–3), Tidyverse (v.1.3.1) and Viridis (v.0.6.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.