Targeted insertion of large genetic payloads using cas directed LINE-1 reverse transcriptase

A difficult genome editing goal is the site-specific insertion of large genetic constructs. Here we describe the GENEWRITE system, where site-specific targetable activity of Cas endonucleases is coupled with the reverse transcriptase activity of the ORF2p protein of the human retrotransposon LINE-1. This is accomplished by providing two RNAs: a guide RNA targeting Cas endonuclease activity and an appropriately designed payload RNA encoding the desired insertion. Using E. coli as a simple platform for development and deployment, we show that with proper payload design and co-expression of helper proteins, GENEWRITE can enable insertion of large genetic payloads to precise locations, although with off-target effects, using the described approach. Based upon these results, we describe a potential strategy for implementation of GENEWRITE in more complex systems.

Targeted insertion of large genetic payloads using cas directed LINE-1 reverse transcriptase Femila Manoj 1 , Laura W. Tai 2 , Katelyn Sun Mi Wang 3

& Thomas E. Kuhlman 3*
A difficult genome editing goal is the site-specific insertion of large genetic constructs. Here we describe the GENEWRITE system, where site-specific targetable activity of Cas endonucleases is coupled with the reverse transcriptase activity of the ORF2p protein of the human retrotransposon LINE-1. This is accomplished by providing two RNAs: a guide RNA targeting Cas endonuclease activity and an appropriately designed payload RNA encoding the desired insertion. Using E. coli as a simple platform for development and deployment, we show that with proper payload design and co-expression of helper proteins, GENEWRITE can enable insertion of large genetic payloads to precise locations, although with off-target effects, using the described approach. Based upon these results, we describe a potential strategy for implementation of GENEWRITE in more complex systems.
Despite their flexibility and ease of use, the repertoire of genome editing modalities that CRISPR/Cas systems allow remains limited. Knockout or point mutants can be generated relatively easily by targeting Cas cleavage to coding or control regions of the genome. The cell must repair such cuts to survive, and errors introduced by the nonhomologous end joining (NHEJ) repair machinery can lead to inactivation of control regions or introduction of missense or point mutations to coding sequences [26][27][28][29] . An additional editing modality is to introduce novel sequences to the genome through Homology Directed Repair (HDR), where a DNA fragment with ends homologous to the sequences flanking the cut site and containing the desired sequence to be inserted is introduced to the cell along with the Cas-sgRNA ribonucleoprotein (RNP) complexes. After cleavage, the fragment is then used to repair the cut by the cell's homologous recombination repair machinery, resulting in its integration. However, HDR remains inefficient and difficult to accomplish, particularly for gene-sized or larger [≥ ~ 1 kilobase pair (kbp)] fragments [30][31][32][33] . A primary reason for this difficulty is that for HDR to be successful, non-homologous end joining (NHEJ) DNA repair, the primary repair mechanism for DNA repair in advanced eukaryotic cells [34][35][36][37] , must be suppressed [38][39][40] .
Here we introduce a method for the active insertion of lengthy genetic sequences into host DNA we call GENEWRITE: Genome Engineering With RNA-Integrating Targetable Endonucleases. This is accomplished by coupling the targetable endonuclease activity of Cas enzymes to the reverse transcriptase activity of the human retrotransposon LINE-1 through translationally fusing Cas and LINE-1 reverse transcriptase proteins (Fig. 1A). A number of recent reports have described approaches coupling the targetability of Cas enzymes with the activity of other transposons or reverse transcriptases. These include Tn7-like transposons whose genomic insertion is accomplished through an associated CRISPR-effector, from the cyanobacterium Scytonema hofmanni 41 and Vibrio cholerae 42 . Insertion of these 2-3 kbp bacterial transposons is programmable to specific genomic locations in E. coli through a guide RNA similar to other Cas enzymes. Another approach, prime editing 43 , fuses a catalytically impaired Cas9 fused to an engineered Moloney Murine Leukemia Virus (M-MLV) reverse transcriptase 44-47 , using a "prime editing guide RNA" (pegRNA) to target short insertions, deletions, and all types of point mutations into human cells. GENEWRITE offers functionality that is distinct from each of these examples. While prime editing similarly uses a reverse transcriptase to insert RNA-encoded sequences into the genome, insertions Results GENEWRITE rationale and design. The human retrotransposon LINE-1 (Long Interspersed Nuclear Element, or L1) encodes the two proteins ORF1p and ORF2p, and both proteins are required for efficient retrotransposition in humans. A primary function of ORF1p appears to be chaperone activity 50 , while ORF2p includes endonuclease (EN) and reverse transcriptase (RT) domains. To retrotranspose, ORF2p EN nicks TArich target DNA, and the 3' end of the LINE-1 mRNA hybridizes with DNA adjacent to the nick to initiate reverse transcription through a process called target primed reverse transcription (TPRT) 51 . In most active L1 elements, this hybridization is facilitated through the presence of a ~ 100 bp long poly(A) tract, which is also thought to be the primary binding target of ORF2p to its encoding mRNA 52 .
LINE-1 and its accessory proteins naturally exist in human cells, making it an appealing target for optimization as a genome editing tool. To attempt to further enhance specifically targeted reverse transcribed insertions by ORF2p in vivo, we removed the promiscuous ORF2p EN domain by deleting amino acids 1-347. The remaining fragment, from amino acids 348-1275, which includes the Z, RT, and cysteine-rich RNA-binding domains, we www.nature.com/scientificreports/ dub ORF2pZRT. Finally, the GENEWRITE protein consists of a translational fusion of ORF2pZRT to targetable Cas endonucleases (Cas9 or Cas12a/Cpf1) with a flexible linker. In addition, the GENEWRITE protein includes N and C-terminal nuclear localization signals (NLS) and a C-terminal 6xHis tag to enable purification (Fig. 1A). A previous similar attempt at replacing ORF2p EN with Cas9 and using Alu-like payload RNA to target ORF2p RT to specific loci in human cells proved unsuccessful 53 . As described below, we have made several refinements to the GENEWRITE system relative to this attempt, including the use of Escherichia coli as a simpler in vivo platform in which to test and optimize. We additionally show that the 10 base pair homology between target and payload used in this previous study is likely inadequate for priming of TPRT.
High expression of GENEWRITE protein is Lethal to E. coli. We designed and synthesized the GENEWRITE protein under control of a T7 promoter, which was cloned into the plasmid pUC57-kan. We transformed this plasmid into E. coli strain BL21-AI, along with either empty plasmid pZA31, or pZA31 carrying ykoU and ykoV B. subtilis NHEJ enzymes expressed from P LtetO1 49,54 . In strain BL21-AI, GENEWRITE expression is inducible by the addition of L-arabinose. Curiously, while expression of Cas9/12a, ORF2pZRT, or both Cas9/12a and ORF2pZRT in individual E. coli cells does not affect growth, strong expression of the GENEWRITE Cas-ORF2pZRT fusion protein induced through the addition of arabinose is lethal to E. coli. This lethality is partially relieved by simultaneous expression of B. subtilis NHEJ enzymes. This suggests lethality may be a consequence of genomic breaks generated by GENEWRITE, perhaps driven by high affinity of ORF2p to arbitrary RNAs in vivo 55 . Consequently, the results described below rely upon low, leakage levels of expression of GENEWRITE without induction.

GENEWRITE is effective at insertions into high-copy number targets in E. coli.
We expected the strategy outlined in Fig. 1B-F to be difficult to successfully execute for a number of reasons, including the expected difficulty of co-transforming individual cells with appropriate amounts of both sgRNA and payload RNA, as well as previously documented preference of ORF2p to act primarily upon its cis-encoding RNA 56 . Hence, as an initial integration target, we chose the high copy number plasmid pUC57-kan [~ 500-1000 /cell] from which the GENEWRITE protein itself is expressed to maximize chances of success. For experiments described here, the ~ 1200 bp payload RNA consisted of an aadA spectinomycin resistance gene driven by a strong, constitutive lacIQ1 promoter 57 and Shine-Dalgarno ribosomal binding site (RBS). Consequently, after the GENEWRITE protocol, cells were spread on plates containing spectinomycin to select for potentially successful integrants.
Based upon our current understanding of TPRT, design of the payload RNA 3' hybridization region is critical. To determine the optimal length of the hybridization region, we generated an array of six identical payload RNAs with hybridization length variable from 0 to 50 bp in 10 bp increments. Based on prior reports of the essentiality of a 3' poly(A) tract for ORF2p binding and reverse transcription 52 , we generated a second array of payload RNAs, identical to the first, but also including the 30 bp poly(A) tract found in the SINE element AluYA5 58 .
We transformed the pUC57-targeting sgRNA along with each payload RNA into E. coli weakly expressing GENEWRITE-Cas9, either with or without simultaneous expression of B. subtilis NHEJ enzymes. The results are shown in Fig. 2. For those payload RNAs containing a poly(A) tract, we observed very few spectinomycin resistant colonies, for both with or without simultaneous co-expression of NHEJ. Conversely, without the poly(A) tract, we obtained hundreds of spectinomycin resistant colonies when complemented with co-expression of NHEJ ( Fig. 2A). Site-specific integration was verified by PCR using primers that amplified across the 5' and 3' integration junctions (Fig. 2B); 63 out of 96 colonies screened yielded a positive signal for a success rate of ~ 72% (Fig. 2C). Sequencing of eight purified plasmids revealed some small deletions at the 5' end of the insertion ( Supplementary Fig. S1). From these experiments, we conclude that optimal design of the payload RNA includes 40-50 bp of 3' homology to the intended target facilitated by NHEJ DNA repair, and with no poly(A) tract. This difference in essentiality of the poly(A) tract to TPRT between E. coli and humans may be the result of mRNA 3' poly(A) tails stabilizing RNAs in eukaryotes, while poly(A) tails designate mRNAs for degradation in bacteria [59][60][61][62] .
We performed a series of controls and further investigations using the payload designed to target pUC57kan with 40 bp 3' hybridization region (Fig. 2D): (1) as expected, the sgRNA is required for efficient targeting and integration; (2) NLS sequences at the N-and C-termini do not significantly interfere with function; (3) the Cas12a/Cpf1 GENEWRITE variant is functional, although with lower efficiency than the Cas9 variant, consistent with previous findings that blunt-end cuts fragments serve as better TPRT substrates than those with 3' or 5' overhangs 48 ; and (4) simultaneous co-expression of unfused Cas9 and ORF2pZRT, rather than the translationally-fused GENEWRITE protein, is not functional. However, LINE-1 reverse transcriptase has been shown to function even when encoded and expressed separately from the endonuclease through association via the naturally occurring cryptic Z domain, raising the possibility of potentially using naturally expressed LINE-1 in the human genome as an editing tool.

GENEWRITE can insert payloads into low and single copy targets.
For the next target, we attempted insertion into the much lower copy number pZA31 plasmid hosting the NHEJ genes [~ 20-30 copies/ cell 54 ]. Using 40 bp of 3' homology to target as described above, we obtained no colonies when transforming payload and sgRNAs into cells expressing GENEWRITE but deficient in NHEJ. However, we obtained ~ 50 colonies on average when transforming into cells expressing both GENEWRITE and NHEJ proteins. PCR screening of putative positive colonies generated a positive signal in 10 out of 50 colonies (Fig. 3A), yielding a success rate of 20%. www.nature.com/scientificreports/ We finally attempted to use GENEWRITE to site-specifically insert a payload into single copy chromosomal loci. We attempted insertions at three loci we have previously shown to accept insertions at high efficiency using recombineering-like methods: the nth locus near the terminus of replication; the atpI locus near the origin of replication; and the ybbD locus midway on the right replichore [ Fig. 3B 63,64 ]. In these cases, repeated attempts at the GENEWRITE protocol as described above were unsuccessful. Prior reports 65 and our own studies of retrotransposition of native LINE-1 in E. coli (Supplementary Fig. S2) suggest that homology between the 5' end of the payload and insertion location may also aid in targeting. Consequently, we attempted two strategies: (1) inclusion of 20 bp of 5' homology between the payload and the targeted insertion site; and (2) simultaneous co-expression of ORF1p. Each of these strategies alone was unsuccessful. Only when targeting nth by including 20 bp of 5' and 40 bp of 3' payload homology to the target, along with simultaneous co-expression of ORF1p, did we obtain significant numbers of colonies after transformation (~ 20 colonies on average). Under these www.nature.com/scientificreports/ conditions, PCR screening of 50 positive colonies (Fig. 3C, Supplementary Fig. S3) demonstrate a success rate of 60%. However, repeated attempts at insertion the atpI and ybbD sites with ORF1p co-expression and 20 bp of 5' and 40 bp of 3' payload homology to target have so far proven unsuccessful. As with 3' homology, further optimization of the amount of 5' homology to the target included in the payload may improve the efficiency of insertion at low copy number targets. Inclusion of homology in the 5' end of the payload, with the same sequence as the sgRNA, suggests the possibility of using a single RNA as both guide and payload. However, attempting to include the necessary secondary structure and using the 5' end of the payload itself as the sgRNA for the Cas component proved unsuccessful, and we found it was necessary to co-transform two separate sgRNA and payload RNAs for successful targeted integration.

Off-target effects and application to complex organisms. Whole genome sequencing of cells sub-
jected to the GENEWRITE expression exhibit larger numbers of high frequency mutations relative to a negative control, with mutations scattered throughout the genome (Fig. 4). Moreover, a large fraction of the plasmids purified and sequenced from GENEWRITE-exposed cells have curiously had the coding sequences of both GENEWRITE and NHEJ proteins excised from their host plasmids ( Supplementary Fig. S4), suggesting that GENEWRITE may also be effective in excising coding regions of inappropriately-highly expressed genes.

Discussion
We have shown that a fusion of a Cas endonuclease and LINE-1 ORF2p reverse transcriptase, which we call GENEWRITE, is capable of integration of large genetic payloads in the E. coli genome through appropriate design of the homology regions of guide and payload RNAs. We also find that assistance from NHEJ DNA repair enzymes and LINE-1 ORF1p protein may help increase the efficiency and specificity of the insertion (results summarized in Supplementary Table S3). We have not yet tested the limits on size of GENEWRITE payloads, but LINE-1 itself is ~ 5 kbp long and hence similarly sized payloads may be accessible. The above-described results were obtained using a simplistic method where each component is delivered separately: the RNAs through electroporation, and the GENEWRITE protein through constitutive, low-level expression from a plasmid. We find GENEWRITE to be remarkably successful given this simplistic approach, despite ORF2p's cis-preference for its encoding RNA 56 and propensity to produce inserts with 5' truncations [65][66][67][68] . However, using this method, we find significant off-target effects, including an increase in the rate of off-target mutations relative to a control, and the excision of highly expressed DNA segments from the genome. We speculate that these off-target effects and the lethality of the GENEWRITE protein to E. coli may be coupled: , where x = 0 corresponds to oriC and x = ± 1 corresponds to terC. (C) PCR amplification across nth integration location. Primers bind to chromosomal regions adjacent to targeted integration site. Amplicon expected from successful integration is ~ 1600 bp. We conservatively identify the last colony as negative for integration. Plasmid design and construction. All GENEWRITE proteins and variants were designed in Vector NTI software (Thermo Fisher Scientific) and synthesized de novo and cloned into pUC57-kan by GENEWIZ Gene synthesis (GENEWIZ); the exception is ORF2pZRT, which was cloned into pUC57-amp by GENEWIZ. A list of all constructs used in this study is found in Supplementary Table S1. Bacillus subtilis NHEJ enzymes [Ku (encoded by the gene ykoV) and LigD (encoded by the gene ykoU)] were expressed from the anhydrotetracycline-inducible P LtetO1 promoter 54 on the plasmid pZA31 49 . Cells not expressing NHEJ were transformed with empty pZA31 as a control. sgRNA and payload RNA synthesis. DNAs encoding sgRNAs were prepared using primers including a T7 promoter driving a 20 bp guide sequence. The 3' end of this primer was designed with a 14 bp overhang homologous to a 77 bp scaffold oligo containing sequence encoding the necessary sgRNA secondary structure and used to prime amplification of the sgRNA-encoding DNA. Sequences of all oligos used in the study are available in Supplementary Table S2.
Payload RNAs were prepared using primers including a T7 promoter driving sequence encoding a strong, constitutive PlacIQ1 promoter, a Shine-Dalgarno ribosomal binding site, and 20 bp of sequence homologous to the spectinomycin resistance gene aadA. Reverse primers were designed with 20 bp homology to the 3' end of aadA and included indicated lengths of sequence homologous to the intended integration site. Payload RNAencoding DNA was amplified from the plasmid pTKRED 69 using these primers.