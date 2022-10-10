Cell lines and cell culture

Experiments were carried out in K562 cells (American Type Culture Collection, CCL-243) and HEK293FT cells. K562 cells were cultured in a controlled humidified incubator at 37 °C and 5% CO 2 in RPMI 1640 (Gibco) media supplemented with 10% FBS (HyClone), penicillin (10,000 IU ml−1), streptomycin (10,000 µg ml−1) and L-glutamine (2 mM). HEK293FT cells, as well as HEK293T and HEK293T-LentiX cells used to produce lentivirus, as described below, were grown in DMEM (Gibco) media supplemented with 10% FBS (HyClone), penicillin (10,000 IU ml−1) and streptomycin (10,000 µg ml−1).

Computational workflow to identify thousands of LSRs and cognate attachment sites

The LSR identification workflow was implemented as described schematically in Fig. 1a. In total, 146,028 bacterial isolate genomes available in the National Center of Biotechnology (NCBI) RefSeq database were identified on 22 August 2019. Genomes were then clustered at the species level using the NCBI taxon ID and the TaxonKit (version 0.7.1) tool74. Genomes within each species were randomized and batched into sets of 50 and 20 genomes, where the first batch included 50 genomes, and all subsequent batches contained 20 genomes. Each batch was then processed by downloading all relevant genomes from NCBI, annotating coding sequences in each genome with Prodigal (version 2.6.3)75 and then searching for all encoded proteins that contained a predicted Recombinase Pfam domain (PF07508) using HMMER (version 3.3.2)76,77. Genomes that contained a predicted LSR were then compared to genomes that lacked that same LSR using the MGEfinder (version 1.0.6) command ‘wholegenome’, which was developed for this purpose by adapting the default MGEfinder tool to work with draft genomes27. If MGE boundaries that contained the LSR were identified, all of the relevant sequence data were saved and stored in a database. The workflow was parallelized using Google Cloud virtual machines.

After this initial round of LSR discovery was complete, a modified approach was taken to further expand the database and avoid redundant searches. First, bacterial species with a high number of isolate genomes available in the first round were analyzed to determine if further inspection of these genomes would be necessary. Rarefaction curves representing the number of new LSR families identified with each additional genome analyzed were estimated for these common species, and species that appeared saturated (that is, fewer than one new LSR cluster per 1,000 genomes analyzed) were considered ‘complete’, meaning no further genomes belonging to this species would be analyzed. Next, 48,557 genomes that met these filtering criteria were downloaded from the GenBank database and prepared for further analysis. The analysis was very similar to round 1 but with some notable differences. First, a database of over 496,133 isolate genomes from the RefSeq and GenBank genomes was constructed. PhyloPhlAn (version 3.0.2) marker genes were then extracted from all of these genomes78. Next, for each genome that was found to contain a given LSR, closely related isolates found in the database were selected according to marker gene homology and used for the comparative genomics analysis and further LSR discovery. This marker gene search approach was made available in a public GitHub repository (https://github.com/bhattlab/GenomeSearch). This second round of LSR and attachment site discovery increased the total number of candidates by approximately 32%.

For each identified LSR, two attB sequences could be chosen to represent the original attB sequence, either through concatenation of the sequences immediately flanking the MGE on the post-integration chromosome or through using the sequence as it exists on the pre-integration chromosome—sequences that could differ from each other slightly. In this study, the sequences flanking the element on the post-integration chromosome were used, motivated by the reasoning that this sequence would more closely represent the original attB as it existed immediately before integration. A sequence spanning 50 bp around the attachment site ‘center’, defined as the short stretch of sequence that was homologous between attB and attP, was used to represent both attB and attP sequences.

Predicting LSR target site specificity

LSR sequences were clustered at 90% or 50% identity using MMseqs2 (version 13-45111)79. Protein sequences that overlapped with predicted attachment sites were extracted from their genome of origin and clustered with all other target proteins at 50% identity using MMseqs2. LSR attachment site combinations that were found to meet quality control filters were considered. To identify site-specific LSRs, only LSRs clustered at 50% identity and target genes clustered at 50% amino acid identity were considered. Next, LSR target pairs were filtered to only include target gene clusters that were targeted by three or more LSR clusters. Next, only LSR clusters that targeted a single target gene cluster were considered. The remaining sets of LSR clusters were considered to be single-targeting, meaning that they were thought to site-specifically target only one gene cluster. Multi-targeting or transposable LSRs with minimal site specificity were identified. Only LSRs clustered at 90% identity and target genes clustered at 50% amino acid identity were considered. Next, the total number of target gene clusters that were targeted by each LSR cluster was counted, and LSR clusters that targeted only one gene cluster were removed from consideration. Next, the remaining LSRs were binned according to the number of protein clusters that they targeted. For the purposes of this paper, ‘>3’ target gene clusters is considered fully multi-targeting. Each 50% identity LSR cluster was then assigned to a multi-targeting bin according to the highest bin attained by any one 90% LSR cluster found within the 50% identity LSR cluster.

Phylogenetic tree construction

Representative sequences of each quality-controlled 50% identity LSR cluster were used to construct the phylogenetic tree. LSRs were aligned using MAFFT in G-INS-i mode (version 7.471)80, and IQ-TREE (version 2.1.2) was then used to generate a consensus tree using 1,000 bootstrap replicates and automatic model selection81.

Phylogenetic analysis of site-specific integrases targeting a conserved attachment site

One example of several site-specific integrases targeting a conserved attachment site is shown in Fig. 1e. All attB attachment sites were clustered at 80% identity using MMseqs2 (ref. 82). Candidates were filtered to include only those that met quality control thresholds and then attB sites that were ranked by the number of LSR clusters that were found to target them. An example attB cluster was chosen for further analysis. All LSRs that targeted this attB cluster were extracted from the database and were aligned using the MAFFT L-INS-i algorithm80. Amino acid identity distances between all LSRs were calculated, and the distance matrix was used to create a hierarchical tree in R. LSRs that were 99% identical at the amino acid level or more were collapsed into a single cluster. This hierarchical tree was visualized and shown in Fig. 1e, along with all attB sites that were targeted by the LSRs.

Identifying target site motifs from attachment sites in the LSR database

The attachment sites associated with multi-targeting LSRs in the database were analyzed to determine target site motifs, as shown in Fig. 1h and Supplementary Fig. 1e. Multi-targeting LSRs in the database were analyzed at the level of individual orthologs, at the level of 90% amino acid identity clusters and at the level of 50% amino acid identity clusters. For each of these levels, only candidates that were found to target more than ten unique attB sequences or ten target genes clustered at 50% amino acid identity were kept. Then, all of the corresponding attB sequences were extracted, with only one attachment site per target gene cluster being extracted to avoid redundancy. These attB sequences were then initially aligned using MAFFT L-INS-i (ref. 80). Next, possible core dinucleotides were identified in each alignment by extracting all dinucleotides in the alignment and ranking them by the conservation of their most frequent nucleotides and their proximity to the center of the attB sequences, using a custom score that equally weighted high nucleotide conservation and normalized distance to the attB center. Candidates were then re-aligned only with respect to these predicted dinucleotide cores rather than using an alignment algorithm, such as MAFFT. These alignments were then visualized in R using ggseqlogo (version 0.1) to identify conserved target site motifs83.

Annotation of LSR-carrying MGEs

Several tools and approaches were used to annotate LSR-carrying MGEs. Phages/prophages were identified using VirSorter2 (version 2.2.3), keeping predictions with boundaries that covered at least 75% of the MGE (ref. 84). ICEs and IMEs were identified using several approaches. First, conjugative elements were identified using CONJscan (version 1.0.2)85 profile HMMs and hmmsearch77, annotating an element as a conjugative element if it contained at least one VirB, T4CP and MOB protein using an E-value cutoff of 1 × 10−4. Next, MGEs were aligned to ICEberg (version 2.0)86 elements using blastn (version 2.12.0)87, identifying elements as ICE/IME elements if they shared at least 80% nucleotide identity and at least 75% alignment coverage with an ICEberg element. Plasmids and other replicons were identified by aligning elements to PLSDB (ref. 88) plasmids using blastn, identifying elements as plasmids if they shared at least 80% nucleotide identity and at least 75% alignment coverage with one of these plasmids. Other replicons were identified if they encoded proteins that matched (E-value = 1 × 10−4) any of the following Pfam89 profile HMMs: Bac_RepA_C, IncFII_repA, RepA_C, RepA_N, RepL, Rep_1, Rep_2, Rep_3, Rep_trans, Rol_Rep_N and TrfA. For annotating the MGE of origin for the LSR clusters presented in Supplementary Fig. 1a, we assigned MGE categories using plurality voting of all relevant MGE annotations, with ties being resolved in the following order: dsDNA Phage, ICE/IME, Plasmid and Other replicon. MGEs with no annotations were assigned to the ‘Other’ category.

Target gene Pfam enrichment, Gene Ontology term enrichment and anti-phage analysis

Target genes, or genes that were found to be targeted and disrupted by LSRs upon integration, were annotated using the Pfam-A profile HMM models89. One representative sequence per target gene cluster, clustered at 50% identity using MMseqs2 (ref. 79), was selected and analyzed. Only target genes that were targeted by LSRs outside of the multi-targeting clade (Fig. 1b) were considered. Randomly selected background genes were chosen from the contigs on which each target gene was found, and these background genes were also analyzed using Pfam domain models. A two-sided Fisher’s exact test was then used to identify Pfam domains that were enriched among target genes over background genes, only calculating enrichment for Pfam domains that occurred in at least five different target genes. The false discovery rate (FDR)-adjusted P values were calculated by running the p.adjust R command on all the Fisher’s exact test P values.

For Gene Ontology (GO) term enrichment, InterProScan version 1.8.0_152-release was used to map target genes and background genes to relevant GO terms90. Enrichment of specific GO term pathways was calculated using the two-sided Fisher’s exact test as was done for individual Pfam domains, testing only terms that mapped to at least five different target genes.

Anti-phage defense gene enrichment was determined using a different approach. First, genomes that contained target genes were annotated using DefenseFinder (version 1.0.8)91,92. These annotations were used to identify any target genes that were also predicted to be anti-phage defense genes. Next, reasoning that if target genes were enriched within or near anti-phage systems then that would indicate an evolved LSR integration strategy, we calculated the distance between target genes and the nearest anti-phage defense gene. This distribution of distances was then compared with the distances between randomly selected genes and the nearest defense gene. The difference between these two distributions was calculated using a Wilcoxon rank-sum test in R.

Initial landing pad LSR candidate selection

LSRs for the initial set of 17 landing pad candidates were identified by searching for the Recombinase Pfam domain (PF07508) among the MGEs that we previously identified27,76. The identity of the attachment site was inferred from the boundaries of the MGE that contained each LSR. For example, imagine a sequence of nucleotides that has the following structure:

$${{{\mathrm{B}}}}_1 - {{{\mathrm{D}}}} - {{{\mathrm{P}}}}_1 - {{{\mathrm{E}}}} - {{{\mathrm{P}}}}_2 - {{{\mathrm{D}}}} - {{{\mathrm{B}}}}_2$$

where B 1 indicates the sequence flanking the MGE insertion on the 5′ end; D indicates the target site duplication created upon insertion (if it exists); P 1 indicates the terminal sequence flanking the 5′ integration boundary that is included in the MGE; E is the intervening MGE; P 2 indicates the terminal sequence flanking the 3′ integration boundary that is included in the MGE; and B 2 indicates the sequence flanking the MGE insertion on the 3′ end. Then, the attB and attP sequences can be reconstructed as:

$${{{\mathrm{attB}}}} = {{{\mathrm{B}}}}_1 + {{{\mathrm{D}}}} + {{{\mathrm{B}}}}_2$$

$${{{\mathrm{attP}}}} = {{{\mathrm{P}}}}_2 + {{{\mathrm{D}}}} + {{{\mathrm{P}}}}_1$$

where the ‘+’ operator in this case indicates nucleotide sequence concatenation.

Candidates were then annotated to determine features such as (1) whether or not the element was predicted to be a phage element93, (2) how many isolates contain the integrated MGE and (3) how often MGEs containing distinct LSRs will integrate at the same location in the genome. Candidates were then given higher priority if they were contained within predicted phage elements, if they appeared in multiple isolates and if the attachment sites were targeted by multiple distinct LSRs. A final list of 17 candidates, listed in Fig. 2b, was then taken forward and validated experimentally.

Subsequent selection of LSR candidates of high quality

As subsequent batches of LSRs were synthesized and tested in our various assays, we improved our quality control criteria for selecting further candidates to synthesize and assay. In our initial batch of human genome-targeting candidates, few quality control filters were put in place. Subsequent batches were more stringently quality-controlled. We settled on one set of quality control criteria that substantially increased the experimental validation rate. First, we only considered LSR-carrying MGEs that were identified through comparing genomes that shared at least 95% average nucleotide identity (ANI) as calculated using FastANI (ref. 94), a commonly accepted ANI cutoff for identifying members of the same species. Next, LSRs with large attachment site centers, above 20 bp in length, were removed. The attachment site center is the portion of the attB and the attP that are identical and should contain the dinucleotide core. Next, LSRs with attachment sites with more than 5% of their nucleotides being ambiguous in the original genome assemblies were removed. Next, only LSRs between 400 amino acids and 650 amino acids in length were kept. Next, only predicted LSRs that contained at least one of the three main LSR Pfam domains were retained (Resolvase, Recombinase and Zn_ribbon_recom). Next, LSRs were removed from consideration if their sequences contained more than 5% ambiguous amino acids. Next, only LSRs that were found on integrative MGEs that were fewer than 200 kb in length were retained, where larger elements were presumed to be technical artifacts. And finally, only LSRs that were within 500 nucleotides of their predicted attachment sites were retained. Candidates that met all of these filters were considered to meet quality control thresholds.

Plasmid recombination assay to validate LSR-attD-attA predictions

Three plasmids were designed for each LSR candidate to test recombination function on an episomal reporter. The effector plasmid contains the EF-1α promoter, followed by the recombinase coding sequence (codon optimized for human cells), a 2A self-cleaving peptide and an EGFP coding sequence. The attA plasmid contains an EF-1α promoter, followed by the attA sequence, followed by mTagBFP2 coding sequence, which should constitutively express the mTagBFP2 protein in human cells. The attD plasmid includes only the attD sequence followed by the mCherry coding sequence, which should produce no fluorescent mCherry before integration of the attA plasmid. Next, 20,000 HEK293FT cells were plated into 96-well plates and transfected 1 day later with 200 ng of effector plasmid, 70 ng of attA plasmid and 50 ng of attD plasmid using Lipofectamine 2000 (Invitrogen). Then, 2–3 days after transfection of cells with all three plasmids, cells were measured using flow cytometry on an Attune NxT Flow Cytometer (Thermo Fisher Scientific) and software (version 5.1.1). HEK293FT cells were lifted from the plate using TrypLE (Gibco) and resuspended in Stain Buffer (BD) before flow. These experiments were conducted in triplicate transfections. Cells were gated for single cells using forward and side scatter and then on cells expressing fluorescent EGFP. Next, mTagBFP2 fluorescence was measured to indicate the amount of un-recombined attD plasmids, and mCherry fluorescence was measured to indicate the amount of recombinant plasmid indicating successful LSR-mediated integration. Corrected MFI was obtained by subtracting the average MFI of all matching attD-only control replicates from the average MFI of the three-plasmid transfected cells. mCherry and EGFP gating was determined based on an empty backbone transfection.

An experiment testing recombinases with matched and unmatched attB and attP plasmids was performed similarly, in K562 cells. In total, 1.2 × 106 K562 cells were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector 2b, program T-016), with 300 ng of the 11.6-kb LSR plasmid, 869 ng of the 4.2-kb attB plasmid and 621 ng of the 3-kb attP plasmid. Three days after transfection, mCherry MFI of ungated cells was measured by flow cytometry on a BD Accuri C6 cytometer and accompanying software (version 227).

Landing pad cell line production

Landing pad LSR candidates were cloned into lentiviral plasmids under the expression of the strong EF-1α promoter, with their attB site in between the promoter and start codon, and with a 2A-EGFP fluorescent marker downstream the LSR coding sequence. Lentivirus production and spinfection of K562 cells were performed as follows. In each well of a six-well tissue culture plate, 5 × 105 HEK293T cells were plated in 2 ml of DMEM, grown overnight and then transfected with 0.75 μg of an equimolar mixture of the three third-generation packaging plasmids (pMD2.G, psPAX2 and pMDLg/pRRE) and 0.75 μg of LSR vectors using 10 μl of polyethylenimine (Polysciences, 23966) and 200 μl of cold serum-free DMEM. pMD2.G (Addgene plasmid 12259; http://n2t.net/addgene:12259; RRID: Addgene_12259), psPAX2 (Addgene plasmid 12260; http://n2t.net/addgene:12260; RRID: Addgene_12260) and pMDLg/pRRE (Addgene plasmid 12251; http://n2t.net/addgene:12251; RRID: Addgene_12251) were gifts from Didier Trono. After 24 hours, 3 ml of DMEM was added to the cells, and, after 72 hours of incubation, lentivirus was harvested. We filtered the pooled lentivirus through a 0.45-μm PVDF filter (Millipore) to remove any cellular debris.

To create polyclonal landing pad cell lines, 2 ml of lentiviral supernatant and 8 μg ml−1 of polybrene was used on 3 × 105 K562 cells to ensure a high MOI. These cells were infected by spinfection for 30 minutes at 1,000g at 33 °C, followed by overnight infection. The next day, the cells were spun down and resuspended in fresh media. This resulted in >50% EGFP+ cell populations, suggesting that each cell likely contained multiple landing pad copies. To create clonal landing pad cell lines, lentivirus doses of 50 μl, 100 μl and 200 μl were used for each vector, to find a condition with low MOI wherein each transduced cell would be likely to contain only a single integrated copy of the landing pad. In total, 3 × 105 K562 cells were mixed with the lentiviruses in 8 μg ml−1 of polybrene and infected overnight, without spinfection. Infected cells grew for 3 days, and then infection efficiency was measured using flow cytometry to measure EGFP (BD Accuri C6); the dose that gave rise to 5–15% EGFP+ cells was selected for each LSR for further experiments. Ten days later, these EGFP+ cells were sorted into a 96-well plate with a single cell in each well, to derive clonal lines with a single landing pad location. Two weeks later, 16 wells per LSR were analyzed by flow cytometry (BD Accuri C6). In some cases, the well was empty, possibly due to a failure to sort a single cell into that well or because the cells died. Four clones for each LSR with a unimodal high EGFP expression level were selected for expansion and subsequent experiments. Altogether, 27 days passed from infection to clone selection, so these are clones that show high EGFP expression stability.

Landing pad integration efficiency assay

Landing pad cell lines were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with the promoterless mCherry donor containing the matching attP at a dose of either 1,000 ng or 3,000 ng of donor plasmid using 400,000 cells per electroporation. At timepoints from 5 days to 12 days after electroporation, the cells were subjected to flow cytometry to measure mCherry and EGFP (BD Accuri C6 or Bio-Rad ZE5).

Pseudosite integration efficiency assay to measure integration into the human genome

To determine the efficiency of integration of attD donors into pseudosites in the human genome, attD sequences were cloned into a plasmid containing a EF-1α promoter, followed by mCherry, a P2A self-cleaving peptide and a puromycin resistance marker. Integration efficiency was measured in both K562 and HEK293FT cells. In K562 cells, 1.0 × 106 cells were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector SF, program FF-120), with 3,000 ng of LSR plasmid and 2,000 ng of pseudosite attD plasmid. As a non-matching LSR control, 3,000 ng of Bxb1 was substituted for the correct LSR plasmid. A similar experiment was performed with additional doses (1,000–3,000 ng) for Cp36 LSR plasmid, and the attD donor plasmid was delivered at a 1:1 molar ratio. The cells were cultured between 2 × 105 cells per milliliter and 1 × 106 cells per milliliter for 2–3 weeks.

In HEK293FT cells, 20,000 cells were plated into 96-well plates and transfected 1 day later with 200 ng of LSR plasmid and 178 ng of pseudosite attD plasmid using Lipofectamine 2000 (Invitrogen). As a non-matching LSR control, 200 ng of Bxb1 was substituted for the correct LSR plasmid. Additionally, a linear version of the pseudosite attD donor was also tested for integration activity in HEK293FT cells. To create the linear donors, pseudosite attD plasmids were PCR amplified using the KAPA Hifi HotStart ReadyMix (Roche), amplifying the attD and the EF-1α promoter, followed by mCherry, a P2A self-cleaving peptide and a puromycin resistance marker. The PCR product was gel extracted with the Monarch DNA Gel Extraction Kit (New England Biolabs (NEB)). In total, 20,000 HEK293FT cells were plated into 96-well plates and transfected 1 day later with 300 ng of LSR plasmid and 24 ng of the linear pseudosite attD donor. As a non-matching LSR control, 300 ng of Bxb1 was substituted for the correct LSR plasmid.

For all K562 and HEK293FT transfections, 100 µl of each sample was run on the Attune NxT Flow Cytometer every 3–4 days to measure the mCherry signal. After 2–3 weeks, transiently transfected plasmid was nearly fully diluted out in the non-matching LSR control, and the efficiency of the LSR was determined by the difference in percentage of mCherry+ cells between the non-matching LSR control and the experimental condition.

Generation of donor plasmids containing unique molecular identifiers

To differentiate unique integration events from clonal expansion or PCR duplicates, unique molecular identifiers (UMIs) were cloned into the pEF-1α-mCherry-P2A-Puro donor plasmids. Nx12 oligos were synthesized by Integrated DNA Technologies (IDT); six-cycle PCR using Kapa Hifi PCR Mastermix (Roche) was performed with BsaI Golden Gate overhangs to create a double-stranded UMI library insert; and the PCR was purified using DNA Clean and Concentrator-5 (Zymo). Next, the UMI library was assembled via a Golden Gate reaction into a BsaI landing pad located upstream of the EF-1α promoter. The Golden Gate reaction specifications are as follows: 127 ng of purified insert; 2 µg of pre-digested (BsaI) and purified (DNA Clean and Concentrator-5, Zymo) backbone (3:1 ratio of insert to backbone); 5 µl of 10× T4 DNA Ligase Buffer (NEB); 2.5 µl of T4 DNA Ligase (NEB); 2.5 µl of BsaI-HFv2 (NEB); and water to a final volume of 50 µl. The reaction was run for 1 hour at 37 °C and then inactivated for 20 minutes at 80 °C. The sample was then purified using DNA Clean and Concentrator-5 (Zymo) with the specified plasmid protocol and quantified via NanoDrop. Then, 1 µl of the library was electroporated into Endura Electrocompetent Cells (Lucigen) using the recommended optional electroporation protocol, plated onto two 500-cm2 BioAssay plates, grown at 30 °C for 16 hours and harvested using the NucleoBond Xtra Maxi EF Kit (Macherey-Nagel). UMI coverage was calculated via dilution plating, which was determined to be 144× coverage of the 16 million UMIs. attD sequences for each LSR were next cloned into these UMI-containing backbones, using Golden Gate assembly into a Esp3I landing pad directly upstream of the UMI. Each assembly contained 34 ng of purified attD insert with Esp3I overhangs, 521 ng of pre-digested, purified backbone (3:1 ratio insert to backbone), 5 ul of 10× T4 DNA Ligase Buffer (NEB), 2.5 ul of T4 DNA Ligase (NEB), 2.5 µl of Esp3I (Thermo Fisher Scientific) and water to a final volume of 50 µl. Then, 1 µl of the library was electroporated into Endura Electrocompetent Cells (Lucigen) using the recommended optional electroporation protocol, seeded directly into liquid culture in Terrific Broth, grown overnight at 37 °C and harvested using the NucleoBond Xtra Maxi EF Kit (Macherey-Nagel). Calculated UMI library coverage was greater than 30× for all donor plasmids.

Integration site mapping assay to determine human genome integration specificity

Integration site mapping was performed on both K562 and HEK293FT cells. In total, 1.0 × 106 K562 cells were electroporated in Amaxa solution (Lonza Nucleofector SF, program FF-120) with LSR and pseudosite attD plasmids, using the protocol as above for the pseudosite integration efficiency assay. For HEK293FT cells, 20,000 cells were plated into 96-well plates and transfected 1 day later with 200 ng of LSR plasmid and 178 ng of pseudosite attD plasmid using Lipofectamine 2000 (Invitrogen). After 5 days in culture, puromycin was added to the media at 1 μg ml−1 for K562 cells and 0.5 μg ml−1 for HEK293FT cells. The cells were cultured for two more weeks, and then the gDNA was harvested using the Quick-DNA Miniprep Kit (Zymo) and quantified by Qubit HS dsDNA Assay (Thermo Fisher Scientific). A modified version of the UDiTaS sequencing assay was then used as described below35,36. Tn5 was purified using the protocol described in Picelli et al.95 and stored at 7.5 mg ml−1. Adaptors were assembled by combining 50 μl of 100 μM top and bottom strand, heating to 95 °C for 2 minutes and slowly ramping down to 25 °C over 12 hours. Next, the transposome was assembled by combining 85.7 μl of Tn5 transposase with 14.3 μl of pre-annealed oligos and incubated for 60 minutes at room temperature. Tagmentation was performed by adding 150 ng of gDNA, 4 μl of 5× TAPS-DMF (50 mM TAPS NaOH, 25 mM MgCl 2 , 50% v/v DMF (pH 8.5) at 25 °C), 3 μl of assembled transposome and water for a final reaction volume of 20 μl. The reaction was incubated at 55 °C for 10–15 minutes and then purified with Zymo DNA Clean and Concentrator-5. The tagmented products were run on Agilent Bioanalyzer HS DNA Kit to confirm average fragment size of ~2 kb. Next, PCR was performed with the outer primers (P5_outer, pseudosite_donor_outer; Supplementary Table 6) for 12 cycles using 12.5 μl of Platinum Superfi PCR Master Mix (Thermo Fisher Scientific), 1.5 μl of 0.5 M TMAC, 0.5 μl of 10 μM pseudosite_donor_outer primer, 0.25 μl of 10 μM P5_outer primer, 9 μl of tagmented DNA and 1.25 μl of DMSO. After AMPure XP 0.9× bead cleanup, a second PCR with the inner nested primers (P5_inner, i7 primers; Supplementary Table 6) was performed for 18 cycles. The PCR contained 25 μl of Platinum Superfi Master Mix (Thermo Fisher Scientific), 3 μl of 0.5 M TMAC, 2.5 μl of DMSO, 2.5 μl of 10 μM P5_inner primer, 5 μl of 10 μM i7 primer, 10 μl of the purified 1st round PCR product and 2 μl of water for a final reaction volume of 50 μl. The final library was size selected on a 2% agarose gel for fragments between 300 bp and 800 bp, gel extracted with the Monarch DNA Gel Extraction Kit (NEB), quantified with Qubit HS dsDNA Assay (Thermo Fisher Scientific) and KAPA Library Quantification Kit (Roche), fragment analyzed with Agilent Bioanalyzer HS DNA Kit and sequenced on a MiSeq (Illumina MiSeq Control Software version 4.0.0.1769). The same protocol was performed for on-target and off-target integration mapping on the landing pad samples, with different donor outer and i7 primers corresponding to the donor plasmid used (LP_donor_outer, LP_i7 primers; Supplementary Table 6).

Computational analysis of integration site mapping assay

Snakemake (version 5.32.0) workflows were constructed and used to analyze NGS data from the integration site mapping assay96. First, stagger sequences added to primers during library preparation were removed using custom Python scripts. Next, fastp (version 0.19.6) was used to trim Nextera adapters from reads and to remove reads with low PHRED scores97. Next, reads were aligned to both the human genome (GRCh38) and a donor plasmid sequence containing the LSR-specific attD sequence in single-end mode using BWA MEM (version 0.7.17)98. Next, reads were analyzed individually using custom Python scripts to identify (1) if they aligned to the donor plasmid, human genome or both; (2) whether or not the reads began at the predicted primer; (3) whether or not the pre-integration attachment site was intact; and (4) whether or not the attachment site matched the expected donor plasmid. Reads were then filtered to include only those reads that mapped to both the donor plasmid and the human genome, those that began at the primer site and those that did not have an intact attD sequence (if this could be determined from the length of a particular read). This filtered read set was then aligned in paired-end mode to the human genome using default settings in BWA MEM. Alignments with a mapping quality score less than 30 were removed, along with supplementary alignments and paired read alignments with an insert size longer than 1,500 bp. The SAMtools markdup tool was used to remove potential PCR duplicates and identify unique reads for downstream analysis99. Next, MGEfinder was used to extract clipped-end sequences from reads aligned to the human genome and generate a consensus sequence of the clipped ends, which represent the crossover from the human genome into the integrated attD sequence27. Using custom Python scripts, k-mers of length 9 bp were extracted from these consensus sequences and compared with a subsequence of the attD plasmid extending from the original primer to 25 bp after the end of the attD attachment site. If there were no shared 9-mers, the candidate was discarded. Otherwise, consensus sequences were clipped to begin at the primer site, and these consensus sequences were then aligned back to the original attD subsequence using the biopython local alignment tool100. Two aligned portions were extracted: the full local alignment of the consensus sequence to the attD (called the ‘full local alignment’) and the longest subset of the alignment that included no ambiguous bases and no gaps (called the ‘contiguous alignment’). To filter a final set of true insertion sites, only sites with at least 80% nucleotide identity shared between the consensus sequence and the attD subsequence in either the full local alignment or the contiguous alignment were kept. Finally, only sites with a crossover point within 15 bp of the predicted dinucleotide core were kept.

This approach could precisely predict integration sites, but errors in sequencing reads led to some variability in this prediction. To account for this, integration sites were combined into integration ‘loci’ by merging all sites that were within 500 bp of each other, using bedtools (version 2.27.0)101. This approach would merge integration events that occurred at the same site but in opposite orientations, for example. When pooling reads across biological or technical replicates, these loci were also merged if they overlapped. When measuring the relative frequency of insertion across different loci, all uniquely aligned reads (de-duplicated using SAMtools markdup) found within each locus were counted, or UMIs were counted if they were available. These were then converted into percentages for each locus by dividing by the total number of unique reads/UMIs aligned to all integration loci.

Target site motifs for different LSRs could be determined from precise predictions of dinucleotide cores for all integration sites. For each integration locus, only one integration site was chosen if there were multiple, and integration sites with more reads supporting them were prioritized. Human genome sequences around the predicted dinucleotide core were extracted using bedtools, choosing the forward or reverse strand depending on the orientation of the integration. All such target sites, or a subset of these target sites if desired, were then analyzed for conservation at each nucleotide position using the ggseqlogo package in R (ref. 83).

DNase hypersensitivity integration site enrichment for multi-targeting LSRs

ENCODE DNase hypersensitivity regions (also referred to as peaks) were used to identify integration sites that overlapped with regions of accessible chromatin. For K562 cells, the DNase hypersensitivity peaks identified in experiment ENCFF274YGF were used, and, for HEK293T (HEK293FT peaks were not available), the peaks identified in experiment ENCFF274YGF were used102,103. Enrichment of integration sites within DNase hypersensitivity sites was calculated using a two-sided Fisher’s exact test, with random background sites selected by randomly choosing two sites from that were within 100 kbp of each true integration site.

Post hoc identification of human genome integration sites using database-derived motifs

A computational approach was designed that started with a query LSR sequence and then built sequence motifs by iteratively adding natural attB sequences of the next most closely related LSR ortholog, only adding additional attB sequences if they were 95% identical or less to already selected attB sequences. Only attB sequences that belonged to relatives that were at least 30% identical at the amino acid level to the queried LSR were considered. The attB sequences were oriented with respect to each other by choosing the strand orientation with the highest global alignment to the query LSR’s attB sequence. All attB sequences were then aligned together using MUSCLE to generate a multiple sequence alignment (MSA). The middle 60 nucleotides of the MSA were then extracted, excluding columns with over 50% gaps, and the nucleotide frequencies were mapped onto the query LSR’s attB sequence to generate a final motif, with gaps being replaced by nucleotides with equally weighted frequencies. Motifs built from 20, 50 and 100 such attB sequences were constructed. Then, motifs were searched against the experimentally observed human integration sites and approximately 30,000 randomly selected human genome sequences using HOMER with no minimum score threshold104. HOMER calculates motif scores for each searched sequence by taking the sum of the log-odds probabilities at each nucleotide position. Next, R scripts were used to iterate across a range of motif score cutoffs to calculate the true-positive rate and the false-positive rate at each cutoff, generating a ROC curve. For each LSR, the motif with the greatest AUC was selected from the three motifs that were constructed.

NGS of linear donor recircularization and integration

Linear donors were generated as described above (in the section titled ‘Pseudosite integration efficiency assay to measure integration into the human genome’), resulting in a purified PCR product containing the attD and the EF-1α promoter, followed by mCherry, a P2A self-cleaving peptide and a puromycin resistance marker. In total, 20,000 HEK293FT cells were plated into a 96-well plate and 1 day later transfected with 24 ng of linear donor and 300 ng of cognate LSR or Bxb1 as a non-matching LSR control. After 5 days, puromycin was added to enrich for integrants. The cells were cultured for two more weeks, and then gDNA was extracted using the Quick-DNA Miniprep Kit (Zymo) and quantified by Qubit HS dsDNA Assay (Thermo Fisher Scientific). PCR primers (Linear_donor_jxn_F, Linear_donor_jxn_R; Supplementary Table 6) were designed to specifically amplify outwards from the ends of the linear donor, and PCR was performed on the linear donor DNA alone and on the DNA extracted from transfections containing the linear donor with and without cognate LSR. PCR products were only visible by gel electrophoresis when using the post-transfection template DNA. PCR products were amplified with Flap2 primers (Supplementary Table 6) to add P5 and P7 adaptors for illumina sequencing and sequenced with the Illumina MiSeq v2 2 × 150 paired-end reads. To perform indel analysis, sequencing was run through the CRISPResso2 (version 2.0.20b) workflow105, with custom parameters: the amplicon was set to be a concatenation of the right and then left flanks of the linear donor; the single guide RNA (sgRNA) was set to be the sequence directly 5′ of the expected rejoining site of the two linear donor ends; the quantification window was set to 0 (relative to the 3′ end of the sgRNA); and the quantification window size was set to 1.

Comparison of LSR and PiggyBac transposase efficiency

In total, 1.2 × 106 K562 cells were electroporated in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 2,000 ng of a pEF-1α-PuroR-P2A-mCherry donor plasmid containing an upstream Cp36 attD site (pJT371), in combination with 3,000 ng of Cp36 expression vector. Cells were grown for 10 days and then analyzed using flow cytometry for mCherry fluorescence (Bio-Rad ZE5, Everest software version 3.1) with analysis using CytoFlow (https://github.com/cytoflow/cytoflow).

Assessment of Cp36 directionality via redosing

To test for possible excision upon Cp36 re-dosing, 1.0 × 106 K562 cells were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector SF, program FF-120) with 3,000 ng of Cp36 LSR plasmid and 2,000 ng of the Cp36 pseudosite attD plasmid with an mCherry expression cassette or attD plasmid alone. After 15 days, the Cp36-treated cells were re-electroporated with 3,000 ng of Cp36 LSR plasmid or empty LSR backbone control plasmid. Three days later, the cells were measured by flow cytometry (Attune NxT) for mCherry fluorescence.

To generate a pure population of stable mCherry-integrated cells using Cp36, 1.2 × 106 K562 cells were electroporated in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 2,000 ng of the same Cp36 PuroR-P2A-mCherry donor, in combination with 3,000 ng of Cp36 expression vector. After 3 weeks of growth to allow the donor plasmid to dilute, cells with integrants were selected to purity using 1 µg ml−1 of puromycin over 7 days and confirmed using flow cytometry for mCherry fluorescence (Attune NxT). To assess the efficiency of integrating a second donor sequence, we generated a second fluorescent donor construct (pJT396) by replacing mCherry in pJT371 with mTagBFP2 and prepared DNA by Mira prep106. We then electroporated 4.0 × 105 of wild-type or the stably integrated mCherry K562 cell lines in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with pJT396 in combination with an equimolar amount of either pUC19 or a Cp36 expression vector, totalling approximately 4 µg of DNA. The frequency of doubly integrated cells was assessed using flow cytometry for mCherry and mTagBFP2 fluorescence at 13 days after electroporation (Attune NxT), with analysis performed in FlowJo. Note that this method differs from that used for mCherry in the initial pseudosite integration assay.

Simultaneous stable delivery of two genes with Cp36

To generate stable mCherry-integrated and BFP-integrated cells using Cp36, 1.2 × 106 K562 cells were electroporated in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 3,000 ng of both the same Cp36 PuroR-P2A-mCherry donor and Cp36 PuroR-P2A-mTagBFP2 donor, in combination with an equimolar dose of 2,400 ng of Cp36 expression vector. Control cells were treated with pUC19 and donors, Cp36 and pUC19 or a single donor and Cp36 or pUC19. The frequency of singly and doubly integrated cells was assessed using flow cytometry for mCherry and mTagBFP2 fluorescence (Attune NxT), with analysis performed in FlowJo.

Activity assay of synthetic enhancer reporters installed at AAVS1

To install the synthetic transcription factor rTetR-VP48 into wild-type K562 cells, 1.0 × 106 wild-type K562 cells were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 1 µg of PiggyBac expression vector (PB200A-1, SBI) and 1 µg of pMMH4, an ITR-flanked plasmid harboring the EF-1α core promoter driving rTetR-VP48-T2A-hygromycin resistance gene and a separate Tet responsive promoter (TRE3G) driving an mCherry gene. Integrants were selected to purity using 200 μg ml−1 of hygromycin (Thermo Fisher Scientific) over 7 days. Enhancer reporter donor constructs flanked by AAVS1 homology arms (pMMH23,24,26) were subsequently integrated into the AAVS1 locus of cells expressing rTetR-VP48 using TALEN-mediated HDR as follows: 1.0 × 106 K562 cells were electroporated in Amaxa solution (Lonza Nucleofector 2b, setting T0-16) with 1,000 ng of reporter and 500 ng of each TALEN-L (Addgene, 35431) and TALEN-R (Addgene, 35432) plasmid (targeting upstream and downstream the intended DNA cleavage site, respectively). In the pooled reporter assay, a small library of Tet responsive elements was ordered as an oligo pool (opJS2, IDT), assembled into the reporter plasmid, mini-prepped and electroporated as a pool. The reporters contain a promoterless puromycin resistance gene that traps the AAVS1 promoter. Two days after electroporation, the cells were treated with 1 µg ml−1 of puromycin antibiotic for 7 days to select for a population with reporter donor integrated into AAVS1. Reporter expression was measured by flow cytometry (Bio-Rad ZE5) after 2 days of 1,000 ng ml−1 doxycycline induction (Thermo Fisher Scientific).

Activity assay of synthetic enhancer reporters installed at a landing pad

To install the synthetic transcription factor rTetR-VP48 into landing pad cells, 1.0 × 106 clonal Kp03 landing pad cells were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 1 µg of PiggyBac expression vector (PB200A-1, SBI) and 1 µg of pMMH4 and selected to purity using 200 μg ml−1 of hygromycin (Thermo Fisher Scientific) over 7 days. To install enhancer reporter plasmids at the landing pad, 1.0 × 106 K562 cells harboring a monoclonal Kp03 landing pad and multiclonal rTetR-VP48 expression construct were electroporated in 100 μl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with 1,000 ng of reporter donor plasmid (pMMH56,59,59). In the pooled reporter assay, 200 ng of each of five reporter constructs (pMMH55–59) were combined and electroporated together. As a negative control, cells were electroporated with 1,000 ng of reporter donor with no attP site upstream of the promoterless puro resistance gene. The reporters contain a promoterless puromycin resistance gene that traps the landing pad promoter. Three days after electroporation, the cells were treated with 1 ng ml−1 of puromycin antibiotic for 7 days to select for a population with reporter donor correctly integrated into the landing pad. All negative control cells died during selection. Reporter expression was measured at the end of selection by flow cytometry (Bio-Rad ZE5) after 2 days of 1,000 ng ml−1 doxycycline induction (Thermo Fisher Scientific).

Magnetic separation of cells based on reporter expression level

The reporter included a synthetic surface marker, consisting of the human IgG1 Fc region linked to an Igκ leader and PDGFRb transmembrane domain, to enable magnetic separation of OFF from ON cells, which we previously used to study transcriptional effector domains46 and here adapted to study enhancers. Before magnetic separation, the cells were cultured between 2 × 105 cells per milliliter and 1 × 106 cells per milliliter for 2 weeks after selection. After 2 days of 1,000 ng ml−1 doxycycline induction, 1 × 107 cells were spun down at 300g for 5 minutes, and media was aspirated. Cells were then resuspended in the same volume of PBS (Gibco), and the spin-down and aspiration was repeated to wash the cells and remove any IgG from serum. Dynabeads M-280 Protein G (Thermo Fisher Scientific, 10003D) were resuspended by vortexing for 30 seconds. Then, 50 ml of blocking buffer was prepared by adding 1 g of biotin-free BSA (Sigma-Aldrich) and 200 μl of 0.5 M pH 8.0 EDTA (Thermo Fisher Scientific, 15575020) into DPBS (Gibco), vacuum filtering with a 0.22-μm filter (Millipore) and then kept on ice. Next, 50 μl of beads was prepared for every 1 × 107 cells by adding 1 ml of buffer per 200 μl of beads, vortexing for 5 seconds, placing on a magnetic tube rack (Eppendorf), waiting 1 minute, removing the supernatant and finally removing the beads from the magnet and resuspending in 100–600 μl of blocking buffer per initial 50 μl of beads. Beads were added to cells at 1 × 107 cells per 25 μl of resuspended beads and then incubated at room temperature while rocking for 30 minutes. We used non-stick Ambion 1.5-ml tubes and a small magnetic rack. After incubation, the bead and cell mixture was placed on the magnetic rack for >2 minutes. The unbound supernatant was transferred to a new tube and placed on the magnet again for >2 minutes to remove any remaining beads, and then the supernatant was transferred to a new tube. For the LSR PRA, the same magnetic separation procedure was performed two more times (for a total of three times) on this supernatant to remove cells with activated reporters from the unbound population. Only the final unbound population was saved for further analysis by flow cytometry and library preparation. The beads from the first round of magnetic separation were resuspended in the same volume of blocking buffer and magnetically separated again, and then the supernatant was discarded. Resuspension, magnetic separation and discarding the supernatant was repeated, and the tube with the beads was kept as the bound fraction. The bound fraction was resuspended in blocking buffer or PBS to dilute the cells (the unbound fraction is already dilute). Flow cytometry (Bio-Rad ZE5) was performed using a small portion of each fraction to estimate the number of cells in each fraction and to confirm separation based on reporter levels. Finally, the samples were spun down, and the pellets were frozen at 20 °C until gDNA extraction. Two additional biological replicates of the LSR PRA were performed similarly at a later date, starting from the step of electroporating cells with the pooled reporter plasmid donors.

Library preparation and sequencing of magnetically separated reporter cell pool

gDNA was extracted using Monarch Genomic DNA Purification Kit (NEB) according to manufacturer instructions. After cell lysis, magnetic separation was performed on the bound population to remove beads. No more than 5 × 106 cells were loaded onto a single column and eluted with water to avoid subsequent PCR inhibition. Libraries were assembled using three PCRs: PCR1 amplifies enhancer elements off the genome; PCR2 extends these amplicons with TruSeq R1/R2 handle sequences; and PCR3 extends these amplicons to add sample barcodes and p5/p7 sequences. PCR1 reactions contained 20 μl of purified gDNA, 2.5 μl of each 10 μM primer (cTF98 and cTF109; Supplementary Table 6) and 25 μl of Q5 2× Master Mix (NEB) and was amplified with the following thermocycling conditions: 3 minutes at 98 °C and then 23× cycles of 10 seconds at 98 °C, 30 seconds at 66 °C and 1 minute at 72 °C and then a final extension step of 72 °C for 5 minutes. The PCR product was purified using 45 µl of SPRI beads (Beckman Coulter) (0.9× of PCR volume) according to manufacturer instructions and eluted in 21 µl of nuclease-free water. PCR2 reactions were assembled with 1 μl of purified PCR1 product, 1 μl of each 10 μM primer (oBD55 and oBD68), 10 μl of Q5 2× Master Mix and 7 μl of nuclease-free water and amplified using the following thermocycling conditions: 30 seconds at 98 °C and then 3–7× cycles of 10 seconds at 98 °C, 30 seconds at 68 °C, 20 seconds at 72 °C and then a final step of 72 °C for 5 minutes. The PCR2 product was purified using 18 µl of SPRI beads (0.9× of PCR volume) according to manufacturer instructions and eluted in 21 µl of nuclease-free water. PCR3 reactions contained 1 μl of purified PCR2 product, 1 μl of each 10 μM primer (oBD19-26), 10 μl of Q5 2× Master Mix and 7 μl of nuclease-free water. The same thermocycling and purification protocol from PCR2 was performed. Purified PCR3 products were confirmed to be the correct size using a D1000 TapeStation (Agilent) and quantified with a Qubit HS kit. Samples were pooled with PhiX (Illumina) to ensure appropriate library complexity and sequenced on an Illumina MiSeq with a Nano kit with 4–8 indexing cycles and 150 cycle paired-end reads.

Analysis of PRA sequencing data

Sequencing reads were demultiplexed using bcl2fastq (version 2.20). The HT-recruit-Analyze processing pipeline was used to generate a Bowtie reference and modified to align paired-end reads with 0 mismatch allowance (https://github.com/bintulab/HT-recruit-Analyze). Count matrices for the bound and unbound samples were then used to calculate log 2 (ON:OFF) for each enhancer, normalizing for read depth across bound and unbound samples.

Amplicon barcode library installation

A library of mCherry amplicons with randomized barcodes after the stop codon was generated by PCR, electroporated into landing pad cells and recovered and sequenced from gDNA. More specifically, to construct the library, primers were designed to amplify the attP-mCherry-pA sequence off of the template plasmids used in previous landing pad assays (pC432 and pC494; Supplementary Table 6), and the reverse primer included a randomized 6×N barcode as an extension. This primer was synthesized by IDT using standard mixed bases. A mastermix for 8× reactions of PCR was made using 80 ng of plasmid template, 200 μl of 2× Q5 MM (NEB), 10 μl each of 100 nM forward and reverse primers (cTF334, JT1046) and 172 μl of nuclease-free water and then split into separate reactions and amplified with the following thermocycling conditions: 2 minutes at 98 °C and then 30× cycles of 10 seconds at 98 °C, 30 seconds at 65 °C and 40 seconds at 72 °C and then a final extension step of 72 °C for 5 minutes. The length of the library was confirmed by gel electrophoresis, and its concentration was measured by NanoDrop.

K562 landing pad clonal lines with the associated (or mismatched, as a control) recombinase were then electroporated with these amplicon donors. In total, 1.2 × 106 K562 cells were electroporated in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with either 250 ng or 750 ng of the amplicon donor. Seven days later, the efficiency of mCherry integration was determined by flow cytometry (Bio-Rad ZE5), and then, 1 day later, gDNA extraction was performed from 5 million cells using a Qiagen DNeasy Mini Prep Kit.

Another experiment was performed similarly with other doses. In total, 1.2 × 106 K562 cells were electroporated in 100 µl of Amaxa solution (Lonza Nucleofector 2b, program T-016) with either 500 ng or 2,000 ng of amplicon donors. We also included a matched plasmid donor condition: 4,615 ng of plasmid was used to provide an equimolar dose as 2,000 ng of amplicon donor. Each donor condition was tested in two different clonal Kp03 landing pad lines. Six days later, the efficiency of mCherry integration was determined by flow cytometry (Bio-Rad ZE5).

Junction PCR library preparation and sequencing of amplicon donor barcodes

NGS libraries were prepared from the extracted gDNA harvested 8 days after electroporation of landing pad cells with 750 ng of a 1.2-kb attP-mCherry-pA-Barcode amplicon. Libraries were assembled using three rounds of PCR that only captures barcodes successfully integrated into the on-target landing pad site in the genome: PCR1 amplifies barcodes off the genome across the 3′ donor–genome junction; nested PCR2 further amplifies the barcodes and extends them with TruSeq R1/R2 handle sequences; and PCR3 extends the amplicons to add sample indices and p5/p7 sequences. Specifically, PCR1 mastermixes were assembled with 480 µl of gDNA, 500 μl of 2× NEBNext Ultra II Master Mix (NEB) and 10 μl each of 100 nM forward and reverse primers (cTF347 is a universal forward primer, and JT1067/8 are reverse primers specific to the associated landing pad; Supplementary Table 6) and then split into ten separate 100-µl reactions and amplified with the following thermocycling conditions: 1 minute at 98 °C and then 35× cycles of 10 seconds at 98 °C, 30 seconds at 68 °C and 45 seconds at 65 °C and then a final extension step of 65 °C for 5 minutes. Gel electrophoresis of junction PCR with matched and mismatched LSR donor samples was performed to confirm that the PCR1 product was specific to cells with on-target integrations. Then, 50 µl of the junction PCR1 product was purified using 45 µl of SPRI beads (Beckman Coulter) (0.9× of PCR volume) according to manufacturer instructions and eluted in 23.5 µl of nuclease-free water. PCR2 reactions were assembled with 22.5 μl of purified PCR1 product, 2.5 μl of pooled 10 μM forward and reverse primer (cTF348 and cTF351) and 25 μl of 2× NEBNext Ultra II Master Mix (NEB) and then thermocycled as follows: 30 seconds at 98 °C and then 6× cycles of 10 seconds at 98 °C, 30 seconds at 68 °C, 20 seconds at 72 °C and then a final step of 72 °C for 5 minutes. The PCR2 product was purified using 45 µl of SPRI beads (0.9× of PCR volume) according to manufacturer instructions and eluted in 21 µl of nuclease free water. PCR3 reactions contained 22.5 μl of purified PCR2 product, 2.5 μl of each 10 μM primer (oBD19-26) and 10 μl of 2× NEBNext Ultra II Master Mix (NEB). The same thermocycling and purification protocol from PCR2 was performed. Purified PCR3 products were confirmed to be the correct size using a D1000 TapeStation (Agilent) and quantified with a Qubit HS kit. Samples were pooled with PhiX (Illumina) to ensure appropriate library complexity and sequenced on an Illumina MiSeq with a Nano kit with 4–8 indexing cycles and 150 cycle paired-end reads.

Analysis of amplicon donor barcode sequencing data

Sequencing reads were analyzed with a custom Python script to count barcodes. Reads were filtered for an average Qscore ≥30 over all positions and a minimum Qscore ≥30 over the 6-bp barcode region. Matches in that region to any of the 4,096 possible 6×N barcodes were tallied. A barcode was defined as a dropout if there were only 0 or 1 counts. The read depth was 216–272× for the pre-installation control samples and 290–357× for all genomic samples.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.