A systematic approach to inserting split inteins for Boolean logic gate engineering and basal activity reduction

Split inteins are powerful tools for seamless ligation of synthetic split proteins. Yet, their use remains limited because the already intricate split site identification problem is often complicated by the requirement of extein junction sequences. To address this, we augment a mini-Mu transposon-based screening approach and devise the intein-assisted bisection mapping (IBM) method. IBM robustly reveals clusters of split sites on five proteins, converting them into AND or NAND logic gates. We further show that the use of inteins expands functional sequence space for splitting a protein. We also demonstrate the utility of our approach over rational inference of split sites from secondary structure alignment of homologous proteins, and that basal activities of highly active proteins can be mitigated by splitting them. Our work offers a generalizable and systematic route towards creating split protein-intein fusions for synthetic biology.


Supplementary Information for:
A systematic approach to inserting split inteins for Boolean logic gate engineering and basal activity reduction    Figure 1 Workflow of intein-assisted bisection mapping (IBM). a. Execution-wise, IBM is an augmentation of the bisection mapping method by adding a Golden Gate substitution step. It starts with a transposition reaction: a staging plasmid carries a 5' and 3' trimmed gene of interest (GOI) with the internal BsaI, BbsI and SapI sites removed. The staging plasmid is mixed in vitro with the MuA transposase and the mini-Mu transposon 1, 2 , which generates the initial insertion library. After in vivo amplification (transformation into E. coli and overnight growth), the insertion library is cut by BsaI and resolved on a DNA agarose gel. The band with the correct size corresponding to GOI with the insertion is purified and ligated into a linearized vector with BsaI-generated overhangs. This produces the Open Reading Frame (ORF) insertion library 3 . The ORF insertion library is again amplified in vivo and then an aliquot is mixed with a DNA fragment carrying the split Ssp DnaB M86 intein in a Golden Gate reaction, using the restriction enzymes BbsI and SapI. In later experiments we replaced the P lux2 promoter by the P PhlF promoter which was reported to be very tight 4 . The DNA fragment contains a different selection marker (chloramphenicol acetyltransferase, cat) to facilitate selection of the library with the transposon replaced. In the end, the final library is screened for individual strains that showed proper reconstitution of protein function upon expression of both the N-and C-lobes (AND or NAND logic gate behavior). Those strains were then subjected to Sanger sequencing at one of the joints to map the split sites. Identical split sites were then deduplicated and consolidated with activity data to generate the intein-bisection maps. b. Illustration of modifications on wild type (WT) Mu transposon R1R2 recognition sequences 5 to accommodate BbsI and SapI Type IIS restriction sites. R1 sequences are highlighted in cyan and R2, dark blue. Mutations introduced are colored in red. Recognition sites are marked by square brackets and overhangs positions are denoted by staggered lines. Figure 2 Illustration on slightly trimmed CDS using mCherry as an example. Before IBM was performed on a protein of interest, most part of its coding DNA sequence (CDS), excluding the very N terminus and the C terminus, would be subcloned onto a staging plasmid and flanked by BsaI sites. If transposition happens within the BsaI sites or beyond them, the transposed fragments would be unable to be selected for before being subcloned into the protein expression plasmid. The sequence spaces between the two BsaI sites thus defines the transposition window. Figure 3 Illustration of effect on transposition and Golden Gate substitution on amino acid linker in spliced product from IBM. The BbsI and SapI overhangs are at the very ends of the R1 recognition sites on the mini-Mu transposon. As a result, the number of amino acids inserted at each junction, between the extein and the intein, is minimal. Transposition by MuA duplicates 5 bp at the insertion site (colored in red) 2 . The duplicated sequence at the 3' end is arbitrary defined as the part of the original sequence and that at the 5' is defined as the extra bases. The extra bases together with the DNA of the BbsI overhang code for three amino acid residues, which includes the -1 extein junction for the M86 intein 6 . At the 3' end, bases of the SapI overhang encodes a serine, which is the required +1 extein junction. Splicing thus leaves behind a 4 bp scar. Figure 4 Outcome simulating control suggested that IBM should recover split site 159/160 on mCherry. A synthetic construct was built to mimic one of the members within the final library if IBM had been performed on mCherry. mCherry would have been split at site 156/160, with a total of four extra amino acid residues added at the split site. This construct yielded the correct AND logic gate behavior with increased fluorescence only when both the N-and C-lobes were present. Proper execution of IBM on mCherry should in principle recover this construct and its split site. Inducers for N-and C-lobes were arabinose (ara, 1 mM) and acyl homoserine lactone (AHL, 1 µM). Single-cell fluorescence distributions shown were pooled from three biological replicates (n = 3).

Supplementary Figure 5 Full intein-bisection map of mCherry.
Full map of IBM on mCherry in which data for Figure 1b was drawn from. Left panel, the fluorescence of the controls that provide the references (horizontal dashed lines) for activities of the intact mCherry and the minimal activity in theory achievable by bipartite constructs. Right panel, bisection map of mCherry. Each vertical group of spots represents an identified split site on the x axis, aligned to the mCherry secondary structure (PDB: 2H5Q) 7 below. A total of 99 filtered candidate strains were characterized and sequenced to generate this map. y locations and error bars are mean and SD of median fluorescence from independent experiments performed on three different days (n ≥ 3, see Supplementary Data 2, sheet "sample_sizes" for exact values of n). Vertical dashed lines bound the permitted transposition window. AHL: acyl homoserine lactone.

Supplementary Figure 6 Proof of mCherry splicing at all split sites identified from IBM of mCherry for
BiFC. For all split sites identified for mCherry from IBM (Fig. 1b), a hexahistidine tag was added to the Cterminus of the split mCherry C-lobes by molecular cloning. Cells harboring the constructs were grown for 5 hours in the presence of 1 mM arabinose and 1 µM AHL, and were then harvested for cell lysis. Whole-cell lysates were subjected to a Western blots experiment. N-and C-lobes expression were probed using antibodies that target the mCherry epitope (red) and the hexahistidine tag (turquoise) respectively. In all constructs, the mCherry epitope is located within the N-lobes despite differences in split sites. Overlaps of red and turquoise bands gave white bands at a size between 25 and 30 kDa, indicating splice products formation. In each lane, a second red band with sizes corresponding to unspliced precursor could be found but only a turquoise band, at the size of the splice product, was present. This suggested the C-lobes were the limiting precursors in splicing and N-lobes were in excess. Our result proves that splicing has occurred at all split sites identified from IBM. The blot is representative of two independent experiments with similar results. Figure 7 Backward compatibility of split sites from IBM of mCherry for BiFC. a. To test if split sites identified from IBM of mCherry were functional for the purpose of bimolecular fluorescence complementation (BiFC), mCherry was split at representative sites (121/122, 173/174, 191/192, 192/193) and fused with synthetic coil-coiled domains SYNZIP17 and SYNZIP18 8 through a 5-residue linker (right most panel). To test if the split mCherry alone could self-dimerize, the same assay was conducted, in which SYNZIP17 on the N-lobes were removed but SYNZIP18 were retained on the C-lobes (middle panel). In addition, some IBM-identified split sites locating within β-sheets (156/157, 176/177, 185/186) were tested. Cells were induced with 1 mM arabinose and 25 µM DAPG were grown for 16 h at 37 °C followed by incubation at room temperature for 9 h to maximize signals from complementation. Single-cell fluorescence distributions shown were pooled from three biological replicates (n = 3). Solid black horizontal lines denote population median, except for autofluorescence which was denoted by dotted grey lines. Above each histogram, the statistics summary is a two-tailed t-test, assuming unequal variance, that compares the median fluorescence values between the test population and the autofluorescence populations (n = 3). p-values are summarized as: n.s. not significant; *p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001. See Supplementary Data 2, sheet "p-values" for exact values of p-values. BCD, bicistronic design to increase the likelihood that the C-lobes would be expressed at similar levels despite having different 5' coding DNA sequences 9 . Note that the same data for constructs with SYNZIP was used to generate The Western blot result on whole-cell lysates of cells from the constructs used in a, harvested at 5 hours after induction. N-and C-lobes expression were probed using antibodies that target the mCherry epitope (red) and the hexahistidine tag (turquoise) respectively. In all constructs, the mCherry epitope is located within the N-lobes despite differences in split sites. This proves that lack of fluorescence from constructs missing SYNZIP17 was not due to absence of protein expression. It should be noted that, in the cases where SYNZIP17 was removed from the N-lobes, the C-lobes were generally more poorly expressed than the N-lobes and so the brightness of the turquoise channel was enhanced to visualize expression of the C-lobes. The blot is representative of two independent experiments with similar results. a-c. Split site 159/160 was not recovered from the IBM of mCherry but served as a reference of a known split site for comparison.  10,11 was between the yellow and magenta colored residues. ( †) The computa onally predicted and then experimentally verified split site 104/105 12 was colored in cyan. It was not identified from this IBM experiment.

Supplementary
Supplementary Figure 9 Explanation of controls of in intein-bisection maps using TetR and ECF20. In the left subplots of TetR and ECF20 intein-bisection maps, there are three sets of vertical dots that were "autofluo.", "reporter only" and "TetR/ECF20 + reporter". They are, respectively, cells co-transformed with empty plasmids, cells co-transformed with the reporter plasmid and an empty plasmid, and cells cotransformed with the reporter plasmid and the intact protein expression plasmid. Their fluorescence profiles, under different combinations of inducers, are shown underneath the schematics. Single-cell fluorescence distributions shown were pooled from biological replicates, each assayed on a different day (n = 3). Addition of inducers had no influence on the basal activity or unrepressed expression of the reporter. Expression of the intact protein was under the control of P araBAD and thus repression or activation activities, in theory, should only be observed when arabinose was added. Results showed that induction by arabinose led to clear repression (TetR) and stronger activation (ECF20). Yet, in absence of any induction, the presence of intact TetR and ECF20 already caused some levels of repression and activation, respectively, at Supplementary Figure 12 Full intein-bisection map of ECF20. Full map of IBM on ECF20 in which data for Figure 2d was drawn from. Left panel, the fluorescence of the controls that provide the references (horizontal dashed lines) for activities of the intact ECF20 and hence the maximum activation activity in theory achievable by bipartite ECF20. Right pane, bisection map of ECF20. Each vertical group of spots represents an identified split site on the x axis, aligned to two predicted ECF20 secondary structures below. A total of 78 filtered candidate strains were characterized and sequenced to generate this map, of which 74 displayed AND logic and 4 were truncations. JPred, the de novo secondary structure predicted from the primary sequence using the JPred 4 server 13 . SWISS-MODEL, the secondary structure predicted through the SWISS-MODEL homology modeling pipeline 14  The crystal structure shows a TetR dimer, and each split site has the -1 and the +1 amino acid residues colored on one monomer (yellow, chain A) only. While the other monomer was shown in grey. Split sites from 67-71 were colored in magenta, 100-118, in blue, 143-184, in purple. The DNA binding domain for TetR consists of the first three alpha helices at the N-terminus. Note that the PDB model 4AC0 had missing residues that are likely unstructured regions. The model shown had those residues filled in by "predicting" the entire structure through the SWISS-MODEL homology modeling pipeline 14 . Supplementary Fig.  11, at 24 h post-induction, split SrpR at most split sites showed strong repression when C-lobes were induced alone. We hypothesized that was due to leaky expression led to N-lobes accumulation over time.

Supplementary Figure 14 Split SrpR N-or C-lobes alone did not achieve repression. In
To test this, the operons encoding N-and C-lobes were separated from one plasmid and subcloned into two different plasmids. Presence or absence of either lobe was achieved co-transforming a construct carrying plasmid or an empty plasmid, respectively, with the reporter plasmid and the expression plasmid of the other lobe into E. coli. Shown in the figure are single cell fluorescence data for the combinatorial transformations, which were performed for all split sites and controls. Results showed that at all split sites, both the N-and C-lobes were required for repression to happen. This proved that neither lobe alone was sufficient for repression and all split SrpR fused with split M86 intein were authentic NAND gates. Singlecell fluorescence distributions shown were pooled from experiments performed on three different days (n = 3) and cells were induced for 24 h prior to fluorescence assay. Fig. 3a. The N-lobe contains a hexahistidine at the N-terminus and the C-lobe, a HA tag at the C-terminus. This allowed the abundances of the protein fragments to be measured semi-quantitatively from Western blots. The intact TetR construct also carries the two tags. BCD, bicistronic design to increase the likelihood that the C-lobes would be expressed at similar levels despite having different 5' coding DNA sequences 9 . b. Western blot on which quantification was performed. Each whole-cell lysate from one of the three biological replicates, from the same experiment where fluorescence measurements were taken, was loaded on one lane for quantification of protein abundances. N-and C-lobes expression were probed using antibodies that target the hexahistidine tag (turquoise) and the HA tag (red) respectively. The lane "reporter" was a whole-cell lysate from bacteria carrying the reporter and empty plasmid, and demonstrates endogenous proteins would not cross-react with either antibody. The reporter plasmid was present in all other whole-cell lysates as well. c. Quantification result from b. The turquoise and red signals were normalized using TetR since they must be in 1:1 ratio. Across the three split constructs, split TetR-SYNZIP at site 166/167 had the lowest expression on one of the lobes (C-lobe). Since the N-and C-lobes must reconstitute in a 1:1 ratio, split TetR-SYNZIP at site 166/167 should have the least abundance of reconstituted TetR. However, it still outperformed the other two split sites in the fluorescence repression assay. This result rejected the possibility that differences in repression strengths were due to inadequate protein expression at sites 70/71 and 112/113. Bar heights and error bars are mean and SD of quantified and then normalized signal intensities from b, which are from three biological replicates (n = 3). Figure 16 Secondary structure alignment of TetR and SrpR with identified split sites revealed limitation of inferring split sites from homology alignment. SrpR and TetR amino acid sequences were aligned with the known TetR structure (PDB: 4AC0) using PROMALS3D with default parameters 17 . IBM identified split sites for SrpR and TetR were marked by green and blue triangles, respectively. The result was re-rendered into Figure 3b. As shown, only the second and the last split seams could be mutually inferred.

Supplementary
Supplementary Figure 17 Bipartite proteins at different split sites have lower overall basal activities over prolonged growth. For TetR (a), SrpR (b) and ECF20 (c), within each subpanel for 5 or 24 h post-induction, the upper panel are pooled single-cell fluorescence data from three independent experiments performed on 3 different days, with no induction, or expression of the intact protein, or both N-and C-lobes. The lower panels are fold changes obtained by dividing, for each strain and each experiment, the median fluorescence value of the high fluorescent population by that of the low fluorescent population. In fold change calculations, bar heights and error bars represent mean and SD. The horizontal dashed lines serve as visual guides for fold changes of the intact protein. For prolonged growth (24 h), most bipartite proteins had less leaky activities than their intact counterparts and hence greater fold changes. Data are from the same source data that was used to produce the intein-bisection maps in Supplementary Figures 10-12, and representative data for TetR and ECF20 were repeated in Figure 3c. For each fold change panel, statistics summaries above each bar is a two-tailed t-test, assuming unequal variance, that compares the fold changes between the intact protein (n = 3, biological replicates) and that of the bipartite construct  [18][19][20] were picked from the literature, synthesized with the known extein junctions, and were inserted into or used to split the mCherry at site 192/193. The resulting constructs were then assayed with or without the corresponding inducer alongside a control that simulated the spliced product with linker due to extein junctions. The result showed that different extein junctions were tolerated at the split site. Under this context, all assayed conditional inteins did not give sufficient differential responses to substantiate further bisection or insertion mapping. Single-cell fluorescence distributions shown were pooled from three biological replicates (n = 3). It should be noted that, the intein MtuA RecA 37R3-2 was not reported to work in E. coli but since it also used the ER-LBD (albeit mutated) and ER-LBD should function in bacteria 21 , it was included for testing. DMSO: dimethyl sulfoxide. Supplementary Figure 21 Chemically inducible dimerization by caffeine binding acVHH is similar to that by FRB/FKBP. The evolved split T7 RNA polymerase 23 was used as a test platform to compare the efficacies and performance of small molecule-induced dimerization between anti-caffeine VHH 24 and FRB/FKBP domains. Strains carrying the constructs were induced under various arabinose concentrations with or without rapamycin (for FRB/FKBP) or caffeine (acVHH) for 5 h. Maximal achievable activation activities were similar between the different domains. In both cases, higher expression levels of the bipartite parts led to higher basal activities, likely due to an increase in local protein concentrations that increased spontaneous reconstitution of T7 RNAP by random collision. However, at lower protein expression strengths T7 RNA polymerase split by FRB/FKBP had lower tendencies to self-associate and gave less basal activities. Linker lengths for FRB/FKBP was chosen from the constructs that displayed strongest activation from Pu et al. 24 , and linker lengths for acVHH were chosen to be made comparable to those of FRB/FKBP. It should be noted, however, based on the tertiary structure of acVHH 25 , the linker lengths were reduced to 10 residues each in subsequent experiments. Single-cell fluorescence distributions shown were pooled from three independent experiments performed on different days (n = 3). To provide a control, the M86 intein was split at the conventional split site that permits spontaneous re-assembly (site 100/101). In all constructs, mCherry carried a hexahistidine tag at the C-terminus. Cells were induced for protein expression at 1 mM arabinose for 24 h in the presence and absence of 100 µM caffeine. Top panel, single-cell fluorescence of induced or uninduced cells prior to cell lysis for Western blot. Fluorescence could be detected for the construct where acVHH was inserted at the split site within the M86 intein for spontaneous re-assembly but not in other cases. Each distribution comes from one biological sample. Bottom panel, the Western blot result for whole-cell lysates on the bipartite constructs after fluorescence was analyzed. N-and C-lobes expression were probed using antibodies that target the mCherry epitope (red) and the hexahistidine tag (turquoise) respectively. In all lanes both the N-and C-lobes could be detected, but spliced product formation (overlapped bands in white) could be observed at site 100/101 but not the rest, indicating that the tested caffeine inducible inteins were non-functional when moved out of their contexts in which they were screened out. The blot is representative of two independent experiments with similar results. Figure 25 ECF20 as an example where additional domains are not tolerated at a split site. a. ECF20 was split at site 101/102 and fused to acVHH domains using various linker lengths between the acVHH and the spilt ECF20 parts, but none was functional. However, expression of bipartite parts alone without any fusion yielded a much weaker but still functional activation. For this experiment, cells carrying constructs were induced for 24 h to allow accumulation of any activation activities such that they could be observed. Single-cell fluorescence distributions shown were pooled from three biological replicates (n = 3). Note that populations induced with arabinose, with or without caffeine appear almost identical and thus give a green appearance due to overlap of colors. b. The results obtained could be explained by a predicted 3D model of ECF20. During transcription initiation, the linker of a sigma factor is normally embedded within the RNA polymerase (gray mesh). Addition of acVHH domains at linker (site 101/102) thus obstructs transcription initiation complex formation. Model is generated by the SWISS-MODEL homology modeling pipeline with PDB: 6JBQ.1.F 16 as the template. Glyphs for restriction sites are colored in grey. Elements in parenthesis, for instance the hexahistidine tag H 6 , are unique to some constructs and do not appear in others. Short amino acid insertions (< 10 residues) that are not linkers are omitted in all schematics. Plasmid backbones 27 , promoters 4, 28 , ribosome binding sites 4 , bicistronic designs 9 (BCDs), insulators 29 , mScarlet-I 30 and terminators 31,32 were obtained or engineered from various studies or the Registry of Standard Biological Parts (http://parts.igem.org/Main_Page). Other genetic elements have been cited in the main text or their respective supplementary figures. In all cases orthogonal terminators were used for constructs that coexisted within the same cell. The only exception was the TetR reporter plasmid which, due to cloning issues, reused the terminator L3S2P21 once. Plasmid sequences are available at SynBioHub 33 (See Data Availability). Figure 28 Example Gating Strategy for flow cytometry data analysis and sorting. a. Gating strategy for all figures shown. To differentiate cells from non-cellular materials present in the PBS and diluted LB, events were gated on FSC-H and SSC-H, both for events within 10 3 and 10 5 arbitrary units. The selected population was further subjected to a second gate in YL2-A against YL2-H, for events within 1 to 10 6 arbitrary units, to gate away detection noise that reported negative fluorescence. Number of events available for analysis exceed 10,000. Measurements were done on the Attune NxT Flow Cytometer. These plots were produced using the FlowCal package v1.3.0 34 but the actual gating was performed using the FlowCytometryTools package v0.5.0. b. Example gating strategies for cell sorting. Exact sizes and positions of the gates were adjusted according to the specific fluorescent profile of the sample and the experiment of interest. Sorting was done on the FACS Aria IIu cytometer. Plots were generated by the BD FACS Diva Software v6. 1