Integrating reductive and synthetic approaches in biology using man-made cell-like compartments

We propose ‘integrated synthetic genetics' as a novel methodology that integrates reductive and synthetic approaches used in life science research. Integrated synthetic genetics enables determinations of sets of genes required for the functioning of any biological subsystem. This method utilizes artificial cell-like compartments, including a randomly introduced whole gene library, strictly defined components for in vitro transcription and translation and a reporter that fluoresces ‘only when a particular function of a target biological subsystem is active.' The set of genes necessary for the target biological subsystem can be identified by isolating fluorescent artificial cells and multiplex next-generation sequencing of genes included in these cells. The importance of this methodology is that screening for the set of genes involved in a subsystem and reconstructing the entire subsystem can be done simultaneously. This methodology can be applied to any biological subsystem of any species and may remarkably accelerate life science research.

We propose 'integrated synthetic genetics' as a novel methodology that integrates reductive and synthetic approaches used in life science research. Integrated synthetic genetics enables determinations of sets of genes required for the functioning of any biological subsystem. This method utilizes artificial cell-like compartments, including a randomly introduced whole gene library, strictly defined components for in vitro transcription and translation and a reporter that fluoresces 'only when a particular function of a target biological subsystem is active.' The set of genes necessary for the target biological subsystem can be identified by isolating fluorescent artificial cells and multiplex next-generation sequencing of genes included in these cells. The importance of this methodology is that screening for the set of genes involved in a subsystem and reconstructing the entire subsystem can be done simultaneously. This methodology can be applied to any biological subsystem of any species and may remarkably accelerate life science research. L ife science research seeks to elucidate the relationships between genotypes and phenotypes. This typically involves reductive (genetic and omics research) and synthetic (synthetic biology) approaches. Genetic and omics research seeks to identify genes involved in a biological subsystem of interest such as transcription, translation, signal transduction, genome repair and metabolism. This approach enables identifying individual genes in a target biological subsystem, although the entirety of the subsystem is not readily characterized. To overcome this, synthetic approaches are used, in which a known set of genes is isolated and combined to reconstruct a target biological subsystem with the aim of a complete proof for the entirety of the subsystem. Life science research combines these two methodologies. However, it is often difficult to fill in information gaps in a target subsystem on the basis of these approaches, and tremendous amounts of time and effort are required to completely understand a subsystem's functions.
To integrate reductive and synthetic approaches, we propose 'integrated synthetic genetics', a novel approach that integrates the advantages of reductive and synthetic approaches. This method provides for simultaneous high-throughput implementation of screening for genes involved in a particular subsystem and reconstructing the entire subsystem. The core of this approach is incorporating artificial cell-like compartments, including an in vitro transcription and translation system (PURE system) 1 . The elements of a PURE system are strictly determined, and a functional protein can be synthesized by introducing any gene fragment in these artificial compartments 2 . Because as many as 10 8 artificial cell-like compartments can be constructed and the functions of the genes inside these compartments can be evaluated while maintaining genotype-phenotype associations, they have been used for the directed evolution of proteins and RNAs 3-6 . Figure 1 shows an outline of our proposed methodology. First, a whole gene library from a target organism is prepared and artificial cell-like compartments that contain randomly introduced library components are constructed (Fig. 1a). Simultaneously, a reporter that fluoresces 'only when a particular function of a target biological subsystem is active' is introduced (Fig. 1b). Next, a liposome library that contains various combinations of genes is constructed (Fig. 1c). Because proteins are synthesized using these introduced gene combinations and express their functions within the PURE system, these liposomes will fluoresce if they contain that set of genes required for the target biological subsystem's function (Fig. 1c). These fluorescent liposomes are isolated by FACS (Fig. 1d), and the genes in each of the fluorescent liposomes are determined by multiplex next-generation sequencing 7 (Fig. 1e). Because each liposome also contains many irrelevant genes, those genes that are common in fluorescent liposomes are identified by hierarchical cluster analysis (Fig. 1f). Finally, these common factors are considered to be a set of genes that are required for a target biological subsystem. The importance of this methodology is that screening for genes involved in a subsystem and reconstructing the whole subsystem can be done simultaneously on the basis of the artificial cell-like compartments that have strictly defined contents and genetic-like screening that begins with a whole gene library. This methodology can be realized due to the large-scale information processing capacity of next-generation sequencing.
Results and Discussion b-Galactoside hydrolysis subsystem. To demonstrate the feasibility of our methodology, we selected the Escherichia coli 'b-galactoside hydrolysis subsystem' as our target. In E. coli, b-galactosidase encoded for by LacZ is necessary and sufficient for b-galactoside hydrolysis 8 . Thus, this target represents the simplest of model systems. To detect this subsystem, we used 5-chloromethylfluorescein di-b-D-galactopyranoside (CMFDG) as the reporter. CMFDG is a non-fluorescent molecule that contains two galactose moieties and fluoresces when these galactose moieties are hydrolysed (Fig. 2a). Using a bulk assay, we verified b-galactoside hydrolysis activity in the PURE system solution. We mixed LacZ with a T7 promoter (T7P-LacZ), a PURE system solution and CMFDG and then incubated this mixture at 37uC. This resulted in intense fluorescence derived from CMFDG, which indicated a reliable level of activity with this PURE system (Supplementary Fig. 1).
Construction of the E. coli ORF library. We constructed an E. coli ORF library comprising 4,123 genes with a T7 promoter, on the basis of the ASKA library that contains 4,132 E. coli strains harbouring an E. coli gene plasmid library. To simplify the construction of the E. coli ORF library, 4,132 E. coli strains were divided into groups with approximately 100 strains per group. Each group was cultured in Luria-Bertani medium, and plasmids from each group were purified. Gene fragments with a T7 promoter were amplified using PCR. Finally, equal amounts of the amplified gene fragments were mixed to prepare the E. coli ORF library. We used deep sequencing to check the quality of this library. Almost all genes (96.7%) were sequenced at least once, which assured the library's quality (Supplementary Fig. 2 and Supplementary Table 1).
IVTT reaction in liposomes. We attempted ultrahigh-throughput reconstruction of the 'b-galactoside hydrolysis subsystem' by starting with the E. coli ORF library. First, we prepared a liposome library that contained the PURE system solution, 100 mM CMFDG, 1 mM transferrin-Alexa Fluor 647 conjugate (volume marker) and 5 nM E. coli ORF library. Microscopic inspection indicated that the average size and volume of these liposomes were 2.4 mm and 7.2 fL, respectively ( Supplementary Fig. 3), which indicated that approximately 20 genes were randomly incorporated in each liposome. Using the formula for combination with repetitions, the probability of having a given target gene among 20 genes randomly chosen from 4,123 genes was 0.48%. Subsequently, these liposomes were incubated at 37uC to allow for gene expression and analysed by FACS. By FACS analysis, particles that emitted Alexa Fluor 647-derived fluorescence were classified as liposomes. This showed that CMFDG-derived fluorescence was not detected in liposomes devoid of the E. coli ORF library, whereas 0.26% of liposomes that contained the E. coli ORF library fluoresced (Fig. 2b). This indicated that fluorescent signals were derived from functioning genes incorporated in these liposomes. The theoretical and experimental proportions of fluorescent liposomes (0.48% and 0.26%, respectively) were similar. These results indicated that genes were distributed in a random manner and that once distributed, the genes in these liposomes were correctly translated into functional proteins.
Multiplex next-generation sequencing. To identify the genes required for the b-galactoside hydrolysis subsystem, fluorescent liposomes and control non-fluorescent liposomes were isolated, and the genes included in them were determined by multiplex nextgeneration sequencing (Supplementary Table 2). Each liposome contained numerous genes irrelevant to the b-galactoside hydrolysis subsystem. Thus, we used hierarchical cluster analysis for the genes in the isolated liposomes to detect any specific patterns and to identify common genes. Hierarchical cluster analysis revealed no common factor(s) in our control analysis of non-fluorescent liposomes (Fig. 3a). In contrast, LacZ was a clear cluster that was included in all fluorescent liposomes and with no other common factors (Fig. 3b). This indicated successful ultrahigh-throughput reconstruction of the b-galactoside hydrolysis subsystem from the E. coli ORF library by integrated synthetic genetics. Furthermore, the conclusions drawn from our cluster analysis were confirmed by additional data that virtually all liposomes constructed using LacZ only were fluorescence positive (Fig. 4). Although we used the bgalactoside hydrolysis subsystem as a target in this study, similar ultrahigh-throughput reconstructions can be performed for any biological subsystem using an appropriate reporter. Integrated synthetic genetics can be applied to more complex systems such as cancer. One of the fundamental characteristics of cancer cells is unlimited cell proliferation which involves promoting cell survival and blocking apoptosis 19 . Consistently, it is known that a few key hallmarks related to apoptosis, cytoskeleton and genomic instability are significantly enriched in tumor genomic alterations 19,20 . Reconstruction and modelling of cancer hallmarks-specific networks will provide insights into cancer therapies.
In conclusion, in this study, we successfully constructed integrated synthetic genetics as a novel method to integrate reductive and synthetic approaches. This system combines the advantages of reductive and synthetic approaches and has three beneficial features (Supplementary Fig. 4). First, this method provides for simultaneous highthroughput implementation of screening and reconstruction, which have been previously used as fundamentally distinct methods in biological research. Second, if an appropriate reporter is available, this method provides for ultrahigh-throughput reconstruction of any biological subsystem of any species. Using a cDNA library, this system may even be applicable to non-model organisms for which a whole gene library is unavailable. Third, even when a target subsystem involves numerous unknown factors, screening for genes that are necessary and sufficient is feasible using multiplex sequencing. For example, to address a subsystem that involves five unknown factors within the context of all E. coli genes, conventional methods would require 4123 5 < 10 18 combinations of experiments, which are practically impossible to implement. With our approach, assuming that 200 genes are randomly introduced into one liposome, the probability that this liposome contains all five unknown genes would be 10 27 , which would correspond to a 10 11 -fold increase in efficiency compared with a conventional approach and therefore, detection becomes sufficiently realistic.
As demonstrated in this study using the b-galactoside hydrolysis subsystem, our method offers the advantage of faster identification, even for single gene identification, as compared with conventional methods. In practical terms, our method takes approximately one week to complete liposome construction, reactions, isolation, multiplex sequencing and data analysis. Thus, our proposed method provides a novel method for life science research and has the potential to substantially enhance research efficiency.

Methods
E. coli ORF library. The ASKA library (GFP non-fusion type) 9 contained all 4,132 genes of E. coli and was provided by the NBRP National Institute of Genetics (Mishima, Japan). All 4,132 genes were amplified by PCR using ASKA library Figure 3 | (a) Hierarchical cluster analysis for genes included in non-fluorescent liposomes. Ten non-fluorescent liposomes that did not show b-galactoside hydrolysis activity were isolated, and the genes included in these liposomes were determined by multiplex next-generation sequencing using HiSeq 2500. In the hierarchical cluster analysis for these genes, there was no commonality among the genes included in each liposome, which indicated a random pattern. Each row represents a separate gene, each column represents a separate liposome and genes found in each liposome are shown in red. (b) Hierarchical cluster analysis for the genes included in fluorescent liposomes. Hierarchical cluster analysis revealed that LacZ was a clear cluster that was included in all fluorescent liposomes and with no other common factors.
www.nature.com/scientificreports SCIENTIFIC REPORTS | 4 : 4722 | DOI: 10.1038/srep04722 plasmids as previously described 10 , with some modifications. In brief, 4,132 E. coli strains obtained from the NBRP were divided into groups with approximately 100 strains per group. Each group was cultured in 100 mL of Luria-Bertani medium 1 chloramphenicol (0.5% w/v yeast extract, 1% w/v tryptone, 1% w/v NaCl and 20 mg/ mL of chloramphenicol). Next, plasmids were extracted from each group and gene fragments were amplified using the following common primers: ASKA forward primer, 59-GGCCTAATACGACTCACTATAGGAGAAATCATAAAAAATTTAT-TTGCTTTGTGAGCGG-39, and ASKA reverse primer, 59-GTTATTGCTCAG-CGGTTAGCGGCCGCATAGGCC-39. ASKA forward primers contained the T7 promoter sequence (Italicized) for gene expression by the PURE system, and ASKA reverse primers contained the stop codon (Italicized). The amplified gene fragments in each group were purified, and equal amounts were mixed to prepare the E. coli ORF library with the added T7 promoter. The average length of all E. coli genes was 880 bp; this value was used to estimate the molarity of the E. coli ORF library.
IVTT reaction under bulk conditions. The IVTT reaction solution was prepared by mixing the PURE system solution, DNA fragments amplified from the ASKA library and 100 mM 5-chloromethylfluorescein di-b-D-galactopyranoside (CMFDG; Life Technologies, Carlsbad, CA, USA). CMFDG has two galactose moieties and is one of the most sensitive substrates for galactosidases. Hydrolysis of non-fluorescent CMFDG can be monitored by an increase in its fluorescence. The reaction solution was incubated at 37uC and fluorescent signals were monitored every 10 min at l ex 5 490 6 10 nm and l em 5 516 6 10 nm using an Infinite M1000 fluorescence microplate reader (TECAN, Männedorf, Switzerland).
IVTT reaction in liposomes. Liposomes were constructed by the water-in-oil emulsion-transfer method as previously described [12][13][14] , with some modifications. In brief, 1 mL of liquid paraffin containing 250 mg of 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (Avanti Polar Lipids, Alabaster, AL, USA) and 25 mg of cholesterol (Nakalai Tesque, Kyoto, Japan) were mixed with the IVTT reaction solution using a syringe pump to prepare water-in-oil emulsion droplets 3 . The IVTT reaction solution contained 20 mL of the PURE system solution supplemented with 5 nM E. coli ORF library, 200 mM sucrose, 0.5 U/uL of RNase inhibitor (RNasin plus, Promega, Madison, WI, USA), 100 mM CMFDG and 1 mM transferrin-Alexa Fluor 647 conjugate (Life Technologies) as a volume marker. The water-in-oil emulsion droplets were mixed with a magnetic stirrer for 1 min and then equilibrated on ice for 10 min to stabilize the emulsions. The mixture was gently placed in 150 mL of the PURE system solution that contained 200 mM glucose in a microtube and was centrifuged at 15,000 3 g for 30 min. Prepared liposomes were removed through an opening at the bottom of the tube. It is important that the liposomes are dispersed in the PURE system solution to prolong protein production in these liposomes 15 . The prepared liposomes were incubated at 37uC for protein production.
FACS analysis. A JSAN cell sorter (Bay Bioscience, Hyogo, Japan) and a FACSAria (Becton Dickinson, Franklin Lakes, NJ, USA) were used for liposome sorting and analysis. CMFDG-derived fluorescence and Alexa Fluor 647-derived fluorescence were monitored separately using a dual band pass filter. Among the particles detected by FACS analysis, those that emitted Alexa Fluor 647-derived fluorescence were classified as liposomes. CMFDG-derived fluorescent liposomes were classified as liposomes that contained the genes required for the b-galactoside hydrolysis subsystem and sorted accordingly. As a negative control, non-fluorescent liposomes that did not emit CMFDG-derived fluorescence were similarly isolated.
Illumina sequencing. To evaluate the quality of the E. coli ORF library and determine the genes included in the isolated liposomes, multiplex next-generation sequencing was done using HiSeq 2500 (Illumina, San Diego, CA, USA). First, gene fragments in the isolated liposomes were amplified by 45 PCR cycles using the primers noted above (i.e. ASKA forward and reverse primers) and KOD-Plus-(Toyobo, Osaka, Japan). The concentrations of amplified gene fragments were determined using Quant-iT PicoGreen (Life Technologies). Next, the liposome-derived DNA fragments and the E. coli ORF library were pre-treated for HiSeq 2500 according to the Nextera XT DNA preparation kit protocol (Illumina). In brief, input DNA was fragmented by a transposome and the dual indexes were tagged by limited-cycle PCR, which allowed for discriminating between DNA fragments derived from different samples. Equal Figure 4 | b-Galactoside hydrolysis activity in artificial cell-like compartments. Multiplex sequencing identified LacZ as a common gene that was included in fluorescent liposomes. To verify this, liposomes that contained the PURE system solution, 100 mM CMFDG, 1 mM transferrin-Alexa Fluor 647 conjugate (volume marker) and 5 nM LacZ were constructed and assayed for their reproducibility of b-galactoside hydrolysis activity. As a negative control, liposomes without LacZ were also constructed. Liposomes were incubated at 37uC and analysed by FACS. At 0 h, CMFDG-derived fluorescence (abscissa) was not detected in either group of liposomes, while after 3 h of incubation, intense fluorescence was detected only in liposomes that contained 5 nM LacZ, which verified the reconstruction of b-galactoside hydrolysis activity. Almost all liposomes had shifted to the right and 8.9% of these liposomes were in the upper right quadrant, which indicated intense fluorescence.
www.nature.com/scientificreports SCIENTIFIC REPORTS | 4 : 4722 | DOI: 10.1038/srep04722 amounts of DNA fragments tagged with dual indexes were mixed for 50-bp singleread sequencing on HiSeq 2500 in rapid-run mode, and approximately 68 million mapped reads were obtained. Next-generation sequencing was done by the Genome Network Analysis Support Facility of Riken.
Data analysis. Mapping of the output data from HiSeq 2500 was done for E. coli ORF nucleotide sequences obtained from Genobase (http://ecoli.naist.jp/GB8-dev/index. jsp?page5genome_download.jsp) using Bowtie2 16 . The abundance of ORFeome clones was quantified using R software. To verify the quality of the E. coli ORF library, the number of reads for each of the 4,123 genes was counted using sequence data (Supplementary Table 1). To identify the necessary and sufficient conditions for the b-galactoside hydrolysis subsystem, the genes in the isolated liposomes were determined. To remove non-specific mapping, identified genes were listed using a threshold of 10,000 reads (Supplementary Table 2). Hierarchical cluster analysis using Cluster 3.0 17 was used to detect any specific patterns among the genes in liposomes. To organize clusters, complete linkage was used as the clustering method and Euclidean distances were used as similarity measures. JAVA Treeview 18 was used to visualize the clustering results.