Robust and predictably performing synthetic circuits rely on the use of well-characterized regulatory parts across different genetic backgrounds and environmental contexts. Here we report the large-scale metagenomic mining of thousands of natural 5′ regulatory sequences from diverse bacteria, and their multiplexed gene expression characterization in industrially relevant microbes. We identified sequences with broad and host-specific expression properties that are robust in various growth conditions. We also observed substantial differences between species in terms of their capacity to utilize exogenous regulatory sequences. Finally, we demonstrate programmable species-selective gene expression that produces distinct and diverse output patterns in different microbes. Together, these findings provide a rich resource of characterized natural regulatory sequences and a framework that can be used to engineer synthetic gene circuits with unique and tunable cross-species functionality and properties, and also suggest the prospect of ultimately engineering complex behaviors at the community level.
Your institute does not have access to this article
Open Access articles citing this article.
Scientific Reports Open Access 22 October 2019
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Sequence Read Archive
Brophy, J.A. & Voigt, C.A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014).
Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Bayer, T.S. et al. Synthesis of methyl halides from biomass using engineered microbes. J. Am. Chem. Soc. 131, 6508–6515 (2009).
Stanton, B.C. et al. Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat. Chem. Biol. 10, 99–105 (2014).
Rhodius, V.A. et al. Design of orthogonal genetic switches based on a crosstalk map of σs, anti-σs, and promoters. Mol. Syst. Biol. 9, 702 (2013).
Kinney, J.B., Murugan, A., Callan, C.G. Jr. & Cox, E.C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. USA 107, 9158–9163 (2010).
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. USA 110, 14024–14029 (2013).
Mutalik, V.K. et al. Quantitative estimation of activity and quality for collections of functional genetic elements. Nat. Methods 10, 347–353 (2013).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Alper, H., Fischer, C., Nevoigt, E. & Stephanopoulos, G. Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. USA 102, 12678–12683 (2005).
Mutalik, V.K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).
Lutz, R. & Bujard, H. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 25, 1203–1210 (1997).
Kang, M.K. et al. Synthetic biology platform of CoryneBrick vectors for gene expression in Corynebacterium glutamicum and its application to xylose utilization. Appl. Microbiol. Biotechnol. 98, 5991–6002 (2014).
Tauer, C., Heinl, S., Egger, E., Heiss, S. & Grabherr, R. Tuning constitutive recombinant gene expression in Lactobacillus plantarum. Microb. Cell Fact. 13, 150 (2014).
Song, Y. et al. Promoter screening from Bacillus subtilis in various conditions hunting for synthetic biology and industrial applications. PLoS One 11, e0158447 (2016).
Markley, A.L., Begemann, M.B., Clarke, R.E., Gordon, G.C. & Pfleger, B.F. Synthetic biology toolbox for controlling gene expression in the cyanobacterium Synechococcus sp. strain PCC 7002. ACS Synth. Biol. 4, 595–603 (2015).
Elmore, J.R., Furches, A., Wolff, G.N., Gorday, K. & Guss, A.M. Development of a high efficiency integration system and promoter library for rapid modification of Pseudomonas putida KT2440. Metab. Eng. Commun. 5, 1–8 (2017).
Guiziou, S. et al. A part toolbox to tune genetic expression in Bacillus subtilis. Nucleic Acids Res. 44, 7495–7508 (2016).
Cardinale, S. & Arkin, A.P. Contextualizing context for synthetic biology—identifying causes of failure of synthetic biological systems. Biotechnol. J. 7, 856–866 (2012).
Temme, K., Hill, R., Segall-Shapiro, T.H., Moser, F. & Voigt, C.A. Modular control of multiple pathways using engineered orthogonal T7 polymerases. Nucleic Acids Res. 40, 8773–8781 (2012).
Kushwaha, M. & Salis, H.M. A portable expression resource for engineering cross-species genetic circuits and pathways. Nat. Commun. 6, 7832 (2015).
Gaida, S.M. et al. Expression of heterologous sigma factors enables functional screening of metagenomic and heterologous genomic libraries. Nat. Commun. 6, 7045 (2015).
Sheth, R.U., Cabral, V., Chen, S.P. & Wang, H.H. Manipulating bacterial communities by in situ microbiome engineering. Trends Genet. 32, 189–200 (2016).
Kim, D. et al. Comparative analysis of regulatory elements between Escherichia coli and Klebsiella pneumoniae by genome-wide transcription start site profiling. PLoS Genet. 8, e1002867 (2012).
Boutard, M. et al. Global repositioning of transcription start sites in a plant-fermenting bacterium. Nat. Commun. 7, 13783 (2016).
Wurtzel, O. et al. The single-nucleotide resolution transcriptome of Pseudomonas aeruginosa grown in body temperature. PLoS Pathog. 8, e1002945 (2012).
Torella, J.P. et al. Unique nucleotide sequence-guided assembly of repetitive DNA parts for synthetic biology applications. Nat. Protoc. 9, 2075–2089 (2014).
Sleight, S.C., Bartley, B.A., Lieviant, J.A. & Sauro, H.M. Designing and engineering evolutionary robust genetic circuits. J. Biol. Eng. 4, 12 (2010).
Bailey, T.L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).
Ishihama, A. Functional modulation of Escherichia coli RNA polymerase. Annu. Rev. Microbiol. 54, 499–518 (2000).
Browning, D.F. & Busby, S.J. The regulation of bacterial transcription initiation. Nat. Rev. Microbiol. 2, 57–65 (2004).
Deutscher, M.P. Degradation of RNA in bacteria: comparison of mRNA and stable RNA. Nucleic Acids Res. 34, 659–666 (2006).
Caron, M.-P. Dual-acting riboswitch control of translation initiation and mRNA decay. Proc. Natl. Acad. Sci. USA 109, E3444–E3453 (2012).
Salis, H.M., Mirsky, E.A. & Voigt, C.A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).
Kong, W., Brovold, M., Koeneman, B.A., Clark-Curtiss, J. & Curtiss, R. III. Turning self-destructing Salmonella into a universal DNA vaccine delivery platform. Proc. Natl. Acad. Sci. USA 109, 19414–19419 (2012).
Weinstock, M.T., Hesek, E.D., Wilson, C.M. & Gibson, D.G. Vibrio natriegens as a fast-growing host for molecular biology. Nat. Methods 13, 849–851 (2016).
Kalinowski, J. et al. The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. J. Biotechnol. 104, 5–25 (2003).
Bikard, D. et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat. Biotechnol. 32, 1146–1150 (2014).
Citorik, R.J., Mimee, M. & Lu, T.K. Sequence-specific antimicrobials using efficiently delivered RNA-guided nucleases. Nat. Biotechnol. 32, 1141–1145 (2014).
Gomaa, A.A. et al. Programmable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. MBio 5, e00928–13 (2014).
Kotula, J.W. et al. Programmable bacteria detect and record an environmental signal in the mammalian gut. Proc. Natl. Acad. Sci. USA 111, 4838–4843 (2014).
Guérout-Fleury, A.M., Frandsen, N. & Stragier, P. Plasmids for ectopic integration in Bacillus subtilis. Gene 180, 57–61 (1996).
Newman, J.R. & Fuqua, C. Broad-host-range expression vectors that carry the L-arabinose-inducible Escherichia coli araBAD promoter and the araC regulator. Gene 227, 197–203 (1999).
Pédelacq, J.D., Cabantous, S., Tran, T., Terwilliger, T.C. & Waldo, G.S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 24, 79–88 (2006).
Markowitz, V.M. et al. IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res. 40, D115–D122 (2012).
LeProust, E.M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
van der Rest, M.E., Lange, C. & Molenaar, D. A heat shock following electroporation induces highly efficient transformation of Corynebacterium glutamicum with xenogeneic plasmid DNA. Appl. Microbiol. Biotechnol. 52, 541–545 (1999).
Jayaprakash, A.D., Jabado, O., Brown, B.D. & Sachidanandam, R. Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 39, e141 (2011).
Goodman, D.B., Church, G.M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013).
Mathews, D.H. RNA secondary structure analysis using RNAstructure. Curr. Protoc. Bioinformatics 46, 12.6.1–12.6.25 (2014).
We thank members of the Wang lab for helpful discussions and feedback. H.H.W. acknowledges funding support from the NIH (1DP5OD009172-02, 1U01GM110714-01A1), NSF (MCB-1453219), Sloan Foundation (FR-2015-65795), DARPA (W911NF-15-2-0065), and ONR (N00014-15-1-2704). N.I.J. is supported by an NSF Graduate Research Fellowship (DGE-16-44869). S.S.Y. is supported by the National Research Foundation of Korea (NRF-2017R1A6A3A03003401). We also thank T. Seto for help with plasmid construction; A. Figueroa for assistance with cell sorting; H. Salis for helpful discussions regarding the RBS calculator; D.B. Goodman for discussions regarding FACS-seq; G.M. Church (Harvard Medical School, Boston, Massachusetts, USA) for access to OLS libraries; and D. Dubnau (Rutgers New Jersey Medical School, Newark, New Jersey, USA), S. Lory, and A. Rasouly (both at Harvard Medical School, Boston, Massachusetts, USA) for providing the BD3182 and PAO1 Δpsy2 strains.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Metadata of the 184 donor genomes used to derive the regulatory sequences used in this study.
(a) genome size, (b) genomic GC content, (c) gram staining, (d) lifestyle, (e) number of regulatory sequences mined per genome, (f) the number of genomes per phylum, and (g) the 16S phylogenetic tree.
Vector maps for pNJ1, pNJ2.1, and pNJ3.1 used for expression measurements of metagenomic regulatory sequence library (RS) in E. coli, B. subtilis, and P. aeruginosa respectively and pNJ6.0, pNJ3.1, pNJ7 and pNJ8, which were used for RS241 library measurements in E. coli, B. subtilis, P. aeruginosa, S. enterica, V. natriegens and C. glutamicum.
(a) Correlation of transcriptional measurements of RS library across two independent replicate cultures (>10 DNA counts across both replicates, n = 18,845) in E. coli performed on different days. (b) Correlation of transcriptional measurements of identical RSs with two different barcodes in E. coli (>10 DNA counts across both constructs, n = 2,273). Pearson correlation (r) is listed in each panel.
(a) Correlation of pooled RNA-seq measurements with individual RT-PCR data from isolate strains containing RS library members for three host species. (b) GFP fluorescence distributions of post-FACS RS library populations displayed as violin plots (n = 10,000 cells, mean value shown as horizontal bar). (c) Correlation of pooled FACS-seq measurements with individual flow cytometry measurements of isolate strains. Pearson correlation coefficients and sample sizes are listed for (r and n) listed in each subplot.
Correlation between transcription (a) and translation (b) data measured using sfGFP and an alternate reporter mCherry. Sample sizes (n) and Pearson correlation coefficients (r) are listed in the lower right of each plot.
Distribution of transcription start sites (TSSs) for active regulatory sequences containing one primary TSS with >70% of reads starting within +/- 5 bp. Most TSSs occur between 20-50 bp upstream of the start codon for B. subtilis, E. coli, and P. aeruginosa.
(a) Transcription activity for 18,205 members of the RS library across multiple growth conditions in E. coli is clustered and shown as a heatmap. Transcription levels are log2 (RNA/DNA) ratios normalized by the mean activity of control sequences (see Methods). (b) Ranked TSS locations of each RS measured in E. coli during LB exponential phase are shown, along with the TSS distribution (top panel) and the frequency of multiple TSSs (inset) of the RS library. (c) Frequency of matching TSS positions for RSs in LB and M9 growth media. Pearson correlation of 1 signifies perfectly matched TSS between conditions and -1 denoting no or anti-correlation. Intermediate values denote partial TSS matching. Example RSs with high, moderate, and no correlation in TSS positions in LB and M9 are shown in the inset (n = 18,205). (d) A subset of 100 robust RSs with condition-invariant transcription levels of different strengths (top panel) generated from a single TSS of different untranslated region (UTR) lengths (bottom panel) is provided as a useful community resource.
Supplementary Figure 8 Comparison of TSS data for regulatory sequences (RSs) across growth conditions in E. coli.
(a) A histogram of the distribution of all 10 pairwise comparisons of TSS position of regulatory sequences measured in 5 growth conditions (LB exponential growth phase, LB-exp; LB exponential with iron depletion, LB-Fe; LB exponential with high salt, LB-NaCl; LB stationary phase, LB-stat; M9 minimal media exponential phase, M9-exp) is shown (n = 18,205). Perfectly matched TSSs in two conditions have a Pearson correlation of 1, while an un-matched pair of TSSs has a correlation of -1. (b) A histogram of the mean TSS correlations (Pearson r) of all RSs across all pairwise conditions show almost half of RSs have the same TSS across all 5 conditions (n = 18,205).
(a) Motif analysis of promoters binned by activity levels. The top two motifs identified by MEME for each recipient at the four activity bins (low, medium low, medium high, high) are shown. All motifs resembled the σ70 motif or its degenerate versions. Statistically non-significant motifs are displayed in gray color. Additional MEME motif outputs are not shown since none were significantly different from σ70-like motifs. (b) Transcriptional activity heatmap grouped by hierarchical clustering (n=395). Motif finding was performed to identify motifs across ten clusters. The corresponding motif for each cluster is indicated by colored circle. (c) Removal of regulatory sequences containing the σ70 motif from the dataset and repeating the analysis performed in a did not reveal additional non-σ70-like motifs (n=76). Statistically non-significant motifs (MEME E-value > 1e-2) are displayed in gray color in b and c.
Supplementary Figure 10 The σ70 motif is the dominant factor governing transcriptional activity of horizontally acquired regulatory sequences.
(a) Pearson correlation of transcriptional activity versus promoter GC content (%GC), RNA structural stability (ΔG RNA), best σ70 match score (max(σ70)) and number of σ70 matches (n(σ70)) are displayed per recipient species. (b) Partial correlation displays activity versus variable by controlling to the other variables. Sample sizes (n) are 4314, 14809, and 17787 regulatory sequences for B. subtilis, E. coli, and P. aeruginosa respectively.
Supplementary Figure 11 Regulatory sequence translation levels determined by FACS-seq in B. subtilis, E. coli, and P. aeruginosa.
(a) The distribution of GFP fluorescence values of the regulatory sequence library in each recipient. (b) Translational activity of 8,898 regulatory sequences with measurable GFP fluorescence data across all three recipients. (c) Analysis of ribosome binding site sequence motifs in highly translated constructs. Motif logos were constructed using WebLogo v3.5.0. The genomic GC content of each species was used for background nucleotide frequency models and are listed in each subplot.
Heatmap panels show the fraction of RS library distributed across bins of transcription and translation levels in three recipients (colored columns). Donor RSs from Firmicutes genomes are shown in (a) and from Proteobacteria genomes in (b). The top row of each heatmap subpanels use values normalized by the total number of regulatory sequences. The middle row use values normalized by each column bin corresponding to transcription windows. The bottom row use values normalized by each row bin corresponding to translation windows. Grey colored rows indicate data points with fewer than 10 RSs in total and insufficient for analysis.
(a) Correlation of regulatory sequence activity in terms of transcription level and translation efficiency (calculated as the ratio of GFP protein levels and transcription levels) between recipient species. Each point corresponds to a single regulatory sequence that has measurable transcription and translation data. Pearson correlation coefficient (r) and statistical significance values (p) are shown for each subplot (n=212 for all six panels). (b) Correlation between calculated translation (TL) efficiency based on the RBS calculator and our measured translation efficiency across highly transcribed regulatory sequences (top 15%) in each recipient species (n = 581, 2276, and 2198 for B. subtilis, E. coli, and P. aeruginosa respectively).
Regulatory sequences are sorted by activity (from high to low) per species by (a) transcription or (b) translation levels. Regulatory sequences are re-sorted by mean transcription levels (from low to high) across all species and plotted for (c) transcription and (d) translation levels. Transcriptional values were normalized with the highest expression construct having a value of 106. Gray lines correspond to sequences where no data was available. Species names are abbreviated as: B. subtilis, B.s.; C. glutanicum, C.g.; P. aeruginosa, P.a.; V. natriegens, V.n.; S. enterica, S.e.; E. coli, E.c.
(a) Pairwise Pearson correlation of transcription (blue triangle) and translation (green triangle) activity profiles of the RS241 library across six host species. Species are arranged based their 16S phylogenetic similarity. Numbers in each box correspond to the Pearson correlation coefficients (n = 241). (b) Scatter plot showing each pairwise correlation described in (a).
Supplementary Figures 1–15
Regulatory sequence library metadata
Library expression data for B. subtilis, E. coli, and P. aeruginosa
Library expression data for E. coli in five growth conditions
RS241 library expression data in six species
Strains and materials used in this study
About this article
Cite this article
Johns, N., Gomes, A., Yim, S. et al. Metagenomic mining of regulatory elements enables programmable species-selective gene expression. Nat Methods 15, 323–329 (2018). https://doi.org/10.1038/nmeth.4633
Nature Materials (2022)
Nature Methods (2021)
Nature Reviews Drug Discovery (2021)
Nature Biotechnology (2020)
The ISME Journal (2020)