Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Metagenomic mining of regulatory elements enables programmable species-selective gene expression


Robust and predictably performing synthetic circuits rely on the use of well-characterized regulatory parts across different genetic backgrounds and environmental contexts. Here we report the large-scale metagenomic mining of thousands of natural 5′ regulatory sequences from diverse bacteria, and their multiplexed gene expression characterization in industrially relevant microbes. We identified sequences with broad and host-specific expression properties that are robust in various growth conditions. We also observed substantial differences between species in terms of their capacity to utilize exogenous regulatory sequences. Finally, we demonstrate programmable species-selective gene expression that produces distinct and diverse output patterns in different microbes. Together, these findings provide a rich resource of characterized natural regulatory sequences and a framework that can be used to engineer synthetic gene circuits with unique and tunable cross-species functionality and properties, and also suggest the prospect of ultimately engineering complex behaviors at the community level.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: High-throughput characterization of regulatory sequences from 184 prokaryotic genomes.
Figure 2: Transcriptional activity of the regulatory library across three diverse species.
Figure 3: Assessing regulatory features that govern transcriptional activity.
Figure 4: FACS-seq of RS library.
Figure 5: Species-selective gene circuits.

Accession codes

Primary accessions

Sequence Read Archive


  1. Brophy, J.A. & Voigt, C.A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014).

    CAS  Article  Google Scholar 

  2. Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

    CAS  Article  Google Scholar 

  3. Bayer, T.S. et al. Synthesis of methyl halides from biomass using engineered microbes. J. Am. Chem. Soc. 131, 6508–6515 (2009).

    CAS  Article  Google Scholar 

  4. Stanton, B.C. et al. Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat. Chem. Biol. 10, 99–105 (2014).

    CAS  Article  Google Scholar 

  5. Rhodius, V.A. et al. Design of orthogonal genetic switches based on a crosstalk map of σs, anti-σs, and promoters. Mol. Syst. Biol. 9, 702 (2013).

    CAS  Article  Google Scholar 

  6. Kinney, J.B., Murugan, A., Callan, C.G. Jr. & Cox, E.C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. USA 107, 9158–9163 (2010).

    CAS  Article  Google Scholar 

  7. Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. USA 110, 14024–14029 (2013).

    CAS  Article  Google Scholar 

  8. Mutalik, V.K. et al. Quantitative estimation of activity and quality for collections of functional genetic elements. Nat. Methods 10, 347–353 (2013).

    CAS  Article  Google Scholar 

  9. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    CAS  Article  Google Scholar 

  10. Alper, H., Fischer, C., Nevoigt, E. & Stephanopoulos, G. Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. USA 102, 12678–12683 (2005).

    CAS  Article  Google Scholar 

  11. Mutalik, V.K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).

    CAS  Article  Google Scholar 

  12. Lutz, R. & Bujard, H. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 25, 1203–1210 (1997).

    CAS  Article  Google Scholar 

  13. Kang, M.K. et al. Synthetic biology platform of CoryneBrick vectors for gene expression in Corynebacterium glutamicum and its application to xylose utilization. Appl. Microbiol. Biotechnol. 98, 5991–6002 (2014).

    CAS  Article  Google Scholar 

  14. Tauer, C., Heinl, S., Egger, E., Heiss, S. & Grabherr, R. Tuning constitutive recombinant gene expression in Lactobacillus plantarum. Microb. Cell Fact. 13, 150 (2014).

    Article  Google Scholar 

  15. Song, Y. et al. Promoter screening from Bacillus subtilis in various conditions hunting for synthetic biology and industrial applications. PLoS One 11, e0158447 (2016).

    Article  Google Scholar 

  16. Markley, A.L., Begemann, M.B., Clarke, R.E., Gordon, G.C. & Pfleger, B.F. Synthetic biology toolbox for controlling gene expression in the cyanobacterium Synechococcus sp. strain PCC 7002. ACS Synth. Biol. 4, 595–603 (2015).

    CAS  Article  Google Scholar 

  17. Elmore, J.R., Furches, A., Wolff, G.N., Gorday, K. & Guss, A.M. Development of a high efficiency integration system and promoter library for rapid modification of Pseudomonas putida KT2440. Metab. Eng. Commun. 5, 1–8 (2017).

    Article  Google Scholar 

  18. Guiziou, S. et al. A part toolbox to tune genetic expression in Bacillus subtilis. Nucleic Acids Res. 44, 7495–7508 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Cardinale, S. & Arkin, A.P. Contextualizing context for synthetic biology—identifying causes of failure of synthetic biological systems. Biotechnol. J. 7, 856–866 (2012).

    CAS  Article  Google Scholar 

  20. Temme, K., Hill, R., Segall-Shapiro, T.H., Moser, F. & Voigt, C.A. Modular control of multiple pathways using engineered orthogonal T7 polymerases. Nucleic Acids Res. 40, 8773–8781 (2012).

    CAS  Article  Google Scholar 

  21. Kushwaha, M. & Salis, H.M. A portable expression resource for engineering cross-species genetic circuits and pathways. Nat. Commun. 6, 7832 (2015).

    CAS  Article  Google Scholar 

  22. Gaida, S.M. et al. Expression of heterologous sigma factors enables functional screening of metagenomic and heterologous genomic libraries. Nat. Commun. 6, 7045 (2015).

    Article  Google Scholar 

  23. Sheth, R.U., Cabral, V., Chen, S.P. & Wang, H.H. Manipulating bacterial communities by in situ microbiome engineering. Trends Genet. 32, 189–200 (2016).

    CAS  Article  Google Scholar 

  24. Kim, D. et al. Comparative analysis of regulatory elements between Escherichia coli and Klebsiella pneumoniae by genome-wide transcription start site profiling. PLoS Genet. 8, e1002867 (2012).

    CAS  Article  Google Scholar 

  25. Boutard, M. et al. Global repositioning of transcription start sites in a plant-fermenting bacterium. Nat. Commun. 7, 13783 (2016).

    CAS  Article  Google Scholar 

  26. Wurtzel, O. et al. The single-nucleotide resolution transcriptome of Pseudomonas aeruginosa grown in body temperature. PLoS Pathog. 8, e1002945 (2012).

    Article  Google Scholar 

  27. Torella, J.P. et al. Unique nucleotide sequence-guided assembly of repetitive DNA parts for synthetic biology applications. Nat. Protoc. 9, 2075–2089 (2014).

    CAS  Article  Google Scholar 

  28. Sleight, S.C., Bartley, B.A., Lieviant, J.A. & Sauro, H.M. Designing and engineering evolutionary robust genetic circuits. J. Biol. Eng. 4, 12 (2010).

    Article  Google Scholar 

  29. Bailey, T.L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).

    CAS  Article  Google Scholar 

  30. Ishihama, A. Functional modulation of Escherichia coli RNA polymerase. Annu. Rev. Microbiol. 54, 499–518 (2000).

    CAS  Article  Google Scholar 

  31. Browning, D.F. & Busby, S.J. The regulation of bacterial transcription initiation. Nat. Rev. Microbiol. 2, 57–65 (2004).

    CAS  Article  Google Scholar 

  32. Deutscher, M.P. Degradation of RNA in bacteria: comparison of mRNA and stable RNA. Nucleic Acids Res. 34, 659–666 (2006).

    CAS  Article  Google Scholar 

  33. Caron, M.-P. Dual-acting riboswitch control of translation initiation and mRNA decay. Proc. Natl. Acad. Sci. USA 109, E3444–E3453 (2012).

    CAS  Article  Google Scholar 

  34. Salis, H.M., Mirsky, E.A. & Voigt, C.A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).

    CAS  Article  Google Scholar 

  35. Kong, W., Brovold, M., Koeneman, B.A., Clark-Curtiss, J. & Curtiss, R. III. Turning self-destructing Salmonella into a universal DNA vaccine delivery platform. Proc. Natl. Acad. Sci. USA 109, 19414–19419 (2012).

    CAS  Article  Google Scholar 

  36. Weinstock, M.T., Hesek, E.D., Wilson, C.M. & Gibson, D.G. Vibrio natriegens as a fast-growing host for molecular biology. Nat. Methods 13, 849–851 (2016).

    CAS  Article  Google Scholar 

  37. Kalinowski, J. et al. The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. J. Biotechnol. 104, 5–25 (2003).

    CAS  Article  Google Scholar 

  38. Bikard, D. et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat. Biotechnol. 32, 1146–1150 (2014).

    CAS  Article  Google Scholar 

  39. Citorik, R.J., Mimee, M. & Lu, T.K. Sequence-specific antimicrobials using efficiently delivered RNA-guided nucleases. Nat. Biotechnol. 32, 1141–1145 (2014).

    CAS  Article  Google Scholar 

  40. Gomaa, A.A. et al. Programmable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. MBio 5, e00928–13 (2014).

    Article  Google Scholar 

  41. Kotula, J.W. et al. Programmable bacteria detect and record an environmental signal in the mammalian gut. Proc. Natl. Acad. Sci. USA 111, 4838–4843 (2014).

    CAS  Article  Google Scholar 

  42. Guérout-Fleury, A.M., Frandsen, N. & Stragier, P. Plasmids for ectopic integration in Bacillus subtilis. Gene 180, 57–61 (1996).

    Article  Google Scholar 

  43. Newman, J.R. & Fuqua, C. Broad-host-range expression vectors that carry the L-arabinose-inducible Escherichia coli araBAD promoter and the araC regulator. Gene 227, 197–203 (1999).

    CAS  Article  Google Scholar 

  44. Pédelacq, J.D., Cabantous, S., Tran, T., Terwilliger, T.C. & Waldo, G.S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 24, 79–88 (2006).

    Article  Google Scholar 

  45. Markowitz, V.M. et al. IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res. 40, D115–D122 (2012).

    CAS  Article  Google Scholar 

  46. LeProust, E.M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).

    CAS  Article  Google Scholar 

  47. van der Rest, M.E., Lange, C. & Molenaar, D. A heat shock following electroporation induces highly efficient transformation of Corynebacterium glutamicum with xenogeneic plasmid DNA. Appl. Microbiol. Biotechnol. 52, 541–545 (1999).

    CAS  Article  Google Scholar 

  48. Jayaprakash, A.D., Jabado, O., Brown, B.D. & Sachidanandam, R. Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 39, e141 (2011).

    CAS  Article  Google Scholar 

  49. Goodman, D.B., Church, G.M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013).

    CAS  Article  Google Scholar 

  50. Mathews, D.H. RNA secondary structure analysis using RNAstructure. Curr. Protoc. Bioinformatics 46, 12.6.1–12.6.25 (2014).

    Google Scholar 

Download references


We thank members of the Wang lab for helpful discussions and feedback. H.H.W. acknowledges funding support from the NIH (1DP5OD009172-02, 1U01GM110714-01A1), NSF (MCB-1453219), Sloan Foundation (FR-2015-65795), DARPA (W911NF-15-2-0065), and ONR (N00014-15-1-2704). N.I.J. is supported by an NSF Graduate Research Fellowship (DGE-16-44869). S.S.Y. is supported by the National Research Foundation of Korea (NRF-2017R1A6A3A03003401). We also thank T. Seto for help with plasmid construction; A. Figueroa for assistance with cell sorting; H. Salis for helpful discussions regarding the RBS calculator; D.B. Goodman for discussions regarding FACS-seq; G.M. Church (Harvard Medical School, Boston, Massachusetts, USA) for access to OLS libraries; and D. Dubnau (Rutgers New Jersey Medical School, Newark, New Jersey, USA), S. Lory, and A. Rasouly (both at Harvard Medical School, Boston, Massachusetts, USA) for providing the BD3182 and PAO1 Δpsy2 strains.

Author information

Authors and Affiliations



N.I.J., A.L.C.G., C.S.S., M.B.S., E.J.A., S.K., and H.H.W. designed the study. N.I.J., S.S.Y., and H.H.W. performed the experiments. N.I.J., A.L.C.G., A.Y., T.B., and H.H.W. analyzed the data. N.I.J., A.L.C.G., and H.H.W. wrote the manuscript, with input from all other authors.

Corresponding author

Correspondence to Harris H Wang.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Metadata of the 184 donor genomes used to derive the regulatory sequences used in this study.

(a) genome size, (b) genomic GC content, (c) gram staining, (d) lifestyle, (e) number of regulatory sequences mined per genome, (f) the number of genomes per phylum, and (g) the 16S phylogenetic tree.

Supplementary Figure 2 Vector designs.

Vector maps for pNJ1, pNJ2.1, and pNJ3.1 used for expression measurements of metagenomic regulatory sequence library (RS) in E. coli, B. subtilis, and P. aeruginosa respectively and pNJ6.0, pNJ3.1, pNJ7 and pNJ8, which were used for RS241 library measurements in E. coli, B. subtilis, P. aeruginosa, S. enterica, V. natriegens and C. glutamicum.

Supplementary Figure 3 Replication experiments to validate method performance.

(a) Correlation of transcriptional measurements of RS library across two independent replicate cultures (>10 DNA counts across both replicates, n = 18,845) in E. coli performed on different days. (b) Correlation of transcriptional measurements of identical RSs with two different barcodes in E. coli (>10 DNA counts across both constructs, n = 2,273). Pearson correlation (r) is listed in each panel.

Supplementary Figure 4 Validations of gene expression measurements.

(a) Correlation of pooled RNA-seq measurements with individual RT-PCR data from isolate strains containing RS library members for three host species. (b) GFP fluorescence distributions of post-FACS RS library populations displayed as violin plots (n = 10,000 cells, mean value shown as horizontal bar). (c) Correlation of pooled FACS-seq measurements with individual flow cytometry measurements of isolate strains. Pearson correlation coefficients and sample sizes are listed for (r and n) listed in each subplot.

Supplementary Figure 5 Alternative reporter gene experiments.

Correlation between transcription (a) and translation (b) data measured using sfGFP and an alternate reporter mCherry. Sample sizes (n) and Pearson correlation coefficients (r) are listed in the lower right of each plot.

Supplementary Figure 6 Transcription start sites in three species.

Distribution of transcription start sites (TSSs) for active regulatory sequences containing one primary TSS with >70% of reads starting within +/- 5 bp. Most TSSs occur between 20-50 bp upstream of the start codon for B. subtilis, E. coli, and P. aeruginosa.

Supplementary Figure 7 Alternative-growth-condition transcription data.

(a) Transcription activity for 18,205 members of the RS library across multiple growth conditions in E. coli is clustered and shown as a heatmap. Transcription levels are log2 (RNA/DNA) ratios normalized by the mean activity of control sequences (see Methods). (b) Ranked TSS locations of each RS measured in E. coli during LB exponential phase are shown, along with the TSS distribution (top panel) and the frequency of multiple TSSs (inset) of the RS library. (c) Frequency of matching TSS positions for RSs in LB and M9 growth media. Pearson correlation of 1 signifies perfectly matched TSS between conditions and -1 denoting no or anti-correlation. Intermediate values denote partial TSS matching. Example RSs with high, moderate, and no correlation in TSS positions in LB and M9 are shown in the inset (n = 18,205). (d) A subset of 100 robust RSs with condition-invariant transcription levels of different strengths (top panel) generated from a single TSS of different untranslated region (UTR) lengths (bottom panel) is provided as a useful community resource.

Supplementary Figure 8 Comparison of TSS data for regulatory sequences (RSs) across growth conditions in E. coli.

(a) A histogram of the distribution of all 10 pairwise comparisons of TSS position of regulatory sequences measured in 5 growth conditions (LB exponential growth phase, LB-exp; LB exponential with iron depletion, LB-Fe; LB exponential with high salt, LB-NaCl; LB stationary phase, LB-stat; M9 minimal media exponential phase, M9-exp) is shown (n = 18,205). Perfectly matched TSSs in two conditions have a Pearson correlation of 1, while an un-matched pair of TSSs has a correlation of -1. (b) A histogram of the mean TSS correlations (Pearson r) of all RSs across all pairwise conditions show almost half of RSs have the same TSS across all 5 conditions (n = 18,205).

Supplementary Figure 9 De novo motif search.

(a) Motif analysis of promoters binned by activity levels. The top two motifs identified by MEME for each recipient at the four activity bins (low, medium low, medium high, high) are shown. All motifs resembled the σ70 motif or its degenerate versions. Statistically non-significant motifs are displayed in gray color. Additional MEME motif outputs are not shown since none were significantly different from σ70-like motifs. (b) Transcriptional activity heatmap grouped by hierarchical clustering (n=395). Motif finding was performed to identify motifs across ten clusters. The corresponding motif for each cluster is indicated by colored circle. (c) Removal of regulatory sequences containing the σ70 motif from the dataset and repeating the analysis performed in a did not reveal additional non-σ70-like motifs (n=76). Statistically non-significant motifs (MEME E-value > 1e-2) are displayed in gray color in b and c.

Supplementary Figure 10 The σ70 motif is the dominant factor governing transcriptional activity of horizontally acquired regulatory sequences.

(a) Pearson correlation of transcriptional activity versus promoter GC content (%GC), RNA structural stability (ΔG RNA), best σ70 match score (max(σ70)) and number of σ70 matches (n(σ70)) are displayed per recipient species. (b) Partial correlation displays activity versus variable by controlling to the other variables. Sample sizes (n) are 4314, 14809, and 17787 regulatory sequences for B. subtilis, E. coli, and P. aeruginosa respectively.

Supplementary Figure 11 Regulatory sequence translation levels determined by FACS-seq in B. subtilis, E. coli, and P. aeruginosa.

(a) The distribution of GFP fluorescence values of the regulatory sequence library in each recipient. (b) Translational activity of 8,898 regulatory sequences with measurable GFP fluorescence data across all three recipients. (c) Analysis of ribosome binding site sequence motifs in highly translated constructs. Motif logos were constructed using WebLogo v3.5.0. The genomic GC content of each species was used for background nucleotide frequency models and are listed in each subplot.

Supplementary Figure 12 Protein expression from Firmicute and Proteobacterial regulatory sequences.

Heatmap panels show the fraction of RS library distributed across bins of transcription and translation levels in three recipients (colored columns). Donor RSs from Firmicutes genomes are shown in (a) and from Proteobacteria genomes in (b). The top row of each heatmap subpanels use values normalized by the total number of regulatory sequences. The middle row use values normalized by each column bin corresponding to transcription windows. The bottom row use values normalized by each row bin corresponding to translation windows. Grey colored rows indicate data points with fewer than 10 RSs in total and insufficient for analysis.

Supplementary Figure 13 Cross-species and in silico comparisons of gene expression levels.

(a) Correlation of regulatory sequence activity in terms of transcription level and translation efficiency (calculated as the ratio of GFP protein levels and transcription levels) between recipient species. Each point corresponds to a single regulatory sequence that has measurable transcription and translation data. Pearson correlation coefficient (r) and statistical significance values (p) are shown for each subplot (n=212 for all six panels). (b) Correlation between calculated translation (TL) efficiency based on the RBS calculator and our measured translation efficiency across highly transcribed regulatory sequences (top 15%) in each recipient species (n = 581, 2276, and 2198 for B. subtilis, E. coli, and P. aeruginosa respectively).

Supplementary Figure 14 Regulatory activity of RS241 library in six bacterial species.

Regulatory sequences are sorted by activity (from high to low) per species by (a) transcription or (b) translation levels. Regulatory sequences are re-sorted by mean transcription levels (from low to high) across all species and plotted for (c) transcription and (d) translation levels. Transcriptional values were normalized with the highest expression construct having a value of 106. Gray lines correspond to sequences where no data was available. Species names are abbreviated as: B. subtilis, B.s.; C. glutanicum, C.g.; P. aeruginosa, P.a.; V. natriegens, V.n.; S. enterica, S.e.; E. coli, E.c.

Supplementary Figure 15 Cross-species transcription and translation level correlations.

(a) Pairwise Pearson correlation of transcription (blue triangle) and translation (green triangle) activity profiles of the RS241 library across six host species. Species are arranged based their 16S phylogenetic similarity. Numbers in each box correspond to the Pearson correlation coefficients (n = 241). (b) Scatter plot showing each pairwise correlation described in (a).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15

Life Sciences Reporting Summary

Supplementary Table 1

Regulatory sequence library metadata

Supplementary Table 2

Library expression data for B. subtilis, E. coli, and P. aeruginosa

Supplementary Table 3

Library expression data for E. coli in five growth conditions

Supplementary Table 4

RS241 library expression data in six species

Supplementary Data Set 1

Strains and materials used in this study

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Johns, N., Gomes, A., Yim, S. et al. Metagenomic mining of regulatory elements enables programmable species-selective gene expression. Nat Methods 15, 323–329 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing