Abstract
Organisms orchestrate cellular functions through transcription factor (TF) interactions with their target genes, although these regulatory relationships are largely unknown in most species. Here we report a high-throughput approach for characterizing TF–target gene interactions across species and its application to 354 TFs across 48 bacteria, generating 17,000 genome-wide binding maps. This dataset revealed themes of ancient conservation and rapid evolution of regulatory modules. We observed rewiring, where the TF sensing and regulatory role is maintained while the arrangement and identity of target genes diverges, in some cases encoding entirely new functions. We further integrated phenotypic information to define new functional regulatory modules and pathways. Finally, we identified 242 new TF DNA binding motifs, including a 70% increase of known Escherichia coli motifs and the first annotation in Pseudomonas simiae, revealing deep conservation in bacterial promoter architecture. Our method provides a versatile tool for functional characterization of genetic pathways in prokaryotes and eukaryotes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
Peak files and their assigned genes for each of the 48 genomes are included as supplementary data files along with readme files that detail file formats. Raw FASTQ sequence data files have been submitted to the National Center for Biotechnology Information under BioProject no. PRJNA758434, for which curated metadata are managed in GOLD (https://gold.jgi.doe.gov/)53. These raw data correspond to Figs. 2–6, Extended Data Figs. 3–10 and Supplementary Figs. 2 and 4. E. coli motifs were downloaded from RegulonDB (http://regulondb.ccg.unam.mx/). Phenotypic data of gene knockouts were downloaded from the Fitness Browser website (http://fit.genomics.lbl.gov/).
Code availability
Primary sequence data analysis was done using open-source software tools as described in the Methods. Custom scripts used to visualize the data are available upon request.
References
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).
Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).
Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4, 651–657 (2007).
Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560 (2007).
Berger, M. F. & Bulyk, M. L. Protein binding microarrays (PBMs) for the rapid, high-throughput characterization of the sequence specificities of DNA binding proteins. Methods Mol. Biol. 338, 245–260 (2006).
Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol. 9, 2944–2949 (1989).
Ellington, A. D. & Szostak, J. W. In vitro selection of RNA molecules that bind specific ligands. Nature 346, 818–822 (1990).
Tuerk, C. & Gold, L. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249, 505–510 (1990).
O’Malley, R. C. et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165, 1280–1292 (2016).
Bartlett, A. et al. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat. Protoc. 12, 1659–1672 (2017).
Fischer, M. S., Wu, V. W., Lee, J. E., O’Malley, R. C. & Glass, N. L. Regulation of cell-to-cell communication and cell wall integrity by a network of MAP kinase pathways and transcription factors in Neurospora crassa. Genetics 209, 489–506 (2018).
de Mendoza, A., Pflueger, J. & Lister, R. Capture of a functionally active methyl-CpG binding domain by an arthropod retrotransposon family. Genome Res. 29, 1277–1286 (2019).
Galli, M. et al. The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family. Nat. Commun. 9, 4526 (2018).
Uygun, S., Azodi, C. B. & Shiu, S.-H. Cis-regulatory code for predicting plant cell-type transcriptional response to high salinity. Plant Physiol. 181, 1739–1751 (2019).
Brooks, M. D. et al. Network Walking charts transcriptional dynamics of nitrogen signaling by integrating validated and predicted genome-wide interactions. Nat. Commun. 10, 1569 (2019).
Ricci, W. A. et al. Widespread long-range cis-regulatory elements in the maize genome. Nat. Plants 5, 1237–1249 (2019).
Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, e04837 (2015).
Hemberg, M. & Kreiman, G. Conservation of transcription factor binding events predicts gene expression across species. Nucleic Acids Res. 39, 7092–7102 (2011).
Kurzchalia, T. V. et al. tRNA-mediated labelling of proteins with biotin. A nonradioactive method for the detection of cell-free translation products. Eur. J. Biochem. 172, 663–668 (1988).
Santos-Zavaleta, A. et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res. 47, D212–D220 (2019).
Ishihama, A., Shimada, T. & Yamazaki, Y. Transcription profile of Escherichia coli: genomic SELEX search for regulatory targets of transcription factors. Nucleic Acids Res. 44, 2058–2074 (2016).
Orenstein, Y. & Shamir, R. A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data. Nucleic Acids Res. 42, e63 (2014).
Shimada, T., Fujita, N., Maeda, M. & Ishihama, A. Systematic search for the Cra-binding promoters using genomic SELEX system. Genes Cells 10, 907–918 (2005).
Ishida, Y., Kori, A. & Ishihama, A. Participation of regulator AscG of the β-glucoside utilization operon in regulation of the propionate catabolism operon. J. Bacteriol. 191, 6136–6144 (2009).
Ogasawara, H., Shinohara, S., Yamamoto, K. & Ishihama, A. Novel regulation targets of the metal-response BasS-BasR two-component system of Escherichia coli. Microbiology (Reading) 158, 1482–1492 (2012).
Shimada, T., Yokoyama, Y., Anzai, T., Yamamoto, K. & Ishihama, A. Regulatory role of PlaR (YiaJ) for plant utilization in Escherichia coli K-12. Sci. Rep. 9, 20415 (2019).
Lamers, J., Schippers, B. & Geels, F. in Cereal Breeding Related to Integrated Cereal Production (eds Jorna, M. L. & Slootmaker, L. A. J.) 134–139 (Pudoc, 1988).
Cole, B. J. et al. Genome-wide identification of bacterial plant colonization genes. PLoS Biol. 15, e2002860 (2017).
Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503–509 (2018).
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. GenBank. Nucleic Acids Res. 35, D21–D25 (2007).
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Chen, I.-M. A. et al. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res. 47, D666–D677 (2019).
Browning, D. F. & Busby, S. J. The regulation of bacterial transcription initiation. Nat. Rev. Microbiol. 2, 57–65 (2004).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Eraso, J. M. et al. The highly conserved MraZ protein is a transcriptional regulator in Escherichia coli. J. Bacteriol. 196, 2053–2066 (2014).
Tamames, J., González-Moreno, M., Mingorance, J., Valencia, A. & Vicente, M. Bringing gene order into bacterial shape. Trends Genet. 17, 124–126 (2001).
Feng, D.-F., Cho, G. & Doolittle, R. F. Determining divergence times with a protein clock: update and reevaluation. Proc. Natl Acad. Sci. USA 94, 13028–13033 (1997).
Grenier, F., Matteau, D., Baby, V. & Rodrigue, S. Complete genome sequence of Escherichia coli BW25113. Genome Announc. 2, e01038-14 (2014).
Xu, C., Shi, W. & Rosen, B. P. The chromosomal arsR gene of Escherichia coli encodes a trans-acting metalloregulatory protein. J. Biol. Chem. 271, 2427–2432 (1996).
Chen, J., Yoshinaga, M., Garbinski, L. D. & Rosen, B. P. Synergistic interaction of glyceraldehydes-3-phosphate dehydrogenase and ArsJ, a novel organoarsenical efflux permease, confers arsenate resistance. Mol. Microbiol. 100, 945–953 (2016).
Chen, Y. M., Zhu, Y. & Lin, E. C. The organization of the fuc regulon specifying l-fucose dissimilation in Escherichia coli K12 as determined by gene cloning. Mol. Gen. Genet. 210, 331–337 (1987).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Mukherjee, S. et al. Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Res. 49, D723–D733 (2021).
Acknowledgements
The work conducted by the US Department of Energy Joint Genome Institute, a US Department of Energy Office of Science User Facility, is supported under contract no. DE-AC02-05CH11231. We thank A. Deutschbauer for generously providing genomic DNA from many of the bacterial species used in this work. We also thank J. Humphries, N. Mouncey, L. Pennacchio, A. Visel and Z. Zhang for helpful discussions and assistance with editing and C. Beecroft and T. Reddy for assistance with data management and Sequence Read Archive submission.
Author information
Authors and Affiliations
Contributions
L.A.B. and R.C.O. designed the experiments. L.A.B., J.E.L., H.N., Y.Z., D.J.D., C.G.D. and Y.Y. performed the experiments. L.A.B., A.S., M.M. and M.J.B. analyzed the data. L.A.B. prepared the figures. L.A.B. and R.C.O. prepared the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Methods thanks Juan Fuxman-Bass, Lijia Ma and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Biotin-DAP-seq in Arabidopsis thaliana.
Demonstration of successful biotin-DAP-seq with eukaryotic transcription factors amplified directly from cDNA. The Arabidopsis thaliana TFs AT3G12730, AT1G72010, and AT1G77920 were selected to represent three distinct TF protein families: MYB, TCP, and bZIP, respectively. The identified binding signal obtained using biotin-tagged proteins matches closely to DAP-seq with Halo-tagged proteins as used in the original 2016 DAP-seq publication.
Extended Data Fig. 2 Fragment size in DAP-seq is a tunable parameter.
DAP-seq fragment library insert size is a tunable parameter, depending on the desired resolution. For multiDAP experiments presented in this work, libraries were constructed from genomic DNA sheared to an average size of 75 bp, because we found this to offer high resolution while still accurately capturing known binding sites. Note that at extremely short insert sizes (20 bp, bottom track) the signal begins to decay, likely because the size is too short to capture the local DNA context of clustered AgaR motifs.
Extended Data Fig. 3 The multiDAP assay produces unique and reproducible TF binding maps.
MultiDAP with 92 E. coli TFs along with 4 negative control samples, run in three independent triplicate 96-well plate experiments demonstrates unique binding signatures for each TF, which are highly reproducible. Signal correlation among these 288 samples was assessed by splitting the entire E. coli genome into 25 bp bins, each with an assigned signal value determined by the sequence read pileup within that bin. Inset shows detail with strong agreement between triplicates, while negative control wells with mock TF expression are largely un-correlated. This implies that background signal is primarily random noise, while individual TFs specifically enrich for binding site regions. Median correlation between replicates is 94%. See also Supplementary Table 4 for peak numbers and correlations.
Extended Data Fig. 4 MultiDAP benchmarking with 92 E. coli TFs.
Triplicate multiDAP experiment with 92 E. coli TFs show that most TFs have strong affinity to a few genomic sites, and some also reproducibly bind weakly to many additional sites. Scatterplots depict the top 10 peaks ranked by fold-enrichment, as compared to merged negative control samples. Boxes represent median and quartiles with whiskers each extending to 1.5x interquartile range and outliers shown as individual points. (a) LacI is known to bind at and repress the promoter upstream of lacZ, along with a weaker accessory binding site just inside the lacZ coding sequence. Both of these binding sites were detected as strong peaks, while the additional weaker detected peaks do not have known biological functions. (b) LexA is known to have multiple binding sites at various promoters. The strongest detected peak corresponds to the known target recN, however in this case even the weakest detected peak in the ruvA promoter is a known functional regulatory site of LexA. (c) and (d) Top 10 peaks detected in triplicate experiments with 92 E. coli TFs.
Extended Data Fig. 5 Conservation of P. simiae TF targets.
Quantification of TF target gene conservation across species reveals global patterns of conservation and evolution. Gray-shaded vertical bars mark phylogenetic clades (top to bottom): Gammaproteobacteria, Betaproteobacteria, Alphaproteobacteria, Bacteroidetes, Gram-positive. For each P. simiae TF (rows), the set of target genes in P. simiae was compared to the set of target genes in each of the other 47 species (columns, labeled 1-48). The two species used in this study as a source of TFs (E. coli and P. simiae) are indicated in the phylogenetic tree as colored dots (blue and green, respectively). Target gene similarity was quantified as the number of matching orthologs appearing in pairs of target gene sets. P-values were determined by comparing to a mock set of target genes randomly selected from each genome for 10,000 iterations (see Methods for details). Blue-to-red shades correspond to significance, with darkest red representing the most significant degree of conservation (p-value < = 1e-4).
Extended Data Fig. 6 Evolutionary repurposing of TFs.
All-vs-all comparison of gene sets targeted by a given TF in each species reveals distinct clusters of conservation in different bacterial clades, suggesting TF functional rewiring. An ancestral form of the E. coli TF AscG appears to have diverged to regulate new metabolic functions in the Pseudomonas clade, while the DNA binding specificity has been maintained. P-values were determined by comparing to a mock set of target genes randomly selected from each genome for 10,000 iterations (see Methods for details). (a) The E. coli TF AscG provides an example of two distinct clusters of target genes: one cluster mainly limited to the Enterobacteria, and another extending across the genus Pseudomonas and into the class β-Proteobacteria. (b) A closer inspection of the AscG target operons in the model organisms E. coli MG1655 and P. aeruginosa PAO1, along with predicted orthologs of these genes in other species, suggests that the TF’s function has diverged between the two clusters. Genes are colored according to their orthogroup: E. coli genes and orthologs in solid colors, and those of P. aeruginosa with stripes. Functional predictions or gene names from RefSeq annotations are shown in legend. (c) Comparison of the Pseudomonas aeruginosa PAO1 PtxS DNA binding sequence motif (top) to that of the Escherichia coli MG1655 TF AscG (bottom) shows high similarity (p-value = 2e-7 as calculated by Tomtom).(d) Despite the nearly identical binding motifs, alignment of the AscG and PtxS amino acid sequences reveals they only share an average 24% amino acid identity across the entire protein sequence (95% coverage). The helix-turn-helix DNA binding domain is conserved at a higher 43% identity, while the C-terminal ligand binding domain shows only 21% amino acid identity. While AscG is known to be induced by the ligand salicin, PtxS is induced by 2-ketogluconate.
Extended Data Fig. 7 DAP-seq allows bundling of gene knockout phenotypes.
Functional predictions of P. simiae TF Ps293 regulon based on multiDAP (see Fig. 4c) are supported by phenotypic measurements. (a) MultiDAP data indicates that TF Ps293 targets three distantly located operons, here designated target operons 1, 2, and 3. (b) As evidenced by phenotypic measurements of gene knockouts, the genes in each operon are responsible for distinct metabolic functions: operon 1 is involved in gly-glu dipeptide metabolism, operon 2 in 2’deoxyinosine metabolism, and operon 3 (which contains the TF Ps293 itself) in both of these functions as well as metabolism of several additional carbon sources. The TF knockout results in a phenotype in all of these conditions, while other genes in the regulon appear to only be important for growth on a subset of these carbon sources. This is consistent with the model that TF Ps293 acts as a master regulator for these diverse metabolic pathways. Gene knockouts in operon 1 result in strong growth defects when grown in the presence of gly-glu as the sole carbon source, while knockouts of the TF itself confers a growth advantage under these conditions. In contrast, the knockout phenotypes of genes in operons 2 and 3 do not show this opposing relationship. Taken together, this allows us to predict that TF Ps293 acts as a repressor of operon 1, and an activator of operons 2 and 3.
Extended Data Fig. 8 MultiDAP is an expedient method for generating TF DNA binding motifs.
New E. coli TF motifs from this work that are not represented in RegulonDB (n = 66).
Extended Data Fig. 9 Validation of TF motifs generated by multiDAP.
E. coli motifs compared to known motifs in RegulonDB. We found good agreement between motifs computed from the multiDAP datasets and RegulonDB: 50 matches (86%) of 58 motifs represented in both datasets. Motifs were considered to be matches if the p-value was less than 0.01, as scored by Tomtom.
Extended Data Fig. 10 The DNA binding motif of Yiaj/PlaR.
The motif for E. coli TF YiaJ was recently described by Shimada et al (top), who established the TFs function in plant breakdown product utilization and renamed it to PlaR. The motif compares closely to the motif that we identified using multiDAP (bottom).
Supplementary information
Supplementary Information
Supplementary Figs. 1–4, notes and references.
Supplementary Table
Supplementary Tables 1–6.
Supplementary Data 1
MultiDAP peaks and corresponding assigned genes in each of 48 genomes.
Supplementary Data 2
E. coli and P. simiae TF DNA binding motifs in MEME format.
Rights and permissions
About this article
Cite this article
Baumgart, L.A., Lee, J.E., Salamov, A. et al. Persistence and plasticity in bacterial gene regulation. Nat Methods 18, 1499–1505 (2021). https://doi.org/10.1038/s41592-021-01312-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-021-01312-2