Abstract
The ability to determine the location and relative strength of all transcription-factor binding sites in a genome is important both for a comprehensive understanding of gene regulation and for effective promoter engineering in biotechnological applications. Here we present a bioinformatically driven experimental method to accurately define the DNA-binding sequence specificity of transcription factors. A generalized profile1 was used as a predictive quantitative model for binding sites, and its parameters were estimated from in vitro–selected ligands using standard hidden Markov model training algorithms2,3. Computer simulations showed that several thousand low- to medium-affinity sequences are required to generate a profile of desired accuracy. To produce data on this scale, we applied high-throughput genomics methods to the biochemical problem addressed here. A method combining systematic evolution of ligands by exponential enrichment (SELEX)4 and serial analysis of gene expression (SAGE)5 protocols was coupled to an automated quality-controlled sequence extraction procedure based on Phred quality scores6. This allowed the sequencing of a database of more than 10,000 potential DNA ligands for the CTF/NFI transcription factor. The resulting binding-site model defines the sequence specificity of this protein with a high degree of accuracy not achieved earlier and thereby makes it possible to identify previously unknown regulatory sequences in genomic DNA. A covariance analysis of the selected sites revealed non-independent base preferences at different nucleotide positions, providing insight into the binding mechanism.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Bucher, P., Karplus, K., Moeri, N. & Hofmann, K. A flexible motif search technique based on generalized profiles. Comput. Chem. 20, 3–29 (1996).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, United Kingdom, 1998).
Ehret, G.B. et al. DNA binding specificity of different STAT proteins. Comparison of in vitro specificity with natural target sites. J. Biol. Chem. 276, 6675–6688 (2001).
Klug, S.J. & Famulok, M. All you wanted to know about SELEX. Mol. Biol. Rep. 20, 97–107 (1994).
Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 3, 175–185 (1998).
Roulet, E., Fisch, I., Junier, T., Bucher, P. & Mermod, N. Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA. In Silico Biol. 1, 21–28 (1998).
Roulet, E. et al. Experimental analysis and computer prediction of CTF/NF-I transcription factor DNA binding sites. J. Mol. Biol. 297, 833–848 (2000).
Berg, O.G. & von Hippel, P.H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–750 (1987).
Goodman, S.D., Velten, N.J., Gao, Q., Robinson, S. & Segall, A.M. In vitro selection of integration host factor binding sites. J. Bacteriol. 181, 3246–3255 (1999).
Fields, D.S., He, Y.Y., Al-Uzri, A.Y. & Stormo, G.D. Quantitative specificity of the Mnt repressor. J. Mol. Biol. 271, 178–194 (1997).
Vant-Hull, B., Payano-Baez, A., Davis, R.H. & Gold, L. The mathematics of SELEX against complex targets. J. Mol. Biol. 278, 579–597 (1998).
Meisterernst, M., Gander, I., Rogge, L. & Winnacker, E.L. A quantitative analysis of nuclear factor I/DNA interactions. Nucleic Acids Res. 16, 4419–4435 (1988).
Perier, R.C., Praz, V., Junier, T. & Bucher, P. The eukaryotic promoter database EPD. Nucleic Acids Res. 28, 302–303 (2000).
Man, T.K. & Stormo, G.D. Non-independence of Mnt repressor-operator interactions determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 29, 2471–2478 (2001).
Zhang, M.Q. & Marr, T.G. A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9, 499–509 (1993).
Burge, C.B. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Hughey Hughey, R. & Krogh, A. Hidden Markov models for sequence analysis. Extension and analysis of the basic method. Comput. Appl. Biosci. 12, 95–107 (1996).
Acknowledgements
We thank Victor Jongeneel for support and suggestions, Roman Chrast and Stylianos Antonarakis for help with the SAGE procedure, Khalil Kadaoui for assistance, and Alan McNair for helpful comments on the manuscript. The financial support of the Ludwig Institute for Cancer Research, the Etat de Vaud, and the Swiss National Science Foundation (grants 31-63933.00 and 31-59370.99) are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Roulet, E., Busso, S., Camargo, A. et al. High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol 20, 831–835 (2002). https://doi.org/10.1038/nbt718
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt718
This article is cited by
-
DNA–protein interaction studies: a historical and comparative analysis
Plant Methods (2021)
-
A deep learning framework to predict binding preference of RNA constituents on protein surface
Nature Communications (2019)
-
From biophysics to ‘omics and systems biology
European Biophysics Journal (2019)
-
SELMAP - SELEX affinity landscape MAPping of transcription factor binding sites using integrated microfluidics
Scientific Reports (2016)
-
Reliable scaling of position weight matrices for binding strength comparisons between transcription factors
BMC Bioinformatics (2015)