GREAT improves functional interpretation of cis-regulatory regions

Journal name:
Nature Biotechnology
Volume:
28,
Pages:
495–501
Year published:
DOI:
doi:10.1038/nbt.1630
Published online

Abstract

We developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to analyze the functional significance of cis-regulatory regions identified by localized measurements of DNA binding events across an entire genome. Whereas previous methods took into account only binding proximal to genes, GREAT is able to properly incorporate distal binding sites and control for false positives using a binomial test over the input genomic regions. GREAT incorporates annotations from 20 ontologies and is available as a web application. Applying GREAT to data sets from chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) of multiple transcription-associated factors, including SRF, NRSF, GABP, Stat3 and p300 in different developmental contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate testable hypotheses. The utility of GREAT is not limited to ChIP-seq, as it could also be applied to open chromatin, localized epigenomic markers and similar functional data sets, as well as comparative genomics sets.

At a glance

Figures

  1. Enrichment analysis of a set of cis-regulatory regions.
    Figure 1: Enrichment analysis of a set of cis-regulatory regions.

    (a) The current prevailing methodology associates only proximal binding events with genes and performs a gene-list test of functional enrichments using tools originally designed for microarray analysis. (b) GREAT's binomial approach over genomic regions uses the total fraction of the genome associated with a given ontology term (green bar) as the expected fraction of input regions associated with the term by chance.

  2. Binding profiles and their effects on statistical tests.
    Figure 2: Binding profiles and their effects on statistical tests.

    (a) ChIP-seq data sets of several regulatory proteins show that the majority of binding events lie well outside the proximal promoter, both for sequence-specific transcription factors (SRF and NRSF, ref. 8; Stat3, ref. 43) and a general enhancer-associated protein (p300, refs. 33,43). Cell type is given in parentheses: H, human; M, mouse. (b) When not restricted to proximal promoters, the gene-based hypergeometric test (red) generates false positive enriched terms, especially at the size range of 1,000–50,000 input regions typical of a ChIP-seq set. Negligible false positive enrichment was observed for the region-based binomial test (blue). For each set size, we generated 1,000 random input sets in which each base pair in the human genome was equally likely to be included in each set, avoiding assembly gaps. We calculated all GO term enrichments for both hypergeometric and binomial tests using GREAT's 5+1 kb basal promoter and up to 1 Mb extension association rule (see Results). Plotted is the average number of terms artificially significant at a threshold of 0.05 after application of the conservative Bonferroni correction. (c) GO enrichment P values using the genomic region-based binomial (x axis) and gene-based hypergeometric (y axis) tests on the SRF data8 with GREAT's 5+1 kb basal promoter and up to 1 Mb extension association rule (see Results). b1 through b10 denote the top ten most enriched terms when we used the binomial test. h1 through h10 denote the top ten most enriched terms when we used the hypergeometric test. Terms significant by both tests (B ∩ H) provide specific and accurate annotations supported by multiple genes and binding events (Table 3). Terms significant by only the hypergeometric test (H\B) are general and often associated with genes of large regulatory domains, whereas terms significant by only the binomial test (B\H) cluster four to six genomic regions near only one or two genes annotated with the term (Supplementary Table 46).

  3. Distal binding events contribute substantially to accurate functional enrichments of p300 limb peaks.
    Figure 3: Distal binding events contribute substantially to accurate functional enrichments of p300 limb peaks.

    We examined properties of the 2,105 p300 mouse embryonic limb peaks33 in the context of three known limb-related terms and a negative control term (GO cortical cytoskeleton). Three different association rules were used (see Results): a gene-based GREAT analysis using only peaks within 2 kb of the nearest transcription start site (labeled 2 kb), an analysis with 5+1 kb basal and up to 50 kb extension (50 kb), and an analysis with 5+1 kb basal and up to 1 Mb extension (1 Mb). For each term, we examined the relevance of distal binding peaks by comparing the experimental results (black bars) to the average values of 1,000 simulated data sets (gray bars) in which the 192 proximal ChIP-seq peaks within 2 kb of the nearest transcription start site were fixed and the 1,913 distal peaks were shuffled uniformly within the mouse genome, avoiding assembly gaps and proximal promoters. By design, simulation results for proximal, 2-kb GREAT are identical to the actual data and are thus omitted. (a) Lengthening a 2-kb proximal promoter to a 50-kb extension, expected to increase genome coverage per term (pπ in Fig. 1b) by 25-fold, causes an actual increase of 19- to 24-fold; in contrast, lengthening a 50-kb extension rule to a 1-Mb extension rule, expected to raise genome coverage 20-fold, leads to an actual increase of only 2.5- to 6-fold because regulatory domains are not extended through neighboring genes. (b) As regulatory domains increase in length from only the proximal 2 kb up to 50 kb and 1 Mb, the number of relevant genes with a p300 limb peak in their regulatory domain increases. The added genes selected only by distal associations are typically enriched for limb functionality compared to simulated data. (c) As regulatory domains increase in length, the number of p300 limb peaks associated with a relevant gene in excess of the number expected by chance increases for all limb-related terms. (d) As in c, the inclusion of distal peaks markedly increases the statistical significance of the correct terms alone. *Statistical significance is measured using the hypergeometric test over genes for 2 kb to mimic current gene-based approaches, and using the binomial test over genomic regions for 50 kb and 1 Mb. Error bars indicate s.d.; NS, not significant at a threshold of 0.05 after false discovery rate multiple test correction; obs, observed; exp, expected. Note scale changes on x axes.

References

  1. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 14971502 (2007).
  2. Mardis, E.R. ChIP-seq: welcome to the new frontier. Nat. Methods 4, 613614 (2007).
  3. Park, P.J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669680 (2009).
  4. Ji, H. et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 26, 12931300 (2008).
  5. Kharchenko, P.V., Tolstorukov, M.Y. & Park, P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 13511359 (2008).
  6. Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol. 27, 6675 (2009).
  7. Tuteja, G., White, P., Schug, J. & Kaestner, K.H. Extracting transcription factor targets from ChIP-Seq data. Nucleic Acids Res. 37, e113 (2009).
  8. Valouev, A. et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods 5, 829834 (2008).
  9. Khatri, P. & Draghici, S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 35873595 (2005).
  10. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 7, 5565 (2006).
  11. Dopazo, J. Functional interpretation of microarray experiments. OMICS 10, 398410 (2006).
  12. Lowe, C.B., Bejerano, G. & Haussler, D. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc. Natl. Acad. Sci. USA 104, 80058010 (2007).
  13. Taher, L. & Ovcharenko, I. Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioinformatics 25, 578584 (2009).
  14. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 2529 (2000).
  15. Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 13211325 (2004).
  16. Bejerano, G. et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 441, 8790 (2006).
  17. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 12991309 (2006).
  18. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289293 (2009).
  19. Schoenfelder, S. et al. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat. Genet. 42, 5361 (2010).
  20. Spitz, F. & Duboule, D. Global control regions and regulatory landscapes in vertebrate development and evolution. Adv. Genet. 61, 175205 (2008).
  21. Huang, da W. et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 35, W169W175 (2007).
  22. Chai, J. & Tarnawski, A.S. Serum response factor: discovery, biochemistry, biological roles and implications for tissue injury healing. J. Physiol. Pharmacol. 53, 147157 (2002).
  23. Miano, J.M., Long, X. & Fujiwara, K. Serum response factor: master regulator of the actin cytoskeleton and contractile apparatus. Am. J. Physiol. Cell Physiol. 292, 7081 (2007).
  24. Ruan, J. et al. TreeFam: 2008 update. Nucleic Acids Res. 36, D735D740 (2008).
  25. Linhart, C., Halperin, Y. & Shamir, R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res. 18, 11801189 (2008).
  26. Natesan, S. & Gilman, M. YY1 facilitates the association of serum response factor with the c-fos serum response element. Mol. Cell. Biol. 15, 59755982 (1995).
  27. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 1554515550 (2005).
  28. Cerami, E.G., Bader, G.D., Gross, B.E. & Sander, C. cPath: open source software for collecting, storing, and querying biological pathways. BMC Bioinformatics 7, 497 (2006).
  29. Bertolotto, C. et al. Cleavage of the serum response factor during death receptor-induced apoptosis results in an inhibition of the c-FOS promoter transcriptional activity. J. Biol. Chem. 275, 1294112947 (2000).
  30. Poser, S., Impey, S., Trinh, K., Xia, Z. & Storm, D.R. SRF-dependent gene expression is required for PI3-kinase-regulated cell proliferation. EMBO J. 19, 49554966 (2000).
  31. Lee, H.J. et al. SRF is a nuclear repressor of Smad3-mediated TGF-beta signaling. Oncogene 26, 173185 (2007).
  32. Chen, C.R., Kang, Y., Siegel, P.M. & Massagué, J. E2F4/5 and p107 as Smad cofactors linking the TGFbeta receptor to c-myc repression. Cell 110, 1932 (2002).
  33. Visel, A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854858 (2009).
  34. Blake, J.A. et al. The Mouse Genome Database genotypes:phenotypes. Nucleic Acids Res. 37, D712D719 (2009).
  35. Wilkie, A.O. & Morriss-Kay, G.M. Genetics of craniofacial development and malformation. Nat. Rev. Genet. 2, 458468 (2001).
  36. Capdevila, J. & Izpisúa Belmonte, J.C. Patterning mechanisms controlling vertebrate limb development. Annu. Rev. Cell Dev. Biol. 17, 87132 (2001).
  37. Kretzschmar, M. & Massagué, J. SMADs: mediators and regulators of TGF-beta signaling. Curr. Opin. Genet. Dev. 8, 103111 (1998).
  38. Bult, C.J., Eppig, J.T., Kadin, J.A., Richardson, J.E. & Blake, J.A. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res. 36, D724D728 (2008).
  39. Niswander, L. Pattern formation: old models out on a limb. Nat. Rev. Genet. 4, 133143 (2003).
  40. Zhou, C.J., Borello, U., Rubenstein, J.L. & Pleasure, S.J. Neuronal production and precursor proliferation defects in the neocortex of mice with loss of function in the canonical Wnt signaling pathway. Neuroscience 142, 11191131 (2006).
  41. Wurst, W. & Bally-Cuif, L. Neural plate patterning: upstream and downstream of the isthmic organizer. Nat. Rev. Neurosci. 2, 99108 (2001).
  42. Park, C.C. et al. Fine mapping of regulatory loci for mammalian gene expression using radiation hybrids. Nat. Genet. 40, 421429 (2008).
  43. Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 11061117 (2008).
  44. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 9961006 (2002).
  45. Hsu, F. et al. The UCSC Known Genes. Bioinformatics 22, 10361046 (2006).
  46. The ENCODE Project Consortium Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799816 (2007).
  47. Lettice, L.A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 17251735 (2003).
  48. Maston, G.A., Evans, S.K. & Green, M.R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 2959 (2006).
  49. Levings, P.P. & Bungert, J. The human beta-globin locus control region. Eur. J. Biochem. 269, 15891599 (2002).
  50. Spitz, F., Gonzalez, F. & Duboule, D. A global control region defines a chromosomal regulatory landscape containing the HoxD cluster. Cell 113, 405417 (2003).

Download references

Author information

Affiliations

  1. Department of Computer Science, Stanford University, Stanford, California, USA.

    • Cory Y McLean,
    • Dave Bristor,
    • Aaron M Wenger &
    • Gill Bejerano
  2. Department of Developmental Biology, Stanford University, Stanford, California, USA.

    • Dave Bristor,
    • Michael Hiller,
    • Bruce T Schaar &
    • Gill Bejerano
  3. Department of Genetics, Stanford University, Stanford, California, USA.

    • Shoa L Clarke
  4. Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, USA.

    • Craig B Lowe

Contributions

C.Y.M. developed the core calculation engine, processed ontologies, analyzed data sets and co-wrote the manuscript. D.B. designed and developed the web application. M.H. added key ontologies and calculated ontology statistics. S.L.C. performed and wrote the SRF analysis. B.T.S. contributed to data set analysis and manuscript writing. A.M.W. guided website design and wrote user documentation. G.B. and C.B.L. devised the different enrichment tests and developed early core calculation engines. G.B. supervised the project and co-wrote the manuscript. All authors edited the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (5M)

    Supplementary Note, Supplementary Figures 1–4 and Supplementary Tables 1–46

Additional data