Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Targeted exploration and analysis of large cross-platform human transcriptomic compendia

Abstract

We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation–based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: SEEK system overview and systematic functional evaluation.
Figure 2: Search results for the Hedgehog (Hh) query (GLI1, GLI2, PTCH1) and search refinement.

Similar content being viewed by others

References

  1. The Cancer Genome Atlas Research Network. Nature 455, 1061–1068 (2008).

  2. Edgar, R., Domrachev, M. & Lash, A.E. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  Google Scholar 

  3. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

    Article  CAS  Google Scholar 

  4. Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Proc. Natl. Acad. Sci. USA 101, 2981–2986 (2004).

    Article  CAS  Google Scholar 

  5. Hibbs, M.A. et al. Bioinformatics 23, 2692–2699 (2007).

    Article  CAS  Google Scholar 

  6. Owen, A.B., Stuart, J., Mach, K., Villeneuve, A.M. & Kim, S. Genome Res. 13, 1828–1837 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Zinman, G.E., Naiman, S., Kanfi, Y., Cohen, H. & Bar-Joseph, Z. Nat. Methods 10, 925–926 (2013).

    Article  CAS  Google Scholar 

  8. Adler, P. et al. Genome Biol. 10, R139 (2009).

    Article  Google Scholar 

  9. Barabási, A.-L. & Oltvai, Z.N. Nat. Rev. Genet. 5, 101–113 (2004).

    Article  Google Scholar 

  10. Han, J.-D.J. et al. Nature 430, 88–93 (2004).

    Article  CAS  Google Scholar 

  11. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Genome Res. 14, 1085–1094 (2004).

    Article  CAS  Google Scholar 

  12. Leek, J.T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  Google Scholar 

  13. Kimura, H., Stephen, D., Joyner, A. & Curran, T. Oncogene 24, 4026–4036 (2005).

    Article  CAS  Google Scholar 

  14. Oliver, T.G. et al. Proc. Natl. Acad. Sci. USA 100, 7331–7336 (2003).

    Article  CAS  Google Scholar 

  15. Berman, D.M. et al. Science 297, 1559–1561 (2002).

    Article  CAS  Google Scholar 

  16. Carpenter, D. et al. Proc. Natl. Acad. Sci. USA 95, 13630–13634 (1998).

    Article  CAS  Google Scholar 

  17. Oue, T., Yoneda, A., Uehara, S., Yamanaka, H. & Fukuzawa, M. J. Pediatr. Surg. 45, 387–392 (2010).

    Article  Google Scholar 

  18. Jagani, Z. et al. Nat. Med. 16, 1429–1433 (2010).

    Article  CAS  Google Scholar 

  19. Cohen, M.M. Jr. Am. J. Med. Genet. A. 123A, 5–28 (2003).

    Article  Google Scholar 

  20. Cheung, H.O.-L. et al. Sci. Signal. 2, ra29 (2009).

    Article  Google Scholar 

  21. Ramasamy, A., Mondry, A., Holmes, C.C. & Altman, D.G. PLoS Med. 5, e184 (2008).

    Article  Google Scholar 

  22. Fisher, R.A. Biometrika 10, 507–521 (1915).

    Google Scholar 

  23. Huttenhower, C. et al. Genome Res. 19, 1093–1106 (2009).

    Article  CAS  Google Scholar 

  24. Song, L., Langfelder, P. & Horvath, S. BMC Bioinformatics 13, 328 (2012).

    Article  CAS  Google Scholar 

  25. Stuart, J.M., Segal, E., Koller, D. & Kim, S.K. Science 302, 249–255 (2003).

    Article  CAS  Google Scholar 

  26. Horvath, S. & Dong, J. PLoS Comput. Biol. 4, e1000117 (2008).

    Article  Google Scholar 

  27. Ruan, J., Dean, A.K. & Zhang, W. BMC Syst. Biol. 4, 8 (2010).

    Article  Google Scholar 

  28. Xulvi-Brunet, R. & Li, H. Bioinformatics 26, 205–214 (2010).

    Article  CAS  Google Scholar 

  29. Moffat, A. & Zobel, J. ACM Trans. Inf. Syst. 27, 2 (2008).

    Article  Google Scholar 

  30. Huttenhower, C., Schroeder, M., Chikina, M.D. & Troyanskaya, O.G. Bioinformatics 24, 1559–1561 (2008).

    Article  CAS  Google Scholar 

  31. Bodenreider, O. Nucleic Acids Res. 32, D267–D270 (2004).

    Article  CAS  Google Scholar 

  32. Gremse, M. et al. Nucleic Acids Res. 39, D507–D513 (2011).

    Article  CAS  Google Scholar 

  33. Myers, C.L., Barrett, D.R., Hibbs, M.A., Huttenhower, C. & Troyanskaya, O.G. BMC Genomics 7, 187 (2006).

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported in part by the US National Institutes of Health (NIH) award R01 HG005998 and partially supported by the US National Science Foundation (NSF) CAREER award (DBI-0546275) and NIH award R01 GM071966 to O.G.T. The project was also partially supported in part by the NIH awards T32 HG003284 and P50 GM071508. M.C. was supported by NSF awards CCF 1218687 and CCF 1302518. O.G.T. receives support as part of the Canadian Institute For Advanced Research in the Genetic Networks group. We thank members of the Troyanskaya lab for comments about SEEK in the regular lab meetings and Q. Zhu for critically reading the manuscript. We thank the volunteers from Princeton and other universities, including the Canadian Institute for Advanced Research Genetic Networks meetings attendees, for testing the SEEK web interface and providing valuable feedback.

Author information

Authors and Affiliations

Authors

Contributions

Q.Z. and O.G.T. wrote the manuscript. Q.Z., O.G.T., K.L. and M.C. designed the algorithm. Q.Z. implemented the search back end and front end, and performed evaluations. A.K.W., D.C.C., A.K., R.Z., M.R.A. and C.S.G. contributed ideas, performed analyses and edited the manuscript. A.T., L.A.B. and Q.Z. performed data and metadata processing. V.N.K. and O.G.T. contributed ideas in the biological study. O.G.T., K.L. and M.C. conceived of the study and gave guidance.

Corresponding authors

Correspondence to Moses Charikar, Kai Li or Olga G Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Gene-retrieval performance vs. query size, and comparisons between SEEK and MEM in single- and multiple-gene queries

(a) SEEK's performance vs. query size (the number of query genes). The plot shows the median (black line) and the IQR (shaded area). The retrieval performance increases as a function of query size, showing that the improved query context, resulted from including more process-relevant genes in the query, can help boost gene retrieval. (b) Gene Recommender's performance vs. query size. (c) The performance of SEEK and MEM. The evaluation is the same as used in Fig. 2 (main text). These plots additionally show the mean (red line), median (black line), and the IQR (shaded area) across 995 processes. (d) Single-gene query retrieval performance.

Source data

Supplementary Figure 2 Robustness of SEEK and Gene Recommender to noisy query genes

(a) SEEK's retrieval robustness in the presence of random gene noise in the query. Red line (at 1.0) denotes the no-noise queries' performance level. Relative performance, defined by the fraction in fold improvement of precision over random at 10% recall (FIOR@10%) between noisy and no-noise queries, is plotted (see Methods). The percentage numbers below the box plot shows the median per-query performance drop. (b) Gene Recommender's gene retrieval robustness in the presence of random gene noise in the query.

Source data

Supplementary Figure 3 Performance of SEEK and other search systems over increasing numbers of gene expression data sets

A sub set consisting of 121 GO-slim (Supplementary Data 3) terms were used to evaluate each system's gene retrieval performance on six compendium sizes each built from random subsets of the data sets in its full compendium. The FIOR@10% is measured. All algorithms are applied to the same data compendia, and MEM, Gene Recommender, and combined data set correlation algorithms do not scale to the large human compendium that SEEK is able to effectively utilize.

Source data

Supplementary Figure 4 SEEK's performance across process groups

Each black dot represents a process. Different statistics were used to summarize performance (FIOR@10%) per group: red dot (mean), blue line (75th percentile), orange line (median). Memberships of the biological processes to the 11 term groups are determined by text-mining the process title, except for the 3 super-groups (see figure for their definitions). Red arrows indicate examples of top-performing processes: “erythrocyte differentiation” (44-fold), “lysosomal transport” (25-fold), “glutamate signaling” (104-fold), and “digestive system development” (33-fold).

Source data

Supplementary Figure 5 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: the data sets prioritized for the query

The top 10 data sets among 5,000 data sets prioritized by SEEK are displayed in the insert. These data sets are weighted by the coexpression of the Hh query genes to indicate the abundance of aberrant Hh signaling activations.

Source data

Supplementary Figure 6 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: data set weight significance and gene-retrieval validation

(a) Top-ranked data sets are specifically highly weighted to the Hedgehog (Hh) query. Data set weight significance is calculated by a comparison with 100 random queries. Empirical P value for each data set d represents the fraction of 100 random queries where the score (or weight) of the data set d prioritized by a random query is higher than d's score in the Hh query. (b) Gene retrieval validation, which serves as an indication of the relevance of the search results (i.e., coexpressed genes) to the Hh context. The gold standard consists of 71 Hh genes assembled from KEGG and GO.

Source data

Supplementary Figure 7 Batch-effect analysis

Enrichment of 100 batch affected data sets (Supplementary Note 1) in full data prioritizations (4,500 data sets) across 121 GO slim queries. This test was done to check if the 100 data sets (with severe batch effects) have a lower score than randomly selected data sets in the ranking. Score-based PAGE enrichment was used. The data sets consistently received a lower score than randomly selected data sets (avg. z-score = –6.3, P < 1.39×10−10), showing that low quality data sets have a relatively small impact on the prioritizations.

Source data

Supplementary Figure 8 Correlation standardization

(a) Prior to standardization, the distribution of Pearson correlation (r) (for all pairs of genes in the data set) is not directly comparable across platforms. We picked a data set at random from each of 8 major platforms to illustrate this lack of comparability. (b) After the normalization of Pearson by Fisher's transform (ln[(1 + r) / (1 – r)] / 2) followed by standardization, all of the selected data sets from these different platforms have been properly standardized to a N(0,1) normal distribution. ­

Source data

Supplementary Figure 9 Spearman and bicor correlation measures

Gene retrieval evaluations for GO slim biological process queries, searched using the SEEK algorithm. The correlation measure is varied: Pearson normalized, Spearman, and bicor. The data sets used are a group of 174 breast cancer tumor data sets.

Source data

Supplementary Figure 10 Variation of the parameter p in the weighting formula

The parameter p that is used in Eq. 2 in the Methods is arrived after testing values in the range from 0.90 to 0.99. At p=0.99, SEEK is most stable in retrieving genes across the 995 GO biological processes.

Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Notes 1–6 (PDF 5185 kb)

Supplementary Data 1

Microarray and RNA sequencing platforms included in the SEEK compendium (XLSX 10 kb)

Supplementary Data 2

Individual GO biological process performance (XLSX 79 kb)

Supplementary Data 3

The effect of compendium size on the gene-retrieval performance (XLSX 55 kb)

Supplementary Data 4

Data set annotations (XLSX 297 kb)

Supplementary Data 5

Detailed data processing procedure for microarray data sets, TCGA RNASeq data sets, and other RNASeq data sets from GEO (XLSX 28 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Q., Wong, A., Krishnan, A. et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 12, 211–214 (2015). https://doi.org/10.1038/nmeth.3249

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3249

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics