We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation–based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Genome Medicine Open Access 12 October 2023
Nature Communications Open Access 22 June 2023
Nature Communications Open Access 01 April 2022
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
The Cancer Genome Atlas Research Network. Nature 455, 1061–1068 (2008).
Edgar, R., Domrachev, M. & Lash, A.E. Nucleic Acids Res. 30, 207–210 (2002).
Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).
Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Proc. Natl. Acad. Sci. USA 101, 2981–2986 (2004).
Hibbs, M.A. et al. Bioinformatics 23, 2692–2699 (2007).
Owen, A.B., Stuart, J., Mach, K., Villeneuve, A.M. & Kim, S. Genome Res. 13, 1828–1837 (2003).
Zinman, G.E., Naiman, S., Kanfi, Y., Cohen, H. & Bar-Joseph, Z. Nat. Methods 10, 925–926 (2013).
Adler, P. et al. Genome Biol. 10, R139 (2009).
Barabási, A.-L. & Oltvai, Z.N. Nat. Rev. Genet. 5, 101–113 (2004).
Han, J.-D.J. et al. Nature 430, 88–93 (2004).
Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Genome Res. 14, 1085–1094 (2004).
Leek, J.T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Kimura, H., Stephen, D., Joyner, A. & Curran, T. Oncogene 24, 4026–4036 (2005).
Oliver, T.G. et al. Proc. Natl. Acad. Sci. USA 100, 7331–7336 (2003).
Berman, D.M. et al. Science 297, 1559–1561 (2002).
Carpenter, D. et al. Proc. Natl. Acad. Sci. USA 95, 13630–13634 (1998).
Oue, T., Yoneda, A., Uehara, S., Yamanaka, H. & Fukuzawa, M. J. Pediatr. Surg. 45, 387–392 (2010).
Jagani, Z. et al. Nat. Med. 16, 1429–1433 (2010).
Cohen, M.M. Jr. Am. J. Med. Genet. A. 123A, 5–28 (2003).
Cheung, H.O.-L. et al. Sci. Signal. 2, ra29 (2009).
Ramasamy, A., Mondry, A., Holmes, C.C. & Altman, D.G. PLoS Med. 5, e184 (2008).
Fisher, R.A. Biometrika 10, 507–521 (1915).
Huttenhower, C. et al. Genome Res. 19, 1093–1106 (2009).
Song, L., Langfelder, P. & Horvath, S. BMC Bioinformatics 13, 328 (2012).
Stuart, J.M., Segal, E., Koller, D. & Kim, S.K. Science 302, 249–255 (2003).
Horvath, S. & Dong, J. PLoS Comput. Biol. 4, e1000117 (2008).
Ruan, J., Dean, A.K. & Zhang, W. BMC Syst. Biol. 4, 8 (2010).
Xulvi-Brunet, R. & Li, H. Bioinformatics 26, 205–214 (2010).
Moffat, A. & Zobel, J. ACM Trans. Inf. Syst. 27, 2 (2008).
Huttenhower, C., Schroeder, M., Chikina, M.D. & Troyanskaya, O.G. Bioinformatics 24, 1559–1561 (2008).
Bodenreider, O. Nucleic Acids Res. 32, D267–D270 (2004).
Gremse, M. et al. Nucleic Acids Res. 39, D507–D513 (2011).
Myers, C.L., Barrett, D.R., Hibbs, M.A., Huttenhower, C. & Troyanskaya, O.G. BMC Genomics 7, 187 (2006).
The work was supported in part by the US National Institutes of Health (NIH) award R01 HG005998 and partially supported by the US National Science Foundation (NSF) CAREER award (DBI-0546275) and NIH award R01 GM071966 to O.G.T. The project was also partially supported in part by the NIH awards T32 HG003284 and P50 GM071508. M.C. was supported by NSF awards CCF 1218687 and CCF 1302518. O.G.T. receives support as part of the Canadian Institute For Advanced Research in the Genetic Networks group. We thank members of the Troyanskaya lab for comments about SEEK in the regular lab meetings and Q. Zhu for critically reading the manuscript. We thank the volunteers from Princeton and other universities, including the Canadian Institute for Advanced Research Genetic Networks meetings attendees, for testing the SEEK web interface and providing valuable feedback.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Gene-retrieval performance vs. query size, and comparisons between SEEK and MEM in single- and multiple-gene queries
(a) SEEK's performance vs. query size (the number of query genes). The plot shows the median (black line) and the IQR (shaded area). The retrieval performance increases as a function of query size, showing that the improved query context, resulted from including more process-relevant genes in the query, can help boost gene retrieval. (b) Gene Recommender's performance vs. query size. (c) The performance of SEEK and MEM. The evaluation is the same as used in Fig. 2 (main text). These plots additionally show the mean (red line), median (black line), and the IQR (shaded area) across 995 processes. (d) Single-gene query retrieval performance.
(a) SEEK's retrieval robustness in the presence of random gene noise in the query. Red line (at 1.0) denotes the no-noise queries' performance level. Relative performance, defined by the fraction in fold improvement of precision over random at 10% recall (FIOR@10%) between noisy and no-noise queries, is plotted (see Methods). The percentage numbers below the box plot shows the median per-query performance drop. (b) Gene Recommender's gene retrieval robustness in the presence of random gene noise in the query.
Supplementary Figure 3 Performance of SEEK and other search systems over increasing numbers of gene expression data sets
A sub set consisting of 121 GO-slim (Supplementary Data 3) terms were used to evaluate each system's gene retrieval performance on six compendium sizes each built from random subsets of the data sets in its full compendium. The FIOR@10% is measured. All algorithms are applied to the same data compendia, and MEM, Gene Recommender, and combined data set correlation algorithms do not scale to the large human compendium that SEEK is able to effectively utilize.
Each black dot represents a process. Different statistics were used to summarize performance (FIOR@10%) per group: red dot (mean), blue line (75th percentile), orange line (median). Memberships of the biological processes to the 11 term groups are determined by text-mining the process title, except for the 3 super-groups (see figure for their definitions). Red arrows indicate examples of top-performing processes: “erythrocyte differentiation” (44-fold), “lysosomal transport” (25-fold), “glutamate signaling” (104-fold), and “digestive system development” (33-fold).
Supplementary Figure 5 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: the data sets prioritized for the query
The top 10 data sets among 5,000 data sets prioritized by SEEK are displayed in the insert. These data sets are weighted by the coexpression of the Hh query genes to indicate the abundance of aberrant Hh signaling activations.
Supplementary Figure 6 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: data set weight significance and gene-retrieval validation
(a) Top-ranked data sets are specifically highly weighted to the Hedgehog (Hh) query. Data set weight significance is calculated by a comparison with 100 random queries. Empirical P value for each data set d represents the fraction of 100 random queries where the score (or weight) of the data set d prioritized by a random query is higher than d's score in the Hh query. (b) Gene retrieval validation, which serves as an indication of the relevance of the search results (i.e., coexpressed genes) to the Hh context. The gold standard consists of 71 Hh genes assembled from KEGG and GO.
Enrichment of 100 batch affected data sets (Supplementary Note 1) in full data prioritizations (4,500 data sets) across 121 GO slim queries. This test was done to check if the 100 data sets (with severe batch effects) have a lower score than randomly selected data sets in the ranking. Score-based PAGE enrichment was used. The data sets consistently received a lower score than randomly selected data sets (avg. z-score = –6.3, P < 1.39×10−10), showing that low quality data sets have a relatively small impact on the prioritizations.
(a) Prior to standardization, the distribution of Pearson correlation (r) (for all pairs of genes in the data set) is not directly comparable across platforms. We picked a data set at random from each of 8 major platforms to illustrate this lack of comparability. (b) After the normalization of Pearson by Fisher's transform (ln[(1 + r) / (1 – r)] / 2) followed by standardization, all of the selected data sets from these different platforms have been properly standardized to a N(0,1) normal distribution.
Gene retrieval evaluations for GO slim biological process queries, searched using the SEEK algorithm. The correlation measure is varied: Pearson normalized, Spearman, and bicor. The data sets used are a group of 174 breast cancer tumor data sets.
The parameter p that is used in Eq. 2 in the Methods is arrived after testing values in the range from 0.90 to 0.99. At p=0.99, SEEK is most stable in retrieving genes across the 995 GO biological processes.
Supplementary Figures 1–10 and Supplementary Notes 1–6 (PDF 5185 kb)
Microarray and RNA sequencing platforms included in the SEEK compendium (XLSX 10 kb)
Individual GO biological process performance (XLSX 79 kb)
The effect of compendium size on the gene-retrieval performance (XLSX 55 kb)
Data set annotations (XLSX 297 kb)
Detailed data processing procedure for microarray data sets, TCGA RNASeq data sets, and other RNASeq data sets from GEO (XLSX 28 kb)
About this article
Cite this article
Zhu, Q., Wong, A., Krishnan, A. et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 12, 211–214 (2015). https://doi.org/10.1038/nmeth.3249
This article is cited by
Genome Medicine (2023)
Nature Communications (2023)
Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data
Genome Biology (2022)
Systematic identification of ACE2 expression modulators reveals cardiomyopathy as a risk factor for mortality in COVID-19 patients
Genome Biology (2022)
Nature Communications (2022)