Abstract
We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation–based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
The Cancer Genome Atlas Research Network. Nature 455, 1061–1068 (2008).
Edgar, R., Domrachev, M. & Lash, A.E. Nucleic Acids Res. 30, 207–210 (2002).
Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).
Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Proc. Natl. Acad. Sci. USA 101, 2981–2986 (2004).
Hibbs, M.A. et al. Bioinformatics 23, 2692–2699 (2007).
Owen, A.B., Stuart, J., Mach, K., Villeneuve, A.M. & Kim, S. Genome Res. 13, 1828–1837 (2003).
Zinman, G.E., Naiman, S., Kanfi, Y., Cohen, H. & Bar-Joseph, Z. Nat. Methods 10, 925–926 (2013).
Adler, P. et al. Genome Biol. 10, R139 (2009).
Barabási, A.-L. & Oltvai, Z.N. Nat. Rev. Genet. 5, 101–113 (2004).
Han, J.-D.J. et al. Nature 430, 88–93 (2004).
Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Genome Res. 14, 1085–1094 (2004).
Leek, J.T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Kimura, H., Stephen, D., Joyner, A. & Curran, T. Oncogene 24, 4026–4036 (2005).
Oliver, T.G. et al. Proc. Natl. Acad. Sci. USA 100, 7331–7336 (2003).
Berman, D.M. et al. Science 297, 1559–1561 (2002).
Carpenter, D. et al. Proc. Natl. Acad. Sci. USA 95, 13630–13634 (1998).
Oue, T., Yoneda, A., Uehara, S., Yamanaka, H. & Fukuzawa, M. J. Pediatr. Surg. 45, 387–392 (2010).
Jagani, Z. et al. Nat. Med. 16, 1429–1433 (2010).
Cohen, M.M. Jr. Am. J. Med. Genet. A. 123A, 5–28 (2003).
Cheung, H.O.-L. et al. Sci. Signal. 2, ra29 (2009).
Ramasamy, A., Mondry, A., Holmes, C.C. & Altman, D.G. PLoS Med. 5, e184 (2008).
Fisher, R.A. Biometrika 10, 507–521 (1915).
Huttenhower, C. et al. Genome Res. 19, 1093–1106 (2009).
Song, L., Langfelder, P. & Horvath, S. BMC Bioinformatics 13, 328 (2012).
Stuart, J.M., Segal, E., Koller, D. & Kim, S.K. Science 302, 249–255 (2003).
Horvath, S. & Dong, J. PLoS Comput. Biol. 4, e1000117 (2008).
Ruan, J., Dean, A.K. & Zhang, W. BMC Syst. Biol. 4, 8 (2010).
Xulvi-Brunet, R. & Li, H. Bioinformatics 26, 205–214 (2010).
Moffat, A. & Zobel, J. ACM Trans. Inf. Syst. 27, 2 (2008).
Huttenhower, C., Schroeder, M., Chikina, M.D. & Troyanskaya, O.G. Bioinformatics 24, 1559–1561 (2008).
Bodenreider, O. Nucleic Acids Res. 32, D267–D270 (2004).
Gremse, M. et al. Nucleic Acids Res. 39, D507–D513 (2011).
Myers, C.L., Barrett, D.R., Hibbs, M.A., Huttenhower, C. & Troyanskaya, O.G. BMC Genomics 7, 187 (2006).
Acknowledgements
The work was supported in part by the US National Institutes of Health (NIH) award R01 HG005998 and partially supported by the US National Science Foundation (NSF) CAREER award (DBI-0546275) and NIH award R01 GM071966 to O.G.T. The project was also partially supported in part by the NIH awards T32 HG003284 and P50 GM071508. M.C. was supported by NSF awards CCF 1218687 and CCF 1302518. O.G.T. receives support as part of the Canadian Institute For Advanced Research in the Genetic Networks group. We thank members of the Troyanskaya lab for comments about SEEK in the regular lab meetings and Q. Zhu for critically reading the manuscript. We thank the volunteers from Princeton and other universities, including the Canadian Institute for Advanced Research Genetic Networks meetings attendees, for testing the SEEK web interface and providing valuable feedback.
Author information
Authors and Affiliations
Contributions
Q.Z. and O.G.T. wrote the manuscript. Q.Z., O.G.T., K.L. and M.C. designed the algorithm. Q.Z. implemented the search back end and front end, and performed evaluations. A.K.W., D.C.C., A.K., R.Z., M.R.A. and C.S.G. contributed ideas, performed analyses and edited the manuscript. A.T., L.A.B. and Q.Z. performed data and metadata processing. V.N.K. and O.G.T. contributed ideas in the biological study. O.G.T., K.L. and M.C. conceived of the study and gave guidance.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Gene-retrieval performance vs. query size, and comparisons between SEEK and MEM in single- and multiple-gene queries
(a) SEEK's performance vs. query size (the number of query genes). The plot shows the median (black line) and the IQR (shaded area). The retrieval performance increases as a function of query size, showing that the improved query context, resulted from including more process-relevant genes in the query, can help boost gene retrieval. (b) Gene Recommender's performance vs. query size. (c) The performance of SEEK and MEM. The evaluation is the same as used in Fig. 2 (main text). These plots additionally show the mean (red line), median (black line), and the IQR (shaded area) across 995 processes. (d) Single-gene query retrieval performance.
Supplementary Figure 2 Robustness of SEEK and Gene Recommender to noisy query genes
(a) SEEK's retrieval robustness in the presence of random gene noise in the query. Red line (at 1.0) denotes the no-noise queries' performance level. Relative performance, defined by the fraction in fold improvement of precision over random at 10% recall (FIOR@10%) between noisy and no-noise queries, is plotted (see Methods). The percentage numbers below the box plot shows the median per-query performance drop. (b) Gene Recommender's gene retrieval robustness in the presence of random gene noise in the query.
Supplementary Figure 3 Performance of SEEK and other search systems over increasing numbers of gene expression data sets
A sub set consisting of 121 GO-slim (Supplementary Data 3) terms were used to evaluate each system's gene retrieval performance on six compendium sizes each built from random subsets of the data sets in its full compendium. The FIOR@10% is measured. All algorithms are applied to the same data compendia, and MEM, Gene Recommender, and combined data set correlation algorithms do not scale to the large human compendium that SEEK is able to effectively utilize.
Supplementary Figure 4 SEEK's performance across process groups
Each black dot represents a process. Different statistics were used to summarize performance (FIOR@10%) per group: red dot (mean), blue line (75th percentile), orange line (median). Memberships of the biological processes to the 11 term groups are determined by text-mining the process title, except for the 3 super-groups (see figure for their definitions). Red arrows indicate examples of top-performing processes: “erythrocyte differentiation” (44-fold), “lysosomal transport” (25-fold), “glutamate signaling” (104-fold), and “digestive system development” (33-fold).
Supplementary Figure 5 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: the data sets prioritized for the query
The top 10 data sets among 5,000 data sets prioritized by SEEK are displayed in the insert. These data sets are weighted by the coexpression of the Hh query genes to indicate the abundance of aberrant Hh signaling activations.
Supplementary Figure 6 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: data set weight significance and gene-retrieval validation
(a) Top-ranked data sets are specifically highly weighted to the Hedgehog (Hh) query. Data set weight significance is calculated by a comparison with 100 random queries. Empirical P value for each data set d represents the fraction of 100 random queries where the score (or weight) of the data set d prioritized by a random query is higher than d's score in the Hh query. (b) Gene retrieval validation, which serves as an indication of the relevance of the search results (i.e., coexpressed genes) to the Hh context. The gold standard consists of 71 Hh genes assembled from KEGG and GO.
Supplementary Figure 7 Batch-effect analysis
Enrichment of 100 batch affected data sets (Supplementary Note 1) in full data prioritizations (4,500 data sets) across 121 GO slim queries. This test was done to check if the 100 data sets (with severe batch effects) have a lower score than randomly selected data sets in the ranking. Score-based PAGE enrichment was used. The data sets consistently received a lower score than randomly selected data sets (avg. z-score = –6.3, P < 1.39×10−10), showing that low quality data sets have a relatively small impact on the prioritizations.
Supplementary Figure 8 Correlation standardization
(a) Prior to standardization, the distribution of Pearson correlation (r) (for all pairs of genes in the data set) is not directly comparable across platforms. We picked a data set at random from each of 8 major platforms to illustrate this lack of comparability. (b) After the normalization of Pearson by Fisher's transform (ln[(1 + r) / (1 – r)] / 2) followed by standardization, all of the selected data sets from these different platforms have been properly standardized to a N(0,1) normal distribution.
Supplementary Figure 9 Spearman and bicor correlation measures
Gene retrieval evaluations for GO slim biological process queries, searched using the SEEK algorithm. The correlation measure is varied: Pearson normalized, Spearman, and bicor. The data sets used are a group of 174 breast cancer tumor data sets.
Supplementary Figure 10 Variation of the parameter p in the weighting formula
The parameter p that is used in Eq. 2 in the Methods is arrived after testing values in the range from 0.90 to 0.99. At p=0.99, SEEK is most stable in retrieving genes across the 995 GO biological processes.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10 and Supplementary Notes 1–6 (PDF 5185 kb)
Supplementary Data 1
Microarray and RNA sequencing platforms included in the SEEK compendium (XLSX 10 kb)
Supplementary Data 2
Individual GO biological process performance (XLSX 79 kb)
Supplementary Data 3
The effect of compendium size on the gene-retrieval performance (XLSX 55 kb)
Supplementary Data 4
Data set annotations (XLSX 297 kb)
Supplementary Data 5
Detailed data processing procedure for microarray data sets, TCGA RNASeq data sets, and other RNASeq data sets from GEO (XLSX 28 kb)
Source data
Rights and permissions
About this article
Cite this article
Zhu, Q., Wong, A., Krishnan, A. et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 12, 211–214 (2015). https://doi.org/10.1038/nmeth.3249
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3249
This article is cited by
-
Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data
Genome Biology (2022)
-
Systematic identification of ACE2 expression modulators reveals cardiomyopathy as a risk factor for mortality in COVID-19 patients
Genome Biology (2022)
-
The NUCKS1-SKP2-p21/p27 axis controls S phase entry
Nature Communications (2021)
-
Immune cells lacking Y chromosome show dysregulation of autosomal gene expression
Cellular and Molecular Life Sciences (2021)
-
MyoMiner: explore gene co-expression in normal and pathological muscle
BMC Medical Genomics (2020)