Targeted exploration and analysis of large cross-platform human transcriptomic compendia

Zhu, Qian; Wong, Aaron K; Krishnan, Arjun; Aure, Miriam R; Tadych, Alicja; Zhang, Ran; Corney, David C; Greene, Casey S; Bongo, Lars A; Kristensen, Vessela N; Charikar, Moses; Li, Kai; Troyanskaya, Olga G

doi:10.1038/nmeth.3249

Brief Communication
Published: 12 January 2015

Targeted exploration and analysis of large cross-platform human transcriptomic compendia

Qian Zhu^1,2,
Aaron K Wong^1,2,
Arjun Krishnan²,
Miriam R Aure³,
Alicja Tadych²,
Ran Zhang^2,4,
David C Corney^2,4,
Casey S Greene^5,6,
Lars A Bongo⁷,
Vessela N Kristensen^3,8,9,
Moses Charikar¹,
Kai Li¹ &
…
Olga G Troyanskaya^1,2,10

Nature Methods volume 12, pages 211–214 (2015)Cite this article

6175 Accesses
101 Citations
23 Altmetric
Metrics details

Subjects

Abstract

We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation–based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: SEEK system overview and systematic functional evaluation.**

**Figure 2: Search results for the Hedgehog (Hh) query (*GLI1*, *GLI2*, *PTCH1*) and search refinement.**

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Article Open access 27 June 2022

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Article Open access 06 December 2021

Variability in estimated gene expression among commonly used RNA-seq pipelines

Article Open access 17 February 2020

References

The Cancer Genome Atlas Research Network. Nature 455, 1061–1068 (2008).
Edgar, R., Domrachev, M. & Lash, A.E. Nucleic Acids Res. 30, 207–210 (2002).
Article CAS Google Scholar
Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).
Article CAS Google Scholar
Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Proc. Natl. Acad. Sci. USA 101, 2981–2986 (2004).
Article CAS Google Scholar
Hibbs, M.A. et al. Bioinformatics 23, 2692–2699 (2007).
Article CAS Google Scholar
Owen, A.B., Stuart, J., Mach, K., Villeneuve, A.M. & Kim, S. Genome Res. 13, 1828–1837 (2003).
CAS PubMed PubMed Central Google Scholar
Zinman, G.E., Naiman, S., Kanfi, Y., Cohen, H. & Bar-Joseph, Z. Nat. Methods 10, 925–926 (2013).
Article CAS Google Scholar
Adler, P. et al. Genome Biol. 10, R139 (2009).
Article Google Scholar
Barabási, A.-L. & Oltvai, Z.N. Nat. Rev. Genet. 5, 101–113 (2004).
Article Google Scholar
Han, J.-D.J. et al. Nature 430, 88–93 (2004).
Article CAS Google Scholar
Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Genome Res. 14, 1085–1094 (2004).
Article CAS Google Scholar
Leek, J.T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS Google Scholar
Kimura, H., Stephen, D., Joyner, A. & Curran, T. Oncogene 24, 4026–4036 (2005).
Article CAS Google Scholar
Oliver, T.G. et al. Proc. Natl. Acad. Sci. USA 100, 7331–7336 (2003).
Article CAS Google Scholar
Berman, D.M. et al. Science 297, 1559–1561 (2002).
Article CAS Google Scholar
Carpenter, D. et al. Proc. Natl. Acad. Sci. USA 95, 13630–13634 (1998).
Article CAS Google Scholar
Oue, T., Yoneda, A., Uehara, S., Yamanaka, H. & Fukuzawa, M. J. Pediatr. Surg. 45, 387–392 (2010).
Article Google Scholar
Jagani, Z. et al. Nat. Med. 16, 1429–1433 (2010).
Article CAS Google Scholar
Cohen, M.M. Jr. Am. J. Med. Genet. A. 123A, 5–28 (2003).
Article Google Scholar
Cheung, H.O.-L. et al. Sci. Signal. 2, ra29 (2009).
Article Google Scholar
Ramasamy, A., Mondry, A., Holmes, C.C. & Altman, D.G. PLoS Med. 5, e184 (2008).
Article Google Scholar
Fisher, R.A. Biometrika 10, 507–521 (1915).
Google Scholar
Huttenhower, C. et al. Genome Res. 19, 1093–1106 (2009).
Article CAS Google Scholar
Song, L., Langfelder, P. & Horvath, S. BMC Bioinformatics 13, 328 (2012).
Article CAS Google Scholar
Stuart, J.M., Segal, E., Koller, D. & Kim, S.K. Science 302, 249–255 (2003).
Article CAS Google Scholar
Horvath, S. & Dong, J. PLoS Comput. Biol. 4, e1000117 (2008).
Article Google Scholar
Ruan, J., Dean, A.K. & Zhang, W. BMC Syst. Biol. 4, 8 (2010).
Article Google Scholar
Xulvi-Brunet, R. & Li, H. Bioinformatics 26, 205–214 (2010).
Article CAS Google Scholar
Moffat, A. & Zobel, J. ACM Trans. Inf. Syst. 27, 2 (2008).
Article Google Scholar
Huttenhower, C., Schroeder, M., Chikina, M.D. & Troyanskaya, O.G. Bioinformatics 24, 1559–1561 (2008).
Article CAS Google Scholar
Bodenreider, O. Nucleic Acids Res. 32, D267–D270 (2004).
Article CAS Google Scholar
Gremse, M. et al. Nucleic Acids Res. 39, D507–D513 (2011).
Article CAS Google Scholar
Myers, C.L., Barrett, D.R., Hibbs, M.A., Huttenhower, C. & Troyanskaya, O.G. BMC Genomics 7, 187 (2006).
Article Google Scholar

Download references

Acknowledgements

The work was supported in part by the US National Institutes of Health (NIH) award R01 HG005998 and partially supported by the US National Science Foundation (NSF) CAREER award (DBI-0546275) and NIH award R01 GM071966 to O.G.T. The project was also partially supported in part by the NIH awards T32 HG003284 and P50 GM071508. M.C. was supported by NSF awards CCF 1218687 and CCF 1302518. O.G.T. receives support as part of the Canadian Institute For Advanced Research in the Genetic Networks group. We thank members of the Troyanskaya lab for comments about SEEK in the regular lab meetings and Q. Zhu for critically reading the manuscript. We thank the volunteers from Princeton and other universities, including the Canadian Institute for Advanced Research Genetic Networks meetings attendees, for testing the SEEK web interface and providing valuable feedback.

Author information

Authors and Affiliations

Department of Computer Science, Princeton University, Princeton, New Jersey, USA
Qian Zhu, Aaron K Wong, Moses Charikar, Kai Li & Olga G Troyanskaya
Lewis-Sigler Institute of Integrative Genomics, Princeton University, Princeton, New Jersey, USA
Qian Zhu, Aaron K Wong, Arjun Krishnan, Alicja Tadych, Ran Zhang, David C Corney & Olga G Troyanskaya
Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway
Miriam R Aure & Vessela N Kristensen
Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
Ran Zhang & David C Corney
Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
Casey S Greene
Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, New Hampshire, USA
Casey S Greene
Department of Computer Science, University of Tromsø, Tromsø, Norway
Lars A Bongo
Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
Vessela N Kristensen
Department of Clinical Molecular Biology (EpiGen), Division of Medicine, Akershus University Hospital, Akershus, Norway
Vessela N Kristensen
Simons Center for Data Analysis, Simons Foundation, New York, New York, USA
Olga G Troyanskaya

Authors

Qian Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Aaron K Wong
View author publications
You can also search for this author in PubMed Google Scholar
Arjun Krishnan
View author publications
You can also search for this author in PubMed Google Scholar
Miriam R Aure
View author publications
You can also search for this author in PubMed Google Scholar
Alicja Tadych
View author publications
You can also search for this author in PubMed Google Scholar
Ran Zhang
View author publications
You can also search for this author in PubMed Google Scholar
David C Corney
View author publications
You can also search for this author in PubMed Google Scholar
Casey S Greene
View author publications
You can also search for this author in PubMed Google Scholar
Lars A Bongo
View author publications
You can also search for this author in PubMed Google Scholar
Vessela N Kristensen
View author publications
You can also search for this author in PubMed Google Scholar
Moses Charikar
View author publications
You can also search for this author in PubMed Google Scholar
Kai Li
View author publications
You can also search for this author in PubMed Google Scholar
Olga G Troyanskaya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.Z. and O.G.T. wrote the manuscript. Q.Z., O.G.T., K.L. and M.C. designed the algorithm. Q.Z. implemented the search back end and front end, and performed evaluations. A.K.W., D.C.C., A.K., R.Z., M.R.A. and C.S.G. contributed ideas, performed analyses and edited the manuscript. A.T., L.A.B. and Q.Z. performed data and metadata processing. V.N.K. and O.G.T. contributed ideas in the biological study. O.G.T., K.L. and M.C. conceived of the study and gave guidance.

Corresponding authors

Correspondence to Moses Charikar, Kai Li or Olga G Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Gene-retrieval performance vs. query size, and comparisons between SEEK and MEM in single- and multiple-gene queries

(a) SEEK's performance vs. query size (the number of query genes). The plot shows the median (black line) and the IQR (shaded area). The retrieval performance increases as a function of query size, showing that the improved query context, resulted from including more process-relevant genes in the query, can help boost gene retrieval. (b) Gene Recommender's performance vs. query size. (c) The performance of SEEK and MEM. The evaluation is the same as used in Fig. 2 (main text). These plots additionally show the mean (red line), median (black line), and the IQR (shaded area) across 995 processes. (d) Single-gene query retrieval performance.

Source data

Supplementary Figure 2 Robustness of SEEK and Gene Recommender to noisy query genes

(a) SEEK's retrieval robustness in the presence of random gene noise in the query. Red line (at 1.0) denotes the no-noise queries' performance level. Relative performance, defined by the fraction in fold improvement of precision over random at 10% recall (FIOR@10%) between noisy and no-noise queries, is plotted (see Methods). The percentage numbers below the box plot shows the median per-query performance drop. (b) Gene Recommender's gene retrieval robustness in the presence of random gene noise in the query.

Source data

Supplementary Figure 3 Performance of SEEK and other search systems over increasing numbers of gene expression data sets

A sub set consisting of 121 GO-slim (Supplementary Data 3) terms were used to evaluate each system's gene retrieval performance on six compendium sizes each built from random subsets of the data sets in its full compendium. The FIOR@10% is measured. All algorithms are applied to the same data compendia, and MEM, Gene Recommender, and combined data set correlation algorithms do not scale to the large human compendium that SEEK is able to effectively utilize.

Source data

Supplementary Figure 4 SEEK's performance across process groups

Each black dot represents a process. Different statistics were used to summarize performance (FIOR@10%) per group: red dot (mean), blue line (75^th percentile), orange line (median). Memberships of the biological processes to the 11 term groups are determined by text-mining the process title, except for the 3 super-groups (see figure for their definitions). Red arrows indicate examples of top-performing processes: “erythrocyte differentiation” (44-fold), “lysosomal transport” (25-fold), “glutamate signaling” (104-fold), and “digestive system development” (33-fold).

Source data

Supplementary Figure 5 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: the data sets prioritized for the query

The top 10 data sets among 5,000 data sets prioritized by SEEK are displayed in the insert. These data sets are weighted by the coexpression of the Hh query genes to indicate the abundance of aberrant Hh signaling activations.

Source data

Supplementary Figure 6 Search results for the Hedgehog (Hh) signaling query GLI1 GLI2 PTCH1: data set weight significance and gene-retrieval validation

(a) Top-ranked data sets are specifically highly weighted to the Hedgehog (Hh) query. Data set weight significance is calculated by a comparison with 100 random queries. Empirical P value for each data set d represents the fraction of 100 random queries where the score (or weight) of the data set d prioritized by a random query is higher than d's score in the Hh query. (b) Gene retrieval validation, which serves as an indication of the relevance of the search results (i.e., coexpressed genes) to the Hh context. The gold standard consists of 71 Hh genes assembled from KEGG and GO.

Source data

Supplementary Figure 7 Batch-effect analysis

Enrichment of 100 batch affected data sets (Supplementary Note 1) in full data prioritizations (4,500 data sets) across 121 GO slim queries. This test was done to check if the 100 data sets (with severe batch effects) have a lower score than randomly selected data sets in the ranking. Score-based PAGE enrichment was used. The data sets consistently received a lower score than randomly selected data sets (avg. z-score = –6.3, P < 1.39×10⁻¹⁰), showing that low quality data sets have a relatively small impact on the prioritizations.

Source data

Supplementary Figure 8 Correlation standardization

(a) Prior to standardization, the distribution of Pearson correlation (r) (for all pairs of genes in the data set) is not directly comparable across platforms. We picked a data set at random from each of 8 major platforms to illustrate this lack of comparability. (b) After the normalization of Pearson by Fisher's transform (ln[(1 + r) / (1 – r)] / 2) followed by standardization, all of the selected data sets from these different platforms have been properly standardized to a N(0,1) normal distribution.

Source data

Supplementary Figure 9 Spearman and bicor correlation measures

Gene retrieval evaluations for GO slim biological process queries, searched using the SEEK algorithm. The correlation measure is varied: Pearson normalized, Spearman, and bicor. The data sets used are a group of 174 breast cancer tumor data sets.

Source data

Supplementary Figure 10 Variation of the parameter p in the weighting formula

The parameter p that is used in Eq. 2 in the Methods is arrived after testing values in the range from 0.90 to 0.99. At p=0.99, SEEK is most stable in retrieving genes across the 995 GO biological processes.

Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Notes 1–6 (PDF 5185 kb)

Supplementary Data 1

Microarray and RNA sequencing platforms included in the SEEK compendium (XLSX 10 kb)

Supplementary Data 2

Individual GO biological process performance (XLSX 79 kb)

Supplementary Data 3

The effect of compendium size on the gene-retrieval performance (XLSX 55 kb)

Supplementary Data 4

Data set annotations (XLSX 297 kb)

Supplementary Data 5

Detailed data processing procedure for microarray data sets, TCGA RNASeq data sets, and other RNASeq data sets from GEO (XLSX 28 kb)

Source data

Source data to Fig. 1

Source data to Supplementary Fig. 2

Source data to Supplementary Fig. 3

Source data to Supplementary Fig. 4

Source data to Supplementary Fig. 5

Source data to Supplementary Fig. 6

Source data to Supplementary Fig. 7

Source data to Supplementary Fig. 8

Source data to Supplementary Fig. 9

Source data to Supplementary Fig. 10

Source data to Supplementary Fig. 11

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Q., Wong, A., Krishnan, A. et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 12, 211–214 (2015). https://doi.org/10.1038/nmeth.3249

Download citation

Received: 05 September 2014
Accepted: 12 November 2014
Published: 12 January 2015
Issue Date: March 2015
DOI: https://doi.org/10.1038/nmeth.3249

This article is cited by

Hypoxia induced responses are reflected in the stromal proteome of breast cancer
- Silje Kjølle
- Kenneth Finne
- Lars A. Akslen
Nature Communications (2023)
Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data
- Kayla A. Johnson
- Arjun Krishnan
Genome Biology (2022)
Systematic identification of ACE2 expression modulators reveals cardiomyopathy as a risk factor for mortality in COVID-19 patients
- Navchetan Kaur
- Boris Oskotsky
- Zicheng Hu
Genome Biology (2022)
Inner nuclear protein Matrin-3 coordinates cell differentiation by stabilizing chromatin architecture
- Hye Ji Cha
- Özgün Uyan
- Stuart H. Orkin
Nature Communications (2021)
The NUCKS1-SKP2-p21/p27 axis controls S phase entry
- Samuel Hume
- Claudia P. Grou
- Grigory L. Dianov
Nature Communications (2021)

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links