Over the past decade, high-throughput gene expression experiments have generated data from millions of assays. Data sets linked to publications are stored in functional genomics data archives: ArrayExpress at the European Bioinformatics Institute, Gene Expression Omnibus at the US National Center for Biotechnology Information and at the DNA Databank of Japan Omics Archive.
Secondary added-value and topical databases process data from the primary archives, adding analysis and annotation to make these data accessible to every biologist by allowing queries such as 'in which tissue is a particular gene expressed?' or 'which genes are differentially expressed between a particular disease and normal samples?'
Public gene expression data are commonly reused to study biological questions, both by reanalysis of primary data and by queries to secondary resources. Approximately half of the studies that use public gene expression data rely solely on existing data without adding newly generated data, and half of them use the public data in combination with new data.
The reproducibility of published microarray-based studies is limited, mostly owing to insufficient experiment annotation and sometimes to unavailability of the raw or processed data. A stricter enforcement of Minimum Information About a Microarray Experiment (MIAME) requirements and also development of easy-to-use experiment annotation tools are needed to achieve a better reproducibility.
Although most of the public gene expression data still are based on microarray experiments, the contribution of high-throughput-sequencing-based expression studies, known as RNA sequencing (RNA-seq), are growing rapidly.
Reuse of RNA-seq data can potentially be even more valuable than reuse of microarray data, partly owing to the costs of experiments and data storage but even more importantly because of a more quantitative nature of sequencing-based expression data. Community standards such as Minimum Information about Sequencing Experiments (MINSEQE) should be adopted to make RNA-seq data maximally reusable.
The bioinformatics resources that store and manage public data are sensitive to short-term funding changes, complicating the maintenance of important databases. The development of long-term infrastructure in bioinformatics, such as the ELIXIR project in Europe, is needed to ensure the long term availability of public data.
Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments — microarrays and next-generation sequencing — have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Brazma, A. et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genet. 29, 365–371 (2001). MIAME was the first initiative to set standards for high-throughput data reporting sharing.
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004). Bioconductor is arguably the most commonly used framework for bioinformatics analysis tools and supports a vast array of open source analysis packages.
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Brazma, A. et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31, 68–71 (2003). References 5 and 6 describe the primary archives at NCBI and EBI, which provide public availability of data from approximately one million microarrays.
Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).
Parkinson, H. et al. ArrayExpress update — an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002–D1004 (2011).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011).
Kodama, Y. et al. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments. Nucleic Acids Res. 40, D38–D42 (2012).
Piwowar, H. A. Who shares? Who doesn't? Factors associated with openly archiving raw research data. PLoS ONE 6, e18657 (2011).
Rustici, G. et al. ArrayExpress update — trends in database growth and links to popular analysis tools. Nucleic Acids Res. 27 Nov 2012 (doi:10.1093/nar/gks1174).
Barrett, T. et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40, D57–D63 (2012).
Gostev, M. et al. The BioSample Database (BioSD) at the European Bioinformatics Institute. Nucleic Acids Res. 40, D64–D70 (2012).
Malone, J. et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26, 1112–1118 (2010).
Kapushesky, M. et al. Gene Expression Atlas update — a value-added database of microarray and sequencing-based functional genomics experiments. Nucleic Acids Res. 40, D1060–D1066 (2012).
Chen, R., Mallelwar, R., Thosar, A., Venkatasubrahmanyam, S. & Butte, A. J. GeneChaser: identifying all biological and clinical conditions in which genes of interest are differentially expressed. BMC Bioinformatics 9, 548 (2008).
Zilliox, M. J. & Irizarry, R. A. A gene expression bar code for microarray data. Nature Methods 4, 911–913 (2007).
McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J. & Irizarry, R. A. The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res. 39, D1011–D1015 (2011). The Gene Expression Barcode is probably the most successful attempt at answering the fundamental question of what is expressed and what is not expressed in a given sample.
Mochida, K., Uehara-Yamaguchi, Y., Yoshida, T., Sakurai, T. & Shinozaki, K. Global landscape of a co-expressed gene network in barley and its application to gene discovery in Triticeae crops. Plant Cell Physiol. 52, 785–803 (2011).
Hamada, K. et al. OryzaExpress: an integrated database of gene expression networks and omics annotations in rice. Plant Cell Physiol. 52, 220–229 (2011).
Obayashi, T., Nishida, K., Kasahara, K. & Kinoshita, K. ATTED-II updates: condition-specific gene coexpression to extend coexpression analyses and applications to a broad range of flowering plants. Plant Cell Physiol. 52, 213–219 (2011).
van Verk, M. C., Bol, J. F. & Linthorst, H. J. Prospecting for genes involved in transcriptional regulation of plant defenses, a bioinformatics approach. BMC Plant Biol. 11, 88 (2011).
Wilson, T. J. & Ge, S. X. ArraySearch: a web-based genomic search engine. Comp. Funct. Genom. 2012, 650842 (2012).
Obayashi, T. & Kinoshita, K. COXPRESdb: a database to compare gene coexpression in seven model animals. Nucleic Acids Res. 39, D1016–D1022 (2011).
Engreitz, J. M. et al. ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 27, 3317–3318 (2011).
Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).
Cho, S. et al. miRGator v2.0: an integrated system for functional investigation of microRNAs. Nucleic Acids Res. 39, D158–D162 (2011).
Cheng, W. C. et al. Microarray meta-analysis database (M(2)DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database. BMC Bioinformatics 11, 421 (2010).
Gadaleta, E. et al. A global insight into a cancer transcriptional space using pancreatic data: importance, findings and flaws. Nucleic Acids Res. 39, 7900–7907 (2011).
Cutts, R. J. et al. The Pancreatic Expression database: 2011 update. Nucleic Acids Res. 39, D1023–D1028 (2011).
Taccioli, C. et al. ParkDB: a Parkinson's disease gene expression database. Database 18, bar007 (2011).
Howell, G. R., Walton, D. O., King, B. L., Libby, R. T. & John, S. W. Datgan, a reusable software system for facile interrogation and visualization of complex transcription profiling data. BMC Genomics 12, 429 (2011).
Rhodes, D. R. et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 6, 1–6 (2004).
Liu, F., White, J. A., Antonescu, C., Gusenleitner, D. & Quackenbush, J. GCOD — GeneChip Oncology Database. BMC Bioinformatics 12, 46 (2011).
Harding, S. D. et al. The GUDMAP database—an online resource for genitourinary research. Development 138, 2845–2853 (2011).
Dash, S., Van Hemert, J., Hong, L., Wise, R. P. & Dickerson, J. A. PLEXdb: gene expression resources for plants and plant pathogens. Nucleic Acids Res. 40, D1194–D1201 (2012).
Fei, Z. et al. Tomato Functional Genomics Database: a comprehensive resource and analysis package for tomato functional genomics. Nucleic Acids Res. 39, D1156–D1163 (2011).
Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. 101, 6062–6067 (2004).
Wu, C. et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 10, R130 (2009).
Finger, J. H. et al. The mouse Gene Expression Database (GXD): 2011 update. Nucleic Acids Res. 39, D835–D841 (2011).
Richardson, L. et al. EMAGE mouse embryo spatial gene expression database: 2010 update. Nucleic Acids Res. 38, D703–D709 (2010).
Haudry, Y. et al. 4DXpress: a database for cross-species expression pattern comparisons. Nucleic Acids Res. 36, D847–D853 (2008).
Jiménez-Lozano, N., Segura, J., Macías, J. R., Vega, J. & Carazo, J. M. Integrating human and murine anatomical gene expression data for improved comparisons. Bioinformatics 28, 397–402 (2012).
Gundem, G. et al. IntOGen: integration and data mining of multidimensional oncogenomic data. Nature Methods 7, 92–93 (2010).
Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006). This much-used resource links gene signatures derived from disease data and drug treatments.
Halling-Brown, M. D., Bulusu, K. C., Patel, M. & Tym, J. E. & Al-Lazikani, B. canSAR: an integrated cancer public translational research and drug discovery resource. Nucleic Acids Res. 40, D947–D956 (2012).
Huang, H., Liu, C.-C. & Zhou, X. J. Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proc. Natl Acad. Sci. USA 107, 6823–6828 (2010) (2010).
Yook, K. et al. WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 40, D735–D741 (2012).
Ioannides, J. P. A. et al. Repeatability of public microarray gene analyses. Nature Genet. 41, 149–155 (2009). This study clearly demonstrates the irreproducibility that follows a lack of annotation or insufficient data or code sharing.
Couzin-Frankel, J. As questions grow, Duke halts trials, launches investigation. Science 329, 614–615.
Baggerly, K. A. & Coombes, K. R. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stat. 3, 1309–1344 (2009).
Baggerly K. A. & Coombes, K. R. What information should be required to support clinical “omics” publications? Clin. Chem. 57, 688–690 (2011).
Shankar, R. et al. Annotare — a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics 26, 2470–2471 (2010).
Sansone, S.-A. et al. Toward interoperable bioscience data. Nature Genet. 44, 121–126 (2012).
Krestyaninova, M. et al. A System for Information Management in BioMedical Studies—SIMBioMS. Bioinformatics 25, 2768–2769 (2009).
Piwowar, H. A., Vision, T. J. & Whitlock, M. C. Data archiving is a good investment. Nature 473, 285–285 (2011).
Parkinson, H. et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33, D553–D555 (2005).
Parkinson, H. et al. ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).
Parkinson, H. et al. ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37, D868–D872 (2009).
Rudy, J. & Valafar, F. Empirical comparison of cross-platform normalization methods for gene expression data. BMC Bioinformatics 12, 467 (2011).
Lukk, M. et al. A global map of human gene expression. Nature Biotech. 28, 322–324 (2010). This analysis of a large compilation of public data shows the large-scale structure of gene expression space in a large variety of human samples, which could not be derived from any contributing studies individually.
Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proc. Natl Acad. Sci. 109, 5594–5599 (2012).
Kohane, I. S. & Valtchinov, V. I. Quantifying the white blood cell transcriptome as an accessible window to the multiorgan transcriptome. Bioinformatics 28, 538–545 (2012).
Ojala, K. A., Kilpinen, S. K. & Kallioniemi, O. P. Classification of unknown primary tumors with a data-driven method based on a large microarray reference database. Genome Med. 3, 63 (2011).
Zheng-Bradley, X., Rung, J., Parkinson, H. & Brazma, A. Large scale comparison of global gene expression patterns in human and mouse. Genome Biol. 11, R124 (2010).
Tseng, G. C., Ghosh, D. & Feingold, E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40, 3785–3799 (2012).
Kang, D. D., Sibille, E., Kaminski, N. & Tseng, G. C. MetaQC: objective quality control and inclusion/exclusion criteria for genomic meta-analysis. Nucleic Acids Res. 40, e15 (2012).
Ramasamy, A., Mondry, A., Holmes, C. C. & Altman, D. G. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 5, e184 (2008).
Vilardell, M. et al. Meta-analysis of heterogeneous Down syndrome data reveals consistent genome-wide dosage effects related to neurological processes. BMC Genomics 12, 229 (2011).
Chen, M., Wang, K., Zhang, L., Li, C. & Yang, Y. The discovery of putative urine markers for the specific detection of prostate tumor by integrative mining of public genomic profiles. PLoS ONE 6, e28552 (2011).
Sontrop, H. M., Verhaegh, W. F., Reinders, M. J. & Moerland, P. D. An evaluation protocol for subtype-specific breast cancer event prediction. PLoS ONE 6, e21681 (2011).
Pierre, M. et al. Meta-analysis of archived DNA microarrays identifies genes regulated by hypoxia and involved in a metastatic phenotype in cancer cells. BMC Cancer 10, 176 (2010).
Kim, S., You, S. & Hwang, D. Aminoacyl-tRNA synthetases and tumorigenesis: more than housekeeping. Nature Rev. Cancer. 11, 708–718 (2011).
Cochran, B. G. The combination of estimates from different experiments. Biometrics 10, 101–129 (1954).
Wang, X. et al. An R package suite for microarray meta-analysis in quality control, differentially expressed gene analysis and pathway enrichment detection. Bioinformatics 28, 2534–2536 (2012).
Marot, G., Foulley, J.-L., Mayer, C.-D. & Jaffrézic, F. Moderated effect size and p-value combinations for microarray meta-analyses. Bioinformatics 25, 2692–2699 (2009).
Gentleman, R., Ruschhaupt, M., Huber, W. & Lusa, L. Meta-analysis for microarray experiments. bioconductor.org [online], (2012).
Ghosh, D. & Choi, H. Package 'metaArray'. bioconductor.org [online], (2012).
Seo, Y. S. et al. Towards establishment of a rice stress response interactome. PLoS Genet. 7, e1002020 (2011).
Soreq, L., Ben-Shaul, Y., Israel, Z., Bergman, H. & Soreq, H. Meta-analysis of genetic and environmental Parkinson's disease models reveals a common role of mitochondrial protection pathways. Neurobiol. Dis. 45, 1018–1030 (2012).
Cacciottolo, M. et al. Reverse engineering gene network identifies new dysferlin-interacting proteins. J. Biol. Chem. 286, 5404–5413 (2011).
Tram, E. et al. Identification of germline alterations of the mad homology 2 domain of SMAD3 and SMAD4 from the Ontario site of the breast cancer family registry (CFR). Breast Cancer Res. 13, R77 (2011).
Xu, Y. et al. Unique DNA methylome profiles in CpG island methylator phenotype colon cancers. Genome Res. 22, 283–291 (2012).
Witkiewicz, A. K. et al. Molecular profiling of a lethal tumor microenvironment, as defined by stromal caveolin-1 status in breast cancers. Cell Cycle. 10, 1794–1809 (2011).
Oshino, T. et al. Auxin depletion in barley plants under high-temperature conditions represses DNA proliferation in organelles and nuclei via transcriptional alterations. Plant Cell Environ. 34, 284–290 (2011).
Alboresi, A. et al. Reactive oxygen species and transcript analysis upon excess light treatment in wild-type Arabidopsis thaliana versus a photosensitive mutant lacking zeaxanthin and lutein. BMC Plant Biol. 11, 62 (2011).
Donoghue, M. T., Keshavaiah, C., Swamidatta, S. H. & Spillane, C. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol. Biol. 11, 47 (2011).
Sanz-Pamplona, R. et al. Gene expression differences between colon and rectum tumors. Clin. Cancer Res. 17, 7303–7312 (2011).
Momin, A. A. et al. A method for visualization of “omic” datasets for sphingolipid metabolism to predict potentially interesting differences. J. Lipid Res. 52, 1073–1083 (2011).
Yeung, K. Y. et al. Construction of regulatory networks using expression time-series data of a genotyped population. Proc. Natl Acad. Sci. 108, 19436–19441 (2011).
Kacmarczyk, T., Waltman, P., Bate, A., Eichenberger, P. & Bonneau, R. Comparative microbial modules resource: generation and visualization of multi-species biclusters. PLoS Comput. Biol. 7, e1002228 (2011).
Deng, J. et al. Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res. 39, 795–807 (2011).
Wilson, P. A. & Plucinski, M. A simple Bayesian estimate of direct RNAi gene regulation events from differential gene expression profiles. BMC Genomics 12, 250 (2011).
Jézéquel, P. et al. bc-GenExMiner: an easy-to-use online platform for gene prognostic analyses in breast cancer. Breast Cancer Res. Treat. 131, 765–775 (2012).
Kolde, R., Laur, S., Adler, P. & Vilo, J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28, 573–580 (2012).
Tsoi, L. C., Qin, T., Slate, E. H. & Zheng, W. J. Consistent Differential Expression Pattern (CDEP) on microarray to identify genes related to metastatic behavior. BMC Bioinformatics 12, 438 (2011).
Berrar, D., Bradbury, I. & Dubitzky, W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics 22, 1245–1250 (2006).
Zheng, W., Chung, L. M. & Zhao, H. Bias detection and correction in RNA-sequencing data. BMC Bioinformatics 12, 290 (2011).
Gonzàlez-Porta, M., Calvo, M., Sammeth, M. & Guigó, R. Estimation of alternative splicing variability in human populations. Genome Res. 22, 528–538 (2012).
Mailman, M. D. et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nature Genet. 39, 1181–1186 (2007).
Kauffmann, A. Gentleman, R. & Huber, W. arrayQualityMetrics—a bioconductor package for quality assessment of microarray data. Bioinformatics 25, 415–416 (2009).
Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
Sherlock, G. et al. The Stanford Microarray Database. Nucleic Acids Res. 29, 152–155 (2001).
Hruz, T. et al. Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv. Bioinformat. 2008, 420747 (2008).
We would like to thank H. Parkinson and U. Sarkans for useful comments and help in analysing ArrayExpress statistics. The work was partly funded by the European Community's FP7 HEALTH grants ENGAGE (grant agreement 201413), SYBARIS (grant agreement 242220) and EurocanPlatform (grant agreement 260791).
The authors declare no competing financial interests.
A solid surface slide on which a collection of microscopic DNA spots representing specific DNA sequences of genomic regions are attached and to which sample DNA fragments can hybridize. Microarrays are used to measure the expression levels of large numbers of genes simultaneously, to genotype multiple regions of a genome or for other high-throughput assays.
- Minimum Information About a Microarray Experiment
(MIAME). A guideline for information that is necessary for the unambiguous interpretation of the results of the experiment, potentially allowing the reproduction of the experiment. MIAME postulates that raw and processed data, sample annotation, array feature annotation, relationship between the samples used in the experiment, arrays and data files, the overall description of the experiment and experimental variables must be given in a usable format to make the results of a microarray experiment interpretable.
- Gene Expression Omnibus
(GEO). A public functional genomics data repository supporting MIAME-compliant data submissions at the US National Center for Biotechnology Information accepting array- and sequence-based data.
A MIAME-compliant archive of functional genomics data at the European Bioinformatics Institute. It is one of the international public data archives recommended by scientific journals for depositions of microarray or high-throughput sequencing data related to publications.
- High-throughput sequencing
DNA sequencing technologies that parallelize the sequencing operations, thus achieving several magnitudes higher throughput than the traditional sequencing methods based on processes invented by Fred Sanger.
- RNA sequencing
(RNA-seq). The use of high-throughput sequencing technologies applied to cDNA molecules obtained by reverse transcription from RNA, or sequencing RNA directly, in order to get information about the RNA content of a sample.
Refers to methods focused on contrasting and combining results from different studies to identify common patterns and improving the signal in data by combining multiple studies.
In relation to microarray and other high-throughput data, normalization usually refers to data transformations that remove systematic noise and that make data combined from several assays mutually comparable.
- Minimum Information about a Sequencing Experiment
(MINSEQE). A formulation of the information that is necessary to interpret the results of a sequencing experiment unambiguously and potentially to reproduce the experiment. MINSEQE is an adoption of Minimum Information About a Microarray Experiment guidelines to functional genomics experiments based on RNA sequencing and other high-throughput-sequencing-based functional genomics experiments.
A life sciences infrastructure project that unites Europe's leading life sciences organizations in managing and safeguarding the massive amounts of data being generated every day by publicly funded research.
About this article
Cite this article
Rung, J., Brazma, A. Reuse of public genome-wide gene expression data. Nat Rev Genet 14, 89–99 (2013). https://doi.org/10.1038/nrg3394
Elucidating gene expression patterns across multiple biological contexts through a large-scale investigation of transcriptomic datasets
BMC Bioinformatics (2022)
Scientific Reports (2022)
Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas
Scientific Data (2022)
Genome Biology (2021)
The importance of adherence to international standards for depositing open data in public repositories
BMC Research Notes (2021)