Introduction

The gene expression omnibus (GEO) is the world’s largest public repository of functional genomic data, founded and run by the US-based National Center for Biotechnology Information (NCBI) within the National Library of Medicine at the National Institutes of Health (NIH) [1]. Along with its European counterpart ArrayExpress [2], such repositories are central towards fostering reproducibility and open access in genomic research [3].

GEO data are classified into four entity types: Platform (GPL), Sample (GSM), Series (GSE), and DataSet (GDS) [1]. Platform (GPL) records detail the specific technology or technologies used to obtain data of a given sample. Sample (GSM) records describe the experimental output of one individual sample. Series (GSE) records consist of a group of related Samples within an experiment. Finally, DataSet (GDS) records are the Series that have been curated by GEO staff, normalized to be biologically and statistically comparable.

Buried within the metadata of GEO deposits, however, lie broader trends in the research ecosystem. Open-access genomic databases on human samples are critical for future advances in oncology and medicine, and have been expanded significantly in the past decade [4,5,6,7,8,9,10]. However, to date, there have been no in-depth analyses of the trends in functional genomic data on GEO or ArrayExpress, despite their growing importance and volume. Such information could prove especially useful for the research on genomic medicine [11], public health [12, 13], and science funding and policy [14].

Here, we developed a database of human GSE alongside their associated metadata, and identified the temporal trends in genomic data growth on GEO. Only some of this metadata was readily available from the GEO browser; the disease-of-interest was extracted from experiment summaries, and funding data were extracted from the associated publications. Probing this database yielded several new insights on the technology, geographic origin, and research focus of the functional genomic studies. Most prominently, we observed a rapid adoption of high-throughput sequencing (HTS) in North America, alongside a shift toward basic research in human.

Materials and methods

Metadata extraction

We identified and included human GEO series using the organism keyword of Homo sapiens without any other search criteria in July 2017, and again in January 2018 for updates. Metadata on all human GEO Series (GSE) were downloaded from the GEO repository browser, including accession codes, title, Series type, release date, and associated curated GDS. Geographic origin (i.e., the corresponding author’s affiliation on the record) and experimental summaries were extracted from each Series’ accession display page using a custom web scraper. For Series with one or more associated publications, further metadata were extracted from MEDLINE, a bibliographic database indexed by the National Library of Medicine. We extracted the grant numbers under the GR field, and the medical subject headings from the MH field. Only the Series uploaded on or before 31 December 2017 were included in the analyses.

Data curation

From the raw metadata, certain fields were extracted to facilitate analyses. The Series type indicates both the general application (e.g., expression profiling or SNP genotyping) and the technology used (e.g., array or HTS). Due to the large number of such combinations, we separated the application and technology for individual analysis.

We classified each Series into one of the six broad disease areas using a keywords-based classification strategy: cancer, cardiovascular diseases, diabetes, immunologic, infectious diseases, and neurologic diseases. Briefly, we scanned each Series’ summary for keywords relating to each disease classification (Supplementary Table 1), and categorized it into the one with the greatest number of keyword hits. Those with no keyword hits were categorized as “unclassified.”

From the grant numbers, we parsed out the specific National Institutes of Health (NIH) institute(s) funding each grant, or listed down “other” for non-NIH institutes. During the data analysis, if a Series was funded by more than one NIH institute or center, each was counted once. If a Series was funded by two grants from the same institute, that institute was counted twice.

Statistical analysis

Statistical analyses including Fischer exact test were performed using MATLAB (Version R2017a March 2017, MathWorks). The Joinpoint Regression Program (Version 4.5.0.1. June 2017, Statistical Research and Applications Branch, National Cancer Institute, Bethesda, MD, USA) was used to analyze the trends in the number of deposited Series per annum and subgroup trend-analyses, from which annual percent change (APC) values were computed [15]. The model selections were based on permutation tests in which log transformation was conducted, an overall P value < 0.05 was considered as significant, and the number of randomly permuted data sets was 4499. Up to two joinpoints were allowed. All P values were two-sided.

Results

Of the 75,376 Series deposited on GEO between 2001 and 2017, a total of 39,076 (51.8%) were human samples. Raw data for the human Series are summarized in Table 1. Fig. 1 shows the descriptive statistics of the Series by geographic origin, disease classification, genomic application, and technology. A slight majority of Series (54%) originated from North America, followed by Europe (28%) and Asia (15%) (Fig. 1a). Around 48% of Series could be classified to one of six major disease-categories: in descending order, cancer (30%), immunologic diseases (9%), neurologic diseases (4%), infectious diseases (3%), diabetes (2%), and cardiovascular diseases (1%) (Fig. 1b). The remaining “unclassified” Series consisted of mostly basic science studies, and some less prevalent diseases. Genomic application was dominated by expression profiling (62%), (Fig. 1c). The majority of the Series were collected using array technologies (58%) or HTS (26%) (Fig. 1d).

Table 1 Trends in the characteristics of the functional genomic data deposited in the gene expression omnibus (GEO) from 2001–2017
Fig. 1
figure 1

Summary statistics. a Geographic origin of submitted GSE. b Area of study for submitted GSE. c Platform usage, separated into application (left) and technology (right)

We next sought to discover trends in these data over time. In regards to the number of Series deposited per year, we identified two segments of growth (one joinpoint), namely 2001–2009 (APC = 43.6, P< 0.001) and 2009–2017 (APC = 20.3, P< 0.001). Sharp fluctuations were found in the number of DataSets (GDS) curated from each year (Fig. 2). GDS curation grew rapidly from 2001 to 2006, when it peaked at 193, but following this period, a very low number of Series were curated from 2008 to 2010. In 2011 there was a sudden jump up to 200 GDS, but the number has since dropped to zero.

Fig. 2
figure 2

Curated datasets. a Absolute number of curated DataSets (GDS). Year indicates the submission date of the associated Series (b) Proportion of each years’ submitted GSE that have been curated into GDS

There were also trends in the geographic origin of Series (Fig. 3). When GEO was launched, a vast majority of the submitted Series originated from North America. With each passing year, however, Europe and Asia represented an increasingly large proportion of submitted Series. This trend took a dramatic turn in 2015, after which the proportion of North American Series sharply increased (Fig. 3a). Analysis of the raw number of Series per year shows that European deposits have plateaued around 2012, with other regions still steadily growing (Fig. 3b).

Fig. 3
figure 3

Geographic distribution. a Proportion and b absolute number of datasets originating from specific geographical regions. Absolute numbers for 2017 are projected off the first 3 months of the year

Given the rapidly evolving nature of genomics, it is perhaps unsurprising that there were changing trends in the genomic technologies used for producing the deposited human genomic data. While array-based technologies initially predominated, HTS rapidly overtook it in 2016 (Fig. 4a). The number of HTS Series deposited per year has been exponentially increasing (APC = 79 for 2009–2017, P < 0.001), while arrays have nearly plateaued in recent years (APC = 3.4 for 2011–2017, P = 0.07) (Fig. 4b, Supplementary Table S2). There has also been a sustained increase in the number of Series using “other” technologies (APC = 59 for 2001–2017, P < 0.001), possibly reflecting the growing number of emerging functional genomic techniques. Interestingly, Series originated from North America were 3 times more likely to use HTS technology compared to those from other regions (OR = 3.39), a gap that dramatically widened after 2015 (OR2017 = 5.52) (Fig. 4c, Table 2).

Fig. 4
figure 4

Genomic platform usage. a Proportion of datasets using certain technologies, by year. b Joint point analysis of each technology’s absolute growth. c Proportion of datasets using high-throughput sequencing (HTS) in North America vs. other regions from 2008 onwards

Table 2 Association between human GEO deposits’ technology used (high-throughput sequencing technology vs. other methods) and their corresponding geographic origin (North America vs. other regions) from 2008 to 2017

We next investigated trends in the Series’ disease-of-interest over time. The proportion of Series that could be classified to one of the six major disease-categories increased steadily from 2003 to around 2008, after which it remained steady at around 60% (Fig. 5a, b). However, starting in 2015, the proportion of Series related to major disease area dropped sharply, down to 36% in 2017. This reflects an increase in “unclassified” Series focusing on basic science and less prevalent diseases. Nevertheless, all six disease classifications still saw a steady growth in the number of Series per year (Supplementary Table S3). The decreasing proportion of disease-classified Series was due almost entirely to those of North America, which dropped from 59% disease-classified in 2015 to just 25% in 2017, while there was no change for the rest of the world (Fig. 5c, Table 3). Importantly, Series utilizing HTS were significantly less likely to be disease-classified compared to other technologies (OR = 0.39), suggesting a preferential use or adoption of HTS in basic science settings (Fig. 5d, Table 4).

Fig. 5
figure 5

Disease classifications and funding sources. a Proportion and b absolute number of Series studying one of six major disease areas over time. c Proportion of datasets classified to a major disease area, from North America vs. other regions. d Proportion of studies that classified to a major disease, for those using high-throughput sequencing vs. those that used other technologies. e Source of funding for human series with associated publications. f Ranked list of NIH institutions, by what proportion of grants produced a GEO deposit. Error bars represents standard error

Table 3 Association between human GEO deposits’ geographic origin (North America vs. other regions) and their corresponding disease area (related to a major disease area or unclassified) from 2001 to 2017
Table 4 Association between the human GEO deposits’ technology used (high-throughput sequencing vs. others) and their corresponding disease area (related to a major disease area or unclassified) from 2008 to 2017

Finally, we assessed trends in the funding sources of Series with associated publication(s) indexed in the MEDLINE. Funding information could only be extracted and analyzed for Series with associated publications, accounting for ~68% of all Series. Of the grants with associated publications indexed in MEDLINE, the large majority (86%) were funded by the U.S. NIH. The NIH institutes funding the greatest proportion of Series were, in descending order, the National Cancer Institute (NCI, 33%), National Institute on Aging (NIA, 11%), National Institute of General Medical Sciences (NIGMS, 7.7%), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK, 6.7%), and National Heart, Lung, and Blood Institute (NHLBI, 6.6%). There were no significant trends in funding sources over time (Fig. 5e). However, simply assessing the proportion of Series funded by a particular agency can be misleading, as larger agencies can naturally fund more studies. To address this, we normalized the number of Series funded by each NIH institute to the total number of grants funded by that institute, giving the proportion of grants that result in a GEO Series. The overall NIH proportion was 0.063, or nearly one Series produced per 16 grants. Five institutes were above this level: unsurprisingly, the National Human Genome Research Institute (NHGRI, 0.49), followed by the NIA (0.19), NCI (0.18), National Center for Research Resources (NCRR, 0.12), and National Institute of Environmental Health Sciences (NIEHS, 0.085) (Fig. 5f). The NIH was not more likely to fund disease-classified studies compared with non-US agencies (OR = 1.02, P = 0.677) (Supplementary Table S4).

Discussion

Since its inception in 2001, the GEO has become a mainstay of molecular biology research [1]. Its exponential growth reflects an evolving research environment where HTS technologies are increasingly used in human genomic studies. GEO metadata thus present a valuable resource in analyzing trends in the research ecosystem. This study, to our best knowledge, represents the first in-depth study of human GEO Series, encompassing geography, disease of interest, funding sources, genomic application, and technology. The summary database curated here is powerful because it not only allows for analysis of descriptive statistics and trends, but also correlations that offer clues as to the origin of specific trends.

Curated DataSets (GDS) are very valuable tools for researchers. They are normalized to be biologically comparable, and are compatible with a suite of data display and analysis tools offered by GEO. Thus, the sharp decline in GDS records in recent years may be troublesome for high-quality, secondary genomic analyses. However, due to the increasing use and availability of free bioinformatics packages [16,17,18,19,20], normalization of functional genomic data is no longer a difficult task. It was likely deemed that the curation process is no longer of sufficient priority to the research community.

The predominant geographic origin of the GEO data has taken some interesting turns. Although the repository was becoming increasingly international, North American deposits once again began dominating after 2015. This was due to a sharp increase in North American deposits as well as a plateau in European ones. The reason behind these trends is not clear, but it is not likely the case that Europeans are now preferentially depositing on ArrayExpress, which continues to see only linear growth in their number of deposits [2].

Of note, it seems that North America is spearheading the sharp rise in HTS technologies in recent years, although its use is increasing in other regions as well. This finding is consistent with the fact that the U.S. has invested more in genomic research than any other country in the world [14]. HTS encompasses a variety of techniques, including ChIP-seq for genome binding profiling, and RNA-seq for transcriptome profiling. RNA-seq has some advantages over array-based technologies, being superior for detecting low-abundance transcripts, biologically distinct isoforms, and genetic variants [21, 22]. As sequencing becomes increasingly cheaper per base, and analysis software more widespread, RNA-seq may continue to overtake array-based technologies.

Interestingly, HTS was less likely to be used to study one of the six major disease areas. This suggests that HTS, as a relatively new technology, is still largely used for basic science and is still in the process of being adopted for more disease-specific applications (likely clinical studies). Nevertheless, this shifting technology carries important implications for the use of genomic data in clinical decisions and precision medicine [9, 23, 24]. Indeed, Array-based transcriptomics are already being used for cancer diagnosis [25, 26], staging and prognosis [27,28,29,30]. Moreover, the unique ability of RNA-seq to detect gene fusions and disease-associated isoforms appears to be an advantage for a clinical tool development [31], although comparatively few RNA-seq-based clinical tests currently exist [32, 33]. As HTS becomes increasingly prevalent in the research world, clinicians will need to adapt so as to be able to effectively collect, analyze, and interpret data of such formats.

Of the investigated disease areas, there was a dominance of unclassified (likely basic research), cancer, and immunological diseases in the GEO deposits. The low percentages of GEO deposits in other disease areas, such as cardiovascular disease, may be concerning because the lack of sufficient human genomic data and understanding in the field may limit the development of genomics-based diagnostics and treatments [10, 31, 34, 35]. Related to this finding, the higher number of GEO Series per grant in select NIH institutes likely reflects a greater preference for and awareness of genomic data among the NIH-sponsored researchers. Perhaps a more interesting question is whether the research areas with fewer per-grant GEO deposits would need more genomic studies. This question may have profound clinical applications and present unique research opportunities. We found that cancer and basic science dominated GEO deposits, consistent with the largest funding sources (such as the NCI). On the other hand, endocrinological (diabetes, for example), neurological, and cardiovascular diseases lagged far behind. Due to the faster accumulation of human genomic data and deeper understanding of cancer, cancer biologists, pathologists, and oncologists will more likely take advantage of genome-based diagnostics and targeted therapies than their colleagues in the fields with fewer genomic data deposits [4, 5, 36]. These advances in cancer will lead to more rapid and profound benefits for cancer patients.

In conclusion, we report increasing trends in GEO deposits (1) using HTS methods, (2) originating from North America, and (3) focusing on basic science applications. Cancer, immunological disease, and neurological diseases were the three disease areas with most deposits on the GEO. We also show that the NHGRI, NCI, NIEHS, and NCCR had a higher number of per-grant GEO Series than other NIH institutes and centers. More studies are needed to elucidate our observations. Our findings nonetheless may shed light on shaping future functional genomics-based research and clinical priorities.